Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Special characters in attributes

Reply
Thread Tools

Special characters in attributes

 
 
SDG
Guest
Posts: n/a
 
      09-19-2007
Hi, I'm writing a web scraper to extract text from a web page, and I
need to know what characters can be present inside an attribute of a
tag.
So far, in the code of my program, I've written that attributes can
contain this characters: '!=@/ \[]#.:_()-&;?
Did I forget something? I've looked if there's an official
specification (like a regular expression for HTML or even only for
attributes), but so far I haven't found anything.
Thanks a lot
 
Reply With Quote
 
 
 
 
David Dorward
Guest
Posts: n/a
 
      09-19-2007
On Sep 19, 1:30 pm, SDG <giuffsa...@hotmail.it> wrote:
> Hi, I'm writing a web scraper to extract text from a web page, and I
> need to know what characters can be present inside an attribute of a
> tag.


Any, although some attributes have limits on what is allowed, although
those limits aren't usually expressed by the DTD (e.g. the width
attribute takes an integer or an integer followed by a percentage
sign), and other characters (& for example) have special meaning.

--
David Dorward
http://dorward.me.uk/
http://blog.dorward.me.uk/

 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      09-19-2007
Scripsit SDG:

> Hi, I'm writing a web scraper to extract text from a web page,


Sounds like reinventing the wheel. Do you intend to reinvent it from
scratch, or are you using some software package for parsing HTML?

> and I
> need to know what characters can be present inside an attribute of a
> tag.


Apparently you are not using some software package for parsing HTML. Do you
really think you are competent enough to consider SGML parsing, XML parsing,
and tagsoup parsing, including their conflicts?

> So far, in the code of my program, I've written that attributes can
> contain this characters: '!=@/ \[]#.:_()-&;?


What an interesting set of characters. I think it's probably the set you
found lying on your keyboard, excluding - for some odd reason - letters and
digits. And you didn't notice e.g. the poor lonesome "+" or the
innocent-looking "$".

> Did I forget something?


Oh, just about 1,000,000 characters. (I'm not kidding. The character set of
HTML is defined as UCS, commonly known as the Unicode character set, though
more formally the ISO 10646 set. Currently only about 100,000 code points
have been allocated, but can you disallow, in HTML parsing, the unassigned
code points? Hardly.)

> I've looked if there's an official
> specification (like a regular expression for HTML or even only for
> attributes), but so far I haven't found anything.


There are several official specifications for HTML. Didn't you know this?
The character repertoire allowed inside an attribute value depends on the
declaration of the attribute, but it can be CDATA, i.e. arbitrary character
data, excluding just the string delimiter (" or ') and, with some variation
between HTML versions, the ampersand character & as such in many or all
contexts. So the question is what can and needs to _excluded_ (or, better,
treated as markup errors).

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Counting utf-8 characters -special characters majna Javascript 4 09-19-2007 01:53 PM
Remove only special characters and junk characters from a file rvino Perl 0 08-14-2007 07:23 AM
Re: Meta-Characters, Special Characters xah@xahlee.org Java 2 05-31-2007 09:25 AM
How to convert HTML special characters to the real characters with a Java script Stefan Mueller HTML 3 07-23-2006 10:09 PM
Special editions and Deluxe special edition dvd question. Rclrk43 DVD Video 8 12-29-2004 07:32 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57