Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Special characters and validation

Reply
Thread Tools

Special characters and validation

 
 
JD
Guest
Posts: n/a
 
      01-29-2009
I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML character
number n" errors.

I decided to write a little tool in C that reads in the copy and
substitutes character entity references for any characters that will
cause the above error. However, I'm confused about what to include in
this program and what to leave out. For example, even though there's an
entity reference for the copyright symbol, I've found I can put this
symbol directly in the source and the page still validates. In that
case, why use the entity reference at all?

Is there a definitive list somewhere of which characters need to be
encoded and which do not?

I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
encoding according to 'Page Info' in FF3.
 
Reply With Quote
 
 
 
 
rf
Guest
Posts: n/a
 
      01-29-2009
JD wrote:
> I frequently receive website copy in the form of Word documents. If I
> copy and paste the content directly from Word into my text editor, I
> often find that my web pages fail to validate due to "non SGML
> character number n" errors.


This stuff is usually because of words "smart quotes" feature, and others.
All such "helpfull" features can be turned off.


 
Reply With Quote
 
 
 
 
Zach
Guest
Posts: n/a
 
      01-29-2009

"JD" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
<...>
> Is there a definitive list somewhere of which characters need to be
> encoded and which do not?
>


space
! !
" " &quot;
# #
$ $
% %
& & &amp;
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< < &lt;
= =
> > &gt;

? ?
@ @
A A
B B
C C
D D
E E
F F
G G
H H
I I
J J
K K
L L
M M
N N
O O
P P
Q Q
R R
S S
T T
U U
V V
W W
X X
Y Y
Z Z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
~ ~
 
, ‚ ‚
f ƒ ƒ
" „ „
. … …
? † †
? ‡ ‡
^ ˆ ˆ
? ‰ ‰
S Š Š
< ‹ ‹
O Œ Œ
' ‘ ‘
' ’ ’
" “ “
" ” ”
. • •
- – –
- — —
~ ˜ ˜
T ™ ™
s š &353;
> › ›

o œ œ
Y Ÿ Ÿ
  &nbsp;
¡ &iexcl;
¢ &cent;
£ &pound;
¤ &curren;
¥ &yen;
¦ &brvbar;
§ &sect;
¨ &uml;
© &copy;
ª &ordf;
« &laquo;
¬ &not;
* ­ ­
® &reg;
¯ &macr;
° &deg;
± &plusmn;
² &sup2;
³ &sup3;
´ &acute;
µ &micro;
¶ &para;
· &middot;
¸ &cedil;
¹ &sup1;
º &ordm;
» &raquo;
¼ &frac14;
½ &frac12;
¾ &frac34;
¿ &iquest;
À &Agrave;
Á &Aacute;
 &Acirc;
à &Atilde;
Ä &Auml;
Å &Aring;
Æ &AElig;
Ç &Ccedil;
È &Egrave;
É &Eacute;
Ê &Ecirc;
Ë &Euml;
Ì &Igrave;
Í &Iacute;
Î &Icirc;
Ï &Iuml;
Ð &ETH;
Ñ &Ntilde;
Ò &Ograve;
Ó &Oacute;
Ô &Ocirc;
Õ &Otilde;
Ö &Ouml;
× &times;
Ø &Oslash;
Ù &Ugrave;
Ú &Uacute;
Û &Ucirc;
Ü &Uuml;
Ý &Yacute;
Þ &THORN;
ß &szlig;
à &agrave;
á &aacute;
â &acirc;
ã &atilde;
ä &auml;
å &aring;
æ &aelig;
ç &ccedil;
è &egrave;
é &eacute;
ê &ecirc;
ë &euml;
ì &igrave;
í &iacute;
î &icirc;
ï &iuml;
ð &eth;
ñ &ntilde;
ò &ograve;
ó &oacute;
ô &ocirc;
õ &otilde;
ö &ouml;
÷ &divide;
ø &oslash;
ù &ugrave;
ú &uacute;
û &ucirc;
ü &uuml;
ý &yacute;
þ &thorn;
ÿ &yuml;
? € &euro;





 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      01-29-2009
Zach wrote:

>> Is there a definitive list somewhere of which characters need to be
>> encoded and which do not?
>>

>
> space


Of course, stuff copied from somewhere without any citation and without even
say how it is supposed to answer the question ranks you as Very Clueless.

Please do not stop using the same forged "identity" before you get a clue.
Thank you in advance.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      01-29-2009
rf wrote:
> JD wrote:
>> I frequently receive website copy in the form of Word documents. If I
>> copy and paste the content directly from Word into my text editor, I
>> often find that my web pages fail to validate due to "non SGML
>> character number n" errors.

>
> This stuff is usually because of words "smart quotes" feature, and
> others. All such "helpfull" features can be turned off.


The only reason to quote the word "helpful" here is that you misspelled it.

"Smart quotes" are the correct quotes. What's wrong here is their encoding,
as opposite to the declared or implied encoding of the page, but that's not
a reason to convert correct characters to something incorrect or at least
inferior.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
Zach
Guest
Posts: n/a
 
      01-29-2009

"Jukka K. Korpela" <(E-Mail Removed)> wrote in message
news:YMlgl.125771$(E-Mail Removed) i.fi...
> Zach wrote:
>
>>> Is there a definitive list somewhere of which characters need to be
>>> encoded and which do not?
>>>

>>
>> space

>
> Of course, stuff copied from somewhere without any citation and without
> even say how it is supposed to answer the question ranks you as Very
> Clueless.
>
> Please do not stop using the same forged "identity" before you get a clue.
> Thank you in advance.
>
> --
> Yucca, http://www.cs.tut.fi/~jkorpela/


I answered the guy's question.

Zach,



 
Reply With Quote
 
JD
Guest
Posts: n/a
 
      01-30-2009
Zach wrote:
> "Jukka K. Korpela" <(E-Mail Removed)> wrote in message
> news:YMlgl.125771$(E-Mail Removed) i.fi...
>> Zach wrote:
>>
>>>> Is there a definitive list somewhere of which characters need to be
>>>> encoded and which do not?
>>>>
>>> space

>> Of course, stuff copied from somewhere without any citation and without
>> even say how it is supposed to answer the question ranks you as Very
>> Clueless.
>>
>> Please do not stop using the same forged "identity" before you get a clue.
>> Thank you in advance.
>>
>> --
>> Yucca, http://www.cs.tut.fi/~jkorpela/

>
> I answered the guy's question.


How, by supplying an indiscriminate list of character entity references?
That's like giving somebody the entire alphabet when they ask which
letters are vowels.
 
Reply With Quote
 
Zach
Guest
Posts: n/a
 
      01-30-2009

"JD" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

<< snipped >>

>> I answered the guy's question.

>
> How, by supplying an indiscriminate list of character entity references?
> That's like giving somebody the entire alphabet when they ask which
> letters are vowels.


oooooooooooooooooooooooooooooooooooooooooooooooooo

Oh. Oh. If a response isn't to your liking, then say so politely.

oooooooooooooooooooooooooooooooooooooooooooooooooo

You wrote: "Is there a definitive list somewhere of which characters need to
be
encoded and which do not?"

I would:
1. transform the text into an array of characters
2. see what the accii value is of each character
3. see if the acii value < or > certain values
4. if so, see whether it is contained in the list I gave you
5. if it is, substitute

Zach.









 
Reply With Quote
 
Zach
Guest
Posts: n/a
 
      01-30-2009

"Ben C" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On 2009-01-30, Zach <(E-Mail Removed)> wrote:
>>
>> "JD" <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed)...
>>
>><< snipped >>
>>
>>>> I answered the guy's question.
>>>
>>> How, by supplying an indiscriminate list of character entity references?
>>> That's like giving somebody the entire alphabet when they ask which
>>> letters are vowels.

>>
>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>
>> Oh. Oh. If a response isn't to your liking, then say so politely.
>>
>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>
>> You wrote: "Is there a definitive list somewhere of which characters need
>> to
>> be
>> encoded and which do not?"
>>
>> I would:
>> 1. transform the text into an array of characters
>> 2. see what the accii value is of each character

>
> It might not have an ASCII value (nor even an ISO-8859-1 value) which is
> the whole problem.
>
>> 3. see if the acii value < or > certain values

>
> If all the characters have ASCII values, then it is not necessary to
> check if they are outside any particular range-- the OP was using
> ISO-8859-1 of which ASCII is a subset.
>
>> 4. if so, see whether it is contained in the list I gave you
>> 5. if it is, substitute

>
> Then any character whose unicode value is outside the range that
> ISO-8859-1 can encode needs to be substituted. There's no other list to
> check them against, unless you are thinking of using e.g. "&nbsp;" instead
> of
> " ", which is more readable. In that case I suppose you get the
> list from http://www.w3.org/TR/REC-html40/sgml/entities.html.




"the OP was using ISO-8859-1 "
Re: http://htmlhelp.com/reference/charset/
Sorry, I don't understand why character for character converting wouldn't
work.

Zach.


 
Reply With Quote
 
Harlan Messinger
Guest
Posts: n/a
 
      01-30-2009
Zach wrote:
> "Ben C" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> On 2009-01-30, Zach <(E-Mail Removed)> wrote:
>>> "JD" <(E-Mail Removed)> wrote in message
>>> news:(E-Mail Removed)...
>>>
>>> << snipped >>
>>>
>>>>> I answered the guy's question.
>>>> How, by supplying an indiscriminate list of character entity references?
>>>> That's like giving somebody the entire alphabet when they ask which
>>>> letters are vowels.
>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>>
>>> Oh. Oh. If a response isn't to your liking, then say so politely.
>>>
>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
>>>
>>> You wrote: "Is there a definitive list somewhere of which characters need
>>> to
>>> be
>>> encoded and which do not?"
>>>
>>> I would:
>>> 1. transform the text into an array of characters
>>> 2. see what the accii value is of each character

>> It might not have an ASCII value (nor even an ISO-8859-1 value) which is
>> the whole problem.
>>
>>> 3. see if the acii value < or > certain values

>> If all the characters have ASCII values, then it is not necessary to
>> check if they are outside any particular range-- the OP was using
>> ISO-8859-1 of which ASCII is a subset.
>>
>>> 4. if so, see whether it is contained in the list I gave you
>>> 5. if it is, substitute

>> Then any character whose unicode value is outside the range that
>> ISO-8859-1 can encode needs to be substituted. There's no other list to
>> check them against, unless you are thinking of using e.g. "&nbsp;" instead
>> of
>> " ", which is more readable. In that case I suppose you get the
>> list from http://www.w3.org/TR/REC-html40/sgml/entities.html.

>
>
>
> "the OP was using ISO-8859-1 "
> Re: http://htmlhelp.com/reference/charset/
> Sorry, I don't understand why character for character converting wouldn't
> work.


If the source is not encoded as ASCII and contains non-ASCII characters,
then an application that reads the source as though it *were* encoded as
ASCII *will not correctly read the non-ASCII characters". It can't
convert them to anything if it can't read them.

The list you gave happens to have very little to do with the question
that was asked. It includes characters that part of the ASCII encoding.
It also includes characters that aren't part of the ASCII encoding. It
also omits thousands of characters that aren't part of the ASCII
encoding. If the encoding to be used to store or transmit them is ASCII,
then all of them numbered above 127 have to be converted to an &
reference. If the encoding to be used is UTF-8 then none of them has to
be. For other encodings, the consequences vary.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Counting utf-8 characters -special characters majna Javascript 4 09-19-2007 01:53 PM
Remove only special characters and junk characters from a file rvino Perl 0 08-14-2007 07:23 AM
Re: Meta-Characters, Special Characters xah@xahlee.org Java 2 05-31-2007 09:25 AM
How to convert HTML special characters to the real characters with a Java script Stefan Mueller HTML 3 07-23-2006 10:09 PM
Special editions and Deluxe special edition dvd question. Rclrk43 DVD Video 8 12-29-2004 07:32 PM



Advertisments