Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A???

Reply
Thread Tools

newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A???

 
 
Jake Barnes
Guest
Posts: n/a
 
      11-17-2005

I'm afriad the below is almost gibberish to me. What do these 5
formulations have in common? Is it true that they all specify the same
character? How is that possible?


====================================

http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs

An important note for developers of UTF-8 decoding routines: For
security reasons, a UTF-8 decoder must not accept UTF-8 sequences that
are longer than necessary to encode a character. For example, the
character U+000A (line feed) must be accepted from a UTF-8 stream only
in the form 0x0A, but not in any of the following five possible
overlong forms:

0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A

 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      11-17-2005
"Jake Barnes" <(E-Mail Removed)> wrote:

> I'm afriad the below is almost gibberish to me.


Is it relevant to you? Is it an XML issue?

> An important note for developers of UTF-8 decoding routines:


Are you developing a UTF-8 decoder? How does that relate to XML?
(XML can be UTF-8 encoded, and often is, but so what?)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      11-17-2005
In article <(E-Mail Removed) om>,
Jake Barnes <(E-Mail Removed)> wrote:

>I'm afriad the below is almost gibberish to me. What do these 5
>formulations have in common? Is it true that they all specify the same
>character? How is that possible?


UTF-8 represents unicode characters as variable length sequences of
bytes, with smaller unicode numbers having shorter sequences.

Characters below 0x80 (those requiring at most 7 bits, the ASCII
characters) are represented as a single byte, and are the same as in
their ASCII representations. So the example you quote, the line feed
character, is represented as 0x0A.

Characters from 0x80 to 0x7FF (those requiring between 8 and 11 bits)
are represented by two bytes. In binary, the bytes are 110xxxxx
10xxxxxx, the 11 bits being distributed with high-order 5 in the first
byte and the low-order 6 in second byte.

To put it another way, a character c is represented as 0xC0 + (c >> 6)
followed by 0x80 + (c & 0x3F).

Now you *could* represent 0x0A in this two-byte form, as 11000000
10001010 (0xC0 0x8A), but UTF-8 says that you must not do this: you
must use the single byte version. And a UTF-8 decoder must give an
error if it encounters a linefeed encoded as 0xC0 0x8A.

Similarly, characters from 0x800 to 0xFFFF (those requiring between 12
and 16 bits) are represented by three bytes. In binary, the bytes are
1110xxxx 10xxxxxx 10xxxxxx, with 4 of the 16 bits in the first byte 6
in each of the second and third.

Again you *could* represent 0x0A in this three-byte form, as 11100000
10000000 10001010 (0xE0 0x80 0x8A), but again UTF-8 says you must not.

And so on. Each length of UTF-8 sequence has enough bits to represent
all the character from zero to some limit, but it must only be used
for representing the characters that can't be represented by a shorter
sequence.

-- Richard
 
Reply With Quote
 
Ian Rastall
Guest
Posts: n/a
 
      11-18-2005
On Thu, 17 Nov 2005 23:11:55 +0000 (UTC), "Jukka K. Korpela"
<(E-Mail Removed)> wrote:

>How does that relate to XML?


Jukka's being ornery, but he does have an excellent introduction to
character code issues here: http://www.cs.tut.fi/~jkorpela/chars.html

Hope that's helpful, although the word "newbie" doesn't usually refer
to someone who is wondering how six different hexadecimal numbers can
refer to the same character, so maybe the link is of no use.

Ian
 
Reply With Quote
 
Jake Barnes
Guest
Posts: n/a
 
      12-05-2005

Jukka K. Korpela wrote:
> "Jake Barnes" <(E-Mail Removed)> wrote:
>
> > I'm afriad the below is almost gibberish to me.

>
> Is it relevant to you? Is it an XML issue?
>
> > An important note for developers of UTF-8 decoding routines:

>
> Are you developing a UTF-8 decoder? How does that relate to XML?
> (XML can be UTF-8 encoded, and often is, but so what?)


I wrote a PHP script to generate an RSS feed from some weblog entries,
but the XML dies because there are garbage characters in the feed. Some
people using the weblog script have been typing their entries in
Microsoft Word or other word processors, and the copying and pasting
the text into the weblogs. I was trying to figure out how to clean up
the feed. To do so, I've been forced to study character encoding
issues.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can *common* struct-members of 2 different struct-types, that are thesame for the first common members, be accessed via pointer cast to either struct-type? John Reye C Programming 28 05-08-2012 12:24 AM
Still have a question about a common JavaScript routine lorlarz Javascript 11 09-09-2008 02:13 PM
java.lang.NoSuchMethodError: wm.common.session.Common.getCustRptListFromMax Denny Java 1 05-01-2008 07:33 AM
What are the common causes to null-terminated string for not having \0 character and later buffer overflow semut C Programming 7 12-05-2006 07:57 AM
Howto Extract PNG from binary file @ 0x80? flamesrock Python 7 01-04-2005 02:23 AM



Advertisments