Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > UTF-8 question from Dive into Python 3

Reply
Thread Tools

UTF-8 question from Dive into Python 3

 
 
Adam Skutt
Guest
Posts: n/a
 
      01-19-2011
On Jan 19, 9:00*am, Tim Harig <(E-Mail Removed)> wrote:
>
> So, you can always assume a big-endian and things will work out correctly
> while you cannot always make the same assumption as little endian
> without potential issues. *The same holds true for any byte stream data..


You need to spend some serious time programming a serial port or other
byte/bit-stream oriented interface, and then you'll realize the folly
of your statement.

> That is why I say that byte streams are essentially big endian. It is
> all a matter of how you look at it.


It is nothing of the sort. Some byte streams are in fact, little
endian: when the bytes are combined into larger objects, the least-
significant byte in the object comes first. A lot of industrial/
embedded stuff has byte streams with LSB leading in the sequence, CAN
comes to mind as an example.

The only way to know is for the standard describing the stream to tell
you what to do.

>
> I prefer to look at all data as endian even if it doesn't create
> endian issues because it forces me to consider any endian issues that
> might arise. *If none do, I haven't really lost anything. *
> If you simply assume that any byte sequence cannot have endian issues you ignore the
> possibility that such issues might not arise.


No, you must assume nothing unless you're told how to combine the
bytes within a sequence into a larger element. Plus, not all byte
streams support such operations! Some byte streams really are just a
sequence of bytes and the bytes within the stream cannot be
meaningfully combined into larger data types. If I give you a series
of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell
me how to combine them into a 16, 32, or 64-bit integer. You cannot
do it without altering the meaning of the samples; it is a completely
non-nonsensical operation.

Adam
 
Reply With Quote
 
 
 
 
Tim Harig
Guest
Posts: n/a
 
      01-19-2011
On 2011-01-19, Adam Skutt <(E-Mail Removed)> wrote:
> On Jan 19, 9:00*am, Tim Harig <(E-Mail Removed)> wrote:
>> That is why I say that byte streams are essentially big endian. It is
>> all a matter of how you look at it.

>
> It is nothing of the sort. Some byte streams are in fact, little
> endian: when the bytes are combined into larger objects, the least-
> significant byte in the object comes first. A lot of industrial/
> embedded stuff has byte streams with LSB leading in the sequence, CAN
> comes to mind as an example.


You are correct. Point well made.
 
Reply With Quote
 
 
 
 
Tim Harig
Guest
Posts: n/a
 
      01-19-2011
On 2011-01-19, Antoine Pitrou <(E-Mail Removed)> wrote:
> On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
> Tim Harig <(E-Mail Removed)> wrote:
>> UTF-8 has no apparent endianess if you only store it as a byte stream.
>> It does however have a byte order. If you store it using multibytes
>> (six bytes for all UTF-8 possibilites) , which is useful if you want
>> to have one storage container for each letter as opposed to one for
>> each byte(1)

>
> That's a ridiculous proposition. Why would you waste so much space?


Space is only one tradeoff. There are many others to consider. I have
created data structures with much higher overhead than that because
they happen to make the problem easier and significantly faster for the
operations that I am performing on the data.

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

> UTF-8 exists *precisely* so that you can save space with most scripts.


UTF-8 has many reasons for existing. One of the biggest is that it
is compatible for tools that were designed to process ASCII and other
8bit encodings.

> If you are ready to use 4+ bytes per character, just use UTF-32 which
> has much nicer properties.


I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
not want to have to worry about converting the encodings back and forth
before and after processing them. That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.
 
Reply With Quote
 
Antoine Pitrou
Guest
Posts: n/a
 
      01-19-2011
On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
Tim Harig <(E-Mail Removed)> wrote:
>
> For many operations, it is just much faster and simpler to use a single
> character based container opposed to having to process an entire byte
> stream to determine individual letters from the bytes or to having
> adaptive size containers to store the data.


You *have* to "process the entire byte stream" in order to determine
boundaries of individual letters from the bytes if you want to use a
"character based container", regardless of the exact representation.
Once you do that it shouldn't be very costly to compute the actual code
points. So, "much faster" sounds a bit dubious to me; especially if you
factor in the cost of memory allocation, and the fact that a larger
container will fit less easily in a data cache.

> That said, and more importantly, many
> variable length byte streams may not have alternate representations as
> unicode does.


This whole thread is about UTF-8 (see title) so I'm not sure what kind
of relevance this is supposed to have.


 
Reply With Quote
 
Tim Harig
Guest
Posts: n/a
 
      01-19-2011
On 2011-01-19, Antoine Pitrou <(E-Mail Removed)> wrote:
> On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
> Tim Harig <(E-Mail Removed)> wrote:
>>
>> For many operations, it is just much faster and simpler to use a single
>> character based container opposed to having to process an entire byte
>> stream to determine individual letters from the bytes or to having
>> adaptive size containers to store the data.

>
> You *have* to "process the entire byte stream" in order to determine
> boundaries of individual letters from the bytes if you want to use a
> "character based container", regardless of the exact representation.


Right, but I only have to do that once. After that, I can directly address
any piece of the stream that I choose. If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted. Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size. You will note that Python does the former.

UTF-32/UCS-4 conversion is definitly supperior if you are actually
doing any major but it adds the complexity and overhead of requiring
the bit twiddling to make the conversions (once in, once again out).
Some programs don't really care enough about what the data actually
contains to make it worth while. They just want to be able to use the
characters as black boxes.

> Once you do that it shouldn't be very costly to compute the actual code
> points. So, "much faster" sounds a bit dubious to me; especially if you


You could I suppose keep a separate list of pointers to each letter so that
you could use the pointer list for indexing or keep a list of the
character sizes so that you can add them and calculate the variable width
index; but, that adds overhead as well.
 
Reply With Quote
 
Antoine Pitrou
Guest
Posts: n/a
 
      01-19-2011
On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
Tim Harig <(E-Mail Removed)> wrote:
> On 2011-01-19, Antoine Pitrou <(E-Mail Removed)> wrote:
> > On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
> > Tim Harig <(E-Mail Removed)> wrote:
> >>
> >> For many operations, it is just much faster and simpler to use a single
> >> character based container opposed to having to process an entire byte
> >> stream to determine individual letters from the bytes or to having
> >> adaptive size containers to store the data.

> >
> > You *have* to "process the entire byte stream" in order to determine
> > boundaries of individual letters from the bytes if you want to use a
> > "character based container", regardless of the exact representation.

>
> Right, but I only have to do that once.


You only have to decode once as well.

> If I leave the information as a
> simple UTF-8 stream,


That's not what we are talking about. We are talking about the supposed
benefits of your 6-byte representation scheme versus proper decoding
into fixed width code points.

> UTF-32/UCS-4 conversion is definitly supperior if you are actually
> doing any major but it adds the complexity and overhead of requiring
> the bit twiddling to make the conversions (once in, once again out).


"Bit twiddling" is not something processors are particularly bad at.
Actually, modern processors are much better at arithmetic and logic
than at recovering from mispredicted branches, which seems to suggest
that discovering boundaries probably eats most of the CPU cycles.

> Converting to a fixed byte
> representation (UTF-32/UCS-4) or separating all of the bytes for each
> UTF-8 into 6 byte containers both make it possible to simply index the
> letters by a constant size. You will note that Python does the
> former.


Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.


 
Reply With Quote
 
Tim Harig
Guest
Posts: n/a
 
      01-19-2011
On 2011-01-19, Antoine Pitrou <(E-Mail Removed)> wrote:
> On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
> Tim Harig <(E-Mail Removed)> wrote:
>> Converting to a fixed byte
>> representation (UTF-32/UCS-4) or separating all of the bytes for each
>> UTF-8 into 6 byte containers both make it possible to simply index the
>> letters by a constant size. You will note that Python does the
>> former.

>
> Indeed, Python chose the wise option. Actually, I'd be curious of any
> real-world software which successfully chose your proposed approach.


The point is basically the same. I created an example because it
was simpler to follow for demonstration purposes then an actual UTF-8
conversion to any official multibyte format. You obviously have no
other purpose then to be contrary, so we ended up following tangents.

As soon as you start to convert to a multibyte format the endian issues
occur. For UTF-8 on big endian hardware, this is anti-climactic because
all of the bits are already stored in proper order. Little endian systems
will probably convert to a native native endian format. If you choose
to ignore that, that is your perogative. Have a nice day.
 
Reply With Quote
 
Antoine Pitrou
Guest
Posts: n/a
 
      01-19-2011
On Wed, 19 Jan 2011 19:18:49 +0000 (UTC)
Tim Harig <(E-Mail Removed)> wrote:
> On 2011-01-19, Antoine Pitrou <(E-Mail Removed)> wrote:
> > On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
> > Tim Harig <(E-Mail Removed)> wrote:
> >> Converting to a fixed byte
> >> representation (UTF-32/UCS-4) or separating all of the bytes for each
> >> UTF-8 into 6 byte containers both make it possible to simply index the
> >> letters by a constant size. You will note that Python does the
> >> former.

> >
> > Indeed, Python chose the wise option. Actually, I'd be curious of any
> > real-world software which successfully chose your proposed approach.

>
> The point is basically the same. I created an example because it
> was simpler to follow for demonstration purposes then an actual UTF-8
> conversion to any official multibyte format. You obviously have no
> other purpose then to be contrary [...]


Right. You were the one who jumped in and tried to lecture everyone on
how UTF-8 was "big-endian", and now you are abandoning the one esoteric
argument you found in support of that.

> As soon as you start to convert to a multibyte format the endian issues
> occur.


Ok. Good luck with your "endian issues" which don't exist.


 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      01-19-2011
On 1/19/2011 1:02 PM, Tim Harig wrote:

> Right, but I only have to do that once. After that, I can directly address
> any piece of the stream that I choose. If I leave the information as a
> simple UTF-8 stream, I would have to walk the stream again, I would have to
> walk through the the first byte of all the characters from the beginning to
> make sure that I was only counting multibyte characters once until I found
> the character that I actually wanted. Converting to a fixed byte
> representation (UTF-32/UCS-4) or separating all of the bytes for each
> UTF-8 into 6 byte containers both make it possible to simply index the
> letters by a constant size. You will note that Python does the former.


The idea of using a custom fixed-width padded version of a UTF-8 steams
waw initially shocking to me, but I can imagine that there are
specialized applications, which slice-and-dice uninterpreted segments,
for which that is appropriate. However, it is not germane to the folly
of prefixing standard UTF-8 steams with a 3-byte magic number,
mislabelled a 'byte-order-mark, thus making them non-standard.

--
Terry Jan Reedy

 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      01-20-2011
On Jan 19, 11:33*pm, Terry Reedy <(E-Mail Removed)> wrote:
> On 1/19/2011 1:02 PM, Tim Harig wrote:
>
> > Right, but I only have to do that once. *After that, I can directly address
> > any piece of the stream that I choose. *If I leave the information as a
> > simple UTF-8 stream, I would have to walk the stream again, I would have to
> > walk through the the first byte of all the characters from the beginning to
> > make sure that I was only counting multibyte characters once until I found
> > the character that I actually wanted. *Converting to a fixed byte
> > representation (UTF-32/UCS-4) or separating all of the bytes for each
> > UTF-8 into 6 byte containers both make it possible to simply index the
> > letters by a constant size. *You will note that Python does the former.

>
> The idea of using a custom fixed-width padded version of a UTF-8 steams
> waw initially shocking to me, but I can imagine that there are
> specialized applications, which slice-and-dice uninterpreted segments,
> for which that is appropriate. However, it is not germane to the folly
> of prefixing standard UTF-8 steams with a 3-byte magic number,
> mislabelled a 'byte-order-mark, thus making them non-standard.
>



Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe
*Unicode Signature*.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Dive into Python question Fred C. Dobbs Python 2 08-26-2006 08:06 PM
Since there was talk of if-then-else not being allowed in lambda expressions, the following is from "Dive into Python" Casey Hawthorne Python 6 07-21-2006 05:30 PM
Dive into Python PDF Franz Mueller Python 2 11-30-2005 05:37 AM
books: Dive into Python vs Beginning Python Franz Mueller Python 6 11-29-2005 07:40 AM
Dive into Python java equivalent Luis P. Mendes Python 0 05-13-2005 02:58 PM



Advertisments