Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Wrong default endianess in utf-16 and utf-32 !? (http://www.velocityreviews.com/forums/t735286-wrong-default-endianess-in-utf-16-and-utf-32-a.html)

jmfauth 10-12-2010 01:28 PM

Wrong default endianess in utf-16 and utf-32 !?
 
I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

(+ technical papers)

It appears Python is just working in the opposite way.

>>> sys.version

2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
>>> repr(u'abc'.encode('utf-16-le'))

'a\x00b\x00c\x00'
>>> repr(u'abc'.encode('utf-16-be'))

'\x00a\x00b\x00c'
>>> repr(u'abc'.encode('utf-16'))

'\xff\xfea\x00b\x00c\x00'
>>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be'))

False
>>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))

True

Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2

I attempted to find some precise discussions on that subject
and I failed.

Any thougths?


Antoine Pitrou 10-12-2010 01:47 PM

Re: Wrong default endianess in utf-16 and utf-32 !?
 
On Tue, 12 Oct 2010 06:28:23 -0700 (PDT)
jmfauth <wxjmfauth@gmail.com> wrote:

> I hope my understanding is correct and I'm not dreaming.
>
> When an endianess is not specified, (BE, LE, unmarked forms),
> the Unicode Consortium specifies, the default byte serialization
> should be big-endian.
>

[...]
>
> It appears Python is just working in the opposite way.
>

[...]
> >>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))

> True


Python uses the host's endianness by default. So, on a little-endian
machine, utf-16 and utf-32 will use little-endian encoding.
While decoding, though, the BOM is read by both of these codecs, so
there should be no interoperability problems:

>>> '\xff\xfea\x00b\x00c\x00'.decode('utf-16')

u'abc'
>>> '\xfe\xff\x00a\x00b\x00c'.decode('utf-16')

u'abc'


(do note, though, that the explicit utf*-be and utf*-le variants do not
add a BOM)

Regards

Antoine.



jmfauth 10-12-2010 02:49 PM

Re: Wrong default endianess in utf-16 and utf-32 !?
 
On 12 oct, 15:47, Antoine Pitrou <solip...@pitrou.net> wrote:
> On Tue, 12 Oct 2010 06:28:23 -0700 (PDT)
>


>
>
> Python uses the host's endianness by default. So, on a little-endian
> machine, utf-16 and utf-32 will use little-endian encoding.



Thanks. I never have been aware of this.

John Machin 10-12-2010 08:00 PM

Re: Wrong default endianess in utf-16 and utf-32 !?
 
jmfauth <wxjmfauth <at> gmail.com> writes:

> When an endianess is not specified, (BE, LE, unmarked forms),
> the Unicode Consortium specifies, the default byte serialization
> should be big-endian.
>
> See http://www.unicode.org/faq//utf_bom.html
> Q: Which of the UTFs do I need to support?
> and
> Q: Why do some of the UTFs have a BE or LE in their label,
> such as UTF-16LE?


Sometimes it is necessary to read right to the end of an answer:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: [snip] the unmarked form uses big-endian byte serialization by default, but
may include a byte order mark at the beginning to indicate the actual byte
serialization used.


jmfauth 10-13-2010 07:07 AM

Re: Wrong default endianess in utf-16 and utf-32 !?
 
On 12 oct, 22:00, John Machin <sjmac...@lexicon.net> wrote:
> jmfauth <wxjmfauth <at> gmail.com> writes:
>
> > When an endianess is not specified, (BE, LE, unmarked forms),
> > the Unicode Consortium specifies, the default byte serialization
> > should be big-endian.

>
> > Seehttp://www.unicode.org/faq//utf_bom.html
> > Q: Which of the UTFs do I need to support?
> > and
> > Q: Why do some of the UTFs have a BE or LE in their label,
> > such as UTF-16LE?

>
> Sometimes it is necessary to read right to the end of an answer:
>
> Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?
>
> A: [snip] the unmarked form uses big-endian byte serialization by default, but
> may include a byte order mark at the beginning to indicate the actual byte
> serialization used.




Well, English is not my native language, however I think I read it
correctly.

My question had nothing to do with the BOM, the encoding/decoding
or the BOM inclusion. My question was:

"What should I understand by "utf-16"? "utf-16-le" or "utf-16-be"?

And Antoine gave an answer.



All times are GMT. The time now is 01:34 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.