Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Problem with sets and Unicode strings

Reply
Thread Tools

Problem with sets and Unicode strings

 
 
Dennis Benzinger
Guest
Posts: n/a
 
      06-27-2006
Hi!

The following program in an UTF-8 encoded file:


# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS


gives this output


False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher" in FIELDS
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(12


Why do the first two print statements succeed and the third one fails
with an exception?

Why does the use of set/frozenset remove the exception?


Thanks,
Dennis
 
Reply With Quote
 
 
 
 
Serge Orlov
Guest
Posts: n/a
 
      06-27-2006
On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
> Hi!
>
> The following program in an UTF-8 encoded file:
>
>
> # -*- coding: UTF-8 -*-
>
> FIELDS = ("Fächer", )
> FROZEN_FIELDS = frozenset(FIELDS)
> FIELDS_SET = set(FIELDS)
>
> print u"Fächer" in FROZEN_FIELDS
> print u"Fächer" in FIELDS_SET
> print u"Fächer" in FIELDS
>
>
> gives this output
>
>
> False
> False
> Traceback (most recent call last):
> File "test.py", line 9, in ?
> print u"FÀcher" in FIELDS
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> ordinal not in range(12
>
>
> Why do the first two print statements succeed and the third one fails
> with an exception?


Actually all three statements fail to produce correct result.

> Why does the use of set/frozenset remove the exception?


Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.
 
Reply With Quote
 
 
 
 
Dennis Benzinger
Guest
Posts: n/a
 
      06-27-2006
Serge Orlov wrote:
> On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
>> Hi!
>>
>> The following program in an UTF-8 encoded file:
>>
>>
>> # -*- coding: UTF-8 -*-
>>
>> FIELDS = ("Fächer", )
>> FROZEN_FIELDS = frozenset(FIELDS)
>> FIELDS_SET = set(FIELDS)
>>
>> print u"Fächer" in FROZEN_FIELDS
>> print u"Fächer" in FIELDS_SET
>> print u"Fächer" in FIELDS
>>
>>
>> gives this output
>>
>>
>> False
>> False
>> Traceback (most recent call last):
>> File "test.py", line 9, in ?
>> print u"FÀcher" in FIELDS
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>> ordinal not in range(12
>>
>>
>> Why do the first two print statements succeed and the third one fails
>> with an exception?

>
> Actually all three statements fail to produce correct result.


So this is a bug in Python?

> frozenset remove the exception?
>
> Because sets use hash algorithm to find matches, whereas the last
> statement directly compares a unicode string with a byte string. Byte
> strings can only contain ascii characters, that's why python raises an
> exception. The problem is very easy to fix: use unicode strings for
> all non-ascii strings.


No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>. But I don't understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.


Dennis
 
Reply With Quote
 
Serge Orlov
Guest
Posts: n/a
 
      06-27-2006
On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
> Serge Orlov wrote:
> > On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
> >> Hi!
> >>
> >> The following program in an UTF-8 encoded file:
> >>
> >>
> >> # -*- coding: UTF-8 -*-
> >>
> >> FIELDS = ("Fächer", )
> >> FROZEN_FIELDS = frozenset(FIELDS)
> >> FIELDS_SET = set(FIELDS)
> >>
> >> print u"Fächer" in FROZEN_FIELDS
> >> print u"Fächer" in FIELDS_SET
> >> print u"Fächer" in FIELDS
> >>
> >>
> >> gives this output
> >>
> >>
> >> False
> >> False
> >> Traceback (most recent call last):
> >> File "test.py", line 9, in ?
> >> print u"FÀcher" in FIELDS
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> >> ordinal not in range(12
> >>
> >>
> >> Why do the first two print statements succeed and the third one fails
> >> with an exception?

> >
> > Actually all three statements fail to produce correct result.

>
> So this is a bug in Python?


No.

> > frozenset remove the exception?
> >
> > Because sets use hash algorithm to find matches, whereas the last
> > statement directly compares a unicode string with a byte string. Byte
> > strings can only contain ascii characters, that's why python raises an
> > exception. The problem is very easy to fix: use unicode strings for
> > all non-ascii strings.

>
> No, byte strings contain characters which are at least 8-bit wide
> <http://docs.python.org/ref/types.html>.


Yes, but later it's written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it's *some*
character. If you put character U+00E4 into a unicode string python
knows it's a "latin small letter a with diaeresis". Trying to compare
*some* character with a specific character is obviously undefined.

> But I don't understand what
> Python is trying to decode and why the exception says something about
> the ASCII codec, because my file is encoded with UTF-8.


Because byte strings can come from different sources (network, files,
etc) not only from the sources of your program python cannot assume
all of them are utf-8. It assumes they are ascii, because most of
wide-spread text encodings are ascii bases. Actually it's a guess,
since there are utf-16, utf-32 and other non-ascii encodings. If you
want to experience the life without guesses put
sys.setdefaultencoding("undefined") into site.py
 
Reply With Quote
 
Robert Kern
Guest
Posts: n/a
 
      06-27-2006
Dennis Benzinger wrote:
> Serge Orlov wrote:
>> On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
>>> Hi!
>>>
>>> The following program in an UTF-8 encoded file:
>>>
>>>
>>> # -*- coding: UTF-8 -*-
>>>
>>> FIELDS = ("Fächer", )
>>> FROZEN_FIELDS = frozenset(FIELDS)
>>> FIELDS_SET = set(FIELDS)
>>>
>>> print u"Fächer" in FROZEN_FIELDS
>>> print u"Fächer" in FIELDS_SET
>>> print u"Fächer" in FIELDS
>>>
>>>
>>> gives this output
>>>
>>>
>>> False
>>> False
>>> Traceback (most recent call last):
>>> File "test.py", line 9, in ?
>>> print u"FÀcher" in FIELDS
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>>> ordinal not in range(12
>>>
>>>
>>> Why do the first two print statements succeed and the third one fails
>>> with an exception?

>> Actually all three statements fail to produce correct result.

>
> So this is a bug in Python?


No.

>> frozenset remove the exception?
>>
>> Because sets use hash algorithm to find matches, whereas the last
>> statement directly compares a unicode string with a byte string. Byte
>> strings can only contain ascii characters, that's why python raises an
>> exception. The problem is very easy to fix: use unicode strings for
>> all non-ascii strings.

>
> No, byte strings contain characters which are at least 8-bit wide
> <http://docs.python.org/ref/types.html>. But I don't understand what
> Python is trying to decode and why the exception says something about
> the ASCII codec, because my file is encoded with UTF-8.


Please read

http://www.amk.ca/python/howto/unicode

The string in all of the containers (FIELDS, FROZEN_FIELDS, FIELDS_SET) is a
regular byte string, not a Unicode string. The encoding declaration only
controls how the file is parsed. The string literal that you use for FIELDS is a
regular string literal, not a Unicode string literal, so the object it creates
is an 8-bit byte string. The tuple containment test is attempting to compare
your Unicode string object to the regular string object for equality. Python
does these comparisons by attempting to decode the regular string into a Unicode
string. Since there is no encoding information present on regular strings at
this point (since the encoding declaration in your file only controls parsing,
nothing else), Python assumes ASCII and throws an exception otherwise.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

 
Reply With Quote
 
Laurent Pointal
Guest
Posts: n/a
 
      06-28-2006
Dennis Benzinger a écrit :
> No, byte strings contain characters which are at least 8-bit wide
> <http://docs.python.org/ref/types.html>. But I don't understand what
> Python is trying to decode and why the exception says something about
> the ASCII codec, because my file is encoded with UTF-8.


[addendum to others replies]

The file encoding directive is used by Python to convert u"xxx" strings
into unicode objects using right conversion rules when compiling the code.
When a string is written simply with "xxx", its a 8 bits string with NO
encoding data associated. When these strings must be converted they are
considered to be using sys.getdefaultencoding() [generally ascii -
forced ascii in python 2.5]

So a short reply: the utf8 directive has no effect on 8 bits strings,
use unicode strings to manage correctly non-ascii texts.

A+

Laurent.

 
Reply With Quote
 
Dennis Benzinger
Guest
Posts: n/a
 
      06-28-2006
Serge Orlov wrote:
> On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
>> Serge Orlov wrote:
>> > On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
>> >> Hi!
>> >>
>> >> The following program in an UTF-8 encoded file:
>> >>
>> >>
>> >> # -*- coding: UTF-8 -*-
>> >>
>> >> FIELDS = ("Fächer", )
>> >> FROZEN_FIELDS = frozenset(FIELDS)
>> >> FIELDS_SET = set(FIELDS)
>> >>
>> >> print u"Fächer" in FROZEN_FIELDS
>> >> print u"Fächer" in FIELDS_SET
>> >> print u"Fächer" in FIELDS
>> >>
>> >>
>> >> gives this output
>> >>
>> >>
>> >> False
>> >> False
>> >> Traceback (most recent call last):
>> >> File "test.py", line 9, in ?
>> >> print u"FÀcher" in FIELDS
>> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in

>> position 1:
>> >> ordinal not in range(12
>> >>
>> >>
>> >> Why do the first two print statements succeed and the third one fails
>> >> with an exception?
>> >
>> > Actually all three statements fail to produce correct result.

>>
>> So this is a bug in Python?

>
> No.
>
>> > frozenset remove the exception?
>> >
>> > Because sets use hash algorithm to find matches, whereas the last
>> > statement directly compares a unicode string with a byte string. Byte
>> > strings can only contain ascii characters, that's why python raises an
>> > exception. The problem is very easy to fix: use unicode strings for
>> > all non-ascii strings.

>>
>> No, byte strings contain characters which are at least 8-bit wide
>> <http://docs.python.org/ref/types.html>.

>
> Yes, but later it's written that non-ascii characters do not have
> universal meaning assigned to them. In other words if you put byte
> 0xE4 into a bytes string all python knows about it is that it's *some*
> character. If you put character U+00E4 into a unicode string python
> knows it's a "latin small letter a with diaeresis". Trying to compare
> *some* character with a specific character is obviously undefined.
> [...]


But <http://docs.python.org/ref/comparisons.html> says:

Strings are compared lexicographically using the numeric equivalents
(the result of the built-in function ord()) of their characters. Unicode
and 8-bit strings are fully interoperable in this behavior.

Doesn't this mean that Unicode and 8-bit strings can be compared and
this comparison is well defined? (even if it's is not meaningful)



Thanks for your anwsers,
Dennis
 
Reply With Quote
 
Dennis Benzinger
Guest
Posts: n/a
 
      06-28-2006
Robert Kern wrote:
> Dennis Benzinger wrote:
>> Serge Orlov wrote:
>>> On 6/27/06, Dennis Benzinger <(E-Mail Removed)> wrote:
>>>> Hi!
>>>>
>>>> The following program in an UTF-8 encoded file:
>>>>
>>>>
>>>> # -*- coding: UTF-8 -*-
>>>>
>>>> FIELDS = ("Fächer", )
>>>> FROZEN_FIELDS = frozenset(FIELDS)
>>>> FIELDS_SET = set(FIELDS)
>>>>
>>>> print u"Fächer" in FROZEN_FIELDS
>>>> print u"Fächer" in FIELDS_SET
>>>> print u"Fächer" in FIELDS
>>>>
>>>>
>>>> gives this output
>>>>
>>>>
>>>> False
>>>> False
>>>> Traceback (most recent call last):
>>>> File "test.py", line 9, in ?
>>>> print u"FÀcher" in FIELDS
>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>>>> ordinal not in range(12
>>>>
>>>>
>>>> Why do the first two print statements succeed and the third one fails
>>>> with an exception?
>>> Actually all three statements fail to produce correct result.

>>
>> So this is a bug in Python?

>
> No.
> [...]


But I'd say that it's not intuitive that for sets x in y can be false
(without raising an exception!) while the doing the same with a tuple
raises an exception. Where is this difference documented?


Thanks,
Dennis
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      06-28-2006
> But <http://docs.python.org/ref/comparisons.html> says:
>
> Strings are compared lexicographically using the numeric equivalents
> (the result of the built-in function ord()) of their characters. Unicode
> and 8-bit strings are fully interoperable in this behavior.
>
> Doesn't this mean that Unicode and 8-bit strings can be compared and
> this comparison is well defined? (even if it's is not meaningful)


Obviously not - otherwise you wouldn't have the problems you'd observed,
wouldn't you?

What happens of course is that in case of string to unicode-comparison, the
string gets coerced to an unicode value - using the default encoding!


# -*- coding: latin1 -*-

print "ö".decode("latin1") == u"ö"
print "ö" == u"ö"



So - they are fully interoperable and the comparison is well defined - when
the coercion is successful.

Diez
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      06-28-2006
> But I'd say that it's not intuitive that for sets x in y can be false
> (without raising an exception!) while the doing the same with a tuple
> raises an exception. Where is this difference documented?


2.3.7 Set Types -- set, frozenset

....

Set elements are like dictionary keys; they need to define both __hash__ and
__eq__ methods.
....

And it has to hold that

a == b => hash(a) == hash(b)

but NOT

hash(a) == hash(b) => a == b

Thus if the hashes vary, the set doesn't bother to actually compare the
values.

Diez
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
File names, character sets and Unicode Michal Ludvig Python 1 12-12-2008 11:08 AM
compare unicode to non-unicode strings Asterix Python 5 08-31-2008 07:31 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
html, unicode and character sets jb HTML 5 03-29-2006 08:32 AM
problem using sets strings and namespaces JBorges C++ 5 07-29-2005 06:02 PM



Advertisments