Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > strxfrm works with unicode string ?

Reply
Thread Tools

strxfrm works with unicode string ?

 
 
nicolas.riesch@genevoise.ch
Guest
Posts: n/a
 
      06-17-2005
I am trying to use strxfm with unicode strings, but it does not work.
This is what I did:

>>> import locale
>>> s=u'\u00e9'
>>> print s


>>> locale.setlocale(locale.LC_ALL, '')

'French_Switzerland.1252'
>>> locale.strxfrm(s)


Traceback (most recent call last):
File "<pyshell#20>", line 1, in -toplevel-
locale.strxfrm(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(12
>>>


Someone sees what I did wrong ?

 
Reply With Quote
 
 
 
 
Gerald Klix
Guest
Posts: n/a
 
      06-17-2005
How about:

import locale
s=u'\u00e9'
print s


locale.setlocale(locale.LC_ALL, '')


locale.strxfrm( s.encode( "latin-1" ) )

---
HTH,
Gerald

http://www.velocityreviews.com/forums/(E-Mail Removed) schrieb:
> I am trying to use strxfm with unicode strings, but it does not work.
> This is what I did:
>
>
>>>>import locale
>>>>s=u'\u00e9'
>>>>print s

>
>
>
>>>>locale.setlocale(locale.LC_ALL, '')

>
> 'French_Switzerland.1252'
>
>>>>locale.strxfrm(s)

>
>
> Traceback (most recent call last):
> File "<pyshell#20>", line 1, in -toplevel-
> locale.strxfrm(s)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 0: ordinal not in range(12
>
>
> Someone sees what I did wrong ?
>


--
GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634

 
Reply With Quote
 
 
 
 
nicolas.riesch@genevoise.ch
Guest
Posts: n/a
 
      06-17-2005
Gruzi, Gerald

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.

Your solution works, but is it a workaround or the real way to use
strxfrm ?
It seems a little artificial to me, but perhaps I haven't understood
something ...

Does this mean that you cannot pass a unicode string to strxfrm ?

Bonne journe !

 
Reply With Quote
 
Gerald Klix
Guest
Posts: n/a
 
      06-17-2005
Sali Nicolas ),
please see below for my answers.

(E-Mail Removed) schrieb:
> Gruzi, Gerald
>
> Well, ok, but I don't understand why I should first convert a pure
> unicode string into a byte string.
> The encoding ( here, latin-1) seems an arbitrary choice.

Well "latin-1" is only encoding, about which I know that it works on
my xterm and which I can type without spelling errors
>
> Your solution works, but is it a workaround or the real way to use
> strxfrm ?
> It seems a little artificial to me, but perhaps I haven't understood
> something ...

In Python 2.3.4 I had some strange encounters with the locale module,
In the end I considered it broken, at least when it came to currency
formating.
>
> Does this mean that you cannot pass a unicode string to strxfrm ?

This works here for my home-grown python 2.4 on Jurrasic Debian Woody:

import locale
s=u'\u00e9'
print s

print locale.setlocale(locale.LC_ALL, '')
print repr( locale.strxfrm( s.encode( "latin-1" ) ) )
print repr( locale.strxfrm( s.encode( "utf-8" ) ) )

The output is rather strange:


de_DE
"\x10\x01\x05\x01\x02\x01'@/locale"
"\x0c\x01\x0c\x01\x04\x01'@/locale"

Another (not so) weird thing happens when I unset LANG.

bear@special:~ > unset LANG
bear@special:~ > python2.4 ttt.py
Traceback (most recent call last):
File "ttt.py", line 3, in ?
print s
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(12

Acually it's more weird, that printing works with LANG=de_DE.

Back to your question. A quick glance at the C-sources of the
_localemodule.c reveals:

if (!PyArg_ParseTuple(args, "s:strxfrm", &s))

So yes, strxfrm does not accept unicode!

I am inclined to consider this a bug.
A least it is not consistent with strcoll.
Strcoll accepts either 2 strings or 2 unicode strings,
at least when HAVE_WCSCOLL was defined when python
was compiled on your plattform.

BTW: Which platform do you use?

HTH,
Gerald

PS: If you have access to irc, you can also ask at
irc://irc.freenode.net#python.de.



--
GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634

 
Reply With Quote
 
Magnus Lycka
Guest
Posts: n/a
 
      06-21-2005
(E-Mail Removed) wrote:
> Gruzi, Gerald
>
> Well, ok, but I don't understand why I should first convert a pure
> unicode string into a byte string.
> The encoding ( here, latin-1) seems an arbitrary choice.


Yes. The correct choice would be 'cp1252', not 'latin-1',
since that's what your locale setting indicates.

It seems to me that Python is on a journey from the ASCII
world to the Unicode world, and it will take a few more
versions before it gets there. Going from 2.2 to 2.3 was
a bumpy part of the ride, and it's still not smooth.

Just try to use raw_input with national characters. As far
as I remember it hasn't worked (on windows at least) since
2.2.

The clear improvement from 2.3 is that if you print unicode
strings to stdout, they will look correct both in the GUI
and in text mode (cmd.exe). That never worked before since
Windows use different code pages in Windows and in the text
mode (which is supposed to be DOS compatible).
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Ambiguous locale.strxfrm Tuomas Vesterinen Python 2 05-23-2009 09:09 AM
Undeterministic strxfrm? Tuomas Python 6 09-05-2007 08:27 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
When I turn on my PC, it works, works, works. Problem! Fogar Computer Information 1 01-17-2006 12:57 AM
After rebooting my PC works, works, works! Antivirus problem? Adriano Computer Information 1 12-15-2003 05:30 AM



Advertisments