Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > the unicode saga continues...

Reply
Thread Tools

the unicode saga continues...

 
 
Ethan Furman
Guest
Posts: n/a
 
      11-14-2009
So I've added unicode support to my dbf package, but I also have some
rather large programs that aren't ready to make the switch over yet. So
as a workaround I added a (rather lame) option to convert the
unicode-ified data that was decoded from the dbf table back into an
encoded format.

Here's the fun part: in figuring out what the option should be for use
with my system, I tried some tests...

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'\xed'

*
>>> print u'\xed'.encode('cp437')

*
>>> print u'\xed'.encode('cp850')

*
>>> print u'\xed'.encode('cp1252')

φ
>>> import locale
>>> locale.getdefaultlocale()

('en_US', 'cp1252')

My confusion lies in my apparant codepage (cp1252), and the discrepancy
with character u'\xed' which is absolutely an i with an accent; yet when
I encode with cp1252 and print it, I get an o with a line.

Can anybody clue me in to what's going on here?

~Ethan~
 
Reply With Quote
 
 
 
 
Ulrich Eckhardt
Guest
Posts: n/a
 
      11-14-2009
Ethan Furman wrote:
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
> (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> print u'\xed'

> *
> >>> print u'\xed'.encode('cp437')

> *
> >>> print u'\xed'.encode('cp850')

> *
> >>> print u'\xed'.encode('cp1252')

> φ
> >>> import locale
> >>> locale.getdefaultlocale()

> ('en_US', 'cp1252')
>
> My confusion lies in my apparant codepage (cp1252), and the discrepancy
> with character u'\xed' which is absolutely an i with an accent; yet when
> I encode with cp1252 and print it, I get an o with a line.

^^^^^^^^^^^^^^^^^^^^^^
For the record: I read a small Greek letter phi in your posting, not an o
with a line. If I encode according to my default locale (UTF-, I get the
letter i with the accent. If I encode with codepage 1252, I get a marker for
an invalid character on my terminal. This is using Debian though, not MS
Windows.

Try printing the repr() of that. The point is that internally, you have the
codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages,
which yields a string of bytes representing this thing ('\xa1', '\xa1' and
'\xed'), useful for storing on disk when the file uses said codepage or
other forms of IO.

Now, with a Unicode string, the output (print) knows what to do, it encodes
it according to the defaultlocale and sends the resulting bytes to stdout.
With a byte string, I think it directly forwards the content to stdout.

Note:
* If you want to verify your code, rather use 'print repr(..)'.
* I could imagine that your locale is simply not set up correctly.

Uli

 
Reply With Quote
 
 
 
 
Martin v. Löwis
Guest
Posts: n/a
 
      11-14-2009
> Can anybody clue me in to what's going on here?

It's as Mark says: the console encoding is cp437 on your system,
cp1252.

Windows has *two* default code pages at any point in time: the
OEM code page, and the ANSI code page. Either one depends on the
Windows release (Western, Japanese, etc.), and can be set by the
administrator. The OEM code page is primarily used for the console
(and then also as the encoding on the FAT filesystem); the ANSI
code page is used in all other places (that don't use Unicode APIs).

In addition, the console code page may deviate from the OEM code
page, if you run chcp.exe.

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
OT: The Saga Continues... [and wait, there's MORE] PC LOAD LETTER MCSE 22 01-24-2006 08:32 PM
Saga radio Bigbri Computer Support 2 08-28-2004 09:14 PM
Continuing saga! Harry Computer Support 7 08-12-2004 07:12 AM
The DMZ and the PIX515e saga Mick Cisco 1 07-03-2004 06:30 PM
AP350 radio saga (interface up/down) Al Blake Cisco 0 04-15-2004 09:08 PM



Advertisments