Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Need debugging knowhow for my creeping Unicodephobia

Reply
Thread Tools

Need debugging knowhow for my creeping Unicodephobia

 
 
mk
Guest
Posts: n/a
 
      02-11-2010
MRAB wrote:

> When working with Unicode in Python 2, you should use the 'unicode' type
> for text (Unicode strings) and limit the 'str' type to binary data
> (bytestrings, ie bytes) only.


Well OK, always use u'something', that's simple -- but isn't str what I
get from files and sockets and the like?

> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
> 'bytes' for binary data (bytes!).


Neat, except that the process of porting most projects and external
libraries to P3 seems to be, how should I put it, standing still? Or am
I wrong? But that's the impression I get?

Take web frameworks for example. Does any of them have serious plans and
work in place to port to P3?

> Strictly speaking, only Unicode can be encoded.


How so? Can't bytestrings containing characters of, say, koi8r encoding
be encoded?

> What Python 2 is doing here is trying to be helpful: if it's already a
> bytestring then decode it first to Unicode and then re-encode it to a
> bytestring.


It's really cumbersome sometimes, even if two libraries are written by
one author: for instance, Mako and SQLAlchemy are written by the same
guy. They are both top-of-the line in my humble opinion, but when you
connect them you get things like this:

1. you query SQLAlchemy object, that happens to have string fields in
relational DB.

2. Corresponding Python attributes of those objects then have type str,
not unicode.

3. then I pass those objects to Mako for HTML rendering.

Typically, it works: but if and only if a character in there does not
happen to be out of ASCII range. If it does, you get UnicodeDecodeError
on an unsuspecting user.

Sure, I wrote myself a helper that iterates over keyword dictionary to
make sure to convert all str to unicode and only then passes the
dictionary to render_unicode. It's an overhead, though. It would be
nicer to have it all unicode from db and then just pass it for rendering
and having it working. (unless there's something in filters that I
missed, but there's encoding of templates, tags, but I didn't find
anything on automatic conversion of objects passed to method rendering
template)

But maybe I'm whining.


> Unfortunately, the default encoding is ASCII, and the bytestring isn't
> valid ASCII. Python 2 is being 'helpful' in a bad way!


And the default encoding is coded in such way so it cannot be changed in
sitecustomize (without code modification, that is).

Regards,
mk

 
Reply With Quote
 
 
 
 
Robert Kern
Guest
Posts: n/a
 
      02-11-2010
On 2010-02-11 15:43 PM, mk wrote:
> MRAB wrote:


>> Strictly speaking, only Unicode can be encoded.

>
> How so? Can't bytestrings containing characters of, say, koi8r encoding
> be encoded?


I think he means that only unicode objects can be encoded using the .encode()
method, as clarified by his next sentence:

>> What Python 2 is doing here is trying to be helpful: if it's already a
>> bytestring then decode it first to Unicode and then re-encode it to a
>> bytestring.


--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

 
Reply With Quote
 
 
 
 
Steve Holden
Guest
Posts: n/a
 
      02-11-2010
mk wrote:
> MRAB wrote:
>
>> When working with Unicode in Python 2, you should use the 'unicode' type
>> for text (Unicode strings) and limit the 'str' type to binary data
>> (bytestrings, ie bytes) only.

>
> Well OK, always use u'something', that's simple -- but isn't str what I
> get from files and sockets and the like?
>

Yes, which is why you need to know what encoding was used to create it.

>> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
>> 'bytes' for binary data (bytes!).

>
> Neat, except that the process of porting most projects and external
> libraries to P3 seems to be, how should I put it, standing still? Or am
> I wrong? But that's the impression I get?
>

No, it's probably not going as quickly as you would like, but it's
certainly not standing still. Some of these libraries are substantial
works, and there were changes to the C API that take quite a bit of work
to adapt existing code to.

> Take web frameworks for example. Does any of them have serious plans and
> work in place to port to P3?
>

There have already been demonstrations of partially-working Python 3
Django. I can't speak to the rest.

>> Strictly speaking, only Unicode can be encoded.

>
> How so? Can't bytestrings containing characters of, say, koi8r encoding
> be encoded?
>

It's just terminology. If a bytestring contains koi8r characters then
(as you unconsciously recognized by your use of the word "encoding") it
already *has* been encoded.

>> What Python 2 is doing here is trying to be helpful: if it's already a
>> bytestring then decode it first to Unicode and then re-encode it to a
>> bytestring.

>
> It's really cumbersome sometimes, even if two libraries are written by
> one author: for instance, Mako and SQLAlchemy are written by the same
> guy. They are both top-of-the line in my humble opinion, but when you
> connect them you get things like this:
>
> 1. you query SQLAlchemy object, that happens to have string fields in
> relational DB.
>
> 2. Corresponding Python attributes of those objects then have type str,
> not unicode.
>

Yes, a relational database will often return ASCII, but nowadays people
are increasingly using encoded Unicode. In that case you need to be
aware of the encoding that has been used to render the Unicode values
into the byte strings (which in Python 2 are of type str) so that you
can decode them into Unicode.

> 3. then I pass those objects to Mako for HTML rendering.
>
> Typically, it works: but if and only if a character in there does not
> happen to be out of ASCII range. If it does, you get UnicodeDecodeError
> on an unsuspecting user.
>

Well first you need to be clear what you are passing to Mako.

> Sure, I wrote myself a helper that iterates over keyword dictionary to
> make sure to convert all str to unicode and only then passes the
> dictionary to render_unicode. It's an overhead, though. It would be
> nicer to have it all unicode from db and then just pass it for rendering
> and having it working. (unless there's something in filters that I
> missed, but there's encoding of templates, tags, but I didn't find
> anything on automatic conversion of objects passed to method rendering
> template)
>

Some database modules will distinguish between fields of type varchar
and nvarchar, returning Unicode objects for the latter. You will need to
ensure that the module knows which encoding is used in the database.
This is usually automatic.

> But maybe I'm whining.
>

Nope, just struggling with a topic that is far from straightforward the
first time you encounter it.
>
>> Unfortunately, the default encoding is ASCII, and the bytestring isn't
>> valid ASCII. Python 2 is being 'helpful' in a bad way!

>
> And the default encoding is coded in such way so it cannot be changed in
> sitecustomize (without code modification, that is).
>

Yes, the default encoding is not always convenient.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      02-11-2010
On 2/11/2010 4:43 PM, mk wrote:

> Neat, except that the process of porting most projects and external
> libraries to P3 seems to be, how should I put it, standing still?


What is important are the libraries, so more new projects can start in
3.x. There is a slow trickly of 3.x support announcements.

> But maybe I'm whining.


Or perhaps explaining why 3.x unicode improvements are needed.

tjr

 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      02-12-2010
On Wed, 10 Feb 2010 12:17:51 -0800, Anthony Tolle wrote:

> 4. Consider switching to Python 3.x, since there is only one string
> type (unicode).


However: one drawback of Python 3.x is that the repr() of a Unicode string
is no longer restricted to ASCII. There is an ascii() function which
behaves like the 2.x repr(). However: the interpreter uses repr() for
displaying the value of an expression typed at the interactive prompt,
which results in "can't encode" errors if the string cannot be converted
to your locale's encoding.

 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      02-13-2010
kj wrote:

>>> =A0 x =3D '%s' % y
>>> =A0 x =3D '%s' % z
>>> =A0 print y
>>> =A0 print z
>>> =A0 print y, z


Bear in mind that most Python implementations assume the "console"
only handles ASCII. So "print" output is converted to ASCII, which
can fail. (Actually, all modern Windows and Linux systems support
Unicode consoles, but Python somehow doesn't get this.)

John Nagle
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      02-13-2010
kj wrote:
> Some people have mathphobia. I'm developing a wicked case of
> Unicodephobia.
>
> I have read a *ton* of stuff on Unicode. It doesn't even seem all
> that hard. Or so I think. Then I start writing code, and WHAM:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(12


First, you haven't told us what platform you're on. Windows? Linux?
Something else?

If you're on Windows, and running Python from the command line, try
"cmd /u" before running Python. This will get you a Windows console that
will print Unicode. Python recognizes this, and "print" calls will
go out to the console in Unicode, which will then print the correct
characters if they're in the font being used by the Windows console.
Most European languages are covered in the standard font.

If you're using IDLE, or some Python debugger, it may need to be
told to have its window use Unicode.

John Nagle
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
creeping consensus Roedy Green Java 6 02-26-2013 03:02 PM
Video Conferencing Knowhow ChronicBoy Computer Information 3 03-12-2010 08:01 AM
Combating Creeping Lens Syndrome jmc Digital Photography 9 02-06-2008 11:30 AM
Does Technical Knowhow Interfere With Creative Intuition? Chris Digital Photography 27 10-09-2006 02:18 AM
Creeping mice Bob - Andover, MA Computer Support 5 02-25-2004 08:50 PM



Advertisments