Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > utf8 encoding problem

Reply
Thread Tools

utf8 encoding problem

 
 
Wichert Akkerman
Guest
Posts: n/a
 
      01-22-2004
I'm struggling with what should be a trivial problem but I can't seem to
come up with a proper solution: I am working on a CGI that takes utf-8
input from a browser. The input is nicely encoded so you get something
like this:

firstname=t%C3%A9s

where %C3CA9 is a single character in utf-8 encoding. Passing this
through urllib.unquote does not help:

>>> urllib.unquote(u't%C3%A9st')

u't%C3%A9st'

The problem turned out to be that urllib.unquote() process processes
its input character by character which breaks when it tries to call
chr() for a character: it gets a character which is not valid ascii
(outside the legal range) or valid unicode (it's only half a utf-8
character) and as a result it fails:

>>> chr(195) + u""

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(12


I can't seem to find a working method to do this conversion correctly.
Can someone point me in the right direction? (and please cc me on
replies since I'm not currently subscribed to this list/newsgroup).

Wichert.

--
Wichert Akkerman <(E-Mail Removed)> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.


 
Reply With Quote
 
 
 
 
Erik Max Francis
Guest
Posts: n/a
 
      01-22-2004
Wichert Akkerman wrote:

> I'm struggling with what should be a trivial problem but I can't seem
> to
> come up with a proper solution: I am working on a CGI that takes utf-8
> input from a browser. The input is nicely encoded so you get something
> like this:
>
> firstname=t%C3%A9s
>
> where %C3CA9 is a single character in utf-8 encoding. Passing this
> through urllib.unquote does not help:
>
> >>> urllib.unquote(u't%C3%A9st')

> u't%C3%A9st'


Unquote it as a normal string, then convert it to Unicode.

>>> import urllib
>>> x = 't%C3%A9s'
>>> y = urllib.unquote(x)
>>> y

't\xc3\xa9s'
>>> z = unicode(y, 'utf-8')
>>> z

u't\xe9s'

--
__ Erik Max Francis && http://www.velocityreviews.com/forums/(E-Mail Removed) && http://www.alcyone.com/max/
/ \ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
\__/ I do not promise to consider race or religion in my appointments.
I promise only that I will not consider them. -- John F. Kennedy
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
utf8 encoding problem Ad Ad Ruby 3 06-26-2009 02:38 PM
Problem with encoding latin1/UTF8 Mark Toth Ruby 1 01-07-2008 08:39 AM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM
Re: utf8 encoding problem Wichert Akkerman Python 4 01-25-2004 08:58 AM



Advertisments