Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > utf-8 encoding issue

Reply
Thread Tools

utf-8 encoding issue

 
 
Marc Petitmermet
Guest
Posts: n/a
 
      09-19-2003
The line below looks up the name "÷ttinger" (with the German umlaut) of
an author using the mysql console:

mysql> select author from records where author like '%Öttinger%';

This successfully finds all entries in the records database where
"÷ttinger" is the author or the co-author.

In a web form, the user enters "÷ttinger" and wants to search with this
search string. My idea is now to convert the search string (which also
could be e.g. some cyrillic text) into unicode and then to utf-8:

unicode(search_string).encode('utf-8')

This gives me the utf-8 encoded version of the string but not yet in the
correct representation. How can I get the correct one (is this the hex
version? I don't know the correct terminology.)?

In short: how do I e.g. convert a sting containing a "÷" into a string
containing a "%Ö"?

Regards,
Marc
 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      09-19-2003
Marc Petitmermet wrote:

> In a web form, the user enters "÷ttinger" and wants to search with this
> search string. My idea is now to convert the search string (which also
> could be e.g. some cyrillic text) into unicode and then to utf-8:
>
> unicode(search_string).encode('utf-8')
>
> This gives me the utf-8 encoded version of the string but not yet in the
> correct representation. How can I get the correct one (is this the hex
> version? I don't know the correct terminology.)?
>
> In short: how do I e.g. convert a sting containing a "÷" into a string
> containing a "%Ö"?


that's not UTF-8, that's HTML/XML-style charrefs.

if mysql translates the charref's to unicode characters, you can simply
use:

s = u.encode("ascii", "xmlcharrefreplace")

where "u" is a unicode string.

if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:

def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))

to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.

decoding the charrefs *before* you add the strings to the database
is a better idea, though.

</F>




 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
c:url encoding issue with ampersands johkar Java 0 02-25-2006 05:00 PM
Url encoding issue; + v. %2B MWells ASP .Net 2 01-15-2005 07:04 PM
changing JVM encoding; setting -Dfile.encoding doesn't work pasmol@plusnet.pl Java 1 10-08-2004 09:50 PM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM



Advertisments