Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Encoding troubles

Reply
Thread Tools

Encoding troubles

 
 
JB
Guest
Posts: n/a
 
      05-17-2010
I'm working on the webapp of our company intranet and I had a question
about proper handling of user input that's causing encoding issues.

Some of the uesrs take notes in Microsoft Office and copy/paste these
into textarea's of the webapp. Some of the characters from Word such
as hypens () and apostrophes () are in an odd encoding. When passed
to the database using sqlalchemy they appear as – and other
characters.

What's the proper handling (conversion?) of user input before it gets
to my database. Do I need to start making a list of the offending
characters and .replace them? Or is there a means to decode/encode the
user input to something more generic? Thanks for your time.
 
Reply With Quote
 
 
 
 
Neil Hodgson
Guest
Posts: n/a
 
      05-17-2010
JB:

> as hypens () and apostrophes () are in an odd encoding. When passed
> to the database using sqlalchemy they appear as – and other
> characters.


The encoding is UTF-8. Normally the best way to handle encodings is
to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
and perform most processing in Unicode.

Neil
 
Reply With Quote
 
 
 
 
Bryan
Guest
Posts: n/a
 
      05-18-2010
Neil Hodgson wrote:
> JB:
>
> > as hypens () and apostrophes () are in an odd encoding. When passed
> > to the database using sqlalchemy they appear as – and other
> > characters.

>
> * *The encoding is UTF-8. Normally the best way to handle encodings is
> to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
> and perform most processing in Unicode.


Good advice to work in Unicode (and in Python 3.X str is unicode), but
I'd guess the encoding he's getting is "Windows-1252". The default
character set of HTTP is ISO-8859-1, but Microsoft likes to use
Windows-1252 in it's place.

What to do about it? First, try specifying utf-8 in the form
containing the textarea, as in

<form action="process.cgi" accept-charset="utf-8">

Note that specifying ISO-8859-1 will not work, in that Microsoft will
still use Windows-1252. I've heard they've gotten better at supporting
utf-8, but I haven't tested.

When a request comes in, check for a Content-Type header that names
the character set, which should be:

Content-Type: application/x-www-form-urlencoded; charset=utf-8

Then you con decode to a unicode object as Neil Hodgson explained.

In case you still have to deal with Windows-1252, Python knows how to
translate Windows-1252 to the best-fit in Unicode. In current Python
2.x:

ustring = unicode(raw_string, 'Windows-1252')

In Python 3.X, what comes from a socket is bytes, and str means
unicode:

ustring = str(raw_bytes, 'Windows-1252')


Of course this all assumes that JB's database likes Unicode. If it
chokes, then alternatives include encoding back to utf-8 and storing
as binary, or translating characters to some best-fit in the set the
database supports.


--
--Bryan Olson
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
changing JVM encoding; setting -Dfile.encoding doesn't work pasmol@plusnet.pl Java 1 10-08-2004 09:50 PM
encoding troubles Matthijs Blaas Java 11 08-21-2004 09:22 AM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM
Encoding troubles Xaver Hinterhuber Python 2 05-18-2004 06:37 AM



Advertisments