Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > handling unicode data

Reply
Thread Tools

handling unicode data

 
 
Filipe
Guest
Posts: n/a
 
      06-28-2006
Hi all,

I'm starting to learn python but am having some difficulties with how
it handles the encoding of data I'm reading from a database. I'm using
pymssql to access data stored in a SqlServer database, and the
following is the script I'm using for testing purposes.

-----------------------------------------------------------------------------
import pymssql

mssqlConnection =
pymssql.connect(host='localhost',user='sa',passwor d='password',database='TestDB')
cur = mssqlConnection.cursor()
query="Select ID, Term from TestTable where ID > 200 and ID < 300;"
cur.execute(query)
row = cur.fetchone()
results = []
while row is not None:
term = row[1]
print type(row[1])
print term
results.append(term)
row = cur.fetchone()
cur.close()
mssqlConnection.close()
print results
-----------------------------------------------------------------------------

In the console output, for a record where I expected to see "Franša"
I'm getting the following:

"<type 'str'>" - When I print the type (print type(row[1]))
"Fran+a" - When I print the "term" variable (print term)
"Fran\xd8a" - When I print all the query results (print results)


The values in "Term" column in "TestTable" are stored as unicode (the
column's datatype is nvarchar), yet, the python data type of the values
I'm reading is not unicode.
It all seems to be an encoding issue, but I can't see what I'm doing
wrong..
Any thoughts?

thanks in advance,
Filipe

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      06-28-2006
Filipe wrote:

> In the console output, for a record where I expected to see "Franša"
> I'm getting the following:
>
> "<type 'str'>" - When I print the type (print type(row[1]))
> "Fran+a" - When I print the "term" variable (print term)
> "Fran\xd8a" - When I print all the query results (print results)
>
> The values in "Term" column in "TestTable" are stored as unicode (the
> column's datatype is nvarchar), yet, the python data type of the values
> I'm reading is not unicode.
> It all seems to be an encoding issue, but I can't see what I'm doing
> wrong..


looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
strings. there might be some configuration option for this; see

in worst case, you could do something like

def unicodify(value):
if isinstance(value, str):
value = unicode(value, "iso-8859-1")
return value

term = unicodify(row[1])

but it's definitely better if you can get the DB-API driver to do the right thing.

</F>



 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      06-28-2006
Fredrik Lundh wrote:
> looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
> strings. there might be some configuration option for this; see
>


Where did you want to point the OP here?

> in worst case, you could do something like
>
> def unicodify(value):
> if isinstance(value, str):
> value = unicode(value, "iso-8859-1")
> return value
>
> term = unicodify(row[1])
>
> but it's definitely better if you can get the DB-API driver to do the right thing.


It seems pymssql does not support such a thing.

Also, it appears that DB-Library (the API used by pymssql) always
returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
so the "right" encoding to use is "mbcs".

Notice that Microsoft plans to abandon DB-Library, so it might be
best to switch to a different module for SQL Server access.

Regards,
Martin
 
Reply With Quote
 
Filipe
Guest
Posts: n/a
 
      06-28-2006
Hi Fredrik,

Thanks for the reply.
Instead of:
term = row[1]
I tried:
term = unicode(row[1], "iso-8859-1")

but the following error was returned when printing "term":
Traceback (most recent call last):
File "test.py", line 11, in ?
print term
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 31: character maps to <undefined>

Is it possible some unicode strings are not printable to the console?
It's odd, because I can manually write in the console the same string
I'm trying to print.
I also tried other encodings, besides iso-8859-1, but got the same
error.

Do you think this has something to do with the DB-API driver? I don't
even know where to start if I have to change something in there

Cheers,
Filipe

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      06-28-2006
Filipe wrote:

> Thanks for the reply.
> Instead of:
> term = row[1]
> I tried:
> term = unicode(row[1], "iso-8859-1")
>
> but the following error was returned when printing "term":
> Traceback (most recent call last):
> File "test.py", line 11, in ?
> print term
> File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
> encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
> position 31: character maps to <undefined>


works for me, given your example:

>>> s = "Fran\xd8a"
>>> unicode(s, "iso-8859-1")

u'Fran\xd8a'

what does

print repr(row[1])

print in this case ?

</F>

 
Reply With Quote
 
Filipe
Guest
Posts: n/a
 
      06-28-2006
Hi,

Martin v. L÷wis wrote:
> Also, it appears that DB-Library (the API used by pymssql) always
> returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
> so the "right" encoding to use is "mbcs".


do you mean using something like the following line?
term = unicode(row[1], "mbcs")

What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
quite a newbie to python)

> Notice that Microsoft plans to abandon DB-Library, so it might be
> best to switch to a different module for SQL Server access.


I've done some searching and settled for pymssql, but it's not too late
to change yet.
I've found these options to connect to a MSSqlServer database:

Pymssql
http://pymssql.sourceforge.net/

ADODB for Python (windows only)
http://phplens.com/lens/adodb/adodb-py-docs.htm

SQLServer for Python (discontinued?)
http://www.object-craft.com.au/projects/mssql/

mxODBC (commercial license)
http://www.egenix.com/files/python/mxODBC.html

ASPN Recipe
http://aspn.activestate.com/ASPN/Coo.../Recipe/144183


Pymssql seemed like the best choice. The ASPN Recipe I mention doesn't
look bad either, but there doesn't seem to be as many people using it
as using pymssql. I'll look a little further though.

 
Reply With Quote
 
Filipe
Guest
Posts: n/a
 
      06-28-2006
Fredrik Lundh wrote:
> works for me, given your example:
> >>> s = "Fran\xd8a"
> >>> unicode(s, "iso-8859-1")

> u'Fran\xd8a'
>
> what does
> print repr(row[1])
>
> print in this case ?


It prints:
'Fran\xd8a'

The error I'm getting is beeing thrown when I print the value to the
console. If I just convert it to unicode all seems ok (except for not
beeing able to show it in the console, that is... .

For example, when I try this:
print unicode("Fran\xd8a", "iso-8859-1")

I get the error:
Traceback (most recent call last):
File "a.py", line 1, in ?
print unicode("Fran\xd8a", "iso-8859-1")
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 4
: character maps to <undefined>

 
Reply With Quote
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      06-28-2006
In <(E-Mail Removed) .com>, Filipe wrote:

> The error I'm getting is beeing thrown when I print the value to the
> console. If I just convert it to unicode all seems ok (except for not
> beeing able to show it in the console, that is... .
>
> For example, when I try this:
> print unicode("Fran\xd8a", "iso-8859-1")
>
> I get the error:
> Traceback (most recent call last):
> File "a.py", line 1, in ?
> print unicode("Fran\xd8a", "iso-8859-1")
> File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
> encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
> position 4
> : character maps to <undefined>


The `unicode()` call doesn't fail here but the ``print`` because printing
unicode strings means they have to be encoded into a byte string again.
And whatever encoding the target of the print (your console) uses, it
does not contain the unicode character u'\xd8'. From the traceback it
seems your terminal uses `cp437` as encoding.

As you can see here: http://www.wordiq.com/definition/CP437 there's no ├ś
in that character set.

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
Frank Millman
Guest
Posts: n/a
 
      06-29-2006

Filipe wrote:
> Hi,
>
> I've done some searching and settled for pymssql, but it's not too late
> to change yet.
> I've found these options to connect to a MSSqlServer database:
>
> Pymssql
> http://pymssql.sourceforge.net/
>
> ADODB for Python (windows only)
> http://phplens.com/lens/adodb/adodb-py-docs.htm
>
> SQLServer for Python (discontinued?)
> http://www.object-craft.com.au/projects/mssql/
>
> mxODBC (commercial license)
> http://www.egenix.com/files/python/mxODBC.html
>
> ASPN Recipe
> http://aspn.activestate.com/ASPN/Coo.../Recipe/144183
>


You did not mention the odbc module from Mark Hammond's win32
extensions. This is what I use, and it works for me. I believe it is
not 100% DB-API 2.0 compliant, but I have not had any problems.

I have not tried connecting to the database from a Linux box (or from
another Windows box, for that matter). I don't know if there are any
implications there.

Frank Millman

 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      06-29-2006
Filipe wrote:
>> Also, it appears that DB-Library (the API used by pymssql) always
>> returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
>> so the "right" encoding to use is "mbcs".

>
> do you mean using something like the following line?
> term = unicode(row[1], "mbcs")


Correct.

> What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
> quite a newbie to python)


It's an SQL server thing more than a Python thing. See AutoAnsiToOem
in

http://support.microsoft.com/default...B;EN-US;199819

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Question regarding handling of Unicode data in Devnagari joy99 Python 2 09-12-2009 09:23 PM
RE: handling unicode data Tim Golden Python 1 06-30-2006 04:15 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments