Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: 'ascii' codec can't encode character u'\xf3'

Reply
Thread Tools

Re: 'ascii' codec can't encode character u'\xf3'

 
 
Martin Slouf
Guest
Posts: n/a
 
      08-17-2004
i had similar errors:

Traceback (most recent call last):
File "/home/martin/skripty/accounts.py", line 125, in ?
main(sys.argv)
File "/home/martin/skripty/accounts.py", line 119, in main
print_accounts(accounts, url_part)
File "/home/martin/skripty/accounts.py", line 94, in print_accounts
print str(i).encode("utf-8", "replace")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
151-152: ordinal not in range(12

- - - -

the solution seems to be:

0. string is not in unicode encoding (assumption)
1. before printing out, convert the string to unicode
2. when printing, convert to whatever charset you like

though i dont understand much why (ive solved it a minute ago the
code should be:

str = "any nonunicode string"
print unicode(str).encode("iso-8859-2", "replace")

comments:

1. why the string is not in unicode can have several reasons -- i guess:
- does ogg stores tags in unicode?
- you have parsed an xml file with encoding attribute set (that
is what i do)
- etc

2. "replace" parameter in encode causes non-printable chars to be
replaced with '?' (you can use "ignore" or strict", see your python
doc)

3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
a funny thing -- first line of code converts from unknown (but the
programmer must know it) to unicode and the second one converts it back
from unicode to unknown (now the programmer tells that secret to python


4. i would like to know from any python expert whether/why/why not:

* my assumptions are right

* why is that behaviour? -- if you search google you get
thousands of errors like this -- with no proper solutions i must add

* is there an easier portable way (no sitecustomize.py changes)
to do it

* i was looking in site.py and there is deleted the
sys.setdefaultencoding() function, but from the comments i do
not know why -- you know it? why is user not allowed to change the
default encoding? it seems reasonable to me if he/she could do that.

thx.

m.

 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      08-17-2004
Martin Slouf wrote:
> the solution seems to be:
>
> 0. string is not in unicode encoding (assumption)
> 1. before printing out, convert the string to unicode
> 2. when printing, convert to whatever charset you like


There is an alternative, if the print is a debug print:

- print a repr() of the unicode object instead of
the unicode object itself. This will work on all
terminals, and show hex escapes of non-ASCII characters.

> 1. why the string is not in unicode can have several reasons -- i guess:
> - does ogg stores tags in unicode?
> - you have parsed an xml file with encoding attribute set (that
> is what i do)
> - etc


Correct.

> 2. "replace" parameter in encode causes non-printable chars to be
> replaced with '?' (you can use "ignore" or strict", see your python
> doc)


Correct.

> 3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
> a funny thing -- first line of code converts from unknown (but the
> programmer must know it) to unicode and the second one converts it back
> from unicode to unknown (now the programmer tells that secret to python
>


No. unicode(text) uses the system default encoding
(sys.getdefaultencoding()) which normally is ASCII.

Printing a Unicode string to a terminal should work fine if the terminal
is properly configured. What that means depends on your operating
system.

> * my assumptions are right


Most of them.

>
> * why is that behaviour? -- if you search google you get
> thousands of errors like this -- with no proper solutions i must add


There is a proper solution. Unfortunately, very similar yet different
problems cause the same error message, and each problem has a different
proper solution:

- A Unicode error is raised when trying to combine a Unicode string
and a byte string, if the byte string contains non-ASCII characters,
e.g.

u"Martin v. " + "Löwis"

The proper solution is to convert the second string into a Unicode
object, e.g. through

unicode("Löwis", "iso-8859-1")

- A unicode error is raised when a Unicode string is printed to
a terminal. The proper solution is that the system administrator
or the user should properly administer the locale, so that Python
knows what characters the terminal can print. For characters that
are then still non-printable, repr() is the proper solution.

- A unicode error is raised when a library does not support Unicode
for some reason. The proper solution is to fix the library. A
proper work-around is to explicitly convert Unicode strings into
the encoding that the library expects.

> * is there an easier portable way (no sitecustomize.py changes)
> to do it


Yes, see above.

> * i was looking in site.py and there is deleted the
> sys.setdefaultencoding() function, but from the comments i do
> not know why -- you know it? why is user not allowed to change the
> default encoding? it seems reasonable to me if he/she could do that.


Yes, but that would not be a proper solution. It would mean that your
script now only works on your system, and fails on a system where
the default encoding has not been changed, or has been changed to
something else. Users should use a proper solution instead.

Regards,
Martin
 
Reply With Quote
 
 
 
 
Martin Slouf
Guest
Posts: n/a
 
      08-17-2004
thank you for reply, great info! it helped me to better understand it;
but of course, some additional questions have risen.

maybe some of those question/comments may seem stupid (ie. clear), but
im new to python and i want to assure myself i get it right; thx for
patience.

> There is an alternative, if the print is a debug print:
>
> - print a repr() of the unicode object instead of
> the unicode object itself. This will work on all
> terminals, and show hex escapes of non-ASCII characters.


just to make sure:

override the object's __repr__(self) method to st. like:

class my_string(string):
def __repr__(self)
tmp = unicode(self.attribute1 + " " + self.attribute2)
return tmp

and use 'my_string' class without any worries instead of classical
string?

>
> No. unicode(text) uses the system default encoding
> (sys.getdefaultencoding()) which normally is ASCII.
>
> Printing a Unicode string to a terminal should work fine if the terminal
> is properly configured. What that means depends on your operating
> system.


my system is debian GNU/Linux stable, im using it for a very, very long
time, though i did not changed any terminal settings but the very
basics. My locales are properly set, im using LC_* environment
variables to set default locale to czech environment with ISO-8859-2
charset. Terminal is capable of displaying 8bit charsets, im not sure
about unicode charsets -- never tried, never needed. All other
locale-sensitive programms are satisfied. (ie. java interpretter -- this
should be much like python

guess in germany it is quite the same, maybe ISO-8859-1 is preferred

example output from my system:

>>> import locale
>>> loc = locale.getdefaultlocale()
>>> loc

['cs_CZ', 'ISO8859-2']

so i guess this is ok.

but the problem maybe in my 'site.py' where setting encoding
according to my locale is done in a code like this:

if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]

so i guess it is never done

did you yourself changed it? did you think this is the 'portable
solution'? i guess not -- another system, another locale, maybe being in
ascii is the best.

>
> >
> > * why is that behaviour? -- if you search google you get
> >thousands of errors like this -- with no proper solutions i must add

>
> There is a proper solution. Unfortunately, very similar yet different
> problems cause the same error message, and each problem has a different
> proper solution:
>


well, if a piece of information like you gave to me was contained in
standard python documentation, probably there will be less
misunderstanding about this issue.

> - A Unicode error is raised when trying to combine a Unicode string
> and a byte string, if the byte string contains non-ASCII characters,
> e.g.
>
> u"Martin v. " + "Löwis"
>
> The proper solution is to convert the second string into a Unicode
> object, e.g. through
>
> unicode("Löwis", "iso-8859-1")
>


if i use
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
at the begginnig of my every script, the example above still has to
be converted -- because of the iso-8859-1 you use in "Löwis"?

what would change if i use
#! /usr/bin/env python
# -*- coding: ISO-8859-1 -*-
?

can i ommit the conversion (ie. is it done automatically for me as if
i write
u"Martin v. " + unicode("Löwis", "ISO-8859-1")
)?

> - A unicode error is raised when a Unicode string is printed to
> a terminal. The proper solution is that the system administrator
> or the user should properly administer the locale, so that Python
> knows what characters the terminal can print. For characters that
> are then still non-printable, repr() is the proper solution.


see above for comments on my setting. if you have done such a
customization (and it differs from mine) and you have experience with
linux, may i ask you for recommendations?

>
> - A unicode error is raised when a library does not support Unicode
> for some reason. The proper solution is to fix the library. A
> proper work-around is to explicitly convert Unicode strings into
> the encoding that the library expects.
>


dont understand -- which library? you meant for example the ogg vorbis
c-library when used with python bindings? -- in that case, what can be
done by me as a developer? -- to know what encoding is used and do the
tricky things i did -- now properly understood:

1. convert from "unknown" to unicode
tmp = unicode("string", "library-charset-specification")

2. print it like
print tmp.encode("my-terminal-charset-specification")

question:

library-charset-specification can be ommited if i specify it in a
comment at the very begginning of a script (as i guessed above) -- or
my-terminal-charset-specification can be ommitted if specied in comment
-- or can i ommit both if equal?

if im about to use the __repr__(self) method, i would do the conversion
inside that method and return tmp, as i tried above, right?

>
> > * i was looking in site.py and there is deleted the
> >sys.setdefaultencoding() function, but from the comments i do
> >not know why -- you know it? why is user not allowed to change the
> >default encoding? it seems reasonable to me if he/she could do that.

>
> Yes, but that would not be a proper solution. It would mean that your
> script now only works on your system, and fails on a system where
> the default encoding has not been changed, or has been changed to
> something else. Users should use a proper solution instead.


i thought that every programmer could call his
sys.setdefaultencoding() method at the start of the script to set it to
whatever he needs. it should work on every system that has proper
encoding files. (though in site.py is a comment on MS indows -- it
breaks that rule

>
> Regards,
> Martin


once again, thank you a lot.

Regards,
Martin (also
 
Reply With Quote
 
Paul Prescod
Guest
Posts: n/a
 
      08-17-2004
Martin Slouf wrote:

> thank you for reply, great info! it helped me to better understand it;
> but of course, some additional questions have risen.
>
> maybe some of those question/comments may seem stupid (ie. clear), but
> im new to python and i want to assure myself i get it right; thx for
> patience.
>
>
>>There is an alternative, if the print is a debug print:
>>
>>- print a repr() of the unicode object instead of
>> the unicode object itself. This will work on all
>> terminals, and show hex escapes of non-ASCII characters.

>
>
> just to make sure:
>
> override the object's __repr__(self) method to st. like:


No, he means instead of:

print foo

print repr(foo)

Paul Prescod

 
Reply With Quote
 
John Roth
Guest
Posts: n/a
 
      08-17-2004
"Martin Slouf" <> wrote in message
news:mailman.1775.1092723467.5135.python-...
> i had similar errors:
>
> Traceback (most recent call last):
> File "/home/martin/skripty/accounts.py", line 125, in ?
> main(sys.argv)
> File "/home/martin/skripty/accounts.py", line 119, in main
> print_accounts(accounts, url_part)
> File "/home/martin/skripty/accounts.py", line 94, in print_accounts
> print str(i).encode("utf-8", "replace")
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 151-152: ordinal not in range(12
>
> - - - -
>
> the solution seems to be:
>
> 0. string is not in unicode encoding (assumption)
> 1. before printing out, convert the string to unicode
> 2. when printing, convert to whatever charset you like
>
> though i dont understand much why (ive solved it a minute ago the
> code should be:
>
> str = "any nonunicode string"
> print unicode(str).encode("iso-8859-2", "replace")


I think the terminology is backwards. If you use a unicode string
(that is, u"foo") that string will be in unicode. That's what Python
does with unicode strings. However,
it can't be read or written as such - it has to be decoded
from something else (utf-8, iso-8859-2, whatever)
after being read, and encoded to something (utf-8, iso-8859-1,
whatever) to be written.

A string on disk isn't in "unicode"; it's always in some
encoded format, which is usually utf-8. Or it's in some
single-byte format such as iso-8859-1. Or a far eastern
multi-byte format. A string only winds up in unicode
when it's comfortably ensconsed in a unicode string.

> comments:
>
> 1. why the string is not in unicode can have several reasons -- i guess:
> - does ogg stores tags in unicode?
> - you have parsed an xml file with encoding attribute set (that
> is what i do)
> - etc
>
> 2. "replace" parameter in encode causes non-printable chars to be
> replaced with '?' (you can use "ignore" or strict", see your python
> doc)
>
> 3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
> a funny thing -- first line of code converts from unknown (but the
> programmer must know it) to unicode and the second one converts it back
> from unicode to unknown (now the programmer tells that secret to python
>


Well, the encoding declaration tells Python what to do with unicode
string literals that it finds in the Python source. It doesn't do anything
else.

> 4. i would like to know from any python expert whether/why/why not:
>
> * my assumptions are right


As I said above, the terminology is backwards. "Pure"
unicode only exists in unicode strings. Everything else
is some encoded character set or other in regular single
byte strings, ***including unicode encoded as utf-8.***

> * why is that behaviour? -- if you search google you get
> thousands of errors like this -- with no proper solutions i must add


There's a lot of confusion out there. Lots of people are under
the impression that the encoding declaration somehow does
something magical with unicode, when all, (and I need to
emphasize that, ALL) it does is convert the source code
to unicode in unicode literals using the specified decoding.
Everything outside of unicode literals is treated as a stream
of 8-bit bytes, regardless of the programmer's intentions.

Before the encoding declaration, if you wanted to
include unicode characters in your program you had
to use an editor that encoded in utf-8 and put them
in single byte strings, and then decode those strings
into unicode strings. This was fairly error-prone since
you could drop utf-8 encoded characters somewhere
they didn't belong, causing very difficult to find bugs.

> * is there an easier portable way (no sitecustomize.py changes)
> to do it


The best thing is to ignore the encoding declaration and
write the program as if it wasn't there. On input you need
to somehow determine the encoding of the data and then
decode that into a unicode string; on output you need
to do the reverse and encode the unicode string into a
single byte string before writing it.

You can simplify some of this by using the open
function in the codecs module. That lets you
declare the encoding on open so that the
encoding and decoding happens transparently.

> * i was looking in site.py and there is deleted the
> sys.setdefaultencoding() function, but from the comments i do
> not know why -- you know it? why is user not allowed to change the
> default encoding? it seems reasonable to me if he/she could do that.


That's someone else's answer. I'm not going to get into
the politics behind that, other than to say that there are
very serious release to release compatibility considerations
here.

John Roth

>
> thx.
>
> m.
>



 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      08-17-2004
Martin Slouf wrote:
>>- print a repr() of the unicode object instead of
>> the unicode object itself. This will work on all
>> terminals, and show hex escapes of non-ASCII characters.

>
>
> just to make sure:
>
> override the object's __repr__(self) method to st. like:
>
> class my_string(string):
> def __repr__(self)
> tmp = unicode(self.attribute1 + " " + self.attribute2)
> return tmp
>
> and use 'my_string' class without any worries instead of classical
> string?


No. Assume yyy is a Unicode object which potentially contains
non-printable characters. Instead of doing

print yyy

do

print repr(yyy)

> my system is debian GNU/Linux stable, im using it for a very, very long
> time, though i did not changed any terminal settings but the very
> basics. My locales are properly set, im using LC_* environment
> variables to set default locale to czech environment with ISO-8859-2
> charset. Terminal is capable of displaying 8bit charsets, im not sure
> about unicode charsets -- never tried, never needed.


I see. Could it be that you are using Python 2.1, then? Because in
Python 2.3, printing Czech characters to the terminal should work
just fine. Please do

Python 2.3.4 (#2, Aug 5 2004, 09:33:45)
[GCC 3.3.4 (Debian 1:3.3.4-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding

'ISO-8859-15'

> if 0:
> # Enable to support locale aware default string encodings.
> import locale
> loc = locale.getdefaultlocale()
> if loc[1]:
> encoding = loc[1]
>
> so i guess it is never done


You don't need to change the default encoding. Instead,
sys.stdout.encoding is used for printing to the terminal (in 2.3 and
later).

> did you yourself changed it?


No. It will work out of the box.

> well, if a piece of information like you gave to me was contained in
> standard python documentation, probably there will be less
> misunderstanding about this issue.


What piece specifically are you referring to? It is all mentioned
in the standard Python documentation.

> #! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> at the begginnig of my every script, the example above still has to
> be converted -- because of the iso-8859-1 you use in "Löwis"?


Yes, and no. Yes, it still has to be converted. UTF-8 is *not*
Unicode; it is a byte encoding, and you cannot mix Unicode
strings and byte strings. No, if I use UTF-8 in my source code,
then "Löwis" will be encoded in UTF-8, not in ISO-8859-1.

> can i ommit the conversion (ie. is it done automatically for me as if
> i write
> u"Martin v. " + unicode("Löwis", "ISO-8859-1")
> )?


You can, but you shouldn't. So I won't tell you how you could do that.

> dont understand -- which library?


The ODBC library, for example, or PyQt.

Regards,
Martin
 
Reply With Quote
 
Martin Slouf
Guest
Posts: n/a
 
      08-18-2004
ok, thanks for your time while answering my questions.

my python is

Python 2.3.3 (#1, May 1 2004, 16:13:07)
[GCC 3.2.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding

'ISO-8859-2'

so im fine with it -- just a strange thing that it has used ascii, if
sys default is ISO-8859-2.

on the other hand: no matter now -- im 'overencoded' -- and i will
explicitly call conversion function from now on in my python scripts
(those are not programs to ensure myself everything is fine

i see that the solution i came with was quite right, though i didnt
much understand it. now i know how it works and im satisfied.

thanks to all of you.

martin.

On Tue, Aug 17, 2004 at 08:17:41PM +0200, "Martin v. Löwis" wrote:
> Martin Slouf wrote:
> >>- print a repr() of the unicode object instead of
> >> the unicode object itself. This will work on all
> >> terminals, and show hex escapes of non-ASCII characters.

> >
> >
> >just to make sure:
> >
> >override the object's __repr__(self) method to st. like:
> >
> >class my_string(string):
> > def __repr__(self)
> > tmp = unicode(self.attribute1 + " " + self.attribute2)
> > return tmp
> >
> >and use 'my_string' class without any worries instead of classical
> >string?

>
> No. Assume yyy is a Unicode object which potentially contains
> non-printable characters. Instead of doing
>
> print yyy
>
> do
>
> print repr(yyy)
>
> >my system is debian GNU/Linux stable, im using it for a very, very long
> >time, though i did not changed any terminal settings but the very
> >basics. My locales are properly set, im using LC_* environment
> >variables to set default locale to czech environment with ISO-8859-2
> >charset. Terminal is capable of displaying 8bit charsets, im not sure
> >about unicode charsets -- never tried, never needed.

>
> I see. Could it be that you are using Python 2.1, then? Because in
> Python 2.3, printing Czech characters to the terminal should work
> just fine. Please do
>
> Python 2.3.4 (#2, Aug 5 2004, 09:33:45)
> [GCC 3.3.4 (Debian 1:3.3.4-7)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import sys
> >>> sys.stdout.encoding

> 'ISO-8859-15'
>
> >if 0:
> > # Enable to support locale aware default string encodings.
> > import locale
> > loc = locale.getdefaultlocale()
> > if loc[1]:
> > encoding = loc[1]
> >
> >so i guess it is never done

>
> You don't need to change the default encoding. Instead,
> sys.stdout.encoding is used for printing to the terminal (in 2.3 and
> later).
>
> >did you yourself changed it?

>
> No. It will work out of the box.
>
> >well, if a piece of information like you gave to me was contained in
> >standard python documentation, probably there will be less
> >misunderstanding about this issue.

>
> What piece specifically are you referring to? It is all mentioned
> in the standard Python documentation.
>
> >#! /usr/bin/env python
> ># -*- coding: UTF-8 -*-
> >at the begginnig of my every script, the example above still has to
> >be converted -- because of the iso-8859-1 you use in "Löwis"?

>
> Yes, and no. Yes, it still has to be converted. UTF-8 is *not*
> Unicode; it is a byte encoding, and you cannot mix Unicode
> strings and byte strings. No, if I use UTF-8 in my source code,
> then "Löwis" will be encoded in UTF-8, not in ISO-8859-1.
>
> >can i ommit the conversion (ie. is it done automatically for me as if
> >i write
> >u"Martin v. " + unicode("Löwis", "ISO-8859-1")
> >)?

>
> You can, but you shouldn't. So I won't tell you how you could do that.
>
> >dont understand -- which library?

>
> The ODBC library, for example, or PyQt.
>
> Regards,
> Martin
> --
> http://mail.python.org/mailman/listinfo/python-list

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
'ascii' codec can't encode character u'\u2013' thomas Armstrong Python 3 09-30-2005 08:07 PM
MySQL: 'latin-1' codec can't encode character francescomoi@europe.com Python 4 05-13-2005 04:05 PM
Re: 'ascii' codec can't encode character u'\xf3' oziko Python 2 08-17-2004 04:08 PM
RE: 'ascii' codec can't encode character u'\xf3' Ben Last Python 0 08-17-2004 01:23 PM
'ascii' codec can't encode character u'\xf3' oziko Python 1 08-16-2004 11:12 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57