Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Is setdefaultencoding bad?

Reply
Thread Tools

Is setdefaultencoding bad?

 
 
moerchendiser2k3
Guest
Posts: n/a
 
      02-23-2011
Hi, I embedded Py2.6.1 in my app and I use UTF-8 encoded strings
everywhere in the interface, so the interface between my app and
Python is UTF-8 so I can simply write:

print u"\uC042"
print u"\uC042".encode("utf_8")

and get the corresponding chinese char in the console. But currently
sys.defaultencoding is still ascii. Should I change it in the site.py
and turn it to utf-8 or is this not recommended somehow? I often read
its highly unrecommended but I can't find an explanation why.

Thanks for any hints!!
Bye, moerchendiser2k3
 
Reply With Quote
 
 
 
 
Nobody
Guest
Posts: n/a
 
      02-23-2011
On Tue, 22 Feb 2011 19:34:21 -0800, moerchendiser2k3 wrote:

> Hi, I embedded Py2.6.1 in my app and I use UTF-8 encoded strings
> everywhere in the interface, so the interface between my app and
> Python is UTF-8 so I can simply write:
>
> print u"\uC042"
> print u"\uC042".encode("utf_8")
>
> and get the corresponding chinese char in the console. But currently
> sys.defaultencoding is still ascii. Should I change it in the site.py
> and turn it to utf-8 or is this not recommended somehow? I often read
> its highly unrecommended but I can't find an explanation why.


You shouldn't use it.

If your code needs to run on any system other than your own, it can't rely
upon the default encoding being set to anything in particular. So
changing the default encoding is an easy way to end up writing code which
doesn't work on any system except your own.

And you can't change the default encoding outside of site.py because the
value has to be constant throughout the lifetime of the process.

IIRC, if you use a unicode string as a dictionary key, and the key can be
converted using the default encoding, the hash is calculated on the
encoded byte string (so that if you have equivalent unicode and byte
strings, both hash to the same value). If you were to change the default
encoding after any dictionaries have been created (internally, Python uses
dictionaries quite extensively), subsequent lookups would use the wrong
hash values.

 
Reply With Quote
 
 
 
 
moerchendiser2k3
Guest
Posts: n/a
 
      02-23-2011
Ok, but that the interface handles UTF-8 strings
are still ok? The defaultencoding is still ascii.
 
Reply With Quote
 
Chris Rebert
Guest
Posts: n/a
 
      02-23-2011
On Wed, Feb 23, 2011 at 3:07 AM, moerchendiser2k3
<(E-Mail Removed)> wrote:
> Ok, but that the interface handles UTF-8 strings
> are still ok? The defaultencoding is still ascii.


Yes, that's fine. UTF-8 is an excellent encoding choice, and
encoding/decoding should always be done explicitly in Python, so the
"default encoding" ideally ought to never come into play (and indeed,
Python 3 does away with bug-prone implicit encoding/decoding entirely
FWICT). Having ASCII as the "default encoding" ensures that implicit
encoding/decoding bugs are relatively apparent.

Cheers,
Chris
--
http://blog.rebertia.com
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      02-24-2011
On Wed, 23 Feb 2011 04:14:29 -0800, Chris Rebert wrote:

>> Ok, but that the interface handles UTF-8 strings
>> are still ok? The defaultencoding is still ascii.

>
> Yes, that's fine. UTF-8 is an excellent encoding choice, and
> encoding/decoding should always be done explicitly in Python, so the
> "default encoding" ideally ought to never come into play (and indeed,
> Python 3 does away with bug-prone implicit encoding/decoding entirely
> FWICT).


On Unix, you have to go out of your way to avoid the use of implicit
encoding/decoding with the "filesystem" encoding. This is because Unix
extensively uses byte strings with no associated encoding, but Python 3
tries to use Unicode for everything.

3.0 was essentially unusable as a Unix scripting language for this reason,
as argv and environ were converted to Unicode, with no possibility of
recovering from unconvertible sequences.

3.1 added the surrogate-escape mechanism which allows recovery of the
original byte sequences, albeit with some effort (i.e. you had to
explicitly decode os.environ and sys.argv).

3.2 adds os.environb (bytes version of os.environ), but it appears that
sys.argv still has to be encoded manually. It also provides os.fsencode()
and os.fsdecode() to simplify the conversion.

Most functions accept bytes arguments, most either return bytes when
passed bytes or (if the function accepts no arguments) has a bytes
equivalent. But variables tend to be Unicode strings with no bytes version
(os.environb is the exception rather than the rule), and some functions
have no bytes equivalent (e.g. os.ctermid(), os.uname(), os.ttyname();
fortunately it's rather unlikely that the result from any of these
functions will contain non-ASCII characters).

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why there is no "setdefaultencoding" in sys module? crow Python 5 07-09-2010 09:48 PM
setdefaultencoding error smalltalk Python 2 12-10-2007 08:58 AM
sys.setdefaultencoding Robin Becker Python 1 08-28-2007 05:13 PM
sys.setdefaultencoding(name) Askari Python 5 09-20-2004 12:07 PM
xml processing and sys.setdefaultencoding christof hoeke Python 4 07-21-2003 05:32 AM



Advertisments