Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Dr. Dobb's Python-URL! - weekly Python news and links (Dec 30)

Reply
Thread Tools

Dr. Dobb's Python-URL! - weekly Python news and links (Dec 30)

 
 
Thomas Heller
Guest
Posts: n/a
 
      01-04-2005
Skip Montanaro <> writes:

> michele> BTW what's the difference between .encode and .decode ?
>
> I started to answer, then got confused when I read the docstrings for
> unicode.encode and unicode.decode:
>
> >>> help(u"\xe4".decode)

> Help on built-in function decode:
>
> decode(...)
> S.decode([encoding[,errors]]) -> string or unicode
>
> Decodes S using the codec registered for encoding. encoding defaults
> to the default encoding. errors may be given to set a different error
> handling scheme. Default is 'strict' meaning that encoding errors raise
> a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
> as well as any other name registerd with codecs.register_error that is
> able to handle UnicodeDecodeErrors.
>
> >>> help(u"\xe4".encode)

> Help on built-in function encode:
>
> encode(...)
> S.encode([encoding[,errors]]) -> string or unicode
>
> Encodes S using the codec registered for encoding. encoding defaults
> to the default encoding. errors may be given to set a different error
> handling scheme. Default is 'strict' meaning that encoding errors raise
> a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
> 'xmlcharrefreplace' as well as any other name registered with
> codecs.register_error that can handle UnicodeEncodeErrors.
>
> It probably makes sense to one who knows, but for the feeble-minded like
> myself, they seem about the same.


It seems also the error messages aren't too helpful:

>>> "ä".encode("latin-1")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(12
>>>


Hm, why does the 'encode' call complain about decoding?

Why do string objects have an encode method, and why do unicode objects
have a decode method, and what does this error message want to tell me:

>>> u"ä".decode("latin-1")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(12
>>>


Thomas
 
Reply With Quote
 
 
 
 
Max M
Guest
Posts: n/a
 
      01-04-2005
wrote:

> uhm ... then there is a misprint in the discussion of the recipe;
> BTW what's the difference between .encode and .decode ?
> (yes, I have been living in happy ASCII-land until now ...



# -*- coding: latin-1 -*-


# here i make a unicode string
unicode_file = u'Some danish characters æøå' #.encode('hex')
print type(unicode_file)
print repr(unicode_file)
print ''


# I can convert this unicode string to an ordinary string.
# because æøå are in the latin-1 charmap it can be understood as
# a latin-1 string
# the æøå characters even has the same value in both
latin1_file = unicode_file.encode('latin-1')
print type(latin1_file)
print repr(latin1_file)
print latin1_file
print ''


## I can *not* convert it to ascii
#ascii_file = unicode_file.encode('ascii')
#print ''


# I can also convert it to utf-8
utf8_file = unicode_file.encode('utf-8')
print type(utf8_file)
print repr(utf8_file)
print utf8_file
print ''


#utf8_file is now an ordinary string. again it can help to think of it
as a file
#format.
#
#I can convert this file/string back to unicode again by using the
decode method.
#It tells python to decode this "file format" as utf-8 when it loads it
onto a
#unicode string. And we are back where we started


unicode_file = utf8_file.decode('utf-8')
print type(unicode_file)
print repr(unicode_file)
print ''


# So basically you can encode a unicode string into a special
string/file format
# and you can decode a string from a special string/file format back
into unicode.


###################################


<type 'unicode'>
u'Some danish characters \xe6\xf8\xe5'

<type 'str'>
'Some danish characters \xe6\xf8\xe5'
Some danish characters æøå

<type 'str'>
'Some danish characters \xc3\xa6\xc3\xb8\xc3\xa5'
Some danish characters æøå

<type 'unicode'>
u'Some danish characters \xe6\xf8\xe5'





--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
Reply With Quote
 
 
 
 
Max M
Guest
Posts: n/a
 
      01-04-2005
Thomas Heller wrote:

> It seems also the error messages aren't too helpful:
>
>>>>"ä".encode("latin-1")

>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(12
>
> Hm, why does the 'encode' call complain about decoding?


Because it tries to print it out to your console and fail. While writing
to the console it tries to convert to ascii.

Beside, you should write:

u"ä".encode("latin-1") to get a latin-1 encoded string.


--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
Reply With Quote
 
Thomas Heller
Guest
Posts: n/a
 
      01-04-2005
Max M <> writes:

> Thomas Heller wrote:
>
>> It seems also the error messages aren't too helpful:
>>
>>>>>"ä".encode("latin-1")

>> Traceback (most recent call last):
>> File "<stdin>", line 1, in ?
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(12
>> Hm, why does the 'encode' call complain about decoding?

>
> Because it tries to print it out to your console and fail. While
> writing to the console it tries to convert to ascii.


Wrong, same error without trying to print something:

>>> x = "ä".encode("latin-1")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(12
>>>


>
> Beside, you should write:
>
> u"ä".encode("latin-1") to get a latin-1 encoded string.


I know, but the question was: why does a unicode string has a encode
method, and why does it complain about decoding (which has already been
answered in the meantime).

Thomas
 
Reply With Quote
 
=?ISO-8859-1?Q?Walter_D=F6rwald?=
Guest
Posts: n/a
 
      01-04-2005
Skip Montanaro wrote:
> aahz> Here's the stark simple recipe: when you use Unicode, you *MUST*
> aahz> switch to a Unicode-centric view of the universe. Therefore you
> aahz> encode *FROM* Unicode and you decode *TO* Unicode. Period. It's
> aahz> similar to the way floating point contaminates ints.
>
> That's what I do in my code. Why do Unicode objects have a decode method
> then?


Because MAL implemented it! >;->

It first encodes in the default encoding and then decodes the result
with the specified encoding, so if u is a unicode object
u.decode("utf-16")
is an abbreviation of
u.encode().decode("utf-16")

In the same way str has an encode method, so
s.encode("utf-16")
is an abbreviation of
s.decode().encode("utf-16")

Bye,
Walter Dörwald
 
Reply With Quote
 
Carl Banks
Guest
Posts: n/a
 
      01-05-2005
Skip Montanaro wrote:
> I started to answer, then got confused when I read the docstrings for
> unicode.encode and unicode.decode:

[snip]


It certainly is confusing. When I first started Unicoding, I pretty
much stuck to Aahz's rule of thumb, without understanding this details,
and still do that. But now I do undertstand it.

Although encodings are bijective (i.e., equivalent one-to-one
mappings), they are not apolar. One side of the encoding is
arbitrarily labeled the encoded form; the other is arbitrarily labeled
the decoded form. (This is not a relativistic system, here.) The
encode method maps from the decoded to the encoded set. The decode
method does the inverse.

That's it. The only real technical difference between encode and
decode is the direction they map in.

By convention, the decoded form is a Python unicode string, and the
encoded form is the byte string.

I believe it's technically possible (but very rude) to write an
"inverse encoding", where the "encoded" form is a unicode string, and
the decoded form is UTF-8 byte string.

Also, note that there are some encodings unrelated to Unicode. For
example, try this:

.. >>> "abcd".encode("base64")
This is an encoding between two byte strings.


--
CARL BANKS

 
Reply With Quote
 
Max M
Guest
Posts: n/a
 
      01-05-2005
Carl Banks wrote:

> Also, note that there are some encodings unrelated to Unicode. For
> example, try this:
>
> . >>> "abcd".encode("base64")
> This is an encoding between two byte strings.


Yes. This can be especially nice when you need to use restricted charsets.

I needed to use unicode objects as Zope ids. But Zope only accepts a
subset of ascii as ids.

So I used:


hex_id = u'INBOX'.encode('utf-8').encode('hex')
>>494e424f58


And I can get the unicode representation back with:

unicode_id = id.decode('hex').decode('utf-8')
>>u'INBOX'


Tn that case id.decode('hex') doesn't return a unicode, but a utf-8
encoded string.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Dr. Dobb's Python-URL! - weekly Python news and links (Jul 21) Irmen de Jong Python 5 07-21-2003 10:53 PM
Dr. Dobb's Python-URL! - weekly Python news and links (Jul 14) Irmen de Jong Python 0 07-14-2003 08:44 AM
Dr. Dobb's Python-URL! - weekly Python news and links (Jul 7) Irmen de Jong Python 0 07-07-2003 12:02 PM
Dr. Dobb's Python-URL! - weekly Python news and links (Jun 30) Irmen de Jong Python 0 06-30-2003 10:58 AM
Re: Dr. Dobb's Python-URL! - weekly Python news and links (Sep 24) Cameron Laird Python 3 06-28-2003 04:55 PM



Advertisments