Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > urllib.unquote and unicode

Reply
Thread Tools

urllib.unquote and unicode

 
 
George Sakkis
Guest
Posts: n/a
 
      12-19-2006
The following snippet results in different outcome for (at least) the
last three major releases:

>>> import urllib
>>> urllib.unquote(u'%94')


# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(12

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

George

 
Reply With Quote
 
 
 
 
Leo Kislov
Guest
Posts: n/a
 
      12-19-2006

George Sakkis wrote:
> The following snippet results in different outcome for (at least) the
> last three major releases:
>
> >>> import urllib
> >>> urllib.unquote(u'%94')

>
> # Python 2.3.4
> u'%94'
>
> # Python 2.4.2
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> ordinal not in range(12
>
> # Python 2.5
> u'\x94'
>
> Is the current version the "right" one or is this function supposed to
> change every other week ?


IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can consider
current behaviour as undefined just like if you pass a random object
into some function you can get different outcome in different python
versions.

-- Leo

 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      12-19-2006
George Sakkis wrote:

> The following snippet results in different outcome for (at least) the
> last three major releases:
>
>>>> import urllib
>>>> urllib.unquote(u'%94')


> # Python 2.4.2
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> ordinal not in range(12


Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.unquote(u"%94")

u'\x94'
>>>


From the above I infer that the 2.4.2 behaviour was considered a bug.

Peter

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      12-19-2006
George Sakkis wrote:

> The following snippet results in different outcome for (at least) the
> last three major releases:
>
>>>> import urllib
>>>> urllib.unquote(u'%94')

>
> # Python 2.3.4
> u'%94'
>
> # Python 2.4.2
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> ordinal not in range(12
>
> # Python 2.5
> u'\x94'
>
> Is the current version the "right" one or is this function supposed to
> change every other week ?


why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.

</F>

 
Reply With Quote
 
Duncan Booth
Guest
Posts: n/a
 
      12-19-2006
"Leo Kislov" <(E-Mail Removed)> wrote:

> George Sakkis wrote:
>> The following snippet results in different outcome for (at least) the
>> last three major releases:
>>
>> >>> import urllib
>> >>> urllib.unquote(u'%94')

>>
>> # Python 2.3.4
>> u'%94'
>>
>> # Python 2.4.2
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position
>> 0: ordinal not in range(12
>>
>> # Python 2.5
>> u'\x94'
>>
>> Is the current version the "right" one or is this function supposed
>> to change every other week ?

>
> IMHO, none of the results is right. Either unicode string should be
> rejected by raising ValueError or it should be encoded with ascii
> encoding and result should be the same as
> urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can
> consider current behaviour as undefined just like if you pass a random
> object into some function you can get different outcome in different
> python versions.


I agree with you that none of the results is right, but not that the
behaviour should be undefined.

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec.

Unfortunately RFC3986 isn't entirely clear-cut on this issue:

> When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set [UCS],
> the data should first be encoded as octets according to the UTF-8
> character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be percent-
> encoded. For example, the character A would be represented as "A",
> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
> as "%C3%80", and the character KATAKANA LETTER A would be represented
> as "%E3%82%A2".


I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.

Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
for a unicode string.

 
Reply With Quote
 
George Sakkis
Guest
Posts: n/a
 
      12-19-2006
Fredrik Lundh wrote:
> George Sakkis wrote:
>
> > The following snippet results in different outcome for (at least) the
> > last three major releases:
> >
> >>>> import urllib
> >>>> urllib.unquote(u'%94')

> >
> > # Python 2.3.4
> > u'%94'
> >
> > # Python 2.4.2
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> > ordinal not in range(12
> >
> > # Python 2.5
> > u'\x94'
> >
> > Is the current version the "right" one or is this function supposed to
> > change every other week ?

>
> why are you passing non-ASCII Unicode strings to a function designed for
> fixing up 8-bit strings in the first place? if you do proper encoding
> before you quote things, it'll work the same way in all Python releases.


I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.

George

 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      12-19-2006
Duncan Booth schrieb:
> The way that uri encoding is supposed to work is that first the input
> string in unicode is encoded to UTF-8 and then each byte which is not in
> the permitted range for characters is encoded as % followed by two hex
> characters.


Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.

> Unfortunately RFC3986 isn't entirely clear-cut on this issue:
>
>> When a new URI scheme defines a component that represents textual
>> data consisting of characters from the Universal Character Set [UCS],
>> the data should first be encoded as octets according to the UTF-8
>> character encoding [STD63]; then only those octets that do not
>> correspond to characters in the unreserved set should be percent-
>> encoded. For example, the character A would be represented as "A",
>> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
>> as "%C3%80", and the character KATAKANA LETTER A would be represented
>> as "%E3%82%A2".


This is irrelevant, it talks about new URI schemes only.

> I think it leaves open the possibility that existing URI schemes which do
> not support unicode characters can use other encodings, but given that the
> original posting started by decoding a unicode string I think that utf-8
> should definitely be assumed in this case.


No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Regards,
Martin
 
Reply With Quote
 
Duncan Booth
Guest
Posts: n/a
 
      12-20-2006
"Martin v. Löwis" <(E-Mail Removed)> wrote:

> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not
>> in the permitted range for characters is encoded as % followed by two
>> hex characters.

>
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?


I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In
particular % encoding is defined in terms of octets, so when given a
unicode string urllib.quote should either encoded it, or throw a suitable
exception (not KeyError which is what it seems to throw now).

My objection to urllib.unquote is that urllib.unquote(u'%a3') returns
u'\xa3' which is a character not an octet. I think it should always return
a byte string, or it should calculate a byte string and then decode it
according to some suitable encoding, or it should throw an exception
[choose any of the above].

Adding an optional encoding parameter to quote/unquote be one option,
although since you can encode/decode the parameter it doesn't add much.

> No, the http scheme is defined by RFC 2616 instead. It doesn't really
> talk about encodings, but hints an interpretation in 3.2.3:


The applicable RFC is 3986. See RFC2616 section 3.2.1:
> For definitive information on URL syntax and semantics, see "Uniform
> Resource Identifiers (URI):
> Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
> 1738 [4] and RFC 1808 [11]).


and RFC 2396:
> Obsoleted by: 3986



> Now, RFC 2396 already says that URIs are sequences of characters,
> not sequences of octets, yet RFC 2616 fails to recognize that issue
> and refuses to specify a character set for its scheme (which
> RFC 2396 says that it could).


and RFC2277, 3.1 says that it MUST identify which charset is used (although
that's just a best practice document not a standard). (The block capitals
are the RFC's not mine.)

> The conventional wisdom is that the choice of URI encoding for HTTP
> is a server-side decision; for that reason, IRIs were introduced.


Yes, I know that in practice some systems use other character sets.
 
Reply With Quote
 
=?ISO-8859-1?Q?Walter_D=F6rwald?=
Guest
Posts: n/a
 
      12-21-2006
Martin v. Löwis wrote:
> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not in
>> the permitted range for characters is encoded as % followed by two hex
>> characters.

>
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?
>
> In URIs, it is entirely unspecified what the encoding is of non-ASCII
> characters, and whether % escapes denote characters in the first place.


http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Servus,
Walter
 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      12-21-2006
>>> The way that uri encoding is supposed to work is that first the input
>>> string in unicode is encoded to UTF-8 and then each byte which is not in
>>> the permitted range for characters is encoded as % followed by two hex
>>> characters.

>> Can you back up this claim ("is supposed to work") by reference to
>> a specification (ideally, chapter and verse)?

> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1


Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.

If somebody implemented IRIs, that would be an entirely different
matter.

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz Śliwiński Python 2 01-19-2011 07:31 AM
Help for Unicode char and Unicode char based string in Ruby Chirag Mistry Ruby 6 02-08-2008 12:45 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments