Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: utf8 encoding problem

Reply
Thread Tools

Re: utf8 encoding problem

 
 
Wichert Akkerman
Guest
Posts: n/a
 
      01-22-2004
Previously Denis S. Otkidach wrote:
> You have to pass 8-bit string, but not unicode. The following
> code works as expected:
>
> >>> urllib.unquote('t%C3%A9st').decode('utf-8')

> u't\xe9st'


Ah, that does work indeed, thanks.

> P.S. According to HTML standard, with
> application/x-www-form-urlencoded content type form data are
> resricted to ASCII codes:
> http://www.w3.org/TR/html4/interact/...#form-data-set
> http://www.w3.org/TR/html4/interact/...#submit-format


Luckily that is not true, otherwise it would be completely impossible to
have websites using non-ascii input. To be specific, the encoding used
for HTML forms is determined by:

1. accept-charset attribute of the form element if present. This is
not handled by all browsers though.
2. the encoding used for the html page containing the form
3. ascii otherwise

this is specified in section 17.3 of the HTML 4.01 standard you are
referring to.

Wichert.

--
Wichert Akkerman <(E-Mail Removed)> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.


 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      01-24-2004
Wichert Akkerman wrote:


>>P.S. According to HTML standard, with
>>application/x-www-form-urlencoded content type form data are
>>resricted to ASCII codes:

[...]
> Luckily that is not true, otherwise it would be completely impossible to
> have websites using non-ascii input. To be specific, the encoding used
> for HTML forms is determined by: [algorithm omitted]


As Denis explains, it is true. See 17.13.4

application/x-www-form-urlencoded
.... Non-alphanumeric characters are replaced by `%HH', a percent sign
and two hexadecimal digits representing the ASCII code of the character.

So this form is restricted only to characters which have an ASCII code,
i.e. ASCII characters.

To have non-ASCII input, use multipart/form-data:

multipart/form-data
....
The content type "multipart/form-data" should be used for submitting
forms that contain files, non-ASCII data, and binary data.

This reconfirms that you should use it for non-ASCII.

Regards,
Martin

 
Reply With Quote
 
 
 
 
Skip Montanaro
Guest
Posts: n/a
 
      01-24-2004

>> Luckily that is not true, otherwise it would be completely impossible
>> to have websites using non-ascii input. To be specific, the encoding
>> used for HTML forms is determined by: [algorithm omitted]


Martin> As Denis explains, it is true. See 17.13.4

Sorry, but I'm coming to this discussion late. See "17.13.4" of what
document?

Thx,

Skip

 
Reply With Quote
 
Andrew Clover
Guest
Posts: n/a
 
      01-24-2004
Martin v. Loewis <(E-Mail Removed)> wrote:

> As Denis explains, it is true. See 17.13.4


Indeed. [Skip: http://www.w3.org/TR/html4/interact/...html#h-17.13.4 ]

> To have non-ASCII input, use multipart/form-data:


Quite so, in theory. Of course in reality, no browser today includes a
Content-Type header in the subparts of a multipart/form-data submission,
so there's nowhere to specify an charset here either! argh.

multipart/form-data as implemented in current UAs is just as encoding-unaware
as application/x-www-form-urlencoded, sadly. In practical terms it does not
really matter much which is used.

[...waiting for the glorious day when UTF-8 and UCS-4 are the only acceptable
encodings; and on that day, Shift-JIS will be the first against the wall oh
let me blummin' well tell you my brother...]

--
Andrew Clover
mailro:(E-Mail Removed)
http://www.doxdesk.com/
 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      01-25-2004
Andrew Clover wrote:
> Quite so, in theory. Of course in reality, no browser today includes a
> Content-Type header in the subparts of a multipart/form-data submission,
> so there's nowhere to specify an charset here either! argh.


Right. In this case, the algorithm Wichert quotes should apply.

I once tried to study why browsers won't send Content-Type headers.
Actually, they *do* send Content-Type headers, but omit the charset=
parameter. I submitted various bug reports, and the Mozilla people
replied that they tried to, and found that various CGI scripts would
break when confronted with the standards-conforming request, but
work when they get the deprecated form.

So it looks like this situation will extend indefinitely.

> multipart/form-data as implemented in current UAs is just as encoding-unaware
> as application/x-www-form-urlencoded, sadly. In practical terms it does not
> really matter much which is used.


Right - for practical terms, standards don't matter much. As this thread
shows, the form used *does* matter in practical terms though: Users
of application/x-www-form-urlencoded are now confronted with the
unescaping-then-decoding issue, which apparently is a challenge.

Regards,
Martin

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
utf8 encoding problem Ad Ad Ruby 3 06-26-2009 02:38 PM
Problem with encoding latin1/UTF8 Mark Toth Ruby 1 01-07-2008 08:39 AM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM
utf8 encoding problem Wichert Akkerman Python 1 01-22-2004 11:07 AM



Advertisments