Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How get UTF-8 from urlencoded web form

Reply
Thread Tools

How get UTF-8 from urlencoded web form

 
 
Yohan N. Leder
Guest
Posts: n/a
 
      07-15-2006
Hello.

All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
Apache2.

I'm trying to obtain (and display) user data which come from a web form
with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
can do-it if the form is a 'multipart/form-data' but not a
'application/x-www-form-urlencoded'.

Here is a script to show the difference :

---- BEGIN ----
#!/usr/bin/perl -w
my $this = "utf8_and_webform.pl";

require 5.8.0;
use utf8;
binmode(STDOUT, ':utf8');
print "Content-type: text/html; charset=UTF-8\n\n";
if (defined $ENV{'QUERY_STRING'} && length($ENV{'QUERY_STRING'}) > 0)
{&see;}
else {&ask;}
exit 0;

sub ask
{ # provide web forms for user to enter data
print <<PAGE
<html><head><title>Test about UTF-8 and web form</title></head><body>
Use the form you want and see the resulting data.
<p>
FORM with enctype as 'application/x-www-form-urlencoded' :<br>
<form action='$this?x' method='post' accept-charset='UTF-8'
enctype='application/x-www-form-urlencoded'>
<textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
<input type='submit' value='send'>
</form></body></html></p>
<p>
FORM with enctype as 'multipart/form-data' :<br>
<form action='$this?x' method='post' accept-charset='UTF-8'
enctype='multipart/form-data'>
<textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
<input type='submit' value='send'></p>
</form></body></html>
PAGE
> [quoted text muted]

}

sub see
{ # display data which come from user form
my $data='';

binmode(STDIN, ':utf8'); # or ':encoding('UTF-8')'
read(STDIN, $data, $ENV{'CONTENT_LENGTH'});

# OR
#use Encode qw(decode);
#read(STDIN, $data, $ENV{'CONTENT_LENGTH'});
#$data = decode('UTF-8', $data);

print $data;
> [quoted text muted]

}
----- END ----

For example, if I submit the 'urlencoded' form (the first one, at top of
generated web page, if you run the script without any url parameter)
with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
A9' displayed in the browser (knowing this has been proceeded through
the see() sub).

While, if I submit the same 'é' from the 'multipart/form-data' form (the
second one, at bottom of generated web page), I get a well interpreted
UTF-8 'é' as expected.

How to get this same UTF-8 'é' when form uses 'application/x-www-form-
urlencoded' enctype ? How to modify the see() sub for this urlencoded
form case ?
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      07-15-2006
Yohan N. Leder wrote:
> if I submit the 'urlencoded' form (the first one, at top of
> generated web page, if you run the script without any url parameter)
> with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
> A9' displayed in the browser (knowing this has been proceeded through
> the see() sub).
>
> While, if I submit the same 'é' from the 'multipart/form-data' form (the
> second one, at bottom of generated web page), I get a well interpreted
> UTF-8 'é' as expected.
>
> How to get this same UTF-8 'é' when form uses 'application/x-www-form-
> urlencoded' enctype ?


The problem is covered by this FAQ entry:
http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
Yohan N. Leder
Guest
Posts: n/a
 
      07-15-2006
In article <>, says...
> The problem is covered by this FAQ entry:
> http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG
>


It doesn't explain the problem, but remove the problem using CGI.pm, and
I would like to understand the problem.
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      07-15-2006
Yohan N. Leder wrote:
> In article <>, says...
>>The problem is covered by this FAQ entry:
>>http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

>
> It doesn't explain the problem, but remove the problem using CGI.pm, and
> I would like to understand the problem.


Excellent learning approach.

The browser automatically URI escapes 'unsafe' characters when you make
a GET or an x-www-form-urlencoded POST request. Hence those characters
need to be unescaped by the web server. CGI.pm as well as other modules
for parsing CGI data takes care of that.

You can study the docs for the Perl module URI::Escape for a better
explanation.

I suppose you should also read up on the HTTP protocol.

HTH

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Bart Van der Donck
Guest
Posts: n/a
 
      07-16-2006
Yohan N. Leder wrote:

> All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
> Apache2.
>
> I'm trying to obtain (and display) user data which come from a web form
> with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
> can do-it if the form is a 'multipart/form-data' but not a
> 'application/x-www-form-urlencoded'.


[snip code ]

> For example, if I submit the 'urlencoded' form (the first one, at top of
> generated web page, if you run the script without any url parameter)
> with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
> A9' displayed in the browser (knowing this has been proceeded through
> the see() sub).
>
> While, if I submit the same 'é' from the 'multipart/form-data' form (the
> second one, at bottom of generated web page), I get a well interpreted
> UTF-8 'é' as expected.
>
> How to get this same UTF-8 'é' when form uses 'application/x-www-form-
> urlencoded' enctype ? How to modify the see() sub for this urlencoded
> form case ?


That shouldn't be particularly mysterious. You're specifying the page's
charset as UTF-8 in its header (where you say "Content-type: text/html;
charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
literal 'é' (dec 142/hex 8E/eacute/LATIN SMALL LETTER E WITH ACUTE.
The code point for à is C3, and for © it's A9, thus the expected
value becomes %C3%A9.

Encoding é -> é -> %C3%A9 :

#!/usr/bin/perl -w
my $posteddata = <STDIN>;
print <<PAGE
Content-type: text/html; charset=UTF-8

<html><body>
Posted data: $posteddata<hr>
<form action='f.pl' method='post'>
<textarea name='msg'></textarea>
<input type='submit'>
</form></body></html>
PAGE

Whereas the "normal" form encoding would be é -> %E9:

#!/usr/bin/perl -w
my $posteddata = <STDIN>;
print <<PAGE
Content-type: text/html

<html><body>
Posted data: $posteddata<hr>
<form action='f.pl' method='post'>
<textarea name='msg'></textarea>
<input type='submit'>
</form></body></html>
PAGE

P.S. 'application/x-www-form-urlencoded' is the default form encoding
type anyhow, so there is actually no need to set this as a form
argument.

Recommended literature:
http://home.tiscali.nl/t876506/utf8tbl.html (search for string C3A9 on
that page)
Table CPs < 256: http://en.wikipedia.org/wiki/ISO_8859-1
And of course Perl FAQ/docs, as Gunnar pointed out.

--
Bart

 
Reply With Quote
 
Yohan N. Leder
Guest
Posts: n/a
 
      07-16-2006
In article <>, says...
> Excellent learning approach.


Thanks. Better than taking everything as an eternal mysterious box in my
mind.

> The browser automatically URI escapes 'unsafe' characters when you make
> a GET or an x-www-form-urlencoded POST request. Hence those characters
> need to be unescaped by the web server. CGI.pm as well as other modules
> for parsing CGI data takes care of that.
>


Hm, understood !

> You can study the docs for the Perl module URI::Escape for a better
> explanation.


I'll do it for sure
 
Reply With Quote
 
Yohan N. Leder
Guest
Posts: n/a
 
      07-16-2006
In article <. com>,
says...
> That shouldn't be particularly mysterious. You're specifying the page's
> charset as UTF-8 in its header (where you say "Content-type: text/html;
> charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
> literal 'é'
>


Effectively what I want. However the gunnar explanation show the key of
the problem : URI escaping when *urlencoded* enctype for form.
 
Reply With Quote
 
Bart Van der Donck
Guest
Posts: n/a
 
      07-16-2006
Yohan N. Leder wrote:

> In article <. com>,
> says...
> > That shouldn't be particularly mysterious. You're specifying the page's
> > charset as UTF-8 in its header (where you say "Content-type: text/html;
> > charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
> > literal 'é'

>
> Effectively what I want. However the gunnar explanation show the key of
> the problem : URI escaping when *urlencoded* enctype for form.


Yes, the URL encoding is done at the browser's side by default, before
and apart from the sendout of the name/value pairs. This behaviour can
be altered by adding enctype="multipart/form-data" as an extra argument
to <form method="post">. The main reason for this feature to exist, is
the transfer of (binary) files to the gateway software on the server.
Thus, if you want to send 'é', the browser will pass it as "%E9" by
default. It's up to your Perl script to decode it back to 'é'. In the
multipart/form-data encoding type, 'é' is just passed as 'é'. In
UTF-8 sets, the browser looks for the literal equivalent of 'é', and
then passes the URL-encoded value of that literal equivalent.

--
Bart

 
Reply With Quote
 
Bart Van der Donck
Guest
Posts: n/a
 
      07-16-2006
Gunnar Hjalmarsson wrote:

> [...]
> The browser automatically URI escapes 'unsafe' characters when you make
> a GET or an x-www-form-urlencoded POST request. Hence those characters
> need to be unescaped by the web server. CGI.pm as well as other modules
> for parsing CGI data takes care of that.


#!/usr/bin/pedant
I think the correct terminology is actually URL-encoding here (or
percent-encoding) in stead of URI-escaping
(http://en.wikipedia.org/wiki/URL_encoding).

--
Bart

 
Reply With Quote
 
Bart Van der Donck
Guest
Posts: n/a
 
      07-16-2006
A. Sinan Unur wrote:

> [...]
> Escaping is a general method of changing the meaning of the characters
> following a designated special character. In this case, % is the special
> character, and it changes the meaning of the characters following it.
> Characters not allowed in URIs are replaced with these escape sequences.


Yes, but escaping would then only refer to the %-sign, not to what
follows. In '%E9', '%' is the escape character and 'E9' the encoded
value of 'é'. E9 has nothing to do with escaping; otherwise it would
have been %é (or \é).

So I think we're both 50% right here

--
Bart

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Ordering of urlencoded tuples incorrect benlucas99@googlemail.com Python 3 01-16-2009 10:57 AM
How to get UTF-8 from an urlencoded web form ? Yohan N. Leder Perl Misc 0 07-15-2006 04:31 PM
Unicode in application/x-www-form-urlencoded? Leif K-Brooks HTML 3 11-29-2004 02:13 AM
decode a urlencoded string Thomas Henz ASP General 2 08-25-2003 06:08 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57