Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to send utf-8 data using LWP::UserAgent?

Reply
Thread Tools

How to send utf-8 data using LWP::UserAgent?

 
 
Gert Brinkmann
Guest
Posts: n/a
 
      07-25-2006
Hello,

I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.

my $req = HTTP::Request->new (
POST => "http://myhost:8181",
HTTP::Headers->new (
'content-type' => "text/xml; charset=utf-8",
),
$xml_data,
);

my $ua = LWP::UserAgent->new;
my $resp = $ua->simple_request($req);

The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
have checked this by listening on the port 8181 via: "netcat -l -p 8181".
German Umlauts do occur there correctly readable as , but IMHO should
not.

I also have checked that the terminal is not converting the data by writing
a file using gedit that contains the string "grt" and netcat'ing it to the
port 8181. The result is: "gört" as expected.

What am I doing wrong?

Thanks,
Gert

 
Reply With Quote
 
 
 
 
Peter J. Holzer
Guest
Posts: n/a
 
      07-25-2006
On Tue, 25 Jul 2006 20:19:03 +0200, Gert Brinkmann wrote:
> I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.
>
> my $req = HTTP::Request->new (
> POST => "http://myhost:8181",
> HTTP::Headers->new (
> 'content-type' => "text/xml; charset=utf-8",
> ),
> $xml_data,
> );
>
> my $ua = LWP::UserAgent->new;
> my $resp = $ua->simple_request($req);
>
> The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
> have checked this by listening on the port 8181 via: "netcat -l -p 8181".
> German Umlauts do occur there correctly readable as äöüß, but IMHO should
> not.

[...]
> What am I doing wrong?


You are not providing a complete script to demonstrate your problem.
Where does $xml_data come from? How do you know that it contains UTF-8?

Dump $xml_data in hex to see what it really contains:

printf STDERR "%x ", ord($_) for (split//, $xml_data);

If "gört" is printed as
67 f6 72 74
it's not UTF-8. It should be
67 c3 b6 72 74

hp

--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd

 
Reply With Quote
 
 
 
 
Gert Brinkmann
Guest
Posts: n/a
 
      07-26-2006

Thank you, Peter, for your answer.

Peter J. Holzer wrote:
> You are not providing a complete script to demonstrate your problem.


Yes, Sorry. I have been so sure that the input to LWP was correct... but it
is not.

> Where does $xml_data come from? How do you know that it contains UTF-8?


I did a check via dumping data into a file:

-----------------
binmode $fh;
print $fh "isutf8=",(Encode::is_utf8($text,0)?1:0), "; correct="
(Encode::is_utf8($text,1)?1:0),"; debugprint=$text\n";
-----------------

the result was:
-----------------
isutf8=1; correct=1; ...grt...
-----------------

I just did notice the utf-8 flag and the utf-8-is-correct-flag. But now
after rechecking with your hexdump printout I see that it is a mistake
that "grt" is printed out readable.

Why does the is_utf8($text,1) routine tell me, that the utf-8 String is
correct utf-8 even if there is an iso-latin "" in the string?

Hmm, now I have to search why the "" is not correctly set as utf-8. This
charset/encoding topic is so unbelievable complicated.

Thank you again,
Gert

 
Reply With Quote
 
Gert Brinkmann
Guest
Posts: n/a
 
      07-26-2006
Gert Brinkmann wrote:

> Why does the is_utf8($text,1) routine tells me, that the utf-8 String is
> correct utf-8 even if there is an iso-latin "" in the string?


Ok. The string is completely correct. It is tagged as utf8 and it contains
utf8. But the question ist: Why is utf8 converted to iso-latin again, when
writing it into the "binmode'd" file?

Here is a test-script:
-----------------------------------------------
#!/usr/bin/perl

use strict;
use warnings;
use Encode;

my $x = 'grt';
$x = Encode::encode("utf-8", $x);
Encode::_utf8_on($x);

open (my $fh, ">foo.log") or die "could not open foo.log";
binmode $fh;
print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
"; correct=", (Encode::is_utf8($x,1)?1:0),";\n";
print $fh $x;
print $fh "\n";
close $fh;
-----------------------------------------------

Execute it gives the following:
$ perl utf8test.pl ; cat foo.log
isutf8=1; correct=1;
grt

I have also tried with
binmode, ":raw"
or ":bytes", but it does not make any difference.

Gert

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-26-2006
On Wed, 26 Jul 2006, Gert Brinkmann wrote:

> Gert Brinkmann wrote:
>
> > Why does the is_utf8($text,1) routine tells me, that the utf-8
> > String is correct utf-8 even if there is an iso-latin "" in the
> > string?

>
> Ok. The string is completely correct. It is tagged as utf8 and it
> contains utf8.


Without being able to tell you the precise answer, I suspect this is a
consequence of Perl's attempt to be compatible with earlier versions.
If your string contains nothing more than iso-8859-1 characters, then
in some circumstances it will be treated as such, even though a
utf8-ified version of the string is available to those who ask for it
nicely. If there had been just one character in the string that was
outside of the iso-8859-1 repertoire, I suspect you would have seen
different behaviour.

I *think* a careful perusal of perldoc perlunicode for the relevant
Perl version should help.

But there are some hunches in what I say above, and ICBW. Hope it's
vaguely useful.
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      07-26-2006

Quoth Gert Brinkmann <(E-Mail Removed)>:
> Gert Brinkmann wrote:
>
> > Why does the is_utf8($text,1) routine tells me, that the utf-8 String is
> > correct utf-8 even if there is an iso-latin "" in the string?

>
> Ok. The string is completely correct. It is tagged as utf8 and it contains
> utf8. But the question ist: Why is utf8 converted to iso-latin again, when
> writing it into the "binmode'd" file?
>
> Here is a test-script:
> -----------------------------------------------
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use Encode;
>
> my $x = 'grt';
> $x = Encode::encode("utf-8", $x);


This is wrong. (I'm surprised you didn't get an error.) encode converts
from characters to bytes; you want to convert from bytes (in whatever
your source file is in, probably iso8859-1) into characters, so you want

$x = Encode::decode iso8859_1 => $x;

An alternative to this would be to use the encoding pragma to tell Perl
what charset your source file uses.

> Encode::_utf8_on($x);


NO! You should never need to call the _utf8_o{n,ff} functions.

> open (my $fh, ">foo.log") or die "could not open foo.log";


open my $fh, '>:encoding(utf', 'foo.log' or die...;

Tell Perl what you want, or it doesn't know what to give you.
:encoding(utf is (IMHO) preferable to :utf8 as you get better error
handling.

> binmode $fh;


This says '$fh is for binary data'. That means that each character
printed to $fh will be written out as a single byte if possible, IOW
the string will be printed in ISO8859-1. Characters above \xff will give
a 'wide character in print' warning, and (I think, but this situation is
Wrong anyway) utf8 output.

Ben

> print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
> "; correct=", (Encode::is_utf8($x,1)?1:0),";\n";


Again, you don't need to care about the state of the internal utf8 flag.
Just tell Perl you want $x to be characters, not bytes.

Ben

--
I must not fear. Fear is the mind-killer. I will face my fear and
I will let it pass through me. When the fear is gone there will be
nothing. Only I will remain.
(E-Mail Removed) Frank Herbert, 'Dune'
 
Reply With Quote
 
Gert Brinkmann
Guest
Posts: n/a
 
      07-27-2006

Thank you, Ben,

with this information I have to reread the utf8- and Encode-perldocs to
really "internalize"(?) this topic.

Ben Morrow wrote:
>> Encode::_utf8_on($x);

>
> NO! You should never need to call the _utf8_o{n,ff} functions.


But what are you doing if you receive a CGI-parameter that was sent from a
web-browser in utf-8? On server side AFAIK you do not get the information
from http which charset was used. If you know that the script is working in
your completely utf-8 enabled web-application it should be utf-8. But is
the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
my understanding it receives utf-8 textstrings and stores it into an
non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?

Thanks,
Gert

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      07-27-2006

Quoth Gert Brinkmann <(E-Mail Removed)>:
>
> Thank you, Ben,
>
> with this information I have to reread the utf8- and Encode-perldocs to
> really "internalize"(?) this topic.


The most important point (and I'm not sure the Perl docs currently make
this entirely clear) is that you always have to know whether a given
string is a sequence of *characters* or a sequence of *bytes*. This is
not the same as whether the perl-internal utf8 flag is on, due to perl's
back-compat stuff.

Basically, all input is in bytes, and all text data should be decoded to
characters before processing. Binary data obviously shouldn't. So on
input (from any source that doesn't do the decoding for you) you need to
determine (somehow) what charset the data is expected to be in, and
decode it. Then on output (again to any source that outputs bytes
directly) you need to decide (somehow) what charset you want and encode
the data before output.

One way of making this easier is to push the :encoding layer onto a
filehandle (see PerlIO::encoding): this does the de/encoding for you
automatically so the filehandle now appears to be a stream of characters
rather than a stream of bytes.

[Note to pacify Alan : my use of the term 'charset' above (and yours
below) corresponds to the MIME paramater of the same name, rather than
to a 'character set' proper]

> Ben Morrow wrote:
> >> Encode::_utf8_on($x);

> >
> > NO! You should never need to call the _utf8_o{n,ff} functions.

>
> But what are you doing if you receive a CGI-parameter that was sent from a
> web-browser in utf-8? On server side AFAIK you do not get the information
> from http which charset was used. If you know that the script is working in
> your completely utf-8 enabled web-application it should be utf-8. But is
> the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
> my understanding it receives utf-8 textstrings and stores it into an
> non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?


I don't really understand the situation you're describing (but then my
knowledge of CGI programming is somewhat limited). Are you saying the
data is known to be in UTF8, or that you don't know what charset it's
in?

A string that contains a sequence of bytes that happen to be valid UTF8
is not at all the same thing as a string that contains the sequence of
characters represented by those bytes. In fact, converting from one to
the other is what the Encode::decode function is for.

The internal utf8 flag *does not* mean 'this string is in UTF8' in any
sense that matters to a user of Perl. What it means is 'this string
contains characters rather than bytes, *AND* some of those characters
are above 0xff'. Or sometimes '... *AND* some of those characters used
to be above 0xff but aren't any more, but I haven't noticed that yet'.
Do you begin to see now why this is a property of the string you really
don't care about?

Ben

--
Musica Dei donum optimi, trahit homines, trahit deos. |
Musica truces mollit animos, tristesque mentes erigit.|(E-Mail Removed)
Musica vel ipsas arbores et horridas movet feras. |
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-27-2006
On Thu, 27 Jul 2006, Gert Brinkmann wrote:

> Ben Morrow wrote:
> >> Encode::_utf8_on($x);

> >
> > NO! You should never need to call the _utf8_o{n,ff} functions.

>
> But what are you doing if you receive a CGI-parameter that was sent
> from a web-browser in utf-8?


An interesting question - but not, I think, a question to which the
answer could ever be _utf8_on($x)

> On server side AFAIK you do not get the information
> from http which charset was used.


The simplest case (and recommended, except that the old NN4.* does not
work, if anybody still cares), is to send out the page which contains
the form, as utf-8, and the browser will respond by submitting the
form in utf-8 encoding.

More complex things can happen if Accept-charset is used. I don't
think I would want to go there, as there seems to be no advantage in
it.

Some browsers, in some situations, unilaterally add to the submitted
data an extra name=value pair, with the name "_charset_" and the value
being the submission encoding that they are using. You can't rely on
getting this, though.

> But is the $parameter CGI variable correctly tagged as utf-8 by the
> CGI module?


"tagging as utf-8" is something which Perl does behind the scenes when
you apply appropriate encode/decode operations on data. Except in
some very obscure situations, it's not something that it makes any
sense to set directly, as Ben has already shown.

> In my understanding it receives utf-8 textstrings and stores it into
> an non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't
> it?


I thought Ben already addressed that point. Ah, and via googroups I
see that he has already responded, although it hasn't yet reached my
news swerver. So I'll leave it there for now.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Pre-Send Request Headers, Pre-Send Request Content =?Utf-8?B?S2V2aW4gQnVydG9u?= ASP .Net 0 12-31-2004 06:29 PM
Re: How can i send 8-bit data or binary data with pyserial? Fredrik Lundh Python 1 12-15-2004 10:24 PM
How can i send 8-bit data or binary data with pyserial? ouz as Python 1 12-14-2004 12:31 PM
How can i send 8-bit data or binary data with pyserial? ouz as Python 3 12-13-2004 01:55 PM



Advertisments