Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > XML::Simple and utf8 woes

Reply
Thread Tools

XML::Simple and utf8 woes

 
 
Guest
Posts: n/a
 
      03-18-2006
Dear wizards,

I use XML::Simple to parse an XML file and
also to write it out. The problem lies in the
utf8 character data contained in the XML
source. While the XMLin() function seems
to read them properly, the XMLout() function
tries to replace utf8 material by multibyte
nonsense.

Below is my minimal example, run under perl 5.8.5
on a Fedora C3 box. Just compare the output
of the script (in w.xml) with its input, in DATA.

Please advice on how to fix the broken utf8 output.

Thanks in advance,
Oliver.

#!/usr/bin/perl
use XML::Simple;
print "Reading data from XML source...\n";
$data=XMLin(\*DATA,
ForceArray=>[manju,hauer],
ContentKey=>'-content',
KeyAttr=>[name],
);
print "Retrieve and display data example:\n";
$k='0004.1';
print $k.": ".
$data->{lemma}->{$k}->{manju}->[0].
"\n";
print "Writing data to XML file...\n";
XMLout($data,
NumericEscape=>0,
RootName=>'wuti',
XMLDecl=>1,
OutputFile=>'w.xml',
);
__DATA__
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<wuti>
<lemma name="0004.1">
<hauer>in der Morgendämmerung (H).</hauer>
<manju>farhûn suwaliyame</manju>
</lemma>
<lemma name="0004.2">
<hauer>Morgendämmerung.</hauer>
<manju>gersi fersi</manju>
</lemma>
</wuti>

--
Dr. Oliver Corff e-mail: http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de
 
Reply With Quote
 
 
 
 
ngoc
Guest
Posts: n/a
 
      03-18-2006

> Below is my minimal example, run under perl 5.8.5
> on a Fedora C3 box. Just compare the output
> of the script (in w.xml) with its input, in DATA.

I tried your code in Windows XP. It gives utf-8 output. But if I use
RootName => 'unicode here', only the output of rootname is changed
(manual fix will help), other parts are in utf-8. I suggest you

1. To save your perl program in utf-8 encoding.

2. This step in theory is not necessary. But maybe it helps

open my $fh, '>:encoding(UTF-', $path or die "open($path): $!";
XMLout($ref, OutputFile => $fh);

3. Try in Windows XP or 2000 environment to see it is different
 
Reply With Quote
 
 
 
 
Guest
Posts: n/a
 
      03-18-2006
ngoc <(E-Mail Removed)> wrote:
: I tried your code in Windows XP. It gives utf-8 output.

Really? I'll have to try tomorrow, don't have an XP box here right now.

: RootName => 'unicode here', only the output of rootname is changed
: (manual fix will help), other parts are in utf-8.

Sounds interesting, I'll try this one, too.

: 1. To save your perl program in utf-8 encoding.

Doesn't make sense, I write everything in utf-8 environment. Did you
notice the a-umlaut and u-caret in the data?

: 2. This step in theory is not necessary. But maybe it helps

: open my $fh, '>:encoding(UTF-', $path or die "open($path): $!";
: XMLout($ref, OutputFile => $fh);

I had tried this already before posting, but to no avail.

: 3. Try in Windows XP or 2000 environment to see it is different

Tomorrow.

Thanks, Oliver.

--
Dr. Oliver Corff e-mail: (E-Mail Removed)-berlin.de
 
Reply With Quote
 
Guest
Posts: n/a
 
      03-20-2006
(E-Mail Removed)-berlin.de wrote:

: Really? I'll have to try tomorrow, don't have an XP box here right now.

I still don't have an XP system at hand.

If you run the code with the -CS flag given to perl, even the innocent
print statement in the middle of the code will output two characters
instead of one utf8-encoded character, and this doesn't change the broken
output of the XMLout() statement.

This is beyond any expectation created after reading the perlrun manpage.

However, if XML::Simple is instructed in the XMLout statement to escape
all non-ASCII characters, then, miraculuously, the correct utf8 replacements
appear. It really drives me nuts.

Oliver.
--
Dr. Oliver Corff e-mail: (E-Mail Removed)-berlin.de
 
Reply With Quote
 
fhscobey
Guest
Posts: n/a
 
      03-20-2006
Hi,
You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
issues crop back up for some reason. Our application deals 100% in
UTF-8 data, but all source code is ISO-8859-1. We really had some
issues getting UTF-8 stuff to work (we started back when 5.8.0 came
out) and found that using 5.8.1, with some well placed ...

Encode::_utf8_on($content);
Encode::_utf8_off($content);

.... seemed to do the trick for us. So you might try to make sure the
UTF-8 flag is turned on for your XML data, and then try and parse it.
We are using some older versions of modules, which at the time, were
just starting to deal with the change in Perl 5.8 to treat content
internally as UTF-8 ecoded. Note: I believe Perl 5.8.7 has some
issues with the Encode module specifically with UTF-8, check with
bugs.perl.org for more information.

All of this may seem strange, but I can tell you when we wrote our
application, it worked fine with Perl 5.8.0 and 5.8.1. I've tried
5.8.3|5|7 and all versions are giving us garbled data out.

Also, if you are reading your data in from a handle, you absolutely
have to decalre the handle to be UTF-8 encoded. [i.e. open(FH,
"<:utf8", "file");].

Not sure if this helps you at all,
- Jeff

 
Reply With Quote
 
Guest
Posts: n/a
 
      03-21-2006
fhscobey <(E-Mail Removed)> wrote:
: Hi,
: You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
: issues crop back up for some reason. Our application deals 100% in
: UTF-8 data, but all source code is ISO-8859-1. We really had some
: issues getting UTF-8 stuff to work (we started back when 5.8.0 came
: out) and found that using 5.8.1, with some well placed ...

: Encode::_utf8_on($content);
: Encode::_utf8_off($content);

: issues with the Encode module specifically with UTF-8, check with
: bugs.perl.org for more information.

Hi Jeff,

You're really saved my day. So it's _not_ my personal failure to
understand how utf8 in Perl works, but really a problem, version-
dependent too. Thank you.

Anyway, of course, when using file handles, I make sure the line
discipline is set to :utf8, but it does not always help. See my other
answer to the Perl and UTF8 posting.

Best regards,
Oliver.


--
Dr. Oliver Corff e-mail: (E-Mail Removed)-berlin.de
 
Reply With Quote
 
Chronos Tachyon
Guest
Posts: n/a
 
      03-23-2006
[Whoops, meant to post, not mail]

(E-Mail Removed)-berlin.de wrote:
> Dear wizards,
>
> I use XML::Simple to parse an XML file and
> also to write it out. The problem lies in the
> utf8 character data contained in the XML
> source. While the XMLin() function seems
> to read them properly, the XMLout() function
> tries to replace utf8 material by multibyte
> nonsense.
>
> Below is my minimal example, run under perl 5.8.5
> on a Fedora C3 box. Just compare the output
> of the script (in w.xml) with its input, in DATA.
>
> Please advice on how to fix the broken utf8 output.
>
> Thanks in advance,
> Oliver.
>
> #!/usr/bin/perl
> use XML::Simple;
> print "Reading data from XML source...\n";
> $data=XMLin(\*DATA,
> ForceArray=>[manju,hauer],
> ContentKey=>'-content',
> KeyAttr=>[name],
> );
> print "Retrieve and display data example:\n";
> $k='0004.1';
> print $k.": ".
> $data->{lemma}->{$k}->{manju}->[0].
> "\n";
> print "Writing data to XML file...\n";
> XMLout($data,
> NumericEscape=>0,
> RootName=>'wuti',
> XMLDecl=>1,
> OutputFile=>'w.xml',
> );
> __DATA__
> <?xml version='1.0' encoding='utf-8' standalone='yes'?>
> <wuti>
> <lemma name="0004.1">
> <hauer>in der Morgendämmerung (H).</hauer>
> <manju>farhûn suwaliyame</manju>
> </lemma>
> <lemma name="0004.2">
> <hauer>Morgendämmerung.</hauer>
> <manju>gersi fersi</manju>
> </lemma>
> </wuti>
>


The problem seems to be the absence of a "use utf8;" pragma. Perl is
assuming that your code (including the __DATA__ section) is in ISO-8859-1.

[Addendum: FWIW, your newsreader is also making the same assumption.]

--
Donald King, a.k.a. Chronos Tachyon
http://chronos-tachyon.net/
 
Reply With Quote
 
fhscobey
Guest
Posts: n/a
 
      03-23-2006
Donald brings up a good point. If your source is not ISO-8859-1(which
I believe you mentioned), you have to use the utf8 pragma. But, I also
believe if you were to try using Perl 5.8.0, you would have to use this
pragma even if it was only the data your script was dealing with.
Starting with 5.8.1+, they deprecated the use of this pragma, to only
be used for telling Perl what encoding your source was in.

See http://perldoc.perl.org/utf8.html for more information.

 
Reply With Quote
 
Guest
Posts: n/a
 
      03-24-2006
Chronos Tachyon <(E-Mail Removed)> wrote:

: The problem seems to be the absence of a "use utf8;" pragma. Perl is
: assuming that your code (including the __DATA__ section) is in ISO-8859-1.

No, I don't think so, as inserting the utf8 pragma doesn't change anything.
I tried it, and the output is still not in utf8.

: [Addendum: FWIW, your newsreader is also making the same assumption.]

That is a different story, on a different machine. My production code
runs in a true utf8 environment, this one here is only used for communi-
cations. Thank you for the hint, nonetheless!

Oliver.

--
Dr. Oliver Corff e-mail: (E-Mail Removed)-berlin.de
 
Reply With Quote
 
Guest
Posts: n/a
 
      03-24-2006
fhscobey <(E-Mail Removed)> wrote:
: Donald brings up a good point. If your source is not ISO-8859-1(which
: I believe you mentioned), you have to use the utf8 pragma. But, I also
: believe if you were to try using Perl 5.8.0, you would have to use this
: pragma even if it was only the data your script was dealing with.
: Starting with 5.8.1+, they deprecated the use of this pragma, to only
: be used for telling Perl what encoding your source was in.

: See http://perldoc.perl.org/utf8.html for more information.

I read that, and also studied the various options to switch -C (see
perlrun for that), and I am really confused why the behaviour of my
system is so out of sync with the descriptions in the documentation.

Oliver.

--
Dr. Oliver Corff e-mail: (E-Mail Removed)-berlin.de
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
Regex testing and UTF8 awarenes or Regex and numeric pattern matching sln@netherlands.com Perl Misc 2 03-10-2009 03:51 AM
MySql+UTF8 woes Ronald Fischer Ruby 0 07-26-2007 11:52 AM
Text::Levenshtein and utf8 woes Perl Misc 3 03-26-2006 05:40 PM
Cmenu, Text Interfaces, and UTF8 shade Perl 1 08-11-2003 11:24 AM



Advertisments