Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) (http://www.velocityreviews.com/forums/t896817-different-results-parsing-a-xml-file-with-xml-simple-xml-sax-vs-xml-parser.html)

Erik Wasser 03-02-2006 03:17 PM

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
Hello Usenet.

I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML::Parser.
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.

Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:

äöü (0) cmp (0) = -1
(1) cmp (0) = 0

The first line was parsed by XML::Sax and the second line was parsed by
XML::Parser. My conclusions:

1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax

Your opinion?

The code (written in ISO-8859-1 on disc):

#!/usr/bin/perl -w

use strict;
use warnings;

use XML::Simple;
use Encode;

foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a></a>");
my $q2 = "";

printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple::PREFERRED_PARSER = 'XML::Parser';
}

PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.

--
So long... Fuzz

A. Sinan Unur 03-02-2006 03:57 PM

Re: Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
fuzz@uni-paderborn.de (Erik Wasser) wrote in
news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de:

> I'm subject to some confusion with XML and UTF8. I'm working with
> XML-Simple and I try to decode some XML with with german umlauts
> (ISO-8859-1). The first XML line declared the encoding correct (see
> code below). But I'm getting different results using XML-Simple with
> the default XML parser named XML::Sax and a second parser named
> XML::Parser. The following code tries to decode the mini XML file and
> prints the UTF8 flags of the resulting strings.
>
> Can someone run this code on his machine and post the results? Thanks.
> The results on my machine are this:
>
> äöü (0) cmp (0) = -1
> (1) cmp (0) = 0
>
> The first line was parsed by XML::Sax and the second line was parsed
> by XML::Parser. My conclusions:
>
> 1) Line 1 is wrong, line 2 is correct
> 2) The output should be line 2 two times.
> 3) There is a bug in XML::Sax
>
> Your opinion?
>
> The code (written in ISO-8859-1 on disc):
>
> #!/usr/bin/perl -w
>
> use strict;
> use warnings;
>
> use XML::Simple;
> use Encode;
>
> foreach (1..2)
> {
> my $q1 = XMLin("<?xml version='1.0'
> encoding='iso-8859-1'?>\n<a></a>"); my $q2 = "";
>
> printf "%s (%d) cmp %s (%d) = %d\n"
> , $q1, Encode::is_utf8($q1)
> , $q2, Encode::is_utf8($q2)
> , $q1 cmp $q2;
> # and again with the non default parser
> $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
> }
>
> PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
> expat-1.95.8.


First off, let me say I don't know much about this stuff. I am on the US
English version of XP. I copied and pasted the code above into Gvim, and
then ran it. I got:


D:\Home\asu1\UseNet\clpmisc> r > results.txt

D:\Home\asu1\UseNet\clpmisc> cat results.txt
(1) cmp (0) = 0
(1) cmp (0) = 0

I would be inclined to look at what changed in XML-SAX between versions
0.12 and 0.13, but then, as I said, I don't know much about encodings
etc.

I have XML-SAX-0.12 and XML-Parser-2.34 and

D:\Home\asu1\UseNet\clpmisc> perl -v

This is perl, v5.8.7 built for MSWin32-x86-multi-thread
(with 14 registered patches, see perl -V for more detail)

Copyright 1987-2005, Larry Wall

Binary build 815 [211909] provided by ActiveState
http://www.ActiveState.com
ActiveState is a division of Sophos.
Built Nov 2 2005 08:44:52

Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html


robic0 03-05-2006 01:30 AM

Re: Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <1usa@llenroc.ude.invalid> wrote:

>fuzz@uni-paderborn.de (Erik Wasser) wrote in
>news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de:
>
>> I'm subject to some confusion with XML and UTF8. I'm working with
>> XML-Simple and I try to decode some XML with with german umlauts
>> (ISO-8859-1). The first XML line declared the encoding correct (see
>> code below). But I'm getting different results using XML-Simple with
>> the default XML parser named XML::Sax and a second parser named
>> XML::Parser. The following code tries to decode the mini XML file and
>> prints the UTF8 flags of the resulting strings.
>>
>> Can someone run this code on his machine and post the results? Thanks.
>> The results on my machine are this:
>>


You didn't try to decode in German! You might have changed the "code page"
to German to get different character sets. It doesn't matter. I'm looking at
your character in whatever "code page" is on my machine. UTF8 is Unicode.
Its not discernable unless you have a Unicode "aware" renderer. You can't
just change the characters on the page via cut & paste and it turns into
Unicode. If you open or save a Unicode document from a Unicode aware editor
the represented character will not be noticable as Unicode, so it's not
something that can be "cut 'n pasted" into a newsgroup, as code to be
tested! UTF8, even "multi-byte" is transparent to the user and only known
to the renderer. Data from a file that is read into a parser (or a Perl
program that is UTF8 aware) that is Unicode is treated as Unicode in its
variable representation and interaction with other variables. If a regex
is to be applied to Unicode data from an aware Perl parser, it works
every time.

robic0 03-05-2006 01:43 AM

Re: Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
On Sat, 04 Mar 2006 17:30:09 -0800, robic0 wrote:

>On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <1usa@llenroc.ude.invalid> wrote:
>
>>fuzz@uni-paderborn.de (Erik Wasser) wrote in
>>news:nsejd3-rep.ln1@wasser-7359.user.cis.dfn.de:
>>
>>> I'm subject to some confusion with XML and UTF8. I'm working with
>>> XML-Simple and I try to decode some XML with with german umlauts
>>> (ISO-8859-1). The first XML line declared the encoding correct (see
>>> code below). But I'm getting different results using XML-Simple with
>>> the default XML parser named XML::Sax and a second parser named
>>> XML::Parser. The following code tries to decode the mini XML file and
>>> prints the UTF8 flags of the resulting strings.
>>>
>>> Can someone run this code on his machine and post the results? Thanks.
>>> The results on my machine are this:
>>>

>
>You didn't try to decode in German! You might have changed the "code page"
>to German to get different character sets. It doesn't matter. I'm looking at
>your character in whatever "code page" is on my machine. UTF8 is Unicode.
>Its not discernable unless you have a Unicode "aware" renderer. You can't
>just change the characters on the page via cut & paste and it turns into
>Unicode. If you open or save a Unicode document from a Unicode aware editor
>the represented character will not be noticable as Unicode, so it's not
>something that can be "cut 'n pasted" into a newsgroup, as code to be
>tested! UTF8, even "multi-byte" is transparent to the user and only known
>to the renderer. Data from a file that is read into a parser (or a Perl
>program that is UTF8 aware) that is Unicode is treated as Unicode in its
>variable representation and interaction with other variables. If a regex
>is to be applied to Unicode data from an aware Perl parser, it works
>every time.


Just a followup, I know your question was with xml, but if you wan't to use
unicode "outside" the 0-128 bracket fro regex you might want to use the
codes as in this simple example (which just uses various "ranges"):

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);

Erik Wasser 03-05-2006 11:49 AM

Re: Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
robic0 wrote:

> Just a followup, I know your question was with xml, but if you wan't to use
> unicode "outside" the 0-128 bracket fro regex you might want to use the
> codes as in this simple example (which just uses various "ranges"):


My question was: why two XML parsers are getting different results? The
different results are confusing me not unicode itself.

--
So long... Fuzz

Peter J. Holzer 03-05-2006 10:09 PM

Re: Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)
 
Erik Wasser wrote:

[XML::Simple gives correct results with XML::Parser, but wrong results
with XML::SAX]

> My question was: why two XML parsers are getting different results?
> The different results are confusing me not unicode itself.


Looks like a bug in XML::SAX or one of the libraries it uses.
However, like Sinan, I cannot reproduce it here on a Debian Sarge
system:

perl, v5.8.4 built for i386-linux-thread-multi
XML::Simple version 2.14
XML::SAX version 0.12
XML::Parser version 2.34
libexpat1 1.95.8-3

So it may be caused by something weird in your einvironment.

hp

--
This is not a signature


All times are GMT. The time now is 04:26 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.