Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > LibXML element->toString vs document->toString

Reply
Thread Tools

LibXML element->toString vs document->toString

 
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-12-2012
Hi, I have been driven mad by the following, which took ages to track
down. What is going on? I appears it is invalid to use toString on the
document object.


#! /usr/local/bin/perl -w
use strict;
use warnings;
use utf8;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin name="\xc5\x81"></plugin>
 
Reply With Quote
 
 
 
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-13-2012
Ben Morrow <(E-Mail Removed)> wrote:

> Quoth http://www.velocityreviews.com/forums/(E-Mail Removed) (Fergus McMenemie):
> > Hi, I have been driven mad by the following, which took ages to track
> > down. What is going on? I appears it is invalid to use toString on the
> > document object.
> >
> >
> > #! /usr/local/bin/perl -w
> > use strict;
> > use warnings;
> > use utf8;
> > use Encode;
> > use XML::LibXML;
> > binmode(STDOUT, ":utf8");
> >
> > my $src= join("",<DATA>);
> > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );

>
> Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
> which is internal to perl and none of your business. (The Encode
> documentation is not as clear about this as is might be, because it only
> became clear through experience that this is the only approach which
> works.)


Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.

My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.

> What are you actually trying to find out?

I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.
 
Reply With Quote
 
 
 
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-14-2012
Ben Morrow <(E-Mail Removed)> wrote:

> > > What are you actually trying to find out?

> > I have to pass references to DOM objects around all over the
> > place. I find I am having to make use of either documentElement()
> > or ownerDocument() depending on what I am doing. I would like to have
> > a consistent "pattern" for doing this. I would like to setting on
> > passing the document object around but it is anoying that I cant then
> > use toString.

>
> I'm afraid I don't understand. When I run the original program I get the
> results I would have expected: the first prints the XML without the
> <?xml?>, the second prints it with it. What is going wrong for you?


Thanks for the tip. My code now reads:-

use strict;
use warnings;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
$src =~ s/\\x([0-9a-f][0-9a-f])/chr hex $1/egi;
$src = Encode::decode "utf8", $src;
print "LibXML VERSION=$XML::LibXML::VERSION\n";
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin
name="\xef\xbd\xb1\xef\xbd\xb2\xef\xbd\xb3\xef\xbd \xb4\xef\xbd\xb5"></pl
ugin>


And fails on my mac running OS X Snow Leopard. But the 'real' version is
running with perl 5.12 on centos and also fails there. No sure about the
version of LibXML.

Does it work for your?

 
Reply With Quote
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-14-2012
Ben Morrow <(E-Mail Removed)> wrote:

> Quoth (E-Mail Removed) (Fergus McMenemie):
> > Ben Morrow <(E-Mail Removed)> wrote:
> > > Quoth (E-Mail Removed) (Fergus McMenemie):

@
> > > > Hi, I have been driven mad by the following, which took ages to track
> > > > down. What is going on? I appears it is invalid to use toString on the
> > > > document object.
> > > >
> > > >
> > > > #! /usr/local/bin/perl -w
> > > > use strict;
> > > > use warnings;
> > > > use utf8;
> > > > use Encode;
> > > > use XML::LibXML;
> > > > binmode(STDOUT, ":utf8");
> > > >
> > > > my $src= join("",<DATA>);
> > > > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
> > >
> > > Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
> > > which is internal to perl and none of your business. (The Encode
> > > documentation is not as clear about this as is might be, because it only
> > > became clear through experience that this is the only approach which
> > > works.)

> >
> > Agreed, the warnings are there. However it did appear to make the
> > issue clearer. This example is rather goofy and posting it to USEnet
> > added a few more wrinkles. My original code and the real program
> > contained the actual characters. However my USEnet reader would not
> > let me post the real chars. Hence the octets.

>
> It can certainly be difficult, given that Usenet officially doesn't
> support anything but ASCII. Unofficially, if you can get your newsreader
> to produce it, articles in UTF-8 with 'Content-type: text/plain;
> charset=UTF-8' seem to work perfectly well.
>
> Another thing you can do is explicitly decode the data in the program
> you post; possibly something like
>
> my $str = <DATA>;
> $str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
> $str = Encode::decode "utf8", $str;
>
> This uses URL-encoding rather than backslashes; you can pick whatever is
> convenient for the data you are trying to post.
>
> > My issue is that document->toString does not appear to work. Please
> > ignore the use of us_utf8.

>
> OK.
>
> > > What are you actually trying to find out?

> > I have to pass references to DOM objects around all over the
> > place. I find I am having to make use of either documentElement()
> > or ownerDocument() depending on what I am doing. I would like to have
> > a consistent "pattern" for doing this. I would like to setting on
> > passing the document object around but it is anoying that I cant then
> > use toString.

>
> I'm afraid I don't understand. When I run the original program I get the
> results I would have expected: the first prints the XML without the
> <?xml?>, the second prints it with it. What is going wrong for you?
>
> Ben

 
Reply With Quote
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-17-2012
Ben Morrow <(E-Mail Removed)> wrote:

> > What gives you that idea? RFC 5536 explicitly allows MIME-encoded
> > data, e.g.,

>
> Ooh, they've actually published an update. I didn't know that.


My newsreader does not properly upport UTF8 I guess lots of others still
dont either.

MacSoup - my soups gone off!
 
Reply With Quote
 
Fergus McMenemie
Guest
Posts: n/a
 
      07-17-2012
Ben Morrow <(E-Mail Removed)> wrote:

> Yes, it works as documented for me. Are you getting confused by the fact
> that ->toString produces a byte string for whole documents, but a
> character string for just an element? Read the 'ENCODINGS SUPPORT'
> section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
> printing a whole document, because the document isn't necessarily in
> UTF-8.


Duh!
Thanks I dont know how I managed to miss that bit.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LibXML UTF8 - Input is not proper UTF-8, indicate encoding ! Vlajko Knezic Perl 1 03-06-2005 08:53 AM
C++ libraries: Xerces, libxml/libxml++ or perhaps Arabica? Olav XML 3 01-20-2005 02:51 PM
cant install libxml::perl p cooper Perl 0 01-10-2004 12:23 PM
catm install libxml::perl p cooper Perl 0 01-10-2004 06:18 AM
Problems with libxml, XML::LibXML and Perl Ian Gregory XML 1 07-25-2003 04:20 PM



Advertisments