Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > XML::Twig doctype and entity handling

Reply
Thread Tools

XML::Twig doctype and entity handling

 
 
Zed Pobre
Guest
Posts: n/a
 
      09-04-2008
I'm writing a program that needs to extract a clump of XML metadata
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, ©
is the first entity that XML:arser will choke on in my current test
data.

Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities? (Or even just © to get me
started?) I came up with a workaround involving slurping the input
file and using a regular expression to split the metadata out into a
temporary file, then run tidy on it, but it's something of an evil
hack, given that I have to just read the results of that back into
XML::Twig anyway.

My last attempt at getting XML::Twig to read this looks like this:

$mobihtmltwig = XML::Twig->new(
load_DTD => 1,
twig_roots => { 'metadata' => 1 },
twig_handlers => { 'metadata' => \&twig_cut_metadata },
output_encoding => 'utf8',
pretty_print => 'indented',
twig_print_outside_roots => 'HTML'
);

$mobihtmltwig->set_doctype(
'package',
"http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd",
"+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");

$mobihtmltwig->entity_list->add_new_ent(copy => "©");

print $mobihtmltwig->entity_names,"\n";

$mobihtmltwig->parsefile($mobihtmlfile);


It dies at the parsefile command with:

undefined entity at line 1, column 413, byte 413 at
/usr/lib/perl5/XML/Parser.pm line 187

Byte 413 is the first ©. This is despite 'copy' being present in
the entity list.

Thanks for any help,

--
Zed Pobre <(E-Mail Removed)> a.k.a. Zed Pobre <(E-Mail Removed)>
PGP key and fingerprint available on finger; encrypted mail welcomed.
 
Reply With Quote
 
 
 
 
Peter J. Holzer
Guest
Posts: n/a
 
      09-06-2008
["Followup-To:" header set to comp.lang.perl.misc.]
On 2008-09-04 23:11, Zed Pobre <(E-Mail Removed)> wrote:
> I'm writing a program that needs to extract a clump of XML metadata
> stored inside of a noncompliant HTML file and then perform a number of
> operations on that metadata. (Specifically, for those curious, this
> is part of a Mobipocket .prc to IPDF .epub ebook converter.)
>
> The HTML file in question has no doctype declaration, and XHTML
> entities may be found in the metadata portion. In particular, &copy;
> is the first entity that XML:arser will choke on in my current test
> data.
>
> Could someone please provide me with an example of how to get
> XML::Twig to recognize XHTML entities?


Just prepend a declaration. For example here is a snippet from one of my
scripts which deals with a similar situation:

while ($lines[0] =~ /\s*<use /) {
shift @lines;
}
my $encoding = "utf-8";
if ($lines[0] =~ / charset=["'](.*?)["']/) {
$encoding=$1
}
my $text = join('', (
"<?xml version='1.0' encoding='$encoding' ?>\n",
"<!DOCTYPE protokoll SYSTEM 'http://www.luga.at/dtd/protokoll.dtd'\n",
" [\n",
" <!ENTITY euro '€'>\n",
" <!ENTITY mdash '—'>\n",
" <!ENTITY rArr '⇒'>\n",
" ]\n",
">\n",
@lines
)
);

This first strips off a few extra lines (which start with "<use "), then
extracts the encoding from the first remaining line and then prepends an
XML declaration with the encoding and a doctype declaration with a few
entities.

hp
 
Reply With Quote
 
 
 
 
Zed Pobre
Guest
Posts: n/a
 
      09-07-2008
Peter J. Holzer <(E-Mail Removed)> wrote:
>
>
> ["Followup-To:" header set to comp.lang.perl.misc.]
> On 2008-09-04 23:11, Zed Pobre <(E-Mail Removed)> wrote:
>> I'm writing a program that needs to extract a clump of XML metadata
>> stored inside of a noncompliant HTML file and then perform a number of
>> operations on that metadata. (Specifically, for those curious, this
>> is part of a Mobipocket .prc to IPDF .epub ebook converter.)
>>
>> The HTML file in question has no doctype declaration, and XHTML
>> entities may be found in the metadata portion. In particular, &copy;
>> is the first entity that XML:arser will choke on in my current test
>> data.
>>
>> Could someone please provide me with an example of how to get
>> XML::Twig to recognize XHTML entities?

>
> Just prepend a declaration. For example here is a snippet from one of my
> scripts which deals with a similar situation:


Thanks for the suggestion, but I think you misunderstand the situation
-- the input file looks something like this (and I don't have control
over its generation):

<html><head><metadata> <dc-metadata [...] </metadata></head><body>[...]

The goal is to avoid slurping the file, but extract and separate the
<metadata>...</metadata> block from the HTML via XML::Twig, outputting
HTML with the metadata block removed, parsing and modifying the XML
metadata block, then outputting that as a separate file. The source
files involved average half a megabyte in size, and can reach several
megabytes.

My hope was to use XML::Twig to keep memory usage down, and certainly
to avoid a twig root involving entire HTML+XMLmetadata structure. At
least, the Twig documentation implied that it could do this in a
low-memory fashion, pulling out only the parts needed. The
documentation also lists functions (that are either buggy or that I am
apparently using incorrectly) to define an entity list or assign a
doctype prior to a parse. I'm hoping that someone can give an example
of correct usage.

My current workaround is actually somewhat similar to yours, except at
a file level: I have a subroutine that slurps the file, regexps out
the metadata block, saves the metadata block to a new file with a
proper XML header and doctype appended, saves everything else to a
HTML-only file, and then returns, so I can call XML::Twig only on the
outputted XML file. This works, but still allocates a potentially
huge amount of memory during the splitting process, even if that
memory is available to Twig after it returns.

I've been contemplating bludgeoning out a low-memory solution with
sysread, since the metadata will always be at the top of the file and
has never so far been larger than about 8kb, but was hoping to see if
someone knew how to get Twig working first.

Thanks again,

--
Zed Pobre <(E-Mail Removed)> a.k.a. Zed Pobre <(E-Mail Removed)>
PGP key and fingerprint available on finger; encrypted mail welcomed.
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      09-07-2008
Zed Pobre <(E-Mail Removed)> wrote:

> I've been contemplating bludgeoning out a low-memory solution with
> sysread, since the metadata will always be at the top of the file and
> has never so far been larger than about 8kb, but was hoping to see if
> someone knew how to get Twig working first.


If you want to reduce memory to a minimum you can't avoid using a
streaming solution. I probably would use XML:arser or SAX.

It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
you'll might bump into problems with XML parsing)

--
John http://johnbokma.com/ - Hacking & Hiking in Mexico

Perl help in exchange for a gift:
http://johnbokma.com/perl/help-in-ex...or-a-gift.html
 
Reply With Quote
 
Zed Pobre
Guest
Posts: n/a
 
      09-08-2008
John Bokma <(E-Mail Removed)> wrote:
>
> Zed Pobre <(E-Mail Removed)> wrote:
>
>> I've been contemplating bludgeoning out a low-memory solution with
>> sysread, since the metadata will always be at the top of the file and
>> has never so far been larger than about 8kb, but was hoping to see if
>> someone knew how to get Twig working first.

>
> If you want to reduce memory to a minimum you can't avoid using a
> streaming solution. I probably would use XML:arser or SAX.
>
> It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
> you'll might bump into problems with XML parsing)


Unfortunately, I really do mean HTML, and very badly formed HTML at
that. The only part that can be relied upon to be well-formed is the
<metadata>...</metadata> clump that I was trying to extract with
twig_roots without actually parsing the rest of the file.

It turns out that this isn't possible even with XML::Twig.

One of the kind monks over at perlmonks.org pointed out that there's
nothing stopping me from passing parsefile() a pipe, so I got past the
doctype problem by passing it 'cat oeb12doctype.xml input.html|', at
which point the parse() cheerfully got so far as splitting off the
HTML with all of the metadata elements removed before die-ing horribly
on a mismatched tag. According to the Twig documentation, there is no
way to proceed and get the extracted elements anyway, so this entire
technique has been a dead end, though amusingly this technique does
work to split out the HTML without the <metadata> elements, since
twig_print_outside_roots will finish up before the parser dies from
mismatched tags. That probably isn't reliable, though.

I'll have to constrain the memory use by doing the initial split in
10k chunks, I guess.

Thanks for the help.

--
Zed Pobre <(E-Mail Removed)> a.k.a. Zed Pobre <(E-Mail Removed)>
PGP key and fingerprint available on finger; encrypted mail welcomed.
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      09-09-2008
["Followup-To:" header set to comp.lang.perl.misc.]
On 2008-09-08 19:17, Zed Pobre <(E-Mail Removed)> wrote:
> John Bokma <(E-Mail Removed)> wrote:
>> Zed Pobre <(E-Mail Removed)> wrote:
>>> I've been contemplating bludgeoning out a low-memory solution with
>>> sysread, since the metadata will always be at the top of the file and
>>> has never so far been larger than about 8kb, but was hoping to see if
>>> someone knew how to get Twig working first.

>>
>> If you want to reduce memory to a minimum you can't avoid using a
>> streaming solution. I probably would use XML:arser or SAX.
>>
>> It's not clear if by HTML you actually mean XHTML (I guess yes,
>> otherwise you'll might bump into problems with XML parsing)

>
> Unfortunately, I really do mean HTML, and very badly formed HTML at
> that.


Then you cannot/should not use an XML parser. XML is designed to have a
strict syntax, and all XML parsers I know rely on this and enforce it
(more or less).

> The only part that can be relied upon to be well-formed is the
><metadata>...</metadata> clump that I was trying to extract with
> twig_roots without actually parsing the rest of the file.
>
> It turns out that this isn't possible even with XML::Twig.
>
> One of the kind monks over at perlmonks.org pointed out that there's
> nothing stopping me from passing parsefile() a pipe, so I got past the
> doctype problem by passing it 'cat oeb12doctype.xml input.html|',


I was going to suggest that when I read about your memory constraints
(although I think slurping in a few MB of into a single string won't
hurt you - the big problem with large XML files is usually the parsed
representation (a tree with hundreds of thousands of nodes for a few
MB).

However, if you know that everything inside your metadata elements is
well-formed XML and that the metadata elements aren't nested and that
there are no CDATA sections which might contain the string "<metadata>"
or "</metadata>", you can easily extract those sections:

$/ = "</metadata>";

print $preamble, "<fakeroot>";
while (<>) {
# each record ends with </metadata>,
# so we can throw everything up to <metadata away:
s/.*(?=<metadata\s)//;
print $_
}
print "</fakeroot>"

put that in a subprocess and pass the pipe to XML::Twig. (or maybe
XML::Twig has a method with which you can feed it chunks of input - I
think it does, but a quick scanning of the man page didn't reveal it).


> at which point the parse() cheerfully got so far as splitting off the
> HTML with all of the metadata elements removed before die-ing horribly
> on a mismatched tag. According to the Twig documentation, there is no
> way to proceed and get the extracted elements anyway,


I think you can get the extracted elements, but there is no way to
continue after the first parse error. So unless you are certain that the
first error occurs after the sections you are interested in you can't
use that.

hp
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      09-09-2008
On 2008-09-09 13:25, Peter J. Holzer <(E-Mail Removed)> wrote:
> However, if you know that everything inside your metadata elements is
> well-formed XML and that the metadata elements aren't nested and that
> there are no CDATA sections which might contain the string "<metadata>"
> or "</metadata>", you can easily extract those sections:
>
> $/ = "</metadata>";
>
> print $preamble, "<fakeroot>";
> while (<>) {
> # each record ends with </metadata>,
> # so we can throw everything up to <metadata away:
> s/.*(?=<metadata\s)//;
> print $_
> }
> print "</fakeroot>"
>
> put that in a subprocess and pass the pipe to XML::Twig. (or maybe
> XML::Twig has a method with which you can feed it chunks of input - I
> think it does, but a quick scanning of the man page didn't reveal it).


Sorry, I missed that there is only one metadata element and that it is
always near the start of the document. In that case it's a lot simpler.
You don't need a loop as you only need to read one element. And you can
just call XML::Twig->parse with that element and don't need to wrap it
in a fake root element.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
CSS Newbie - CSS Works With Invalid DOCTYPE. Fails With Valid DOCTYPE. Larry Lindstrom HTML 19 06-12-2012 02:07 PM
Entity Framework - Reassigning child entity's parent Norm ASP .Net 3 07-06-2009 07:28 PM
How to relate a SQL based entity with an Object based entity in Entity Framework markla ASP .Net 1 10-06-2008 09:42 AM
Entity Name or Entity Number? Samuel van Laere HTML 4 02-24-2007 10:11 PM
xerces: entity and doctype with DOMPrint __PPS__ XML 2 09-26-2005 11:47 PM



Advertisments