Peter J. Holzer <hjp-> wrote:
>
>
> ["Followup-To:" header set to comp.lang.perl.misc.]
> On 2008-09-04 23:11, Zed Pobre <> wrote:
>> I'm writing a program that needs to extract a clump of XML metadata
>> stored inside of a noncompliant HTML file and then perform a number of
>> operations on that metadata. (Specifically, for those curious, this
>> is part of a Mobipocket .prc to IPDF .epub ebook converter.)
>>
>> The HTML file in question has no doctype declaration, and XHTML
>> entities may be found in the metadata portion. In particular, ©
>> is the first entity that XML:
arser will choke on in my current test
>> data.
>>
>> Could someone please provide me with an example of how to get
>> XML::Twig to recognize XHTML entities?
>
> Just prepend a declaration. For example here is a snippet from one of my
> scripts which deals with a similar situation:
Thanks for the suggestion, but I think you misunderstand the situation
-- the input file looks something like this (and I don't have control
over its generation):
<html><head><metadata> <dc-metadata [...] </metadata></head><body>[...]
The goal is to avoid slurping the file, but extract and separate the
<metadata>...</metadata> block from the HTML via XML::Twig, outputting
HTML with the metadata block removed, parsing and modifying the XML
metadata block, then outputting that as a separate file. The source
files involved average half a megabyte in size, and can reach several
megabytes.
My hope was to use XML::Twig to keep memory usage down, and certainly
to avoid a twig root involving entire HTML+XMLmetadata structure. At
least, the Twig documentation implied that it could do this in a
low-memory fashion, pulling out only the parts needed. The
documentation also lists functions (that are either buggy or that I am
apparently using incorrectly) to define an entity list or assign a
doctype prior to a parse. I'm hoping that someone can give an example
of correct usage.
My current workaround is actually somewhat similar to yours, except at
a file level: I have a subroutine that slurps the file, regexps out
the metadata block, saves the metadata block to a new file with a
proper XML header and doctype appended, saves everything else to a
HTML-only file, and then returns, so I can call XML::Twig only on the
outputted XML file. This works, but still allocates a potentially
huge amount of memory during the splitting process, even if that
memory is available to Twig after it returns.
I've been contemplating bludgeoning out a low-memory solution with
sysread, since the metadata will always be at the top of the file and
has never so far been larger than about 8kb, but was hoping to see if
someone knew how to get Twig working first.
Thanks again,
--
Zed Pobre <> a.k.a. Zed Pobre <>
PGP key and fingerprint available on finger; encrypted mail welcomed.