Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: XML/XHTML/HTML differences, bugs... and howto

Reply
Thread Tools

Re: XML/XHTML/HTML differences, bugs... and howto

 
 
Stefan Behnel
Guest
Posts: n/a
 
      01-24-2013
Andrew Robinson, 23.01.2013 16:22:
> Good day ,
>
> I've been exploring XML parsers in python; particularly:
> xml.etree.cElementTree; and I'm trying to figure out how to do it
> incrementally, for very large XML files -- although I don't think the
> problems are restricted to incremental parsing.
>
> First problem:
> I've come across an issue where etree silently drops text without telling
> me; and separate.
>
> I am under the impression that XHTML is a subset of XML (eg:defined tags),
> and that once an HTML file is converted to XHTML, the body of the document
> can be handled entirely as XML.
>
> If I convert a (partial/contrived) html file like:
>
> <html>
> <div>
> <p> This is example <b>bold</b> text.
> </div>
> </html>
>
> to XHTML, I might do --right or wrong-- (1):
>
> <html>
> <div>
> <p /> This is example <b>bold</b> text.
> </div>
> </html>
>
> or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
>
> But, when I parse with etree, in example (1) both "This is an example" and
> "text." are dropped;
> The missing text is part of the start, or end event tags, in the
> incrementally parsed method.
>
> Likewise: In example (2), only "text" gets dropped.


Nope, you should read the manual on this. Here's a tutorial:

http://lxml.de/tutorial.html#elements-contain-text

This is using lxml.etree, which is the Python XML library most people use
these days. It's ElementTree compatible, so the tutorial also works for ET
(unless stated otherwise).


> Isn't XML supposed to error out when invalid xml is parsed?


It does.


> I have an XML file which will grow larger than memory on a target machine,
> so here's what I want to do:
>
> Given a source XML file, and a destination file:
> 1) iteratively scan part of the source tree.
> 2) Optionally Modify some of scanned tree.
> 3) Write partial scan/tree out to the destination file.
> 4) Free memory of no-longer needed (partial) source XML.
> 5) continue scanning a new section of the source file... eg: goto step 1
> until source file is exhausted.
>
> But, I don't see a way to write portions of an XML tree, or iteratively
> write a tree to disk.
> How can this be done?


There are several ways to do it. Python has a couple of external libraries
available that are made specifically for generating markup incrementally.

lxml also gained that feature recently. It's not documented yet, but here
are usage examples:

https://github.com/lxml/lxml/blob/ma...tal_xmlfile.py

Stefan


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HOWTO: const and pointer variants in C and C++ Adem C++ 10 11-12-2008 10:02 PM
HOWTO: const and pointer variants in C and C++ Adem C Programming 10 11-12-2008 10:02 PM
HOWTO: Remove <span> tag from Panel and PlaceHolder WebControls? Don Wash ASP .Net 2 09-07-2004 12:22 AM
HOWTO: Creating and Organizing Reuseable or Common Methods and Functions for ASP.NET Don Wash ASP .Net 6 08-05-2004 07:32 AM
VB Code behind: Howto handle querystrings and requests ALPO ASP .Net 1 11-15-2003 09:30 PM



Advertisments