Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Trying to parse a HUGE(1gb) xml file

Reply
Thread Tools

Trying to parse a HUGE(1gb) xml file

 
 
spaceman-spiff
Guest
Posts: n/a
 
      12-20-2010
Hi c.l.p folks

This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.

I am trying to parse a ginormous ( ~ 1gb) xml file.


0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)


1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.

My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.

3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xm...-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]

Which one is the best for my situation ?

Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email :
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. http://stackoverflow.com/questions/1...as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xm...x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML


 
Reply With Quote
 
 
 
 
Adam Tauno Williams
Guest
Posts: n/a
 
      12-20-2010
On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
> Hi c.l.p folks
> This is a rather long post, but i wanted to include all the details &
> everything i have tried so far myself, so please bear with me & read
> the entire boringly long post.
> I am trying to parse a ginormous ( ~ 1gb) xml file.


Do that hundreds of times a day.

> 0. I am a python & xml n00b, s& have been relying on the excellent
> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
> u are readng this, you are AWESOME & so is your witty & humorous
> writing style)
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot() #my huge xml has 1 root at the top level
> print root


Yes, this is a terrible technique; most examples are crap.

> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
> function builds & returns a tree object, in-memory(RAM), which
> represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i
> run this simple 4 line py code in a terminal for my HUGE target file
> (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python
> process, with memory (the VIRT column) increasing from 100MB , all the
> way upto 2100MB.


Yes, this is using DOM. DOM is evil and the enemy, full-stop.

> I am guessing, as this happens (over the course of 20-30 mins), the
> tree representing is being slowly built in memory, but even after
> 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.


You need to process the document as a stream of elements; aka SAX.

> 3. I also tried using lxml, but an lxml tree is much more expensive,
> as it retains more info about a node's context, including references
> to it's parent.
> [http://www.ibm.com/developerworks/xm...-hiperfparse/]
> When i ran the same 4line code above, but with lxml's elementree
> ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree


You're still using DOM; DOM is evil.

> Which one is the best for my situation ?
> Any & all
> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
> the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.


<http://docs.python.org/library/xml.sax.html>

<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

 
Reply With Quote
 
 
 
 
Tim Harig
Guest
Posts: n/a
 
      12-20-2010
[Wrapped to meet RFC1855 Netiquette Guidelines]
On 2010-12-20, spaceman-spiff <> wrote:
> This is a rather long post, but i wanted to include all the details &
> everything i have tried so far myself, so please bear with me & read
> the entire boringly long post.
>
> I am trying to parse a ginormous ( ~ 1gb) xml file.

[SNIP]
> 4. I then investigated some streaming libraries, but am confused - there
> is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
> interface[http://effbot.org/zone/element-iterparse.htm]


I have made extensive use of SAX and it will certainly work for low
memory parsing of XML. I have never used "iterparse"; so, I cannot make
an informed comparison between them.

> Which one is the best for my situation ?


Your posed was long but it failed to tell us the most important piece
of information: What does your data look like and what are you trying
to do with it?

SAX is a low level API that provides a callback interface allowing you to
processes various elements as they are encountered. You can therefore
do anything you want to the information, as you encounter it, including
outputing and discarding small chunks as you processes it; ignoring
most of it and saving only what you want to memory data structures;
or saving all of it to a more random access database or on disk data
structure that you can load and process as required.

What you need to do will depend on what you are actually trying to
accomplish. Without knowing that, I can only affirm that SAX will work
for your needs without providing any information about how you should
be using it.
 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      12-20-2010
On 12/20/2010 2:49 PM, Adam Tauno Williams wrote:
>
> Yes, this is a terrible technique; most examples are crap.


> Yes, this is using DOM. DOM is evil and the enemy, full-stop.


> You're still using DOM; DOM is evil.


For serial processing, DOM is superfluous superstructure.
For random access processing, some might disagree.

>
>> Which one is the best for my situation ?
>> Any& all
>> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
>> the c.l.p community would be greatly appreciated.
>> Plz feel free to email me directly too.

>
> <http://docs.python.org/library/xml.sax.html>
>
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>


For Python (unlike Java), wrapping module functions as class static
methods is superfluous superstructure that only slows things down.

raise Exception(...) # should be something specific like
raise ValueError(...)

--
Terry Jan Reedy

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-21-2010
Adam Tauno Williams, 20.12.2010 20:49:
> On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
>> This is a rather long post, but i wanted to include all the details&
>> everything i have tried so far myself, so please bear with me& read
>> the entire boringly long post.
>> I am trying to parse a ginormous ( ~ 1gb) xml file.

>
> Do that hundreds of times a day.
>
>> 0. I am a python& xml n00b, s& have been relying on the excellent
>> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
>> u are readng this, you are AWESOME& so is your witty& humorous
>> writing style)
>> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>> import xml.etree.ElementTree as etree


Try

import xml.etree.cElementTree as etree

instead. Note the leading "c", which hints at the C implementations of
ElementTree. It's much faster and much more memory friendly than the Python
implementation.


>> tree = etree.parse('*path_to_ginormous_xml*')
>> root = tree.getroot() #my huge xml has 1 root at the top level
>> print root

>
> Yes, this is a terrible technique; most examples are crap.
>
>> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
>> function builds& returns a tree object, in-memory(RAM), which
>> represents the entire document.
>> I tried this code, which works fine for a small ( ~ 1MB), but when i
>> run this simple 4 line py code in a terminal for my HUGE target file
>> (1GB), nothing happens.
>> In a separate terminal, i run the top command,& i can see a python
>> process, with memory (the VIRT column) increasing from 100MB , all the
>> way upto 2100MB.

>
> Yes, this is using DOM. DOM is evil and the enemy, full-stop.


Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
While I agree that DOM is, well, maybe not "the enemy", but not exactly
beautiful either, ElementTree is really a good thing, likely also in this case.


>> I am guessing, as this happens (over the course of 20-30 mins), the
>> tree representing is being slowly built in memory, but even after
>> 30-40 mins, nothing happens.
>> I dont get an error, seg fault or out_of_memory exception.

>
> You need to process the document as a stream of elements; aka SAX.


IMHO, this is the worst advice you can give.

Stefan

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-21-2010
spaceman-spiff, 20.12.2010 21:29:
> I am sorry i left out what exactly i am trying to do.
>
> 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
> The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)
>
> I need to detect them& then for each 1, i need to copy all the content b/w the element's start& end tags& create a smaller xml file.


Then cElementTree's iterparse() is your friend. It allows you to basically
iterate over the XML tags while its building an in-memory tree from them.
That way, you can either remove subtrees from the tree if you don't need
them (to safe memory) or otherwise handle them in any way you like, such as
serialising them into a new file (and then deleting them).

Also note that the iterparse implementation in lxml.etree allows you to
specify a tag name to restrict the iterator to these tags. That's usually a
lot faster, but it also means that you need to take more care to clean up
the parts of the tree that the iterator stepped over. Depending on your
requirements and the amount of manual code optimisation that you want to
invest, either cElementTree or lxml.etree may perform better for you.

It seems that you already found the article by Liza Daly about high
performance XML processing with Python. Give it another read, it has a
couple of good hints and examples that will help you here.

Stefan

 
Reply With Quote
 
Stefan Sonnenberg-Carstens
Guest
Posts: n/a
 
      12-22-2010
Am 20.12.2010 20:34, schrieb spaceman-spiff:
> Hi c.l.p folks
>
> This is a rather long post, but i wanted to include all the details& everything i have tried so far myself, so please bear with me& read the entire boringly long post.
>
> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
>
> 0. I am a python& xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME& so is your witty& humorous writing style)
>
>
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot() #my huge xml has 1 root at the top level
> print root
>
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds& returns a tree object, in-memory(RAM), which represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
> In a separate terminal, i run the top command,& i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
>
> I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
>
> My hardware setup : I have a win7 pro box with 8gb of RAM& intel core2 quad cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space& 2gb(2048mb) ram, assigned to the guest ubuntu os.
>
> 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
> [http://www.ibm.com/developerworks/xm...-hiperfparse/]
>
> When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
>
> i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb& then, python(or the os ?) kills the process as it nears the total system memory(2gb)
>
> I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
> & ran top from another terminal (http://imgur.com/HAoHA.png)
>
> 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
>
> Which one is the best for my situation ?
>
> Any& all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.
>
> thanks a ton
>
> cheers
> ashish
>
> email :
> ashish.makani
> domain:gmail.com
>
> p.s.
> Other useful links on xml parsing in python
> 0. http://diveintopython3.org/xml.html
> 1. http://stackoverflow.com/questions/1...as-a-generator
> 2. http://codespeak.net/lxml/tutorial.html
> 3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
> 4. http://www.ibm.com/developerworks/xm...x-hiperfparse/
> 5.http://effbot.org/zone/element-index.htm
> http://effbot.org/zone/element-iterparse.htm
> 6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
>
>

Normally (what is normal, anyway?) such files are auto-generated,
and are something that has a apparent similarity with a database query
result, encapsuled in xml.
Most of the time the structure is same for every "row" thats in there.
So, a very unpythonic but fast, way would be to let awk resemble the
records and write them in csv format to stdout.
then pipe that to your python cruncher of choice and let it do the hard
work.
The awk part can be done in python, anyway, so could skip that.

And take a look at xmlsh.org, they offer tools for the command line,
like xml2csv. (Need java, btw).

Cheers

 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      12-23-2010
On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote:

> Normally (what is normal, anyway?) such files are auto-generated,
> and are something that has a apparent similarity with a database query
> result, encapsuled in xml.
> Most of the time the structure is same for every "row" thats in there.
> So, a very unpythonic but fast, way would be to let awk resemble the
> records and write them in csv format to stdout.


awk works well if the input is formatted such that each line is a record;
it's not so good otherwise. XML isn't a line-oriented format; in
particular, there are many places where both newlines and spaces are just
whitespace. A number of XML generators will "word wrap" the resulting XML
to make it more human readable, so line-oriented tools aren't a good idea.


 
Reply With Quote
 
Stefan Sonnenberg-Carstens
Guest
Posts: n/a
 
      12-23-2010
Am 23.12.2010 21:27, schrieb Nobody:
> On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote:
>
>> Normally (what is normal, anyway?) such files are auto-generated,
>> and are something that has a apparent similarity with a database query
>> result, encapsuled in xml.
>> Most of the time the structure is same for every "row" thats in there.
>> So, a very unpythonic but fast, way would be to let awk resemble the
>> records and write them in csv format to stdout.

> awk works well if the input is formatted such that each line is a record;

You shouldn't tell it to awk.
> it's not so good otherwise. XML isn't a line-oriented format; in
> particular, there are many places where both newlines and spaces are just
> whitespace. A number of XML generators will "word wrap" the resulting XML
> to make it more human readable, so line-oriented tools aren't a good idea.

I never had the opportunity seeing awk fail on this task

For large datasets I always have huge question marks if one says "xml".
But I don't want to start a flame war.
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      12-25-2010
On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
> For large datasets I always have huge question marks if one says "xml".
> But I don't want to start a flame war.


I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data, but what else are they to use?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
See Python Video! http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Noob trying to parse bad HTML using xml.etree.ElementTree Chris Angelico Python 0 12-30-2012 10:07 AM
Noob trying to parse bad HTML using xml.etree.ElementTree Morten Guldager Python 0 12-30-2012 09:52 AM
Re: Trying to parse a HUGE(1gb) xml file spaceman-spiff Python 3 12-22-2010 10:28 PM
clueless student trying to parse XML sal achhala Java 7 10-20-2003 05:07 PM
trying to parse XML from an email... rtl Perl Misc 5 07-05-2003 05:10 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57