Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Trying to parse a HUGE(1gb) xml file

Reply
Thread Tools

Re: Trying to parse a HUGE(1gb) xml file

 
 
spaceman-spiff
Guest
Posts: n/a
 
      12-20-2010
Hi Usernet

First up, thanks for your prompt reply.
I will make sure i read RFC1855, before posting again, but right now chasing a hard deadline

I am sorry i left out what exactly i am trying to do.

0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)

I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml file.

1. Can you point me to some examples/samples of using SAX, especially , ones dealing with really large XML files.

2.This brings me to another q. which i forgot to ask in my OP(original post).
Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?
While researching my problem, some article seemed to advise against this, especially since its known apriori, that the file is an xml & since regex code gets complicated very quickly & is not very readable.

But is that just a "style"/"elegance" issue, & for my particular problem (detecting a certain element, & then creating(writing) a smaller xml file corresponding to, each pair of start & end tags of said element), is the open file & regex approach, something you would recommend ?

Thanks again for your super-prompt response

cheers
ashish
 
Reply With Quote
 
 
 
 
Adam Tauno Williams
Guest
Posts: n/a
 
      12-20-2010
On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
> I need to detect them & then for each 1, i need to copy all the
> content b/w the element's start & end tags & create a smaller xml
> file.


Yep, do that a lot; via iterparse.

> 1. Can you point me to some examples/samples of using SAX,
> especially , ones dealing with really large XML files.


SaX is equivalent to iterparse (iterpase is a way, to essentially, do
SaX-like processing).

I provided an iterparse example already. See the Read_Rows method in
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

> 2.This brings me to another q. which i forgot to ask in my OP(original post).
> Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?


No.


 
Reply With Quote
 
 
 
 
Tim Harig
Guest
Posts: n/a
 
      12-20-2010
On 2010-12-20, spaceman-spiff <(E-Mail Removed)> wrote:
> 0. Goal :I am looking for a specific element..there are several 10s/100s
> occurrences of that element in the 1gb xml file. The contents of the xml,
> is just a dump of config parameters from a packet switch( although imho,
> the contents of the xml dont matter)


Then you need:
1. To detect whenever you move inside of the type element you are
seeking and whenever you move out of it. As long as these
elements cannot be nested inside of each other, this is an
easy binary task. If they can be nested, then you will
need to maintain some kind of level count or recursively
decompose each level.

2. Once you have obtained a complete element (from its start tag to
its end tag) you will need to test whether you have the
single correct element that you are looking for.

Something like this (untested) will work if the target tag cannot be nested
in another target tag:

- import xml.sax
- class tagSearcher(xml.sax.ContentHandler):
-
- def startDocument():
- self.inTarget = False
-
- def startElement(name, attrs):
- if name == targetName:
- self.inTarget = True
- elif inTarget = True:
- # save element information
-
- def endElement(name):
- if name == targetName:
- self.inTarget = False
- # test the saved information to see if you have the
- # one you want:
- #
- # if its the peice you are looking for, then
- # you can process the information
- # you have saved
- #
- # if not, disgard the accumulated
- # information and move on
-
- def characters(content):
- if self.inTarget == True:
- # save the content
-
- yourHandler = tagSearcher()
- yourParser = xml.sax.make_parser()
- yourParser.parse(inputXML, yourHandler)

Then you just walk through the document picking up and discarding each
target element type until you have the one that you are looking for.

> I need to detect them & then for each 1, i need to copy all the content
> b/w the element's start & end tags & create a smaller xml file.


Easy enough; but, with SAX you will have to recreate the tags from
the information that they contain because they will be skipped by the
character() events; so you will need to save the information from each tag
as you come across it. This could probably be done more automatically
using saxutils.XMLGenerator; but, I haven't actually worked with it
before. xml.dom.pulldom also looks interesting

> 1. Can you point me to some examples/samples of using SAX, especially ,
> ones dealing with really large XML files.


There is nothing special about large files with SAX. Sax is very simple.
It walks through the document and calls the the functions that you
give it for each event as it reaches varius elements. Your callback
functions (methods of a handler) do everthing with the information.
SAX does nothing more then call your functions. There are events for
reaching a starting tag, an end tag, and characters between tags;
as well as some for beginning and ending a document.

> 2.This brings me to another q. which i forgot to ask in my OP(original
> post). Is simply opening the file, & using reg ex to look for the element
> i need, a *good* approach ? While researching my problem, some article
> seemed to advise against this, especially since its known apriori, that
> the file is an xml & since regex code gets complicated very quickly &
> is not very readable.
>
> But is that just a "style"/"elegance" issue, & for my particular problem
> (detecting a certain element, & then creating(writing) a smaller xml
> file corresponding to, each pair of start & end tags of said element),
> is the open file & regex approach, something you would recommend ?


It isn't an invalid approach if it works for your situatuation. I have
used it before for very simple problems. The thing is, XML is a context
free data format which makes it difficult to generate precise regular
expressions, especially where where tags of the same type can be nested.

It can be very error prone. Its really easy to have a regex work for
your tests and fail, either by matching too much or failing to match,
because you didn't anticipate a given piece of data. I wouldn't consider
it a robust solution.
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      12-22-2010
On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:
> On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
>> I need to detect them& then for each 1, i need to copy all the
>> content b/w the element's start& end tags& create a smaller xml
>> file.

>
> Yep, do that a lot; via iterparse.
>
>> 1. Can you point me to some examples/samples of using SAX,
>> especially , ones dealing with really large XML files.


I've just subclassed HTMLparser for this. It's slow, but
100% Python. Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.

> SaX is equivalent to iterparse (iterpase is a way, to essentially, do
> SaX-like processing).


Iterparse does try to build a tree, although you can discard the
parts you don't want. If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree. You have to keep clearing the "root"
element to discard that.

> I provided an iterparse example already. See the Read_Rows method in
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>


I don't quite see the point of creating a class with only static
methods. That's basically a verbose way to create a module.
>
>> 2.This brings me to another q. which i forgot to ask in my OP(original post).
>> Is simply opening the file,& using reg ex to look for the element i need, a *good* approach ?

>
> No.


If the XML file has a very predictable structure, that may not be
a bad idea. It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.

John Nagle


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Noob trying to parse bad HTML using xml.etree.ElementTree Chris Angelico Python 0 12-30-2012 10:07 AM
Noob trying to parse bad HTML using xml.etree.ElementTree Morten Guldager Python 0 12-30-2012 09:52 AM
Trying to parse a HUGE(1gb) xml file spaceman-spiff Python 41 01-14-2011 02:22 AM
clueless student trying to parse XML sal achhala Java 7 10-20-2003 05:07 PM
trying to parse XML from an email... rtl Perl Misc 5 07-05-2003 05:10 PM



Advertisments