Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: Trying to parse a HUGE(1gb) xml file (http://www.velocityreviews.com/forums/t740509-re-trying-to-parse-a-huge-1gb-xml-file.html)

spaceman-spiff 12-20-2010 08:29 PM

Re: Trying to parse a HUGE(1gb) xml file
 
Hi Usernet

First up, thanks for your prompt reply.
I will make sure i read RFC1855, before posting again, but right now chasing a hard deadline :)

I am sorry i left out what exactly i am trying to do.

0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)

I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml file.

1. Can you point me to some examples/samples of using SAX, especially , ones dealing with really large XML files.

2.This brings me to another q. which i forgot to ask in my OP(original post).
Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?
While researching my problem, some article seemed to advise against this, especially since its known apriori, that the file is an xml & since regex code gets complicated very quickly & is not very readable.

But is that just a "style"/"elegance" issue, & for my particular problem (detecting a certain element, & then creating(writing) a smaller xml file corresponding to, each pair of start & end tags of said element), is the open file & regex approach, something you would recommend ?

Thanks again for your super-prompt response :)

cheers
ashish

Adam Tauno Williams 12-20-2010 08:33 PM

Re: Trying to parse a HUGE(1gb) xml file
 
On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
> I need to detect them & then for each 1, i need to copy all the
> content b/w the element's start & end tags & create a smaller xml
> file.


Yep, do that a lot; via iterparse.

> 1. Can you point me to some examples/samples of using SAX,
> especially , ones dealing with really large XML files.


SaX is equivalent to iterparse (iterpase is a way, to essentially, do
SaX-like processing).

I provided an iterparse example already. See the Read_Rows method in
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

> 2.This brings me to another q. which i forgot to ask in my OP(original post).
> Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?


No.



Tim Harig 12-20-2010 09:37 PM

Re: Trying to parse a HUGE(1gb) xml file
 
On 2010-12-20, spaceman-spiff <ashish.makani@gmail.com> wrote:
> 0. Goal :I am looking for a specific element..there are several 10s/100s
> occurrences of that element in the 1gb xml file. The contents of the xml,
> is just a dump of config parameters from a packet switch( although imho,
> the contents of the xml dont matter)


Then you need:
1. To detect whenever you move inside of the type element you are
seeking and whenever you move out of it. As long as these
elements cannot be nested inside of each other, this is an
easy binary task. If they can be nested, then you will
need to maintain some kind of level count or recursively
decompose each level.

2. Once you have obtained a complete element (from its start tag to
its end tag) you will need to test whether you have the
single correct element that you are looking for.

Something like this (untested) will work if the target tag cannot be nested
in another target tag:

- import xml.sax
- class tagSearcher(xml.sax.ContentHandler):
-
- def startDocument():
- self.inTarget = False
-
- def startElement(name, attrs):
- if name == targetName:
- self.inTarget = True
- elif inTarget = True:
- # save element information
-
- def endElement(name):
- if name == targetName:
- self.inTarget = False
- # test the saved information to see if you have the
- # one you want:
- #
- # if its the peice you are looking for, then
- # you can process the information
- # you have saved
- #
- # if not, disgard the accumulated
- # information and move on
-
- def characters(content):
- if self.inTarget == True:
- # save the content
-
- yourHandler = tagSearcher()
- yourParser = xml.sax.make_parser()
- yourParser.parse(inputXML, yourHandler)

Then you just walk through the document picking up and discarding each
target element type until you have the one that you are looking for.

> I need to detect them & then for each 1, i need to copy all the content
> b/w the element's start & end tags & create a smaller xml file.


Easy enough; but, with SAX you will have to recreate the tags from
the information that they contain because they will be skipped by the
character() events; so you will need to save the information from each tag
as you come across it. This could probably be done more automatically
using saxutils.XMLGenerator; but, I haven't actually worked with it
before. xml.dom.pulldom also looks interesting

> 1. Can you point me to some examples/samples of using SAX, especially ,
> ones dealing with really large XML files.


There is nothing special about large files with SAX. Sax is very simple.
It walks through the document and calls the the functions that you
give it for each event as it reaches varius elements. Your callback
functions (methods of a handler) do everthing with the information.
SAX does nothing more then call your functions. There are events for
reaching a starting tag, an end tag, and characters between tags;
as well as some for beginning and ending a document.

> 2.This brings me to another q. which i forgot to ask in my OP(original
> post). Is simply opening the file, & using reg ex to look for the element
> i need, a *good* approach ? While researching my problem, some article
> seemed to advise against this, especially since its known apriori, that
> the file is an xml & since regex code gets complicated very quickly &
> is not very readable.
>
> But is that just a "style"/"elegance" issue, & for my particular problem
> (detecting a certain element, & then creating(writing) a smaller xml
> file corresponding to, each pair of start & end tags of said element),
> is the open file & regex approach, something you would recommend ?


It isn't an invalid approach if it works for your situatuation. I have
used it before for very simple problems. The thing is, XML is a context
free data format which makes it difficult to generate precise regular
expressions, especially where where tags of the same type can be nested.

It can be very error prone. Its really easy to have a regex work for
your tests and fail, either by matching too much or failing to match,
because you didn't anticipate a given piece of data. I wouldn't consider
it a robust solution.

John Nagle 12-22-2010 10:28 PM

Re: Trying to parse a HUGE(1gb) xml file
 
On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:
> On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
>> I need to detect them& then for each 1, i need to copy all the
>> content b/w the element's start& end tags& create a smaller xml
>> file.

>
> Yep, do that a lot; via iterparse.
>
>> 1. Can you point me to some examples/samples of using SAX,
>> especially , ones dealing with really large XML files.


I've just subclassed HTMLparser for this. It's slow, but
100% Python. Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.

> SaX is equivalent to iterparse (iterpase is a way, to essentially, do
> SaX-like processing).


Iterparse does try to build a tree, although you can discard the
parts you don't want. If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree. You have to keep clearing the "root"
element to discard that.

> I provided an iterparse example already. See the Read_Rows method in
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>


I don't quite see the point of creating a class with only static
methods. That's basically a verbose way to create a module.
>
>> 2.This brings me to another q. which i forgot to ask in my OP(original post).
>> Is simply opening the file,& using reg ex to look for the element i need, a *good* approach ?

>
> No.


If the XML file has a very predictable structure, that may not be
a bad idea. It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.

John Nagle




All times are GMT. The time now is 09:35 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.