Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > splitting an XML file on the basis on basis of XML tags

Reply
Thread Tools

splitting an XML file on the basis on basis of XML tags

 
 
bijeshn@gmail.com
Guest
Posts: n/a
 
      04-02-2008
Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
<r2>
..
..
.. -----------------------|
.. |
.. |
.. |----------------------> there are n
records in between....
.. |
.. |
.. |
.. ------------------------|
..
..
</r2>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
</r1>


Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

Thanks...







 
Reply With Quote
 
 
 
 
Chris
Guest
Posts: n/a
 
      04-02-2008
bije...@gmail.com wrote:
> Hi all,
>
> i have an XML file with the following structure::
>
> <r1>
> <r2>-----|
> <r3> |
> <r4> |
> . |
> . | --------------------> constitutes one record.
> . |
> . |
> . |
> </r4> |
> </r3> |
> </r2>----|
> <r2>
> .
> .
> . -----------------------|
> . |
> . |
> . |----------------------> there are n
> records in between....
> . |
> . |
> . |
> . ------------------------|
> .
> .
> </r2>
> <r2>-----|
> <r3> |
> <r4> |
> . |
> . | --------------------> constitutes one record.
> . |
> . |
> . |
> </r4> |
> </r3> |
> </r2>----|
> </r1>
>
>
> Here <r1> is the main root tag of the XML, and <r2>...</r2>
> constitutes one record. What I would like to do is
> to extract everything (xml tags and data) between nth <r2> tag and (n
> +k)th <r2> tag. The extracted data is to be
> written down to a separate file.
>
> Thanks...


You could create a generator expression out of it:

txt = """<r1>
<r2><r3><r4>1</r4></r3></r2>
<r2><r3><r4>2</r4></r3></r2>
<r2><r3><r4>3</r4></r3></r2>
<r2><r3><r4>4</r4></r3></r2>
<r2><r3><r4>5</r4></r3></r2>
</r1>
"""
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.
 
Reply With Quote
 
 
 
 
bijeshn
Guest
Posts: n/a
 
      04-03-2008
On Apr 2, 5:37*pm, Chris <cwi...@gmail.com> wrote:
> bije...@gmail.com wrote:
> > Hi all,

>
> > * * * * *i have an XML file with the following structure::

>
> > <r1>
> > <r2>-----|
> > <r3> * * |
> > <r4> * * |
> > . * * * * * |
> > . * * * * * | * * * * --------------------> constitutes one record.
> > . * * * * * |
> > . * * * * * |
> > . * * * * * |
> > </r4> * *|
> > </r3> * *|
> > </r2>----|
> > <r2>
> > .
> > .
> > . * *-----------------------|
> > . * * * * * * * * * * * * * |
> > . * * * * * * * * * * * * * |
> > . * * * * * * * * * * * * * |----------------------> there are n
> > records in between....
> > . * * * * * * * * * * * * * |
> > . * * * * * * * * * * * * * |
> > . * * * * * * * * * * * * * |
> > . * ------------------------|
> > .
> > .
> > </r2>
> > <r2>-----|
> > <r3> * * |
> > <r4> * * |
> > . * * * * * |
> > . * * * * * | * * * * --------------------> constitutes one record.
> > . * * * * * |
> > . * * * * * |
> > . * * * * * |
> > </r4> * *|
> > </r3> * *|
> > </r2>----|
> > </r1>

>
> > * * * *Here <r1> is the main root tag of the XML, and <r2>...</r2>
> > constitutes one record. What I would like to do is
> > to extract everything (xml tags and data) between nth <r2> tag and (n
> > +k)th <r2> tag. The extracted data is to be
> > written down to a separate file.

>
> > Thanks...

>
> You could create a generator expression out of it:
>
> txt = """<r1>
> * * <r2><r3><r4>1</r4></r3></r2>
> * * <r2><r3><r4>2</r4></r3></r2>
> * * <r2><r3><r4>3</r4></r3></r2>
> * * <r2><r3><r4>4</r4></r3></r2>
> * * <r2><r3><r4>5</r4></r3></r2>
> * * </r1>
> * * """
> l = len(txt.split('r2>'))-1
> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
> and i.replace('>','').replace('<','').strip())
>
> Now you have a generator you can iterate through with a.next() or
> alternatively you could just create a list out of it by replacing the
> outer parens with square brackets.- Hide quoted text -
>
> - Show quoted text -


Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...

so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      04-03-2008
bijeshn wrote:
> On Apr 2, 5:37 pm, Chris <cwi...@gmail.com> wrote:
>> bije...@gmail.com wrote:
>>> Hi all,
>>> i have an XML file with the following structure::
>>> <r1>
>>> <r2>-----|
>>> <r3> |
>>> <r4> |
>>> . |
>>> . | --------------------> constitutes one record.
>>> . |
>>> . |
>>> . |
>>> </r4> |
>>> </r3> |
>>> </r2>----|
>>> <r2>
>>> .
>>> .
>>> . -----------------------|
>>> . |
>>> . |
>>> . |----------------------> there are n
>>> records in between....
>>> . |
>>> . |
>>> . |
>>> . ------------------------|
>>> .
>>> .
>>> </r2>
>>> <r2>-----|
>>> <r3> |
>>> <r4> |
>>> . |
>>> . | --------------------> constitutes one record.
>>> . |
>>> . |
>>> . |
>>> </r4> |
>>> </r3> |
>>> </r2>----|
>>> </r1>
>>> Here <r1> is the main root tag of the XML, and <r2>...</r2>
>>> constitutes one record. What I would like to do is
>>> to extract everything (xml tags and data) between nth <r2> tag and (n
>>> +k)th <r2> tag. The extracted data is to be
>>> written down to a separate file.
>>> Thanks...

>> You could create a generator expression out of it:
>>
>> txt = """<r1>
>> <r2><r3><r4>1</r4></r3></r2>
>> <r2><r3><r4>2</r4></r3></r2>
>> <r2><r3><r4>3</r4></r3></r2>
>> <r2><r3><r4>4</r4></r3></r2>
>> <r2><r3><r4>5</r4></r3></r2>
>> </r1>
>> """
>> l = len(txt.split('r2>'))-1
>> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
>> and i.replace('>','').replace('<','').strip())
>>
>> Now you have a generator you can iterate through with a.next() or
>> alternatively you could just create a list out of it by replacing the
>> outer parens with square brackets.- Hide quoted text -
>>
>> - Show quoted text -

>
> Hmmm... will look into it.. Thanks
>
> the XML file is almost a TB in size...
>

Good grief. When will people stop abusing XML this way?

> so SAX will have to be the parser.... i'm thinking of doing something
> to split the file using SAX
> ... Any suggestions on those lines..? If there are any other parsers
> suitable, please suggest...


You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
Steve

--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
Marco Mariani
Guest
Posts: n/a
 
      04-03-2008
Steve Holden wrote:

>> the XML file is almost a TB in size...
>>

> Good grief. When will people stop abusing XML this way?


Not before somebody writes a clever xmlfs for the linux kernel :-/

 
Reply With Quote
 
Marco Mariani
Guest
Posts: n/a
 
      04-03-2008
Marco Mariani wrote:

>>> the XML file is almost a TB in size...
>>>

>> Good grief. When will people stop abusing XML this way?

>
> Not before somebody writes a clever xmlfs for the linux kernel :-/


Ok.

I meant it as a joke, but somebody has been there and done that.

Twice.


http://xmlfs.modry.cz/user_documentation/

http://www.haifa.ibm.com/projects/st...lfs/index.html
 
Reply With Quote
 
Chris
Guest
Posts: n/a
 
      04-03-2008
On Apr 3, 8:51*am, Steve Holden <st...@holdenweb.com> wrote:
> bijeshn wrote:
> > On Apr 2, 5:37 pm, Chris <cwi...@gmail.com> wrote:
> >> bije...@gmail.com wrote:
> >>> Hi all,
> >>> * * * * *i have an XML file with the following structure::
> >>> <r1>
> >>> <r2>-----|
> >>> <r3> * * |
> >>> <r4> * * |
> >>> . * * * * * |
> >>> . * * * * * | * * * * --------------------> constitutes one record.
> >>> . * * * * * |
> >>> . * * * * * |
> >>> . * * * * * |
> >>> </r4> * *|
> >>> </r3> * *|
> >>> </r2>----|
> >>> <r2>
> >>> .
> >>> .
> >>> . * *-----------------------|
> >>> . * * * * * * * * * * * * * |
> >>> . * * * * * * * * * * * * * |
> >>> . * * * * * * * * * * * * * |----------------------> there are n
> >>> records in between....
> >>> . * * * * * * * * * * * * * |
> >>> . * * * * * * * * * * * * * |
> >>> . * * * * * * * * * * * * * |
> >>> . * ------------------------|
> >>> .
> >>> .
> >>> </r2>
> >>> <r2>-----|
> >>> <r3> * * |
> >>> <r4> * * |
> >>> . * * * * * |
> >>> . * * * * * | * * * * --------------------> constitutes one record.
> >>> . * * * * * |
> >>> . * * * * * |
> >>> . * * * * * |
> >>> </r4> * *|
> >>> </r3> * *|
> >>> </r2>----|
> >>> </r1>
> >>> * * * *Here <r1> is the main root tag of the XML, and <r2>...</r2>
> >>> constitutes one record. What I would like to do is
> >>> to extract everything (xml tags and data) between nth <r2> tag and (n
> >>> +k)th <r2> tag. The extracted data is to be
> >>> written down to a separate file.
> >>> Thanks...
> >> You could create a generator expression out of it:

>
> >> txt = """<r1>
> >> * * <r2><r3><r4>1</r4></r3></r2>
> >> * * <r2><r3><r4>2</r4></r3></r2>
> >> * * <r2><r3><r4>3</r4></r3></r2>
> >> * * <r2><r3><r4>4</r4></r3></r2>
> >> * * <r2><r3><r4>5</r4></r3></r2>
> >> * * </r1>
> >> * * """
> >> l = len(txt.split('r2>'))-1
> >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
> >> and i.replace('>','').replace('<','').strip())

>
> >> Now you have a generator you can iterate through with a.next() or
> >> alternatively you could just create a list out of it by replacing the
> >> outer parens with square brackets.- Hide quoted text -

>
> >> - Show quoted text -

>
> > Hmmm... will look into it.. Thanks

>
> > the XML file is almost a TB in size...

>
> Good grief. When will people stop abusing XML this way?
>
> > so SAX will have to be the parser.... i'm thinking of doing something
> > to split the file using SAX
> > ... Any suggestions on those lines..? If there are any other parsers
> > suitable, please suggest...

>
> You could try pulldom, but the documentation is disgraceful.
>
> ElementTree.iterparse *might* help.
>
> regards
> * Steve
>
> --
> Steve Holden * * * *+1 571 484 6266 * +1 800 494 3119
> Holden Web LLC * * * * * * *http://www.holdenweb.com/


I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg)
And the OP never said the XML file for 1TB in size, which makes things
different.
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      04-03-2008
> I abuse it because I can (and because I don't generally work with XML
> files larger than 20-30meg)
> And the OP never said the XML file for 1TB in size, which makes things
> different.


Even with small xml-files your advice was not very sound. Yes, it's
tempting to use regexes to process xml. But usually one falls flat on
his face soon - because of whitespace or attribute order or <foo></foo>
versus <foo/> or .. or .. or.

Use an XML-parser. That's what they are for. And especially with the
pythonic ones like element-tree (and the compatible lxml), its even more
straight-forward than using rexes.


Diez
 
Reply With Quote
 
bijeshn
Guest
Posts: n/a
 
      04-04-2008
On Apr 3, 11:28*pm, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
> > I abuse it because I can (and because I don't generally work with XML
> > files larger than 20-30meg)
> > And the OP never said the XML file for 1TB in size, which makes things
> > different.

>
> Even with small xml-files your advice was not very sound. Yes, it's
> tempting to use regexes to process xml. But usually one falls flat on
> his face soon - because of whitespace or attribute order or <foo></foo>
> versus <foo/> or .. or .. or.
>
> Use an XML-parser. That's what they are for. And especially with the
> pythonic ones like element-tree (and the compatible lxml), its even more
> straight-forward than using rexes.
>
> Diez


yeah, i plan to use SAX.. but the thing is how do you do it with
that?....

forget 1 TB for now... how do you split an XML file which is something
like 70-80 GB... on the basis of my need (thats the post.)?
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-07-2008
schrieb:
> Hi all,
>
> i have an XML file with the following structure::
>
> <r1>
> <r2>-----|
> <r3> |
> <r4> |
> . |
> . | --------------------> constitutes one record.
> . |
> . |
> . |
> </r4> |
> </r3> |
> </r2>----|
> <r2>
> .
> .
> . -----------------------|
> . |
> . |
> . |----------------------> there are n
> records in between....
> . |
> . |
> . |
> . ------------------------|
> .
> .
> </r2>
> <r2>-----|
> <r3> |
> <r4> |
> . |
> . | --------------------> constitutes one record.
> . |
> . |
> . |
> </r4> |
> </r3> |
> </r2>----|
> </r1>
>
>
> Here <r1> is the main root tag of the XML, and <r2>...</r2>
> constitutes one record. What I would like to do is
> to extract everything (xml tags and data) between nth <r2> tag and (n
> +k)th <r2> tag. The extracted data is to be
> written down to a separate file.


What do you mean by "written down to a separate file"? Do you have a specific
format in mind?

In general, you can try this:

>>> from xml.etree import cElementTree as ET
>>> itercontext = ET.iterparse("thefile.xml", events=("start", "end")
>>> event,root = itercontext.next()
>>> for event,element in itercontext:

... if event == "end" and element.tag == "r2":
... print ET.tostring(element) # write record subtree as XML
... root.clear() # one record done, clean up everything

http://effbot.org/zone/element-iterparse.htm

You can also do things like

... print element.findtext("r3/r4")

Read the ElementTree tutorial to learn how to extract your data:

http://effbot.org/zone/element.htm#s...or-subelements

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Search a file on the basis of numeric fields part of filename chiku C++ 4 06-26-2009 07:14 PM
RegEx to find CFML tags nested in HTML tags Dean H. Saxe Perl 0 01-03-2004 06:11 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? John Dibling C++ 0 07-19-2003 04:41 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? Mark C++ 0 07-19-2003 04:24 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? John Ericson C++ 0 07-19-2003 04:03 PM



Advertisments