Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Removing elements from large XML documents

Reply
Thread Tools

Removing elements from large XML documents

 
 
Jakub Moskal
Guest
Posts: n/a
 
      03-28-2007
Hi,

I need to remove certain elements from the XML document tree based on
given parameters, e.g. I have a document with a structure as follows:

<country>
<city>
<street name="streetName" />
</city>
</country>

and I want to remove all <country> nodes for which the street name is
"someName" (I know the example is lame, but it exposes my problem).

Initially I used DOM and whenever I found <street> element with the
name attribute that I don't want, I removed such country using:
root.removeChild(node.getParent().getParent().getP arent())).

It worked just fine with small files, but problems occurred when I
started dealing with docs that are 10-60MB in size. DOM loads the
entire document tree into the memory and this solution doesn't scale
at all - on most computers I get memory issues. I don't want to go
into giving JVM more memory, because I don't feel that this is the
direction in which I should go about it - it's not a universal
solution.

SAX parses the document in a serial fashion, I can't find a way to
remove the great-grand-node of the current element with it. Processing
XSLT works similar to DOM and memory issues occur.

Is there anything else out there that would help me solve this issue?
Would chopping the file into smaller pieces be a good solution?

Any help greatly appreciated,
Jakub.

 
Reply With Quote
 
 
 
 
Tom Hawtin
Guest
Posts: n/a
 
      03-28-2007
Jakub Moskal wrote:
>
> SAX parses the document in a serial fashion, I can't find a way to
> remove the great-grand-node of the current element with it. Processing
> XSLT works similar to DOM and memory issues occur.


(Strictly whether XSLT uses a DOM is implementation dependent. There was
some talk of making Xalan work in a streaming mode several years ago,
but XSLT isn't seen as sexy as it once was.)

My suggestion is that when you hit a <country> element, you switch to a
temporary stream (StringWriter, say). When you find a <street> element
you don't want, you switch the output to a null stream. At the end of
the </country> element (or before) write the temporary stream to the
real output stream, and switch back.

(I suggest not using RandomAccessFile to jump backwards, as it is
excessively slow.)

Tom Hawtin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
XML: JDOM: removing all elements with certain attribute cyberco Java 2 11-07-2007 11:20 PM
No more stuff on C:\Documents and Settings\[User]\My Documents\Visual Studio 2005\ craigkenisston@hotmail.com ASP .Net 1 10-18-2006 03:31 PM
removing elements invalidates only those iterators that had specifically pointed at the removed elements Alien C++ 6 09-21-2006 03:13 PM
Removing elements from a list that are elements in another list Adam Hartshorne C++ 2 01-27-2006 07:47 AM
Highly optimized business-rule validation of very large XML documents possible? Mike XML 4 11-23-2003 09:58 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57