Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Seek in huge xml-files

Reply
Thread Tools

Seek in huge xml-files

 
 
Bogomir Engel
Guest
Posts: n/a
 
      08-08-2008
Hi all,

For a student project I have to be able to look up information in
xml-files that are several GB big. Depending on the input of the user
through the GUI data has to be displayed. And it's not applicable to
parse the whole file for every input. We can't use DOM since it would
load the whole file into memory. Our current approaches are based on the
use of SAX. We thought of generating some sort of index for every data
set that would provide us the byte offset in the file. The Project has
to be implemented in Java, so we wanted to do something like

Reader.skip(offsetBytes)

So we could jump to the location where our data set is without having to
parse the whole file. The Problem with that is, that we don't have any
idea on how to obtain the index information. How can you find out, where
in a file the SAX parser is (meaning the byte offset)?

Another point is that our tests with the SAX parser when skipping bytes
in it's input source produced this exception.

Content is not allowed in prolog

So we are wondering, whether it's possible to jump to some given
position and then parse from there.

I'm thankful for any advice since I'm quite helpless now. Many Thanks!
Bogomir Engel
 
Reply With Quote
 
 
 
 
Bogomir Engel
Guest
Posts: n/a
 
      09-03-2008
We successfully completed the application.

javax.xml.stream.Location offers a method getCharacterOffset() which
does exactly what we needed.

This article was quite helpful:
Parsing XML documents partially with StAX
http://www.ibm.com/developerworks/xm...tx2/index.html

StAX is a very useful tool, when you don't have the memory to do it with
DOM and SAX offers insufficient control. For example one can decide how
to proceed in the parsing process at any time.

By the way, with some mapping JiBX (http://jibx.sourceforge.net/ very
recommendable) created ordinary Java objects out of the xml-data sets
for us. We saved the byte offsets in the objects during the initial
parsing process.
 
Reply With Quote
 
 
 
 
Bogomir Engel
Guest
Posts: n/a
 
      09-03-2008
We successfully completed the application.

javax.xml.stream.Location offers a method getCharacterOffset() which
does exactly what we needed.

This article was quite helpful:
Parsing XML documents partially with StAX
http://www.ibm.com/developerworks/xm...tx2/index.html

StAX is a very useful tool, when you don't have the memory to do it with
DOM and SAX offers insufficient control. For example one can decide how
to proceed in the parsing process at any time.

By the way, with some mapping JiBX (http://jibx.sourceforge.net/ very
recommendable) created ordinary Java objects out of the xml-data sets
for us. We saved the byte offsets in the objects during the initial
parsing process.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
When using System.IO.FileStream, I write 8 bytes, then seek to the start of the file, does the 8 bytes get flushed on seek and the buffer become a readbuffer at that point instead of being a write buffer? DR ASP .Net 2 07-29-2008 09:50 AM
When using System.IO.FileStream, I write 8 bytes, then seek to the start of the file, does the 8 bytes get flushed on seek and the buffer become a readbuffer at that point instead of being a write buffer? DR ASP .Net Building Controls 0 07-29-2008 01:37 AM
seek trough files in vhdl Maki VHDL 0 01-24-2005 08:23 PM
Win32 ADO Seek method news.verizon.net Perl 0 09-09-2003 07:16 PM



Advertisments