Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Large XML files

Reply
Thread Tools

Large XML files

 
 
jdev8080
Guest
Posts: n/a
 
      12-20-2005
We are looking at creating large XML files containing binary data
(encoded as base64) and passing them to transformers that will parse
and transform the data into different formats.

Basically, we have images that have associated metadata and we are
trying to develop a unified delivery mechanism. Our XML documents may
be as large as 1GB and contain up to 100,000 images.

My question is, has anyone done anything like this before?

What are the performance considerations?

Do the current parsers support this size of XML file?

Has anyone used fast infoset for this type of problem?

Is there a better way to deliver large sets of binary files (i.e. zip
files or something like that)?

Any input would be great. If there is a better board to post this,
please let me know.

Thx,

Bret

 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?J=FCrgen_Kahrs?=
Guest
Posts: n/a
 
      12-20-2005
jdev8080 wrote:

> Basically, we have images that have associated metadata and we are
> trying to develop a unified delivery mechanism. Our XML documents may
> be as large as 1GB and contain up to 100,000 images.
>
> My question is, has anyone done anything like this before?


Yes, Andrew Schorr told me that he processes files
of this size. After some experiments with Pyxie, he
now uses xgawk with the XML extension of GNU Awk.

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

> What are the performance considerations?


Andrew stores each item in a separate XML file and
the concatenates all the XML files to one large file,
often large than 1 GB. My own performance measurements
tell me that a modern PC should parse about 10 MB/s.

> Do the current parsers support this size of XML file?


Yes, but probably only SAX-like parsers.
DOM-like parsers have to store the complete file
in memory and are therefore limited by the amount
of memory. In reality, no DOM parsers to date is able
to read XML files larger than about 500 M. If I am wrong
about this, I bet that someone will correct me.

> Is there a better way to deliver large sets of binary files (i.e. zip
> files or something like that)?


I store such files in .gz format. When reading them, it
is a good idea _not_ to unzip them. Use gzip to produce
a stream of data which will be immediately processed by
the SAX parser:

gzip -c large_file.xml | parser ...

The advantage of this approach is that at each time instant,
only part of the file will occupy space in memory. This is
extremely fast and your server can run a hundred of such
processes on each CPU in parallel.
 
Reply With Quote
 
 
 
 
Jimmy Zhang
Guest
Posts: n/a
 
      01-08-2006
You can also try VTD-XML (http://vtd-xml.sf.net), which uses about 1.3~1.5x
the
size of XML file. Currently it only supports files size of 1GB, so if you
have 2GB of
physical memory, you can load everything in memory and perform random access
on
it like DOM (of course with DOM will get outOfMem exception). Support for
large
files are on the way.

"Jürgen Kahrs" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> jdev8080 wrote:
>
>> Basically, we have images that have associated metadata and we are
>> trying to develop a unified delivery mechanism. Our XML documents may
>> be as large as 1GB and contain up to 100,000 images.
>>
>> My question is, has anyone done anything like this before?

>
> Yes, Andrew Schorr told me that he processes files
> of this size. After some experiments with Pyxie, he
> now uses xgawk with the XML extension of GNU Awk.
>
> http://home.vrweb.de/~juergen.kahrs/gawk/XML/
>
>> What are the performance considerations?

>
> Andrew stores each item in a separate XML file and
> the concatenates all the XML files to one large file,
> often large than 1 GB. My own performance measurements
> tell me that a modern PC should parse about 10 MB/s.
>
>> Do the current parsers support this size of XML file?

>
> Yes, but probably only SAX-like parsers.
> DOM-like parsers have to store the complete file
> in memory and are therefore limited by the amount
> of memory. In reality, no DOM parsers to date is able
> to read XML files larger than about 500 M. If I am wrong
> about this, I bet that someone will correct me.
>
>> Is there a better way to deliver large sets of binary files (i.e. zip
>> files or something like that)?

>
> I store such files in .gz format. When reading them, it
> is a good idea _not_ to unzip them. Use gzip to produce
> a stream of data which will be immediately processed by
> the SAX parser:
>
> gzip -c large_file.xml | parser ...
>
> The advantage of this approach is that at each time instant,
> only part of the file will occupy space in memory. This is
> extremely fast and your server can run a hundred of such
> processes on each CPU in parallel.



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Large Size XML files for data transfer RAJ ASP .Net 2 07-09-2004 07:30 AM
XSLT Large xml files Rajesh Patel XML 3 06-30-2004 04:56 PM
caching large XML files in an ASP.net application spammy ASP .Net 3 05-21-2004 07:54 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments