Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Parsing large XML files FAST

Reply
Thread Tools

Parsing large XML files FAST

 
 
PedroX
Guest
Posts: n/a
 
      06-26-2005
Hello:

I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
XML documents are like 10 mb in size. It's taking over an hour to parse such
sizes!?

I don't really need to use ASP or a web server at all because I am parsing
all in my own computer. Is there any executable that can do this parsing
faster than the way I was doing it?

Thanks in advance.


 
Reply With Quote
 
 
 
 
PedroX
Guest
Posts: n/a
 
      06-26-2005
I wrote:

> I need to parse some large XML files, and save the data in an Access DB. I
> was using MSXML 2 and ASP, but it turns out to be extremely slow when

then

I made a mistake. I am actually using MSXML 4.0.


 
Reply With Quote
 
 
 
 
Brian Staff
Guest
Posts: n/a
 
      06-27-2005
> Is there any executable that can do this parsing
> faster than the way I was doing it?


>> objXMLDoc.selectNodes("//node_name")


I am not an expert on techniques of parsing, but if performance were a
problem for me, I would try and use as much explicit node naming as
possible...for instance I would maybe recode the above statement to be
something like this:

objXMLDoc.selectNodes("rootNode/childNode/node_name")

I know if _I_ was the parser, I would be able to find those nodes in a 10mb
structure quicker using the second technique rather than using the first.

JAT - Brian


 
Reply With Quote
 
=?ISO-8859-1?Q?J=FCrgen_Kahrs?=
Guest
Posts: n/a
 
      06-27-2005
PedroX wrote:

> I need to parse some large XML files, and save the data in an Access DB. I
> was using MSXML 2 and ASP, but it turns out to be extremely slow when then
> XML documents are like 10 mb in size. It's taking over an hour to parse such
> sizes!?


Andrew Schorr had a similar problem. He read
XML larger than a GigaByte and stored them into
Postgres. He also had problems with finding the
right tool for it. Eventually, he used an extension
of the GNU Awk language:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

Use Google and you will find his explanations
in comp.lang.awk.

> I don't really need to use ASP or a web server at all because I am parsing
> all in my own computer. Is there any executable that can do this parsing
> faster than the way I was doing it?


XML parsers can read large files fast only with
the SAX approach (or similar event-driven models).
The DOM model simply cant do this because it has
to hold the complete XML tree in memory.
 
Reply With Quote
 
PedroX
Guest
Posts: n/a
 
      06-27-2005

"Brian Staff" wrote

> I am not an expert on techniques of parsing, but if performance were a
> problem for me, I would try and use as much explicit node naming as
> possible...for instance I would maybe recode the above statement to be
> something like this:
>
> objXMLDoc.selectNodes("rootNode/childNode/node_name")


WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Thank you !!!!!!!!!!!!!!!!


 
Reply With Quote
 
Bryce K. Nielsen
Guest
Posts: n/a
 
      06-27-2005
> WOW. That DID make a difference.
> What was taking over an hour before now takes about 2 minutes!
> Thank you !!!!!!!!!!!!!!!!
>


This will make a huge difference. Remember that with XPath, the //node_name
means that it will search *every* node in the entire document. If you make
it more specific, it will be a lot faster.

However, when dealing with 10mb+ documents, you should really start using
SAX and not DOM. I was unaware that VBScript couldn't do SAX, since MSXML's
SAX parser is just a COM object, I figured you could (I've just never
tried). If you can't implement the interface, you could always create a COM
Wrapper that does specifically what you need and call that from your ASP
page. I.e. using VB, create a COM object that takes an XML string, it
implements the SAX parser to do the inserts into Access, etc.

But the point is, 10mb+, stay away from DOM, use SAX...

Bryce K. Nielsen
SysOnyx, Inc. (www.sysonyx.com)
Makers of xmlDig, the XML-SQL Extractor
http://www.sysonyx.com/products/xmldig

P.S. Why did you cross-post this? I typically find better results when I
post messages to one board at a time...


 
Reply With Quote
 
=?ISO-8859-1?Q?J=FCrgen_Kahrs?=
Guest
Posts: n/a
 
      06-27-2005
PedroX wrote:

> WOW. That DID make a difference.
> What was taking over an hour before now takes about 2 minutes!


Expat (a XML/SAX parser) needs about 2 seconds for 10 MB.
 
Reply With Quote
 
Brian Staff
Guest
Posts: n/a
 
      06-27-2005
> WOW. That DID make a difference.
> What was taking over an hour before now takes about 2 minutes!


Well, it was a bit of a guess on my part<g> - but it is encouraging to know
that explicit Xpath naming does really make a difference.

Brian

 
Reply With Quote
 
PedroX
Guest
Posts: n/a
 
      06-27-2005
> But the point is, 10mb+, stay away from DOM, use SAX...

I wanted to, but I the whole thing (including the alternative .NET's
XmlTextReader)
is just beyond my comprehension. I found no tutorials that I could
understand.
I know VBScript, Javascript / JScript, and that's pretty much it.
No Java, no C, no Visual Basic per se (although is similar to VBScript).





 
Reply With Quote
 
Bryce K. Nielsen
Guest
Posts: n/a
 
      06-27-2005
> Well, it was a bit of a guess on my part<g> - but it is encouraging to
know
> that explicit Xpath naming does really make a difference.
>


Yeah, it will. The double-slash is like a wildcard, search *every* node for
this xpath. If you use an explicit path, it knows to only look in one area.
Also don't forget that the result set of a wildcard search could be large,
where-as an explicit one will probably only return the one node...

-BKN


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
fast copying of large files in python Catherine Moroney Python 1 11-02-2011 09:26 PM
*Fast* way to process large files line by line Devesh Agrawal Ruby 18 11-17-2006 03:15 AM
building an index for large text files for fast access Yi Xing Python 6 07-26-2006 04:18 AM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments