Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > How to speed up XML reading

Reply
Thread Tools

How to speed up XML reading

 
 
Ramon F Herrera
Guest
Posts: n/a
 
      09-11-2012

My application makes a large number of XPath() retrievals and that's
the code that predominantly uses most of the clock time. The rest of
the tasks take a negligible amount of CPU and disk. In short, all the
app does is to read XML variables and write them in a PDF file.

See a previous, very related post below.

-Ramon

=============================================

> You can't compare SAX and DOM. SAX is under the parsing level therefore
> DOM is for manipulating an XML document. DOM is mostly built with SAX
> system. You can use it or ignore it building your own SAX code. However
> create your own SAX handler is much complex and the final result could
> be much slower than with a pure DOM usage.


Very true. (Though some DOM parsers/loaders bypass SAX for greater
efficiency; I believe Xerces actually uses lower-level events to drive
its DOM construction.)

SAX does require that you manage all the state information, which may
or may not include building something like the DOM for part or all of
the document. How fast or slow that will be depends entirely on the
problem at hand and how good your code is.

If you've got time, doing it all via SAX may be worth trying. But it
isn't always going to be a magic bullet.

As I said in my other post, the first thing to do is to find out
whether this is even a significant part of your application's
processing time.

--
Joe Kesselman,

 
Reply With Quote
 
 
 
 
Ramon F Herrera
Guest
Posts: n/a
 
      09-11-2012
A related thread is: "Why is SAX faster than DOM?"

-RFH
 
Reply With Quote
 
 
 
 
Ramon F Herrera
Guest
Posts: n/a
 
      09-11-2012

Tools used:
C++
Xerces-C
XQilla
Developed under Linux, ported to Windows


A very important lesson that I learned follows. Xerces implements a
reasonably/very fast XPath retrieval BUT it does so at the expense of
flexibility. The only type of XPath retrieval supported by Xerces is
the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/
@titledetail");

After running some benchmarks I have concluded that my best option is
to use a combination of the 2 XPath engines: Xerces for the "easy"
stuff and Xqilla for the more complex.

-Ramon

 
Reply With Quote
 
Alain Ketterlin
Guest
Posts: n/a
 
      09-12-2012
Ramon F Herrera <(E-Mail Removed)> writes:

[...]
> A very important lesson that I learned follows. Xerces implements a
> reasonably/very fast XPath retrieval BUT it does so at the expense of
> flexibility. The only type of XPath retrieval supported by Xerces is
> the MINIMAL one:
>
> string neededVariable = XPath("/this/is/the/variable/that/i/need");
>
> If the path contains any character like "[", "@", "=", etc. I must
> resort to XQilla, which is wonderful (a LOT easier to code than pure
> Xerces), but as slow as molasses in cold weather:
>
> string someOtherVar = XPath("/table/joint/ancestor::table/
> @titledetail");


... would have the same effect as ancestor::table since the query starts
at document root.

> After running some benchmarks I have concluded that my best option is
> to use a combination of the 2 XPath engines: Xerces for the "easy"
> stuff and Xqilla for the more complex.


XPath may require DOM if you use funny axes, e.g., preceding-sibling::*
and, maybe, ancestor.

However, for the request you show above, a hand-coded SAX parser keeping
a simple stack (with @titledetail cached where appropriate) can extract what
you want. XPath, and any generic query language for that matter, is far
more powerful, and will therefore most likely be slower.

(Generating the SAX handler for any given XPath query is left as an
exercise for the reader.

-- Alain.
 
Reply With Quote
 
Ramon F Herrera
Guest
Posts: n/a
 
      09-12-2012
On Sep 12, 4:32*am, Alain Ketterlin <(E-Mail Removed)-strasbg.fr>
wrote:

> (Generating the SAX handler for any given XPath query
> is left as an exercise for the reader.
>
> -- Alain.


Merci, Alain.

Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

-Ramon

 
Reply With Quote
 
Manuel Collado
Guest
Posts: n/a
 
      09-12-2012
El 12/09/2012 14:52, Ramon F Herrera escribió:
>...
> Actually, I think that the solution to my performance problem is to
> implement (via SAX?) the reading of the whole XML file and insert the
> variables in my own data structures. That must speed up the variable
> retrieval substantially BUT an XML guru is required, which I am not.
>
> In the meantime, I downloaded libxml and will see how well it
> performs. Perhaps that is the solution to my problem. Being written in
> C, it should be faster than Xerces-C++


You could try Expat, written in C.

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado



 
Reply With Quote
 
Joe Kesselman
Guest
Posts: n/a
 
      09-14-2012
On 9/11/2012 1:52 PM, Ramon F Herrera wrote:
> A related thread is: "Why is SAX faster than DOM?"


(Answer: It isn't always. Depends on the patterns of access to the data.)


--
Joe Kesselman,
http://www.love-song-productions.com...lam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
Reply With Quote
 
Joe Kesselman
Guest
Posts: n/a
 
      09-14-2012
On 9/11/2012 2:20 PM, Ramon F Herrera wrote:
> If the path contains any character like "[", "@", "=", etc. I must
> resort to XQilla, which is wonderful (a LOT easier to code than pure
> Xerces), but as slow as molasses in cold weather


You might want to look at Xalan. There was a fair amount of work put
into Xalan performance; I don't know how XQilla compares to that.

Or, if you're using IBM's Java environment, you might want to look at
the XML support that ships with that JRE, which is another design
iteration past Xalan. Or, in Websphere, the Websphere XML feature, which
supports XPath 2.0, XSLT 2.0, and XQuery and is yet another design
iteration.

With all of these, remember that the JAXP/TRAX APIs allow precompiling a
path or query. And remember that the performance can be improved if the
document is cached in memory in the appropriate internal representation.
(The Xerces implementation is single-pass, I believe; if you want to run
more than one path the advantage goes away quickly because you have to
reparse the input document.)


--
Joe Kesselman,
http://www.love-song-productions.com...lam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
Reply With Quote
 
Joe Kesselman
Guest
Posts: n/a
 
      09-14-2012
> Actually, I think that the solution to my performance problem is to
> implement (via SAX?) the reading of the whole XML file and insert the
> variables in my own data structures. That must speed up the variable
> retrieval substantially BUT an XML guru is required, which I am not.


In many cases, yes, XML should be used as your "portability" level, and
custom internal representations should be used within the application.
Of course the downside is that you then have to implement a lot more of
your own logic rather than being able to take advantage of the XML-level
utilities.

> In the meantime, I downloaded libxml and will see how well it
> performs. Perhaps that is the solution to my problem. Being written in
> C, it should be faster than Xerces-C++


C++ isn't necessarily slower than C. That depends on the details of the
code, both in coding style and in algorithms. Remember, an infinite
speedup of something that accounts for only 1% of runtime is only a 1%
real improvement.

--
Joe Kesselman,
http://www.love-song-productions.com...lam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
Reply With Quote
 
Ramon F Herrera
Guest
Posts: n/a
 
      09-16-2012
On Sep 14, 12:04*am, Joe Kesselman <(E-Mail Removed)>
wrote:
> On 9/11/2012 2:20 PM, Ramon F Herrera wrote:
>
> > If the path contains any character like "[", "@", "=", etc. I must
> > resort to XQilla, which is wonderful (a LOT easier to code than pure
> > Xerces), but as slow as molasses in cold weather

>


> You might want to look at Xalan. There was a fair amount
> of work put into Xalan performance; I don't know how XQilla
> compares to that.


Following a previous advice of yours, I looked into it. It seems that
Xalan has reached a dead end. It won't even compile on a regular Linux
box.

What I discovered is that most of the action is in libxml. See my
thread "Dramatic performance gains with Libxml" (I develop under C/C+
+).

-Ramon
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
UnauthorizedAccessException when reading XML files (no problem when reading other file-types) blabla120@gmx.net ASP .Net 0 09-15-2006 02:08 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Reported Wireless speed w/ repeater 7-9x Measured Speed Lance Wireless Networking 0 10-31-2004 09:31 PM
speed speed speed a.metselaar Computer Support 14 12-30-2003 03:34 AM



Advertisments