Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > "drop-in" DOM replacement for minidom?

Reply
Thread Tools

"drop-in" DOM replacement for minidom?

 
 
Paul Miller
Guest
Posts: n/a
 
      08-13-2003
We've run into minidom's inabilty to handle large (20+MB) XML files, and
need a replacement that can handle it. Unfortunately, we're pretty
dependent on a DOM, so a pulldom or SAX replacement is likely out of the
question for now.

Has someone done a more efficient minidom replacement module that we can
just drop in? Preferrably written in C?


 
Reply With Quote
 
 
 
 
Geoff Gerrietts
Guest
Posts: n/a
 
      08-13-2003
Quoting Paul Miller ((E-Mail Removed)):
> We've run into minidom's inabilty to handle large (20+MB) XML files, and
> need a replacement that can handle it. Unfortunately, we're pretty
> dependent on a DOM, so a pulldom or SAX replacement is likely out of the
> question for now.
>
> Has someone done a more efficient minidom replacement module that we can
> just drop in? Preferrably written in C?


I've posted on a related topic in the past, when a friend of mine was
blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
every response I got was of the general form "well what the hell are
you using DOM for? are you defective?" Some were more diplomatic than
others.

My friend also had some more challenging problems. He was running on a
DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
byte-ordering problems. PyRXP wouldn't compile for him, if I recall
correctly -- or maybe there were licensing problems? Anyway, he
ultimately settled on using pulldom; that gave him simplicity, speed,
and a small enough memory profile that it satisfied his needs.

Obviously it won't help in your case.

I don't think you'll find something that precisely mimics the minidom
module's interface, so you're going to hafta do some retooling.
However, I believe that if you can get 4Suite to compile, you might
find some love in there. There's a cDomlette component (labelled at
the time of my last reading as "experimental") that builds the parse
tree in C, with a minimal memory consumption.

Here's a link to something that should tell you how to make it work
(though when I personally used cDomlette, I seem to remember it being
harder than this....)

http://uche.ogbuji.net/tech/akara/no...1-01/domlettes

Also, you may be interested in looking at the comparisons done by the
PyRXP folks on their page:

http://www.reportlab.com/xml/pyrxp.html

Best of luck!

--G.

--
Geoff Gerrietts "Whenever people agree with me I always
<geoff at gerrietts net> feel I must be wrong." --Oscar Wilde

 
Reply With Quote
 
 
 
 
Armin Wittfoth
Guest
Posts: n/a
 
      08-14-2003
Harry George <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> Paul Miller <(E-Mail Removed)> writes:
>
> Switching to
> SAX was a major improvement in mem usage and thus in parse time.
>


As an alternative you can easily build a custom, lightweight, Object
Model. I'm using one designed naively to reflect the set of elements
used in the several XML schemas we use. I use SAX to parse the
document into our object model and have the convenience of programming
with the nicer (in some ways DOM like) interface.

Basically there is a class Element which (since 2.2) is a child of
list. By convention it can contain either a unicode string (CDATA) or
another element. The XML attributes can be either stored as a
dictionary or, as I eventually did, directly as attributes of the
class. Record the parent element (aka location), add some methods
such as nextSibling() etc and you're on your way.

In our case I've adopted a naive approach, ie there is a separate
class for every type of XML element (which all ultimately derive from
Element). This suffers from being non-general (ie specific, to the
specific set of schema we use), but it has the advantage that you
don't have to look up what kind of Element you are dealing with and
determine what to do with it, but can use polymorphism nicely.
Further there is no conceptual difference between a chunk of XML, and
the python object structure (ie Elements within Elements) used to
represent it.

It was because Python was so ideally suited to this kind of thing,
that I originally adopted it. As an aside I wrote an XLST sheet,
which reads the various xml-schema files (I only write DTDs myself,
relying on converters to generate xsd), and writes out the python stub
code, (ie creates the basic class definition for each element adding
the appropriate attributes etc), saving a lot of boring boilerplate
typing and allows for quick and accurate code updates if new
attributes are added to the schema.

Going about it in this kind of way, you get something of much lighter
weight than DOM, but which does have that nice structural (as opposed
to SAX's event-driven) way of working with XML.
 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      08-14-2003
On Wed, 13 Aug 2003 11:09:39 -0500, Paul Miller <(E-Mail Removed)> wrote:

>We've run into minidom's inabilty to handle large (20+MB) XML files, and
>need a replacement that can handle it. Unfortunately, we're pretty
>dependent on a DOM, so a pulldom or SAX replacement is likely out of the
>question for now.
>
>Has someone done a more efficient minidom replacement module that we can
>just drop in? Preferrably written in C?
>

I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
Can you assume that you are dealing with valid (error-free) XML as input?

Regards,
Bengt Richter
 
Reply With Quote
 
Uche Ogbuji
Guest
Posts: n/a
 
      08-15-2003
Geoff Gerrietts <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> Quoting Paul Miller ((E-Mail Removed)):
> > We've run into minidom's inabilty to handle large (20+MB) XML files, and
> > need a replacement that can handle it. Unfortunately, we're pretty
> > dependent on a DOM, so a pulldom or SAX replacement is likely out of the
> > question for now.
> >
> > Has someone done a more efficient minidom replacement module that we can
> > just drop in? Preferrably written in C?

>
> I've posted on a related topic in the past, when a friend of mine was
> blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
> every response I got was of the general form "well what the hell are
> you using DOM for? are you defective?" Some were more diplomatic than
> others.


My response is usually more like "what are you using XML for a single
30MB file for?"

I've long maintained that when working with XML, modest document sizes
is very important, regardless of what tools you're using.

But that having been said, some documents are 30MB, and it makes sense
that they're 30MB, and that's just the way it is.


> My friend also had some more challenging problems. He was running on a
> DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
> byte-ordering problems.


4Suite used to have byte-ordering problems, originally reported under
Solaris 9, and also affecting some Mac OS X users. Those are fixed
now.


> PyRXP wouldn't compile for him, if I recall
> correctly -- or maybe there were licensing problems? Anyway, he
> ultimately settled on using pulldom; that gave him simplicity, speed,
> and a small enough memory profile that it satisfied his needs.
>
> Obviously it won't help in your case.


pulldom is always worth considering.

http://www-106.ibm.com/developerwork...tipulldom.html

> I don't think you'll find something that precisely mimics the minidom
> module's interface, so you're going to hafta do some retooling.
> However, I believe that if you can get 4Suite to compile,


Which I hardly expect to be a problem.

> you might
> find some love in there. There's a cDomlette component (labelled at
> the time of my last reading as "experimental")


cDomlette hasn't been experimental for nearly a year now. We use it
heavily in production.


> that builds the parse
> tree in C, with a minimal memory consumption.


And fast parse and mutation time.


> Here's a link to something that should tell you how to make it work
> (though when I personally used cDomlette, I seem to remember it being
> harder than this....)
>
> http://uche.ogbuji.net/tech/akara/no...1-01/domlettes


Your memories must be from long ago That API is how it's been for
a while.


> Also, you may be interested in looking at the comparisons done by the
> PyRXP folks on their page:
>
> http://www.reportlab.com/xml/pyrxp.html
>
> Best of luck!


Ditto.

--Uche
http://uche.ogbuji.net
 
Reply With Quote
 
Paul Miller
Guest
Posts: n/a
 
      08-15-2003
>>Has someone done a more efficient minidom replacement module that we can
>>just drop in? Preferrably written in C?
>>

>I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
>Can you assume that you are dealing with valid (error-free) XML as input?


Yes, it is assumed to be valid. We don't even use a DTD. But we use the DOM
to point to later nodes in the tree by following references in nodes higher
in the tree.

But, building a sparse object model initially and resolving references
later might be the right solution.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert a XML DOM Object to a HTML DOM Object manjunath.d@gmail.com XML 0 09-20-2005 08:16 AM
What is the difference between DOM Level 1 and DOM Level 2. mike XML 1 11-20-2004 03:19 PM
Difference between pure DOM and JAXP over DOM ?? Thorsten Meininger XML 0 07-28-2004 08:51 AM
Difference between pure DOM and JAXP over DOM ?? Thorsten Meininger Java 0 07-28-2004 08:51 AM
DOM replacement for innerHTML Joe Kelsey Javascript 1 08-21-2003 11:16 PM



Advertisments