![]() |
|
|
|||||||
![]() |
XML - XSLTranslation of a large XML file using Java results in OutOfMemory |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Hi
I'm attempting additions/changes to a Java program that (among other things) uses XSLT to transform a large (96 Mb) XML file. It runs fine on small XML files but generates OutOfMemory exceptions with large XML files. I tried a simple punt of -Xmx512MB but that didn't work. In the future, the input XML file may become considerably bigger than 96 MB, so even if it did work, it probably would be putting off the inevitable to some later date. I'm using JavaSE 1.4.2_11 and the XSL/XML libraries that come with it. The translation is from and to an xml file. The code I inherited looks a lot like most of the example code you can find on the net for doing an XSLT transformation. The relevant part is: TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(xsltSource); transformer.transform(new StreamSource(new StringReader(x)), xsltDest); where xsltSource is XSLT in the form of a string, generated by code immediately above the snip shown, and the "x" is the input xml to be transformed. Things I tried: 1. I modified the above code to use a file instead of a String as the XML to be transformed and a file for the XSLT that specifies the transformation. It works fine with small XML input files but not with large ones. I assume this code is using the DOM parser, and there is simply not enough room in memory to house the input XML file. 2. Based on some old (years old) newsgroup posts I found, I tried using a SAX equivalent of the above code, assuming that SAX takes in, parses and transforms the input XML file either picemeal (maybe element by element?) or that SAX uses the complete virtual memory of the computer. But this code also results successful runs on small input XML files and OutOfMemory errors on large ones. Here is a snip of the SAX code (adapted from a chapter of Burke's "XSLT and Java" at the O'Reilly website): FileInputStream brXSLT = new FileInputStream ("C:/Documents and Settings/Lenny/Desktop/OCCxsl.xsl"); // Set up the transformer TransformerFactory transFact = TransformerFactory.newInstance( ); SAXTransformerFactory saxTransFact = (SAXTransformerFactory) transFact; Source xsltSource = new StreamSource(brXSLT); TransformerHandler transHand = saxTransFact.newTransformerHandler(xsltSource); // Set up input source InputSource inxml = new InputSource(inXML); SAXSource saxSource = new SAXSource(inxml); // Set the destination for the XSLT transformation transHand.setResult(new StreamResult(outXML)); // attach the XSLT processor to the XMLReader String parserClass = "org.apache.crimson.parser.XMLReaderImpl"; XMLReader reader = XMLReaderFactory.createXMLReader(parserClass); //parse the input file to an output file reader.setContentHandler(transHand); reader.parse(inxml); I'm considering making a custom parser of the input XML file which basically identifies elements of the input XML file and treats each element as if it were a comlete document. e.g. send the content handler ch.startDocument() ch.startElement(..) // pass through the original element ch.characters(..) // " ch.endElement(..) // " ch.endDocument() for each element in the input XML file. But being a newbie to XSLT, I don't know if this is worth pursuing, or even if it would work; I'm hoping there are simpler, more strightforward ways of accomplising the same thing and at a higher level. It does seem pretty clumsy, even if it would work. I found a reply on the web to someone who had a similar problem. To the effect that a "SAX pipeline" should be used. But there was no further elaboration, and so far, I haven't figured out what a SAX Pipeline is or how it would help. Any advice, references to examples, or actual examples would be greatly appreciated. Non-procedural programming is taking quite a bit of effort to understand! Thanks in advance for your help. Lenny Wintfeld ps - I've had this up on comp.lang.java.programmer for most of the day with no replies. It bridges both specialties, that's why I'm trying here. Lenny Wintfeld |
|
|
|
|
#2 |
|
Posts: n/a
|
In general, XSLT can't operate as a streaming processor, since its use
of XPaths assumes the entire document is available in memory (or at least can be re-read) at once. Some processors use more compact models than others and thus may be able to handle larger documents in the same memory; this is part of why Xalan created its own model, known as DTM, rather than using an off-the-shelf DOM implementation. If you're willing to limit the kinds of stylesheets you write to ones which _only_ process the document in forward order, you can of course set up a minimal data model which just contains one (or a few) nodes; Xalan's SQL extension works that way, actually. Yes, automatically recognizing which stylesheets (or portions thereof) are streamable would be a Good Thing, but it's still something of a Holy Grail for XSLT implementers. If you look in the archives of the Xalan mailing list, you'll see much past discussion of this, and of possible approaches to dealing with it. Look in particular for the keywords "streaming", "pruning", and "filtering". Folks are continuing to research this, but it is not an easy problem. But until someone does get a handle on this problem... Sometimes, if you have to process large documents, the only good answer is to drop down from XSLT to a lower level and code the processing yourself as a direct SAX application. That lets you take advantage of whatever streaming/pruning/filtering opportunities exist, as well as letting you code a special-purpose (and thus more compact) model for any data you do have to retain. High-level languages are a good thing, but some problems are still best addressed by low-level bit-twiddling. |
|
|
|
#3 |
|
Posts: n/a
|
Joe Kesselman wrote:
> In general, XSLT can't operate as a streaming processor, since its use > of XPaths assumes the entire document is available in memory (or at > least can be re-read) at once. Some processors use more compact models > than others and thus may be able to handle larger documents in the same > memory; this is part of why Xalan created its own model, known as DTM, > rather than using an off-the-shelf DOM implementation. Perhaps it's appropriate to mention Omnimark, which uses a technique sometimes known as "write-behind" (borrowed from the hardware field). Instead of having an addressing scheme (XPath) for accessing objects out of document sequence, it provides for the placement of references to named anchors at the places where you know (or have computed) you will need to access such objects, and then creating the anchors themselves when you encounter them in document order. When the last event in document order has triggered, the "write-behind" reconciliation takes place, and all the values of the anchors are slotted into the places reserved for them by the references. (At least, this is how it used to work: I haven't used it for years.) ///Peter -- XML FAQ: http://xml.silmaril.ie/ |
|
|
|
#4 |
|
Posts: n/a
|
Thanks very much for your reply and advice. It's a shame that the XSL
transform engines can't (at least as an option) use virtual memory as their target environment for xml data file transformations. It looks like I may have a long row to hoe in doing the equivalent of the transform using procedural code! The sad part is, the transfomations that are done to these XML files using XSLT seem to be custom made for XSLT! Just a couple of quick follow ups: 1. Note that the transformation that is being done is XML to XML. Except for a sort, which could be broken out of the XSLT stylesheet and done procedurally after the transformation is complete, all other transformations in the stylesheet are local to small elements in the xml being transformed and there are no dependencies between these. With those restrictions, is there a way to mechanize a sequential (element-by-element) transformation? If so could you point me to some examples? 2. I'm tantlized by the reference that I noted in my original post to a suggestion that a "SAX Pipeline" be used to process very large XML files. To me that sounds like a sequential processor of XML with XSLT. Do you know where I could get additonal info on a "SAX Pipeline", or might this have been some wishful thnking on the part of it's author? Once again, thanks for your feedback. Lenny Wintfeld |
|
|
|
#5 |
|
Posts: n/a
|
wrote:
> Just a couple of quick follow ups: 1. Note that the transformation that > is being done is XML to XML. Except for a sort, which could be broken > out of the XSLT stylesheet and done procedurally after the > transformation is complete, all other transformations in the stylesheet > are local to small elements in the xml being transformed and there are > no dependencies between these. With those restrictions, is there a way > to mechanize a sequential (element-by-element) transformation? If so > could you point me to some examples? 2. I'm tantlized by the reference It sounds like your focus is on large files (> 100 MB) and you may be willing to give up XSL and Java in order to solve the problem. The following tool is not so specialized in producing XML files, but it can handle 1 GB of data withing 1 or 2 minutes: http://home.vrweb.de/~juergen.kahrs/...of-an-XML-file > that I noted in my original post to a suggestion that a "SAX Pipeline" > be used to process very large XML files. To me that sounds like a > sequential processor of XML with XSLT. Do you know where I could get > additonal info on a "SAX Pipeline", or might this have been some > wishful thnking on the part of it's author? Maybe this one helps: Pipestreaming microformats http://www-128.ibm.com/developerwork...matters44.html |
|
|
|
#6 |
|
Posts: n/a
|
wrote:
> Thanks very much for your reply and advice. It's a shame that the XSL > transform engines can't (at least as an option) use virtual memory as > their target environment for xml data file transformations. Generally, XSLT transformers *will* use virtual memory if the language they're running in and the operating system they're running on support it -- they just don't try to do the memory management themselves; they trust the system to do it for them. And in fact Java does use virtual memory... but the JVM you're using won't let you set that limit high enough for this particular document. > It looks > like I may have a long row to hoe in doing the equivalent of the > transform using procedural code! The sad part is, the transfomations > that are done to these XML files using XSLT seem to be custom made for > XSLT! I know how you feel. All I can say is that I know folks who are working on finding ways to address this, so In The Future Things Should Be Better. The concepts are relatively straightforward; the hard part is translating them into rules the machine can apply. > transformation is complete, all other transformations in the stylesheet > are local to small elements in the xml being transformed and there are > no dependencies between these. With those restrictions, is there a way > to mechanize a sequential (element-by-element) transformation? I agree that this is exactly the kind of problem that ought to be streamable... There's no portable way to leverage that, but specific XSLT processor may have a way to handle it. To take the example I know best: Xalan's internal data representation does happen to have the ability to "prune off" the most recently added nodes, so an explicit call to an extension function could, theoretically, discard the element once you're done processing it. In fact, one of Xalan's more obscure and underdocumented extensions does discard trees, though only in specific situations; we added that to handle the foreach-over-a-list-of-document()s situation... but I don't think there's a generalized version which would address your case. (We'd started investigating one, actually, then Other Priorities Intervenes.) > could you point me to some examples? 2. I'm tantlized by the reference > that I noted in my original post to a suggestion that a "SAX Pipeline" > be used to process very large XML files. To me that sounds like a > sequential processor of XML with XSLT. I think that was probably intended to be a reference to hand-coded SAX processing. But actually, you *could* do a compromise: hand-code a SAX processor which essentially breaks the large document up into a series of smaller ones and runs XSLT transforms on each one via its API (eg TrAX, of you're working in Java), then reassembles the output of those transformations into a single document again. -- () ASCII Ribbon Campaign | Joe Kesselman /\ Stamp out HTML e-mail! | System architexture and kinetic poetry |
|
|
|
#7 |
|
Posts: n/a
|
Jurgen, I looked at your reference to xmlgawk in some detail, and it
seems pretty encouraging; not only for the problem I stated, but for web tie-ins on XML data. I will look at your document in more detail and at the references (especially XMLBooster, xmllib and Expat). But in the mean time could you let me know directly, or provide me with some info on the following: How would I tie in xmlgawk to my primary application(s) in java. Would I do the equivalent of an exec(..) of the awk processor and then look for an exit code or is there a library that ties it in more directly (similar to the XSLT library for Java)? I'm looking forward to seeing if xmlgawk would be a reasonable half step between purely procedural code and XSLT; either premanently, or until XSLT can handle the kinds of XML files I'm called on to process. Thanks for the reference! Lenny W. |
|