Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   XML (http://www.velocityreviews.com/forums/f32-xml.html)
-   -   Search for string, then extract entire XML element where it appears. How? (http://www.velocityreviews.com/forums/t303848-search-for-string-then-extract-entire-xml-element-where-it-appears-how.html)

mandibdc@gmail.com 06-30-2006 02:55 PM

Search for string, then extract entire XML element where it appears. How?
 
I need to extract some elements from a very large XML file. Because of
the size, I'd like to work with it on my Linux machine as a text file.

Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

The XML document is comprised of a bunch of <item> elements:

<?xml version="1.0" encoding="UTF-8"?>
<item>
<property1>100</property1>
<property2>
<id>0</id>
<code>ThisIsTheStringINeedToMatch</code>
</property2>
<keyword>
<value>value1</value>
<value>value2</value>
</keyword>
<color>
<type>21</type>
<shade>1</shade>
</color>
</item>

How would you approach this? I can write a script to find each code,
but I'm not sure how to then search forwards/backwards to extract the
DNA element.

Thanks!

M


Joe Kesselman 06-30-2006 03:44 PM

Re: Search for string, then extract entire XML element where it appears.How?
 
mandibdc@gmail.com wrote:
> Basically, I am going to have a list of specific strings I'm searching
> for. For each string, I need to search through the XML file, and when
> I find that string (in the tag <code>), copy the entire <item> XML
> element that the code appears in, into another text file.
>
> How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.

mandibdc@gmail.com 06-30-2006 05:57 PM

Re: Search for string, then extract entire XML element where it appears. How?
 
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

Joe Kesselman wrote:
> mandibdc@gmail.com wrote:
> > Basically, I am going to have a list of specific strings I'm searching
> > for. For each string, I need to search through the XML file, and when
> > I find that string (in the tag <code>), copy the entire <item> XML
> > element that the code appears in, into another text file.
> >
> > How would you approach this?

>
> Using which tool?
>
> In XPath, including XSLT, use ancestor::item to find the enclosing item
> element.
>
> If you're operating on the DOM APIs, simply iterate your way up the
> parents looking for that item element... or use the filtered traversal
> mechanisms, if your DOM supports them.
>
> If you're working in SAX... SAX can't run backward, so it's up to you to
> do some sort of buffering so you can re-scan once you recognize the item
> as being one you're interested in.



Joe Kesselman 06-30-2006 06:38 PM

Re: Search for string, then extract entire XML element where it appears.How?
 
mandibdc@gmail.com wrote:
> I was hoping to just write a text parsing script using perl, for
> example...


Can't help; I'm not a perl user, and I tend not to reinvent wheels
unless necessary.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= 06-30-2006 06:58 PM

Re: Search for string, then extract entire XML element where it appears.How?
 
mandibdc@gmail.com wrote:

> I was hoping to just write a text parsing script using perl, for
> example...
>
> But I'm open to suggestions as to how most effectively to extract data
> from this large file.



I think Joe Kesselman summarized your set of
options really comprehensively. Look at the
data and decide which kind of output you need.
You mentioned that (in case of a match), you
need the whole element. Do you need the element
exactly, with all possible sub-elements to
arbitrary depth ?

If the tree hierarchy is rather flat, then you
could use a SAX-like parser, as describe by Joe.
SAX-like parsers are available for most languages,
even Perl, bash, and gawk (which I prefer).

Joe Kesselman 07-01-2006 03:50 AM

Re: Search for string, then extract entire XML element where it appears.How?
 
If it's a particularly huge file, I'd go with the buffed-SAX
semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
between SAX and DOM intended for this sort of chunk-at-a-time processing.)

Iterate through the document. For each item element, build an in-memory,
check its <code>, output it if it's one you want, and discard it so.
This way you don't have to keep the whole source document in memory at
once. As a refinement, for even better efficiencly, optimize this by
discarding the partly-built subtree (and events until it ends) as soon
as you see that the <code> isn't one you're looking for.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Peter Flynn 07-03-2006 01:54 AM

Re: Search for string, then extract entire XML element where it appears.How?
 
mandibdc@gmail.com wrote:
> I was hoping to just write a text parsing script using perl, for
> example...


Don't. There are subtleties about the way in which XML is formed
which will conspire to bite you in the ass if you use a non-XML
language.

Using Perl with one of the several XML APIs is fine, of course.

> But I'm open to suggestions as to how most effectively to extract data
> from this large file.


How large is large? XSLT runs pretty fast on a modern system, and what
you want to do isn't exactly rocket science (or if it is, I know any
number of unemployed rocket scientists who can do it for you :-)

This seems to do the job:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml"/>

<xsl:template match="items">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="item">
<xsl:if test="contains(property2/code,'Match')">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

///Peter
--
XML FAQ: http://xml.silmaril.ie/


All times are GMT. The time now is 07:01 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.