Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Searching XML

Reply
Thread Tools

Searching XML

 
 
Nash Kabbara
Guest
Posts: n/a
 
      10-26-2004
Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely: ** *

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
*** **if(getTextNode(nodeList.item(index)).trim().equals (myvalue))
//getTextNode merely return the text value of the node
*** **{
*** ***counter++;
*** ***tempIndex[arrIndex++] = index;
*** **}
*
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?


Thanks.
 
Reply With Quote
 
 
 
 
Jeff Kish
Guest
Posts: n/a
 
      10-26-2004
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <> wrote:

>Hi all,
>
> I just finished writing a log reader that reads xml logs (about 1 to 2 MB
>large). At the command line you can specify the file name, the name of the
>element and it's value like so: logreader log.txt MyElement myvalue
>
> In retrospect, I've noticed that it takes a long time to process. The time
>is spent on comparing the value of all tags named MyElement to myvalue.
>Namely: ** *
>
>NodeList nodeList = m_document.getElementsByTagName(MyElement);
>for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
>*** **if(getTextNode(nodeList.item(index)).trim().equals (myvalue))
> //getTextNode merely return the text value of the node
>*** **{
>*** ***counter++;
>*** ***tempIndex[arrIndex++] = index;
>*** **}
>*
>This takes around 20 seconds to complete processing. So my question is, is
>there some way where I can extract xml elements based on the element value.
>For example XPATH allows you to chose elements based to attribute value, so
>I'm wondering, is there a similar mechanism that allows you to grab
>elements based on their value?
>
>
>Thanks.

Here is a query that selects data based on element values...

This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

for $b in document("books.xml")//book
where some $a in $b/author
satisfies ($a/last="Stevens" and $a/first="W.")
return $b/title

returns these results:

<title>TCP/IP Illustrated</title>,
<title>Advanced Programming in the UNIX Environment</title>


Using this data:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>

<book year="1999">
<title>The Economics of Technology andContent for Digital TV</title>
<editor><last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

HTH

 
Reply With Quote
 
 
 
 
Andy Dingley
Guest
Posts: n/a
 
      10-26-2004
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
wrote:

>This takes around 20 seconds to complete processing.


I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.

> So my question is, is
>there some way where I can extract xml elements based on the element value.


Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xslaram name="elmName" >MyElementName</xslaram>
...
//*[local-name() = string($elmName)]


XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
 
Reply With Quote
 
Jeff Kish
Guest
Posts: n/a
 
      10-26-2004
On Tue, 26 Oct 2004 12:09:25 +0100, Andy Dingley <>
wrote:

>On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
>wrote:
>
>>This takes around 20 seconds to complete processing.

>
>I'm not surprised ! getElementsByTagName is always slow, but it's
>also inefficient here because it's having to look everywhere in the
>structure to find elements to test their names. If you can improve
>the search by looking for elements as children or grand-children,
>rather than searching everywhere for them, then this can be a good
>tweak.
>
>XML is often incredibly powerful, but this excess power can lead to
>inefficiencies if it's being used "by default" when you didn't really
>need it.
>
>> So my question is, is
>>there some way where I can extract xml elements based on the element value.

>
>Yes, XPath ! Just use "//MyElementName"
>
>Or if MyElementName is supplied by the users, then use a [...]
>predicate and the local-name() function to get the name of the
>element, then compare it to the value of an element name supplied as a
>parameter.
>
><xslaram name="elmName" >MyElementName</xslaram>
> ...
>//*[local-name() = string($elmName)]
>
>
>XQuery (and various other incarnations) will do it too, and with
>better performance. However it's sometimes hard to find XQuery
>features in an environment, but most will have XSLT and XPath
>available.

I like Andy's answer better.
Jeff Kish
 
Reply With Quote
 
Nash Kabbara
Guest
Posts: n/a
 
      10-26-2004
Hi Andy,

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue). So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>. Sorry for
not being clear. It seems your examples of xpath get elements base on their
name, but not value.


Nash
Andy Dingley wrote:

> On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
> wrote:
>
>>This takes around 20 seconds to complete processing.

>
> I'm not surprised ! getElementsByTagName is always slow, but it's
> also inefficient here because it's having to look everywhere in the
> structure to find elements to test their names. If you can improve
> the search by looking for elements as children or grand-children,
> rather than searching everywhere for them, then this can be a good
> tweak.
>
> XML is often incredibly powerful, but this excess power can lead to
> inefficiencies if it's being used "by default" when you didn't really
> need it.
>
>> So my question is, is
>>there some way where I can extract xml elements based on the element
>>value.

>
> Yes, XPath ! Just use "//MyElementName"
>
> Or if MyElementName is supplied by the users, then use a [...]
> predicate and the local-name() function to get the name of the
> element, then compare it to the value of an element name supplied as a
> parameter.
>
> <xslaram name="elmName" >MyElementName</xslaram>
> ...
> //*[local-name() = string($elmName)]
>
>
> XQuery (and various other incarnations) will do it too, and with
> better performance. However it's sometimes hard to find XQuery
> features in an environment, but most will have XSLT and XPath
> available.


 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      10-26-2004
On Tue, 26 Oct 2004 10:09:27 -0500, Nash Kabbara <>
wrote:

> Thanks for the response. Actually the lag is not in getElementsByTagName,
>but by the loop I have that compares the values of the tags with what the
>user is looking for (myvalue).


I don't recognise the coding platform - what is it ?

There's a lot you can do to improve that loop.
- Use an iterator not an array index
- Be suspicious of that .getlength() method, especially in an array
bound. Is that a per-iteration overhead you've given yourself ?
- never trim() when you can rtrim()
- Never trim() when you can use a space-ignoring comparison instead.

The trouble with much XML optimisation is that it becomes sensitive to
the data you feed it. Do you have a lot of matching elements to walk
through, or is finding the set of elements the main problem ?


> So I was wondering if there's a built in
>mechanism that pulls elements based on their Value. When I say "Value" I
>mean their content, not their name. i.e <Element>value</Element>.


Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

--
Smert' spamionam
 
Reply With Quote
 
Tjerk Wolterink
Guest
Posts: n/a
 
      10-26-2004
I think youre coding in Java,

It is better to use SAX: Simple Api for XML.
You then dont have to load the entire DOM,
and you can do some optimizations.

SAX is a good choice if it is not too complex what you want to do.

Greetz
Tjerk

Nash Kabbara wrote:
> Hi all,
>
> I just finished writing a log reader that reads xml logs (about 1 to 2 MB
> large). At the command line you can specify the file name, the name of the
> element and it's value like so: logreader log.txt MyElement myvalue
>
> In retrospect, I've noticed that it takes a long time to process. The time
> is spent on comparing the value of all tags named MyElement to myvalue.
> Namely:
>
> NodeList nodeList = m_document.getElementsByTagName(MyElement);
> for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
> if(getTextNode(nodeList.item(index)).trim().equals (myvalue))
> //getTextNode merely return the text value of the node
> {
> counter++;
> tempIndex[arrIndex++] = index;
> }
>
> This takes around 20 seconds to complete processing. So my question is, is
> there some way where I can extract xml elements based on the element value.
> For example XPATH allows you to chose elements based to attribute value, so
> I'm wondering, is there a similar mechanism that allows you to grab
> elements based on their value?
>
>
> Thanks.

 
Reply With Quote
 
Jeff Kish
Guest
Posts: n/a
 
      10-26-2004
<snip>
>Yes, XPath !
>
>Use a similar predicate, "//*[string (.) = $elmContents]"
>
>string() is optional (because in this context it's the default
>behaviour) but it's good practice to use it in situations like this,
>because it makes reading your code a lot clearer in the future.

<snip>
lots of good info in this thread!
Yes, Sax if you don't need to load your entire object in memory.

Oh.. regarding xquery..

for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
return
<temp>{string($b/.), name($b/.)}</temp>

{-- results in this output
<temp>TCP/IP Illustrated title</temp>
--}

Jeff Kish
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Google search result to be URL-limited when searching site, but notwhen searching Web stumblng.tumblr Javascript 1 02-04-2008 09:01 AM
Berkeley DB XML vs 4suite for fast searching in XML DB? Sullivan WxPyQtKinter Python 1 04-02-2006 02:23 AM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
searching for xml nodes helpful sql ASP .Net 2 05-18-2005 10:46 PM
Searching XML files with DOM sal achhala Java 0 03-01-2004 11:21 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57