Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Extracting data from xml file

Reply
Thread Tools

Extracting data from xml file

 
 
Mag Gam
Guest
Posts: n/a
 
      03-03-2007
Hi All,
I am new to XML, and trying to extract some data from a file.

The file looks like this:
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<TAPE>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>6.99</PRICE>
<YEAR>1985</YEAR>
<TAPE>
<CATALOG>

I am trying to get
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99


What is the best method to do this? Is there a tool or utility you can
recommend for Windows?

 
Reply With Quote
 
 
 
 
Joe Kesselman
Guest
Posts: n/a
 
      03-03-2007
> What is the best method to do this?

Lots of tutorials exist on the web. My standard recommended starting
point: http://www.ibm.com/xml

(I'd probably hardcode it using DOM or SAX. But it might be easier for a
novice to write an XSLT stylesheet. There are other tools which might be
easier again, but they're less well standardized and I hesitate to
recommend that a novice invest in learning them.)


--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
 
Reply With Quote
 
 
 
 
roy axenov
Guest
Posts: n/a
 
      03-03-2007
On Mar 3, 7:57 pm, "Mag Gam" <(E-Mail Removed)> wrote:
> <CATALOG>
> <CD>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>10.90</PRICE>
> <YEAR>1985</YEAR>
> </CD>
> <TAPE>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>6.99</PRICE>
> <YEAR>1985</YEAR>
> <TAPE>
> <CATALOG>


This is not well-formed and therefore not XML. If that's
your real data, XML tools are quite unlikely to help you.

Assuming it's just another case of 'oh, for some reason I
just typed that in instead of using copy-paste'...

> I am trying to get
> Artist: Bob Dylan
> Company: Columbia
> CD Price: 10.90
> Tape Price: 6.99


Another day, another grouping problem...

> What is the best method to do this? Is there a tool or
> utility you can recommend for Windows?


Define 'best'. Define 'utility'. I don't believe there's a
DWIM-type tool that would automagically, well, do what you
mean at a click of a button. Therefore, it's a programming
problem. You could use a DOM or SAX parser in your language
of choice, as Joseph proposed. Or you could use XSLT. Or
maybe XQuery or xmlgawk. In case it's XSLT/XQuery, I
believe there are many GUI tools that might make working
with the code easier for you; I'm not sure if there are any
good open source ones, though. If you'd be happy with
Unix-style small tools, there's a number of open source
XSLT processors, including Saxon (it's written in Java, so
it shouldn't be a problem running it on a Windows box),
xsltproc and xalan (if there are no native ports, Cygwin or
MinGW will probably save the day). In short, you should
determine what you want then google for it. Come back with
specific questions.

Here's a transformation that does more or less what you
want with your sample data (after it's been fixed, of
course):

<xsl:stylesheet version="1.0"
xmlnssl="http://www.w3.org/1999/XSL/Transform">
<xsl:key name="id" match="CD|TAPE"
use="concat(TITLE,ARTIST,COMPANY)"/>
<xsl:key name="first" match="CD|TAPE"
use=
"
generate-id()=
generate-id
(
key('id',concat(TITLE,ARTIST,COMPANY))[1]
)
"/>
<xslutput method="text"/>
<xsl:template match="@*|node()"/>
<xsl:template match="/">
<xsl:apply-templates select="key('first',true())"/>
</xsl:template>
<xsl:template match="CD|TAPE">
<xsl:text> </xsl:text>
<xsl:apply-templates/>
<xsl:apply-templates
select="key('id',concat(TITLE,ARTIST,COMPANY))"
mode="prices"/>
</xsl:template>
<xsl:template match="TITLE">
<xsl:text>Title: </xsl:text>
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="ARTIST">
<xsl:text>Artist: </xsl:text>
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="COMPANY">
<xsl:text>Company: </xsl:text>
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="@*|node()" mode="prices"/>
<xsl:template match="CD|TAPE" mode="prices">
<xsl:apply-templates mode="prices"/>
</xsl:template>
<xsl:template match="CD/PRICE" mode="prices">
<xsl:text>CD Price: </xsl:text>
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="TAPE/PRICE" mode="prices">
<xsl:text>Tape Price: </xsl:text>
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:template>
</xsl:stylesheet>

--
roy axenov

 
Reply With Quote
 
=?ISO-8859-1?Q?J=FCrgen_Kahrs?=
Guest
Posts: n/a
 
      03-03-2007
Mag Gam wrote:
> Hi All,
> I am new to XML, and trying to extract some data from a file.
>
> The file looks like this:
> <CATALOG>
> <CD>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>10.90</PRICE>
> <YEAR>1985</YEAR>
> </CD>
> <TAPE>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>6.99</PRICE>
> <YEAR>1985</YEAR>
> <TAPE>
> <CATALOG>


The last two last are not correct (closing tags should begin with /).

> I am trying to get
> Artist: Bob Dylan
> Company: Columbia
> CD Price: 10.90
> Tape Price: 6.99
>
>
> What is the best method to do this? Is there a tool or utility you can
> recommend for Windows?


One of the many tools that can solve the problem is XMLgawk:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/


The following script solves your problem.

@load xml
XMLCHARDATA { data = $0 }
XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}

Invoke the script like this and it will produce the
following output:

xgawk -f catalog.awk catalog.xml
Artist: Bob Dylan
Company: Columbia
CD Price: 10.90
Tape Price: 6.99



 
Reply With Quote
 
Mag Gam
Guest
Posts: n/a
 
      03-04-2007
On Mar 3, 2:51 pm, Jürgen Kahrs <(E-Mail Removed)>
wrote:
> Mag Gam wrote:
> > Hi All,
> > I am new to XML, and trying to extract some data from a file.

>
> > The file looks like this:
> > <CATALOG>
> > <CD>
> > <TITLE>Empire Burlesque</TITLE>
> > <ARTIST>Bob Dylan</ARTIST>
> > <COUNTRY>USA</COUNTRY>
> > <COMPANY>Columbia</COMPANY>
> > <PRICE>10.90</PRICE>
> > <YEAR>1985</YEAR>
> > </CD>
> > <TAPE>
> > <TITLE>Empire Burlesque</TITLE>
> > <ARTIST>Bob Dylan</ARTIST>
> > <COUNTRY>USA</COUNTRY>
> > <COMPANY>Columbia</COMPANY>
> > <PRICE>6.99</PRICE>
> > <YEAR>1985</YEAR>
> > <TAPE>
> > <CATALOG>

>
> The last two last are not correct (closing tags should begin with /).
>
> > I am trying to get
> > Artist: Bob Dylan
> > Company: Columbia
> > CD Price: 10.90
> > Tape Price: 6.99

>
> > What is the best method to do this? Is there a tool or utility you can
> > recommend for Windows?

>
> One of the many tools that can solve the problem is XMLgawk:
>
> http://home.vrweb.de/~juergen.kahrs/gawk/XML/
>
> The following script solves your problem.
>
> @load xml
> XMLCHARDATA { data = $0 }
> XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
> XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
> XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
> XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}
>
> Invoke the script like this and it will produce the
> following output:
>
> xgawk -f catalog.awk catalog.xml
> Artist: Bob Dylan
> Company: Columbia
> CD Price: 10.90
> Tape Price: 6.99



Thanks everyone!
I am very new to XML and trying to learn my ropes.

Roy:
I have yet to try your XSL solution. I will try it. The XML code was
not valid, I know. I used it for an example.
Lets assume this is my new .xml file: http://msdn2.microsoft.com/en-us/library/ms762271.aspx
(made some slight modifications, like added 2 authors)

<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<author>II Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>

How would I get 'Book Title' and 'Book Author' ?

TIA

 
Reply With Quote
 
git
Guest
Posts: n/a
 
      03-04-2007
On Sat, 03 Mar 2007 09:57:38 -0800, Mag Gam wrote:

> Hi All,
> I am new to XML, and trying to extract some data from a file.
>
> The file looks like this:
> <CATALOG>
> <CD>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>10.90</PRICE>
> <YEAR>1985</YEAR>
> </CD>
> <TAPE>
> <TITLE>Empire Burlesque</TITLE>
> <ARTIST>Bob Dylan</ARTIST>
> <COUNTRY>USA</COUNTRY>
> <COMPANY>Columbia</COMPANY>
> <PRICE>6.99</PRICE>
> <YEAR>1985</YEAR>
> <TAPE>
> <CATALOG>
>
> I am trying to get
> Artist: Bob Dylan
> Company: Columbia
> CD Price: 10.90
> Tape Price: 6.99
>
>
> What is the best method to do this? Is there a tool or utility you can
> recommend for Windows?


On windows, for someone who just wants to get on with the job rather than
learn xslt or xpath, I would recommend coding it all in JScript (or
vbscript). Use use the MS XML parse that comes with windows and walk over
the DOM to find the data you want.

I am working on examples of this technique on my blog/site:

http://nerds-central.blogspot.com/20...pt-exsead.html

http://nerds-central.blogspot.com/20...atom-feed.html
(I promise that I will write the follow up to that second article real
soon! And I am working VBScript examples as well).

Feel free to join the Nerds-Central email group to ask more questions if
you like the method:
http://tech.groups.yahoo.com/group/nerds-central/

Cheers

AJ


--
Cubical Land:
www.cubicalland.com
Nerds-Central:
nerds-central.blogspot.com

 
Reply With Quote
 
=?ISO-8859-1?Q?J=FCrgen_Kahrs?=
Guest
Posts: n/a
 
      03-04-2007
Mag Gam wrote:

> How would I get 'Book Title' and 'Book Author' ?


Use this XMLgawk script:

@load xml
XMLCHARDATA { data = $0 }
XMLENDELEM == "author" { author = data }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "book" { print author, title}


And you will get the following output from the XML
data that you posted:

xgawk -f catalog2.awk catalog2.xml

II Gambardella, Matthew XML Developer's Guide
Ralls, Kim Midnight Rain
Corets, Eva Maeve Ascendant
Corets, Eva Oberon's Legacy
Corets, Eva The Sundered Grail
Randall, Cynthia Lover Birds
Thurman, Paula Splish Splash
Knorr, Stefan Creepy Crawlies
Kress, Peter Paradox Lost
O'Brien, Tim Microsoft .NET: The Programming Bible
O'Brien, Tim MSXML3: A Comprehensive Guide
Galos, Mike Visual Studio 7: A Comprehensive Guide
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with Extracting data from XML into Access Debbiedo XML 4 05-12-2007 01:49 AM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Newbie needs help extracting data from XML Rodney Python 4 12-30-2005 12:46 AM
extracting data from a database and converting it into an XML file Toto C++ 5 03-06-2004 03:31 PM
Extracting data from XML document Ken XML 8 11-30-2003 01:51 AM



Advertisments