Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > xml.parsers.expat loading xml into a dict and whitespace

Reply
Thread Tools

xml.parsers.expat loading xml into a dict and whitespace

 
 
kaens
Guest
Posts: n/a
 
      05-23-2007
Hey everyone, this may be a stupid question, but I noticed the
following and as I'm pretty new to using xml and python, I was
wondering if I could get an explanation.

Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)

so we have
<parent>
<option1>foo</option1>
<option2>bar</option2>
. . .
</parent>

(I'm using xml.parsers.expat)
the parser sets a flag that says it's in the parent, and sets the
value of the current tag it's processing in the start tag handler.
The character data handler sets a dictionary value like so:

dictName[curTag] = data

after I'm done processing the file, I print out the dict, and the first value is
<a few bits of whitespace> : <a whole bunch of whitespace>

There are comments in the xml file - is this what is causing this?
There are also blank lines. . .but I don't see how a blank line would
be interpreted as a tag. Comments though, I could see that happening.

Actually, I just did a test on an xml file that had no comments or
whitespace and got the same behaviour.

If I feed it the following xml file:

<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

wtf.

For reference, here's the handler functions:

def handleCharacterData(self, data):
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data

def handleStartElement(self, name, attributes):
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name


def handleEndElement(self, name):
if name == "options":
self.inOptions = False
self.curTag = ""

Sorry if the whitespace in the code got mangled (fingers crossed...)
 
Reply With Quote
 
 
 
 
Steven Bethard
Guest
Posts: n/a
 
      05-23-2007
kaens wrote:
> Let's say I write a simple xml parser, for an xml file that just loads
> the content of each tag into a dict (the xml file doesn't have
> multiple hierarchies in it, it's flat other than the parent node)

[snip]
> <options>
> <one>hey</one>
> <two>bee</two>
> <three>eff</three>
> </options>
>
> it prints out:
> " :
>
> three : eff
> two : bee
> one : hey"


I don't have a good answer for your expat code, but if you're not
married to that, I strongly suggest you look into ElementTree[1]::

>>> xml = '''\

.... <options>
.... <one>hey</one>
.... <two>bee</two>
.... <three>eff</three>
.... </options>
.... '''

>>> import xml.etree.cElementTree as etree
>>> tree = etree.fromstring(xml)
>>> d = {}
>>> for child in tree:

.... d[child.tag] = child.text
....
>>> d

{'three': 'eff', 'two': 'bee', 'one': 'hey'}


[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

STeVe
 
Reply With Quote
 
 
 
 
kaens
Guest
Posts: n/a
 
      05-23-2007
> [1] ElementTree is in the 2.5 standard library, but if you're stuck with
> an earlier python, just Google for it -- there are standalone versions


I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
 
Reply With Quote
 
kaens
Guest
Posts: n/a
 
      05-23-2007
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

On 5/23/07, kaens <> wrote:
> > [1] ElementTree is in the 2.5 standard library, but if you're stuck with
> > an earlier python, just Google for it -- there are standalone versions

>
> I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
>

 
Reply With Quote
 
Steven Bethard
Guest
Posts: n/a
 
      05-23-2007
kaens wrote:
> Now the code looks like this:
>

[snip ElementTree code]
>
> freaking easy. Compare with making a generic xml parser class, and
> inheriting from it for doing different things with different xml
> files. This does exactly the right thing. I'm sure it's not perfect
> for all cases, and I'm sure there will be times when I want something
> closer to expat, but this is PERFECT for what I need to do right now.
>
> That settles it, I'm addicted to python now. I swear I had a little
> bit of a nerdgasm. This is orders of magnitude smaller than what I had
> before, way easier to read and way easier to maintain.
>
> Thanks again for the point in the right direction, Steve.


You're welcome. In return, you've helped me to augment my vocabulary
with an important new word "nerdgasm".

STeVe
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      05-23-2007
kaens wrote:
> Now the code looks like this:
>
> import xml.etree.ElementTree as etree
>
> optionsXML = etree.parse("options.xml")
> options = {}
>
> for child in optionsXML.getiterator():
> if child.tag != optionsXML.getroot().tag:
> options[child.tag] = child.text
>
> for key, value in options.items():
> print key, ":", value


Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value


Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value


Have fun,

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      05-23-2007
kaens wrote:
> Now the code looks like this:
>
> import xml.etree.ElementTree as etree
>
> optionsXML = etree.parse("options.xml")
> options = {}
>
> for child in optionsXML.getiterator():
> if child.tag != optionsXML.getroot().tag:
> options[child.tag] = child.text
>
> for key, value in options.items():
> print key, ":", value


Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value


Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value


Have fun,

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Splitting text at whitespace but keeping the whitespace in thereturned list MRAB Python 3 01-26-2010 11:36 PM
Structure using whitespace vs logical whitespace cmdrrickhunter@yaho.com Python 10 12-16-2008 03:51 PM
Re: xml.parsers.expat loading xml into a dict and whitespace kaens Python 0 05-23-2007 06:15 AM
Whitespace where I don't want whitespace! Oli Filth HTML 9 01-17-2005 08:47 PM
Re: dict->XML->dict? Or, passing small hashes through text? Skip Montanaro Python 0 08-15-2003 03:46 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57