Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > XML SAX parser bug?

Reply
Thread Tools

XML SAX parser bug?

 
 
mitsura@skynet.be
Guest
Posts: n/a
 
      01-19-2006
Hi,

I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?
I am new to Python.

I am using Python 2.4.1, pyWin32 extension 2.4 and PyXML 0.8.4

Any help very much appreciated.

Kris

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-19-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I think I ran into a bug in the XML SAX parser.
>
> part of my program consist of reading a rather large XML file (about
> 10Mb) containing a few thousand elements.
> I have the following problem. Sometimes that SAX parses misreads a
> line.
> Let me explain: the XML file contains a few thousand lines like this:
> "
> <TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
> "
> where 'n91c90a.cmc.com' is the name of a system and thus changes per
> system.
> I a few cases, the SAX parser misreads the line. The parser sometimes
> plits characters the line in:
> "WINOSSPI:Storage@@n" and "91c90a.cmc.com".
> I put a 'print characters' line in the 'characters' method of the
> parser that is how I found out.
> It only happens for a few of the thousand lines but you can imagine
> that is very annoying.
>
> I checked for errors in the XML file but the file seems ok.
>
> Is this a bug or am I doing something wrong?


it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.

</F>



 
Reply With Quote
 
 
 
 
mitsura@skynet.be
Guest
Posts: n/a
 
      01-19-2006

Fredrik Lundh schreef:

> (E-Mail Removed) wrote:
>
> > I think I ran into a bug in the XML SAX parser.
> >
> > part of my program consist of reading a rather large XML file (about
> > 10Mb) containing a few thousand elements.
> > I have the following problem. Sometimes that SAX parses misreads a
> > line.
> > Let me explain: the XML file contains a few thousand lines like this:
> > "
> > <TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
> > "
> > where 'n91c90a.cmc.com' is the name of a system and thus changes per
> > system.
> > I a few cases, the SAX parser misreads the line. The parser sometimes
> > plits characters the line in:
> > "WINOSSPI:Storage@@n" and "91c90a.cmc.com".
> > I put a 'print characters' line in the 'characters' method of the
> > parser that is how I found out.
> > It only happens for a few of the thousand lines but you can imagine
> > that is very annoying.
> >
> > I checked for errors in the XML file but the file seems ok.
> >
> > Is this a bug or am I doing something wrong?

>
> it's not a bug; the parser is free to split up character runs (due to buffering,
> entities or character references, etc). it's up to you to merge character runs
> into strings.
>
> </F>

Thanks for the feedback,

but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string

 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      01-19-2006
(E-Mail Removed) wrote:
> but how do I detect that the parser has split up the characters? I gues
> I need to detect it in order to reconstruct the complete string


Don't try to detect it. Instead, assume it always happens, and collect
the strings in characters(), rather than processing them. Do something
like this

def startElement(self, ...):
self.chardata = ""

def characters(self, data):
self.chardata += data

def endElement(self, ...):
process(self.chardata)

This is simplified - you might have to deal with nested elements,
somehow.

Regards,
Martin
 
Reply With Quote
 
uche.ogbuji@gmail.com
Guest
Posts: n/a
 
      02-07-2006
(E-Mail Removed) wrote:
> Fredrik Lundh schreef:
> > (E-Mail Removed) wrote:
> > > I think I ran into a bug in the XML SAX parser.
> > >
> > > part of my program consist of reading a rather large XML file (about
> > > 10Mb) containing a few thousand elements.
> > > I have the following problem. Sometimes that SAX parses misreads a
> > > line.

> >
> > it's not a bug; the parser is free to split up character runs (due to buffering,
> > entities or character references, etc). it's up to you to merge character runs
> > into strings.

>
> but how do I detect that the parser has split up the characters? I gues
> I need to detect it in order to reconstruct the complete string


Here's a recipe:

http://aspn.activestate.com/ASPN/Coo.../Recipe/265881

Using this filter you can then write SAX code that assumes normalized
text events. Also, 4Suite's SAX implementation, Saxlette,
automatically does this text event merging for you at C speed:

http://4suite.org/docs/CoreManual.xml#saxlette

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can I read String (XML content) rather XML file using SAX parser Sanjeev Java 4 05-04-2008 10:59 PM
XML::Parser Installation error: XML-Parser-2.34 Sean Perl Misc 3 10-03-2006 01:23 AM
XML::Parser Installation error: XML-Parser-2.34 Sean Perl Misc 0 10-02-2006 06:20 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
XML-Parser to XML-Parser communication (encoding issues?) arne Perl Misc 0 09-13-2005 12:53 PM



Advertisments