Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > SAX parseing goes 'all funny' on value [en]

Reply
Thread Tools

SAX parseing goes 'all funny' on value [en]

 
 
Fred
Guest
Posts: n/a
 
      12-13-2003
Hi,

I am parsing a small xml document and the parseing goes 'all funny'
when parsing this element: <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>

I've created a subclass of org.xml.sax.helpers.DefaultHandler, and an
instance of this subclass is set on my
org.apache.xerces.parsers.SAXParser:

SAXParser parser = new SAXParser();
parser.setContentHandler(pdh);
parser.setErrorHandler(pdh);

I've found that the

public void characters(char[] ch, int offset, int length) throws
SAXException

method is called once per element parsed. my debug output confirms
this. e.g. when parsing <useragent>MobileExplorer/3.00 (Mozilla/1.22;
compatible; MMEF300; Microsoft; Windows; GenericLarge)</useragent> it
reads:

D: reading characters...(useragent) length=89, offset=721,
found='MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300;
Microsoft; Windows; GenericLarge)'
D: ending element (useragent) current element value is :
[MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300; Microsoft;
Windows; GenericLarge)]


But... when parsing <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>
the debug output reads

D: reading characters...(useragent) length=16, offset=1097,
found='Mozilla/4.61 [en'
D: reading characters...(useragent) length=1, offset=0, found=']'
D: reading characters...(useragent) length=11, offset=1114, found='
(WinNT; I)'
D: ending (useragent) current element value is : [ (WinNT; I)]

It calls the characters method trice?!
Does the [en] bit in the element value have anything to do with this?
Would like to understand what and why.

(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)

Thanks for your input.
Fred.
 
Reply With Quote
 
 
 
 
Julian Reschke
Guest
Posts: n/a
 
      12-13-2003
Fred wrote:

> (As a 'temp fix' I thought to have the DefaultHandlers characters(...)
> method concatenate characters read, till the endElement(...) is
> invoked; but that seems to break everything.)


I think that's how SAX is supposed to work. There's no guarantee that
you're only getting a single event here.
 
Reply With Quote
 
 
 
 
Eric Bohlman
Guest
Posts: n/a
 
      12-14-2003
Julian Reschke <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> Fred wrote:
>
>> (As a 'temp fix' I thought to have the DefaultHandlers characters(...)
>> method concatenate characters read, till the endElement(...) is
>> invoked; but that seems to break everything.)

>
> I think that's how SAX is supposed to work. There's no guarantee that
> you're only getting a single event here.


It *is* how SAX is supposed to work. Keep in mind that character data in
XML can be arbitrarily long; if a parser had to deliver character data in a
single chunk, it could find itself constantly allocating and reallocating
buffers. Not imposing such a requirement greatly simplifies buffer
management in a parser; it can use a fixed-size internal buffer and just
call the character handler when everything up to the end of the buffer is
character data, rather than having to shift everything around. That can
greatly speed up parsing.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem parseing a XML - PullParser Sebastian (syepes) Ruby 10 12-12-2008 03:21 PM
Parseing HTML =?Utf-8?B?Z3V5?= ASP .Net 4 11-10-2006 02:53 PM
xml parseing An S. C++ 2 09-05-2005 10:06 PM
parseing c++ within a xml document atapi103@gmail.com XML 2 02-07-2005 09:33 PM
excel formula parseing ?? tag Python 0 09-09-2004 07:27 AM



Advertisments