Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Split file that contains multiple XML

Reply
Thread Tools

Split file that contains multiple XML

 
 
Dominik Stadler
Guest
Posts: n/a
 
      06-23-2005
Hi,

We have a file that contains multiple XML-Messages in the form of:

<FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>

I know this looks broken from the beginning, but we cannot change the
application that generates these kind of data, so we need a way to cope
with it.

How would I go about reading/splitting this? I don't think there is
functionality available to do this with Xerces, right?

I thought about using SAX to try to parse the complete text (we should get
an error at the second message) and then read the char/line information
from the errormessage, but this sounds like a hack to me, is there some
other way?

Thanks... Dominik.
 
Reply With Quote
 
 
 
 
Andrew Schorr
Guest
Posts: n/a
 
      06-24-2005
Hi,

I have actually done exactly what you suggested using the Expat SAX
parser
(and this feature is now included
in the XML gawk extension found at
http://sourceforge.net/projects/xmlgawk).
The key is to call XML_Parse until it returns XML_STATUS_ERROR.
At that point, one calls XML_GetCurrentByteIndex to find the location
of the error. You can then close out the parsing of the previous
document,
and then start parsing the new one that begins at the returned error
offset into the file. To see how this is done, you can look in
xml_puller.c in the sourceforge repository:

http://cvs.sourceforge.net/viewcvs.p....6&view=markup

Or you can just use xmlgawk and not worry about implementing this
yourself.

Regards,
Andy

 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      06-24-2005
In article < m>,
Dominik Stadler <> wrote:

>We have a file that contains multiple XML-Messages in the form of:
>
><FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>


>How would I go about reading/splitting this?


Wrap an element around it so it is

<x><FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE></x>

and then just extract the children of that element.

-- Richard
 
Reply With Quote
 
Andrew Schorr
Guest
Posts: n/a
 
      06-27-2005
It seems to me that this method will not work if any of the messages
contain XML declaration
headers. When I try your technique by feeding some concatenated
messages, each of which
contains an XML declaration, into the Expat parser, I get this error
message:

xml declaration not at start of external entity

But I expect your technique should work fine if there are no XML
declarations in the
messages.

Regards,
Andy

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
a split is not a split Dumbell Computer Support 3 03-09-2009 10:45 PM
How can I split database results with ExecuteReader and Split? needin4mation@gmail.com ASP .Net 2 05-05-2006 10:36 PM
Small inconsistency between string.split and "".split Carlos Ribeiro Python 11 09-17-2004 05:57 PM
Why does split operate over multiple lines in the absence of "ms" ? And why doesn't $_ work with split? Sara Perl Misc 6 04-12-2004 09:07 AM
Regex problem, match if line contains <a>, unless it also contains <b> James Dyer Perl 5 02-20-2004 12:29 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57