Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > large xml file...

Reply
Thread Tools

large xml file...

 
 
boris
Guest
Posts: n/a
 
      08-23-2011
hi all,
I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.


 
Reply With Quote
 
 
 
 
Ian Shef
Guest
Posts: n/a
 
      08-23-2011
boris <(E-Mail Removed)> wrote in news:j2uqp4$n8h$1
@speranza.aioe.org:

> hi all,
> I need to process large xml file and dump some documents to a different
> file based on content of some elements.
>
> let's say I need to check content of <text3> and dump the whole <doc> to
> a different file:
>
> <doc>
> <text1>
> <text2>
> <text3> ... etc
>
> </doc>
>
> I'm trying to do this using sax. Are there any examples how to do this?
> Is using sax ok for this task?
> thanks.
>
>
>


What you are asking is unclear to me.
Do you mean that <text3> will determine whether you dump the whole <doc> to
another file?
Do you mean that <text3> will determine what file the whole <doc> will be
dumped to?
Or do you mean that the whole <doc> will be dumped to some other file, and
while you are at it, <text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being
processed?

Sorry about the questions, but I need clarification. I have used SAX and
may be able to provide enlightenment. SAX has its uses, but is not so good
when 'memory' is involved unless _you_ provide the memory. SAX appears to
excel when processing can take place in a single pass with very little
lokking backwards. Consequently, it does not use as much memory as some
other methods.





 
Reply With Quote
 
 
 
 
boris
Guest
Posts: n/a
 
      08-23-2011
On 08/22/2011 08:43 PM, Ian Shef wrote:
> boris<(E-Mail Removed)> wrote in news:j2uqp4$n8h$1
> @speranza.aioe.org:
>
>> hi all,
>> I need to process large xml file and dump some documents to a different
>> file based on content of some elements.
>>
>> let's say I need to check content of<text3> and dump the whole<doc> to
>> a different file:
>>
>> <doc>
>> <text1>
>> <text2>
>> <text3> ... etc
>>
>> </doc>
>>
>> I'm trying to do this using sax. Are there any examples how to do this?
>> Is using sax ok for this task?
>> thanks.
>>
>>
>>

>
> What you are asking is unclear to me.
> Do you mean that<text3> will determine whether you dump the whole<doc> to
> another file?
> Do you mean that<text3> will determine what file the whole<doc> will be
> dumped to?
> Or do you mean that the whole<doc> will be dumped to some other file, and
> while you are at it,<text3> will also be checked and reported in some way?
>
> Can you read the "large xml file" twice?
> Can you put the whole "large xml file" (or at least the part preceeding
> <text3>) into memory?
> Can you copy the "large xml file" to another file while it is being
> processed?
>
> Sorry about the questions, but I need clarification. I have used SAX and
> may be able to provide enlightenment. SAX has its uses, but is not so good
> when 'memory' is involved unless _you_ provide the memory. SAX appears to
> excel when processing can take place in a single pass with very little
> lokking backwards. Consequently, it does not use as much memory as some
> other methods.
>


> Do you mean that<text3> will determine whether you dump the
>whole<doc> to
> another file?

yes


> Can you read the "large xml file" twice?

I would like to read it once.

> Can you put the whole "large xml file" (or at least the part >preceeding
> <text3>) into memory?

no.


 
Reply With Quote
 
boris
Guest
Posts: n/a
 
      08-23-2011
> On 08/22/2011 08:43 PM, Ian Shef wrote:

> > Can you put the whole "large xml file" (or at least the part >preceeding
> > <text3>) into memory?

> no.


No, I can load the whole file. 1 doc is not a problem...




 
Reply With Quote
 
Arne Vajh°j
Guest
Posts: n/a
 
      08-23-2011
On 8/22/2011 8:05 PM, boris wrote:
> I need to process large xml file and dump some documents to a different
> file based on content of some elements.
>
> let's say I need to check content of <text3> and dump the whole <doc> to
> a different file:
>
> <doc>
> <text1>
> <text2>
> <text3> ... etc
>
> </doc>
>
> I'm trying to do this using sax. Are there any examples how to do this?
> Is using sax ok for this task?


SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne


 
Reply With Quote
 
Ian Shef
Guest
Posts: n/a
 
      08-23-2011
boris <(E-Mail Removed)> wrote in
news:j2utnu$t1q$(E-Mail Removed):

>> On 08/22/2011 08:43 PM, Ian Shef wrote:

>
>> > Can you put the whole "large xml file" (or at least the part
>> > >preceeding <text3>) into memory?

>> no.

>
> No, I can load the whole file. 1 doc is not a problem...
>
>
>
>


As you are processing, you can save the XML yourself (e.g. as a List of
String_s).

Based on the result of evaluating <text3>, you can choose to:

Open an output file, copy the List of String_s to the output file, and copy
any succeeding XML to the output file, or discard the List and discontinue
processing.

Alternatively, you can save the XML to a file as you process it. When you
evaluate <text3>, you can choose to continue saving to the file, or delete
the file and discontinue processing.





 
Reply With Quote
 
boris
Guest
Posts: n/a
 
      08-24-2011
On 08/22/2011 09:59 PM, Arne Vajh°j wrote:
> On 8/22/2011 8:05 PM, boris wrote:
>> I need to process large xml file and dump some documents to a different
>> file based on content of some elements.
>>
>> let's say I need to check content of <text3> and dump the whole <doc> to
>> a different file:
>>
>> <doc>
>> <text1>
>> <text2>
>> <text3> ... etc
>>
>> </doc>
>>
>> I'm trying to do this using sax. Are there any examples how to do this?
>> Is using sax ok for this task?

>
> SAX or StAX seems as the most obvious choices given the context.
>
> Any textbook SAX example should lead you to working code.
>
> I can post some code, but I doubt that it will show anything
> various books and tutorials does not.
>
> Arne
>
>

I tried to accumulate the whole xml(<doc>...</doc>) as string using
sax, but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like &quot;

Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same result.





 
Reply With Quote
 
Andreas Leitgeb
Guest
Posts: n/a
 
      08-24-2011
boris <(E-Mail Removed)> wrote:
> Using stax, I get correct xml, if I print events right away, but I if I
> store them in collection and print them later , I don't get the same result.


That sounds more like a bug in your code for "storing" and "printing later"
than a problem with stax itself.

 
Reply With Quote
 
Arne Vajh°j
Guest
Posts: n/a
 
      08-24-2011
On 8/24/2011 2:40 PM, boris wrote:
> On 08/22/2011 09:59 PM, Arne Vajh°j wrote:
>> On 8/22/2011 8:05 PM, boris wrote:
>>> I need to process large xml file and dump some documents to a different
>>> file based on content of some elements.
>>>
>>> let's say I need to check content of <text3> and dump the whole <doc> to
>>> a different file:
>>>
>>> <doc>
>>> <text1>
>>> <text2>
>>> <text3> ... etc
>>>
>>> </doc>
>>>
>>> I'm trying to do this using sax. Are there any examples how to do this?
>>> Is using sax ok for this task?

>>
>> SAX or StAX seems as the most obvious choices given the context.
>>
>> Any textbook SAX example should lead you to working code.
>>
>> I can post some code, but I doubt that it will show anything
>> various books and tutorials does not.


> I tried to accumulate the whole xml(<doc>...</doc>) as string using sax,
> but in this case all special characters are processed by parser
> and are just characters and not "predefined entities" like &quot;
>
> Using stax, I get correct xml, if I print events right away, but I if I
> store them in collection and print them later , I don't get the same
> result.


Any correct XML parser should convert the XML &quot; to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

Arne


 
Reply With Quote
 
Stanimir Stamenkov
Guest
Posts: n/a
 
      08-25-2011
Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajh°j/:

> Any correct XML parser should convert the XML &quot; to a " in
> a Java String.
>
> Any correct XML formatter/serializer should convert it back again
> when generating new XML.


I think any sane XML serializer should not output " as &quot; in
text content.

--
Stanimir
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
median of large data set (from large file) friend.05@gmail.com Perl Misc 5 04-02-2009 04:06 AM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
What XML technologies to learn first for "XML Processing" and "XML Mapping"? Bomb Diggy Java 0 07-28-2004 07:26 AM
[Urgent] Is there a size limit on returning a large dataset or a large typed array from web service? Ketchup ASP .Net Web Services 1 05-25-2004 10:11 AM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments