Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to import only part of a large XML file?

Reply
Thread Tools

How to import only part of a large XML file?

 
 
ccc31807
Guest
Posts: n/a
 
      11-16-2011
On Nov 11, 5:39*pm, Dwight Army of Champions
<(E-Mail Removed)> wrote:
> I have a very large XML file that I want to load, but I don't want to
> necessarily load the entire document; that takes too long. What I want
> to do instead is only key/value pairs that meet certain criteria, like
> only grab entries whose value fall within a certain date for a key
> date_of_entry. Can I just use XML::Simple for this or do I need a
> better module?


This depends on the nature of your input. I do this kind of thing
every day, and use a simple regular expression to filter the file. Of
course, you still have to read every line of the file to make sure
that you catch all of your intended targets, but you would have to do
that anyway. This is the kind of task for which it's a lot easier to
hand roll your own parser than it is to look for, evaluate, learn,
install, and use some third party module. In my opinion anyway. For
example:

SCRIPT
#! perl
use warnings;
use strict;
my %filter;
while (<DATA>)
{
next unless /\w/;
chomp;
if ($_ =~ m!<order>(\d+)</order>!)
{
my $key = $1;
while (<DATA>)
{
last if $_ =~ m!</pres>!;
next unless $_ =~ m!<last>(\w+)</last>!;
$filter{$key} = $1;
}
}
}
print "Finished processing file\n";
foreach my $key (sort keys %filter) { print "$key => $filter{$key}
\n"; }
exit(0);

__DATA__
<pres>
<order>1</order>
<first>George</first>
<last>Washington</last>
<year>1788</year>
</pres>
<pres>
<order>2</order>
<first>John</first>
<last>Adams</last>
<year>1796</year>
</pres>
<pres>
<order>3</order>
<first>Thomas</first>
<last>Jefferson</last>
<year>1800</year>
</pres>

OUTPUT
$perl filter_test.plx
Finished processing file
1 => Washington
2 => Adams
3 => Jefferson
4 => Madison
5 => Monroe
6 => Adams
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-16-2011
ccc31807 <(E-Mail Removed)> writes:
> On Nov 11, 5:39*pm, Dwight Army of Champions
> <(E-Mail Removed)> wrote:
>> I have a very large XML file that I want to load, but I don't want to
>> necessarily load the entire document; that takes too long. What I want
>> to do instead is only key/value pairs that meet certain criteria, like
>> only grab entries whose value fall within a certain date for a key
>> date_of_entry. Can I just use XML::Simple for this or do I need a
>> better module?

>
> This depends on the nature of your input.


[...]

> while (<DATA>)
> {
> next unless /\w/;
> chomp;
> if ($_ =~ m!<order>(\d+)</order>!)
> {
> my $key = $1;
> while (<DATA>)
> {
> last if $_ =~ m!</pres>!;
> next unless $_ =~ m!<last>(\w+)</last>!;
> $filter{$key} = $1;
> }
> }
> }


AFAIK, a well-formed XML file could have an order description looking like
this:

<order


>1</order





>


meaning, it is not really possible to parse XML without doing a
character-by-character lexical analysis of the input data stream
first.
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-16-2011
Dwight Army of Champions <(E-Mail Removed)> writes:
> I have a very large XML file that I want to load, but I don't want to
> necessarily load the entire document; that takes too long. What I want
> to do instead is only key/value pairs that meet certain criteria, like
> only grab entries whose value fall within a certain date for a key
> date_of_entry.


This is impossible. Technically, XML is a sequential character stream
and any structured data encoded as XML can only be recovered by
aggregating characters into tokens based on the rules for XML tokens
and parsing the resulting token stream.

 
Reply With Quote
 
Willem
Guest
Posts: n/a
 
      11-16-2011
Rainer Weikusat wrote:
) AFAIK, a well-formed XML file could have an order description looking like
) this:
)
)<order
)
)
)>1</order
)
)
)
)
)>
)
) meaning, it is not really possible to parse XML without doing a
) character-by-character lexical analysis of the input data stream
) first.

Indeed. To me, this is an argument that XML is usually a bad choice,
especially when you use it to store, transmit and retrieve data.

It's a _markup_ language, people! Not a data storage language.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      11-16-2011
On Nov 16, 11:50*am, Rainer Weikusat <(E-Mail Removed)> wrote:
> AFAIK, a well-formed XML file could have an order description looking like
> this:
>
> <order
>
> >1</order

>
> meaning, it is not really possible to parse XML without doing a
> character-by-character lexical analysis of the input data stream
> first.


As I said, it depends on the nature of your input. XML handles
'ragged' data as well as the kind of normalized data we would expect
to use for an RDBMS. If you aren't sure of the format of your data,
you obviously have to validate it somehow. Part of this might be
removing whitespace at the beginning and ends of lines. Sometimes it
might be removing newlines from several lines until you match some
kind of closing tag.

I don't advocate reinventing wheels. I also don't advocate searching
for a CPAN module as the first step in solving a particular
programming problem. If you need to run a script continually
processing the same kind of input, it might pay to cobble together
some code that does EXACTLY what you need, no more and no less, that
to use someone else's code.

I say this as a promiscuous user of CPAN modules -- hardly a week goes
by that I don't install a new module for one reason or another -- and
frequently I just look at the source, modify it to do what I need, and
don't use or require the module.

TIMTOWTDI, CC.
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      11-16-2011
On Nov 11, 7:11*pm, Dwight Army of Champions
<(E-Mail Removed)>
> <?xml version="1.0"?>
> <library>
> <book>
> * * * * <title>Dreamcatcher</title>
> * * * * <author>Stephen King</author>
> * * * * <genre>Horror</genre>
> * * * * <pages>899</pages>
> * * * * <price>23.99</price>
> * * * * <rating>5</rating>
> * * * * <publication_date>11/27/2001</publication_date>
> </book>

....
> </library>


If I had this kind of file, and it was a static file, I would read it
into some kind of database. If you used something like SQLite, you
could read it into a table <book> element by <book> element, and then
use normal SQL to munge your data.

Alternative, you could convert the file into CSV format, which in many
ways is a lot easier to handle than XML.

It strikes me that using XML for this kind of work is overkill, unless
you had a specific requirement to use XML. If you had to use XML it
might pay to learn a little XSLT and use that instead of Perl. Perl is
a great language for string processing, but in some cases XSLT works
better.

CC.
 
Reply With Quote
 
Klaus
Guest
Posts: n/a
 
      11-16-2011
On 16 nov, 18:17, ccc31807 <(E-Mail Removed)> wrote:
> On Nov 11, 7:11*pm, Dwight Army of Champions
> <(E-Mail Removed)>
>
> > <?xml version="1.0"?>
> > <library>
> > <book>
> > * * * * <title>Dreamcatcher</title>
> > * * * * <author>Stephen King</author>
> > * * * * <genre>Horror</genre>
> > * * * * <pages>899</pages>
> > * * * * <price>23.99</price>
> > * * * * <rating>5</rating>
> > * * * * <publication_date>11/27/2001</publication_date>
> > </book>

> ...
> > </library>

>
> If I had this kind of file, and it was a static file, I would read it
> into some kind of database. If you used something like SQLite, you
> could read it into a table <book> element by <book> element, and then
> use normal SQL to munge your data.
>
> Alternative, you could convert the file into CSV format, which in many
> ways is a lot easier to handle than XML.


Converting to CSV is as easy as:

use strict;
use warnings;

use XML::Reader;
use Text::CSV_XS;

my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => [
'/title',
'/author',
'/genre',
'/pages',
'/price',
'/rating',
'/publication_date',
]},
{ root => '/library/music', branch => [
'/title',
'/artist',
'/release_date',
'/label',
]});

my $csv = Text::CSV_XS->new({ sep_char => ',', binary => 1, eol =>
$/ });
open my $ofh, '>', 'out.csv' or die $!;

while ($rdr->iterate) {
$csv->print($ofh, [ ($rdr->rx == 0 ? 'book' : 'music'), $rdr-
>value ]);

}

close $ofh;
 
Reply With Quote
 
Klaus
Guest
Posts: n/a
 
      11-16-2011
On 16 nov, 17:32, ccc31807 <(E-Mail Removed)> wrote:
> On Nov 11, 5:39*pm, Dwight Army of Champions
>
> <(E-Mail Removed)> wrote:
> > I have a very large XML file that I want to load, but I don't want to
> > necessarily load the entire document; that takes too long. What I want
> > to do instead is only key/value pairs that meet certain criteria, like
> > only grab entries whose value fall within a certain date for a key
> > date_of_entry. Can I just use XML::Simple for this or do I need a
> > better module?

>
> This depends on the nature of your input. I do this kind of thing
> every day, and use a simple regular expression to filter the file. Of
> course, you still have to read every line of the file to make sure
> that you catch all of your intended targets, but you would have to do
> that anyway. This is the kind of task for which it's a lot easier to
> hand roll your own parser than it is to look for, evaluate, learn,
> install, and use some third party module. In my opinion anyway. For
> example:
>
> SCRIPT
> #! perl
> use warnings;
> use strict;
> my %filter;
> while (<DATA>)
> {
> * * next unless /\w/;
> * * chomp;
> * * if ($_ =~ m!<order>(\d+)</order>!)
> * * {
> * * * * my $key = $1;
> * * * * while (<DATA>)
> * * * * {
> * * * * * * last if $_ =~ m!</pres>!;
> * * * * * * next unless $_ =~ m!<last>(\w+)</last>!;
> * * * * * * $filter{$key} = $1;
> * * * * }
> * * }}
>
> print "Finished processing file\n";
> foreach my $key (sort keys %filter) { print "$key => $filter{$key}
> \n"; }
> exit(0);


Using XML::Reader, it's even easier:

use strict;
use warnings;

use XML::Reader;

my %filter;

my $rdr = XML::Reader->new(\*DATA,
{mode => 'branches'},
{ root => '/data/pres', branch => [
'/order',
'/last',
]});

while ($rdr->iterate) {
my ($order, $last) = $rdr->value;
$filter{$order} = $last;
}

print "Finished processing file\n";
foreach my $key (sort keys %filter) {
print "$key => $filter{$key}\n";
}

__DATA__
<data>
<pres>
<order>1</order>
<first>George</first>
<last>Washington</last>
<year>1788</year>
</pres>
<pres>
<order>2</order>
<first>John</first>
<last>Adams</last>
<year>1796</year>
</pres>
<pres>
<order>3</order>
<first>Thomas</first>
<last>Jefferson</last>
<year>1800</year>
</pres>
</data>
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-17-2011
ccc31807 <(E-Mail Removed)> writes:
> On Nov 16, 11:50*am, Rainer Weikusat <(E-Mail Removed)> wrote:
>> AFAIK, a well-formed XML file could have an order description looking like
>> this:
>>
>> <order
>>
>> >1</order

>>
>> meaning, it is not really possible to parse XML without doing a
>> character-by-character lexical analysis of the input data stream
>> first.

>
> As I said, it depends on the nature of your input. XML handles
> 'ragged' data as well as the kind of normalized data we would expect
> to use for an RDBMS. If you aren't sure of the format of your data,
> you obviously have to validate it somehow. Part of this might be
> removing whitespace at the beginning and ends of lines. Sometimes it
> might be removing newlines from several lines until you match some
> kind of closing tag.


The point I was trying to make is that the kind of input your (example) code
can deal with needs to follow the rules of a grammar which is a proper
subset of the XML grammar.
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      11-17-2011
On Nov 17, 2:33*pm, Rainer Weikusat <(E-Mail Removed)> wrote:
> The point I was trying to make is that the kind of input your (example) code
> can deal with needs to follow the rules of a grammar which is a proper
> subset of the XML grammar.


Yes, I understood your point. We all have to deal with messy data, and
faulty input will kill an application with no hope of recovery if you
don't deal with the possibility of corrupted data.

That said, if you are confident of the format of your input (as you
might have with an input file generated from a database) it might be
quicker and easier to hand roll your own.

If you have XML, you can use a SAX parser to process your input
element by element, and I assume that it would handle your whitespace
example without a problem.

I don't deal with XML much, and I really appreciate the post from
others that illustrate scripts with XML::Reader and the like. I didn't
have it but installed it yesterday, and have spend several hours
piddling with it.

CC.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
reformatting only a part of the XML using XSL unaveen XML 1 03-18-2008 02:53 AM
XML Schema question - does "import" import elements? Vitali Gontsharuk XML 2 08-25-2005 07:33 PM
ActiveX apologetic Larry Seltzer... "Sun paid for malicious ActiveX code, and Firefox is bad, bad bad baad. please use ActiveX, it's secure and nice!" (ok, the last part is irony on my part) fernando.cassia@gmail.com Java 0 04-16-2005 10:05 PM
Easy part done, now the hard part!! jollyjimpoppy A+ Certification 0 09-10-2003 10:37 AM



Advertisments