Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Data analysis of large collection of XML files

Reply
Thread Tools

Data analysis of large collection of XML files

 
 
Henry S. Thompson
Guest
Posts: n/a
 
      11-24-2008
Well, I did a small experiment. The W3C XML Schema Test Suite [1] has
approximately 40000 XML files in it (test schemas and files). The
following UN*X pipeline

> find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 wc -c | cut -d ' ' -f 1 | stats


run at the root of the suite produced the following output:

n = 39377
NA = 0
min = 7
max = 878065
sum = 3.57162e+07
ss = 3.35058e+12
mean = 907.033
var = 8.42692e+07
sd = 9179.83
se = 46.2608
skew = 89.0122
kurt = 8280.51

and took 1 minute 18 seconds real time on a modest Sun.

How much overhead does XML parsing add to this? I used lxprintf, one
of the down-translation tools in the LT XML toolkit [2] [3]:

> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null | while read l; do echo $l | wc -c; done | stats


(average the length of every 'name' attribute anywhere in the corpus)

with the following output:

n = 57418
NA = 0
min = 1
max = 2004
sum = 700156
ss = 3.29944e+07
mean = 12.194
var = 425.949
sd = 20.6385
se = 0.0861301
skew = 39.7485
kurt = 3428.03

and this took 8 minutes 50 seconds real time on the same machine.

So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through
wc. Or, it took 00.0135 seconds per document, on average, to do the
statistics using a fast XML tool to do the data extraction.

Maybe this helps you plan.

ht

[1] http://www.w3.org/XML/2004/xml-schem...ite/index.html
[2] http://www.ltg.ed.ac.uk/~richard/ltxml2/ltxml2.html
[3] http://www.ltg.ed.ac.uk/software/ltxml2
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail:
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
 
Reply With Quote
 
 
 
 
Hermann Peifer
Guest
Posts: n/a
 
      11-24-2008
On Nov 24, 3:39*pm, h...@inf.ed.ac.uk (Henry S. Thompson) wrote:
>
> So, to parse and traverse 39377 XML documents, averaging 900 bytes
> long, took at most a factor of 6.8 longer than to run them all through wc..
>


This more or less confirms my personal (unofficial and undocumented)
statistics, which say that on average, XML documents contain 10% data
and 90% "packaging waste". Often enough, one just has to get rid of
the packaging in order to do something meaningful with the actual
values.

On the other hand: processing data in XML format with dedicated XML
tools has also its advantages, no doubt about that.

Hermann
 
Reply With Quote
 
 
 
 
Henry S. Thompson
Guest
Posts: n/a
 
      11-24-2008
ht writes:

> So, to parse and traverse 39377 XML documents, averaging 900 bytes
> long, took at most a factor of 6.8 longer than to run them all through
> wc. Or, it took 00.0135 seconds per document, on average, to do the
> statistics using a fast XML tool to do the data extraction.


Well, I decided the two pipes I used were too different, so I did a
more nearly equivalent comparison:

> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \

xargs -n 1 wc -c | \
cut -d ' ' -f 1 > /tmp/fsize

[count the length in chars of each file]

> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \

xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null > /tmp/names

[just extract the value of all the 'name' attributes anywhere in any
of the files]

The first (no parse, just wc) case took 1min17secs, the second, XML
parse and XPath evaluate, took 3mins40secs, so the relevant measures
are

2.857 times as long to for the XML condition
0.006 seconds per XML file (so we're down to realistic times for
your 50M file collection, given that a) this machine is slow and b)
the files are coming in via NFS)

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail:
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
 
Reply With Quote
 
Hermann Peifer
Guest
Posts: n/a
 
      11-24-2008
On Nov 23, 4:17*pm, Peter Flynn <peter.n...@m.silmaril.ie> wrote:
> Ken Starks wrote:
>
> > How big is 50 million ? Well, suppose each file takes one second to
> > parse and convert to csv, you might like to know that

>
> > 50 million seconds = 578.703704 days
> >http://www.google.co.uk/search?hl=en...econds+in+days

>
> I think that no matter which way you do it, it's going to take a
> significant number of days to wade through them.
>


Well, I think that the estimate of 1 second for processing a 3K file
is not very realistic. It is far too high and subsequently, it
wouldn't take a significant number of days to process 50M XML files.

I just took a 3K sample XML file and copied it 1M times: 1000
directories, with 1000 XML files each. Calculating the average value
of some element across all 1000000 files took me:

- 10 minutes with text processing tools (grep and awk)
- 20 minutes with XML parsing tools (xmlgawk)

Extrapolated to 50M files, this would mean a processing time of 8 and
17 hours respectively.

Hermann
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Evaluating static analysis and Dynamic analysis tools for C/C++ ssubbarayan C Programming 5 11-03-2009 12:50 AM
Collection problems (create Collection object, add data to collection, bind collection to datagrid) Øyvind Isaksen ASP .Net 1 05-18-2007 09:24 AM
Persisting collection data of a webcontrol when leaving the collection editor in VS2005 mehdi.mousavi@gmail.com ASP .Net Building Controls 1 05-19-2006 03:49 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM
NetFlow collection/analysis (not flow-tools :) X.25 Cisco 1 07-08-2003 09:48 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57