ht writes:
> So, to parse and traverse 39377 XML documents, averaging 900 bytes
> long, took at most a factor of 6.8 longer than to run them all through
> wc. Or, it took 00.0135 seconds per document, on average, to do the
> statistics using a fast XML tool to do the data extraction.
Well, I decided the two pipes I used were too different, so I did a
more nearly equivalent comparison:
> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \
xargs -n 1 wc -c | \
cut -d ' ' -f 1 > /tmp/fsize
[count the length in chars of each file]
> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \
xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null > /tmp/names
[just extract the value of all the 'name' attributes anywhere in any
of the files]
The first (no parse, just wc) case took 1min17secs, the second, XML
parse and XPath evaluate, took 3mins40secs, so the relevant measures
are
2.857 times as long to for the XML condition
0.006 seconds per XML file (so we're down to realistic times for
your 50M file collection, given that a) this machine is slow and b)
the files are coming in via NFS)
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail:
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]