Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Java and huge XML file to be parsed

Reply
Thread Tools

Java and huge XML file to be parsed

 
 
Dimitri Maziuk
Guest
Posts: n/a
 
      06-24-2004
Roedy Green sez:
> On Wed, 23 Jun 2004 22:34:52 +0000 (UTC), Dimitri Maziuk
><dima@127.0.0.1> wrote or quoted :


> Think of what fraction of the
> planet's XML or HTML documents would pass a complete W3C validation
> suite, perhaps under 1%. Using a binary format solves that problem in
> one fell swoop with the additional benefits of:
>
> 1. more compact, faster download.
> 2. faster processing.
> 3. tighter specification.
> 4. fewer people have to understand it.
> 5. simpler classes needed to process it, important in handhelds.


That is assuming

0. software that translated source into binary works correctly:
we know it doesn't. And when it doesn't we get to the interesting
part: failure modes. HTML browser can fail to "View source" and
user will still see the content. Binary browser?

1. binary representation is not necessarily more compact. E.g.
using double-byte characters vs. single-byte + charset header.

2. nobody cares about processing speed. The bottleneck is network
I/O, not CPU speed. What we do care about is byte ordering, original
word sizes, and all other fun stuff you need to deal with when
getting raw bytes over the wire.

3. non-issue as there's no reason why text markup format specs
must necessarily be less tight that binary format specs. What
happened in Real Life when Nutscrape, Microshaft, and whathaveyou
add feechoorz to their software and then shove them up HTML
specs would happen with any format, binary, shminary.

4. non-issue. Your own estimate is that 1% of HTML is good, ergo
only 1% of webshite designers and authors of HTML editing software
understand HTML. Ergo, they don't _have_ to understand it already,
obscuring the format further won't change anything.

(Obviously, the assumption that people will make better $foo if
they don't understand $foo is in itself rather amusing. E.g.
people would make better cars if they didn't understand how cars
work.)

5. who said anything about classes? You can process HTML with sed:
s/<.+>//g will give you nice plain text output, and you can add
bells and whistles as appropriate for your hardware.

Furrfu
Dima
--
I'm going to exit now since you don't want me to replace the printcap. If you
change your mind later, run -- magicfilter config script
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      06-25-2004
On Thu, 24 Jun 2004 22:59:09 +0000 (UTC), Dimitri Maziuk
<dima@127.0.0.1> wrote or quoted :

>
>0. software that translated source into binary works correctly:
> we know it doesn't.


The odds of it working are extremely high. A bug will soon be noticed
and fixed because there are so many other programs cross checking it.


The odds of a human doing it manually perfectly are extremely low. It
is the sort of mind-numbing task computers excel at.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      06-25-2004
On Thu, 24 Jun 2004 22:59:09 +0000 (UTC), Dimitri Maziuk
<dima@127.0.0.1> wrote or quoted :

>
>1. binary representation is not necessarily more compact. E.g.
> using double-byte characters vs. single-byte + charset header.



Then use a binary representation with single-byte UTF-8 or other even
more compact encoding such as Huffman.

The point is when you don't worry about making the format convenient
for humans you can make it optimally convenient for computers, i.e.
some optimal combination of:

fast to process,
compact to transport,
processible with small amounts of RAM.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Grant Wagner
Guest
Posts: n/a
 
      06-25-2004
Jezuch wrote:

> Użytkownik Roedy Green napisał:
> > It is just people would have used more appropriate tools to create the
> > web content.

>
> This one is *the* problem. People are lazy. Imagine what would happen if you
> developed something like this and said to them "it's all fine, but you have
> to use THIS tool". I presume that noone would bother to get it...


Actually, the problem isn't with laziness, it's with being told what to do, and
how to do it.

If I wrote perfectly acceptable HTML in Notepad, and then was told "that's fine,
but you have to use *this* tool to do it all again because you can't import what
you've done", I'd either tell them to go stuff it, or I'd find a way to upload
my hand-coded Notepad version, even if it meant writing a "compiler" to turn my
perfectly acceptable HTML into whatever tokenized mish-mash-mess was "required".

HTTP is _all_ text/byte-stream, it's what allows me to do:

print "Content-Type: text/plain\n\n";
print "this\n";
print "is\n";
print "a\n";
print "new\n";
print "line";

in Perl and have it come out on the browser correctly.

And thank <insert your choosen deity here> they did it that way is all I have to
say.

--
| Grant Wagner <(E-Mail Removed)>

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      06-25-2004
On Fri, 25 Jun 2004 20:34:32 GMT, Grant Wagner
<(E-Mail Removed)> wrote or quoted :

>If I wrote perfectly acceptable HTML in Notepad, and then was told "that's fine,
>but you have to use *this* tool to do it all again because you can't import what
>you've done",


Nobody is stopping you from using it, there is just one more step. It
is no different than being told you must put your letter in an
envelope before posting it.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      06-25-2004
On Fri, 25 Jun 2004 20:45:04 GMT, Roedy Green
<(E-Mail Removed)> wrote or quoted :

>Nobody is stopping you from using it, there is just one more step. It
>is no different than being told you must put your letter in an
>envelope before posting it.


For communication there has to be SOME standard. Think of it this way.
You are infringing on MY rights by insisting I use a fluffy, badly
specified error-prone format. You are further deliberately trying to
drive mad by putting malformed HTML on your website that crashes my
browser.

I should, in the American tradition, SUE you for damages, pain and
suffering.


--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      04-15-2006
Hello, Roedy Green !
You wrote:

> On Thu, 17 Jun 2004 22:25:00 -0400, Sudsy

<(E-Mail Removed)>
> wrote or quoted :
>
> >Try modifying those with a simple
> >text editor!

>
> Why use a ancient tool like that? It is like doing data entry

with
> NOTEPAD.


And what specifically is wrong with allowing someone to edit it
with the simplest of tools? That isn't even an option with a
binary format.

> For heaven sake. Surely we could create editor that
> created, edited and searched a compact XML-like representation

that
> made it IMPOSSIBLE to create syntax errors and almost correct

data.

Sure we could create an editor for each and every format out
there, but that would sure be a lot of work. Each fomat would
have its own editor. And we also end up duplicating that work
when we want to transform from one format to another.

Or we can use XML where the parser or editor only has to be
created once. The parser already exists and there are XML editors
that do just what you describe. And transformation from one
format to another is easy as well.

So if you want to keep reinventing the wheel feel free. The rest
of the world has got better things to do.

> It is not as though we failed to notice what a MESS HTML became

from
> lack of such a representation. The idiots took the worst

features of
> HTML.


No they didn't. The primary problems with HTML is that it is
about presentation, it is not well-formed, and not validating.
All of which are not true of XML.

> It is amazing that such a IDIOTIC format caught on.


Quite understandable why it caught on.
--
Dale King
My Blog: http://daleking.homedns.org/Blog
 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      04-15-2006
Hello, Roedy Green !
You wrote:

> On Thu, 17 Jun 2004 23:49:28 +0200, Katrin Tomanek
> <(E-Mail Removed)> wrote or quoted :
>
> >I've got a really big XML File (about 215 MBytes), which I

have to parse.
> u
> ARRGH. That file is probably 20 times the size if would be if

stored
> in some sensible format.


There seems to be an underlying false assumption by he OP and
probably by Roedy. The fact that it is 215 MB on disk does not
mean that the in-memory version of it will be anywhere near that
large. When using a DOM style tool you have control over the
objects created.

> It will take 100 times a long to parse than
> some sensible binary format.


I find that to be a gross exaggeration, but neither of us has
hard data. I would also say that the development time for coding
the parser and editor for a binary format is 100 times that of
using XML. Although that development time can be lessened by
using XML to describe and edit the data then transforming that
into the binary format.

> PHOOEY ON XML! I knew this insanity would happen.


What insanity? That it would actually be put to good use despite
your objections?
--
Dale King
My Blog: http://daleking.homedns.org/Blog
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Writing a local XML file from a parsed URL?? Similar to RSS linearfusion Javascript 2 06-27-2006 07:37 PM
Validating parsed XML document against XML-schema TKok Java 1 12-08-2005 02:01 PM
insertion of string into characters of parsed xml (in SAX) YuliaG Java 2 04-04-2005 07:16 AM
How to embed html in xml (i.e. prevent the html from being parsed)? Failure XML 1 09-07-2003 09:34 PM



Advertisments