Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Efficient format for huge amount of data

Reply
Thread Tools

Efficient format for huge amount of data

 
 
Gabriel Genellina
Guest
Posts: n/a
 
      01-20-2004
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
.... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL
 
Reply With Quote
 
 
 
 
Marco Schmidt
Guest
Posts: n/a
 
      01-20-2004
Gabriel Genellina:

[...]

>Maybe a binary format is more efficient, but I don't know which could
>be the best way, nor how to implement it.


There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian, valid intervals for the primitives types are defined in
the Java specs (e.g. char from 0 to 65535), the format of String
serialization is described in the API docs of read/writeUTF.

So if an element would be like the data you described above, an
element class could be:

class Element {
String s;
int i;
float f;
String s2;
}

And reading and writing could work like that:

Element read(DataInputStream in) throws IOException {
Element elem = new Element();
elem.s = in.readUTF();
elem.i = in.readInt();
elem.f = in.readFloat();
elem.s2 = in.readUTF();
return elem;
}

void write(DataOutputStream out, Element elem) throws IOException {
out.writeUTF(elem.s);
out.writeInt(elem.i);
out.writeFloat(elem.f);
out.writeUTF(elem.s2);
}

There is no single best way of doing persistent storage. Personally
I'd work with databases whenever it's feasible. I don't like self-made
binary formats like the above very much. You can't change things
easily, at least not if you have to convert existing data from binary
format A to B. Other people will have to study your format and write
and maintain dedicated code.

However, the format is more efficient (less space and faster to parse)
than ASCII text.

Regards,
Marco
--
Please reply in the newsgroup, not by email!
Java programming tips: http://jiu.sourceforge.net/javatips.html
Other Java pages: http://www.geocities.com/marcoschmidt.geo/java.html
 
Reply With Quote
 
 
 
 
Thomas Schodt
Guest
Posts: n/a
 
      01-20-2004
Marco Schmidt wrote:

> There are DataInputStream and DataOutputStream. Both have read and
> write method for the primitive types of Java and Strings. Byte order
> is big endian


So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.
 
Reply With Quote
 
Andrew Hobbs
Guest
Posts: n/a
 
      01-20-2004

"Gabriel Genellina" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> I have to pass a huge amount of data to a Java program. The source
> program is not written in Java but I have control over both programs
> and can arrange any suitable format at both ends.
>
> The dataset is a sequence of records, all records having the same
> structure. This structure is only known at runtime, and it's built on
> simple types like string, integer, double, etc.
>
> I could use an ASCII file to transfer data, like this:
>
> "A string", 123, 4.567, "X"
> "Another string", 89, 10.0, "Y"
> "Third line", -1, 0.0, "Z"
> ... many more lines, 100K or 1M ...
>
> but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
> for each line + the various wrapper classes like Integer, Double... I
> think this may be very slow for a large file.


How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

We wrote our own parser and reading a 1 MByte file off disc, parsing it into
floats and strings and then drawing the 3D structure that it represents
takes a fraction of a second. If you want to see what I mean then log onto
www.metasense.com.au and try the free trial version. Click on the Chemistry
and then the DNA folder and try out some of those molecules. The largest is
almost 1 M in size and it loads and displays on my machine in about 1/2
second. It might take longer for you depending upon the speed of your
connection.

Cheers

Andrew

--
************************************************** ******
Andrew Hobbs PhD

MetaSense Pty Ltd - www.metasense.com.au
12 Ashover Grove
Carine W.A.
Australia 6020

61 8 9246 2026
http://www.velocityreviews.com/forums/(E-Mail Removed)

************************************************** *******



>
> Maybe a binary format is more efficient, but I don't know which could
> be the best way, nor how to implement it.
> I've considered using Serialized, but since the source program is not
> written in Java it may be hard to replicate exactly the Serialized
> format - btw, where is it documented? if documented at all...
>
> Any ideas are welcome.
> Thanks,
>
> Gabriel Genellina
> Softlab SRL



 
Reply With Quote
 
Christian Holm
Guest
Posts: n/a
 
      01-20-2004
"Gabriel Genellina" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
<snip>
> but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
> for each line + the various wrapper classes like Integer, Double... I
> think this may be very slow for a large file.


I wouldn't worry too much about speed. I've written something very similar,
and was able to parse a 600 mb text-file using the method above in about a
minute. Your case may be a bit more timeconsuming, but it will probably
still be fast enough.

Christian


 
Reply With Quote
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      01-20-2004
Gabriel Genellina wrote:
> I have to pass a huge amount of data

[...]
> ... many more lines, 100K or 1M ...


1M is not a huge amount of data. I eat that for breakfast - twice

> but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
> for each line + the various wrapper classes like Integer, Double... I
> think this may be very slow for a large file.


Try it. Slow is a relative term, but I don't think you will get in
trouble here.

> Maybe a binary format is more efficient, but I don't know which could
> be the best way, nor how to implement it.


A ByteBuffer might be the fastest.

> I've considered using Serialized, but since the source program is not
> written in Java it may be hard to replicate exactly the Serialized
> format - btw, where is it documented? if documented at all...


AFAIR the low-level details are documented in the
Data[Output|Input]Stream or Object[Input|Output]Stream API
documentation. There is also some spec. on Sun's Java web site.

/Thomas

 
Reply With Quote
 
nos
Guest
Posts: n/a
 
      01-20-2004

"Gabriel Genellina" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> I have to pass a huge amount of data to a Java program. The source
> program is not written in Java but I have control over both programs
> and can arrange any suitable format at both ends.
>
> The dataset is a sequence of records, all records having the same
> structure. This structure is only known at runtime, and it's built on
> simple types like string, integer, double, etc.
>
> I could use an ASCII file to transfer data, like this:
>
> "A string", 123, 4.567, "X"
> "Another string", 89, 10.0, "Y"
> "Third line", -1, 0.0, "Z"
> ... many more lines, 100K or 1M ...
>
> but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
> for each line + the various wrapper classes like Integer, Double... I
> think this may be very slow for a large file.
>
> Maybe a binary format is more efficient, but I don't know which could
> be the best way, nor how to implement it.
> I've considered using Serialized, but since the source program is not
> written in Java it may be hard to replicate exactly the Serialized
> format - btw, where is it documented? if documented at all...
>
> Any ideas are welcome.
> Thanks,
>
> Gabriel Genellina
> Softlab SRL


I would put one value per line. This avoids tokenizing and
the file size doesn't change much.


 
Reply With Quote
 
William Brogden
Guest
Posts: n/a
 
      01-20-2004

"Gabriel Genellina" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> I have to pass a huge amount of data to a Java program. The source
> program is not written in Java but I have control over both programs
> and can arrange any suitable format at both ends.
>
> The dataset is a sequence of records, all records having the same
> structure. This structure is only known at runtime, and it's built on
> simple types like string, integer, double, etc.
>
> I could use an ASCII file to transfer data, like this:
>
> "A string", 123, 4.567, "X"
> "Another string", 89, 10.0, "Y"
> "Third line", -1, 0.0, "Z"
> ... many more lines, 100K or 1M ...
>
> but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
> for each line + the various wrapper classes like Integer, Double... I
> think this may be very slow for a large file.


A StreamTokenizer would be much more flexible and you would only need to
create one.
Using the flag to set end-of-line as a token would let you tell when each
line ends.
Bill

>
> Maybe a binary format is more efficient, but I don't know which could
> be the best way, nor how to implement it.
> I've considered using Serialized, but since the source program is not
> written in Java it may be hard to replicate exactly the Serialized
> format - btw, where is it documented? if documented at all...
>
> Any ideas are welcome.
> Thanks,
>
> Gabriel Genellina
> Softlab SRL





----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeed.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---
 
Reply With Quote
 
Chris
Guest
Posts: n/a
 
      01-20-2004
I doubt that speed will be an issue for you.

I've been working on some address handling software for a mate,
comma-delimited records, file-size usually around the 3-4Mb mark,
using BufferedReader and StringTokenizer for parsing - it generally
takes a minute or so to process (and it looks like the in-memory
processing I'm doing is considerably more complex than your
requirements).

Try it and see!

- sarge
 
Reply With Quote
 
Jon A. Cruz
Guest
Posts: n/a
 
      01-20-2004
Thomas Schodt wrote:
>
> So be sure to use htons() / htonl() in the non-Java app before stuffing
> the data on the stream.


Actually, try not to use them.

Instead use explicit byte math to get values out in an explicit order.

Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

to give you the rough idea:

write32( char* dst, uint32 u )
{
dst++ = (u >> 24) & 0x0ff;
dst++ = (u >> 16) & 0x0ff;
dst++ = (u >> & 0x0ff;
dst++ = (u >> 0) & 0x0ff;
}

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
The huge amount response data problem falconzyx@gmail.com Perl Misc 9 03-27-2008 06:27 AM
How to count huge amount of files in folder MichiMichi ASP .Net 2 03-14-2007 07:39 PM
Trouble with huge amount of State Server Sessions Timed out =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?= ASP .Net 7 09-28-2006 03:36 PM
Huge amount of invisible rubbish in C:\WINDOWS\Local Settings\Temporary Internet Files nemo Computer Support 33 03-22-2005 03:45 AM
Clicking a "mailto:" link on a page opens huge amount of IE windows after a while Henk Jol Firefox 1 01-04-2005 10:13 PM



Advertisments