Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Large Data files and sizes - what are you doing about them?

Reply
Thread Tools

Large Data files and sizes - what are you doing about them?

 
 
Rich_Elswick
Guest
Posts: n/a
 
      03-09-2006
Hi all,

I am parsing a large data sets (62 gigs on one file). I can parse out
the into smaller files fine with perl, which is what we have to do
anyway (i.e. hex data becomes ascii .csv file type of different decoded
variables.) I am working with CAN data for those that know about
Controller Area Networks collected by Vector CANalyzer.

After they are parsed out, I am looking at the largest data file (1
file becomes ~100 smaller files) is about 2 gigs as of right now, but
who knows how large it could become in the future. I then use GDGraph
to parse through the data files and rapidly generate some .png files
for review (I have issues with this as well and will post those
questions some other time.) I run this on the whole batch of 100
files, going through each file one at a time using a batch program to
call the each perl program separately for each GDGraph, because GDGraph
loads the entire data set into memory before graphing the data. This
limits me to using this method on data files smaller than ~20 megs,
based on system memory. I suppose I could up the memory size of the
individual machine, but that is 1. costs money, 2. makes me request it
form IT (not easy), 3. Still doesn't work with a 2 gig file.

I was wondering 2 things.

1. Is there a better way of graphing this data, which uses less memory?
2. What is everyone else out there using?

Please no comments about just sampling the data (once every 5 lines or
something like that) and graphing the sampled data as we have already
considered this and that may be our method of resolving our issues.

Thanks,
Rich Elswick
Test Engineer
Cobasys LLC
http://www.cobasys.com

 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      03-09-2006
Rich_Elswick schreef:

> 1. Is there a better way of graphing this data, which uses less
> memory?


Maybe RRD.

http://search.cpan.org/~tcaine/POE-C...ool/RRDTool.pm
(and others)

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      03-09-2006
"Rich_Elswick" <(E-Mail Removed)> wrote:
> Hi all,
>
> I am parsing a large data sets (62 gigs on one file). I can parse out
> the into smaller files fine with perl, which is what we have to do
> anyway (i.e. hex data becomes ascii .csv file type of different decoded
> variables.) I am working with CAN data for those that know about
> Controller Area Networks collected by Vector CANalyzer.
>
> After they are parsed out, I am looking at the largest data file (1
> file becomes ~100 smaller files) is about 2 gigs as of right now, but
> who knows how large it could become in the future. I then use GDGraph
> to parse through the data files and rapidly generate some .png files
> for review (I have issues with this as well and will post those
> questions some other time.) I run this on the whole batch of 100
> files, going through each file one at a time using a batch program to
> call the each perl program separately for each GDGraph, because GDGraph
> loads the entire data set into memory before graphing the data. This
> limits me to using this method on data files smaller than ~20 megs,
> based on system memory. I suppose I could up the memory size of the
> individual machine, but that is 1. costs money, 2. makes me request it
> form IT (not easy), 3. Still doesn't work with a 2 gig file.
>
> I was wondering 2 things.
>
> 1. Is there a better way of graphing this data, which uses less memory?


It seems to me that if you are trying to plot 2 gig worth of data, than at
least one of two things is probably the case. Either most of the data
points fall on almost exaclty top of each other, and therefore you can get
the same image by plotting less than all of them. Or the resulting image
is a blob of partially or nearly overlapping symbols, which would convey
little information other than blobiness, and thus by plotting less than all
of them you get a graph that is more informative than plotting all of them.

Since you don't want to hear about sampling, I would suggest two
alternatives which are related to sampling but aren't the same. One would
be filtering, where you exclude points if you know that they are
effectively on top of a previous, included, point. The other would be
summarization--instead of taking every 500th point to plot, like in
sampling, you take the mean of all 500 and plot that, or you take the min,
max, and median of each group of 500 and plot those 3 things rather than
all 500.

> 2. What is everyone else out there using?


I use GD::Graph using sampling summarization techniques.

Sometimes I use GD::Graph to set up my axes and labels and titles and such
on a dummy data set, but then use GD directly to draw the actual data
points on the canvass provided by GD::Graph. This way all the data doesn't
need to be in memory at once. However, you need to use the internal
methods of GD::Graph to figure out what coordinates to supply to GD, so
this is a lot of work and is fragile.

I also use R and/or gnuplot to draw some types of images (i.e. contour
plots) which summarize very large datasets without actually drawing each
point. These are stand alone programs, and I only use Perl to massage
their inputs, but I think there are modules which will help interface Perl
with both of them.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Win 7 changing font sizes without icon sizes? why? Computer Support 0 03-21-2010 11:32 AM
Re: Win 7 changing font sizes without icon sizes? why? Computer Support 0 03-21-2010 11:31 AM
The File Sizes of Pictures on my CDs Increased to Unreadable Sizes Marful Computer Support 11 03-08-2006 07:13 PM
How best to SHRINK large digital photos for email file sizes Eunice Santorini Digital Photography 26 10-01-2003 07:29 AM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments