Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Frequency in large datasets

Reply
Thread Tools

Frequency in large datasets

 
 
Cosmic Cruizer
Guest
Posts: n/a
 
      05-01-2008
I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

My program keeps aborting after a few minutes because the computer runs out
of memory. I have four gigs of ram and the total paging files is 10 megs,
but Perl does not appear to be using it.

How can I find the frequency of each line using such a large dataset? I
tried to have two output files where I kept moving the databack and forth
each time I grabbed the next line from TEMP instead of using $seen{$_}++,
but I did not have much success.
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      05-01-2008
Cosmic Cruizer wrote:
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
> $!";
> foreach (<TEMP>) {
> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";
>
> My program keeps aborting after a few minutes because the computer runs out
> of memory.


This line:

> foreach (<TEMP>) {


reads the whole file into memory. You should read the file line by line
instead by replacing it with:

while (<TEMP>) {

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      05-01-2008
Cosmic Cruizer <(E-Mail Removed)> wrote in
news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17. 102:

> I've been able to reduce my dataset by 75%, but it still leaves me
> with a file of 47 gigs. I'm trying to find the frequency of each line
> using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile:
> $!";
> foreach (<TEMP>) {


Well, that is simply silly. You have a huge file yet you try to read all
of it into memory. Ain't gonna work.

How long is each line and how many unique lines do you expect?

If the number of unique lines is small relative to the number of total
lines, I do not see any difficulty if you get rid of the boneheaded for
loop.

> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";



my %seen;

open my $TEMP, '<', $tempfile
or die "Cannot open '$tempfile': $!";

++ $seen{ $_ } while <$TEMP>;

close $TEMP
or die "Cannot close '$tempfile': $!";

> My program keeps aborting after a few minutes because the computer
> runs out of memory. I have four gigs of ram and the total paging files
> is 10 megs, but Perl does not appear to be using it.


I don't see much point to having a 10 MB swap file. To make the best use
of 4 GB physical memory, AFAIK, you need to be running a 64 bit OS.

> How can I find the frequency of each line using such a large dataset?
> I tried to have two output files where I kept moving the databack and
> forth each time I grabbed the next line from TEMP instead of using
> $seen{$_}++, but I did not have much success.


If the number of unique lines is large, I would periodically store the
current counts, clear the hash, keep processing the original file. Then,
when you reach the end of the original data file, go back to the stored
counts (which will have multiple entries for each unique line) and
aggregate the information there.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      05-01-2008
Cosmic Cruizer <(E-Mail Removed)> wrote:
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile: $!";
> foreach (<TEMP>) {
> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";


If each line shows up a million times on average, that shouldn't
be a problem. If each line shows up twice on average, then it won't
work so well with 4G of RAM. We don't which of those is closer to your
case.

> My program keeps aborting after a few minutes because the computer runs
> out of memory. I have four gigs of ram and the total paging files is 10
> megs, but Perl does not appear to be using it.


If the program is killed due to running out of memory, then I would
say that the program *does* appear to be using the available memory. What
makes you think it isn't using it?


> How can I find the frequency of each line using such a large dataset?


I probably wouldn't use Perl, but rather the OS's utilities. For example
on linux:

sort big_file | uniq -c


> I
> tried to have two output files where I kept moving the databack and forth
> each time I grabbed the next line from TEMP instead of using $seen{$_}++,
> but I did not have much success.


But in line 42.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      05-01-2008
Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
> Cosmic Cruizer wrote:
> > I've been able to reduce my dataset by 75%, but it still leaves me with
> > a file of 47 gigs. I'm trying to find the frequency of each line using:
> >
> > open(TEMP, "< $tempfile") || die "cannot open file
> > $tempfile: $!";
> > foreach (<TEMP>) {
> > $seen{$_}++;
> > }
> > close(TEMP) || die "cannot close file
> > $tempfile: $!";
> >
> > My program keeps aborting after a few minutes because the computer runs
> > out of memory.

>
> This line:
>
> > foreach (<TEMP>) {

>
> reads the whole file into memory. You should read the file line by line
> instead by replacing it with:
>
> while (<TEMP>) {


Duh, I completely overlooked that.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
Cosmic Cruizer
Guest
Posts: n/a
 
      05-01-2008
Gunnar Hjalmarsson <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> Cosmic Cruizer wrote:
>> I've been able to reduce my dataset by 75%, but it still leaves me
>> with a file of 47 gigs. I'm trying to find the frequency of each line
>> using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file
>> $tempfile:
>> $!";
>> foreach (<TEMP>) {
>> $seen{$_}++;
>> }
>> close(TEMP) || die "cannot close file
>> $tempfile: $!";
>>
>> My program keeps aborting after a few minutes because the computer
>> runs out of memory.

>
> This line:
>
>> foreach (<TEMP>) {

>
> reads the whole file into memory. You should read the file line by
> line instead by replacing it with:
>
> while (<TEMP>) {
>


<sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
else I used the while statement to get me to this point. This solves the
problem.

Thank you.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      05-01-2008
Cosmic Cruizer <(E-Mail Removed)> wrote:
>I've been able to reduce my dataset by 75%, but it still leaves me with a
>file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
>$!";
> foreach (<TEMP>) {


This slurps the whole file (yes, all 47GB) inot a list and then iterates
over that list. Read the file line-by-line instead:

while (<TEMP>){

This should work unless you have a lot of different data points.

jue
 
Reply With Quote
 
Ben Bullock
Guest
Posts: n/a
 
      05-01-2008
A. Sinan Unur <(E-Mail Removed)> wrote:
> Cosmic Cruizer <(E-Mail Removed)> wrote in
> news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17. 102:
>
>> I've been able to reduce my dataset by 75%, but it still leaves me
>> with a file of 47 gigs. I'm trying to find the frequency of each line
>> using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file
>> $tempfile:
>> $!";
>> foreach (<TEMP>) {

>
> Well, that is simply silly. You have a huge file yet you try to read all
> of it into memory. Ain't gonna work.


I'm not sure why it's silly as such - perhaps he didn't know that
"foreach" would read all the file into memory.


> If the number of unique lines is small relative to the number of total
> lines, I do not see any difficulty if you get rid of the boneheaded for
> loop.


Again, why is it "boneheaded"? The fact that foreach reads the entire
file into memory isn't something I'd expect people to know
automatically.

 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      05-01-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) (Ben Bullock) wrote in
news:fvbj3s$l7u$(E-Mail Removed):

> A. Sinan Unur <(E-Mail Removed)> wrote:
>> Cosmic Cruizer <(E-Mail Removed)> wrote in
>> news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17. 102:
>>


....

>>> foreach (<TEMP>) {

>>
>> Well, that is simply silly. You have a huge file yet you try to read
>> all of it into memory. Ain't gonna work.

>
> I'm not sure why it's silly as such - perhaps he didn't know that
> "foreach" would read all the file into memory.


Well, I assumed he didn't. But this is one of those things, had I found
myself doing it, after spending hours and hours trying to work out a way
of processing the file, I would have slapped my forehead and said, "now
that was just a silly thing to do". Coupled with the "ain't" I assumed
my meaning was clear. I wasn't calling the OP names, but trying to get a
message across very strongly.

>> If the number of unique lines is small relative to the number of
>> total lines, I do not see any difficulty if you get rid of the
>> boneheaded for loop.

>
> Again, why is it "boneheaded"?


Because there is no hope of anything working so long as that for loop is
there.

> The fact that foreach reads the entire file into memory isn't
> something I'd expect people to know automatically.


Maybe this helps:

From perlfaq3.pod:

<blockquote>
* How can I make my Perl program take less memory?

....

Of course, the best way to save memory is to not do anything to waste it
in the first place. Good programming practices can go a long way toward
this:

* Don't slurp!

Don't read an entire file into memory if you can process it line by
line. Or more concretely, use a loop like this:
</blockquote>

Maybe you would like to read the rest.

So, calling the for loop boneheaded is a little stronger than "Bad
Idea", but then what is simply a bad idea with a 200 MB file (things
will still work but less efficiently) is boneheaded with a 47 GB file
(there is no chance of the program working).

There is a reason "Don't slurp!" appears with an exclamation mark and as
the first recommendation in the FAQ list answer.

Hope this helps you become more comfortable with the notion that reading
a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
Wall does it, if Superman does it ... you get the picture I hope.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
nolo contendere
Guest
Posts: n/a
 
      05-01-2008
On May 1, 7:26*am, "A. Sinan Unur" <(E-Mail Removed)> wrote:
> (E-Mail Removed) (Ben Bullock) wrote innews:fvbj3s$l7u$(E-Mail Removed):
>
> > A. Sinan Unur <(E-Mail Removed)> wrote:

>
> Hope this helps you become more comfortable with the notion that reading
> a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
> Wall does it, if Superman does it ... you get the picture I hope.
>


I don't think it would be boneheaded if Superman did it...I mean, he's
SUPERMAN.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FAQ 6.15 How can I print out a word-frequency or line-frequency summary? PerlFAQ Server Perl Misc 0 03-26-2011 04:00 AM
FAQ 6.15 How can I print out a word-frequency or line-frequency summary? PerlFAQ Server Perl Misc 0 02-01-2011 11:00 AM
Counting Frequency of Values in an Array (And Sorting by Frequency?) x1 Ruby 9 10-12-2006 04:04 PM
How do Datasets manage to get deserialized as DataSets instead of a wsdl.exe-created proxy class? Francisco Garcia ASP .Net 2 04-13-2006 10:41 AM
How do Datasets manage to get deserialized as DataSets instead of a wsdl.exe-created proxy class? news.microsoft.com ASP .Net 0 04-12-2006 09:07 AM



Advertisments