Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > out of memory

Reply
Thread Tools

out of memory

 
 
friend.05@gmail.com
Guest
Posts: n/a
 
      10-31-2008
Hi,

I want to parse large log file (in GBs)

and I am readin 2-3 such files in hash array.

But since it will very big hash array it is going out of memory.

what are the other approach I can take.


Example code:

open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$id"}}, $time;
}
close $INFO;


In above code $file is very big in size(in Gbs); so I am getting out
of memory !
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      10-31-2008
"(E-Mail Removed)" <(E-Mail Removed)> wrote:
>I want to parse large log file (in GBs)
>
>and I am readin 2-3 such files in hash array.
>
>But since it will very big hash array it is going out of memory.
>
>what are the other approach I can take.


"Doctor, it hurts when I do this."
"Well, then don't do it."

Simple: don't read them into RAM but process them line by line.

>Example code:
>
>open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
>while (<$INFO>)


Oh, you are processing them line by line,

>{
> (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
>undef) = split('\|');
> push @{$time_table{"$cli_ip|$id"}}, $time;
>}
>close $INFO;


If for whatever reason your requirement (sic!!!) is to create an array
with all this data, then you need better hardware and probably a 64bit
OS and Perl.

Of course a much better approach would probably be to trade time for
space and find a different algorithm to solve your original problem
(which you didn't tell us about) by using less RAM in the first place. I
personally don't see any need to store more than one data set in RAM for
"parsing log files", but of course I don't know what kind of log files
you are talking about and what information you want to compute from
those log files.

Another common solution is to use a database to handle large sets of
data.

jue
 
Reply With Quote
 
 
 
 
Juha Laiho
Guest
Posts: n/a
 
      10-31-2008
"(E-Mail Removed)" <(E-Mail Removed)> said:
>I want to parse large log file (in GBs)
>
>and I am readin 2-3 such files in hash array.
>
>But since it will very big hash array it is going out of memory.


Do you really need to have the whole file available in order to
extract the data you're interested in?

>Example code:
>
>open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
>while (<$INFO>)
>{
> (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
>undef) = split('\|');
> push @{$time_table{"$cli_ip|$id"}}, $time;
>}
>close $INFO;
>
>In above code $file is very big in size(in Gbs); so I am getting out
>of memory !


So, you're storing times based on client ip and id, if I read correctly.

How about not keeping that data in memory, but writing it out as you
gather it?
- to a text file, to be processed further in a next stage of the script
- to a database format file (via DB_File module, or one of its sister
modules), so that you can do fast indexed searches on the data
- to a "real" database in a proper relational structure, to allow
you to do any kind of relational reporting rather easily

Also, where $time above apparently is a string containing some kind of
a timestamp, you could convert that timestamp into something else
(number of seconds from epoch comes to mind) that takes a lot less
memory than a string representation such as "2008-10-31 18:33:24".
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)
 
Reply With Quote
 
friend.05@gmail.com
Guest
Posts: n/a
 
      10-31-2008
On Oct 31, 12:37*pm, Juha Laiho <(E-Mail Removed)> wrote:
> "(E-Mail Removed)" <(E-Mail Removed)> said:
>
> >I want to parse large log file (in GBs)

>
> >and I am readin 2-3 such files in hash array.

>
> >But since it will very big hash array it is going out of memory.

>
> Do you really need to have the whole file available in order to
> extract the data you're interested in?
>
> >Example code:

>
> >open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
> >while (<$INFO>)
> >{
> > * * * *(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
> >undef) = split('\|');
> > * * * * * *push @{$time_table{"$cli_ip|$id"}}, $time;
> >}
> >close $INFO;

>
> >In above code $file is very big in size(in Gbs); so I am getting out
> >of memory !

>
> So, you're storing times based on client ip and id, if I read correctly.
>
> How about not keeping that data in memory, but writing it out as you
> gather it?
> - to a text file, to be processed further in a next stage of the script
> - to a database format file (via DB_File module, or one of its sister
> * modules), so that you can do fast indexed searches on the data
> - to a "real" database in a proper relational structure, to allow
> * you to do any kind of relational reporting rather easily
>
> Also, where $time above apparently is a string containing some kind of
> a timestamp, you could convert that timestamp into something else
> (number of seconds from epoch comes to mind) that takes a lot less
> memory than a string representation such as "2008-10-31 18:33:24".
> --
> Wolf *a.k.a. *Juha Laiho * * Espoo, Finland
> (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
> * * * * *PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
> "...cancel my subscription to the resurrection!" (Jim Morrison)


Thanks.

if I output as text file and read it again later on will be able to
search based on key. (I mean when read it again I will be able to use
it as hash or not )
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      10-31-2008
"(E-Mail Removed)" <(E-Mail Removed)> wrote:
> Hi,
>
> I want to parse large log file (in GBs)
>
> and I am readin 2-3 such files in hash array.
>
> But since it will very big hash array it is going out of memory.
>
> what are the other approach I can take.


The other approaches you can take depend on what you are trying to
do.

>
> Example code:
>
> open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
> while (<$INFO>)
> {
> (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
> undef) = split('\|');
> push @{$time_table{"$cli_ip|$id"}}, $time;
> }
> close $INFO;


You could get some improvement by having just a hash rather than a hash of
arrays. Replace the push with, for example:

$time_table{"$cli_ip|$id"} .= "$time|";

Then you would have to split the hash values into a list/array one at a
time as they are needed.



Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      10-31-2008
"(E-Mail Removed)" <(E-Mail Removed)> wrote:
>if I output as text file and read it again later on will be able to
>search based on key. (I mean when read it again I will be able to use
>it as hash or not )


That depends upon what you do with the data when reading it in again. Of
course you can construct hash, but then you wouldn't have gained
anything. Why would this hash be any smaller than the one you were
trying to construct the first time?

Your current approach (put everything into a hash) and your current
hardware are incompatible.

Either get larger hardware (expensive) or rethink your basic approach,
e.g. use a database system or compute your desired results on the fly
while parsing through the file or write intermediate results to a file
in a format that later can be processed line by line or by any other of
the gazillions ways of preversing RAM. Don't you learn those techniques
in basic computer science classes any more?

jue
 
Reply With Quote
 
friend.05@gmail.com
Guest
Posts: n/a
 
      10-31-2008
On Oct 31, 1:22*pm, Jürgen Exner <(E-Mail Removed)> wrote:
> "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> >if I output as text file and read it again later on will be able to
> >search based on key. (I mean when read it again I will be able to use
> >it as hash or not )

>
> That depends upon what you do with the data when reading it in again. Of
> course you can construct hash, but then you wouldn't have gained
> anything. Why would this hash be any smaller than the one you were
> trying to construct the first time?
>
> Your current approach (put everything into a hash) and your current
> hardware are incompatible.
>
> Either get larger hardware (expensive) or rethink your basic approach,
> e.g. use a database system or compute your desired results on the fly
> while parsing through the file or write intermediate results to a file
> in a format that later can be processed line by line or by any other of
> the gazillions ways of preversing RAM. Don't you learn those techniques
> in basic computer science classes any more?
>
> jue


output to a file and using it again will take lot of time. It will be
very slow.

will be helpful in speed if I use DB_FILE module
 
Reply With Quote
 
friend.05@gmail.com
Guest
Posts: n/a
 
      10-31-2008
On Oct 31, 1:41*pm, "(E-Mail Removed)" <(E-Mail Removed)>
wrote:
> On Oct 31, 1:22*pm, Jürgen Exner <(E-Mail Removed)> wrote:
>
>
>
>
>
> > "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> > >if I output as text file and read it again later on will be able to
> > >search based on key. (I mean when read it again I will be able to use
> > >it as hash or not )

>
> > That depends upon what you do with the data when reading it in again. Of
> > course you can construct hash, but then you wouldn't have gained
> > anything. Why would this hash be any smaller than the one you were
> > trying to construct the first time?

>
> > Your current approach (put everything into a hash) and your current
> > hardware are incompatible.

>
> > Either get larger hardware (expensive) or rethink your basic approach,
> > e.g. use a database system or compute your desired results on the fly
> > while parsing through the file or write intermediate results to a file
> > in a format that later can be processed line by line or by any other of
> > the gazillions ways of preversing RAM. Don't you learn those techniques
> > in basic computer science classes any more?

>
> > jue

>
> output to a file and using it again will take lot of time. It will be
> very slow.
>
> will be helpful in speed if I use DB_FILE module- Hide quoted text -
>
> - Show quoted text -


here is what I am trying to do.

I have two large files. I will read one file and see if that is also
present in second file. I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.


this is my current code. It runs fine with small file.



open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$dns_id"}}, $time;
}


open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
\n";
while (<$INFO_PRI>)
{
(undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
$pri_id, undef, $query, undef) = split('\|');
$pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
}

@pri_ip_id_table_ = keys(%pri_ip_id_table);

for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2
{
if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
is there in file 1
{
#do some processing.
}

}



so for above example which I approach will be best ?


Thanks for your help.
 
Reply With Quote
 
Charlton Wilbur
Guest
Posts: n/a
 
      10-31-2008
>>>>> "JE" == Jürgen Exner <(E-Mail Removed)> writes:

JE> Don't you learn those techniques in basic computer science
JE> classes any more?

The assumption that someone who is getting paid to program has had -- or
even has had any interest in -- computer science classes gets less
tenable with each passing day.

Charlton


--
Charlton Wilbur
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
J. Gleixner
Guest
Posts: n/a
 
      10-31-2008
(E-Mail Removed) wrote:
> On Oct 31, 1:41 pm, "(E-Mail Removed)" <(E-Mail Removed)>
> wrote:
>> On Oct 31, 1:22 pm, Jürgen Exner <(E-Mail Removed)> wrote:
>>
>>
>>
>>
>>
>>> "(E-Mail Removed)" <(E-Mail Removed)> wrote:
>>>> if I output as text file and read it again later on will be able to
>>>> search based on key. (I mean when read it again I will be able to use
>>>> it as hash or not )
>>> That depends upon what you do with the data when reading it in again. Of
>>> course you can construct hash, but then you wouldn't have gained
>>> anything. Why would this hash be any smaller than the one you were
>>> trying to construct the first time?
>>> Your current approach (put everything into a hash) and your current
>>> hardware are incompatible.
>>> Either get larger hardware (expensive) or rethink your basic approach,
>>> e.g. use a database system or compute your desired results on the fly
>>> while parsing through the file or write intermediate results to a file
>>> in a format that later can be processed line by line or by any other of
>>> the gazillions ways of preversing RAM. Don't you learn those techniques
>>> in basic computer science classes any more?
>>> jue

>> output to a file and using it again will take lot of time. It will be
>> very slow.
>>
>> will be helpful in speed if I use DB_FILE module- Hide quoted text -
>>
>> - Show quoted text -

>
> here is what I am trying to do.
>
> I have two large files. I will read one file and see if that is also
> present in second file. I also need count how many time it is appear
> in both the file. And according I do other processing.
>
> so if I process line by line both the file then it will be like (eg.
> file1 has 10 line and file2 has 10 line. for each line file1 it will
> loop 10 times. so total 100 loops.) I am dealing millions of lines so
> this approach will be very slow.


Maybe you shouldn't do your own math. It'd be 10 reads, for each file,
so 20.
>
>
> this is my current code. It runs fine with small file.
>

use strict;
use warnings;

>
>
> open ($INFO, '<', $file) or die "Cannot open $file :$!\n";

open( my $INFO, ...

> while (<$INFO>)
> {
> (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
> undef) = split('\|');


my( $time, $cli_ip, $ser_ip, $id ) = (split( /\|/ ))[3,4,5,7];

> push @{$time_table{"$cli_ip|$dns_id"}}, $time;
> }

close( $INFO );
>
>
> open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
> \n";


open( my $INFO_PRI, ...

> while (<$INFO_PRI>)
> {
> (undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
> $pri_id, undef, $query, undef) = split('\|');


my( $pri_time, $pri_cli_ip, $pri_id, $query ) = (split( /\|/ ))[3,4,7,9];

> $pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
> push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
> }


Read one file into memory/hash, if possible. As you're processing
the second one, store/push some data to process later, or process
it at that time, if it matches your criteria. There's no need to
store both in memory.

>
> @pri_ip_id_table_ = keys(%pri_ip_id_table);
>
> for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2


Ugg.. the keys for %pri_ip_id_table are 'something|somethingelse'
how that works with that for loop is probably not what one
would expect.

> {
> if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
> is there in file 1


Really? Where is pri_ip_dns_table_ defined?

> so for above example which I approach will be best ?

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory card out of memory - Nikon Coolpix Club_obi_wan Digital Photography 11 4 Weeks Ago 07:20 PM
GC.Collect() not cleaning memory, how to find out what references to lots of memory still exist? DR ASP .Net 0 04-15-2008 09:03 PM
Out of memory when having lot of memory! jmr ASP .Net 5 11-23-2007 06:23 PM
How do I get an out-of-memory error memory usage dump? Todd Java 4 09-05-2007 03:08 PM
out of memory while expanding memory stream v2brothers Computer Support 1 08-29-2007 11:11 AM



Advertisments