Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   best practice to avoiding excessive memory usage?? (http://www.velocityreviews.com/forums/t900663-best-practice-to-avoiding-excessive-memory-usage.html)

Chris 11-17-2006 02:38 PM

best practice to avoiding excessive memory usage??
 
I've come across the perl issue of inefficient use of memory when
dealing with large datasets. What are people's opinions on the best way
to work around this problem.

e.g.

My input file has this layout:
# Input 1_8:
0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
# Output 1_8:
0 0 1
# Input 1_9:
0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
# Output 1_9:
0 0 1
# Input 1_10:
0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
# Output 1_10:
0 0 1

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

#!/usr/bin/perl

use strict;
use warnings;

my ($patfile) = @ARGV;

open(my $FH, $patfile) or die;
my @array;
my $flag = 0;
my $i = 0;

while (<$FH>) {
$flag = 0 if (/^\# Output/);
$flag = 1 and next if (/^\# Input/);
if ($flag) {
chomp;
print "$i\n";
$array[$i] = [ split ];
++$i;
}
}
exit;

I've read about the various work-arounds to access the array via a file
on disk, but they don't seem to be very conducive for working with
complex data structures. Can you guys/gals let me know of their
favourite method to work more efficiently as at the moment I'm just
reading/writing the files a bit at a time?
TIA

xhoster@gmail.com 11-17-2006 03:43 PM

Re: best practice to avoiding excessive memory usage??
 
Chris <ithinkiam@gmail.com> wrote:
> I've come across the perl issue of inefficient use of memory when
> dealing with large datasets. What are people's opinions on the best way
> to work around this problem.


That depends entirely on what you are trying to do with the data. You
haven't shown us anything about what you are trying to do. The code you
showed us does nohting but take memory and burn CPU cycles.

> e.g.
>
> My input file has this layout:
> # Input 1_8:
> 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
> # Output 1_8:
> 0 0 1
> # Input 1_9:
> 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
> # Output 1_9:
> 0 0 1
> # Input 1_10:
> 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
> # Output 1_10:
> 0 0 1
>
> With ~73000 pairs of input and outputs. The file is ~260Mb is size.
> However when reading the file into an array with the following code
> snippet results in 1.2Gb of memory usage:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my ($patfile) = @ARGV;
>
> open(my $FH, $patfile) or die;
> my @array;
> my $flag = 0;
> my $i = 0;
>
> while (<$FH>) {
> $flag = 0 if (/^\# Output/);
> $flag = 1 and next if (/^\# Input/);
> if ($flag) {
> chomp;
> print "$i\n";
> $array[$i] = [ split ];
> ++$i;
> }
> }
> exit;


This program reads in data and does nothing with it. You may as well
move the "exit" up to just before the "use strict;"

>
> I've read about the various work-arounds to access the array via a file
> on disk,


Which ones?

> but they don't seem to be very conducive for working with
> complex data structures.


Why not? What problems did you encounter?

> Can you guys/gals let me know of their
> favourite method to work more efficiently as at the moment I'm just
> reading/writing the files a bit at a time?


Reading and writing the files a bit at a time is an efficient method.
At least as far as memory is concerned.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Chris 11-17-2006 04:58 PM

Re: best practice to avoiding excessive memory usage??
 
xhoster@gmail.com wrote:

> Chris <ithinkiam@gmail.com> wrote:
>> I've come across the perl issue of inefficient use of memory when
>> dealing with large datasets. What are people's opinions on the best
>> way to work around this problem.

>
> That depends entirely on what you are trying to do with the data. You
> haven't shown us anything about what you are trying to do. The code
> you showed us does nohting but take memory and burn CPU cycles.


Exactly. I was trying to give an example of the inefficient use of
memory by perl - nothing more and nothing less.
[snip]

>>
>> I've read about the various work-arounds to access the array via a
>> file on disk,

>
> Which ones?


The ones in the FAQ. 'How can I make my Perl program take less memory?'

>> but they don't seem to be very conducive for working with
>> complex data structures.

>
> Why not? What problems did you encounter?


AFAICS you can either store 1D arrays as lines in file or use some sort
of DB to manage the data. I may use these in the future, but at the
moment I'm looking a reasonably straight forward method to make an
existing program more memory efficient.

>
>> Can you guys/gals let me know of their
>> favourite method to work more efficiently as at the moment I'm just
>> reading/writing the files a bit at a time?

>
> Reading and writing the files a bit at a time is an efficient method.
> At least as far as memory is concerned.
>


OK. That's what I'll do for the time being. However, I'm still
interested in hearing how other people have overcome this problem.
Thanks.

Martijn Lievaart 11-17-2006 05:03 PM

Re: best practice to avoiding excessive memory usage??
 
On Fri, 17 Nov 2006 15:43:48 +0000, xhoster wrote:

>> Can you guys/gals let me know of their
>> favourite method to work more efficiently as at the moment I'm just
>> reading/writing the files a bit at a time?

>
> Reading and writing the files a bit at a time is an efficient method.
> At least as far as memory is concerned.


That is the best method.

Others include:

- Add more memory. 1.26G data usage is not that much and memory is cheap.

- Process the file in stages, producing intermediairy results (and files)
to make the next stage efficient.

- Put the data in a database. (Optionally producing a new datafile from
the database after processing).

M4
--
Redundancy is a great way to introduce more single points of failure.


xhoster@gmail.com 11-17-2006 06:00 PM

Re: best practice to avoiding excessive memory usage??
 
Chris <ithinkiam@gmail.com> wrote:
> >
> >> Can you guys/gals let me know of their
> >> favourite method to work more efficiently as at the moment I'm just
> >> reading/writing the files a bit at a time?

> >
> > Reading and writing the files a bit at a time is an efficient method.
> > At least as far as memory is concerned.
> >

>
> OK. That's what I'll do for the time being. However, I'm still
> interested in hearing how other people have overcome this problem.


I've used probably dozens of different methods to overcome the problem of
excess memory use, but each one is suited to only specific kinds of
problems. Changing algorithms to that you don't everything in memory at
once. Using Perl to transform the problem to something that can be solved
by using the system sort routine. Changing languages to something more
memory efficient, either entirely or using Inline or just by using Perl to
pre-process into a C-friendly format, then using C, then using Perl to
post-process back into the desired format. Using DBM::Deep. Storing
"records" as whole strings and splitting them on the fly when needed
(occasionally using tied arrays or hashes to hide this fact).

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Ted Zlatanov 11-17-2006 08:53 PM

Re: best practice to avoiding excessive memory usage??
 
On 17 Nov 2006, ithinkiam@gmail.com wrote:

> OK. That's what I'll do for the time being. However, I'm still
> interested in hearing how other people have overcome this problem.


As the size of your data grows, the solutions grow more complex too.
Everyone knows how to manage data = 1% of the system memory well. Few
manage data that is 500% of the system memory well.

Depending on your application you'll have to find the right solution.
Usually you'll end up with a database (not necessarily RDBMS) or
you'll split your data into several manageable pieces, to be processed
and loaded sequentially on one server or in parallel on multiple
servers.

For most problems, using a RDBMS database is the fastest, cheapest,
simplest way to manage large amounts of data. You see, then you can
just blame the DBAs when things don't work right :)

Ted

Mumia W. (reading news) 11-17-2006 10:12 PM

Re: best practice to avoiding excessive memory usage??
 
On 11/17/2006 08:38 AM, Chris wrote:
> I've come across the perl issue of inefficient use of memory when
> dealing with large datasets. What are people's opinions on the best way
> to work around this problem.
>
> e.g.
>
> My input file has this layout:
> # Input 1_8:
> 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
> # Output 1_8:
> 0 0 1
> # Input 1_9:
> 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
> # Output 1_9:
> 0 0 1
> # Input 1_10:
> 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
> # Output 1_10:
> 0 0 1
>
> With ~73000 pairs of input and outputs. The file is ~260Mb is size.
> However when reading the file into an array with the following code
> snippet results in 1.2Gb of memory usage:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my ($patfile) = @ARGV;
>
> open(my $FH, $patfile) or die;
> my @array;
> my $flag = 0;
> my $i = 0;
>
> while (<$FH>) {
> $flag = 0 if (/^\# Output/);
> $flag = 1 and next if (/^\# Input/);
> if ($flag) {
> chomp;
> print "$i\n";
> $array[$i] = [ split ];
> ++$i;
> }
> }
> exit;
>
> I've read about the various work-arounds to access the array via a file
> on disk, but they don't seem to be very conducive for working with
> complex data structures. Can you guys/gals let me know of their
> favourite method to work more efficiently as at the moment I'm just
> reading/writing the files a bit at a time?
> TIA


Arrays have a lot of overhead, so don't split the lines into arrays,
just put them into the main array without splitting.

When you need the data from a line, split it then.


--
paduille.4060.mumia.w@earthlink.net

Peter J. Holzer 11-18-2006 01:37 PM

Re: best practice to avoiding excessive memory usage??
 
On 2006-11-17 14:38, Chris <ithinkiam@gmail.com> wrote:
> I've come across the perl issue of inefficient use of memory when
> dealing with large datasets.


You aren't the first one. There are modules for dealing with large
numeric arrays for a reason.

> What are people's opinions on the best way
> to work around this problem.


So far I haven't needed them but searching CPAN for appropriate modules
would certainly be among the first things I'd try. I have also
bookmarked something called "PDL - The Perl Data Language" just in case
I'll ever need it.

> My input file has this layout:
> # Input 1_8:
> 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
> # Output 1_8:
> 0 0 1
> # Input 1_9:
> 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
> # Output 1_9:
> 0 0 1
> # Input 1_10:
> 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
> # Output 1_10:
> 0 0 1
>
> With ~73000 pairs of input and outputs. The file is ~260Mb is size.
> However when reading the file into an array with the following code
> snippet results in 1.2Gb of memory usage:


This is not surprising. Perl scalars take quite a bit of space. Assuming
no overhead from memory management (which is hardly realistic), a floating
point number takes 20 bytes, and a string takes 25 + n bytes (where n is
the length of the string).

> $array[$i] = [ split ];


You are storing your values as strings here. Since all your values seem
to be 7 characters long you could reduce the size of each element from
32 to 20 bytes, saving almost 40 %, by converting each value into a
number:

$array[$i] = [ map { $_ + 0 } split ];

In reality, the space saving may be less or more, depending on the
memory management of your perl implementation, the exact shape of your
data and other conditions.

Note that this solution is brittle: If you access the elements of your
arrays in a string context, perl may convert them back into strings, and
you will need even more space than you needed in the first place.

hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | hjp@hjp.at | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd

Chris 11-20-2006 11:02 AM

Re: best practice to avoiding excessive memory usage??
 
Peter J. Holzer wrote:

> On 2006-11-17 14:38, Chris <ithinkiam@gmail.com> wrote:
>> I've come across the perl issue of inefficient use of memory when
>> dealing with large datasets.

>
> You aren't the first one. There are modules for dealing with large
> numeric arrays for a reason.
>
> So far I haven't needed them but searching CPAN for appropriate
> modules would certainly be among the first things I'd try. I have also
> bookmarked something called "PDL - The Perl Data Language" just in
> case I'll ever need it.


Yes. I've seen that one, it looks very useful indeed. I'm sure I'll use
it in the future.

>> With ~73000 pairs of input and outputs. The file is ~260Mb is size.
>> However when reading the file into an array with the following code
>> snippet results in 1.2Gb of memory usage:

>
> You are storing your values as strings here. Since all your values
> seem to be 7 characters long you could reduce the size of each element
> from 32 to 20 bytes, saving almost 40 %, by converting each value into
> a number:
>
> $array[$i] = [ map { $_ + 0 } split ];
>
> In reality, the space saving may be less or more, depending on the
> memory management of your perl implementation, the exact shape of your
> data and other conditions.


Indeed, the above makes almost no difference (~100Mb) to my example
code... :(


Chris 11-20-2006 11:05 AM

Re: best practice to avoiding excessive memory usage??
 
Thanks for all the useful replies. I now have better ideas for future
memory management.


All times are GMT. The time now is 09:54 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.