Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Perl storing huge data(300MB) in a scalar

Reply
Thread Tools

Perl storing huge data(300MB) in a scalar

 
 
kalpanashtty@gmail.com
Guest
Posts: n/a
 
      12-05-2006
Hello,
This is regarding issues we face in while storing large data in a
scalar variable. The problem is explained as below:

We have a log file which has 10lines each line has appx 300MB
long(continuous). Using perl we read each line and store the read line
in a scalar variable. This works fine. But each time when it read these
huge line we see after sometime "Out of memory" and even memory
consumption increases.

Do any one faced this problem and know how to handle this kind of
scenario.

Kalpana

 
Reply With Quote
 
 
 
 
J.D. Baldwin
Guest
Posts: n/a
 
      12-05-2006

In the previous article, <(E-Mail Removed)> wrote:
> Do any one faced this problem and know how to handle this kind of
> scenario.


I had a similar problem a few months back with huge log data that
wasn't broken by newlines. perldoc -f getc has what you probably
need. Something along the lines of:

my $chunk = '';
for ( 1..$howmanycharsdoyouwantatonce )
{
$chunk .= getc FHANDLE;
}
--
_+_ From the catapult of |If anyone disagrees with any statement I make, I
_|70|___=}- J.D. Baldwin |am quite prepared not only to retract it, but also
\ / http://www.velocityreviews.com/forums/(E-Mail Removed)|to deny under oath that I ever made it. -T. Lehrer
***~~~~-----------------------------------------------------------------------
 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      12-05-2006
J.D. Baldwin wrote:
> In the previous article, <(E-Mail Removed)> wrote:
>>Do any one faced this problem and know how to handle this kind of
>>scenario.

>
> I had a similar problem a few months back with huge log data that
> wasn't broken by newlines. perldoc -f getc has what you probably
> need. Something along the lines of:
>
> my $chunk = '';
> for ( 1..$howmanycharsdoyouwantatonce )
> {
> $chunk .= getc FHANDLE;
> }


Read one character at a time? Ick!

read FHANDLE, my $chunk, $howmanycharsdoyouwantatonce;

Or:

local $/ = \$howmanycharsdoyouwantatonce;
my $chunk = <FHANDLE>;



John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      12-05-2006
(E-Mail Removed) wrote:
> Hello,
> This is regarding issues we face in while storing large data in a
> scalar variable. The problem is explained as below:
>
> We have a log file which has 10lines each line has appx 300MB
> long(continuous). Using perl we read each line and store the read line
> in a scalar variable. This works fine. But each time when it read these
> huge line we see after sometime "Out of memory" and even memory
> consumption increases.
>
> Do any one faced this problem and know how to handle this kind of
> scenario.


I write code that doesn't have this problem. Since you haven't shown
us any of our code, I can't tell you which part of your code is the
problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
J.D. Baldwin
Guest
Posts: n/a
 
      12-05-2006

In the previous article, John W. Krahn <(E-Mail Removed)> wrote:
> Read one character at a time? Ick!


There seemed to be a good reason at the time. Anyway, performance
wasn't an issue.

> local $/ = \$howmanycharsdoyouwantatonce;
> my $chunk = <FHANDLE>;


That's a cool trick, thanks.
--
_+_ From the catapult of |If anyone disagrees with any statement I make, I
_|70|___=}- J.D. Baldwin |am quite prepared not only to retract it, but also
\ / (E-Mail Removed)|to deny under oath that I ever made it. -T. Lehrer
***~~~~-----------------------------------------------------------------------
 
Reply With Quote
 
greg.ferguson@icrossing.com
Guest
Posts: n/a
 
      12-05-2006

J.D. Baldwin wrote:
> In the previous article, John W. Krahn <(E-Mail Removed)> wrote:
> > Read one character at a time? Ick!

>
> There seemed to be a good reason at the time. Anyway, performance
> wasn't an issue.
>
> > local $/ = \$howmanycharsdoyouwantatonce;
> > my $chunk = <FHANDLE>;

>
> That's a cool trick, thanks.


You might want to check further into the $/ Perl variable...

http://perldoc.perl.org/perlvar.html#$RS

If there are literal strings in your file that you can use as a
pseudo-eol, then set $/ to that string and read the file as normal.
You'll have the advantage of not needing to see if you read too little
or too much and having to reconstruct your lines.

Perl does well with file I/O, but will grunt when having to allocate
big chunks of memory to read the lines. If you read about slurp you'll
see it's almost never a good idea, and from doing some benchmarking I
found I was ahead just reading line by line, or in reasonable sized
fixed blocks, so I'd go about finding some way of determining the real
end-of-record marker.

 
Reply With Quote
 
Ala Qumsieh
Guest
Posts: n/a
 
      12-06-2006
(E-Mail Removed) wrote:

> Do any one faced this problem and know how to handle this kind of
> scenario.


use a 64-bit compiled version of Perl?

--Ala

 
Reply With Quote
 
J.D. Baldwin
Guest
Posts: n/a
 
      12-07-2006

In the previous article, <(E-Mail Removed)> wrote:
> If you read about slurp you'll see it's almost never a good idea
> [...]


So, a question then:

I have a very short script that reads the output of wget $URL like so:

my $wget_out = `/path/to/wget $URL`;

I am absolutely assured that the output from this URL will be around
10-15K every time. Furthermore, I need to search for a short string
that always appears near the end of the output (so there is no
advantage to cutting off the input after some shorter number of
characters).

So now that you have educated me a little, I am doing this:

$/ = \32000; # much bigger than ever needed, small enough
# to avoid potential memory problems in the
# unlikely event of runaway output from wget

my $wget_out = `/path/to/wget $URL`;

if ( $wget_out /$string_to_match/ )
{
# do "OK" thing
}
else
{
# do "not so OK" thing
}

Performance is important, but not extremely so; this script runs many
times per hour to validate the output of certain web servers. So if
there is overhead to the "obvious" line-by-line read-and-match method
of doing the same thing (which will always have to read about 200
lines before matching), then doing it that way is wasteful.

In your opinion, is this an exception to the "almost never a good
idea," or is this a case for slurping?

Also, if I can determine the absolute earliest $string_to_match could
possibly appear, I suppose I can get a big efficiency out of

my $earliest_char = 8_000; # string of interest appears after
# AT LEAST 8,000 characters

if ( substr($wget_out, $earliest_char) =~ /$string_to_match/ )
{
...

Yes?
--
_+_ From the catapult of |If anyone disagrees with any statement I make, I
_|70|___=}- J.D. Baldwin |am quite prepared not only to retract it, but also
\ / (E-Mail Removed)|to deny under oath that I ever made it. -T. Lehrer
***~~~~-----------------------------------------------------------------------
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      12-07-2006
(E-Mail Removed) wrote:
> In the previous article, <(E-Mail Removed)> wrote:
> > If you read about slurp you'll see it's almost never a good idea
> > [...]


I would disagree. Slurp is quite often a good idea. Slurping data that
is, or has the potential to be, very large when doing so is utterly
unnecessary is rarely a good idea, though.

>
> So, a question then:
>
> I have a very short script that reads the output of wget $URL like so:
>
> my $wget_out = `/path/to/wget $URL`;
>
> I am absolutely assured that the output from this URL will be around
> 10-15K every time.


So how does this get turned into 300MB?

> Furthermore, I need to search for a short string
> that always appears near the end of the output (so there is no
> advantage to cutting off the input after some shorter number of
> characters).
>
> So now that you have educated me a little, I am doing this:
>
> $/ = \32000; # much bigger than ever needed, small enough
> # to avoid potential memory problems in the
> # unlikely event of runaway output from wget
>
> my $wget_out = `/path/to/wget $URL`;


Backticks in a scalar context is not line oriented, and so $/ is irrelevant
to it. Even in a list context, backticks seem to slurp the whole thing,
and only apply $/ to it after slurping.

If you are really worried about runaway wget, you should either open a pipe
and read from it yourself:

open my $fh, "/path/to/get $URL |" or die $!;
$/=\32000;
my $wget_out=<$fh>;

or just use system tools to do it and forget about $/ altogether:

my $wget_out = `/path/to/wget $URL|head -c 32000`;

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      12-07-2006
J.D. Baldwin <(E-Mail Removed)> wrote:

> my $wget_out = `/path/to/wget $URL`;



You can make it more portable by doing it in native Perl
rather than shelling out:

use LWP::Simple;
my $wget_out = get $URL;


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
User Images: Storing in Files VS Storing in Database Jonathan Wood ASP .Net 1 06-02-2008 05:56 PM
storing pointer vs storing object toton C++ 11 10-13-2006 11:08 AM
Replace scalar in another scalar Mark Perl Misc 4 01-27-2005 02:48 PM
Shorthand for($scalar) loops and resetting pos($scalar) Clint Olsen Perl Misc 6 11-13-2003 12:50 AM



Advertisments