Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Can this be more efficient?

Reply
Thread Tools

Can this be more efficient?

 
 
Colin chaplin
Guest
Posts: n/a
 
      03-15-2005
Hi All

I've got a little quick and dirty perl program that scans through logfiles
for certain words and counts them up. Trouble is, it's quite slow and I was
wondering if you smart chaps could suggest a way of speeding it up. The
logfiles are about 50MB in size, and I'm running it on a reasonably powerful
PC

The Psuedo code is

For each logfile in the directory:
read into an @rray
close file
for each line of array
do text processing stuff

print results on screen

I've played about with not reading into an array just reading the file line
by line, no obvious change either way

As I said its written using basic perl data types, so I was wondering if
there is a quicker way to store and process this data ?


 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      03-15-2005
"Colin chaplin" <(E-Mail Removed)> wrote in message
news:d16qmq$46t$1$(E-Mail Removed)...
> Hi All
>
> I've got a little quick and dirty perl program that scans through

logfiles
> for certain words and counts them up. Trouble is, it's quite slow and

I was
> wondering if you smart chaps could suggest a way of speeding it up.

The
> logfiles are about 50MB in size, and I'm running it on a reasonably

powerful
> PC
>
> The Psuedo code is
>
> For each logfile in the directory:
> read into an @rray
> close file
> for each line of array
> do text processing stuff
>
> print results on screen
>
> I've played about with not reading into an array just reading the file

line
> by line, no obvious change either way


I don't know about speed efficiency, but this is certainly more
efficient space-wise.

How are you determining "no obvious change"? Have you used the
Benchmark module?

Also, is there any reason you can't just use grep, rather than entire
perl script?

if you really want Perl...

perl -ne '$words++ if /\bsearchedWord\b/; END { print "$words words
found\n" }' *.log

(note that this assumes the word will only appear a maximum of once per
line.

Paul Lalli

 
Reply With Quote
 
 
 
 
Colin chaplin
Guest
Posts: n/a
 
      03-15-2005
> >
> > I've got a little quick and dirty perl program that scans through

> logfiles
> > for certain words and counts them up. Trouble is, it's quite slow and

>
> > by line, no obvious change either way

>
> I don't know about speed efficiency, but this is certainly more
> efficient space-wise.
>
> How are you determining "no obvious change"? Have you used the
> Benchmark module?
>

By the very scientific "gosh this is taking a long time" make-a-cup-of-tea,
tidy-up, "oh it's still going" approach. I was looking for a magic bullet
perhaps that'd make it work in a matter of seconds. Not so much to speed
this up as it's only going to run a few times, but so in future when I do
similar it's better.


> Also, is there any reason you can't just use grep, rather than entire
> perl script?
>


I was being a little brief with my description, there's quite a lot of
pattern matching going on there that I use a function for. Almost do-able in
grep I guess but apart from speed the program works pretty well so I'm happy
to keep with it. Id post it here but I'd probably only get slated for coding
style >

> if you really want Perl...
>
> perl -ne '$words++ if /\bsearchedWord\b/; END { print "$words words
> found\n" }' *.log
>


I can't get that syntax to run from the command line, not sure why (NT
platform "can't find string terminator ' before EOF") but if I understand
the -ne switch it's effectively openinging every file, going through it line
by line, which is pretty much what I'm doing now ?

I did just have a look at $/ but not sure that would help me out.

Thanks for your time!

Colin


 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      03-15-2005
Colin chaplin <(E-Mail Removed)> wrote:

> for certain words and counts them up. Trouble is, it's quite slow



> The Psuedo code is
>
> For each logfile in the directory:
> read into an @rray
> close file
> for each line of array
> do text processing stuff
>
> print results on screen



The Psuedo answer is: improve the "do text processing stuff" part.

Of course we can't help you with that, since we cannot see that part...


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Paul Lalli
Guest
Posts: n/a
 
      03-15-2005
"Colin chaplin" <(E-Mail Removed)> wrote in message
news:d16uk2$9h5$1$(E-Mail Removed)...
> Paul Lalli wrote
>
> > Also, is there any reason you can't just use grep, rather than

entire
> > perl script?
> >

>
> I was being a little brief with my description,


You should always specify your description as completely as possible.
How else can you expect people to help you do what you want to do?

> there's quite a lot of
> pattern matching going on there that I use a function for. Almost

do-able in
> grep I guess but apart from speed the program works pretty well so I'm

happy
> to keep with it. Id post it here but I'd probably only get slated for

coding
> style >


You say that as though it's a bad thing...

> > if you really want Perl...
> >
> > perl -ne '$words++ if /\bsearchedWord\b/; END { print "$words words
> > found\n" }' *.log
> >

>
> I can't get that syntax to run from the command line, not sure why (NT
> platform "can't find string terminator ' before EOF")


Windows requires args to be delimited by " instead of '. Which means
you'll have to backslash all the " in the code.

> but if I understand
> the -ne switch it's effectively openinging every file, going through

it line
> by line, which is pretty much what I'm doing now ?


No. What you said you were currently doing is reading the file line by
line, storing each line onto an array, and then going through the array
line by line. Quite different.


Paul Lalli

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      03-15-2005
"Colin chaplin" <(E-Mail Removed)> wrote:
> Hi All
>
> I've got a little quick and dirty perl program that scans through
> logfiles for certain words and counts them up. Trouble is, it's quite
> slow and I was wondering if you smart chaps could suggest a way of
> speeding it up. The logfiles are about 50MB in size, and I'm running it
> on a reasonably powerful PC
>
> The Psuedo code is
>
> For each logfile in the directory:
> read into an @rray
> close file
> for each line of array
> do text processing stuff
>
> print results on screen


So, how about not doing the text processing stuff? If it is still slow,
then you know the slow step is one of the earlier steps. If not still
slow, you know it is the text processing stuff. Then you can show is the
real code used to do the part that is slow. I give us real numbers, so we
can try to replicate the problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Ala Qumsieh
Guest
Posts: n/a
 
      03-15-2005
Colin chaplin wrote:
> Id post it here but I'd probably only get slated for coding
> style >


What you're saying is akin to me going to a car mechanic and telling her
"my car is broken. how do I fix it?" without me allowing her to see the car.

You won't get any useful replies like that. Plus, a little critique of
your coding style should be a Good Thing (tm).

--Ala
 
Reply With Quote
 
Colin chaplin
Guest
Posts: n/a
 
      03-15-2005

"Ala Qumsieh" <(E-Mail Removed)> wrote in message
news:g4GZd.21999$(E-Mail Removed). com...
> Colin chaplin wrote:
> > Id post it here but I'd probably only get slated for coding
> > style >

>
> What you're saying is akin to me going to a car mechanic and telling her
> "my car is broken. how do I fix it?" without me allowing her to see the

car.
>
> You won't get any useful replies like that. Plus, a little critique of
> your coding style should be a Good Thing (tm).



Ok, here goes, warts and all (I removed names to protect the guilty). The
program reads through Mailsweeper log files and spits out how many emails
for certain domains are recorded in the logs, and also a list of recipients
and number of emails they receive. The log is a trimmed version of an SMTP
communication.


Please note:
* I'm not a programmer, be gentle
* This isnt a production program
* Clever regular expressions make my head explode when I look at them a
while after writing the code
* I'm barely a techy these days
* Don't run this scissors

In answer to another post, I had the program setup so that it read line by
line, and changed it to gobble the entire file at once to see if would have
a performance impact

Thanks to all for taking the time!

&opendir (".");
$DOMS{'thisdom'}=12;

sub opendir
{
# Recursively goes through directories and process all mailsweeper files
my $dir=$_[0];
my $name;
opendir(DIRHANDLE, $dir) || die "Cannot opendir $dir: $!";
foreach $name (sort readdir(DIRHANDLE))
{
if (($name eq '.') || ($name eq '..') || (lc($name) eq lc($currentlog)))
{



}
else
{

# Only open the file if it looks like a MSW Logfile
if ($name =~ /\Aopr[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.log/i)
{

&processFile($dir . "\\" . $name)

}
if (-d $dir . "\\" . $name)
{
print "opening " .$dir . "\\" . $name ."\n";
&opendir($dir . "\\" . $name);
}
}
}

closedir(DIRHANDLE);
}



sub processFile
{
my $filename=$_[0];
my @filetext;

print " Opening $filename\n";
open (INFILE,$filename) || die ("EK $!: $filename");
while (<INFILE>)
{
push(@filetext,$_);
}
close (INFILE);
foreach (@filetext)
{
if (/RCPT\sTO\:/i)
{
#print "[$_]\n";

if (/(\<.*?\>)/)
{

$bits=$1;
$bits =~ s/\<//g;
$bits =~ s/\>//g;
$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$bits = lc ($bits);
if (&valid_domain($bits))
{
$DOMS{$bits}++;
}
# print $bits;
}
}
# else {print "NO: $_\n";}
}

}
$DOMMER{'testdom'}=5;

open (OUTFILE,">allemails.csv") || die ("eeK:$!");
open (OUTF,">alldoms.csv") || die ("yikes:$!");
foreach $key ( sort by_this keys %DOMS )
{

($pre,$aft) = split(/\@/,$key);
$DOMMER{$aft}=$DOMMER{$aft}+1;
print OUTFILE "\"$key\",\"$DOMS{$key}\"\n";

}

foreach $key ( sort by_this keys %DOMMER )
{
print OUTF "\"$key\",\"$DOMMER{$key}\"\n";

}


sub by_this
{
($tat,$abit)=split (/\@/,$a);
($tat,$bbit)=split (/\@/,$b);
($abit cmp $bbit) || ($a cmp $b);
}

sub valid_domain
{
#decide if this is a xxxxx doman

my $dom=$_[0];
($tat,$dom)=split(/\@/,$dom);
if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}
if ($dom =~(/thatdomain.co\.uk/)) {return 1;}
if ($dom =~(/otherdomain\.com/)) {return 1;}

return (0);
}


 
Reply With Quote
 
Martin Kissner
Guest
Posts: n/a
 
      03-15-2005
Colin chaplin wrote :

> I was being a little brief with my description, there's quite a lot of
> pattern matching going on there that I use a function for.


The regexes might be the reason, why the script takes such a long time
to run.
I am pretty new to perl and I am reading the Camel Book.
In the Chapter 5 Pattern-Matching there are examples which are said to
take years (actually millions of years) because of backtracking.

So one approach to improve the performance of your script might be the
optimization of you regxes.

HTH
Martin

--
perl -e '$S=[[73,116,114,115,31,96],[108,109,114,102,99,112],
[29,77,98,111,105,29],[100,93,95,103,97,110]];
for(0..3){for$s(0..5){print(chr($S->[$_]->[$s]+$_+1))}}'
 
Reply With Quote
 
Fabian Pilkowski
Guest
Posts: n/a
 
      03-15-2005
* Colin chaplin wrote:
>
> Please note:
> * I'm not a programmer, be gentle


.... but learning something new cannot be wrong, isn't?

> * This isnt a production program


.... but speeding up a program is mostly interesting.

> * Clever regular expressions make my head explode when I look at them a
> while after writing the code


.... when clever regexes would be necessary I'll took them (and write a
short comment aside to remember me after a while).

> * I'm barely a techy these days
> * Don't run this scissors
>
> In answer to another post, I had the program setup so that it read line by
> line, and changed it to gobble the entire file at once to see if would have
> a performance impact
>
> Thanks to all for taking the time!
>
> &opendir (".");
> $DOMS{'thisdom'}=12;
>
> sub opendir {
> # Recursively goes through directories and process all mailsweeper files


[...]

> }


For me it looks smarter, doing such a recursive directory traversal with
a standard perl module like File::Find. Your sub could look like:

use File::Find;
sub my_opendir {
my $dir = shift;
find( sub {
# $_ contains the current filename only, while
# $File::Find::name contains the complete pathname
if ( $_ =~ /opr\d{8}\.log$/i ) {
processFile( $File::Find::name )
}
}, $dir );
}

Btw, it's always a bad idea to name your own functions like perl ones.
Hence I named your sub »my_opendir« instead of »opendir«.

>
> sub processFile {
> my $filename=$_[0];
> my @filetext;
> print " Opening $filename\n";
> open (INFILE,$filename) || die ("EK $!: $filename");
> while (<INFILE>){
> push(@filetext,$_);
> }
> close (INFILE);


Eh, this will read the complete file into the array @filetext. That can
be made easier by »my @filetext = <INFILE>;« -- but that's not what some
others in this thread meaning with "use a while loop". See below.

> foreach (@filetext){
> if (/RCPT\sTO\:/i){
> #print "[$_]\n";
> if (/(\<.*?\>)/){
> $bits=$1;
> $bits =~ s/\<//g;
> $bits =~ s/\>//g;


If you don't catch those angle brackets in $1 you haven't delete them.

> $bits =~ s/RCPT TO\://g;
> $bits =~ s/\s//g;
> $bits = lc ($bits);


The search pattern is modified by »i« previously, i.e. case-insensitive..
Here you delete all uppercased occurrences of "RCPT TO:" but not the
lowercased ones. Perhaps you should do the lc() before.

> if (&valid_domain($bits)){
> $DOMS{$bits}++;
> }
> # print $bits;
> }
> }
> # else {print "NO: $_\n";}
> }
> }


Using a while loop to read in a file line by line:

sub processFile {
my $filename = shift;
print " Opening $filename\n";
open INFILE, $filename or die "EK $!: $filename";
while ( <INFILE> ) {
if ( /RCPT\sTO\:/i and /<(.*?)>/ ) {
my $bits = lc $1;
$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$DOMS{$bits}++ if valid_domain($bits);
}
}
close INFILE;
}

>
> $DOMMER{'testdom'}=5;
>
> open (OUTFILE,">allemails.csv") || die ("eeK:$!");
> open (OUTF,">alldoms.csv") || die ("yikes:$!");
> foreach $key ( sort by_this keys %DOMS )
> {
>
> ($pre,$aft) = split(/\@/,$key);
> $DOMMER{$aft}=$DOMMER{$aft}+1;
> print OUTFILE "\"$key\",\"$DOMS{$key}\"\n";


Before using such backslashed quotes, learn more about perl's quoting
operators like q// and qq//.

print OUTFILE qq{"$key","$DOMS{$key}"\n};

>
> }
>
> foreach $key ( sort by_this keys %DOMMER )
> {
> print OUTF "\"$key\",\"$DOMMER{$key}\"\n";
>
> }
>
>
> sub by_this
> {
> ($tat,$abit)=split (/\@/,$a);
> ($tat,$bbit)=split (/\@/,$b);
> ($abit cmp $bbit) || ($a cmp $b);
> }


If a function returns values which aren't interesting this moment, just
throw them away:

( undef, $abit ) = split /@/, $a;

>
> sub valid_domain
> {
> #decide if this is a xxxxx doman
>
> my $dom=$_[0];
> ($tat,$dom)=split(/\@/,$dom);


See above and throw $tat away

>
> if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}
> if ($dom =~(/thatdomain.co\.uk/)) {return 1;}
> if ($dom =~(/otherdomain\.com/)) {return 1;}


In Perl you can note if-statements behind. In cases like this I think
it's more readable due to fewer parenthesis

return 1 if $dom =~ /thisdomain\.co\.uk/;
return 1 if $dom =~ /thatdomain\.co\.uk/;
return 1 if $dom =~ /otherdomain\.com/;

You could do it with »or« also:

return 1 if $dom =~ /thisdomain\.co\.uk/
or $dom =~ /thatdomain\.co\.uk/
or $dom =~ /otherdomain\.com/;

I hope you like my suggestions.

regards,
fabian
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Kamaelia 0.4.0 RELEASED - Faster! More Tools! More Examples! More Docs! ;-) Michael Python 4 06-26-2006 08:00 AM
Tree view questions - Can I have more than one root? Can I de-link the current node? Alan Silver ASP .Net 3 11-09-2005 03:01 PM
With a Ruby Yell: more, more more! Robert Klemme Ruby 5 09-29-2005 06:37 AM
Can a digital see more than we can? Gerald Ross Digital Photography 20 08-31-2004 03:35 PM
Re: With More Flash More Lumix: using an external flash unit with the FZ1 and other digicams Hans-Georg Michna Digital Photography 4 08-24-2003 06:05 PM



Advertisments