Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing large web server logfiles efficiently

Reply
Thread Tools

Parsing large web server logfiles efficiently

 
 
ashutosh.gaur@gmail.com
Guest
Posts: n/a
 
      01-14-2006
Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.

The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
This machine will be used solely for file processing and running one
more application (Informatica)

thanks
Ash

 
Reply With Quote
 
 
 
 
l v
Guest
Posts: n/a
 
      01-14-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hi
> I'm a perl newbie. I've been given the task of parsing through very
> large (500MB) web server log files in an efficient manner. I need to
> parse about 8 such files in parallel and create corresponding csv files
> as output. This needs to be done every hour. In other words, the entire
> parsing of about 8 files should complete well within 30 minutes. The
> remaining 30 minutes are required for other database related activities
> that need to be performed on the csv files generated by the perl
> script.
>
> Following is a snippet of my perl routine....
>
> open(INFO, $in_file);
> open(DAT, $out_file);
>
> while (<INFO>) {
>
> my ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent);


You can declare your variables in the next statement. so delete the
above statement by adding my to the beginning of the line.

>
> *my* ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent) =
> /^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
> (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
> or next;


Try replacing the rexexp with split() on space(s) into an array.

>
> my $decrypt_url = <decrypting subroutine> $url;
>
> print DAT $host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $decrypt_url, $protocol, $status,
> $bytes, $referer, $agent, "\n";


you can then use print map { "$_," } @array or join() to add in your
commas for your CSV output.

> }
>
> ---------------------------------------------------------------------------------------------------
> This script takes about 50 minutes to process all the 8 files. I need
> some suggestions to improve the performance and bring the processing
> time down.
>

[snip]
>
> thanks
> Ash


I'm sure there are much more efficient ways, but something to start
with.

Len

 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      01-14-2006
(E-Mail Removed) <(E-Mail Removed)> wrote:

> open(INFO, $in_file);



You should always, yes *always*, check the return value from open():

open(INFO, $in_file) or die "could not open '$in_file' $!";


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
MikeGee
Guest
Posts: n/a
 
      01-15-2006

(E-Mail Removed) wrote:
> Hi
> I'm a perl newbie. I've been given the task of parsing through very
> large (500MB) web server log files in an efficient manner. I need to
> parse about 8 such files in parallel and create corresponding csv files
> as output. This needs to be done every hour. In other words, the entire
> parsing of about 8 files should complete well within 30 minutes. The
> remaining 30 minutes are required for other database related activities
> that need to be performed on the csv files generated by the perl
> script.
>
> Following is a snippet of my perl routine....
>
> open(INFO, $in_file);
> open(DAT, $out_file);
>
> while (<INFO>) {
>
> my ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent);
>
> ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent) =
> /^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
> (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
> or next;
>
> my $decrypt_url = <decrypting subroutine> $url;
>
> print DAT $host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $decrypt_url, $protocol, $status,
> $bytes, $referer, $agent, "\n";
> }
>
> ---------------------------------------------------------------------------------------------------
> This script takes about 50 minutes to process all the 8 files. I need
> some suggestions to improve the performance and bring the processing
> time down.
>
> The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
> This machine will be used solely for file processing and running one
> more application (Informatica)
>
> thanks
> Ash


Another approach to take is substituting commas for spaces in the
string rather than capturing all the fields. If your fields never
contain spaces then:

tr/ /,/

Couple that with sysread/syswrite, and you should get some big
improvements.

 
Reply With Quote
 
it_says_BALLS_on_your forehead
Guest
Posts: n/a
 
      01-15-2006

MikeGee wrote:
> (E-Mail Removed) wrote:
> > Hi
> > I'm a perl newbie. I've been given the task of parsing through very
> > large (500MB) web server log files in an efficient manner. I need to
> > parse about 8 such files in parallel and create corresponding csv files
> > as output. This needs to be done every hour. In other words, the entire
> > parsing of about 8 files should complete well within 30 minutes. The
> > remaining 30 minutes are required for other database related activities
> > that need to be performed on the csv files generated by the perl
> > script.
> >
> > Following is a snippet of my perl routine....
> >
> > open(INFO, $in_file);
> > open(DAT, $out_file);
> >
> > while (<INFO>) {
> >
> > my ($host, $ident_user, $auth_user, $date, $time,
> > $time_zone, $method, $url, $protocol, $status, $bytes,
> > $referer, $agent);
> >
> > ($host, $ident_user, $auth_user, $date, $time,
> > $time_zone, $method, $url, $protocol, $status, $bytes,
> > $referer, $agent) =
> > /^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
> > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
> > or next;
> >
> > my $decrypt_url = <decrypting subroutine> $url;
> >
> > print DAT $host, $ident_user, $auth_user, $date, $time,
> > $time_zone, $method, $decrypt_url, $protocol, $status,
> > $bytes, $referer, $agent, "\n";
> > }
> >
> > ---------------------------------------------------------------------------------------------------
> > This script takes about 50 minutes to process all the 8 files. I need
> > some suggestions to improve the performance and bring the processing
> > time down.
> >
> > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
> > This machine will be used solely for file processing and running one
> > more application (Informatica)
> >
> > thanks
> > Ash

>
> Another approach to take is substituting commas for spaces in the
> string rather than capturing all the fields. If your fields never
> contain spaces then:
>
> tr/ /,/
>
> Couple that with sysread/syswrite, and you should get some big
> improvements.


can you explain this more? why does this improve performance? isn't
this only for fixed-length unbuffered input?

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      01-15-2006
(E-Mail Removed) wrote:
> Hi
> I'm a perl newbie. I've been given the task of parsing through very
> large (500MB) web server log files in an efficient manner. I need to
> parse about 8 such files in parallel and create corresponding csv files
> as output. This needs to be done every hour. In other words, the entire
> parsing of about 8 files should complete well within 30 minutes. The
> remaining 30 minutes are required for other database related activities
> that need to be performed on the csv files generated by the perl
> script.


parse the file into a csv file, and then re-parse the csv file to do
database stuff with it? Wouldn't it be more efficient to do the database
stuff directly in the script below?
>
> Following is a snippet of my perl routine....
>
> open(INFO, $in_file);
> open(DAT, $out_file);


You should check the success of these. I assume the string in $out_file
begins with a ">"?

>
> while (<INFO>) {
>
> my ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent);
>
> ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent)


You could combine the my into the same statement as the assignment.
Probably not much faster, but certainly more readable.

> =
> /^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
> (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
> or next;
>
> my $decrypt_url = <decrypting subroutine> $url;
>
> print DAT $host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $decrypt_url, $protocol, $status,
> $bytes, $referer, $agent, "\n";
> }
>
> -------------------------------------------------------------------------
> --------------------------


> This script takes about 50 minutes to process all the 8 files.


That script only processes one file. How do the other 7 get processed?

It takes me less than 3 minutes to process one 635 MB file on a signle CPU
3 GHz machine.

> I need
> some suggestions to improve the performance and bring the processing
> time down.


Take out the print DAT. How long does it take? Take out the decrypting
subroutine, too. How long does it take now? Take out the regex. How long
does it take now?

You could try changing the regex to a split, with some post-processing of
the elements. But from some testing I've done, I doubt that will same more
than 20% or so, and that is without the necessary post-processing.

> The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.


So, when the program is running, what is happening on the machine? Is the
CPU pegged? Is the network bandwidth pegged? Is the disk bandwidth
pegged? Are you using all 8 CPUs?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      01-16-2006
(E-Mail Removed) wrote:
> Hi
> I'm a perl newbie. I've been given the task of parsing through very
> large (500MB) web server log files in an efficient manner. I need to
> parse about 8 such files in parallel and create corresponding csv files
> as output. This needs to be done every hour. In other words, the entire
> parsing of about 8 files should complete well within 30 minutes. The
> remaining 30 minutes are required for other database related activities
> that need to be performed on the csv files generated by the perl
> script.
>
> Following is a snippet of my perl routine....
>
> open(INFO, $in_file);
> open(DAT, $out_file);
>
> while (<INFO>) {
>
> my ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent);
>
> ($host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $url, $protocol, $status, $bytes,
> $referer, $agent) =
> /^(\S+) (\S+) (\S+) \[([^:]+)\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
> (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
> or next;
>
> my $decrypt_url = <decrypting subroutine> $url;


I wonder if that mysterious "decrypting" subroutine is where the
bottleneck is? What does it do?

>
> print DAT $host, $ident_user, $auth_user, $date, $time,
> $time_zone, $method, $decrypt_url, $protocol, $status,
> $bytes, $referer, $agent, "\n";
> }
>
> ---------------------------------------------------------------------------------------------------
> This script takes about 50 minutes to process all the 8 files. I need
> some suggestions to improve the performance and bring the processing
> time down.
>
> The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
> This machine will be used solely for file processing and running one
> more application (Informatica)

 
Reply With Quote
 
Ash
Guest
Posts: n/a
 
      01-16-2006
Hi Xho,
I did some modifications based on your suggestions and some on my own.
i tried reading the entire file into a scalar context and removed the
regex. All I did was to substitute the space with a comma
--------------------------------------------------------------
undef $/;
$_ = <INFO>;

# Remove all double quotes
s/ /,/g;
---------------------------------------------------------------
The time came down to 45 seconds for a file. However, doing it this
way, I'll not be able to apply the decrypting subroutine. Moreover,
though it didn't occur, there is a possibility of memory problems with
this approach. The other approach of moving into an array and using the
same regex didn't improve the performance much (just about 5 minute
gain)

I did not understand your questions regarding pegging. Is there a way I
can peg the cpu and the bandwidths? How can I make sure that my script
use all the available CPUs?

This script runs parallely for all 8 files.

The decrypting subroutine is currently being developed by a seperate
team. I'm not sure how efficient would it be. What I wanted was to make
my script efficient before even pluggin that routine in.

thanks to all of you for your inputs
Ash

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      01-17-2006
"Ash" <(E-Mail Removed)> wrote:
> Hi Xho,
> I did some modifications based on your suggestions and some on my own.


Hi Ash,

I don't think you understood my suggestions. Those suggestions were on
ways to *diagnose* the problems more accurately, not ways to fix them. If
it gets much faster when you comment out the print, then you know the print
is the problem. If it gets much faster when you comment out the decrypt,
you know that that is the problem. etc.

> i tried reading the entire file into a scalar context and removed the
> regex. All I did was to substitute the space with a comma
> --------------------------------------------------------------
> undef $/;
> $_ = <INFO>;
>
> # Remove all double quotes
> s/ /,/g;
> ---------------------------------------------------------------
> The time came down to 45 seconds for a file.


Just one file, or all 8 in parallel? If the latter, then we now know that
reading the files from disk (or at least slurping it) is not the
bottleneck, but we don't know much more than that.

> However, doing it this
> way, I'll not be able to apply the decrypting subroutine.


Right. There is no point in testing the performance of s/ /,/g as that
doesn't do what needs to be done. I wanted you to take out the regex
entirely. Read the line, and then throw it away and go read the next line.
If that is much faster than reading the line, doing the regex, throwing
away the result of the regex and just going to the next line, then you know
the regex is the bottleneck. Then you will know where to focus your
efforts.

> Moreover,
> though it didn't occur, there is a possibility of memory problems with
> this approach.


Absolutely. You want to test the line-by-line approach, there is no point
in testing the slurping approach as that is not a viable alternative.

> The other approach of moving into an array and using the
> same regex didn't improve the performance much (just about 5 minute
> gain)
>
> I did not understand your questions regarding pegging. Is there a way I
> can peg the cpu and the bandwidths?


By "pegging" I mean using all of the resource which is available, so that
that resource becomes the bottleneck.

> How can I make sure that my script
> use all the available CPUs?


You use OS-specific tools to do that. On unix-like system, "top" is a good
one. However, how you interpret the results of "top" are OS-dependent. On
Solaris, I think it should list each of the 8 processes as getting nearly
12.5% of the CPU. If not, then it probably not CPU bound, but rather
IO bound.


>
> This script runs parallely for all 8 files.


And it takes 50 minutes for the last of the 8 to finish? How long does
it take if you only process 4 in parallel? (If it still takes 50 minutes,
that suggests you are CPU bound. If it takes substantially less, that
suggests you are IO bound.)

>
> The decrypting subroutine is currently being developed by a seperate
> team. I'm not sure how efficient would it be. What I wanted was to make
> my script efficient before even pluggin that routine in.


So what are you currently using, just an empty dummy subroutine?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Ash
Guest
Posts: n/a
 
      01-17-2006
Hi Xho

> I don't think you understood my suggestions. Those suggestions were on
> ways to *diagnose* the problems more accurately, not ways to fix them. If
> it gets much faster when you comment out the print, then you know the print
> is the problem. If it gets much faster when you comment out the decrypt,
> you know that that is the problem. etc.


Going by the methodical approach you suggested, I figured out that it
is the regex that's the bottleneck. A straightforward line-by-line read
followed by a write for 8 files running in parallel took a little over
2 minutes. Putting the regex back took the time back to about 50 mins.

> Just one file, or all 8 in parallel? If the latter, then we now know that
> reading the files from disk (or at least slurping it) is not the
> bottleneck, but we don't know much more than that.


Reading/writing is not the bottleneck

> > However, doing it this
> > way, I'll not be able to apply the decrypting subroutine.

>
> Right. There is no point in testing the performance of s/ /,/g as that
> doesn't do what needs to be done. I wanted you to take out the regex
> entirely. Read the line, and then throw it away and go read the next line.
> If that is much faster than reading the line, doing the regex, throwing
> away the result of the regex and just going to the next line, then you know
> the regex is the bottleneck. Then you will know where to focus your
> efforts.
>
> > Moreover,
> > though it didn't occur, there is a possibility of memory problems with
> > this approach.

>
> Absolutely. You want to test the line-by-line approach, there is no point
> in testing the slurping approach as that is not a viable alternative.
>
> > The other approach of moving into an array and using the
> > same regex didn't improve the performance much (just about 5 minute
> > gain)
> >
> > I did not understand your questions regarding pegging. Is there a way I
> > can peg the cpu and the bandwidths?

>
> By "pegging" I mean using all of the resource which is available, so that
> that resource becomes the bottleneck.
>
> > How can I make sure that my script
> > use all the available CPUs?

>
> You use OS-specific tools to do that. On unix-like system, "top" is a good
> one. However, how you interpret the results of "top" are OS-dependent. On
> Solaris, I think it should list each of the 8 processes as getting nearly
> 12.5% of the CPU. If not, then it probably not CPU bound, but rather
> IO bound.



>
> >
> > This script runs parallely for all 8 files.

>
> And it takes 50 minutes for the last of the 8 to finish? How long does
> it take if you only process 4 in parallel? (If it still takes 50 minutes,
> that suggests you are CPU bound. If it takes substantially less, that
> suggests you are IO bound.)


The process (with regex) takes 47 minutes for a single file and about
50 minutes for 8 parallel files. So, its CPU bound

> >
> > The decrypting subroutine is currently being developed by a seperate
> > team. I'm not sure how efficient would it be. What I wanted was to make
> > my script efficient before even pluggin that routine in.

>
> So what are you currently using, just an empty dummy subroutine?

Right now I dont even have a dummy routine there. I'm just throwing the
encrypted result into the output file

This is a sample log entry
-------------------------------------------------------------------------------
151.205.97.52 - - [23/Aug/2005:11:56:31 +0000] "GET
/liveupdate-aka.symantec.com/common$20client$20core_103.0.3_english_livetri.zip
HTTP/1.1" 304 162 "-" "Symantec LiveUpdate" "-"
-------------------------------------------------------------------------------

The url /liveupdate-aka.s.....
will be encrypted. I tried a few other regex to parse the line but
didn't get anything that would inprove performance.

thanks
Ash

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching through two logfiles in parallel? Victor Hooi Python 6 01-08-2013 11:40 PM
Efficiently reading large blocks from file persres@googlemail.com C++ 4 01-11-2010 02:47 PM
Request.UserHostAddress and IIS Logfiles cbanks@bjtsupport.com ASP .Net 0 02-10-2006 12:29 AM
Question about logfiles in Python Courtis Joannis Python 1 02-24-2005 02:36 AM
So just how do you filter large numbers of netblocks efficiently? Jeff Cisco 4 06-10-2004 07:12 PM



Advertisments