Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   perl multithreading performance (http://www.velocityreviews.com/forums/t908209-perl-multithreading-performance.html)

dniq00@gmail.com 08-27-2008 07:59 PM

perl multithreading performance
 
Hello, oh almighty perl gurus!

I'm trying to implement multithreaded processing for the humongous
amount of logs that I'm currently processing in 1 process on a 4-CPU
server.

What the script does is for each line it checks if the line contains
GET request, and if it does - goes through a list of pre-compiled
regular expressions, trying to find a matching one. Once the match is
found - it uses another regexp, associated with the found match, which
is a bit more complex, to extract data from the line. I have split it
in two separate matches, because about 30% of all lines will match,
and I don't want to run that complex regexp to extract data for all
the lines I know won't match. The goal is to count how many lines
matched for every specific regexp, and the end result is built as a
hash, having data, extracted from the line with second regexp, used as
hash keys, and the value is the number of matches.

Anyway, currently all this is done in a single process, which parses
approx. 30000 lines per second. The CPU usage for this process is
100%, so the bottleneck is in the parsing part.

I have changed the script to use threads + threads::shared +
Thread::Queue. I read data from logs like this:

Code
until( $no_more_data ) {
my @buffer;
foreach( (1..$buffer_size) ) {
if( my $line = <> ) {
push( @buffer, $line );
} else {
$no_more_data = 1;
$q_in->enqueue( \@buffer );
foreach( (1..$cpu_count) ) {
$q_in->enqueue( undef );
}
last;
}
}
$q_in->enqueue( \@buffer ) unless $no_more_data;
}

Then, I create $cpu_count threads, which does something like this:

Code
sub parser {
my $counters = {};
while( my $buffer = $q_in->dequeue() ) {
foreach my $line ( @{ $buffer } ) {
# do its thing
}
}
return $counters;
}

Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
faster than single-process script, consumes about 2-3 times more
memory and about as much times more CPU.

I've also tried abandoning the Thread:Queue and just use
threads::shared with lock/cond_wait/cond_signal combination, without
much success.

I've tried to play with $cpu_count and $buf_size, and found that after
$buf_size > 1000 doesn't make much difference, and $cpu_count > 2
actually makes things a lot worse.

Any ideas why in the world it's so slow? I did some research and
couldn't find a lot of info, other than the way I do it pretty much
the way it should be done, unless I'm missing something...

Hope anybody can enlighten me...

THANKS!

Leon Timmermans 08-27-2008 08:39 PM

Re: perl multithreading performance
 
On Wed, 27 Aug 2008 12:59:36 -0700, dniq00 wrote:
>
> Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
> faster than single-process script, consumes about 2-3 times more memory
> and about as much times more CPU.
>
> I've also tried abandoning the Thread:Queue and just use threads::shared
> with lock/cond_wait/cond_signal combination, without much success.
>
> I've tried to play with $cpu_count and $buf_size, and found that after
> $buf_size > 1000 doesn't make much difference, and $cpu_count > 2
> actually makes things a lot worse.
>
> Any ideas why in the world it's so slow? I did some research and
> couldn't find a lot of info, other than the way I do it pretty much the
> way it should be done, unless I'm missing something...
>
> Hope anybody can enlighten me...
>
> THANKS!


The speed of perl's threading is dependent on how much you share between
threads. Sharing the lines before processing them can become a
bottleneck, I suspect that's the problem in your case. You probably want
to divide the work first, and only used shared resources to report back
the results. Making a program scale over multiple processors isn't easy.
Sean O'Rourke entry in the wide finder benchmark (http://www.cs.ucsd.edu/
~sorourke/wf.pl) offers an interesting approach to this, though it isn't
exactly optimized for readability.

Regards,

Leon Timmermans

Ted Zlatanov 08-27-2008 09:06 PM

Re: perl multithreading performance
 
On Wed, 27 Aug 2008 12:59:36 -0700 (PDT) dniq00@gmail.com wrote:

d> What the script does is for each line it checks if the line contains
d> GET request, and if it does - goes through a list of pre-compiled
d> regular expressions, trying to find a matching one. Once the match is
d> found - it uses another regexp, associated with the found match, which
d> is a bit more complex, to extract data from the line. I have split it
d> in two separate matches, because about 30% of all lines will match,
d> and I don't want to run that complex regexp to extract data for all
d> the lines I know won't match. The goal is to count how many lines
d> matched for every specific regexp, and the end result is built as a
d> hash, having data, extracted from the line with second regexp, used as
d> hash keys, and the value is the number of matches.

d> Anyway, currently all this is done in a single process, which parses
d> approx. 30000 lines per second. The CPU usage for this process is
d> 100%, so the bottleneck is in the parsing part.
....
d> Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
d> faster than single-process script, consumes about 2-3 times more
d> memory and about as much times more CPU.
....
d> Any ideas why in the world it's so slow? I did some research and
d> couldn't find a lot of info, other than the way I do it pretty much
d> the way it should be done, unless I'm missing something...

You may be hitting the limits of I/O. Try feeding your script
pre-canned data from memory in a loop and see if that improves
performance. It also depends on what kind of processing you are doing
on input lines.

Also, check out the swatch log file monitor, it may do what you need
already.

Ted

dniq00@gmail.com 08-27-2008 09:15 PM

Re: perl multithreading performance
 
On Aug 27, 5:06*pm, Ted Zlatanov <t...@lifelogs.com> wrote:
> You may be hitting the limits of I/O. *Try feeding your script
> pre-canned data from memory in a loop and see if that improves
> performance.


No, the IO is fine - there are pretty much always $q_in->pending > 1,
and as the script does its thing, number of pending buffers sometimes
goes beyond 10.

>*It also depends on what kind of processing you are doing
> on input lines.


Just trying to match multiple regexps against each line.

> Also, check out the swatch log file monitor, it may do what you need
> already.


Nope, it doesn't :( I already have the single-threaded script, which
has been working for years now, but the amount of logs it needs to
process keeps growing, and I'm basically at the point where it can
only keep up with the speed with which logs are being written, so if
there's back-log for whatever reason - it might not catch up, so I'm
looking into how I can improve its performance.

dniq00@gmail.com 08-27-2008 09:25 PM

Re: perl multithreading performance
 
On Aug 27, 4:39*pm, Leon Timmermans <faw...@gmail.com> wrote:
> On Wed, 27 Aug 2008 12:59:36 -0700, dniq00 wrote:
>
> > Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
> > faster than single-process script, consumes about 2-3 times more memory
> > and about as much times more CPU.

>
> > I've also tried abandoning the Thread:Queue and just use threads::shared
> > with lock/cond_wait/cond_signal combination, without much success.

>
> > I've tried to play with $cpu_count and $buf_size, and found that after
> > $buf_size > 1000 doesn't make much difference, and $cpu_count > 2
> > actually makes things a lot worse.

>
> > Any ideas why in the world it's so slow? I did some research and
> > couldn't find a lot of info, other than the way I do it pretty much the
> > way it should be done, unless I'm missing something...

>
> > Hope anybody can enlighten me...

>
> > THANKS!

>
> The speed of perl's threading is dependent on how much you share between
> threads. Sharing the lines before processing them can become a
> bottleneck, I suspect that's the problem in your case. You probably want
> to divide the work first, and only used shared resources to report back
> the results. Making a program scale over multiple processors isn't easy.
> Sean O'Rourke entry in the wide finder benchmark (http://www.cs.ucsd.edu/
> ~sorourke/wf.pl) offers an interesting approach to this, though it isn't
> exactly optimized for readability.
>
> Regards,
>
> Leon Timmermans


Thanks for the link - trying to figure out whattahellisgoingon
there :) Looks like he's basically mmaps the input and begins reading
it starting at different points. Thing is, I'm using <> as input,
which can contain hundreds of gigabytes of data, so I'm not sure how's
that going to work out...


Martijn Lievaart 08-27-2008 09:53 PM

Re: perl multithreading performance
 
On Wed, 27 Aug 2008 14:15:34 -0700, dniq00 wrote:

> Nope, it doesn't :( I already have the single-threaded script, which has
> been working for years now, but the amount of logs it needs to process
> keeps growing, and I'm basically at the point where it can only keep up
> with the speed with which logs are being written, so if there's back-log
> for whatever reason - it might not catch up, so I'm looking into how I
> can improve its performance.


Perl threading, well frankly, sucks. You may want to switch to another
language with re support that meets your needs. I would go for C++ (with
boost), but then I know that language very well.

M4

xhoster@gmail.com 08-27-2008 10:17 PM

Re: perl multithreading performance
 
dniq00@gmail.com wrote:
> Hello, oh almighty perl gurus!
>
> I'm trying to implement multithreaded processing for the humongous
> amount of logs that I'm currently processing in 1 process on a 4-CPU
> server.


Start 4 processes, telling each one to work on a different log file.
Either do this from the command line, or implement it with fork or system,
depending on how automatic it all has to be.

> Anyway, currently all this is done in a single process, which parses
> approx. 30000 lines per second.


If you just check for GET (and then ignore the result), how many lines per
second would it do?

> The CPU usage for this process is
> 100%, so the bottleneck is in the parsing part.
>
> I have changed the script to use threads + threads::shared +
> Thread::Queue. I read data from logs like this:
>
> Code
> until( $no_more_data ) {
> my @buffer;
> foreach( (1..$buffer_size) ) {
> if( my $line = <> ) {
> push( @buffer, $line );
> } else {
> $no_more_data = 1;
> $q_in->enqueue( \@buffer );
> foreach( (1..$cpu_count) ) {
> $q_in->enqueue( undef );
> }
> last;
> }
> }
> $q_in->enqueue( \@buffer ) unless $no_more_data;
> }
>
> Then, I create $cpu_count threads, which does something like this:


What do you mean "then"? If you wait until all lines are enqueued before
you create the consumer threads, your entire log file will be in memory!

>
> Code
> sub parser {
> my $counters = {};
> while( my $buffer = $q_in->dequeue() ) {
> foreach my $line ( @{ $buffer } ) {
> # do its thing
> }
> }
> return $counters;
> }


When $counters is returned, what do you do with it? That could be
another synchronization bottleneck.

>
> Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
> faster than single-process script, consumes about 2-3 times more
> memory and about as much times more CPU.


That doesn't surprise me.

> I've also tried abandoning the Thread:Queue and just use
> threads::shared with lock/cond_wait/cond_signal combination, without
> much success.


This also doesn't surprise me. Synchronizing shared access is hard and
often slow.


>
> I've tried to play with $cpu_count and $buf_size, and found that after
> $buf_size > 1000 doesn't make much difference, and $cpu_count > 2
> actually makes things a lot worse.
>
> Any ideas why in the world it's so slow? I did some research and
> couldn't find a lot of info, other than the way I do it pretty much
> the way it should be done, unless I'm missing something...
>
> Hope anybody can enlighten me...


If you post fully runnable dummy code, and a simple program which
generates log-file data to put through it, I'd probably couldn't resist the
temptation to play around with it and find the bottlenecks.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

cartercc 08-28-2008 11:45 AM

Re: perl multithreading performance
 
On Aug 27, 5:53*pm, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:

> Perl threading, well frankly, sucks. You may want to switch to another
> language with re support that meets your needs. I would go for C++ (with
> boost), but then I know that language very well.


I've been playing with Erlang. In this case, you could probably spawn
separate threads per line and have them all run concurrently. I
haven't done a 'real' project (yet) but I've written some toy scripts
that tear through large files in fractions of milliseconds.

CC

Ted Zlatanov 08-28-2008 01:39 PM

Re: perl multithreading performance
 
On Wed, 27 Aug 2008 23:53:09 +0200 Martijn Lievaart <m@rtij.nl.invlalid> wrote:

ML> On Wed, 27 Aug 2008 14:15:34 -0700, dniq00 wrote:
>> Nope, it doesn't :( I already have the single-threaded script, which has
>> been working for years now, but the amount of logs it needs to process
>> keeps growing, and I'm basically at the point where it can only keep up
>> with the speed with which logs are being written, so if there's back-log
>> for whatever reason - it might not catch up, so I'm looking into how I
>> can improve its performance.


ML> Perl threading, well frankly, sucks. You may want to switch to another
ML> language with re support that meets your needs. I would go for C++ (with
ML> boost), but then I know that language very well.

Hadoop is a nice non-Perl framework for this kind of work.

Ted

J. Gleixner 08-28-2008 03:32 PM

Re: perl multithreading performance
 
dniq00@gmail.com wrote:
> Hello, oh almighty perl gurus!
>
> I'm trying to implement multithreaded processing for the humongous
> amount of logs that I'm currently processing in 1 process on a 4-CPU
> server.
>
> What the script does is for each line it checks if the line contains
> GET request, and if it does - goes through a list of pre-compiled
> regular expressions, trying to find a matching one. [...]


> Any ideas why in the world it's so slow? I did some research and
> couldn't find a lot of info, other than the way I do it pretty much
> the way it should be done, unless I'm missing something...


Another, much easier/faster approach, would be:

grep ' GET ' file | your_script.pl

The earlier you can filter out the work that's needed, the better, and
you're not going to get much faster than grep. The more refined you
can make that initial filtering of data to only send lines you're
interested in, to your program, the better.


All times are GMT. The time now is 08:20 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.