"it_says_BALLS_on_your forehead" <> wrote:
> > balancing on the fly is almost surely going
> > to be better than some precomputed balancing based on the assumption
> > that size = time.
> >
> > $pm = new Parallel::ForkManager(20);
> >
> > foreach $file (sort {$files{$b}<=>$files{$a}} keys %files) {
> > my $pid = $pm->start and next;
> > ##Process the $file
> > $pm->finish; # Terminates the child process
> > }
> > $pm->wait_all_children;
> >
> > ...
>
> i admit i'm not too familiar with Threads/Forks (the only fork i use is
> the one called from system() ).
One advantage of Parallel::ForkManager is that you don't need to be all
that familiar with fork. It handles most of it for you, as long as you
follow the example of having a "$pm->start and next" near the top of the
loop and a "$pm->finish" at the end of the loop. ($pm->finish actually
calls exit in the child process, so anything between the finish and the end
of the loop is not executed.)
> also, i've read that Perl threading
> isn't too stable.
Forking on linux is rock stable. Forking on Windows is emulated using
threads, but I think it is stable enough for what you are doing.
> i've looked on the web a little, but have not found
> anything that describes how to do all of the following:
>
> 1) instantiate N processes (or threads)
$pm = new Parallel::ForkManager($N);
(Doesn't actual instantiate them, but declares how many you want
instantiated, once you get around to instantiating them.)
> 2) start each process parsing a log file
That is what the "foreach...$pm->start and next" does. It starts a process
on the next log file, unless there are already 20 (or $N) outstanding
processes. In that case, it waits for one of those outstanding processes to
end, then starts a process on the next log file.
> 3) the first process that is done looks at a shared or global queue and
> pulls the next log file from that and processes until the queue is
> empty.
ForkManager uses inversion of control (or at least something like it). The
first slave process that is done finishes. As part of finishing, it
notifies the master process. The master process keeps the queue, and uses
it to start the next process, to replace the one that finished.
....
>
> if i can get the on-the-fly thing working, that would be preferable.
> then sorting would not even be helpful, would it?
I find that it is helpful, especially when the length of the various
tasks vary by orders of magnitude.
Let's say your largest task will take 20 minutes for 1 process/CPU to
process, and all the rest of your tasks combined will take 20 minutes for
the other 19 CPUs to process. If you start the largest task first, then in
20 minutes you are done. If you start the largest task last, then say it
takes 15 minutes before it gets started[1], and then 20 minutes for it to
run, so the time to completeion is 35 minutes.
By starting the tasks it reverse order of run time, it lets the shorter
tasks pack around the longer ones in an efficient way.
(I just did a test on a uniform distribution of run-lengths[2], and
"processing" from long to short took 8:25 while short to long took 9:10. I
think the difference can be larger if the dispersion in runtimes is
greater)
Xho
[1] Since all-but-the-longest take 20 minutes to finish on 19 CPUs, they
will take ~19 minutes to finish on 20 CPUs (since we haven't yet started
the longest task, the shorter ones will have 20 CPUs to use, not 19).
However, the longest one doesn't need to wait for all of the shorter ones
to finish before it starts, it only needs to wait for 381 out of the 400 of
the shorter ones to finish. So I just pulled 15 minutes out of my ass, as
a guess of how long it will take for 381 of them to finish.
[2]
use strict;
use Parallel::ForkManager;
my $pm = new Parallel::ForkManager(10);
## do it with and without the "reverse"
foreach my $file (reverse 1..100) {
my $pid = $pm->start and next;
sleep $file;
$pm->finish; # Terminates the child process
}
$pm->wait_all_children;
--
--------------------
http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB