Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Alternative to Parallel::ForkManager

Reply
Thread Tools

Alternative to Parallel::ForkManager

 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      04-24-2008
nolo contendere <(E-Mail Removed)> wrote:
> On Apr 23, 5:24=A0pm, (E-Mail Removed) wrote:
> > nolo contendere <(E-Mail Removed)> wrote:
> > > Scenario:
> > > =A0 =A0 =A0I am expecting 3 files in a drop directory. They won't
> > > necessarily all arrive at the same time. I want to begin processing
> > > the each file as soon as it arrives (or as close to arrival time as
> > > is reasonable).

> >
> > What is the relationship between the 3 files? =A0Presumably, this whole
> > thing will happen more than once, right, otherwise you wouldn't need
> > to automate it? =A0So what is the difference between "3 files show up,
> > and that happens 30 times" and just "90 files show up"?

>
> The timing.


I still don't understand. How is the timing significant? If you want
each file to start being processed as soon as it shows up, then what
difference does it make whether they tend to show up in clumps of three?
As soon as they show up is as soon as they show up, regardless of when that
is. Is there something significant that the clumps of three have in common
*other* than merely their timing?


> This is what I have, and again, I think I just needed to move the glob
> function ( get_files() below ) into each thread.


If you do that, the "threads" will be fighting over the files. You will
have to code that very, very carefully. But depending on your answer to my
first question, it might be moot.


> I won't know the
> exact filename beforehand, so can't pass that to the child process and
> have it wait for it.


Is there a pattern, like one file will end in _1, one end in _2, and one
end in _3? If so, give each child a different (and mutually exclusive)
pattern to glob on. That way they won't fight over the files.


> my $done =3D 0;
> while ( is_before($stop_checking_time) && !$done ) {
> get_files( $loadcount, \$filecount, \@files, \$num_threads );


....
> my $pm =3D Parallel::ForkManager->new( $num_threads );


$num_threads determines the *maximum* number of processes that will be live
at any one time. This should be determined based on the number of CPUs or
the amount of main memory or the IO bandwidth that your server has. It
should not be determined by the count of the number of tasks to be done, as
you seem to be doing here.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
 
 
 
nolo contendere
Guest
Posts: n/a
 
      04-24-2008
On Apr 24, 2:35*pm, (E-Mail Removed) wrote:
> nolo contendere <(E-Mail Removed)> wrote:
> > On Apr 23, 5:24=A0pm, (E-Mail Removed) wrote:
> > > nolo contendere <(E-Mail Removed)> wrote:
> > > > Scenario:
> > > > =A0 =A0 =A0I am expecting 3 files in a drop directory. They won't
> > > > necessarily all arrive at the same time. I want to begin processing
> > > > the each file as soon as it arrives (or as close to arrival time as
> > > > is reasonable).

>
> > > What is the relationship between the 3 files? =A0Presumably, this whole
> > > thing will happen more than once, right, otherwise you wouldn't need
> > > to automate it? =A0So what is the difference between "3 files show up,
> > > and that happens 30 times" and just "90 files show up"?

>
> > The timing.

>
> I still don't understand. *How is the timing significant? *If you want
> each file to start being processed as soon as it shows up, then what
> difference does it make whether they tend to show up in clumps of three?
> As soon as they show up is as soon as they show up, regardless of when that
> is. *Is there something significant that the clumps of three have in common
> *other* than merely their timing?


The difference lies in my implementation of the solution, not
necessarily the problem. Historically I've used Parallel::ForkManager
in cases where there were many more jobs to do than there were CPUs,
and those jobs were all ready to be done. In that scenario, I would
initiate <num CPU> processes, loop through the jobs and assign a job
to each process until all the jobs were done. In the case mentioned in
this thread, all the jobs are not ready to be done at process-
initiation time. If I were to use my old implementation, then it's
possible that 2 out of 3 files showed up at time t0. So 2 processes
were kicked off. Shortly after, the 3rd file showed up. But my script
doesn't know that until the last of the 2 processes is finished, and
so I must wait to process the 3rd file.

By including the sleep/check code in the logic for each process, I can
handle this case more efficiently.

So to answer your earlier questions of the difference it makes, and
the significance: it changed my thought process (hopefully for the
better) around how to handle this incarnation of staggered-yet-
concurrent job processing.

>
> > This is what I have, and again, I think I just needed to move the glob
> > function ( get_files() below ) into each thread.

>
> If you do that, the "threads" will be fighting over the files. *You will
> have to code that very, very carefully. *But depending on your answer tomy
> first question, it might be moot.


Yes, the "very, very carefully" is why I posted to begin with, hoping
for an elegant and efficient solution.

>
> > I won't know the
> > exact filename beforehand, so can't pass that to the child process and
> > have it wait for it.

>
> Is there a pattern, like one file will end in _1, one end in _2, and one
> end in _3? *If so, give each child a different (and mutually exclusive)
> pattern to glob on. *That way they won't fight over the files.
>


The pattern is: <CLASS>_YYYYMMDDhhmmss_nnn

The glob is on <CLASS>_

There may be skips in the 'nnn' sequence, which is why rather than
attempting to be more specific on the glob pattern, i had hoped to
mark the files as 'being-processed' by either atomic rename, or atomic
mv to a work/tmp dir.

> > my $done =3D 0;
> > while ( is_before($stop_checking_time) && !$done ) {
> > * * get_files( $loadcount, \$filecount, \@files, \$num_threads );

>
> ...
>
> > * * * * my $pm =3D Parallel::ForkManager->new( $num_threads );

>
> $num_threads determines the *maximum* number of processes that will be live
> at any one time. *This should be determined based on the number of CPUs or
> the amount of main memory or the IO bandwidth that your server has. *It
> should not be determined by the count of the number of tasks to be done, as
> you seem to be doing here.


Yeah, I know, it's dangerous. There *shouldn't* be more than 40 files
at a time (I know, I know, stupid to believe this will actually be
true), and each process is calling Tcl script which will load the file
to a Sybase table (I don't have control over this). I think that this
is less bound by CPU, and more by IO, so I don't think $num_procs >
$num_cpus should be that much of an issue. Of course, I could be
wrong. This would require testing.
 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      04-24-2008
nolo contendere <(E-Mail Removed)> wrote:
> On Apr 24, 2:35=A0pm, (E-Mail Removed) wrote:
> >
> > I still don't understand. =A0How is the timing significant? =A0If you
> > want each file to start being processed as soon as it shows up, then
> > what difference does it make whether they tend to show up in clumps of
> > three? As soon as they show up is as soon as they show up, regardless
> > of when that is. Is there something significant that the clumps of
> > three have in common *other* than merely their timing?

>
> The difference lies in my implementation of the solution, not
> necessarily the problem.

....
> So to answer your earlier questions of the difference it makes, and
> the significance: it changed my thought process (hopefully for the
> better) around how to handle this incarnation of staggered-yet-
> concurrent job processing.


OK, so let me try to change your thought process yet again, then

The master process does all the waiting. That way, it fights with no
one but itself. First it waits for a file to exist (if necessary) then
it waits for ForkManager to let it start a new process (if necessary).
It does a rename in between. There is no one else trying to do the rename,
so no worry about race conditions (unless you unwisely start two master
processes!).


## Set a reasonable upper limit of 10. May never be reached!
my $pm=Parallel::ForkManager->new(10);

while ( is_before($stop_checking_time) && !$done ) {
my @files = glob "${class}_*";
sleep 1 unless (@files);
foreach my $file (@files) {
my $new_name="foo_$file";
## it is important that the renamed file won't match the glob on
## the next time through the loop!
rename $files, $new_name or die $!;
$pm->start() and next;
process($new_name);
$pm->finish();
};
};

If the main process remembers what files were already started, then
it could remember to skip those ones the next time through and wouldn't
need to bother with the renaming.

Of course, you could always change the sleep loop into some kind of FS
change notifier, as was discussed elsewhere in the thread. But doing a
glob once a second is probably not going to be a problem. At least, I
wouldn't worry about until it proves itself to be a problem.

> >
> > If you do that, the "threads" will be fighting over the files. You
> > will have to code that very, very carefully. But depending on your
> > answer to my first question, it might be moot.

>
> Yes, the "very, very carefully" is why I posted to begin with, hoping
> for an elegant and efficient solution.


It's best to avoid needing to do it all, like above. But if I needed
to do this, this is how I would try. If you are using threads and they
have the same PID, then you will have to use something other than $$.

my $file="some_file_we_are_fighting_over";
my $newname="$$.$file";
if (-e $newname) {
die "Should never happen. A previous job must have had the same PID,
Accomplished the rename, then failed to clean up after itself";
};
if (rename $file, $newname) {
## we won. Do whatever needs doing;
## But since we are paranoid...
-e $newname or die "$newname: how could this be?";
process($newname);
} elsif ($!{ENOENT}) {
## The file didn't exist, most likely someone else beat us to it.
## Do nothing, fall through to next iteration
} else {
## something *else* went wrong. What could it be?
die "Rename $file, $newname failed in an unexpected way: $!";
}

> > $num_threads determines the *maximum* number of processes that will be
> > live at any one time. This should be determined based on the number of
> > CPUs or the amount of main memory or the IO bandwidth that your server
> > has. It should not be determined by the count of the number of tasks
> > to be done, as you seem to be doing here.

>
> Yeah, I know, it's dangerous. There *shouldn't* be more than 40 files
> at a time (I know, I know, stupid to believe this will actually be
> true),


But there is no reason to take this risk. Hard code 40 as the max number
of processes (I'd probably go lower myself, but if you think 40 is the
number below which you don't need to worry...). If there are ever more
than forty, then some will have to wait in line, and don't crash your
machine. If there are never more than forty, then hardcoding the value of
40 instead of passing around $num_threads doesn't change the behavior at
all (and makes the code cleaner to boot).


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      04-24-2008
On Thu, 24 Apr 2008 07:29:15 -0700, nolo contendere wrote:

> On Apr 24, 4:15*am, Peter Makholm <(E-Mail Removed)> wrote:
>> Ben Morrow <(E-Mail Removed)> writes:
>> > SGI::FAM only works under Irix. I've been meaning to port it to other
>> > systems that support fam (and gamin, the GNU rewrite) but haven't got
>> > round to it yet.

>>
>> Never used the module myself (should have made that clear) and I have
>> to admit that my only reason to assume that it is usable on other
>> platforms is that File::Tail::FAM talk about Linux.
>>
>> //Makholm

>
> I appreciate the effort Peter, however I'm currently stuck on Solaris.


I thought FAM works on Solaris, so you may not be completely out of luck.
Haven't used any of the FAM modules myself though.

M4
 
Reply With Quote
 
nolo contendere
Guest
Posts: n/a
 
      04-25-2008
On Apr 24, 4:23*pm, (E-Mail Removed) wrote:
> nolo contendere <(E-Mail Removed)> wrote:
> > On Apr 24, 2:35=A0pm, (E-Mail Removed) wrote:

>
> > > I still don't understand. =A0How is the timing significant? =A0If you
> > > want each file to start being processed as soon as it shows up, then
> > > what difference does it make whether they tend to show up in clumps of
> > > three? As soon as they show up is as soon as they show up, regardless
> > > of when that is. Is there something significant that the clumps of
> > > three have in common *other* than merely their timing?

>
> > The difference lies in my implementation of the solution, not
> > necessarily the problem.

> ...
> > So to answer your earlier questions of the difference it makes, and
> > the significance: it changed my thought process (hopefully for the
> > better) around how to handle this incarnation of staggered-yet-
> > concurrent job processing.

>
> OK, so let me try to change your thought process yet again, then


Xho, I don't know how much they're paying you right now, but I'm
certain it's not enough . Thanks for your help!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Quicktime Alternative, RealPlayer Alternative & Media Player Classic John Capleton Computer Support 3 12-05-2005 07:41 AM
WPS DDK Updates or alternative? glenn Wireless Networking 1 11-06-2005 03:20 PM
Symantec Express Cleanup or alternative Ron P Firefox 2 04-24-2005 10:39 AM
Real Player Alternative and Mozilla probs dw Firefox 1 09-04-2003 12:18 PM



Advertisments