Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > waitpid woes on Solaris, Perl 5.8.8

Reply
Thread Tools

waitpid woes on Solaris, Perl 5.8.8

 
 
A
Guest
Posts: n/a
 
      02-13-2007
I am getting intermittent unexpected result from waitpid on Solaris 9
running Perl 5.8.8.

Here is the scenario (the bare bones code is below).

Program_A, written in Perl, is invoked about a million times every
day. Most of the times, it invokes (using fork-exec) Program_B which
is written in C++. Program_A uses waitpid to get the exit code of
Program_B.
It works fine most of the times, but about a few dozen times every
day, the waitpid apparently fails and when it fails, I get

$? is -1
$! is "No child processes"

In all of the cases I have investigated, the child process, Program_B,
started and completed gracefully with "exit(0)" and of course, the pid-
s match from the trace log of both processes.

The output, from the code below, in such case is

Child pid=5196, exitCode=0xffffffff (No child processes)

Program_A itself is transient and short lived, and, depending on its
input, executes Program_B at most once.

What am I doing wrong?
How to detect and correct this?

Thanks for your help.

# ------------------------------------------- begin code
-------------------------------------------------
#!/usr/local/bin/perl

# program_A

my $cpid;
my $ec = undef;
my $em = undef;

sub getChildStatus
{
my $tc = undef;
my $tm = undef;
my $r = undef;

while ( 1 ) {
$r = waitpid($cpid, 0);
$tc = $?;
$em = $!;
last if ( -1 == $r || $r == $cpid );
print STDERR "waitpid($cpid, 0) returned $r ( $? )\n";
}
if ( $cpid == $r ) {
$ec = $tc;
$em = $tm;
}
}

sub sigCLDhandler
{
my $sig = shift;
print STDERR "caught SIG $sig\n";
getChildStatus;
}


sub runIt
{
my $oldSigCld = $SIG{CLD};
local $SIG{CLD} = \&sigChldHandler;

$cpid = fork;
if ( ! defined $cpid ) { print STDERR "fork failed [ $! ]\n";
return; }

if ( 0 == $cpid ) {
print STDERR "child pid $$ starting\n";

exec program_B, .. .. ..;

print STDERR "child pid $$: exec failed [$!], exiting with -1\n";
exit(-1);
} # 0 == $cpid i.e. the child

getChildStatus; # only the parent reaches here
$SIG{CLD} = $oldSigCld ;
} # runIt

#
# main
#
runIt;
if ( $ec ) {
printf STDERR "Child pid=$cpid exitcode=%#08x msg=(%s)\n", $ec, $em;
}

# ------------------------------------------- end code
-------------------------------------------------

 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      02-13-2007
"A" <(E-Mail Removed)> wrote:
> I am getting intermittent unexpected result from waitpid on Solaris 9
> running Perl 5.8.8.
>
> Here is the scenario (the bare bones code is below).
>
> Program_A, written in Perl, is invoked about a million times every
> day. Most of the times, it invokes (using fork-exec) Program_B which
> is written in C++. Program_A uses waitpid to get the exit code of
> Program_B.
> It works fine most of the times, but about a few dozen times every
> day, the waitpid apparently fails and when it fails, I get
>
> $? is -1
> $! is "No child processes"
>
> In all of the cases I have investigated, the child process, Program_B,
> started and completed gracefully with "exit(0)" and of course, the pid-
> s match from the trace log of both processes.
>
> The output, from the code below, in such case is
>
> Child pid=5196, exitCode=0xffffffff (No child processes)
>
> Program_A itself is transient and short lived, and, depending on its
> input, executes Program_B at most once.
>
> What am I doing wrong?


You are mucking with $SIG{CLD} when, as far as I can tell, you have
no need to. getChildStatus (and the waitpid in it) can get called twice,
once from the sig handler and once from the runIt. If it does get called
twice, the second time that child no longer exists, as it was already
waited on. Remove the $SIG{CLD} stuff.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
 
 
 
A
Guest
Posts: n/a
 
      02-14-2007
On Feb 13, 3:44 pm, (E-Mail Removed) wrote:
>
> You are mucking with $SIG{CLD} when, as far as I can tell, you have
> no need to. getChildStatus (and the waitpid in it) can get called twice,
> once from the sig handler and once from the runIt. If it does get called
> twice, the second time that child no longer exists, as it was already
> waited on. Remove the $SIG{CLD} stuff.
>
> Xho
>
> - Show quoted text -


Thanks for your reply.

First, there's a typo in my original message.

The third line after the while(1) in getChildStatus should be
$tm = $!;
instead of
$em = $!;

Now, to the point that the waitpid could get called twice.

Please note that the code is designed to guard against this, the
assignments to the globals $ec and $em are done if and only if waitpid
returns the matching pid.
So, even if it is called twice, the second time waitpid returns -1,
and then
getChildStatus returns without modifying the globals.

On your advice to remove the $SIG{CLD}, there are 3 statements,

the first statement saves the handler,
the second statement installs the current one needed by this
routine
and the last one re-installs the saved handler.

which one(s) would you suggest I remove?

Yes, there's a deficiency (bug, if you will) in the code. The
$SIG{CLD} should be re-installed if fork fails, but that I think, is
of no consequence to the problem at hand.

Thanks again.

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      02-14-2007
"A" <(E-Mail Removed)> wrote:
> On Feb 13, 3:44 pm, (E-Mail Removed) wrote:
> >
> > You are mucking with $SIG{CLD} when, as far as I can tell, you have
> > no need to. getChildStatus (and the waitpid in it) can get called
> > twice, once from the sig handler and once from the runIt. If it does
> > get called twice, the second time that child no longer exists, as it
> > was already waited on. Remove the $SIG{CLD} stuff.
> >
> > Xho
> >
> > - Show quoted text -

>
> Thanks for your reply.
>
> First, there's a typo in my original message.
>
> The third line after the while(1) in getChildStatus should be
> $tm = $!;
> instead of
> $em = $!;
>
> Now, to the point that the waitpid could get called twice.
>
> Please note that the code is designed to guard against this, the
> assignments to the globals $ec and $em are done if and only if waitpid
> returns the matching pid.


The waitpid of one getChildStatus returns the expected pid and sets the
global $? and $!. Before it can do anything else, the waitpid of the other
getChildStatus returns -1 and over writes the global $? and $! with it's
own values, but for this one $r does not meet the if and so returns control
to the first getChildStatus. The first getChildStatus was the right pid
recorded in $r (as that was a lexical and didn't get overwritten), but has
the wrong $? and $! because they did get overwritten, and now those get
recorded into your $tm and $cm

>
> On your advice to remove the $SIG{CLD}, there are 3 statements,
>
> the first statement saves the handler,
> the second statement installs the current one needed by this
> routine
> and the last one re-installs the saved handler.
>
> which one(s) would you suggest I remove?


Probably all of them, but it is not really possible to know from what you
give. We would need to see the code that set the orginal handler that is
getting saved and then restored. If the handler you inherit is necessary,
then why would it be safe to overwrite it with something else for even the
duration of this routine? On the other hand, if the handler you inherit is
not necessary, then what is the point of saving and re-installing it? If
there is no other code which intalls a handler in the first place, then I'd
remove all three of those things. (And even if not, remove at least two,
see below)

> Yes, there's a deficiency (bug, if you will) in the code. The
> $SIG{CLD} should be re-installed if fork fails, but that I think, is
> of no consequence to the problem at hand.


Since you use local to install the handler, I think the old one will be
reinstalled upon fork failure anyway. Saving the old one explicitly and
reinstalling explicit seem to be unnecessary, assuming the local is doing
its job.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Mark
Guest
Posts: n/a
 
      02-15-2007
On Feb 13, 11:22 am, "A" <(E-Mail Removed)> wrote:
> I am getting intermittent unexpected result from waitpid on Solaris 9
>
> sub runIt
> {
> my $oldSigCld = $SIG{CLD};
> local $SIG{CLD} = \&sigChldHandler;


I think you meant sigCLDhandler here.

 
Reply With Quote
 
A
Guest
Posts: n/a
 
      02-20-2007
On Feb 14, 7:54 pm, "Mark" <(E-Mail Removed)> wrote:
> On Feb 13, 11:22 am, "A" <(E-Mail Removed)> wrote:
>
> > I am getting intermittent unexpected result fromwaitpidon Solaris 9

>
> > sub runIt
> > {
> > my $oldSigCld = $SIG{CLD};
> > local $SIG{CLD} = \&sigChldHandler;

>
> I think you meant sigCLDhandler here.


Yes!

 
Reply With Quote
 
A
Guest
Posts: n/a
 
      02-20-2007
On Feb 14, 12:31 pm, (E-Mail Removed) wrote:
> "A" <(E-Mail Removed)> wrote:
> > On Feb 13, 3:44 pm, (E-Mail Removed) wrote:

>
> > > You are mucking with $SIG{CLD} when, as far as I can tell, you have
> > > no need to. getChildStatus (and thewaitpidin it) can get called
> > > twice, once from the sig handler and once from the runIt. If it does
> > > get called twice, the second time that child no longer exists, as it
> > > was already waited on. Remove the $SIG{CLD} stuff.

>
> > > Xho

>
> > > - Show quoted text -

>
> > Thanks for your reply.

>
> > First, there's a typo in my original message.

>
> > The third line after the while(1) in getChildStatus should be
> > $tm = $!;
> > instead of
> > $em = $!;

>
> > Now, to the point that thewaitpidcould get called twice.

>
> > Please note that the code is designed to guard against this, the
> > assignments to the globals $ec and $em are done if and only ifwaitpid
> > returns the matching pid.

>
> The waitpid of one getChildStatus returns the expected pid and sets the
> global $? and $!. Before it can do anything else, the waitpid of the other
> getChildStatus returns -1 and over writes the global $? and $! with it's
> own values, but for this one $r does not meet the if and so returns control
> to the first getChildStatus. The first getChildStatus was the right pid
> recorded in $r (as that was a lexical and didn't get overwritten), but has
> the wrong $? and $! because they did get overwritten, and now those get
> recorded into your $tm and $cm
>


Thanks for your explanation. Yes, it has every indication of being a
race condition.

>
> > On your advice to remove the $SIG{CLD}, there are 3 statements,

>
> > the first statement saves the handler,
> > the second statement installs the current one needed by this
> > routine
> > and the last one re-installs the saved handler.

>
> > which one(s) would you suggest I remove?

>
> Probably all of them, but it is not really possible to know from what you
> give. We would need to see the code that set the orginal handler that is
> getting saved and then restored. If the handler you inherit is necessary,
> then why would it be safe to overwrite it with something else for even the
> duration of this routine? On the other hand, if the handler you inherit is
> not necessary, then what is the point of saving and re-installing it? If
> there is no other code which intalls a handler in the first place, then I'd
> remove all three of those things. (And even if not, remove at least two,
> see below)
>
> > Yes, there's a deficiency (bug, if you will) in the code. The
> > $SIG{CLD} should be re-installed if fork fails, but that I think, is
> > of no consequence to the problem at hand.

>
> Since you use local to install the handler, I think the old one will be
> reinstalled upon fork failure anyway. Saving the old one explicitly and
> reinstalling explicit seem to be unnecessary, assuming the local is doing
> its job.
>
> Xho
>


I am not sure I understand your remark that the rest of the code
should be a factor in determining what signal handling should be used
here. The rest of the code may or may not contain other routines which
may or may not install their own handlers, and the same may be true
for the calling routine.

Now I had been scouring our logs since my earlier posts, and here is
the finding:
This is a "bug" in Perl 5.8.8 itself, at least in Perl 5.8.8 on
Solaris 9.


The original Program_A (in Perl) and Program_B (C++) are running
unchanged for about a year. We had been using Perl 5.6.1 that came
with the Solaris. We discovered a separate "bug" in that Perl (related
to file locking). Then we migrated to the Perl 5.8.8. And the "errors"
I described in the OP started appearing exactly at the same time.

More tellingly, these applications run of a set of servers which are,
network wise and geographic location wise, diverse. The upgrade to the
Perl 5.8.8 was done in stages and the "problem" started on each
machine on precisely the date that machine was upgraded to Perl 5.8.8.

So, it appears that we traded one Perl bug for another.

Thanks.

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      02-21-2007
"A" <(E-Mail Removed)> wrote:
>
> >
> > > On your advice to remove the $SIG{CLD}, there are 3 statements,

> >
> > > the first statement saves the handler,
> > > the second statement installs the current one needed by this
> > > routine
> > > and the last one re-installs the saved handler.

> >
> > > which one(s) would you suggest I remove?

> >
> > Probably all of them, but it is not really possible to know from what
> > you give. We would need to see the code that set the orginal handler
> > that is getting saved and then restored. If the handler you inherit is
> > necessary, then why would it be safe to overwrite it with something
> > else for even the duration of this routine? On the other hand, if the
> > handler you inherit is not necessary, then what is the point of saving
> > and re-installing it? If there is no other code which intalls a
> > handler in the first place, then I'd remove all three of those things.
> > (And even if not, remove at least two, see below)


....

> I am not sure I understand your remark that the rest of the code
> should be a factor in determining what signal handling should be used
> here.


If the rest of the code installed a sig-child handler, it probably did it
for a reason--it expects to get signaled upon the exit of a child it
started, and expects to do some needful thing upon that signal. So let's
say you uninstall that handler for a brief period, and before you
re-install it the child that the other part of the code spawned exits. Now
the originally installed handler is restored, but it is never going to get
called, because your code already ate that signal. That is probably bad.
Why is that bad? I don't know, because I don't know what the rest of the
code does. But presumably it wouldn't have installed a signal handler if
it didn't need to--and if it did need it then having it not get called is
bad. On the other hand, the code you showed us installed a signal handler
despite (apparently) not needing to, so maybe assuming that the rest of the
code would only install a signal handler if it needed it is a bad
assumption.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Questions about os.waitpid(pid, options) on windows Fan Python 1 07-16-2011 05:40 AM
'waitpid' query Mike C Programming 10 01-29-2009 02:53 PM
chaining processes, Process.waitpid Thomas Hafner Ruby 0 04-13-2007 11:46 PM
spawnl and waitpid naima.mans@gmail.com Python 13 02-28-2007 05:00 PM
Fork + Waitpid lasek C Programming 4 05-14-2005 05:04 AM



Advertisments