Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > speaking of forking -- parallel database fetches?

Reply
Thread Tools

speaking of forking -- parallel database fetches?

 
 
DJ Stunks
Guest
Posts: n/a
 
      11-24-2006
Hey all,

I have a question about performing two long-running selects from a
database in parallel. I've only written a few scripts which do things
in parallel so I'm not an expert by any means.

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete rather than pulling all the rows in the child, and serializing
and passing to the parent.

Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

TIA,
-jp

#!/usr/bin/perl <pseudocode>

use strict;
use warnings;

my @kids = (
{ query => 'select * from big_table' },
{ query => 'select * from another_big_table' },
);

my $pid;
for my $kid (@kids) {
$pid = open( my $fh, '|-');
die "Can't fork: $!\n" if not defined $pid;

@{ $kid }{ 'pid','handle' } = ($pid,$fh);
}

if ( $pid == 0) { # I'm one of the children

# connect to the db
# prepare query
# execute query
# wait for results
# foreach @row = $sth->fetchrow_array
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row
#
# while ( my $line = < $kids[1]{handle} >) {
# etc.
}

__END__

 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      11-24-2006
"DJ Stunks" <(E-Mail Removed)> wrote:
> Hey all,
>
> I have a question about performing two long-running selects from a
> database in parallel. I've only written a few scripts which do things
> in parallel so I'm not an expert by any means.
>
> I have two straightforward SELECT statements, but both take about two
> minutes to complete. I'd like to run them in parallel, but I obviously
> need access to all the rows - what's the best way to do so?


How many rows are we talking?

> I was thinking something along these lines (pseudocode below) but I was
> hoping there would be some way to give the parent access to the
> statement handle itself so it could pull the rows once the queries were
> complete


I'm pretty sure that that ain't gonna happen.

> rather than pulling all the rows in the child, and serializing
> and passing to the parent.
>
> Any ideas or maybe modules which could be handy? (I looked at both
> Acme::Spork and Parallel::ForkManager but neither are appropriate)


Parallel::Jobs might help, although it isn't very easy to use and wants to
run commands rather than Perl code (so you would have to fire up a new Perl
interpreter from scratch, and it wouldn't inherit variables, etc, from the
existing one). I've posted here a while ago a simple mod of
Parallel::ForkManager to allow back tack from the child to the parent, but
it send the data as a slug of serialized data, not as a stream. Whether
this is deadly or not depends on how many rows.

http://groups.google.com/group/comp....frm/thread/9c8
ef79472740156/

Also, there is Parallel::Simple, which I never used but looks promising.

>
> my $pid;
> for my $kid (@kids) {
> $pid = open( my $fh, '|-');
> die "Can't fork: $!\n" if not defined $pid;
>
> @{ $kid }{ 'pid','handle' } = ($pid,$fh);
> }
>
> if ( $pid == 0) { # I'm one of the children


How does the child know which child it is? It would have to loop over
@kids comparing it's pid to the stored pid (which it can't do, because
the stored pid is stored only in the parent, not the child). Better to
move this else block into the inside of the the for my $kid loop, that way
the kid automatically know who it is by looking in $kid.

....

> # wait for results
> # foreach @row = $sth->fetchrow_array


Technically, you can't waid for results as a separate action from
calling fetchrow_array. fetchrow_array is inherently waiting for results.

> # print join( $;, @row ), "\n"
>
> # exit; (exit or waitpid? do I know the parent read everything?)


exit. You almost certainly don't want to waitpid (for what? The parent?).
At this point, you stuffed everything you have into the buffer up to the
parent. That is all you an do. If the parent doesn't read it all, what
is the child to do about it?

> }
> else { # I'm the parent
>
> # while ( my $line = < $kids[0]{handle} >) {
> # @row = split $;, $line;
> # do whatever with @row


You will want to "close $kids[0]{handle} or [die|warn] $! $?;" to make sure
the kid ended things on a good note. The close implicilty waits for the
the child to exit.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
 
 
 
Rocco Caputo
Guest
Posts: n/a
 
      11-25-2006
On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
>
> I have two straightforward SELECT statements, but both take about two
> minutes to complete. I'd like to run them in parallel, but I obviously
> need access to all the rows - what's the best way to do so?

[...]
> Any ideas or maybe modules which could be handy? (I looked at both
> Acme::Spork and Parallel::ForkManager but neither are appropriate)


There are four ways to do this with POE. Whether they're appropriate
is left as an exercise.

http://search.cpan.org/search?query=POE+DBI&mode=dist

--
Rocco Caputo - http://poe.perl.org/
 
Reply With Quote
 
DJ Stunks
Guest
Posts: n/a
 
      11-25-2006

Rocco Caputo wrote:
> On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
> >
> > I have two straightforward SELECT statements, but both take about two
> > minutes to complete. I'd like to run them in parallel, but I obviously
> > need access to all the rows - what's the best way to do so?

> [...]
> > Any ideas or maybe modules which could be handy? (I looked at both
> > Acme::Spork and Parallel::ForkManager but neither are appropriate)

>
> There are four ways to do this with POE. Whether they're appropriate
> is left as an exercise.
>
> http://search.cpan.org/search?query=POE+DBI&mode=dist


this looks great, time to start reading

you know POE had occurred to me, my search terms must have been bad
choices.

Thanks to both who responded,
-jp

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-26-2006
On 2006-11-24 19:23, DJ Stunks <(E-Mail Removed)> wrote:
> I have two straightforward SELECT statements, but both take about two
> minutes to complete. I'd like to run them in parallel, but I obviously
> need access to all the rows - what's the best way to do so?
>
> I was thinking something along these lines (pseudocode below) but I was
> hoping there would be some way to give the parent access to the
> statement handle itself so it could pull the rows once the queries were
> complete


Theoretically, that's possible. In practice, database client libaries
don't support that.

[...]
> if ( $pid == 0) { # I'm one of the children
>
> # connect to the db
> # prepare query
> # execute query
> # wait for results
> # foreach @row = $sth->fetchrow_array
> # print join( $;, @row ), "\n"
>
> # exit; (exit or waitpid? do I know the parent read everything?)
> }
> else { # I'm the parent
>
> # while ( my $line = < $kids[0]{handle} >) {
> # @row = split $;, $line;
> # do whatever with @row
> #
> # while ( my $line = < $kids[1]{handle} >) {
> # etc.
> }


You are reading all the lines from kid 0 before the first line from kid 1
here. This means that kid 1 will block as soon as it has written enough
records to fill the pipe (typically a few kB). Not a problem if your
queried only return a few rows, but if they are long-running because
they return lots of rows, they won't really run in parallel.

You may want to look at select (or IO::Select), but be warned that select
and <> don't mix well - you need to use sysread instead.

hp

--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
DJ Stunks
Guest
Posts: n/a
 
      11-27-2006
DJ Stunks wrote:
> Rocco Caputo wrote:
> > On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
> > >
> > > I have two straightforward SELECT statements, but both take about two
> > > minutes to complete. I'd like to run them in parallel, but I obviously
> > > need access to all the rows - what's the best way to do so?

> > [...]
> > > Any ideas or maybe modules which could be handy? (I looked at both
> > > Acme::Spork and Parallel::ForkManager but neither are appropriate)

> >
> > There are four ways to do this with POE. Whether they're appropriate
> > is left as an exercise.
> >
> > http://search.cpan.org/search?query=POE+DBI&mode=dist

>
> this looks great, time to start reading


Well, I went with EasyDBI which looks ideal, however, my queries are
both coming back with the error "Died". The queries work through
straight DBI, and both of them take the right amount of time through
EasyDBI, but no rows return, just the error.

I'm not sure if anyone is familiar with this module. It doesn't look
like I can get any more detailed debugging as to what "Died". I'll try
one of the other 4 suggestions tomorrow and see how it goes.

If it matters, the queries use the DBD::ODBC driver and I'm running on
Win32 (ActiveState Perl 5.8.7 [813], DBD::ODBC 1.13, POE 0.9500,
EasyDBI 1.14)

Thanks for the suggestion though, I'll keep at it.
-jp

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Database Database Database Database scott93727@gmail.com Computer Information 0 09-27-2012 02:43 AM
DataBase DataBase DataBase DataBase scott93727@gmail.com Computer Information 0 09-26-2012 09:40 AM
Parallel in, Parallel out shift register Vivek Menon VHDL 5 06-08-2011 03:56 PM
Parallel port control with USB->Parallel converter Soren Python 4 02-14-2008 03:18 PM
Parallel/ Multithreading /Forking? Prabh Perl Misc 4 10-28-2004 08:38 PM



Advertisments