Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Can this database update script be optimized? (http://www.velocityreviews.com/forums/t884745-can-this-database-update-script-be-optimized.html)

Cheez 01-22-2004 07:14 AM

Can this database update script be optimized?
 
Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string. If the record doesn't
exist, I INSERT the data normally.

It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size. Initially I have about 40 of these files to
process to create a table with about 2 million records.

My questions are:

1. using Perl is this the most efficient approach?
2. would a native SQL query using the database GUI be faster?
3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

Thanks,
Cheez


code below:


print "\nINSERTER_SEQUENCE: inserting data into table\n";

use DBI;

my $dbh = DBI->connect("dbi:mysql:mydb", "", "");

open(FILE, "seqparsed.txt");

while (my $data = <FILE>) {

($wherefound,$sequence) = split /\t/, $data;

dupcheck (); #this sub does all the work

}

close (FILE);

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

$sth->execute($sequence);

#selecting any records that match $sequence

if (@row = $sth->fetchrow_array) {

#a match results in frequency++
#and update of where it was found

my ($seq, $wheref) = @row;

$wherefound = "$wherefound, $wheref";

$sth = $dbh->prepare("update sequence5 SET
frequency=frequency+1, wherefound=? WHERE sequence=?");

$sth->execute($wherefound,$sequence);

}

else {

# if no records match $sequence then INSERT new data

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

$sth->execute( $wherefound, $sequence);

}
}

gnari 01-22-2004 08:24 AM

Re: Can this database update script be optimized?
 
"Cheez" <cheez2112@yahoo.com> wrote in message
news:1e85f7c8.0401212314.26159394@posting.google.c om...
> Hi there,
>
>
> It's slow. I am performing this update with about 50,000 records at a
> time. I think there's an exponential increase in processing time as
> the table grows in size.


have you checked your indexes?

gnari





pkent 01-22-2004 04:55 PM

Re: Can this database update script be optimized?
 
In article <1e85f7c8.0401212314.26159394@posting.google.com >,
cheez2112@yahoo.com (Cheez) wrote:

> I have a script that inserts and updates a mySQL table using data from
> a text file. I am checking for duplicate records by first doing a

....
> 2. would a native SQL query using the database GUI be faster?


Ghod no, with all that GUI stuff and dialogs and menus in the way :-)
.... or do you mean the database's native client libraries? If you do,
well the perl DBI hooks directly into them (in all cases I know of) so
you get the bets speed.


> 3. I am really poor at choosing when and where for 'die' statements.
> Any pointers of where I should put these would be great.


use the 'RaiseError' => 1 option in the DBI connect() call - there's
examples in the docs. And always check the return value of open().

And always use strict and warnings (or -w on older versions of perl).


> sub dupcheck {
>
> $sth = $dbh->prepare("SELECT sequence, wherefound FROM
> sequence5 WHERE sequence=?");


This is a bad, or at least inefficient, thing to do, because you call
this sub once for every line - and you're prepare()ing the same thing
every time. You need to prepare() _once_ outside the loop and then use
the resulting statement handle over and over inside the loop.


> $sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
> sequence) VALUES (?, ?)");


ditto for this one - prepare once, use many times.

P

--
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply

ctcgag@hotmail.com 01-22-2004 05:09 PM

Re: Can this database update script be optimized?
 
cheez2112@yahoo.com (Cheez) wrote:
> Hi there,
>
> I have a script that inserts and updates a mySQL table using data from
> a text file. I am checking for duplicate records by first doing a
> SELECT. If the record exists and 1) increase a counter by one and 2)
> concatenate text in a field with a string.


Are you sure you don't want master=>detail tables, rather than
concatting into a field?

> If the record doesn't
> exist, I INSERT the data normally.
>
> It's slow.


How slow is it? How fast does it need to be?

> I am performing this update with about 50,000 records at a
> time.


You are performing the update one record at a time, 50,000 times.

> I think there's an exponential increase in processing time as
> the table grows in size.


I really doubt it. I bet it is, at worst, N^2. If you can demonstrate
that it is exponential, that would be quite interesting (and perhaps useful
for finding the problem).


> Initially I have about 40 of these files to
> process to create a table with about 2 million records.


40*50,000 = 2 million. So how many times does the "pre-existing record"
branch of the code acutally get executed?

> My questions are:
>
> 1. using Perl is this the most efficient approach?


I would move all the prepares to outside the inner loops, although
for MySQL I doubt it matters. If the insert branch is frequent and the
update branch is rare, I'd do the insert pre-emptively, and then do the
update only if the insert throws you a primary key violation.

But most likely, it isn't a Perl problem but a mysql one. Is "sequence" a
primary key or otherwise indexed?

> 2. would a native SQL query using the database GUI be faster?


I have no idea what you are talking about. I doubt that this is a
question for a perl group.

> 3. I am really poor at choosing when and where for 'die' statements.
> Any pointers of where I should put these would be great.


First, use strict. Pass $wherefound,$sequence into dupcheck() as
arguments. you need to die (or otherwise handle) a failure to open.

Use RaiseError=>1 in your db connection. If you don't do that, then
you should have "die" statements in damn near every line.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service New Rate! $9.95/Month 50GB

Cheez 01-23-2004 06:32 AM

Re: Can this database update script be optimized?
 
Thanks again all for the replies. Very useful stuff. Since it did
end up taking a very long time...20,000 records/hour were updated
using this script! That's too slow. I am going to build a hash table
using 'sequence' as the key and then incrementing a 'value' by one for
duplicates. Maybe try with all 3 million records?

I just can't say enough about how useful these comments are.

Cheers,
Cheez

Bruce Horrocks 01-23-2004 05:30 PM

Re: Can this database update script be optimized?
 
In message <1e85f7c8.0401212314.26159394@posting.google.com >, Cheez
<cheez2112@yahoo.com> writes
>2. would a native SQL query using the database GUI be faster?


See my response to your previous "SQL insert/update duplicate entries"
post.

Regards,
--
Bruce Horrocks
Surrey
England
<firstname>@<surname>.plus.com -- fix the obvious for email


All times are GMT. The time now is 05:21 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.