Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Can this database update script be optimized?

Reply
Thread Tools

Can this database update script be optimized?

 
 
Cheez
Guest
Posts: n/a
 
      01-22-2004
Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string. If the record doesn't
exist, I INSERT the data normally.

It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size. Initially I have about 40 of these files to
process to create a table with about 2 million records.

My questions are:

1. using Perl is this the most efficient approach?
2. would a native SQL query using the database GUI be faster?
3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

Thanks,
Cheez


code below:


print "\nINSERTER_SEQUENCE: inserting data into table\n";

use DBI;

my $dbh = DBI->connect("dbi:mysql:mydb", "", "");

open(FILE, "seqparsed.txt");

while (my $data = <FILE>) {

($wherefound,$sequence) = split /\t/, $data;

dupcheck (); #this sub does all the work

}

close (FILE);

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

$sth->execute($sequence);

#selecting any records that match $sequence

if (@row = $sth->fetchrow_array) {

#a match results in frequency++
#and update of where it was found

my ($seq, $wheref) = @row;

$wherefound = "$wherefound, $wheref";

$sth = $dbh->prepare("update sequence5 SET
frequency=frequency+1, wherefound=? WHERE sequence=?");

$sth->execute($wherefound,$sequence);

}

else {

# if no records match $sequence then INSERT new data

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

$sth->execute( $wherefound, $sequence);

}
}
 
Reply With Quote
 
 
 
 
gnari
Guest
Posts: n/a
 
      01-22-2004
"Cheez" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Hi there,
>
>
> It's slow. I am performing this update with about 50,000 records at a
> time. I think there's an exponential increase in processing time as
> the table grows in size.


have you checked your indexes?

gnari




 
Reply With Quote
 
 
 
 
pkent
Guest
Posts: n/a
 
      01-22-2004
In article <(E-Mail Removed) >,
http://www.velocityreviews.com/forums/(E-Mail Removed) (Cheez) wrote:

> I have a script that inserts and updates a mySQL table using data from
> a text file. I am checking for duplicate records by first doing a

....
> 2. would a native SQL query using the database GUI be faster?


Ghod no, with all that GUI stuff and dialogs and menus in the way
.... or do you mean the database's native client libraries? If you do,
well the perl DBI hooks directly into them (in all cases I know of) so
you get the bets speed.


> 3. I am really poor at choosing when and where for 'die' statements.
> Any pointers of where I should put these would be great.


use the 'RaiseError' => 1 option in the DBI connect() call - there's
examples in the docs. And always check the return value of open().

And always use strict and warnings (or -w on older versions of perl).


> sub dupcheck {
>
> $sth = $dbh->prepare("SELECT sequence, wherefound FROM
> sequence5 WHERE sequence=?");


This is a bad, or at least inefficient, thing to do, because you call
this sub once for every line - and you're prepare()ing the same thing
every time. You need to prepare() _once_ outside the loop and then use
the resulting statement handle over and over inside the loop.


> $sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
> sequence) VALUES (?, ?)");


ditto for this one - prepare once, use many times.

P

--
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply
 
Reply With Quote
 
ctcgag@hotmail.com
Guest
Posts: n/a
 
      01-22-2004
(E-Mail Removed) (Cheez) wrote:
> Hi there,
>
> I have a script that inserts and updates a mySQL table using data from
> a text file. I am checking for duplicate records by first doing a
> SELECT. If the record exists and 1) increase a counter by one and 2)
> concatenate text in a field with a string.


Are you sure you don't want master=>detail tables, rather than
concatting into a field?

> If the record doesn't
> exist, I INSERT the data normally.
>
> It's slow.


How slow is it? How fast does it need to be?

> I am performing this update with about 50,000 records at a
> time.


You are performing the update one record at a time, 50,000 times.

> I think there's an exponential increase in processing time as
> the table grows in size.


I really doubt it. I bet it is, at worst, N^2. If you can demonstrate
that it is exponential, that would be quite interesting (and perhaps useful
for finding the problem).


> Initially I have about 40 of these files to
> process to create a table with about 2 million records.


40*50,000 = 2 million. So how many times does the "pre-existing record"
branch of the code acutally get executed?

> My questions are:
>
> 1. using Perl is this the most efficient approach?


I would move all the prepares to outside the inner loops, although
for MySQL I doubt it matters. If the insert branch is frequent and the
update branch is rare, I'd do the insert pre-emptively, and then do the
update only if the insert throws you a primary key violation.

But most likely, it isn't a Perl problem but a mysql one. Is "sequence" a
primary key or otherwise indexed?

> 2. would a native SQL query using the database GUI be faster?


I have no idea what you are talking about. I doubt that this is a
question for a perl group.

> 3. I am really poor at choosing when and where for 'die' statements.
> Any pointers of where I should put these would be great.


First, use strict. Pass $wherefound,$sequence into dupcheck() as
arguments. you need to die (or otherwise handle) a failure to open.

Use RaiseError=>1 in your db connection. If you don't do that, then
you should have "die" statements in damn near every line.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service New Rate! $9.95/Month 50GB
 
Reply With Quote
 
Cheez
Guest
Posts: n/a
 
      01-23-2004
Thanks again all for the replies. Very useful stuff. Since it did
end up taking a very long time...20,000 records/hour were updated
using this script! That's too slow. I am going to build a hash table
using 'sequence' as the key and then incrementing a 'value' by one for
duplicates. Maybe try with all 3 million records?

I just can't say enough about how useful these comments are.

Cheers,
Cheez
 
Reply With Quote
 
Bruce Horrocks
Guest
Posts: n/a
 
      01-23-2004
In message <(E-Mail Removed) >, Cheez
<(E-Mail Removed)> writes
>2. would a native SQL query using the database GUI be faster?


See my response to your previous "SQL insert/update duplicate entries"
post.

Regards,
--
Bruce Horrocks
Surrey
England
<firstname>@<surname>.plus.com -- fix the obvious for email
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Database Database Database Database scott93727@gmail.com Computer Information 0 09-27-2012 02:43 AM
DataBase DataBase DataBase DataBase scott93727@gmail.com Computer Information 0 09-26-2012 09:40 AM
Update On The Windows Phone 7 Update Update Lawrence D'Oliveiro NZ Computing 2 02-25-2011 08:03 AM
[BUG?] (2) Update database using stored procedure and OleDbDataAdapter.Update joun ASP .Net 5 11-30-2004 06:51 PM
[BUG?] Update database using stored procedure and OleDbDataAdapter.Update joun ASP .Net 9 11-30-2004 04:57 AM



Advertisments