Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: How to find bad row with db api executemany()?

Reply
Thread Tools

Re: How to find bad row with db api executemany()?

 
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      03-29-2013
On Fri, 29 Mar 2013 14:53:30 -0400, Dave Angel <(E-Mail Removed)>
declaimed the following in gmane.comp.python.general:

> On 03/29/2013 10:48 AM, Roy Smith wrote:
> > I'm inserting a gazillion rows into a MySQL database using MySQLdb and cursor.executemany() for efficiency. Every once in a while, I get a row which violates some kind of database constraint and raises Error.
> >
> > I can catch the exception, but don't see any way to tell which row caused the problem. Is this information obtainable, short of retrying each row one by one?
> >

>
> I don't know the direct answer, or even if there is one (way to get
> MySQL to tell you which one failed), but ...
>
> Assuming that executeMany is much cheaper than a million calls to
> executeOne (or whatever).


If using MySQLdb, there isn't all that much difference... MySQLdb is
still compatible with MySQL v4 (and maybe even v3), and since those
versions don't have "prepared statements", .executemany() essentially
turns into something that creates a newline delimited "list" of
"identical" (but for argument substitution) statements and submits that
to MySQL.

Just look at the source code (cursors.py in MySQLdb package).

Hmmm... That gives a rally point. If it IS submitting one massive
string containing all the data copies, could the failure be coming from
submitting something too big for the client/server communication
channel?
--
Wulfraed Dennis Lee Bieber AF6VN
http://www.velocityreviews.com/forums/(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
 
 
 
Roy Smith
Guest
Posts: n/a
 
      03-30-2013
In article <(E-Mail Removed)>,
Dennis Lee Bieber <(E-Mail Removed)> wrote:

> If using MySQLdb, there isn't all that much difference... MySQLdb is
> still compatible with MySQL v4 (and maybe even v3), and since those
> versions don't have "prepared statements", .executemany() essentially
> turns into something that creates a newline delimited "list" of
> "identical" (but for argument substitution) statements and submits that
> to MySQL.


Shockingly, that does appear to be the case. I had thought during my
initial testing that I was seeing far greater throughput, but as I got
more into the project and started doing some side-by-side comparisons,
it the differences went away.

We're sucking in a pretty huge amount of data. The source document is a
7 GB gzipped XML file. I'm not sure how big it is uncompressed (we use
gzip.GzipFile to uncompress on the fly) but I'm guessing something like
a 30x compression ratio so 200 GB? The last time we imported the whole
set, it ran for 21 days!

It turns out, the problems we were seeing were all inserts into a new
table we added. Apparently, the default charset is latin-1 and we
didn't notice that when we altered the schema! Once I noticed that all
the other tables were utf-8 and changed this one to be that, the
problems went away.

Sadly, I ended up resorting to a truly ugly hack to diagnose the
problem. I catch the exception and parse the text message. Yuck.

try:
self.executemany(self.sql_statement, self.sql_params)
except MySQLdb.Error as ex:
code, message = ex.args
m = re.search(r".* at row (\d+)$", message)
if m:
i = int(m.group(1)) - 1 # Python is 0-index, SQL, 1-index


The other truly horrible part of the project was when I decided it was
bad for my import script to have too much schema knowledge hard-wired
in. So, I decided to use SQLAlchemy to introspect the database and
discover the column names, types, and defaults. It turns out, if an
integer column has a default (say, 0), the introspected data comes back
with the default as the string, '0'. WTF???

Does Postgress's Python adapter handle executemany() in a sane way?
We're not wedded to MySQL in any way. We use it for exactly this one
process. We get these XML dumps from a supplier's SQL-Server database.
We stage the data in MySQL, then export what we need into MongoDB. We
could easily swap out the MySQL staging for Postgress if that worked
better.

Hmmm, we do take advantage of REPLACE INTO, which I think is a
non-standard MySQL addition. Not sure if Postgress supports that.
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      03-30-2013
On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <(E-Mail Removed)> wrote:
> In article <(E-Mail Removed)>,
> Dennis Lee Bieber <(E-Mail Removed)> wrote:
>
>> If using MySQLdb, there isn't all that much difference... MySQLdb is
>> still compatible with MySQL v4 (and maybe even v3), and since those
>> versions don't have "prepared statements", .executemany() essentially
>> turns into something that creates a newline delimited "list" of
>> "identical" (but for argument substitution) statements and submits that
>> to MySQL.

>
> Shockingly, that does appear to be the case. I had thought during my
> initial testing that I was seeing far greater throughput, but as I got
> more into the project and started doing some side-by-side comparisons,
> it the differences went away.


How much are you doing per transaction? The two extremes (everything
in one transaction, or each line in its own transaction) are probably
the worst for performance. See what happens if you pepper the code
with 'begin' and 'commit' statements (maybe every thousand or ten
thousand rows) to see if performance improves.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      03-30-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <(E-Mail Removed)> wrote:
> > In article <(E-Mail Removed)>,
> > Dennis Lee Bieber <(E-Mail Removed)> wrote:
> >
> >> If using MySQLdb, there isn't all that much difference... MySQLdb is
> >> still compatible with MySQL v4 (and maybe even v3), and since those
> >> versions don't have "prepared statements", .executemany() essentially
> >> turns into something that creates a newline delimited "list" of
> >> "identical" (but for argument substitution) statements and submits that
> >> to MySQL.

> >
> > Shockingly, that does appear to be the case. I had thought during my
> > initial testing that I was seeing far greater throughput, but as I got
> > more into the project and started doing some side-by-side comparisons,
> > it the differences went away.

>
> How much are you doing per transaction? The two extremes (everything
> in one transaction, or each line in its own transaction) are probably
> the worst for performance. See what happens if you pepper the code
> with 'begin' and 'commit' statements (maybe every thousand or ten
> thousand rows) to see if performance improves.
>
> ChrisA


We're doing it all in one transaction, on purpose. We start with an
initial dump, then get updates about once a day. We want to make sure
that the updates either complete without errors, or back out cleanly.
If we ever had a partial daily update, the result would be a mess.

Hmmm, on the other hand, I could probably try doing the initial dump the
way you describe. If it fails, we can just delete the whole thing and
start again.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-30-2013
On Sat, Mar 30, 2013 at 12:19 PM, Roy Smith <(E-Mail Removed)> wrote:
> In article <(E-Mail Removed)>,
> Chris Angelico <(E-Mail Removed)> wrote:
>
>> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <(E-Mail Removed)> wrote:
>> > In article <(E-Mail Removed)>,
>> > Dennis Lee Bieber <(E-Mail Removed)> wrote:
>> >
>> >> If using MySQLdb, there isn't all that much difference... MySQLdb is
>> >> still compatible with MySQL v4 (and maybe even v3), and since those
>> >> versions don't have "prepared statements", .executemany() essentially
>> >> turns into something that creates a newline delimited "list" of
>> >> "identical" (but for argument substitution) statements and submits that
>> >> to MySQL.
>> >
>> > Shockingly, that does appear to be the case. I had thought during my
>> > initial testing that I was seeing far greater throughput, but as I got
>> > more into the project and started doing some side-by-side comparisons,
>> > it the differences went away.

>>
>> How much are you doing per transaction? The two extremes (everything
>> in one transaction, or each line in its own transaction) are probably
>> the worst for performance. See what happens if you pepper the code
>> with 'begin' and 'commit' statements (maybe every thousand or ten
>> thousand rows) to see if performance improves.
>>
>> ChrisA

>
> We're doing it all in one transaction, on purpose. We start with an
> initial dump, then get updates about once a day. We want to make sure
> that the updates either complete without errors, or back out cleanly.
> If we ever had a partial daily update, the result would be a mess.
>
> Hmmm, on the other hand, I could probably try doing the initial dump the
> way you describe. If it fails, we can just delete the whole thing and
> start again.


One transaction for the lot isn't nearly as bad as one transaction per
row, but it can consume a lot of memory on the server - or at least,
that's what I found last time I worked with MySQL. (PostgreSQL works
completely differently, and I'd strongly recommend doing it all as one
transaction if you switch.) It's not guaranteed to help, but if it
won't hurt to try, there's a chance you'll gain some performance.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      03-30-2013
In article <(E-Mail Removed)>,
Roy Smith <(E-Mail Removed)> wrote:

> In article <(E-Mail Removed)>,
> Dennis Lee Bieber <(E-Mail Removed)> wrote:
>
> > If using MySQLdb, there isn't all that much difference... MySQLdb is
> > still compatible with MySQL v4 (and maybe even v3), and since those
> > versions don't have "prepared statements", .executemany() essentially
> > turns into something that creates a newline delimited "list" of
> > "identical" (but for argument substitution) statements and submits that
> > to MySQL.

>
> Shockingly, that does appear to be the case. I had thought during my
> initial testing that I was seeing far greater throughput, but as I got
> more into the project and started doing some side-by-side comparisons,
> it the differences went away.


OMG, this is amazing.

http://stackoverflow.com/questions/3945642/

It turns out, the MySQLdb executemany() runs a regex over your SQL and
picks one of two algorithms depending on whether it matches or not.

restr = (r"\svalues\s*"
r"(\(((?<!\\)'[^\)]*?\)[^\)]*(?<!\\)?'"
r"|[^\(\)]|"
r"(?:\([^\)]*\))"
r")+\))")

Leaving aside the obvious line-noise aspects, the operative problem here
is that it only looks for "values" (in lower case).

I've lost my initial test script which convinced me that executemany()
would be a win; I'm assuming I used lower case for that. Our production
code uses "VALUES".

The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
seconds. When I switch to "values", I'm getting more like 1000 rows in
100 ms!

A truly breathtaking bug.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-30-2013
On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <(E-Mail Removed)> wrote:
> The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
> seconds. When I switch to "values", I'm getting more like 1000 rows in
> 100 ms!
>
> A truly breathtaking bug.


*facepalm*

Doubly facepalm because a regex could easily have tested for mixed case.

Especially facepalm because there's some way to do this that's faster
than straight INSERT statements, and it's not clearly documented as
"hey, guys, if you want to dump loads of data in, use COPY instead"
(it might be that, I don't know, but usually COPY isn't directly
transliterable with INSERT).

I agree. Breathtaking.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      03-30-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> Especially facepalm because there's some way to do this that's faster
> than straight INSERT statements, and it's not clearly documented as
> "hey, guys, if you want to dump loads of data in, use COPY instead"
> (it might be that, I don't know, but usually COPY isn't directly
> transliterable with INSERT).


We're actually using REPLACE INTO. For the initial data load, we could
just as well do INSERT, but we need the REPLACE functionality as we roll
in the daily incremental updates.

This also explains why, even after provisioning our RDS instance for
2000 IOPS (that's AWS-speak for "we paid extra to get more disk
bandwidth"), we didn't see any performance improvement!
 
Reply With Quote
 
rusi
Guest
Posts: n/a
 
      03-30-2013
On Mar 30, 7:49*am, Chris Angelico <(E-Mail Removed)> wrote:
> On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <(E-Mail Removed)> wrote:
> > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
> > seconds. *When I switch to "values", I'm getting more like 1000 rows in
> > 100 ms!

>
> > A truly breathtaking bug.

>
> *facepalm*
>
> Doubly facepalm because a regex could easily have tested for mixed case.
>
> Especially facepalm because there's some way to do this that's faster
> than straight INSERT statements, and it's not clearly documented as
> "hey, guys, if you want to dump loads of data in, use COPY instead"
> (it might be that, I don't know, but usually COPY isn't directly
> transliterable with INSERT).
>
> I agree. Breathtaking.
>
> ChrisA


I recently heard this:
A phone company needed to send out bulk-smses to its customers. It was
of the order of millions.
A (noob?) python programmer was assigned the task and used django with
whatever is the django orm.
It took of the order of weeks to send out the smses.
A python expert was called in. He threw out the python and redid it
in SQL.
It was done in minutes.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-30-2013
On Sat, Mar 30, 2013 at 2:09 PM, Roy Smith <(E-Mail Removed)> wrote:
> In article <(E-Mail Removed)>,
> Chris Angelico <(E-Mail Removed)> wrote:
>
>> Especially facepalm because there's some way to do this that's faster
>> than straight INSERT statements, and it's not clearly documented as
>> "hey, guys, if you want to dump loads of data in, use COPY instead"
>> (it might be that, I don't know, but usually COPY isn't directly
>> transliterable with INSERT).

>
> We're actually using REPLACE INTO. For the initial data load, we could
> just as well do INSERT, but we need the REPLACE functionality as we roll
> in the daily incremental updates.
>
> This also explains why, even after provisioning our RDS instance for
> 2000 IOPS (that's AWS-speak for "we paid extra to get more disk
> bandwidth"), we didn't see any performance improvement!


Hmm. I heard around the forums that Amazon weren't that great at disk
bandwidth anyway, and that provisioning IO was often a waste of money.
But we never did all that much much research on Amazon I/O
performance; shortly after doing some basic benchmarking, we decided
that the cloud was a poor fit for our system model, and went looking
at dedicated servers with their own RAID storage right there on the
bus.

ChrisA
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
How to find bad row with db api executemany()? Roy Smith Python 2 03-30-2013 04:44 PM
Re: How to find bad row with db api executemany()? Dave Angel Python 0 03-29-2013 06:53 PM
ok I can do a totals row but how about a percentage row after each data row D ASP .Net Datagrid Control 0 05-23-2005 04:10 PM
ActiveX apologetic Larry Seltzer... "Sun paid for malicious ActiveX code, and Firefox is bad, bad bad baad. please use ActiveX, it's secure and nice!" (ok, the last part is irony on my part) fernando.cassia@gmail.com Java 0 04-16-2005 10:05 PM



Advertisments