Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > [SQL] Pick random rows from SELECT?

Reply
Thread Tools

[SQL] Pick random rows from SELECT?

 
 
Gilles Ganault
Guest
Posts: n/a
 
      09-21-2009
Hello

I have a working Python script that SELECTs rows from a database to
fetch a company's name from a web-based database.

Since this list is quite big and the site is the bottleneck, I'd like
to run multiple instances of this script, and figured a solution would
be to pick rows at random from the dataset, check in my local database
if this item has already been taken care of, and if not, download
details from the remote web site.

If someone's done this before, should I perform the randomization in
the SQL query (SQLite using the APSW wrapper
http://code.google.com/p/apsw/), or in Python?

Thank you.

Here's some simplified code:

sql = 'SELECT id,label FROM companies WHERE activity=1'
rows=list(cursor.execute(sql))
for row in rows:
id = row[0]
label = row[1]

print strftime("%H:%M")
url = "http://www.acme.com/details.php?id=%s" % id
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req).read()

name = re_name.search(response)
if name:
name = name.group(1)
sql = 'UPDATE companies SET name=? WHERE id=?'
cursor.execute(sql, (name,id) )
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      09-21-2009
Gilles Ganault wrote:

> I have a working Python script that SELECTs rows from a database to
> fetch a company's name from a web-based database.
>
> Since this list is quite big and the site is the bottleneck, I'd like
> to run multiple instances of this script, and figured a solution would
> be to pick rows at random from the dataset, check in my local database
> if this item has already been taken care of, and if not, download
> details from the remote web site.
>
> If someone's done this before, should I perform the randomization in
> the SQL query (SQLite using the APSW wrapper
> http://code.google.com/p/apsw/), or in Python?
>
> Thank you.
>
> Here's some simplified code:
>
> sql = 'SELECT id,label FROM companies WHERE activity=1'
> rows=list(cursor.execute(sql))
> for row in rows:
> id = row[0]
> label = row[1]
>
> print strftime("%H:%M")
> url = "http://www.acme.com/details.php?id=%s" % id
> req = urllib2.Request(url, None, headers)
> response = urllib2.urlopen(req).read()
>
> name = re_name.search(response)
> if name:
> name = name.group(1)
> sql = 'UPDATE companies SET name=? WHERE id=?'
> cursor.execute(sql, (name,id) )


I don't think you need to randomize the requests. Instead you could control
a pool of worker processes using

http://docs.python.org/library/multiprocessing.html

Peter

 
Reply With Quote
 
 
 
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      09-22-2009
On Mon, 21 Sep 2009 10:59:38 +0200, Gilles Ganault <>
declaimed the following in gmane.comp.python.general:

> Since this list is quite big and the site is the bottleneck, I'd like
> to run multiple instances of this script, and figured a solution would
> be to pick rows at random from the dataset, check in my local database
> if this item has already been taken care of, and if not, download
> details from the remote web site.
>

You really think making MULTIPLE, overlapping requests to a web site
is going to be more efficient than just suffering the single transfer
time of one large query?

> If someone's done this before, should I perform the randomization in
> the SQL query (SQLite using the APSW wrapper
> http://code.google.com/p/apsw/), or in Python?
>

Pardon, I thought you implied the bottleneck is the web-site
database -- I'd worry about any web-site that exposes a file-server
based database to direct user access.

> Here's some simplified code:
>
> sql = 'SELECT id,label FROM companies WHERE activity=1'
> rows=list(cursor.execute(sql))
> for row in rows:
> id = row[0]
> label = row[1]
>
> print strftime("%H:%M")
> url = "http://www.acme.com/details.php?id=%s" % id
> req = urllib2.Request(url, None, headers)
> response = urllib2.urlopen(req).read()
>
> name = re_name.search(response)
> if name:
> name = name.group(1)
> sql = 'UPDATE companies SET name=? WHERE id=?'
> cursor.execute(sql, (name,id) )


Ah... You mean you are retrieving the names from a local database,
and then requesting web-site details based upon that name.

No matter how you look at it, you appear to want to process the
entire local list of companies... Multiple randomized local queries will
just add to the final run-time as you start to get duplicates -- and
have to reject that one to query for another random name.

I'd suggest either a pool of threads -- 5-10, each reading company
names from a shared QUEUE, which is populated by the main thread
(remember to commit() so that you don't block on database updates by the
threads). OR... determine how many companies there are, and start
threads feeding them <start> and <length> (length being #names /
#threads, round up -- start then being 0*length+1, 1*length+1, etc...)
and use those in thread specific selects using "... limit <length>
offset <start>"... This way each thread retrieves its own limited set of
companies (make sure to use the same sorting criteria).
--
Wulfraed Dennis Lee Bieber KD6MOG
HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
Gilles Ganault
Guest
Posts: n/a
 
      09-23-2009
On Mon, 21 Sep 2009 21:40:02 -0700, Dennis Lee Bieber
<> wrote:
> I'd suggest either a pool of threads -- 5-10, each reading company
>names from a shared QUEUE, which is populated by the main thread
>(remember to commit() so that you don't block on database updates by the
>threads). OR... determine how many companies there are, and start
>threads feeding them <start> and <length> (length being #names /
>#threads, round up -- start then being 0*length+1, 1*length+1, etc...)
>and use those in thread specific selects using "... limit <length>
>offset <start>"... This way each thread retrieves its own limited set of
>companies (make sure to use the same sorting criteria).


Thanks for the help
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Pick random choice from variables eli m Python 3 02-10-2013 06:03 PM
Math.random() and Math.round(Math.random()) and Math.floor(Math.random()*2) VK Javascript 15 05-02-2010 03:43 PM
random.random(), random not defined!? globalrev Python 4 04-20-2008 08:12 AM
When to pick quad core and when to pick dual core thingy NZ Computing 6 11-21-2006 07:08 AM
When to pick ASP.Net, when to pick desktop? tom c ASP .Net 5 11-01-2006 06:15 PM



Advertisments