Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Concurrent threads to pull web pages? (http://www.velocityreviews.com/forums/t700044-concurrent-threads-to-pull-web-pages.html)

Gilles Ganault 10-01-2009 09:28 AM

Concurrent threads to pull web pages?
 
Hello

I recently asked how to pull companies' ID from an SQLite database,
have multiple instances of a Python script download each company's web
page from a remote server, eg. www.acme.com/company.php?id=1, and use
regexes to extract some information from each page.

I need to run multiple instances to save time, since each page takes
about 10 seconds to be returned to the script/browser.

Since I've never written a multi-threaded Python script before, to
save time investigating, I was wondering if someone already had a
script that downloads web pages and save some information into a
database.

Thank you for any tip.

exarkun@twistedmatrix.com 10-02-2009 01:33 AM

Re: Concurrent threads to pull web pages?
 
On 1 Oct, 09:28 am, nospam@nospam.com wrote:
>Hello
>
> I recently asked how to pull companies' ID from an SQLite
>database,
>have multiple instances of a Python script download each company's web
>page from a remote server, eg. www.acme.com/company.php?id=1, and use
>regexes to extract some information from each page.
>
>I need to run multiple instances to save time, since each page takes
>about 10 seconds to be returned to the script/browser.
>
>Since I've never written a multi-threaded Python script before, to
>save time investigating, I was wondering if someone already had a
>script that downloads web pages and save some information into a
>database.


There's no need to use threads for this. Have a look at Twisted:

http://twistedmatrix.com/trac/

Here's an example of how to use the Twisted HTTP client:

http://twistedmatrix.com/projects/we...les/getpage.py

Jean-Paul

MRAB 10-02-2009 01:46 AM

Re: Concurrent threads to pull web pages?
 
Gilles Ganault wrote:
> Hello
>
> I recently asked how to pull companies' ID from an SQLite database,
> have multiple instances of a Python script download each company's web
> page from a remote server, eg. www.acme.com/company.php?id=1, and use
> regexes to extract some information from each page.
>
> I need to run multiple instances to save time, since each page takes
> about 10 seconds to be returned to the script/browser.
>
> Since I've never written a multi-threaded Python script before, to
> save time investigating, I was wondering if someone already had a
> script that downloads web pages and save some information into a
> database.
>
> Thank you for any tip.


You could put the URLs into a queue and have multiple worker threads
repeatedly get a URL from the queue, download the page, and then put the
page into another queue for processing by another extraction thread.
This post might help:

http://mail.python.org/pipermail/pyt...er/195866.html


exarkun@twistedmatrix.com 10-02-2009 01:48 AM

Re: Concurrent threads to pull web pages?
 
On 01:36 am, kyle@kyleterry.com wrote:
>On Thu, Oct 1, 2009 at 6:33 PM, <exarkun@twistedmatrix.com> wrote:
>>On 1 Oct, 09:28 am, nospam@nospam.com wrote:
>>>Hello
>>>
>>> I recently asked how to pull companies' ID from an SQLite
>>>database,
>>>have multiple instances of a Python script download each company's
>>>web
>>>page from a remote server, eg. www.acme.com/company.php?id=1, and use
>>>regexes to extract some information from each page.
>>>
>>>I need to run multiple instances to save time, since each page takes
>>>about 10 seconds to be returned to the script/browser.
>>>
>>>Since I've never written a multi-threaded Python script before, to
>>>save time investigating, I was wondering if someone already had a
>>>script that downloads web pages and save some information into a
>>>database.

>>
>>There's no need to use threads for this. Have a look at Twisted:
>>
>> http://twistedmatrix.com/trac/
>>
>>Here's an example of how to use the Twisted HTTP client:
>>
>>http://twistedmatrix.com/projects/we...les/getpage.py

>
>I don't think he was looking for a framework... Specifically a
>framework
>that you work on.


He's free to use anything he likes. I'm offering an option he may not
have been aware of before. It's okay. It's great to have options.

Jean-Paul

Dennis Lee Bieber 10-02-2009 05:48 AM

Re: Concurrent threads to pull web pages?
 
On Fri, 02 Oct 2009 01:33:18 -0000, exarkun@twistedmatrix.com declaimed
the following in gmane.comp.python.general:

> There's no need to use threads for this. Have a look at Twisted:
>
> http://twistedmatrix.com/trac/
>


Strange... While I can easily visualize how to convert the problem
to a task pool -- especially given that code to do a single occurrence
is already in place...

... conversion to an event-dispatch based system is something /I/
can not imagine...

Twisted may be a magnificent effort... but it doesn't fit my mental
framework.
--
Wulfraed Dennis Lee Bieber KD6MOG
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/


exarkun@twistedmatrix.com 10-02-2009 03:09 PM

Re: Concurrent threads to pull web pages?
 
On 05:48 am, wlfraed@ix.netcom.com wrote:
>On Fri, 02 Oct 2009 01:33:18 -0000, exarkun@twistedmatrix.com declaimed
>the following in gmane.comp.python.general:
>>There's no need to use threads for this. Have a look at Twisted:
>>
>> http://twistedmatrix.com/trac/

>
> Strange... While I can easily visualize how to convert the
>problem
>to a task pool -- especially given that code to do a single occurrence
>is already in place...
>
> ... conversion to an event-dispatch based system is something
>/I/
>can not imagine...


The cool thing is that there's not much conversion to do from the single
request version to the multiple request version, if you're using
Twisted. The single request version looks like this:

getPage(url).addCallback(pageReceived)

And the multiple request version looks like this:

getPage(firstURL).addCallback(pageReceived)
getPage(secondURL).addCallback(pageReceived)

Since the APIs don't block, doing things concurrently ends up being the
easy thing.

Not to say it isn't a bit of a challenge to get into this mindset, but I
think anyone who wants to put a bit of effort into it can manage. :)
Getting used to using Deferreds in the first place (necessary to
write/use even the single request version) is probably where more people
have trouble.

Jean-Paul


All times are GMT. The time now is 08:18 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.