Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Web Spider

Reply
Thread Tools

Web Spider

 
 
Thomas Lindgaard
Guest
Posts: n/a
 
      07-06-2004
Hello

I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from

http://starship.python.net/crew/aahz...dPoolSpider.py

Some of the source shown below.

A couple of questions:

1) Why use the

if __name__ == '__main__':

construct?

2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Hmm... I think that's about it for now.

---------------------------------------------------------------------

MAX_THREADS = 3

....

class Retriever(threading.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Thread.__init__(self)
self.inputQueue = inputQueue
self.outputQueue = outputQueue

def run(self):
while 1:
self.URL = self.inputQueue.get()
self.getPage()
self.outputQueue.put(self.getLinks())

...


class RetrievePool:
def __init__(self, numThreads):
self.retrievePool = []
self.inputQueue = Queue.Queue()
self.outputQueue = Queue.Queue()
for i in range(numThreads):
retriever = Retriever(self.inputQueue, self.outputQueue)
retriever.start()
self.retrievePool.append(retriever)

...


class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQueued = 0
self.retriever = RetrievePool(maxThreads)

def run(self):
self.startPages()
while self.numPagesQueued > 0:
self.queueLinks()
self.startPages()
self.retriever.shutdown()
self.URLs = self.URLdict.keys()
self.URLs.sort()

...


if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL, MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL


--
Regards
/Thomas

 
Reply With Quote
 
 
 
 
Peter Hansen
Guest
Posts: n/a
 
      07-06-2004
Thomas Lindgaard wrote:

> A couple of questions:
>
> 1) Why use the
> if __name__ == '__main__':
> construct?


Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name

> 2) In Retrievepool.__init__ the Retriever.__init__ is called with
> self.inputQueue and self.outputQueue as arguments. Does this mean that
> each Retriever thread has a reference to Retrievepool.inputQueue and
> Retrievepool.outputQueue


Yes, and that's sort of the whole point of the thing.

> 3) How many threads will be running? Spider.run initializes the
> Retrievepool and this will consist of MAX_THREADS threads, so once the
> crawler is running there will be the main thread (caught in the while loop
> in Spider.run) and MAX_THREADS Retriever threads running, right?


Yep. Good analysis. You could inject this somewhere to
check:

print len(threading.enumerate()), 'threads exist'

-Peter
 
Reply With Quote
 
 
 
 
Thomas Lindgaard
Guest
Posts: n/a
 
      07-07-2004
On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:

> Answered indirectly in this FAQ:
> http://www.python.org/doc/faq/progra...nt-module-name


Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).

>> 2) In Retrievepool.__init__ the Retriever.__init__ is called with
>> self.inputQueue and self.outputQueue as arguments. Does this mean that
>> each Retriever thread has a reference to Retrievepool.inputQueue and
>> Retrievepool.outputQueue

>
> Yes, and that's sort of the whole point of the thing.


Okidoki

>> 3) How many threads will be running? Spider.run initializes the
>> Retrievepool and this will consist of MAX_THREADS threads, so once the
>> crawler is running there will be the main thread (caught in the while
>> loop in Spider.run) and MAX_THREADS Retriever threads running, right?

>
> Yep. Good analysis. You could inject this somewhere to check:


Thanks - sometimes it actually helps to read code you want to elaborate on
closely

> print len(threading.enumerate()), 'threads exist'


Can a thread die spontaneously if for instance an exception is thrown?

--
Mvh.
/Thomas

 
Reply With Quote
 
Peter Hansen
Guest
Posts: n/a
 
      07-07-2004
Thomas Lindgaard wrote:
> On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
>>Answered indirectly in this FAQ:
>>http://www.python.org/doc/faq/progra...nt-module-name

>
> Let me just see if I understood this correctly...
>
> The reason for using the construct is to have to "modes" for the script:
> One for running the script by itself (ie. run main()) and one for when it
> is included from somewhere else (ie. main() should not be run unless
> called from the surrounding code).


Yep.
>
> Can a thread die spontaneously if for instance an exception is thrown?


The interactive prompt is your friend for such questions in Python.
Good to get in the habit of being able to check such stuff out
easily:

c:\>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import time, threading
>>> class Test(threading.Thread):

.... def run(self):
.... while 1:
.... time.sleep(5)
.... 1/0
....
>>> a = Test()
>>> threading.enumerate()

[<_MainThread(MainThread, started)>]
>>> a.start()
>>> threading.enumerate()

[<Test(Thread-2, started)>, <_MainThread(MainThread, started)>]

>>> # wait a few seconds here

Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\a\python23\lib\threading.py", line 436, in __bootstrap
self.run()
File "<stdin>", line 5, in run
ZeroDivisionError: integer division or modulo by zero

>>> threading.enumerate()

[<_MainThread(MainThread, started)>]

Tada! The answer is yes.

-Peter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SEO spider-spider ANY website prasadpelluri5@gmail.com Java 0 10-30-2008 07:04 PM
how google spider access my web site? baroque Chou ASP .Net 7 02-02-2006 09:25 PM
web spider and password protected pages jdonnell Python 5 02-17-2005 01:42 AM
Web Crawler / Spider Commercial Software Info Request Gray Ghost Computer Support 1 11-07-2004 01:48 PM
Suggestions on a Web Spider that .... JeepGary Java 2 10-21-2003 02:53 AM



Advertisments