Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: web crawler in python

Thread Tools

Re: web crawler in python

Philip Semanchuk
Posts: n/a

On Dec 9, 2009, at 7:39 PM, my name wrote:

> I'm currently planning on writing a web crawler in python but have a
> question as far as how I should design it. My goal is speed and
> maximum
> efficient use of the hardware\bandwidth I have available.
> As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a
> 20mbps
> bandwidth cap (for now) . Running FreeBSD.
> What would be the best way to design the crawler? Using the thread
> module?
> Would I be able to max out this connection with the hardware listed
> above
> using python threads?

I wrote a web crawler in Python (under FreeBSD, in fact) and I chose
to do it using separate processes. Process A would download pages and
write them to disk, process B would attempt to convert them to
Unicode, process C would evaluate the content, etc. That worked well
for me because the processes were very independent of one another so
they had very little data to share. Each process had a work queue
(Postgres database table); process A would feed B's queue, B would
feed C & D's queues, etc.

I should point out that my crawler spidered one site at a time. As a
result the downloading process spent a lot of time waiting (in order
to be polite to the remote Web server). This sounds pretty different
from what you want to do (an indeed from most crawlers).

Figuring out the best design for your crawler depends on a host of
factors that you haven't mentioned. (What are you doing with the
pages you download? Is the box doing anything else? Are you storing
the pages long term or discarding them? etc.) I don't think we can do
it for you -- I know *I* can't; I have a day job. But I encourage
you to try something out. If you find your code isn't giving what you
want, come back to the list with a specific problem. It's always
easier to help with specific than with general problems.

Good luck
Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Web crawler on python yura Python 1 10-30-2008 10:25 PM
Web crawler on python sonich Python 4 10-28-2008 05:22 PM
Web Crawler - Python or Perl? Python 11 06-22-2008 05:47 PM
web crawler in python or C? abhinav Python 13 02-20-2006 09:07 PM
web crawler in python or C? abhinav C Programming 1 02-16-2006 08:33 AM