Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Web Crawler - Python or Perl?

Reply
Thread Tools

Web Crawler - Python or Perl?

 
 
disappearedng@gmail.com
Guest
Posts: n/a
 
      06-09-2008
Hi all,
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?
 
Reply With Quote
 
 
 
 
subeen
Guest
Posts: n/a
 
      06-09-2008
On Jun 9, 11:48 pm, (E-Mail Removed) wrote:
> Hi all,
> I am currently planning to write my own web crawler. I know Python but
> not Perl, and I am interested in knowing which of these two are a
> better choice given the following scenario:
>
> 1) I/O issues: my biggest constraint in terms of resource will be
> bandwidth throttle neck.
> 2) Efficiency issues: The crawlers have to be fast, robust and as
> "memory efficient" as possible. I am running all of my crawlers on
> cheap pcs with about 500 mb RAM and P3 to P4 processors
> 3) Compatibility issues: Most of these crawlers will run on Unix
> (FreeBSD), so there should exist a pretty good compiler that can
> optimize my code these under the environments.
>
> What are your opinions?


It really doesn't matter whether you use Perl or Python for writing
web crawlers. I have used both for writing crawlers. The scenarios you
mentioned (I/O issues, Efficiency, Compatibility) don't differ two
much for these two languages. Both the languages have fast I/O. You
can use urllib2 module and/or beautiful soup for developing crawler in
Python. For Perl you can use Mechanize or LWP modules. Both languages
have good support for regular expressions. Perl is slightly faster I
have heard, though I don't find the difference myself. Both are
compatible with *nix. For writing a good crawler, language is not
important, it's the technology which is important.

regards,
Subeen.
http://love-python.blogspot.com/
 
Reply With Quote
 
 
 
 
Stefan Behnel
Guest
Posts: n/a
 
      06-09-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> 1) I/O issues: my biggest constraint in terms of resource will be
> bandwidth throttle neck.
> 2) Efficiency issues: The crawlers have to be fast, robust and as
> "memory efficient" as possible. I am running all of my crawlers on
> cheap pcs with about 500 mb RAM and P3 to P4 processors
> 3) Compatibility issues: Most of these crawlers will run on Unix
> (FreeBSD), so there should exist a pretty good compiler that can
> optimize my code these under the environments.


You should rethink your requirements. You expect to be I/O bound, so why do
you require a good "compiler"? Especially when asking about two interpreted
languages...

Consider using lxml (with Python), it has pretty much everything you need for
a web crawler, supports threaded parsing directly from HTTP URLs, and it's
plenty fast and pretty memory efficient.

http://codespeak.net/lxml/

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      06-09-2008
subeen wrote:
> can use urllib2 module and/or beautiful soup for developing crawler


Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan
 
Reply With Quote
 
subeen
Guest
Posts: n/a
 
      06-09-2008
On Jun 10, 12:15 am, Stefan Behnel <(E-Mail Removed)> wrote:
> subeen wrote:
> > can use urllib2 module and/or beautiful soup for developing crawler

>
> Not if you care about a) speed and/or b) memory efficiency.
>
> http://blog.ianbicking.org/2008/03/3...r-performance/
>
> Stefan


ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.


regards,
Subeen.
http://love-python.blogspot.com/
 
Reply With Quote
 
Ray Cote
Guest
Posts: n/a
 
      06-09-2008
At 11:21 AM -0700 6/9/08, subeen wrote:
>On Jun 10, 12:15 am, Stefan Behnel <(E-Mail Removed)> wrote:
>> subeen wrote:
>> > can use urllib2 module and/or beautiful soup for developing crawler

>>
>> Not if you care about a) speed and/or b) memory efficiency.
>>
> > http://blog.ianbicking.org/2008/03/3...r-performance/
>>
>> Stefan

>
>ya, beautiful soup is slower. so it's better to use urllib2 for
>fetching data and regular expressions for parsing data.
>
>
>regards,
>Subeen.
>http://love-python.blogspot.com/
>--
>http://mail.python.org/mailman/listinfo/python-list


Beautiful Soup is a bit slower, but it will actually parse some of
the bizarre HTML you'll download off the web. We've written a couple
of crawlers to run over specific clients sites (I note, we did _not_
create the content on these sites).

Expect to find html code that looks like this:

<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]

I don't know if some of the quicker parsers discussed require
well-formed HTML since I've not used them. You may want to consider
using one of the quicker HTML parsers and, when they throw a fit on
the downloaded HTML, drop back to Beautiful Soup -- which usually
gets _something_ useful off the page.

--Ray

--

Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com
 
Reply With Quote
 
Sebastian \lunar\ Wiesner
Guest
Posts: n/a
 
      06-09-2008
subeen <(E-Mail Removed)> at Montag 09 Juni 2008 20:21:

> On Jun 10, 12:15 am, Stefan Behnel <(E-Mail Removed)> wrote:
>> subeen wrote:
>> > can use urllib2 module and/or beautiful soup for developing crawler

>>
>> Not if you care about a) speed and/or b) memory efficiency.
>>
>> http://blog.ianbicking.org/2008/03/3...r-performance/
>>
>> Stefan

>
> ya, beautiful soup is slower. so it's better to use urllib2 for
> fetching data and regular expressions for parsing data.


BeautifulSoup is implemented on regular expressions. I doubt, that you can
achieve a great performance gain by using plain regular expressions, and
even if, this gain is certainly not worth the effort. Parsing markup with
regular expressions is hard, and the result will most likely not be as fast
and as memory-efficient as lxml.html.

I personally am absolutely happy with lxml.html. It's fast, memory
efficient, yet powerful and easy to use.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)
 
Reply With Quote
 
George Sakkis
Guest
Posts: n/a
 
      06-09-2008
On Jun 9, 1:48 pm, (E-Mail Removed) wrote:

> Hi all,
> I am currently planning to write my own web crawler. I know Python but
> not Perl, and I am interested in knowing which of these two are a
> better choice given the following scenario:
>
> 1) I/O issues: my biggest constraint in terms of resource will be
> bandwidth throttle neck.
> 2) Efficiency issues: The crawlers have to be fast, robust and as
> "memory efficient" as possible. I am running all of my crawlers on
> cheap pcs with about 500 mb RAM and P3 to P4 processors
> 3) Compatibility issues: Most of these crawlers will run on Unix
> (FreeBSD), so there should exist a pretty good compiler that can
> optimize my code these under the environments.
>
> What are your opinions?


You mentioned *what* you want but not *why*. If it's for a real-world
production project, why reinvent a square wheel and not use (or at
least extend) an existing open source crawler, with years of
development behind it ? If it's a learning exercise, why bother about
performance so early ?

In any case, since you said you know python but not perl, the choice
is almost a no-brainer, unless you're looking for an excuse to learn
perl. In terms of performance they are comparable, and you can
probably manage crawls in the order of 10-100K pages at best. For
million-page or larger crawls though, you'll have to resort to C/C++
sooner or later.

George
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      06-09-2008
Ray Cote wrote:
> Beautiful Soup is a bit slower, but it will actually parse some of the
> bizarre HTML you'll download off the web.

[...]
> I don't know if some of the quicker parsers discussed require
> well-formed HTML since I've not used them. You may want to consider
> using one of the quicker HTML parsers and, when they throw a fit on the
> downloaded HTML, drop back to Beautiful Soup -- which usually gets
> _something_ useful off the page.


So does lxml.html. And if you still feel like needing BS once in a while,
there's lxml.html.soupparser.

http://codespeak.net/lxml/elementsoup.html

Stefan
 
Reply With Quote
 
disappearedng@gmail.com
Guest
Posts: n/a
 
      06-10-2008
As to why as opposed to what, I am attempting to build a search engine
right now that plans to crawl not just html but other things too.

I am open to learning, and I don't want to learn anything that doesn't
really contribute to building my search engine for the moment. Hence I
want to see whether learning PERL will be helpful to the later parts
of my search engine.

Victor
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: web crawler in python Philip Semanchuk Python 0 12-10-2009 01:24 PM
Web crawler on python yura Python 1 10-30-2008 10:25 PM
Web crawler on python sonich Python 4 10-28-2008 05:22 PM
web crawler in python or C? abhinav Python 13 02-20-2006 09:07 PM
web crawler in python or C? abhinav C Programming 1 02-16-2006 08:33 AM



Advertisments