Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > urllib2 and threading

Reply
Thread Tools

urllib2 and threading

 
 
robean
Guest
Posts: n/a
 
      05-01-2009
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.

Here's the code:

#!/usr/bin/python

import urllib2
import threading

class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args

def run(self):
apply(self.func, self.args)


def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print "**** error ****", e.reason
except urllib2.HTTPError, e:
print "**** error ****", e.code

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()

ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here

fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()

# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls[i],) )
threads.append(t)

# start the threads
for i in range(num_links):
threads[i].start()

for i in range(num_links):
threads[i].join()

print "all done"

 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      05-01-2009
robean <(E-Mail Removed)> writes:
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> the example shown here is simplified and just confirms the url of the
> site visited.


Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
of pages and have multiple cpu's, you probably want parallel processes
rather than threads.

> wrong? I am new to both threading and urllib2, so its possible that
> the SNAFU is quite obvious..
> ...
> ulock = threading.Lock()


Without looking at the code for more than a few seconds, using an
explicit lock like that is generally not a good sign. The usual
Python style is to send all inter-thread communications through
Queues. You'd dump all your url's into a queue and have a bunch of
worker threads getting items off the queue and processing them. This
really avoids a lot of lock-related headache. The price is that you
sometimes use more threads than strictly necessary. Unless it's a LOT
of extra threads, it's usually not worth the hassle of messing with
locks.

 
Reply With Quote
 
 
 
 
robean
Guest
Posts: n/a
 
      05-01-2009
Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful
here,eventually
page.close()
ulock.release()

with

else:
pass

the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior? Many thanks,

Robean



On May 1, 12:27*am, Paul Rubin <http://(E-Mail Removed)> wrote:
> robean <(E-Mail Removed)> writes:
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> > the example shown here is simplified and just confirms the url of the
> > site visited.

>
> Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
> of pages and have multiple cpu's, you probably want parallel processes
> rather than threads.
>
> > wrong? I am new to both threading and urllib2, so its possible that
> > the SNAFU is quite obvious..
> > ...
> > ulock = threading.Lock()

>
> Without looking at the code for more than a few seconds, using an
> explicit lock like that is generally not a good sign. *The usual
> Python style is to send all inter-thread communications through
> Queues. *You'd dump all your url's into a queue and have a bunch of
> worker threads getting items off the queue and processing them. *This
> really avoids a lot of lock-related headache. *The price is that you
> sometimes use more threads than strictly necessary. *Unless it's a LOT
> of extra threads, it's usually not worth the hassle of messing with
> locks.


 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      05-01-2009
robean wrote:
> I am writing a program that involves visiting several hundred webpages
> and extracting specific information from the contents. I've written a
> modest 'test' example here that uses a multi-threaded approach to
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that)


Try lxml.html instead. It often parses HTML pages better than BS, can parse
directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
faster and more memory friendly than the combination of urllib2 and BS,
especially when threading is involved. It also supports CSS selectors for
finding page content, so your "elaborate scraping" might actually turn out
to be a lot simpler than you think.

http://codespeak.net/lxml/

These might be worth reading:

http://blog.ianbicking.org/2008/12/1...aping-library/
http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan
 
Reply With Quote
 
shailen.tuli@gmail.com
Guest
Posts: n/a
 
      05-01-2009
For better performance, lxml easily outperforms Beautiful Soup.

For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, though.

- Shailen

On May 1, 9:29*am, Stefan Behnel <(E-Mail Removed)> wrote:
> robean wrote:
> > I am writing a program that involves visiting several hundred webpages
> > and extracting specific information from the contents. I've written a
> > modest 'test' example here that uses a multi-threaded approach to
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that)

>
> Try lxml.html instead. It often parses HTML pages better than BS, can parse
> directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
> faster and more memory friendly than the combination of urllib2 and BS,
> especially when threading is involved. It also supports CSS selectors for
> finding page content, so your "elaborate scraping" might actually turn out
> to be a lot simpler than you think.
>
> http://codespeak.net/lxml/
>
> These might be worth reading:
>
> http://blog.ianbicking.org/2008/12/1...r-performance/
>
> Stefan


 
Reply With Quote
 
Piet van Oostrum
Guest
Posts: n/a
 
      05-01-2009
>>>>> robean <(E-Mail Removed)> (R) wrote:

>R> def get_info_from_url(url):
>R> """ A dummy version of the function simply visits urls and prints
>R> the url of the page. """
>R> try:
>R> page = urllib2.urlopen(url)
>R> except urllib2.URLError, e:
>R> print "**** error ****", e.reason
>R> except urllib2.HTTPError, e:
>R> print "**** error ****", e.code


There's a problem here. HTTPError is a subclass of URLError so it should
be first. Otherwise when you have an HTTPError (like a 404 File not
found) it will be caught by the "except URLError", but it will not have
a reason attribute, and then you get an exception in the except clause
and the thread will crash.
--
Piet van Oostrum <(E-Mail Removed)>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Aahz
Guest
Posts: n/a
 
      05-02-2009
In article <(E-Mail Removed)>,
robean <(E-Mail Removed)> wrote:
>
>Here's the problem: the script simply crashes after getting a a couple
>of urls and takes a long time to run (slower that a non-threaded
>version that I wrote and ran). Can anyone figure out what I am doing
>wrong? I am new to both threading and urllib2, so its possible that
>the SNAFU is quite obvious.


For an example, see

http://www.pythoncraft.com/OSCON2001/index.html
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

"Typing is cheap. Thinking is expensive." --Roy Smith
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: threading in PyQt vs threading in standard library Steven Woody Python 0 01-09-2009 07:48 AM
threading in PyQt vs threading in standard library Steven Woody Python 0 01-09-2009 07:15 AM
Cooperative threading preemptive threading - a bit confused failure_to@yahoo.co.uk Java 9 12-29-2007 01:10 AM
Problem with: urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) Josef Cihal Python 0 09-05-2005 11:26 AM
Urllib2/threading errors under Cygwin Jacek Trzmiel Python 0 05-01-2004 02:57 AM



Advertisments