Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   NZ Computing (http://www.velocityreviews.com/forums/f47-nz-computing.html)
-   -   Beautiful BeautifulSoup (http://www.velocityreviews.com/forums/t651202-beautiful-beautifulsoup.html)

Lawrence D'Oliveiro 01-04-2009 04:45 AM

Beautiful BeautifulSoup
 
BeautifulSoup is a Python library for scraping websites. It will even try to
make sense of malformed HTML. Here's how easy it is to grab strips off the
dilbert.com site:

----
import sys
import re
import urllib
from BeautifulSoup import \
BeautifulSoup as SoupParse

url_base = "http://dilbert.com"

TempFilename, Headers = urllib.urlretrieve("%s/strips/?Page=%d" % (url_base, int(sys.argv[1])))
Document = SoupParse(open(TempFilename, "r").read())
for img in Document.findAll("img", src = re.compile(r"\.strip\.gif$")) :
filename = re.search(r"([^\/]+)$", img["src"]).group(1)
img_url = "%s/%s" % (url_base, img["src"])
sys.stderr.write("%s => %s\n" % (img_url, filename)) # debug
urllib.urlretrieve(img_url, filename)
#end for
----

For instance, the oldest strips are currently on page 691. So the call

get_dilbert 691

will download the strips on that page into the current directory.


All times are GMT. The time now is 04:19 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.