![]() |
Beautiful BeautifulSoup
BeautifulSoup is a Python library for scraping websites. It will even try to
make sense of malformed HTML. Here's how easy it is to grab strips off the dilbert.com site: ---- import sys import re import urllib from BeautifulSoup import \ BeautifulSoup as SoupParse url_base = "http://dilbert.com" TempFilename, Headers = urllib.urlretrieve("%s/strips/?Page=%d" % (url_base, int(sys.argv[1]))) Document = SoupParse(open(TempFilename, "r").read()) for img in Document.findAll("img", src = re.compile(r"\.strip\.gif$")) : filename = re.search(r"([^\/]+)$", img["src"]).group(1) img_url = "%s/%s" % (url_base, img["src"]) sys.stderr.write("%s => %s\n" % (img_url, filename)) # debug urllib.urlretrieve(img_url, filename) #end for ---- For instance, the oldest strips are currently on page 691. So the call get_dilbert 691 will download the strips on that page into the current directory. |
| All times are GMT. The time now is 04:19 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.