bash.org scraping

Discussion in 'NZ Computing' started by Lawrence D'Oliveiro, Mar 20, 2008.

  1. Here's a script that returns quotes in text form from bash.org. For
    instance, to get quote number 284202, use a command like this:

    getbashquote 284202

    Or to get a full helping of the random-quotes page, type

    getbashquote random

    ----
    #!/usr/bin/python
    #+
    # This script retrieves the text of a quote with the specified
    # number from bash.org. Invoke thie script as follows:
    #
    # getbashquote quoteid
    #
    # where quotenr is the ID of the quote page to get.
    #
    # Created 2008 March 21 by Lawrence D'Oliveiro <_zealand>.
    #-

    import sys
    import urllib
    import HTMLParser
    import htmlentitydefs

    class QuoteGetter(HTMLParser.HTMLParser) :

    def handle_starttag(self, tag, attrs) :
    if not self.in_quote and tag == "p" and dict(attrs).get("class") == "qt" :
    self.in_quote = True
    self.cur_quote = ""
    #end if
    #end handle_starttag

    def handle_endtag(self, tag) :
    if self.in_quote and tag == "p" :
    self.in_quote = False
    self.quotes.append(self.cur_quote)
    #end if
    #end handle_endtag

    def handle_data(self, the_data) :
    if self.in_quote :
    self.cur_quote += the_data
    #end if
    #end handle_data

    def handle_charref(self, the_data) :
    if self.in_quote :
    self.cur_quote += unichr(the_data).encode("utf-8")
    #end if
    #end handle_charref

    def handle_entityref(self, the_data) :
    if self.in_quote :
    self.cur_quote += unichr(htmlentitydefs.name2codepoint[the_data]).encode("utf-8")
    #end if
    #end handle_entityref

    def __init__(self) :
    self.in_quote = False
    self.quotes = []
    self.reset() # not called implicitly, whatver documentation says
    #end __init__

    #end QuoteGetter

    if len(sys.argv) != 2 :
    raise RuntimeError("Usage:\n\t%s quoteid\n" % sys.argv[0])
    #end if
    tempfilename, ignore_headers = urllib.urlretrieve("http://bash.org/?" + sys.argv[1])
    GetQuotes = QuoteGetter()
    GetQuotes.feed(open(tempfilename, "r").read())
    GetQuotes.close()
    First = True
    for Quote in GetQuotes.quotes :
    if First :
    First = False
    else :
    sys.stdout.write("\n")
    #end if
    sys.stdout.write(Quote + "\n")
    #end for
    Lawrence D'Oliveiro, Mar 20, 2008
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?VGludGluTWlsb3U=?=

    Scheduling a screen-scraping progam on a locked PC?

    =?Utf-8?B?VGludGluTWlsb3U=?=, Jan 5, 2005, in forum: Microsoft Certification
    Replies:
    7
    Views:
    531
    =?Utf-8?B?QnJhbmRlbg==?=
    Jan 12, 2005
  2. Lisa Horton

    Hamas is scraping the bottom of the barrel now LOL

    Lisa Horton, Apr 19, 2004, in forum: Digital Photography
    Replies:
    6
    Views:
    372
    Awbawabi
    Apr 20, 2004
  3. NIST.org
    Replies:
    8
    Views:
    712
  4. Lawrence D'Oliveiro

    random bash.org quote

    Lawrence D'Oliveiro, Aug 9, 2009, in forum: NZ Computing
    Replies:
    0
    Views:
    364
    Lawrence D'Oliveiro
    Aug 9, 2009
  5. RichA

    Dpreview REALLY scraping the barrel bottom now!

    RichA, Jul 6, 2010, in forum: Digital Photography
    Replies:
    43
    Views:
    994
    John Turco
    Jul 20, 2010
Loading...

Share This Page