Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Parsing links within a html file.

Reply
Thread Tools

Parsing links within a html file.

 
 
Shriphani
Guest
Posts: n/a
 
      01-14-2008
Hello,
I have a html file over here by the name guide_ind.html and it
contains links to other html files like guides.html#outline . How do I
point BeautifulSoup (I want to use this module) to
guides.html#outline ?
Thanks
Shriphani P.
 
Reply With Quote
 
 
 
 
Hai Vu
Guest
Posts: n/a
 
      01-17-2008
On Jan 14, 9:59 am, Shriphani <(E-Mail Removed)> wrote:
> Hello,
> I have a html file over here by the name guide_ind.html and it
> contains links to other html files like guides.html#outline . How do I
> point BeautifulSoup (I want to use this module) to
> guides.html#outline ?
> Thanks
> Shriphani P.


Try Mark Pilgrim's excellent example at:
http://www.diveintopython.org/http_w...ces/index.html

From the above link, you can retrieve openanything.py which I use in
my example:

# list_url.py
# created by Hai Vu on 1/16/2008

from openanything import fetch
from sgmllib import SGMLParser

class RetrieveURLs(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attributes):
url = [v for k, v in attributes if k.lower() == 'href']
self.urls.extend(url)
print '\t%s' % (url)

#
--------------------------------------------------------------------------------------------------------------
# main
def main():
site = 'http://www.google.com'

result = fetch(site)
if result['status'] == 200:
# Extracts a list of URLs off the top page
parser = RetrieveURLs()
parser.feed(result['data'])
parser.close()

# Display the URLs we just retrieved
print '\nURL retrieved from %s' % (site)
print '\t' + '\n\t'.join(parser.urls)
else:
print 'Error (%d) retrieving %s' % (result['status'], site)

if __name__ == '__main__':
main()
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FF Crashes on Links within Links Puma Firefox 10 02-17-2009 07:12 PM
Parsing HTML / following links etc Dan Cuddeford Ruby 10 01-26-2008 12:35 AM
How do I include a server tag within a javascript parameter which is itself within a HTML element event? mark4asp ASP .Net 2 11-07-2006 06:23 AM
Opening all links of a html page and saving the html pages java_seek Java 4 12-10-2004 04:33 PM
OT?: Word wrap in text files opened in MSIE (re HTML links in an html index file). fitwell HTML 2 11-13-2003 03:14 PM



Advertisments