Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > HTML data extraction?

Reply
Thread Tools

HTML data extraction?

 
 
Dave Kuhlman
Guest
Posts: n/a
 
      12-22-2003

I recently read an article by Jon Udell about extracting data from
Web pages as a poor person's Web services. So, I have a question:

Is there any Python support for finding and extracting information
from HTML documents.

I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
<li> tag.

- return the data which is inside a <a> tag that has attribute
href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

I've found the HTMLParser and htmllib modules in the Python
standard library, but I'm wondering if there is anything at a
higher level.

Web searches did not turn up anything interesting.

Thanks for help.

Dave

--
http://www.rexx.com/~dkuhlman

 
Reply With Quote
 
 
 
 
djw
Guest
Posts: n/a
 
      12-22-2003
I don't know if there is anything at a higher level (I guess a Google
session would tell you that), but doing what you describe with the
HTMLParser module is very straightforward. All you have to do is keep
some state flags in the derived HTMLParser class that indicate the
found/not-found state of what you are looking for and have that control
the collection of data between the flags.

Starting with the example in the docs, and adding some (untested) additions:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def __init__( self ):
HTMLParser.__init__( self )
self.in_bold_tag = False
self.in_list_tag = False
self.data_in_bold_list = ''

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
if tag == 'b': self.in_bold_tag = True
if tag == 'li' : self.in_list_tag = True

def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
if tag == 'b': self.in_bold_tag = False
if tag == 'li' : self.in_list_tag = False

def handle_data( self, data ):
if self.in_bold_tag and self.in_list_tag:
self.data_in_bold_list = ''.join( [ self.data_in_bold_list,
data ] )

This is just an outline, but you get the idea...

-Don



Dave Kuhlman wrote:
> I recently read an article by Jon Udell about extracting data from
> Web pages as a poor person's Web services. So, I have a question:
>
> Is there any Python support for finding and extracting information
> from HTML documents.
>
> I'd like something that would do things like the following:
>
> - return the data which is inside a <b> tag which is inside a
> <li> tag.
>
> - return the data which is inside a <a> tag that has attribute
> href="http://www.python.org".
>
> - Etc.
>
> It would be a sort of structured grep for HTML.
>
> I've found the HTMLParser and htmllib modules in the Python
> standard library, but I'm wondering if there is anything at a
> higher level.
>
> Web searches did not turn up anything interesting.
>
> Thanks for help.
>
> Dave
>


 
Reply With Quote
 
 
 
 
John J. Lee
Guest
Posts: n/a
 
      12-22-2003
[Sorry if this got posted twice, not sure what I did...]

Dave Kuhlman <> writes:
[...]
> I'd like something that would do things like the following:
>
> - return the data which is inside a <b> tag which is inside a
> <li> tag.
>
> - return the data which is inside a <a> tag that has attribute
> href="http://www.python.org".
>
> - Etc.
>
> It would be a sort of structured grep for HTML.


1. http://wwwsearch.sf.net/bits/pullparser.py

It's a port of Perl's HTML::TokeParser.

p = pullparser.PullParser(f)
p.get_tag("b")
p.get_tag("li")
print p.get_text()


p = pullparser.PullParser(f)
for tag in p:
tag = p.get_tag("a")
if dict(tag.attrs).get("href") == "http://www.python.org":
print p.get_text()

I'll release a beta version in a day or so with a couple of minor
changes (including that .get_text() will no longer raise
NoMoreTagsError) and a proper tarball package.


2. stuff your data through mxTidy or uTidylib to get XHTML, then into
XPath from PyXML.

http://www.zvon.org/xxl/XPathTutoria.../examples.html

In fact, tidying HTML is sometimes necessary even if you don't need
XHTML or a tree-based API.


3. microdom

http://www.xml.com/pub/a/2003/10/15/microdom.html

Haven't used it myself.


John
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
firefox html, my downloaded html and firebug html different? Adam Akhtar Ruby 9 08-16-2008 07:55 PM
Converting HTML input data to pre-format XML data Elijah Odumosu Ruby 1 04-29-2008 06:56 PM
Howto:HTML template which needs to be filled with data from a data =?Utf-8?B?VmxhZHk=?= ASP .Net 4 10-05-2004 12:39 PM
how to redirect to a frames-based html page and load the right html when coming from an ASP.NET page Mark Kamoski ASP .Net 1 08-13-2003 05:51 AM
How to use HTML::Parser to remove HTML tags and print result Mitchua Perl 1 07-15-2003 02:02 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57