Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > HTML Structure Extraction

Reply
Thread Tools

HTML Structure Extraction

 
 
dayzman@hotmail.com
Guest
Posts: n/a
 
      12-08-2004
Hi,

I'm going to write a program that extracts the structure of HTML
documents. The structure would be in the form of a tree, separating the
tags and grouping the start and end tags. I think I will use
htmllib.HTMLParser, is it appropriate for my application? If so, I
believe I will need to keep track of the depth reached.

Any tips for such application will be much appreciated.

Cheers,
Michael

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      12-08-2004
<(E-Mail Removed)> wrote:

> I'm going to write a program that extracts the structure of HTML
> documents. The structure would be in the form of a tree, separating the
> tags and grouping the start and end tags. I think I will use
> htmllib.HTMLParser, is it appropriate for my application? If so, I
> believe I will need to keep track of the depth reached.


you mean like:

http://www.crummy.com/software/BeautifulSoup/
http://effbot.org/zone/element-tidylib.htm
http://utidylib.berlios.de/
http://www.xmlsoft.org/
http://effbot.org/zone/pythondoc-ele...reeBuilder.htm

and a few dozen others?

</F>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Automated HTML code extraction and documenting sirleech@gmail.com HTML 0 09-13-2005 08:01 PM
HTML info extraction utility MaggieMagill HTML 5 03-04-2005 07:46 PM
Unstructured HTML extraction dayzman@hotmail.com XML 4 12-07-2004 12:57 PM
Unstructured HTML extraction dayzman@hotmail.com XML 0 12-07-2004 03:03 AM
Erroneous Text Extraction using HTML::Parser Himanshu Garg Perl 0 01-27-2004 03:28 PM



Advertisments