Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Unstructured HTML extraction

Thread Tools

Unstructured HTML extraction
Posts: n/a

I'm interested in a program that extracts the structure of unstructured
HTML documents. The program should be able to make good estimates about
different font styles used to represent headings, for example, some may
use <font size = 24> for headings and some may use <h1>, in the end,
both should output the same structure. The output can be in XML or
other formats. Manual driving should remain minimal. Does anyone know
of such program (preferably open-source)?


Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML info extraction utility MaggieMagill HTML 5 03-04-2005 07:46 PM
Unstructured HTML extraction XML 4 12-07-2004 12:57 PM
Flat unstructured log files to XML hakhan XML 0 10-19-2004 11:50 AM
Erroneous Text Extraction using HTML::Parser Himanshu Garg Perl 0 01-27-2004 03:28 PM
Can MayaVi visualize 3D functions in an unstructured grid ? A. Novruzi Python 2 01-15-2004 06:58 PM