Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > "Fixing html files"

Reply
Thread Tools

"Fixing html files"

 
 
John Resler
Guest
Posts: n/a
 
      03-16-2005
Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?
 
Reply With Quote
 
 
 
 
David Carlisle
Guest
Posts: n/a
 
      03-16-2005
John Resler <(E-Mail Removed)> writes:

> Hi all,
> First I want to say I am fully aware of the huge scope of the problem
> of parsing and correcting files of any sort. I have been using the jTidy
> libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
> I use and convert it to xhtml if possible. Not to complain about Tidy,
> it is the only application I'm aware of that does what it does... I am
> just curious if there are any other applications/libraries that perform
> the same function, more completely?



Hard to quantify "more completely". tidy does a better job than most.
Alternative route might be for example John Cowan's tagsoup
http://mercury.ccil.org/~cowan/XML/tagsoup/
which will allow you to parse most html into an xml processing
pipeline. It doesn't do any cleaning up really, but once you have it as
xml you just hit it with enough xslt of your choice and it should all
come out looking lovely, er, in theory....

If you are feeling really brave there's my htmlparse xslt2 stylesheet
but this is decidedly unsupported.
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

David
 
Reply With Quote
 
 
 
 
Nick Kew
Guest
Posts: n/a
 
      03-16-2005
John Resler wrote:
> Hi all,
> First I want to say I am fully aware of the huge scope of the
> problem of parsing and correcting files of any sort. I have been using
> the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up


Dave Raggett wrote the original tidy, but it's been some years since
he was in charge of it.

> the html I use and convert it to xhtml if possible. Not to complain
> about Tidy, it is the only application I'm aware of that does what it
> does... I am just curious if there are any other applications/libraries
> that perform the same function, more completely?


libxml2 parses html, including tagsoup html, and gives you SAX or DOM
APIs on it. You can then serialise that to better HTML or XHTML.
It's a different approach to tidy, and shares the same fundamental
problem of having to guess blindly when presented with heavy-duty
gibberish.

A higher-level application based on libxml2 is AccessValet. Its
real purpose is (X)HTML accessibility analysis and reporting, but it
will also clean up (x)html. It takes a more brutal approach than
tidy: instead of attempting to substitute for crap, it strips it.
So if you take the default - which is strict output - it'll remove
everything that's deprecated in HTML4/XHTML1, and
<p align=center><font color=black>some text here<p>some more text
becomes
<p>some text here</p><p>some more text</p>

I wouldn't recommend it over tidy for that particular purpose, but it's
an option

You can also fix markup on the fly when serving it. The state of the
art there is mod_publisher, at
http://apache.webthing.com/mod_publisher/
and is far better than any of the tidy-in-a-webserver options.

--
Nick Kew
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
firefox html, my downloaded html and firebug html different? Adam Akhtar Ruby 9 08-16-2008 07:55 PM
How do I identify word<html><html>other word? Laura Perl 1 06-04-2004 11:32 PM
how to redirect to a frames-based html page and load the right html when coming from an ASP.NET page Mark Kamoski ASP .Net 1 08-13-2003 05:51 AM
How to use HTML::Parser to remove HTML tags and print result Mitchua Perl 1 07-15-2003 02:02 PM



Advertisments