Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Java API for correcting malformed HTML code

Reply
Thread Tools

Java API for correcting malformed HTML code

 
 
MCP
Guest
Posts: n/a
 
      06-09-2004
Hello,
What are the Java APIs out there that can simply correct malformed
HTML code, like take a input stream of badly formed HTML and produce
an output stream of clean HTML code (parsable by the Swing HTML
parser) ?
 
Reply With Quote
 
 
 
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      06-09-2004
MCP wrote:
> What are the Java APIs out there that can simply correct malformed
> HTML code, like take a input stream of badly formed HTML and produce
> an output stream of clean HTML code (parsable by the Swing HTML
> parser) ?


Maybe this can help http://jtidy.sourceforge.net/ No idea if it fulfills
all your requirements.

/Thomas
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      06-09-2004
On 9 Jun 2004 06:03:20 -0700, http://www.velocityreviews.com/forums/(E-Mail Removed) (MCP) wrote or
quoted :

>What are the Java APIs out there that can simply correct malformed
>HTML code, like take a input stream of badly formed HTML and produce
>an output stream of clean HTML code (parsable by the Swing HTML
>parser) ?


I have been bugging the HTMLValidator people to write such a beast. I
figured it could save me a ton of work if it did simple unambiguous
corrections like insert missing </li> or convert stray & to &amp;

His fear is making a change that the user did not want. He did not
want to be morally liable for messing up the source.

I have done a number of one shot programs to clean up various problems
in my website. They do it all with indexof and substring. If you are
just trying to correct a single problem at a time, it can be pretty
simple.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      06-10-2004
On Wed, 09 Jun 2004 20:54:17 GMT, Roedy Green wrote:

> ..it could save me a ton of work if it did simple unambiguous
> corrections like insert missing </li>


(whispers) W3C defininition for the <li>
is that it does not require a closing </li>..

<http://www.w3.org/TR/1999/REC-html401-19991224/struct/lists.html#didx-list>

--
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      06-10-2004
On Thu, 10 Jun 2004 04:03:36 GMT, Andrew Thompson
<(E-Mail Removed)> wrote or quoted :

>(whispers) W3C defininition for the <li>
>is that it does not require a closing </li>..


what about </td> and </tr>?

Anyway I like to have the HTML consistent.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      06-10-2004
On Thu, 10 Jun 2004 06:14:58 GMT, Roedy Green wrote:

> On Thu, 10 Jun 2004 04:03:36 GMT, Andrew Thompson
> <(E-Mail Removed)> wrote or quoted :
>
>>(whispers) W3C defininition for the <li>
>>is that it does not require a closing </li>..

>
> what about </td> and </tr>?


I am pretty sure they need to be
explicitly closed. (shrugs) If in doubt,
leave one out and throw it at the validator
(which is usually quicker than finding the
element on W3C's site)

> Anyway I like to have the HTML consistent.


I know what you mean, it has taken
some training to *prevent* myself from
typing </p> and </li>..

--
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      06-10-2004
On Thu, 10 Jun 2004 18:37:46 GMT, arne thormodsen wrote:

>> I know what you mean, it has taken
>> some training to *prevent* myself from
>> typing </p> and </li>..
>>

>
> Why bother? All new broswers..


...not all browser are new, not all users
can update, not all sites can afford to
turn away customers just because their
browser is not flavour of the month.

That's why.

--
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
 
Reply With Quote
 
arne thormodsen
Guest
Posts: n/a
 
      06-10-2004
>
> I know what you mean, it has taken
> some training to *prevent* myself from
> typing </p> and </li>..
>


Why bother? All new broswers interpret XHTML properly, so you might
as well make your HTML well-formed as XML too. Then you can use XML
tools to process it.

--arne


 
Reply With Quote
 
arne thormodsen
Guest
Posts: n/a
 
      06-10-2004


>
> Maybe this can help http://jtidy.sourceforge.net/ No idea if it

fulfills
> all your requirements.
>


I've used it extensively in the past. It works pretty well.

--arne

> /Thomas



 
Reply With Quote
 
Christophe Vanfleteren
Guest
Posts: n/a
 
      06-10-2004
Andrew Thompson wrote:

> On Thu, 10 Jun 2004 18:37:46 GMT, arne thormodsen wrote:
>
>>> I know what you mean, it has taken
>>> some training to *prevent* myself from
>>> typing </p> and </li>..
>>>

>>
>> Why bother? All new broswers..

>
> ..not all browser are new, not all users
> can update, not all sites can afford to
> turn away customers just because their
> browser is not flavour of the month.
>
> That's why.
>


I'm pretty sure even netscape 4.7 or Lynx interprets </p> and </li>
correctly. Even pure XHTML should pose no problem for those, when you write
the empty elements like <br> as <br /> instead of <br/>. Any browser better
than those (that's all of the currently used browsers should have no
problems if you close your tags.

As it says in the spec, the closing tags are not *required*, it doesn't say
that they shouldn't be present. And the advantages of writing XML
compatible HTML are bigger than adjusting to the lowest possible
denominator IMHO.

Have you got any example of a browser which breaks when you add the optional
closing tags?

--
Kind regards,
Christophe Vanfleteren
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-correcting library for parsing/modifying broken HTML/PHP files? Markus Fischer Ruby 2 04-05-2011 01:27 PM
Microsoft Internet Explorer Malformed HTML Parsing Denial of Service Vulnerability Imhotep Computer Security 16 06-03-2006 02:30 AM
Malformed HTML from UserControl Patrick ASP .Net 1 05-19-2004 03:31 PM
Reed-Solomon correcting code - coder/decoder in vhdl Wilq VHDL 0 05-17-2004 08:39 AM
htmllib.py and parsing malformed HTML KC Python 8 09-05-2003 02:48 PM



Advertisments