Go Back   Velocity Reviews > Newsgroups > Java
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

Java - Re: Junit - "Credible" HTML checker?

 
Thread Tools Search this Thread
Old 08-07-2009, 12:13 AM   #1
Default Re: Junit - "Credible" HTML checker?


On Thu, 6 Aug 2009, bugbear wrote:

> I have some routines that generate HTML;
> it would be useful if (in my unit testing)
> I had a quick and dirty "is this valid HTML" test.
>
> I don't need an html renderer - something
> cruddy based on "likely" looking regexps would
> suit me very well.
>
> I'm simply trying to avoid doing full deploy + interactive
> testing of stuff (html) which isn't even "likely".
>
> Does anyone know of anything?


The Rolls-Royce here is HtmlUnit, which is a complete headless browser -
it reads HTML, parses CSS, runs javascript (courtesy of Rhino), etc. It
has interfaces which make it easy to ask questions like "get me all the
div elements", "get me all the paragraph elements with class errorReport",
"get me the text content of this element", etc, which is what you need for
testing.

It's built on top of NekoHTML, which is a pretty decent HTML parser. Other
popular parsers are JTidy and TagSoup, but i think those are more lenient
in their parsing (Neko can be lenient, but tends more towards strictness),
and for what you want to do, you don't want leniency.

Apologies for the lack of URLs, but you strike me as the kind of chap who
is quite capable of using google!

tom

--
The sunlights differ, but there is only one darkness. -- Ursula K. LeGuin,
'The Dispossessed'


Tom Anderson
  Reply With Quote
Old 08-07-2009, 06:27 PM   #2
Jean-Baptiste Nizet
 
Posts: n/a
Default Re: Junit - "Credible" HTML checker?
Tom Anderson a écrit :
> On Thu, 6 Aug 2009, bugbear wrote:
>
>> I have some routines that generate HTML;
>> it would be useful if (in my unit testing)
>> I had a quick and dirty "is this valid HTML" test.
>>
>> I don't need an html renderer - something
>> cruddy based on "likely" looking regexps would
>> suit me very well.
>>
>> I'm simply trying to avoid doing full deploy + interactive
>> testing of stuff (html) which isn't even "likely".
>>
>> Does anyone know of anything?

>
> The Rolls-Royce here is HtmlUnit, which is a complete headless browser -
> it reads HTML, parses CSS, runs javascript (courtesy of Rhino), etc. It
> has interfaces which make it easy to ask questions like "get me all the
> div elements", "get me all the paragraph elements with class
> errorReport", "get me the text content of this element", etc, which is
> what you need for testing.
>
> It's built on top of NekoHTML, which is a pretty decent HTML parser.
> Other popular parsers are JTidy and TagSoup, but i think those are more
> lenient in their parsing (Neko can be lenient, but tends more towards
> strictness), and for what you want to do, you don't want leniency.
>
> Apologies for the lack of URLs, but you strike me as the kind of chap
> who is quite capable of using google!
>
> tom
>


The problem with HtmlUnit (in this particular case) is precisely that it
tries to work like a real browser, which means that it'll do his best to
give you a dom tree even if the HTML is not valid at all.

JB.


Jean-Baptiste Nizet
  Reply With Quote
Old 08-07-2009, 07:04 PM   #3
Arne Vajhøj
 
Posts: n/a
Default Re: Junit - "Credible" HTML checker?
Jean-Baptiste Nizet wrote:
> Tom Anderson a écrit :
>> On Thu, 6 Aug 2009, bugbear wrote:
>>> I have some routines that generate HTML;
>>> it would be useful if (in my unit testing)
>>> I had a quick and dirty "is this valid HTML" test.
>>>
>>> I don't need an html renderer - something
>>> cruddy based on "likely" looking regexps would
>>> suit me very well.
>>>
>>> I'm simply trying to avoid doing full deploy + interactive
>>> testing of stuff (html) which isn't even "likely".
>>>
>>> Does anyone know of anything?

>>
>> The Rolls-Royce here is HtmlUnit, which is a complete headless browser
>> - it reads HTML, parses CSS, runs javascript (courtesy of Rhino), etc.
>> It has interfaces which make it easy to ask questions like "get me all
>> the div elements", "get me all the paragraph elements with class
>> errorReport", "get me the text content of this element", etc, which is
>> what you need for testing.
>>
>> It's built on top of NekoHTML, which is a pretty decent HTML parser.
>> Other popular parsers are JTidy and TagSoup, but i think those are
>> more lenient in their parsing (Neko can be lenient, but tends more
>> towards strictness), and for what you want to do, you don't want
>> leniency.
>>
>> Apologies for the lack of URLs, but you strike me as the kind of chap
>> who is quite capable of using google!

>
> The problem with HtmlUnit (in this particular case) is precisely that it
> tries to work like a real browser, which means that it'll do his best to
> give you a dom tree even if the HTML is not valid at all.


If super strict parsing is needed, then XHTML and a regular XML
parser is an option.

Arne


Arne Vajhøj
  Reply With Quote
Old 08-07-2009, 08:09 PM   #4
Chris Riesbeck
 
Posts: n/a
Default Re: Junit - "Credible" HTML checker?
Arne Vajhøj wrote:
> Jean-Baptiste Nizet wrote:
>> Tom Anderson a écrit :
>>> On Thu, 6 Aug 2009, bugbear wrote:
>>>> I have some routines that generate HTML;
>>>> it would be useful if (in my unit testing)
>>>> I had a quick and dirty "is this valid HTML" test.
>>>>
>>>> I don't need an html renderer - something
>>>> cruddy based on "likely" looking regexps would
>>>> suit me very well.
>>>>
>>>> I'm simply trying to avoid doing full deploy + interactive
>>>> testing of stuff (html) which isn't even "likely".
>>>>
>>>> Does anyone know of anything?
>>>
>>> The Rolls-Royce here is HtmlUnit, which is a complete headless
>>> browser - it reads HTML, parses CSS, runs javascript (courtesy of
>>> Rhino), etc. It has interfaces which make it easy to ask questions
>>> like "get me all the div elements", "get me all the paragraph
>>> elements with class errorReport", "get me the text content of this
>>> element", etc, which is what you need for testing.
>>>
>>> It's built on top of NekoHTML, which is a pretty decent HTML parser.
>>> Other popular parsers are JTidy and TagSoup, but i think those are
>>> more lenient in their parsing (Neko can be lenient, but tends more
>>> towards strictness), and for what you want to do, you don't want
>>> leniency.
>>>
>>> Apologies for the lack of URLs, but you strike me as the kind of chap
>>> who is quite capable of using google!

>>
>> The problem with HtmlUnit (in this particular case) is precisely that
>> it tries to work like a real browser, which means that it'll do his
>> best to give you a dom tree even if the HTML is not valid at all.

>
> If super strict parsing is needed, then XHTML and a regular XML
> parser is an option.
>
> Arne


some other options you might want to explore -- Validator might be the
most appropriate if JTidy isn't:

COBRA: http://lobobrowser.org/cobra.jsp
Validator.nu: http://about.validator.nu/htmlparser/
HTMLCleaner: http://htmlcleaner.sourceforge.net/



Chris Riesbeck
  Reply With Quote
Old 08-10-2009, 08:05 PM   #5
Tom Anderson
 
Posts: n/a
Default Re: Junit - "Credible" HTML checker?
On Fri, 7 Aug 2009, Jean-Baptiste Nizet wrote:

> Tom Anderson a ?crit :
>> On Thu, 6 Aug 2009, bugbear wrote:
>>
>>> I have some routines that generate HTML;
>>> it would be useful if (in my unit testing)
>>> I had a quick and dirty "is this valid HTML" test.
>>>
>>> I don't need an html renderer - something
>>> cruddy based on "likely" looking regexps would
>>> suit me very well.
>>>
>>> I'm simply trying to avoid doing full deploy + interactive
>>> testing of stuff (html) which isn't even "likely".
>>>
>>> Does anyone know of anything?

>>
>> The Rolls-Royce here is HtmlUnit, which is a complete headless browser - it
>> reads HTML, parses CSS, runs javascript (courtesy of Rhino), etc. It has
>> interfaces which make it easy to ask questions like "get me all the div
>> elements", "get me all the paragraph elements with class errorReport", "get
>> me the text content of this element", etc, which is what you need for
>> testing.
>>
>> It's built on top of NekoHTML, which is a pretty decent HTML parser. Other
>> popular parsers are JTidy and TagSoup, but i think those are more lenient
>> in their parsing (Neko can be lenient, but tends more towards strictness),
>> and for what you want to do, you don't want leniency.
>>
>> Apologies for the lack of URLs, but you strike me as the kind of chap who
>> is quite capable of using google!

>
> The problem with HtmlUnit (in this particular case) is precisely that it
> tries to work like a real browser, which means that it'll do his best to
> give you a dom tree even if the HTML is not valid at all.


Ah, but then it's simply a matter of bending the tool to your will. We
modified HtmlUnit to XHTML - and amongst other things, that means being
less tolerant of errors. Basically, we found HtmlUnit's central parsing
class, the one which wraps NekoHTML, and changed the set of options it
sets on Neko before a parse. We also had to modify a few other spots in
the parser chain, ISTR. I'll dig out the details tomorrow,

tom

--
For various unconvincing reasons, your call may be recorded.


Tom Anderson
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Accessibility Checker Radley General Help Related Topics 0 01-10-2008 04:23 AM
VHDL code of odd parity checker shakeelsultan Hardware 0 10-27-2006 06:27 PM
MSN Messenger Block Checker and Yahoo Block Checker mianriz Software 0 07-30-2006 09:22 AM
Re: System File Checker - XP jt A+ Certification 0 10-18-2003 11:45 PM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46