Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > htmllib.py and parsing malformed HTML

Reply
Thread Tools

htmllib.py and parsing malformed HTML

 
 
KC
Guest
Posts: n/a
 
      09-02-2003
I have written a parser using htmllib.HTMLParser and it functions fine
unless the HTML is malformed. For example, is some instances, the
provider of the HTML leaves out the <TR> tags but includes the </TR> tags.

Apparently, htmllib and more likely sgmllib do not parse an end tag if a
corresponding start tag was not found. Does anyone know a way to "fool"
the parser into handling the end tag is a start tag was not found?

Thanks,

Kevin

 
Reply With Quote
 
 
 
 
Thomas =?ISO-8859-15?Q?G=FCttler?=
Guest
Posts: n/a
 
      09-02-2003
KC wrote:

> I have written a parser using htmllib.HTMLParser and it functions fine
> unless the HTML is malformed. For example, is some instances, the
> provider of the HTML leaves out the <TR> tags but includes the </TR> tags.
>
> Apparently, htmllib and more likely sgmllib do not parse an end tag if a
> corresponding start tag was not found. Does anyone know a way to "fool"
> the parser into handling the end tag is a start tag was not found?


Hi,

You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
parse the html.

thomas

 
Reply With Quote
 
 
 
 
KC
Guest
Posts: n/a
 
      09-02-2003
Thomas GŁttler wrote:
>
> Hi,
>
> You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
> parse the html.


I appreciate the suggestion but unfortunately this will not work well
for me as the parser runs as part of a cron job. I wouldn't be able to
review the tidy error log in a timely fashion if there was a problem.

What would be really nice is a way to tell the parser it was "inside" a
<TR> when I encountered a <TD> after a closing </TR>. Browsers still
display the HTML correctly without a starting <TR>, but if the closing
</TR> is omitted everything gets mangled.

Any other suggestions?

 
Reply With Quote
 
KC
Guest
Posts: n/a
 
      09-02-2003
KC wrote:
>
> What would be really nice is a way to tell the parser it was "inside" a
> <TR> when I encountered a <TD> after a closing </TR>. Browsers still
> display the HTML correctly without a starting <TR>, but if the closing
> </TR> is omitted everything gets mangled.
>

I solved this problem, perhaps not the most elegant way, but it is still
solved. Any suggestions on improvements are welcome. I added the
following method to my parser class to make this work:


def parse_endtag(self, i) :
rawdata = self.rawdata
tag = rawdata[i+2:i+4].strip().lower()
if tag == 'tr' :
self.fmtr.writer.send_tag('</TR>')
return htmllib.HTMLParser.parse_endtag(self, i)


I should also mention that I added the send_tag method to my writer
implementation which simply writes the given text to the output stream.


 
Reply With Quote
 
John J. Lee
Guest
Posts: n/a
 
      09-02-2003
KC <(E-Mail Removed)> writes:

> Thomas GŁttler wrote:
> > Hi,
> > You could use tidy (http://www.w3.org/People/Raggett/tidy/) before
> > you
> > parse the html.

>
> I appreciate the suggestion but unfortunately this will not work well
> for me as the parser runs as part of a cron job. I wouldn't be able
> to review the tidy error log in a timely fashion if there was a
> problem.

[...]

So, what about *your* code's error log (or the equivalent --
presumably an unhandled traceback)?? It's not obvious that your
solution (in a later post) will be any more robust than just piping
everything through HTMLTidy. In fact, since you will find a great
variety of nonsense in 'HTML as deployed', it seems likely that
HTMLTidy will do the better job.


John
 
Reply With Quote
 
KC
Guest
Posts: n/a
 
      09-04-2003
John J. Lee wrote:

>
> So, what about *your* code's error log (or the equivalent --
> presumably an unhandled traceback)?? It's not obvious that your
> solution (in a later post) will be any more robust than just piping
> everything through HTMLTidy. In fact, since you will find a great
> variety of nonsense in 'HTML as deployed', it seems likely that
> HTMLTidy will do the better job.
>


If this parser was handling a "great variety of nonsense" I would
wholeheartedly agree with you. However, since this HTML is from a
single vendor and that vendor is a government entity, this solution was
better than integrating a third-party product. As with most
organizations, changing *our* code is much more acceptable to the powers
that be, than bringing in a third-party product that will have to be
evaluated and have countless meetings over its approval. For many of
us, business and policy decisions often forge the direction for
technology usage within our organizations.

 
Reply With Quote
 
John J. Lee
Guest
Posts: n/a
 
      09-04-2003
KC <(E-Mail Removed)> writes:

> John J. Lee wrote:
>
> > So, what about *your* code's error log (or the equivalent --
> > presumably an unhandled traceback)?? It's not obvious that your

[...]
> If this parser was handling a "great variety of nonsense" I would
> wholeheartedly agree with you. However, since this HTML is from a
> single vendor and that vendor is a government entity, this solution


Oh, got you. Fair enough


[...]
> for technology usage within our organizations.


You can always tell when someone's 'business button' has been pushed
when they use the word 'within'


John
 
Reply With Quote
 
Jeremy Bowers
Guest
Posts: n/a
 
      09-05-2003
On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
> As with most organizations,
> changing *our* code is much more acceptable to the powers that be, than
> bringing in a third-party product that will have to be evaluated and have
> countless meetings over its approval. For many of us, business and policy
> decisions often forge the direction for technology usage within our
> organizations.


If you are having real problems with poor HTML, HTMLTidy may be worth
going to bat over. If you can find a simple solution that works on the
HTML you are processing, great, go with it, and it's worth researching in
your situation first. But HTML can go bad in more ways then you can
imagine (which is in fact part of the problem); if you are getting HTML
that's bad in a lot of little ways, you'll find the "apply a hack to fix
this file, apply a hack to fix that file" will start stepping on its own
toes.

HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
functionality that you can *not* replicate in a reasonable amount of time;
it's one of those packages that isn't so much a program that "does
something" as a program that represents many, many man-years of "knowledge
acquired".

I'm not trying to push anything, since I don't know your situation, but
HTMLTidy is one of those rare projects that you really shouldn't allow NMH
to scuttle unless you *really* need to. (Again, I mention if there's some
simple way you can characterize the bad HTML coming out of one single
program, go ahead and try to fix it; maybe you'll get lucky and a regex
will be enough.)
 
Reply With Quote
 
KC
Guest
Posts: n/a
 
      09-05-2003
Jeremy Bowers wrote:
> On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
>

....

> that's bad in a lot of little ways, you'll find the "apply a hack to fix
> this file, apply a hack to fix that file" will start stepping on its own
> toes.

Oh yeah, I couldn't agree more. Any more requests for "hacks" and
HTMLTidy gets brought into the picture.
>
> HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
> functionality that you can *not* replicate in a reasonable amount of time;
> it's one of those packages that isn't so much a program that "does
> something" as a program that represents many, many man-years of "knowledge
> acquired".
>

Agreed. I like HTMLTidy very much and it's obvious it could save us
developers a lot of effort.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Microsoft Internet Explorer Malformed HTML Parsing Denial of Service Vulnerability Imhotep Computer Security 16 06-03-2006 02:30 AM
DataSet.ReadXml() - Malformed Xml? George Durzi ASP .Net 3 11-17-2004 09:49 PM
Java API for correcting malformed HTML code MCP Java 11 06-11-2004 12:43 AM
Malformed HTML from UserControl Patrick ASP .Net 1 05-19-2004 03:31 PM
what is malformed about this url? sviau ASP .Net 6 02-22-2004 01:39 AM



Advertisments