Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Question regarding HTMLParser module.

Reply
Thread Tools

Question regarding HTMLParser module.

 
 
Adonis
Guest
Posts: n/a
 
      07-28-2003
When parsing my html files, I use handle_pi to capture some embedded python
code, but I have noticed that in the embedded python code if it contains
html, HTMLParser will parse it as well, and thus causes an error when I exec
the code, raises an EOL error. I have a work around for this as I use
different set of characters rather that <tag> use something like (tag) then
revert it back to <tag> via another function, I was wondering if there is a
way to tell HTMLParser to ignore the embedded tags or another alternative?

Any help would be greatly appreciated.
And another note, I am well aware of Zope, Webware, CherryPy, etc... for
py/html embedding options, but I want this to be a learning experience.

HTML processing instruction:
<?
import time
print time.strftime('%b-%d-%Y')
print '<tt>testing!()</tt>')
>


error:
Traceback (most recent call last):
File "C:\home\Adonis\python\t.py", line 40, in -toplevel-
x.feed(z)
File "C:\Python23\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python23\lib\HTMLParser.py", line 154, in goahead
k = self.parse_pi(i)
File "C:\Python23\lib\HTMLParser.py", line 232, in parse_pi
self.handle_pi(rawdata[i+2: j])
File "C:\home\Adonis\python\t.py", line 33, in handle_pi
exec(data)
File "<string>", line 4
print '<tt
^
SyntaxError: EOL while scanning single-quoted string


 
Reply With Quote
 
 
 
 
Carl Banks
Guest
Posts: n/a
 
      07-28-2003
Adonis wrote:
> When parsing my html files, I use handle_pi to capture some embedded python
> code, but I have noticed that in the embedded python code if it contains
> html, HTMLParser will parse it as well, and thus causes an error when I exec
> the code, raises an EOL error. I have a work around for this as I use
> different set of characters rather that <tag> use something like (tag) then
> revert it back to <tag> via another function, I was wondering if there is a
> way to tell HTMLParser to ignore the embedded tags or another alternative?
>
> Any help would be greatly appreciated.
> And another note, I am well aware of Zope, Webware, CherryPy, etc... for
> py/html embedding options, but I want this to be a learning experience.



Unfortunately, HTMLParser (and the similar sgmllib) miserably fail to
process inline text. I know this very well; I have an HTML-generating
package that uses a lot of scripting and verbatim text.

What's happening in your case is that HTMLParser, when processing a <?
tag, simply and naively inputs text up to the next ">". HTMLParser
thinks the > in <tt> closes your <? tag. (It should at least have a
flag indicating whether it should read up to "?>" or just ">".)

A workaround is to do something like this:

<? print '<tt\x29monospaced</tt\x29' >

where obviously, \x29 is the hex code for >. That's not quite as bad
as replacing characters, although it's still not perfect.

Another possibility is to use sgmllib, but that's probably way more
trouble than it's worth, and still far from perfect. Basically,
sgmllib parsers have an method called verbatim, that turns of HTML tag
processing, although entities and closing tags are still processed.
(Entities and closing tags you can kind of reconstruct into the
original text, although the whitespace is lost.) This is what I do in
my own HTML-generating package.

I'll probably contribute some badly-needed remedies to HTMLParser
sometime, as the limitations of it and sgmllib are starting to get on
my nerves.


--
CARL BANKS
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
I use htmlparser http://htmlparser.sourceforge.net/javadoc/index.html mike Java 0 01-11-2005 03:46 PM
HTMLParser question Rajarshi Guha Python 1 08-19-2004 03:51 PM
Swing HTMLParser problem JavaJug Java 3 07-26-2004 01:06 PM
question on HTMLParser and parser.feed() Stephen Briley Python 1 12-06-2003 09:00 AM
HTMLParser solution! Tan Vu Ngoc Java 0 11-18-2003 02:07 AM



Advertisments