Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > HTMLParser and write

Thread Tools

HTMLParser and write

Kai I Hendry
Posts: n/a
I am finding the :
A little lacking.

I want an example with parses and then writes the same html file (a fine test
case!). Does anyone know where I can find such an example, as my initial attempt
is proving tricky. For example do I really need to do things like: ' %s="%s" '
% (name, value) with the attributes? What happens if a tag needs not be closed
by handle_endtag? Why does my __init__ def not work? And what about the rest?
From decl to parsing entities...

import sys
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

#def __init__(self):
#self.tagsoup = []

def handle_starttag(self, tag, attrs):
sys.stdout.write('<%s' % tag)
for attr in attrs:
name, value = attr
sys.stdout.write(' %s="%s" ' % (name, value))

#This is the whole tag
#But, how do know if it needs to be closed?
#print self.get_starttag_text()

def handle_data(self, data):

def handle_endtag(self, tag):
sys.stdout.write('</%s>' % tag)

#Something like this?
#Or is there a better way?
#print self.check_for_whole_start_tag

if __name__ == "__main__":
h = MyHTMLParser()

# __init__ def results in some sort of rawdata error, hence:
h.tagsoup = []


import urllib2
html = urllib2.urlopen('')
Reply With Quote
Stephen Ferg
Posts: n/a
You're right. The example is REALLY feeble. Maybe this will help:

Use HTMLParser to read in an HTML file and write it out again.
This will put all tag and attribute names into lowercase.

2 2004-01-05 added handle_pi and improved attribute processing

from HTMLParser import HTMLParser

class CustomizedParser(HTMLParser):

def setOutfileName(self, argOutfileName):
"""Remember the output file, so it is easy to write to it.
self.OutfileName = argOutfileName
self.Outfile = open(self.OutfileName, "w")

def closeOutfile(self):

def write(self, argString):

def handle_starttag(self, argTag, argAttrs):
""" argAttrs is a list of tuples.
Each tuple is a pair of (attribute_name, attribute_value)
attributes = "".join([' %s="%s"' % (key, value) for key, value in argAttrs])
self.Outfile.write("<%s%s>" % (argTag, attributes))

def handle_startendtag(self, argTag, argAttrs):
""" argAttrs is a list of tuples.
Each tuple is a pair of (attribute_name, attribute_value)
attributes = "".join([' %s="%s"' % (key, value) for key, value in argAttrs])
self.Outfile.write("<%s%s/>" % (argTag, attributes))

def handle_endtag(self, argTag):
self.write("</%s>" % argTag)

def handle_data(self, argString):

def handle_charref(self, argString):
self.write("&#%s;" % argString)

def handle_entityref(self, argString):
self.write("&%s;" % argString)

def handle_comment(self, argString):
self.write("<!--%s-->" % argString)

def handle_decl(self, argString):
self.write("<!%s>" % argString)

def handle_pi(self, argString):
# handle a processing instruction
self.write("<?%s>" % argString)

def main(myInfileName, myOutfileName ):
myInfile = open(myInfileName, "r")
myParser = CustomizedParser()



def dq(s):
"""Enclose a string argument in double quotes"""
return '"'+ s + '"'

if __name__ == "__main__":
print "Starting HTMLParserDemoProgram"
main("c:\junk\slide01.html", "c:\junk\slide01a.html")
print "Ending HTMLParserDemoProgram"
Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTMLParser and non-ascii html pages Yaşar Arabacı Python 0 09-20-2011 04:44 PM
I use htmlparser mike Java 0 01-11-2005 03:46 PM
What is the major difference among Jtidy,Jdom,JavaCC,and HTMLparser? mike Java 0 11-06-2004 03:03 PM
question on HTMLParser and parser.feed() Stephen Briley Python 1 12-06-2003 09:00 AM
htmllib.HTMLParser and unicode Achim Domma Python 0 09-17-2003 10:06 AM