Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: HTML to Text renderer

Reply
Thread Tools

Re: HTML to Text renderer

 
 
Ian Bicking
Guest
Posts: n/a
 
      11-03-2004
Robert Brewer wrote:
> Ian Bicking wrote:
>
>>Does anyone know of a module that can render HTML to text? Just a
>>subset of HTML, really; I'd like to compose emails using <p> tags and
>>whatnot, fill in all the values in the email template, then
>>apply word
>>wrapping and other formatting. Also, it'll make using Zope Page
>>Templates with email easier.
>>
>>Even if all it supports is <p> and <br> that would be enough, but I'm
>>hoping there's something even more complete out there. I don't need
>>something as general as, say, Lynx; these templates would be written
>>with a specific renderer in mind.

>
>
> To clarify: you don't want the HTML tags merely stripped; you want to
> replace e.g. br with a line break and p with, say, two line breaks?


Right. And word wrapping too. Some other tags would also be
interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
to control alignment (e.g., <p align="">).

--
Ian Bicking / http://www.velocityreviews.com/forums/(E-Mail Removed) / http://blog.ianbicking.org
 
Reply With Quote
 
 
 
 
Roger Binns
Guest
Posts: n/a
 
      11-03-2004
Ian Bicking wrote:
> Right. And word wrapping too. Some other tags would also be
> interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
> to control alignment (e.g., <p align="">).


Usually I resort to using one of the text based browsers (eg lynx/links/w3m)
which all have a mode to dump plain text out formatted in that way.

Roger


 
Reply With Quote
 
 
 
 
Marc Christiansen
Guest
Posts: n/a
 
      11-09-2004
Ian Bicking <(E-Mail Removed)> wrote:
> Robert Brewer wrote:
>> To clarify: you don't want the HTML tags merely stripped; you want to
>> replace e.g. br with a line break and p with, say, two line breaks?

>
> Right. And word wrapping too. Some other tags would also be
> interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
> to control alignment (e.g., <p align="">).


Have a look at htmllib.HTMLParser and formatter in the standard Python
lib (but also look at the source of htmllib). Maybe they provide what
you need.

HTH
Marc
 
Reply With Quote
 
Ivo Woltring
Guest
Posts: n/a
 
      11-09-2004

"Marc Christiansen" <(E-Mail Removed)-empire.de> wrote in message
news:(E-Mail Removed)-empire.de...
> Ian Bicking <(E-Mail Removed)> wrote:
> > Robert Brewer wrote:
> >> To clarify: you don't want the HTML tags merely stripped; you want to
> >> replace e.g. br with a line break and p with, say, two line breaks?

> >
> > Right. And word wrapping too. Some other tags would also be
> > interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
> > to control alignment (e.g., <p align="">).

>
> Have a look at htmllib.HTMLParser and formatter in the standard Python
> lib (but also look at the source of htmllib). Maybe they provide what
> you need.
>
> HTH
> Marc


look at this code:

===CUT BELOW===
from sgmllib import SGMLParser

class html2txt(SGMLParser):
"""html2txt()
"""
def reset(self):
"""reset() --> initialize the parser"""
SGMLParser.reset(self)
self.pieces = []

def handle_data(self, text):
"""handle_data(text) --> appends the pieces to self.pieces
handles all normal data not between brackets "<>"
"""
self.pieces.append(text)

def handle_entityref(self, ref):
"""called for each entity reference, e.g. for "&copy;", ref will be
"copy"
Reconstruct the original entity reference.
"""
if ref=='amp':
self.pieces.append("&")

def output(self):
"""Return processed HTML as a single string"""
return " ".join(self.pieces)

if __name__=="__main__":
html="""<h1>just a piece of html</h1>
<div class="toc">
<ul>
<li><span class="section"><a
href="index.html#install.choosing">1.1. Which Python is right for
you?</a></span></li>
<li><span class="section"><a href="windows.html">1.2. Python
on Windows</a></span></li>
<li><span class="section"><a href="macosx.html">1.3. Python
on Mac OS X</a></span></li>
<li><span class="section"><a href="macos9.html">1.4. Python
on Mac OS 9</a></span></li>
<li><span class="section"><a href="redhat.html">1.5. Python
on RedHat Linux</a></span></li>
<li><span class="section"><a href="debian.html">1.6. Python
on Debian GNU/Linux</a></span></li>
<li><span class="section"><a href="source.html">1.7. Python
Installation from Source</a></span></li>
<li><span class="section"><a href="shell.html">1.8. The
Interactive Shell</a></span></li>
<li><span class="section"><a href="summary.html">1.9.
Summary</a></span></li>
</ul>
</div>
"""
parser = html2txt()
parser.reset()
parser.feed(html)
parser.close()
print parser.output()
=== END CUT ===

The html2txt class is of course extendable and changeble. For me it was
important to convert html to text but the behavior of the class can be
adjusted to change tags to do other stuff... hope it helps

Ivo.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
html renderer Robert kebernet Cooper Java 2 06-28-2005 09:06 AM
Serializing HTML document and resources to the HTML renderer unbending Java 0 12-17-2004 04:58 AM
HTML to Text renderer Ian Bicking Python 0 11-02-2004 11:07 PM
Looking for HTML Renderer Xiaolei Li Java 7 10-07-2004 07:15 PM
html renderer Michele Simionato Python 2 11-17-2003 04:27 PM



Advertisments