Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > text representation of HTML

Reply
Thread Tools

text representation of HTML

 
 
Ksenia Marasanova
Guest
Posts: n/a
 
      07-19-2006
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test


i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

Thanks!
Ksenia.
 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      07-19-2006
Ksenia Marasanova wrote:

> Hi,
>
> I am looking for a library that will give me very simple text
> representation of HTML.
> For example
> <div><h1>Title</h1><p>This is a <br />test</p></div>
>
> will be transformed to:
>
> Title
>
> This is a
> test
>
>
> i want to send plain text alternative of html email, and would prefer
> to do it automatically from HTML source.
> Any hints?


html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez
 
Reply With Quote
 
 
 
 
Laurent Rahuel
Guest
Posts: n/a
 
      07-19-2006
Hi,

I guess stripogram would be more pythonic :
http://sourceforge.net/project/showf...?group_id=1083

Regards,

Laurent

Diez B. Roggisch wrote:

> Ksenia Marasanova wrote:
>
>> Hi,
>>
>> I am looking for a library that will give me very simple text
>> representation of HTML.
>> For example
>> <div><h1>Title</h1><p>This is a <br />test</p></div>
>>
>> will be transformed to:
>>
>> Title
>>
>> This is a
>> test
>>
>>
>> i want to send plain text alternative of html email, and would prefer
>> to do it automatically from HTML source.
>> Any hints?

>
> html2text is a commandline tool. You can invoke it from python using
> subprocess.
>
> Diez


 
Reply With Quote
 
garabik-news-2005-05@kassiopeia.juls.savba.sk
Guest
Posts: n/a
 
      07-20-2006
Ksenia Marasanova <(E-Mail Removed)> wrote:
> Hi,
>
> I am looking for a library that will give me very simple text
> representation of HTML.
> For example
> <div><h1>Title</h1><p>This is a <br />test</p></div>
>
> will be transformed to:
>
> Title
>
> This is a
> test
>
>
> i want to send plain text alternative of html email, and would prefer
> to do it automatically from HTML source.


something like this:

import re
text = '<div><h1>Title</h1><p>This is a <br />test</p></div>'
text = re.sub(r'[\n\ \t]+', ' ', text)
text = re.sub(r'(?i)(\<p\>|\<br\>|\<h[1-6]\>)', '\n', text)
result = re.sub('<.+?>', '', text)
print result

--
-----------------------------------------------------------
| Radovan GarabĂ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 
Reply With Quote
 
Duncan Booth
Guest
Posts: n/a
 
      07-20-2006
Ksenia Marasanova wrote:

> I am looking for a library that will give me very simple text
> representation of HTML.
> For example
><div><h1>Title</h1><p>This is a <br />test</p></div>
>
> will be transformed to:
>
> Title
>
> This is a
> test
>
>
> i want to send plain text alternative of html email, and would prefer
> to do it automatically from HTML source.
> Any hints?


Use htmllib:

>>> import htmllib, formatter, StringIO
>>> def cleanup(s):

out = StringIO.StringIO()
p = htmllib.HTMLParser(
formatter.AbstractFormatter(formatter.DumbWriter(o ut)))
p.feed(s)
p.close()
if p.anchorlist:
print >>out
for idx,anchor in enumerate(p.anchorlist):
print >>out, "\n[%d]: %s" % (idx+1,anchor)
return out.getvalue()

>>> print cleanup('''<div><h1>Title</h1><p>This is a <br

/>test</p></div>''')

Title

This is a
test
>>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a

href="http://python.org">a link</a> to the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org


 
Reply With Quote
 
Tim Williams
Guest
Posts: n/a
 
      07-20-2006
On 20 Jul 2006 15:12:27 GMT, Duncan Booth <(E-Mail Removed)> wrote:
> Ksenia Marasanova wrote:
> > i want to send plain text alternative of html email, and would prefer
> > to do it automatically from HTML source.
> > Any hints?

>
> Use htmllib:
>
> >>> import htmllib, formatter, StringIO
> >>> def cleanup(s):

> out = StringIO.StringIO()
> p = htmllib.HTMLParser(
> formatter.AbstractFormatter(formatter.DumbWriter(o ut)))
> p.feed(s)
> p.close()
> if p.anchorlist:
> print >>out
> for idx,anchor in enumerate(p.anchorlist):
> print >>out, "\n[%d]: %s" % (idx+1,anchor)
> return out.getvalue()
>
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br

> />test</p></div>''')
>
> Title
>
> This is a
> test
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a

> href="http://python.org">a link</a> to the Python homepage</p></div>''')
>
> Title
>
> This is a
> test with a link[1] to the Python homepage
>
> [1]: http://python.org
>


cleanup() doesn't handle script and styles too well. html2text will
do a much better job of these and give a more structured output
(compatible with Markdown)

http://www.aaronsw.com/2002/html2text/

>>> import html2text
>>> print html2text.html2text('''<div><h1>Title</h1><p>This is a <br

/>test with <a href="http://python.org">a link</a> to the Python
homepage</p></div>''')

# Title

This is a
test with [a link][1] to the Python homepage

[1]: http://python.org


HTH
 
Reply With Quote
 
Ksenia Marasanova
Guest
Posts: n/a
 
      09-21-2006
Sorry for the late reply... better too late than never
Thanks to all for the tips. Stripogram is the winner, since it is the
most configurable and accept line-length parameter, which is handy for
email...

Ksenia.

On 7/19/06, Laurent Rahuel <(E-Mail Removed)> wrote:
> Hi,
>
> I guess stripogram would be more pythonic :
> http://sourceforge.net/project/showf...?group_id=1083
>
> Regards,
>
> Laurent
>
> Diez B. Roggisch wrote:
>
> > Ksenia Marasanova wrote:
> >
> >> Hi,
> >>
> >> I am looking for a library that will give me very simple text
> >> representation of HTML.
> >> For example
> >> <div><h1>Title</h1><p>This is a <br />test</p></div>
> >>
> >> will be transformed to:
> >>
> >> Title
> >>
> >> This is a
> >> test
> >>
> >>
> >> i want to send plain text alternative of html email, and would prefer
> >> to do it automatically from HTML source.
> >> Any hints?

> >
> > html2text is a commandline tool. You can invoke it from python using
> > subprocess.
> >
> > Diez

>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Text file to XML representation kaklis@gmail.com Python 2 10-22-2009 11:56 AM
Handling of a SVG representation with Java's Text Listeners? lbrtchx@gmail.com Java 5 10-23-2008 05:46 PM
FF developer html representation dorayme HTML 3 10-11-2007 01:58 PM
XHTML Orphaned text representation in a DOM Mike XML 4 08-01-2007 11:57 AM
How do I interpret debugger text representation? Sara Perl Misc 2 05-27-2004 01:23 AM



Advertisments