Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > How to convert markup text to plain text in python?

Reply
Thread Tools

How to convert markup text to plain text in python?

 
 
geoffbache
Guest
Posts: n/a
 
      02-01-2008
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

Regards,
Geoff
 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      02-01-2008
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")



Well, if all you want to do is remove everything from a "<" to a
">", you can use

>>> s = "<B>Today</B> is <U>Friday</U>"
>>> import re
>>> r = re.compile('<[^>]*>')
>>> print r.sub('', s)

Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>thesis</a
> online"""


where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions

-tkc



 
Reply With Quote
 
 
 
 
ph
Guest
Posts: n/a
 
      02-01-2008
On 01-Feb-2008, geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


Quick but very dirty way:

data=urllib.urlopen('http://google.com').read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])



 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      02-01-2008
Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>>
>> I've looked around a bit but failed to find anything, any tips?
>>
>> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

>
>
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
>
> >>> s = "<B>Today</B> is <U>Friday</U>"
> >>> import re
> >>> r = re.compile('<[^>]*>')
> >>> print r.sub('', s)

> Today is Friday
>
> it should even work for semi-pathological cases such as
>
> s = """You can find my <a
> href='http://example.com'>thesis</a
> > online"""

>
> where the tag contents are split across lines. There are more
> pathological cases where tags aren't well-formed, e.g.
>
> s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
>
> in which case you get what you deserve for making such
> pathological conditions
>

The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
Tim Chase
Guest
Posts: n/a
 
      02-01-2008
>> Well, if all you want to do is remove everything from a "<" to a
>> ">", you can use
>>
>> >>> s = "<B>Today</B> is <U>Friday</U>"
>> >>> import re
>> >>> r = re.compile('<[^>]*>')
>> >>> print r.sub('', s)

>> Today is Friday
>>

[Tim's ramblings about pathological cases snipped]
>
> The real answer to this question is "learn how to use Beautiful Soup" --
> see http://www.crummy.com/software/BeautifulSoup/


Yes, for more pathological cases, BS does a great job of parsing
junk

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc



 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      02-01-2008
On Feb 1, 10:54*am, Tim Chase <python.l...@tim.thechases.com> wrote:
> >> Well, if all you want to do is remove everything from a "<" to a
> >> ">", you can use

>
> >> * >>> s = "<B>Today</B> is <U>Friday</U>"
> >> * >>> import re
> >> * >>> r = re.compile('<[^>]*>')
> >> * >>> print r.sub('', s)
> >> * Today is Friday

>
> [Tim's ramblings about pathological cases snipped]


pyparsing includes an example script for stripping tags from HTML
source. See it on the wiki at http://pyparsing.wikispaces.com/spac...tmlStripper.py.

-- Paul
 
Reply With Quote
 
Zentrader
Guest
Posts: n/a
 
      02-02-2008
On Feb 1, 8:07 am, geoffbache <geoff.ba...@pobox.com> wrote:
> I have some marked up text and would like to convert it to plain text,


If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file. On Linux
it would be
lynx -dump http://www.etc > text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      02-03-2008
geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


>>> import lxml.etree as et
>>> doc = et.HTML("<b>Today</b> is <u>Friday</u>")
>>> et.tostring(doc, method='text', encoding=unicode)

u'Today is Friday'


http://codespeak.net/lxml

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      02-11-2008
geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


This might be of interest:

http://pypi.python.org/pypi/haufe.stripml

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
[ANN] RedCloth Mini-Cookbook: How Tos, Recipes, F. A.Qs – Usingthe Ruby Plain Text to Web Markup Gem Gerald Bauer Ruby 1 08-16-2008 06:11 PM
Plain text file to xml file convert mahesh Java 2 02-17-2007 01:48 PM
Convert HTML to plain text Marcel Kessler Java 3 11-14-2006 07:58 AM
Best way to convert html to plain text in java? google@lrlart.com Java 7 07-04-2006 06:29 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57