Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Stripping HTML with RE

Reply
Thread Tools

Stripping HTML with RE

 
 
Steveo
Guest
Posts: n/a
 
      11-09-2004
I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)

re.compile("<.*?>")

I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

Steve
 
Reply With Quote
 
 
 
 
Steven Bethard
Guest
Posts: n/a
 
      11-09-2004
Steveo <stephen_p_barrett <at> hotmail.com> writes:
>
> I wanted to allow all H1 and H2 tags so i changed it to:
>
> re.compile("<[^H1|^H2]*?>")
>
> This seemed to work but it also allowed the HTML tag(basically anythin
> with an H or a 1 or a 2) How can I get this to strip all tags except
> H1 and H2. Any Help you could give would be great.


You probably want a lookahead assertion. From the docs at
http://docs.python.org/lib/re-syntax.html:

(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.

So I would write your example something like:

>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')

'sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')

'<H1>sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')

'<H1>sdfsa</H2>'

(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)

Steve

 
Reply With Quote
 
 
 
 
Miles Fender
Guest
Posts: n/a
 
      11-09-2004
Steveo wrote:
> I am currently stripping HTML from a string with the following code.
> (I know it's not the best way to strip HTML but bear with me)
> [...]


Instead of using REs, you might consider the StrippingParser
from the Python Cookbook:

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52281

It allows you to specify explicitly which tags you want to leave
intact, so you'll be able to change your mind later without futzing
about with a complex RE...


Miles
 
Reply With Quote
 
Steven Bethard
Guest
Posts: n/a
 
      11-09-2004
I wrote:
> >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')

> 'sdfsa'


Maybe slightly better:

>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>')

'sdfsa'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>')

'<H1>sdfsa'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>')

'<H1>sdfsa</H2>'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')

'<H2>sdfsa</H2>'

I've just grouped things a bit differently so that I only have to write H1 and
H2 once.

Steve

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML stripping? Carlo Razzeto ASP .Net 1 07-10-2007 09:56 PM
Stripping html Medros C Programming 6 06-12-2006 01:19 PM
Stripping html tags from text Spondishy ASP .Net 4 03-07-2006 03:45 PM
Stripping HTML attributes and tags JJ Harrison HTML 5 11-28-2005 10:12 PM
regex for stripping HTML Michael Vilain Perl 4 10-30-2003 01:06 PM



Advertisments