Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Easy way to remove HTML entities from an HTML document?

Reply
Thread Tools

Easy way to remove HTML entities from an HTML document?

 
 
Robert Oschler
Guest
Posts: n/a
 
      07-25-2004
Is there a module/function to remove all the HTML entities from an HTML
document (e.g. - &nbsp, &amp, &apos, etc.)?

If not I'll just write one myself but I figured I'd save myself some time.

Thanks,
--
Robert


 
Reply With Quote
 
 
 
 
Christopher T King
Guest
Posts: n/a
 
      07-25-2004
On Sun, 25 Jul 2004, Robert Oschler wrote:

> Is there a module/function to remove all the HTML entities from an HTML
> document (e.g. - &nbsp, &amp, &apos, etc.)?


htmllib has this capability, but if you're not doing any other HTML
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
does nicely:

import re
import htmlentitydefs

def convertentity(m):
if m.group(1)=='#':
try:
return chr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'

Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in &#nnn; format. If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

Hope this helps.

 
Reply With Quote
 
 
 
 
Michael Scarlett
Guest
Posts: n/a
 
      07-26-2004
"Robert Oschler" <no_replies@fake_email_address.invalid> wrote in message news:<X9UMc.12838$(E-Mail Removed)>. ..
> Is there a module/function to remove all the HTML entities from an HTML
> document (e.g. - &nbsp, &amp, &apos, etc.)?
>
> If not I'll just write one myself but I figured I'd save myself some time.
>
> Thanks,



check out mark pilgrims site: http://diveintopython.org/html_processing/index.html
 
Reply With Quote
 
Robert Oschler
Guest
Posts: n/a
 
      07-26-2004
"Christopher T King" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
>
> htmllib has this capability, but if you're not doing any other HTML
> parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
> does nicely:
>
> import re
> import htmlentitydefs
>
> def convertentity(m):
> if m.group(1)=='#':
> try:
> return chr(int(m.group(2)))
> except ValueError:
> return '&#%s;' % m.group(2)
> try:
> return htmlentitydefs.entitydefs[m.group(2)]
> except KeyError:
> return '&%s;' % m.group(2)
>
> def converthtml(s):
> return re.sub(r'&(#?)(.+?);',convert,s)
>
> converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'
>
> Unknown or invalid entities are left in &xxx; format, while also leaving
> Unicode entities in &#nnn; format. If you want a Unicode string to be
> returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
> and 'htmlentitydefs.entitydefs[m.group(2)]' with
> 'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
>
> Hope this helps.
>


Chris,

I believe the line that reads:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

Should read:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)

Once I made that change it worked like a charm. I'm showing the correction
for future Usenet searchers.

So you can pass a function to re.sub() as the replacement patttern? Very
cool, I didn't know that. I think you could spend a year just learning
regular expressions and still miss something.


Thanks,
Robert.


 
Reply With Quote
 
Christopher T King
Guest
Posts: n/a
 
      07-27-2004
On Mon, 26 Jul 2004, Robert Oschler wrote:

> I believe the line that reads:
>
> def converthtml(s):
> return re.sub(r'&(#?)(.+?);',convert,s)
>
> Should read:
>
> def converthtml(s):
> return re.sub(r'&(#?)(.+?);',convertentity,s)


Oops, you're right, mea culpa

> So you can pass a function to re.sub() as the replacement patttern? Very
> cool, I didn't know that. I think you could spend a year just learning
> regular expressions and still miss something.


That feature is only mentioned briefly in the online docs, and not at all
in sre.sub's docstring. Surprising, since it's indeed a very useful
feature.

 
Reply With Quote
 
Robert Oschler
Guest
Posts: n/a
 
      07-27-2004
"Christopher T King" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
>
> That feature is only mentioned briefly in the online docs, and not at all
> in sre.sub's docstring. Surprising, since it's indeed a very useful
> feature.
>


Chris,

Speaking of learning cool things by osmosis, do you know of a well commented
source of Python code, perhaps an Open Source project, that I could study to
learn more interesting techniques like the regexp tip you shared? I find
that studying other people's code is the best way to avoid getting in a
programming rut.

Thanks.

--
Robert



 
Reply With Quote
 
Christopher T King
Guest
Posts: n/a
 
      07-29-2004
On Tue, 27 Jul 2004, Robert Oschler wrote:

> Speaking of learning cool things by osmosis, do you know of a well commented
> source of Python code, perhaps an Open Source project, that I could study to
> learn more interesting techniques like the regexp tip you shared? I find
> that studying other people's code is the best way to avoid getting in a
> programming rut.


I seem to recall reading about that re.sub trick in something linked from
Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
are often links there to interesting and useful code snippets from
ActiveState's Python Cookbook and other sources; I'd say start there if
you want to find neat tricks you can do with Python.

I'm not sure of any particularly "well commented" Python projects though
(I've never really looked into that), but you'll probably find some
interesting small projects in the Vaults of Parnassus
(http://www.vex.net/parnassus/).

 
Reply With Quote
 
Robert Oschler
Guest
Posts: n/a
 
      07-30-2004

"Christopher T King" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
>
> I seem to recall reading about that re.sub trick in something linked from
> Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
> are often links there to interesting and useful code snippets from
> ActiveState's Python Cookbook and other sources; I'd say start there if
> you want to find neat tricks you can do with Python.
>
> I'm not sure of any particularly "well commented" Python projects though
> (I've never really looked into that), but you'll probably find some
> interesting small projects in the Vaults of Parnassus
> (http://www.vex.net/parnassus/).
>


Thanks Chris and thanks for all your other help.

With your Python skill you should work for Google. Too bad you don't, you'd
be a wealthy man soon (Google IPO). Wish I did.

--
Robert


 
Reply With Quote
 
Christopher T King
Guest
Posts: n/a
 
      07-31-2004
On Fri, 30 Jul 2004, Robert Oschler wrote:

> With your Python skill you should work for Google. Too bad you don't,
> you'd be a wealthy man soon (Google IPO). Wish I did.


Thanks for the compliment. To work at Google is my dream job, and I'm
sure that of many others on this list, too (makes me wonder if any Google
employees read this list...).

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
.NET-ey way to convert XML-encoded/escaped entities into normal characters/HTML? ASP .Net 2 06-20-2007 05:32 PM
HTML::Entities::encode() returning wrong(?) entities Jim Higson Perl Misc 3 07-25-2004 09:13 PM
RE: Easy way to remove HTML entities from an HTML document? Robert Brewer Python 0 07-25-2004 08:21 PM
easy way to remove nonprintable chars from string Don Hiatt Python 3 07-24-2003 08:47 PM
RE: easy way to remove nonprintable chars from string sismex01@hebmex.com Python 0 07-24-2003 08:11 PM



Advertisments