Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Convert from unicode chars to HTML entities

Reply
Thread Tools

Convert from unicode chars to HTML entities

 
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-29-2007
I have a string containing Latin-1 characters:

s = u"© and many more..."

I want to convert it to HTML entities:

result =>
"© and many more..."

Decimal/hex escapes would be acceptable:
"© and many more..."
"© and many more..."

I can look up tables of HTML entities on the web (they're a dime a
dozen), turn them into a dict mapping character to entity, then convert
the string by hand. Is there a "batteries included" solution that doesn't
involve reinventing the wheel?


--
Steven D'Aprano


 
Reply With Quote
 
 
 
 
Adonis Vargas
Guest
Posts: n/a
 
      01-29-2007
Steven D'Aprano wrote:
> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "© and many more..."
>
> Decimal/hex escapes would be acceptable:
> "© and many more..."
> "© and many more..."
>
> I can look up tables of HTML entities on the web (they're a dime a
> dozen), turn them into a dict mapping character to entity, then convert
> the string by hand. Is there a "batteries included" solution that doesn't
> involve reinventing the wheel?
>
>


Its *very* ugly, but im pretty sure you can make it look prettier.

import htmlentitydefs as entity

s = u"© and many more..."
t = ""
for i in s:
if ord(i) in entity.codepoint2name:
name = entity.codepoint2name.get(ord(i))
entityCode = entity.name2codepoint.get(name)
t +="&#" + str(entityCode)
else:
t += i
print t

Hope this helps.

Adonis
 
Reply With Quote
 
 
 
 
Adonis Vargas
Guest
Posts: n/a
 
      01-29-2007
Adonis Vargas wrote:
[...]
>
> Its *very* ugly, but im pretty sure you can make it look prettier.
>
> import htmlentitydefs as entity
>
> s = u"© and many more..."
> t = ""
> for i in s:
> if ord(i) in entity.codepoint2name:
> name = entity.codepoint2name.get(ord(i))
> entityCode = entity.name2codepoint.get(name)
> t +="&#" + str(entityCode)
> else:
> t += i
> print t
>
> Hope this helps.
>
> Adonis


or

import htmlentitydefs as entity

s = u"© and many more..."
t = u""
for i in s:
if ord(i) in entity.codepoint2name:
name = entity.codepoint2name.get(ord(i))
t += "&" + name + ";"
else:
t += i
print t

Which I think is what you were looking for.

Adonis
 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      01-29-2007
En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano
<(E-Mail Removed)> escribió:

> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "&copy; and many more..."
>


Module htmlentitydefs contains the tables you're looking for, but you need
a few transforms:

<code>
# -*- coding: iso-8859-15 -*-
from htmlentitydefs import codepoint2name

unichr2entity = dict((unichr(code), u'&%s;' % name)
for code,name in codepoint2name.iteritems()
if code!=3 # exclude "&"

def htmlescape(text, d=unichr2entity):
if u"&" in text:
text = text.replace(u"&", u"&amp;")
for key, value in d.iteritems():
if key in text:
text = text.replace(key, value)
return text

print '%r' % htmlescape(u'hello')
print '%r' % htmlescape(u'"©® áé&ö <²³>')
</code>

Output:
u'hello'
u'&quot;&copy;&reg; &aacute;&eacute;&amp;&ouml; &lt;&sup2;&sup3;&gt;'

The result is an unicode object, with all known entities replaced. It does
not handle missing, unknown entities - as the docs for htmlentitydefs say,
"the definition provided here contains all the entities defined by XHTML
1.0 that can be handled using simple textual substitution in the Latin-1
character set (ISO-8859-1)."

--
Gabriel Genellina

 
Reply With Quote
 
Leif K-Brooks
Guest
Posts: n/a
 
      01-29-2007
Steven D'Aprano wrote:
> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "&copy; and many more..."
>
> Decimal/hex escapes would be acceptable:
> "© and many more..."
> "&#xA9; and many more..."


>>> s = u"© and many more..."
>>> s.encode('ascii', 'xmlcharrefreplace')

'© and many more...'
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-29-2007
On Sun, 28 Jan 2007 23:41:19 -0500, Leif K-Brooks wrote:

> >>> s = u"© and many more..."
> >>> s.encode('ascii', 'xmlcharrefreplace')

> '© and many more...'


Wow. That's short and to the point. I like it.

A few issues:

(1) It doesn't seem to be reversible:

>>> '© and many more...'.decode('latin-1')

u'© and many more...'

What should I do instead?


(2) Are XML entities guaranteed to be the same as HTML entities?


(3) Is there a way to find out at runtime what encoders/decoders/error
handlers are available, and what they do?


Thanks,


--
Steven D'Aprano

 
Reply With Quote
 
Leif K-Brooks
Guest
Posts: n/a
 
      01-29-2007
Steven D'Aprano wrote:
> A few issues:
>
> (1) It doesn't seem to be reversible:
>
>>>> '© and many more...'.decode('latin-1')

> u'© and many more...'
>
> What should I do instead?


Unfortunately, there's nothing in the standard library that can do that,
as far as I know. You'll have to write your own function. Here's one
I've used before (partially stolen from code in Python patch #912410
which was written by Aaron Swartz):

from htmlentitydefs import name2codepoint
import re

def _replace_entity(m):
s = m.group(1)
if s[0] == u'#':
s = s[1:]
try:
if s[0] in u'xX':
c = int(s[1:], 16)
else:
c = int(s)
return unichr(c)
except ValueError:
return m.group(0)
else:
try:
return unichr(name2codepoint[s])
except (ValueError, KeyError):
return m.group(0)

_entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
def unescape(s):
return _entity_re.sub(_replace_entity, s)

> (2) Are XML entities guaranteed to be the same as HTML entities?


XML defines one entity which doesn't exist in HTML: &apos;. But
xmlcharrefreplace only generates numeric character references, and those
should be the same between XML and HTML.

> (3) Is there a way to find out at runtime what encoders/decoders/error
> handlers are available, and what they do?


From what I remember, that's not possible because the codec system is
designed so that functions taking names are registered instead of the
names themselves. But all of the standard codecs are documented at
<http://python.org/doc/current/lib/standard-encodings.html>, and all of
the standard error handlers are documented at
<http://python.org/doc/current/lib/codec-base-classes.html>.
 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      01-29-2007
Steven D'Aprano schrieb:
> A few issues:
>
> (1) It doesn't seem to be reversible:
>
>>>> '© and many more...'.decode('latin-1')

> u'© and many more...'
>
> What should I do instead?


For reverse processing, you need to parse it with an
SGML/XML parser.

> (2) Are XML entities guaranteed to be the same as HTML entities?


Please make a terminology difference between "entity", "entity
reference", and "character reference".

An (external parsed) entity is a named piece of text, such
as the copyright character. An entity reference is a reference
to such a thing, e.g. &copy;

A character reference is a reference to a character, not to
an entity. xmlcharrefreplace generates character references,
not entity references (let alone generating entities). The
character references in XML and HTML both reference by
Unicode ordinal, so it is "the same".

> (3) Is there a way to find out at runtime what encoders/decoders/error
> handlers are available, and what they do?


Not through Python code. In C code, you can look at the
codec_error_registry field of the interpreter object.

Regards,
Martin
 
Reply With Quote
 
Roberto Bonvallet
Guest
Posts: n/a
 
      02-08-2007
Steven D'Aprano <(E-Mail Removed)> wrote:
> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "&copy; and many more..."

[...[
> Is there a "batteries included" solution that doesn't involve
> reinventing the wheel?


recode is good for this kind of things:

$ recode latin1..html -d mytextfile

It seems that there are recode bindings for Python:

$ apt-cache search recode | grep python
python-bibtex - Python interfaces to BibTeX and the GNU Recode library

HTH, cheers.
--
Roberto Bonvallet
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Unicode to HTML entities Clodoaldo Python 6 05-30-2007 12:53 PM
How to convert Unicode string to raw string escaped with HTML Entities ldng Python 3 05-10-2007 09:37 AM
convert html entities into real chars Laszlo Nagy Python 2 04-10-2007 03:33 PM
ASP converts Unicode Chars to HTML entities? Beat Richli ASP General 2 09-07-2005 05:25 PM
HTML::Entities::encode() returning wrong(?) entities Jim Higson Perl Misc 3 07-25-2004 09:13 PM



Advertisments