Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > converting html escape sequences to unicode characters

Reply
Thread Tools

converting html escape sequences to unicode characters

 
 
harrelson
Guest
Posts: n/a
 
      12-10-2004
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley

 
Reply With Quote
 
 
 
 
Kent Johnson
Guest
Posts: n/a
 
      12-10-2004
harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8. Stuff like:
>
> 비
> 행
> 기
> 로
> 보
> 낼
> 거
> 에
> 요
> 내
> 면
> 금
> 이
> 얼
> 마
> 지
> 잠
>
> Anyone know what the decimal is representing? It doesn't seem to
> equate to a unicode codepoint...


In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
 
Reply With Quote
 
 
 
 
Craig Ringer
Guest
Posts: n/a
 
      12-10-2004
On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8. Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape

u'\ube44'
>>> print uescape


(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['비', '행', '기', '로',

'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠']
>>> def unescape(escapeseq):

.... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
....
>>> print ' '.join([ unescape(x) for x in entities ])

비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 *

--
Craig Ringer

 
Reply With Quote
 
Craig Ringer
Guest
Posts: n/a
 
      12-10-2004
On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:
> On Fri, 2004-12-10 at 08:36, harrelson wrote:
> > I have a list of about 2500 html escape sequences (decimal) that I need
> > to convert to utf-8. Stuff like:

>
> I'm pretty sure this somewhat horrifying code does it, but is probably
> an example of what not to do:


It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

--
Craig Ringer

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert unicode escape sequences to unicode in a file Jeremy Python 0 01-11-2011 11:39 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
RE:Windows XP unicode and escape sequences Python 4 12-19-2007 09:52 AM
Windows XP unicode and escape sequences Python 2 12-15-2007 07:54 AM
How to read strings cantaining escape character from a file and useit as escape sequences? slomo Python 5 12-02-2007 11:39 AM



Advertisments