Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Unicode characters

Reply
Thread Tools

Unicode characters

 
 
Paul Johnston
Guest
Posts: n/a
 
      09-04-2006
Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.


Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

Cheers Paul
 
Reply With Quote
 
 
 
 
limodou
Guest
Posts: n/a
 
      09-04-2006
On 9/4/06, Paul Johnston <(E-Mail Removed)> wrote:
> Hi
> I have a string which I convert into a list then read through it
> printing its glyph and numeric representation
>
> #-*- coding: utf-8 -*-
>
> thestring = "abcd"
> thelist = list(thestring)
>
> for c in thelist:
> print c,
> print ord(c)
>
> Works fine for latin characters but when I put in a unicode character
> a two byte character gives me two characters. For example an arabic
> alef returns
>
> * 216
> * 167
>
> ( the first asterix is the empty set symbol the second a double "s")
>
> Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
> sequential listings i.e.
> 216 167
> 216 168
> 216 169
> So it is reading the correct details.
>
>
> Is there anyway to get the c in the for loop to recognise it is
> reading a multiple byte character.
> I have followed the info in PEP 0263 and am using Python 2.4.3 Build
> 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2
>

If the string is not a unicode, it's be encoded in byte, so you can
only get the every character encoding of the string. You can conver it
to unicode, and if the character value less than 127, it should be an
ascii, otherwise maybe a multibytes character. for example:

a = 'string'
b = unicode(a, encoding_according_your_situation)
for i in b:
if ord(i) < 127:
print ord(i), 'ascii'
else:
print ord(i), 'multibytes'

--
I like python!
My Blog: http://www.donews.net/limodou
UliPad Site: http://wiki.woodpecker.org.cn/moin/UliPad
UliPad Maillist: http://groups.google.com/group/ulipad
 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      09-04-2006
Paul Johnston wrote:

> Hi
> I have a string which I convert into a list then read through it
> printing its glyph and numeric representation
>
> #-*- coding: utf-8 -*-
>
> thestring = "abcd"
> thelist = list(thestring)
>
> for c in thelist:
> print c,
> print ord(c)
>
> Works fine for latin characters but when I put in a unicode character
> a two byte character gives me two characters. For example an arabic
> alef returns
>
> * 216
> * 167
>
> ( the first asterix is the empty set symbol the second a double "s")
>
> Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
> sequential listings i.e.
> 216 167
> 216 168
> 216 169
> So it is reading the correct details.
>
>
> Is there anyway to get the c in the for loop to recognise it is
> reading a multiple byte character.
> I have followed the info in PEP 0263 and am using Python 2.4.3 Build
> 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2


Use unicode objects instead of byte strings. The above string literal is
_not_ affected by the coding:-header whatsoever.

That applies only to

u"some text"

literals, and makes them a unicode object.

The normal string literals are just bytes - because of your encoding being
properly set in the editor, an entered multibyte-character is stored as
such.

In a nutshell: try the above using u"abcd".
Diez
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz Śliwiński Python 2 01-19-2011 07:31 AM
Re: convert unicode characters to visibly similar ascii characters Laszlo Nagy Python 6 07-02-2008 04:42 PM
Re: convert unicode characters to visibly similar ascii characters M.-A. Lemburg Python 0 07-02-2008 08:39 AM
Re: convert unicode characters to visibly similar ascii characters Terry Reedy Python 0 07-01-2008 07:46 PM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments