Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unicode wrap unicode object?

Reply
Thread Tools

unicode wrap unicode object?

 
 
ygao
Guest
Posts: n/a
 
      04-07-2006
>>> import sys
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> s

'\xe9\xab\x98'
>>> ss

u'\xe9\xab\x98'
>>>

how do I get ss from s?
Can there be a way do this?
thanks!

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      04-08-2006
"ygao" <(E-Mail Removed)> wrote:

> >>> import sys
> >>> sys.setdefaultencoding("utf-8")


hmm. what kind of bootleg python is that ?

>>> import sys
>>> sys.setdefaultencoding("utf-8")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'module' object has no attribute 'setdefaultencoding'

(you're not supposed to change the default encoding. don't
do that; it'll only cause problems in the long run).

> >>> s='\xe9\xab\x98' #this uff-8 string
> >>> ss=U'\xe9\xab\x98'
> >>> s

> '\xe9\xab\x98'
> >>> ss

> u'\xe9\xab\x98'
> >>>

> how do I get ss from s?
> Can there be a way do this?


you have UTF-8 *bytes* in a Unicode text string? sounds like
someone's made a mistake earlier on...

anyway, iso-8859-1 is, in practice, a null transform, that simply
converts unicode characters to bytes:

>>> s = ss.encode("iso-8859-1")
>>> s

'\xe9\xab\x98'
>>> s.decode("utf-8")

u'\u9ad8'
>>> import unicodedata
>>> unicodedata.name(s.decode("utf-8"))

'CJK UNIFIED IDEOGRAPH-9AD8'

but it's probably better to fix the code that puts UTF-8 data in your
Unicode strings (look for bogus iso-8859-1 conversions)

</F>



 
Reply With Quote
 
 
 
 
ygao
Guest
Posts: n/a
 
      04-08-2006
sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.


>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>> s1=s.decode('unicode_escape')
>>> s1==ss

True
>>> ss1==s

True
>>>


 
Reply With Quote
 
ygao
Guest
Posts: n/a
 
      04-08-2006
sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.
>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>> s1=s.decode('unicode_escape')
>>> s1==ss

True
>>> ss1==s

True

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      04-08-2006
"ygao" wrpte_

> I must use utf-8 for chinese.


yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
are designed to hold things that you've already decoded (that is, your
chinese text), not the raw UTF-8 bytes.

if you store the UTF-8 in an ordinary 8-bit string instead, you can use
the unicode constructor to convert things properly:

b = "... some utf-8 data ..."

# turn it into a unicode string
u = unicode(b, "utf-8")

# ... do something with it ...

# turn it back into a utf-8 string
s = u.encode("utf-8")

# or use some other encoding
s = u.encode("big5")

e.g.

>>> b = '\xe9\xab\x98'
>>> u = unicode(b, "utf-8")
>>> u.encode("utf-8")

'\xe9\xab\x98'
>>> u.encode("big5")

'\xb0\xaa'

</F>



 
Reply With Quote
 
ygao
Guest
Posts: n/a
 
      04-08-2006
thanks for your advice.

 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      04-08-2006
ygao wrote:
> I must use utf-8 for chinese.


Sure. But please don't do that:

>>>> import sys
>>>> reload(sys)
>>>> sys.setdefaultencoding("utf-8")


As Fredrik says, you should really avoid changing the
default encoding.

>>>> s='\xe9\xab\x98' #this uff-8 string
>>>> ss=U'\xe9\xab\x98'
>>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>>> s1=s.decode('unicode_escape')
>>>> s1==ss

> True
>>>> ss1==s

> True


Ok. But how about that:

py> s='\xe9\xab\x98'
py> ss=u'\u9ad8'
py> s1=s.decode('utf-8')
py> s1==ss
True

Here, ss is a single character, which uses 3 bytes in UTF-8.
In your example, ss has three characters, which are not Chinese,
but European.

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
To wrap or not to wrap? Aaron Fude Java 12 05-10-2008 06:33 PM
Wrap computer components in bubble wrap? Ickshka Computer Support 7 05-05-2006 05:54 PM
Text::Wrap and unicode wing328hk@gmail.com Perl Misc 4 01-04-2006 02:10 PM
Text::Wrap::wrap difference Art Werschulz Perl Misc 1 09-25-2003 06:15 PM
Text::Wrap::wrap difference Art Werschulz Perl Misc 0 09-22-2003 02:36 PM



Advertisments