Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > compare unicode to non-unicode strings

Reply
Thread Tools

compare unicode to non-unicode strings

 
 
Asterix
Guest
Posts: n/a
 
      08-31-2008
how could I test that those 2 strings are the same:

'séd' (repr is 's\\xc3\\xa9d')

u'séd' (repr is u's\\xe9d')
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      08-31-2008
On Aug 31, 11:04 pm, Asterix <(E-Mail Removed)> wrote:
> how could I test that those 2 strings are the same:
>
> 'sd' (repr is 's\\xc3\\xa9d')


No, the repr is 's\xc3\xa9d'.

>
> u'sd' (repr is u's\\xe9d')


No, the repr is u's\xe9d'.

To answer your question:



 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      08-31-2008
On Aug 31, 11:04 pm, Asterix <(E-Mail Removed)> wrote:
> how could I test that those 2 strings are the same:
>
> 'sd' (repr is 's\\xc3\\xa9d')
>
> u'sd' (repr is u's\\xe9d')


[note: your reprs are wrong; change the \\ to \]

You need to decode the non-unicode string and compare the result with
the unicode string. You need to know the encoding used for the non-
unicode string. In the example that you gave, it's about 99.99% likely
that it's UTF-8.

>>> 's\xc3\xa9d'.decode('utf8')

u's\xe9d'
>>> u's\xe9d'.encode('utf8')

's\xc3\xa9d'
>>>


HTH,
John
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      08-31-2008
Asterix wrote:

> how could I test that those 2 strings are the same:
>
> 'séd' (repr is 's\\xc3\\xa9d')
>
> u'séd' (repr is u's\\xe9d')


determine what encoding the former string is using (looks like UTF-,
and convert it to Unicode before doing the comparision.

>>> b = 's\xc3\xa9d'
>>> u = u's\xe9d'
>>> b

's\xc3\xa9d'
>>> u

u's\xe9d'
>>> unicode(b, "utf-8")

u's\xe9d'
>>> unicode(b, "utf-8") == u

True

</F>

 
Reply With Quote
 
Méta-MCI (MVP)
Guest
Posts: n/a
 
      08-31-2008
Par Toutatis !
Si tu avais posé la question * Ordralphabétix, ou sur un des ng français
consacrés * Python, au lieu de refaire "La grande Traversée", la réponse
aurait peut-être été plus rapide.

@-salutations
--
Michel Claveau


 
Reply With Quote
 
Matt Nordhoff
Guest
Posts: n/a
 
      08-31-2008
Asterix wrote:
> how could I test that those 2 strings are the same:
>
> 'séd' (repr is 's\\xc3\\xa9d')
>
> u'séd' (repr is u's\\xe9d')


You may also want to look at unicodedata.normalize(). For example, é can
be represented multiple ways:

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'é')

u'\xe9'
>>> unicodedata.normalize('NFD', u'é')

u'e\u0301'
>>> u'\xe9' == u'e\u0301'

False

The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
WITH ACUTE). The second form is "decomposed", being made up of U+0065
(LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

Even though they represent the same thing to a human, they don't compare
as equal. But if you normalize them to the same form, they will.

For more information, look at the unicodedata module's documentation:
<http://docs.python.org/lib/module-unicodedata.html>
--
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
unicode compare errors Ross Python 3 12-13-2010 03:33 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Newbie: How to compare strings? =?Utf-8?B?Y2hhcmxpZXdlc3Q=?= ASP .Net 1 08-16-2005 09:44 PM
Re: Compare words (not Strings) mromarkhan@rogers.com Java 0 06-21-2004 12:56 AM
How to compare strings Thomas Reinemann VHDL 0 05-27-2004 02:24 PM



Advertisments