Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unicodedata . normalize (NFD - NFC) inconsistency

Reply
Thread Tools

unicodedata . normalize (NFD - NFC) inconsistency

 
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      11-08-2004
I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2. I thought about posting here; anyone more knowing could
give it a look.

If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).

*** Example ***

>>> import unicodedata as ud
>>> def report(utext):

for uchar in utext:
print ord(uchar), ud.name(uchar)


>>> u1=u'\N{greek small letter alpha with oxia}'
>>> report(u1)

8049 GREEK SMALL LETTER ALPHA WITH OXIA
>>> u2=ud.normalize('NFD', u1)
>>> report(u2)

945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
>>> u3=ud.normalize('NFC', u2)
>>> report(u3)

940 GREEK SMALL LETTER ALPHA WITH TONOS
>>>


*** End of Example ***

I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      11-08-2004
Christos TZOTZIOY Georgiou wrote:
> I have no extensive knowledge about Unicode, yet I believe that this
> must be a problem of the Unicode 3.2 specification and not Python's.


Without checking the details: very well possible. Could this be
an instance of python.org/sf/1054943 ?

Regards,
Martin
 
Reply With Quote
 
 
 
 
Brion Vibber
Guest
Posts: n/a
 
      11-09-2004
Christos TZOTZIOY Georgiou wrote:
> I found at least one case where decombining and recombining a unicode
> character does not result in the same character (see at end).
>
> I have no extensive knowledge about Unicode, yet I believe that this
> must be a problem of the Unicode 3.2 specification and not Python's.


I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. Although it may
seem counterintuitive, it is in fact perfectly legitimate for a
character not to be its own canonical composition.

>>>>u1=u'\N{greek small letter alpha with oxia}'
>>>>report(u1)

>
> 8049 GREEK SMALL LETTER ALPHA WITH OXIA


This character is a "singleton decomposition". It decomposes into GREEK
SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
LETTER ALPHA and a COMBINING ACUTE ACCENT.

It is by definition not normalized, so when you normalize it to form C
it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
to get "back" to the original character in a normalized string. For some
more info see:
http://www.unicode.org/unicode/repor...ion_List_Table

>>>>u2=ud.normalize('NFD', u1)
>>>>report(u2)

>
> 945 GREEK SMALL LETTER ALPHA
> 769 COMBINING ACUTE ACCENT
>
>>>>u3=ud.normalize('NFC', u2)
>>>>report(u3)

>
> 940 GREEK SMALL LETTER ALPHA WITH TONOS


You should get this same result directly for ud.normalize('NFC', u1).
Converting directly to NFC should always give the same result as
converting to NFD and then NFC. Either will give you back the string you
started with if and only if it's already normalized to form C.

-- brion vibber (brion @ pobox.com)
 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      11-10-2004
On Mon, 08 Nov 2004 17:40:47 -0800, rumours say that Brion Vibber
<(E-Mail Removed)> might have written:

>I've been spending some time lately writing a normalizer (in PHP of all
>things -- yeesh!), and yes Unicode is a scary world.


....

>http://www.unicode.org/unicode/repor...ion_List_Table


Thanks for the pointer, very informative, explaining why the observed
behaviour is well inside the definition of Unicode. Thanks go to Martin
also for taking a look at this.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Vector, matrix, normalize, rotate. What package? =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?= Python 5 03-01-2007 05:57 PM
how to "normalize" indentation sources AndyL Python 6 05-25-2006 08:00 PM
How to normalize hundreds of mp3's Kyote Computer Information 5 04-20-2006 12:23 PM
XSLT to "normalize" weight attribute arnold XML 1 03-05-2006 05:27 PM
Help needed: How to normalize/denormalize XML using C Mike C Programming 0 03-11-2005 04:16 AM



Advertisments