Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > convert Unicode to lower/uppercase?

Reply
Thread Tools

convert Unicode to lower/uppercase?

 
 
jallan
Guest
Posts: n/a
 
      09-25-2003
http://www.velocityreviews.com/forums/(E-Mail Removed) (Martin v. Löwis) wrote in message news:<(E-Mail Removed)-berlin.de>...
> (E-Mail Removed) (jallan) writes:
>
> > A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
> > but doesn't include 1:many mappings such as the one needed for
> > uppercasing ß. Since many parsers now expect this file to have at most
> > single characters in the case mapping fields, an additional file
> > (SpecialCasing.txt) was added to provide the 1:many mappings. For more
> > information, see UTR #21- Case Mappings [MD]
> > >>

> >
> > Python specifications make an implied claim of full support for
> > Unicode and an implied claim that the function upper() uppercases a
> > string properly.

>
> This is a contradiction: SpecialCasing contains 1:n mappings, whereas
> .upper() can only return a single result. So how do you think
> SpecialCasing should be considered in the implementation of .upper()?


I am not aware that it is philosophically a *necessary* feature of
..upper() that a single character not be replaced by a string of two or
more characters.

One should fix the contradition by either changing the behavior of
..upper() so that it will properly case all strings or documenting
clearly that .upper() does not handle particular kinds of casing. Of
course users often don't read the documentation.

> > Users should not have to know such details. They may wish to know
> > where a particular function does not do what might be expected of it.

>
> Things are more difficult than they appear to be.


Yes.

Again and again one thinks one has a solution for a problem and then
exceptions turn up.

Again and again one finds things that one's code doesn't handle, often
from failure to analyze fully in the intitial stages and adopting
algorithms that prove insufficient to handle the data found in
reality.

Jim Allan








Jim Allan
 
Reply With Quote
 
 
 
 
Neil Hodgson
Guest
Posts: n/a
 
      09-25-2003
jallan:
> (E-Mail Removed) (Martin v. Lwis) wrote
> > ...
> > This is a contradiction: SpecialCasing contains 1:n mappings, whereas
> > .upper() can only return a single result. So how do you think
> > SpecialCasing should be considered in the implementation of .upper()?

>
> I am not aware that it is philosophically a *necessary* feature of
> .upper() that a single character not be replaced by a string of two or
> more characters.


That is not the issue. The issue is that .upper would have to return a
list or map of results (for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}), which would be
difficult for the caller to make use of without performing some additional
work, finding the correct result for its locale. It is simpler for the
caller to provide a locale argument in the .upper call or in its context.

Neil


 
Reply With Quote
 
 
 
 
Neil Hodgson
Guest
Posts: n/a
 
      09-25-2003
Me:

> for an illustrative but incorrect example
> "ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),


For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

Neil


 
Reply With Quote
 
jallan
Guest
Posts: n/a
 
      09-26-2003
"Neil Hodgson" <(E-Mail Removed)> wrote in message news:<wmJcb.122907$(E-Mail Removed)>...
> Me:
>
> > for an illustrative but incorrect example
> > "ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),

>
> For a real example from the Microsoft web site, uppercasing "indigo"
> (u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
> (u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
> with dots above the 'I's for Turkish:
> (u'\u0130\u004e\u0044\u0130\u0047\u004f').
>


The file http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
purportedly contains *all* casings for all scripts for all languages
where the casings are not one-to-one or are otherwise not
straightforward.

The *only* locale oddities there are for Lithuanian and the two
languages Turkish and Azeri and concern only dot/no-dot variants of
the letters _i_, _I_, _j_, _J_ and no others.

There are *no* other locale-based oddities. The mess is thankfully
*very* limited in scope.

In my opinion, if the full Unicode casing specification is to be
followed, the most useful solution would be a parameter allowing the
user to choose among (1) normal Latin casing, (2) Turkish/Azeri or (2)
Lithuanian as the casing model for treatment of these letters.

The default for the parameter would either be based on current locale
or be normal Latin casing. I think the latter far better as it is
dangerous to have functions in a language differ from machine to
machine according to the current locale.

Also, in case someone brings it up, it was formerly standard to
generally omit diacritics on capital letters in Portuguese and in
French (in France but not in Quebec!)

This is no longer the norm for either language. See
http://www.academie-francaise.fr/lan...l#accentuation
and http://www.press.uchicago.edu/Misc/C...haracters.html.

I have seen academic style sheets with a silly rule that diacritics
should be placed on capital letters as on lowercase letters except for
the word "A". See http://www.alphaacademic.co.uk/fcs.htm and
http://www.sagepub.com/journalManusc...pid=9669&sc=1:

<< We use accents on capital letters, but capital A does not take a
grave accent. >>

It would not hurt to make a casing table customizable for such unusual
styles. But that is beyond Unicode's specifications.

A programmer who wishes odd customization beyond the norms of a
language and Unicode specifications can do it through transformations
outside of normal casing.

Jim Allan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert unicode escape sequences to unicode in a file Jeremy Python 0 01-11-2011 11:39 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments