Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > convert Unicode to lower/uppercase?

Reply
Thread Tools

convert Unicode to lower/uppercase?

 
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      09-19-2003
Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

What I actually need to do is to compare a number of strings in a
case-insensitive manner, so I assume it's simplest to convert to
lower/upper first.

Possibly all strings will be from the latin-1 character set, so I could
convert to 8-bit latin-1, map to lowercase, and convert back, but that
seems rather cumbersome.

--
Hallvard
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      09-19-2003
nospam wrote:

> Has someone got a Python routine or module which converts Unicode
> strings to lowercase (or uppercase)?


Toiled and came up with:

>>> print u"abc".upper()

ABC

>>> u"ABC".lower()

u'abc\xe4\xf6\xfc'

Peter
 
Reply With Quote
 
 
 
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      09-19-2003
Thanks!

--
Hallvard
 
Reply With Quote
 
jallan
Guest
Posts: n/a
 
      09-21-2003
Peter Otten <(E-Mail Removed)> wrote in message news:<bkepb9$6a4$01$(E-Mail Removed)-online.com>...
> nospam wrote:
>
> > Has someone got a Python routine or module which converts Unicode
> > strings to lowercase (or uppercase)?

>
> Toiled and came up with:
>
> >>> print u"abc".upper()

> ABC
>
> >>> u"ABC".lower()

> u'abc\xe4\xf6\xfc'
>
> Peter


But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "" is actually "SS", that is the single
character "" should uppercase to two characters.

Jim Allan
 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      09-21-2003
jallan wrote:

> But that really doesn't work properly. According to Unicode specs and
> German usage the uppercase of "" is actually "SS", that is the single
> character "" should uppercase to two characters.


Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Also, in German, the uppercase mapping of is of ongoing debate.
For example, the Duden from 1919 says

| Fr wird in groer Schrift SZ angewandt [...]. Die Verwendung
| _zweier_ Buchstaben fr _einen_ Laut ist nur ein Notbehelf, der
| aufhren mu, sobald ein geeigneter Druckbuchstabe fr das
| groe geschaffen ist.

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

Regards,
Martin

 
Reply With Quote
 
Asun Friere
Guest
Posts: n/a
 
      09-22-2003
"Martin v. Lwis" <(E-Mail Removed)> wrote in message news:<bkkusk$pvi$05$(E-Mail Removed)-online.com>...
> The usage of SZ has only been eliminated in the recent change of
> the amtliche Rechtschreibung.
>


And replaced with what? ie. is there now a single capital for SZ?
 
Reply With Quote
 
=?ISO-8859-1?Q?Gerhard_H=E4ring?=
Guest
Posts: n/a
 
      09-22-2003
Asun Friere wrote:
> "Martin v. Lwis" <(E-Mail Removed)> wrote in message news:<bkkusk$pvi$05$(E-Mail Removed)-online.com>...
>>The usage of SZ has only been eliminated in the recent change of
>>the amtliche Rechtschreibung.

>
> And replaced with what? ie. is there now a single capital for SZ?


(sz) has not been completely eliminated. After *short* vocals it has
been replace with ss (Ku => Kuss, Flu, => Fluss). But after *long*
vocals, it is still used (Ma, Gru, ...).

-- Gerhard

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      09-22-2003
"Martin v. Lwis" wrote:

> jallan wrote:
>
>> But that really doesn't work properly. According to Unicode specs and
>> German usage the uppercase of "" is actually "SS", that is the single
>> character "" should uppercase to two characters.

>
> Can you cite exact chapter and verse of the Unicode specs that say so?
> According to the Unicode database,
>
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> has neither an uppercase mapping, nor a lowercase mapping.


It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDAT...ialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "".upper() --> "SS" is required.

> Also, in German, the uppercase mapping of is of ongoing debate.


My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.
For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

Peter
 
Reply With Quote
 
jallan
Guest
Posts: n/a
 
      09-22-2003
Peter Otten <(E-Mail Removed)> wrote in message news:<bkm919$ast$01$(E-Mail Removed)-online.com>...
> "Martin v. Lwis" wrote:
>
> > jallan wrote:
> >
> >> But that really doesn't work properly. According to Unicode specs and
> >> German usage the uppercase of "" is actually "SS", that is the single
> >> character "" should uppercase to two characters.

> >
> > Can you cite exact chapter and verse of the Unicode specs that say so?
> > According to the Unicode database,
> >
> > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> >
> > has neither an uppercase mapping, nor a lowercase mapping.

>
> It seems like UnicodeData.txt does not give the full story. Quoting from
> http://www.unicode.org/Public/UNIDAT...ialCasing.txt:
>
> [...]


> # (For compatibility, the UnicodeData.txt file only contains case mappings
> for
> # characters where they are 1-1, and does not have locale-specific
> mappings.)
> [...]
> # <code>; <lower> ; <title> ; <upper> ; (<condition_list> ? # <comment>
> [...]
> # The German es-zed is special--the normal mapping is to SS.
> # Note: the titlecase should never occur in practice. It is equal to
> titlecase(uppercase(<es-zed>))
>
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> [...]
>
> Thus, to comply with the standard, "".upper() --> "SS" is required.


Yes.

Also the Unicode main charts in the annotation for 00DF state:

uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative. See
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another **** legacy implementation?

Jim Allan
 
Reply With Quote
 
Martin v. =?iso-8859-15?q?L=F6wis?=
Guest
Posts: n/a
 
      09-22-2003
http://www.velocityreviews.com/forums/(E-Mail Removed) (Asun Friere) writes:

> > The usage of SZ has only been eliminated in the recent change of
> > the amtliche Rechtschreibung.
> >

>
> And replaced with what? ie. is there now a single capital for SZ?


Unfortunately, I don't have a current Duden here, but I *think* you
now have to write double-S. There is, of course, the old MASSE vs
MASZE issue - I don't know whether this is considered relevant, as
capitalization is rare, anyway, and ambiguities can be clarified from
the context.

Regards,
Martin

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert unicode escape sequences to unicode in a file Jeremy Python 0 01-11-2011 11:39 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments