Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Py 3.3, unicode / upper()

Reply
Thread Tools

Py 3.3, unicode / upper()

 
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      12-19-2012
I was using the German word "Straße" (Strasse) — German
translation from "street" — to illustrate the catastrophic and
completely wrong-by-design Unicode handling in Py3.3, this
time from a memory point of view (not speed):

>>> sys.getsizeof('Straße')

43
>>> sys.getsizeof('STRAẞE')

50

instead of a sane (Py3.2)

>>> sys.getsizeof('Straße')

42
>>> sys.getsizeof('STRAẞE')

42


But, this is not the problem.
I was suprised to discover this:

>>> 'Straße'.upper()

'STRASSE'

I really, really do not know what I should think about that.
(It is a complex subject.) And the real question is why?

jmf

 
Reply With Quote
 
 
 
 
Thomas Bach
Guest
Posts: n/a
 
      12-19-2012
On Wed, Dec 19, 2012 at 06:23:00AM -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I was suprised to discover this:
>
> >>> 'Strae'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?


Because there is no definition for upper-case ''. 'SS' is used as the
common replacement in this case. I think it's pretty smart!

Regards,
Thomas.
 
Reply With Quote
 
 
 
 
Stefan Krah
Guest
Posts: n/a
 
      12-19-2012
(E-Mail Removed) <(E-Mail Removed)> wrote:
> But, this is not the problem.
> I was suprised to discover this:
>
> >>> 'Strae'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?


http://de.wikipedia.org/wiki/Gro%C3%...C3.9Fes_.C3.9F

"Die gegenwrtigen amtlichen Regeln[6] zur neuen deutschen Rechtschreibung
kennen keinen Grobuchstaben zum : Jeder Buchstabe existiert als
Kleinbuchstabe und als Grobuchstabe (Ausnahme ). Im Versalsatz empfehlen
die Regeln, das durch SS zu ersetzen: Bei Schreibung mit Grobuchstaben
schreibt man SS, zum Beispiel: Strae -- STRASSE."


According to the new official spelling rules the uppercase does not exist.
The recommendation is to use "SS" when writing in all-caps.


As to why: It has always been acceptable to replace with "ss" when
wasn't part of a character set. In the new spelling rules, has been
officially replaced with "ss" in some cases:

http://en.wiktionary.org/wiki/da%C3%9F


The uppercase isn't really needed, since does not occur at the beginning
of a word. As far as I know, most Germans wouldn't even know that it has
existed at some point or how to write it.



Stefan Krah


 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      12-19-2012
On Thu, Dec 20, 2012 at 1:23 AM, <(E-Mail Removed)> wrote:
> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?


Not all strings can be uppercased and lowercased cleanly. Please stop
trotting out the old Box Hill-to-Camberwell arguments[1] yet again.

For comparison, try this string:

'𝐇𝐞𝐥𝐥𝐨, 𝐰𝐨𝐫𝐥𝐝!'.upper()

And while you're at it, check out sys.getsizeof() on that sort of
string, compare your beloved 3.2 on that. Oh, and also check out len()
on it.

[1] Melbourne's current ticketing system is based on zones, and
Camberwell is in zone 1, and Box Hill in zone 2. Detractors of public
transport point out that it costs far more to take the train from Box
Hill to Camberwell than it does to drive a car the same distance. It's
the same contrived example that keeps on getting trotted out time and
time again.

ChrisA
 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      12-19-2012
On 19.12.2012 15:23, (E-Mail Removed) wrote:
> I was using the German word "Straße" (Strasse) — German
> translation from "street" — to illustrate the catastrophic and
> completely wrong-by-design Unicode handling in Py3.3, this
> time from a memory point of view (not speed):
>
>>>> sys.getsizeof('Straße')

> 43
>>>> sys.getsizeof('STRAẞE')

> 50
>
> instead of a sane (Py3.2)
>
>>>> sys.getsizeof('Straße')

> 42
>>>> sys.getsizeof('STRAẞE')

> 42


How do those arbitrary numbers prove anything at all? Why do you draw
the conclusion that it's broken by design? What do you expect? You're
very vague here. Just to show how ridiculously pointless your numers
are, your example gives 84 on Python3.2 for any input of yours.

> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?


Because in the German language the uppercase "ß" is virtually dead.

Regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$(E-Mail Removed)>
 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      12-19-2012
On 19.12.2012 16:18, Johannes Bauer wrote:

> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.


....on Python3.2 on MY system is what I meant to say (x86_64 Linux). Sorry.

Also, further reading:

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F
http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$(E-Mail Removed)>
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      12-19-2012
On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <(E-Mail Removed)> wrote:
> On 19.12.2012 15:23, (E-Mail Removed) wrote:
>> I was using the German word "Straße" (Strasse) — German
>> translation from "street" — to illustrate the catastrophic and
>> completely wrong-by-design Unicode handling in Py3.3, this
>> time from a memory point of view (not speed):
>>
>>>>> sys.getsizeof('Straße')

>> 43
>>>>> sys.getsizeof('STRAẞE')

>> 50
>>
>> instead of a sane (Py3.2)
>>
>>>>> sys.getsizeof('Straße')

>> 42
>>>>> sys.getsizeof('STRAẞE')

>> 42

>
> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.


You may not be familiar with jmf. He's one of our resident trolls, and
he has a bee in his bonnet about PEP 393 strings, on the basis that
they take up more space in memory than a narrow build of Python 3.2
would, for a string with lots of BMP characters and one non-BMP. In
3.2 narrow builds, strings were stored in UTF-16, with *surrogate
pairs* for non-BMP characters. This means that len() counts them
twice, as does string indexing/slicing. That's a major bug, especially
as your Python code will do different things on different platforms -
most Linux builds of 3.2 are "wide" builds, storing characters in four
bytes each.

PEP 393 brings wide build semantics to all Pythons, while achieving
memory savings better than a narrow build can (with PEP 393 strings,
any all-ASCII or all-Latin-1 strings will be stored one byte per
character). Every now and then, though, jmf points out *yet again*
that his beloved and buggy narrow build consumes less memory and runs
faster than the oh so terrible 3.3 on some contrived example. It gets
rather tiresome.

Interestingly, IDLE on my Windows box can't handle the bolded
characters very well...

>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d 428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d !"
>>> print(s)

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
print(s)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
in position 0: Non-BMP character not supported in Tk

I think this is most likely a case of "yeah, Windows XP just sucks".
But I have no reason or inclination to get myself a newer Windows to
find out if it's any different.

ChrisA
 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      12-19-2012
On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <(E-Mail Removed)> wrote:
> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP. In
> 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> pairs* for non-BMP characters. This means that len() counts them
> twice, as does string indexing/slicing. That's a major bug, especially
> as your Python code will do different things on different platforms -
> most Linux builds of 3.2 are "wide" builds, storing characters in four
> bytes each.


>From what I've been able to discern, his actual complaint about PEP

393 stems from misguided moral concerns. With PEP-393, strings that
can be fully represented in Latin-1 can be stored in half the space
(ignoring fixed overhead) compared to strings containing at least one
non-Latin-1 character. jmf thinks this optimization is unfair to
non-English users and immoral; he wants Latin-1 strings to be treated
exactly like non-Latin-1 strings (I don't think he actually cares
about non-BMP strings at all; if narrow-build Unicode is good enough
for him, then it must be good enough for everybody). Unfortunately
for him, the Latin-1 optimization is rather trivial in the wider
context of PEP-393, and simply removing that part alone clearly
wouldn't be doing anybody any favors. So for him to get what he
wants, the entire PEP has to go.

It's rather like trying to solve the problem of wealth disparity by
forcing everyone to dump their excess wealth into the ocean.
 
Reply With Quote
 
Benjamin Peterson
Guest
Posts: n/a
 
      12-19-2012
<wxjmfauth <at> gmail.com> writes:
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?


Because that's what the Unicode spec says to do.



 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      12-19-2012
Le mercredi 19 dcembre 2012 15:52:23 UTC+1, Christian Heimes a crit*:
> Am 19.12.2012 15:23, schrieb (E-Mail Removed):
>
> > But, this is not the problem.

>
> > I was suprised to discover this:

>
> >

>
> >>>> 'Strae'.upper()

>
> > 'STRASSE'

>
> >

>
> > I really, really do not know what I should think about that.

>
> > (It is a complex subject.) And the real question is why?

>
>
>
> It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
>
> form. However the unicode database specifies an upper case mapping from
>
> to SS. http://codepoints.net/U+00DF
>
>
>
> Christian


-----

Yes, it is correct (or can be considered as correct).
I do not wish to discuss the typographical problematic
of "Das Grosse Eszett". The web is full of pages on the
subject. However, I never succeeded to find an "official
position" from Unicode. The best information I found seem
to indicate (to converge), U+1E9E is now the "supported"
uppercase form of U+00DF. (see DIN).

What is bothering me, is more the implementation. The Unicode
documentation says roughly this: if something can not be
honoured, there is no harm, but do not implement a workaroud.
In that case, I'm not sure Python is doing the best.

If "wrong", this can be considered as programmatically correct
or logically acceptable (Py3.2)

>>> 'Strae'.upper().lower().capitalize() == 'Strae'

True

while this will *always* be problematic (Py3.3)

>>> 'Strae'.upper().lower().capitalize() == 'Strae'

False

jmf

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments