Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Performance of int/long in Python 3

Reply
Thread Tools

Performance of int/long in Python 3

 
 
rusi
Guest
Posts: n/a
 
      04-01-2013
On Apr 1, 5:15*pm, Roy Smith <(E-Mail Removed)> wrote:
> In article <515941d8$0$29967$c3e8da3$(E-Mail Removed) om>,
> *Steven D'Aprano <(E-Mail Removed)> wrote:
>
> > [...]
> > >> OK, that leads to the next question. *Is there anyway I can (in Python
> > >> 2.7) detect when a string is not entirely in the BMP? *If I could find
> > >> all the non-BMP characters, I could replace them with U+FFFD
> > >> (REPLACEMENT CHARACTER) and life would be good (enough).

>
> > Of course you can do this, but you should not. If your input data
> > includes character C, you should deal with character C and not just throw
> > it away unnecessarily. That would be rude, and in Python 3.3 it should be
> > unnecessary.

>
> The import job isn't done yet, but so far we've processed 116 million
> records and had to clean up four of them. *I can live with that.
> Sometimes practicality trumps correctness.


That works out to 0.000003%. Of course I assume it is US only data.
Still its good to know how skew the distribution is.
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      04-01-2013
On Mon, 01 Apr 2013 06:11:50 -0700, rusi wrote:

> On Apr 1, 5:15*pm, Roy Smith <(E-Mail Removed)> wrote:


>> The import job isn't done yet, but so far we've processed 116 million
>> records and had to clean up four of them. *I can live with that.
>> Sometimes practicality trumps correctness.

>
> That works out to 0.000003%. Of course I assume it is US only data.
> Still its good to know how skew the distribution is.


If the data included Japanese names, or used Emoji, it would be much
closer to 100% than 0.000003%.



--
Steven
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      04-01-2013
On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:

> In article <515941d8$0$29967$c3e8da3$(E-Mail Removed) om>,
> Steven D'Aprano <(E-Mail Removed)> wrote:
>
>> [...]
>> >> OK, that leads to the next question. Is there anyway I can (in
>> >> Python 2.7) detect when a string is not entirely in the BMP? If I
>> >> could find all the non-BMP characters, I could replace them with
>> >> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).

>>
>> Of course you can do this, but you should not. If your input data
>> includes character C, you should deal with character C and not just
>> throw it away unnecessarily. That would be rude, and in Python 3.3 it
>> should be unnecessary.

>
> The import job isn't done yet, but so far we've processed 116 million
> records and had to clean up four of them. I can live with that.
> Sometimes practicality trumps correctness.


Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.


> It turns out, the problem is that the version of MySQL we're using


Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/s...multibyte.html



Postgresql has supported non-broken UTF-8 since at least version 8.1.


> doesn't support non-BMP characters. Newer versions do (but you have to
> declare the column to use the utf8bm4 character set). I could upgrade
> to a newer MySQL version, but it's just not worth it.


My brain just broke. So-called "UTF-8" in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?



> Actually, I did try spinning up a 5.5 instance (one of the nice things
> of being in the cloud) and experimented with that, but couldn't get it
> to work there either. I'll admit that I didn't invest a huge amount of
> effort to make that work before just writing this:
>
> def bmp_filter(self, s):
> """Filter a unicode string to remove all non-BMP (basic
> multilingual plane) characters. All such characters are
> replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).
>
> """


I expect that in 5-10 years, applications that remove or mangle non-BMP
characters will be considered as unacceptable as applications that mangle
BMP characters. Or for that matter, applications that cannot handle names
with apostrophes.

Hell, if your customer base is in Asia, chances are that mangling non-BMP
characters is *already* considered unacceptable.


--
Steven
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      04-01-2013
On Tue, Apr 2, 2013 at 4:07 AM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:
>> It turns out, the problem is that the version of MySQL we're using

>
> Well there you go. Why don't you use a real database?
>
> http://www.postgresql.org/docs/9.2/s...multibyte.html
>
>
>
> Postgresql has supported non-broken UTF-8 since at least version 8.1.


Not only that, but I *rely* on PostgreSQL to test-or-reject stuff that
comes from untrustworthy languages, like PHP. If it's malformed in any
way, it won't get past the database.

>> doesn't support non-BMP characters. Newer versions do (but you have to
>> declare the column to use the utf8bm4 character set). I could upgrade
>> to a newer MySQL version, but it's just not worth it.

>
> My brain just broke. So-called "UTF-8" in MySQL only includes up to a
> maximum of three-byte characters. There has *never* been a time where
> UTF-8 excluded four-byte characters. What were the developers thinking,
> arbitrarily cutting out support for 50% of UTF-8?


Steven, you punctuated that wrongly.

What, were the developers *thinking*? Arbitrarily etc?

It really is brain-breaking. I could understand a naive UTF-8 codec
being too permissive (allowing over-long encodings, allowing
codepoints above what's allocated (eg FA 80 80 80 80, which would
notionally represent U+2000000), etc), but why should it arbitrarily
stop short? There must have been some internal limitation - that,
perhaps, collation was defined only within the BMP.

ChrisA
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      04-01-2013
On 01/04/2013 18:07, Steven D'Aprano wrote:
> On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:
>
>> In article <515941d8$0$29967$c3e8da3$(E-Mail Removed) om>,
>> Steven D'Aprano <(E-Mail Removed)> wrote:
>>
>>> [...]
>>> >> OK, that leads to the next question. Is there anyway I can (in
>>> >> Python 2.7) detect when a string is not entirely in the BMP? If I
>>> >> could find all the non-BMP characters, I could replace them with
>>> >> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).
>>>
>>> Of course you can do this, but you should not. If your input data
>>> includes character C, you should deal with character C and not just
>>> throw it away unnecessarily. That would be rude, and in Python 3.3 it
>>> should be unnecessary.

>>
>> The import job isn't done yet, but so far we've processed 116 million
>> records and had to clean up four of them. I can live with that.
>> Sometimes practicality trumps correctness.

>
> Well, true. It has to be said that few programming languages (and
> databases) make it easy to do the right thing. On the other hand, you're
> a programmer. Your job is to write correct code, not easy code.
>
>
>> It turns out, the problem is that the version of MySQL we're using

>
> Well there you go. Why don't you use a real database?
>
> http://www.postgresql.org/docs/9.2/s...multibyte.html
>
>
>
> Postgresql has supported non-broken UTF-8 since at least version 8.1.
>
>
>> doesn't support non-BMP characters. Newer versions do (but you have to
>> declare the column to use the utf8bm4 character set). I could upgrade
>> to a newer MySQL version, but it's just not worth it.

>
> My brain just broke. So-called "UTF-8" in MySQL only includes up to a
> maximum of three-byte characters. There has *never* been a time where
> UTF-8 excluded four-byte characters. What were the developers thinking,
> arbitrarily cutting out support for 50% of UTF-8?
>

[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?

 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      04-01-2013
---------


I'm not whining or and I'm not complaining (and never did).
I always exposed facts.

I'm not especially interested in Python, I'm interested in
Unicode.

Usualy when I posted examples, there are confirmed.


What I see is this (std "download-abled" Python's on Windows 7 (and
other
Windows/platforms/machines):

Py32
>>> import timeit
>>> timeit.repeat("'a' * 1000 + 'ẞ'")

[0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
>>> timeit.repeat("'a' * 1000 + 'z'")

[0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

Py33
import timeit
timeit.repeat("'a' * 1000 + 'ẞ'")
[1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
timeit.repeat("'a' * 1000 + 'z'")
[0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

I have systematically such a behaviour, in 99.99999% of my tests.
When there is something better, it is usually because something else
(3.2/3.3) has been modified.

I have my idea where this is coming from.

Question: When it is claimed, that this has been tested,
do you mean stringbench.py as proposed many times by Terry?
(Thanks for an answer).

jmf

 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      04-01-2013
On Tue, Apr 2, 2013 at 6:15 AM, jmfauth <(E-Mail Removed)> wrote:
> Py32
>>>> import timeit
>>>> timeit.repeat("'a' * 1000 + 'ẞ'")

> [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
>>>> timeit.repeat("'a' * 1000 + 'z'")

> [0.7105829560031083, 0.6904999426964764, 0.6938637184431968]
>
> Py33
> import timeit
> timeit.repeat("'a' * 1000 + 'ẞ'")
> [1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
> timeit.repeat("'a' * 1000 + 'z'")
> [0.6640958193635527, 0.6469043692851528, 0.6458961423900007]


This is what's called a microbenchmark. Can you show me any instance
in production code where an operation like this is done repeatedly, in
a time-critical place? It's a contrived example, and it's usually
possible to find regressions in any system if you fiddle enough with
the example. Do you have, for instance, a web server that can handle
1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?

ChrisA
 
Reply With Quote
 
Mark Lawrence
Guest
Posts: n/a
 
      04-01-2013
On 01/04/2013 20:15, jmfauth wrote:
> ---------
>
>
> I'm not whining or and I'm not complaining (and never did).
> I always exposed facts.


The only fact I'm aware of is an edge case that is being addressed on
the Python bug tracker, sorry I'm too lazy to look up the number again.

>
> I'm not especially interested in Python, I'm interested in
> Unicode.


So why do you keep harping on about the same old edge case?

>
> Usualy when I posted examples, there are confirmed.


The only thing you've ever posted are the same old boring micro
benchmarks. You never, ever comment on the memory savings that are IIRC
extremely popular with the Django folks amongst others. Neither do you
comment on the fact that the unicode implementation in Python 3.3 is
correct. I can only assume that you prefer a fast but buggy
implementation to a correct but slow one. Except that in many cases the
3.3 implementation is actually faster, so you conveniently ignore this.

>
>
> What I see is this (std "download-abled" Python's on Windows 7 (and
> other
> Windows/platforms/machines):
>
> Py32
>>>> import timeit
>>>> timeit.repeat("'a' * 1000 + 'ẞ'")

> [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
>>>> timeit.repeat("'a' * 1000 + 'z'")

> [0.7105829560031083, 0.6904999426964764, 0.6938637184431968]
>
> Py33
> import timeit
> timeit.repeat("'a' * 1000 + 'ẞ'")
> [1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
> timeit.repeat("'a' * 1000 + 'z'")
> [0.6640958193635527, 0.6469043692851528, 0.6458961423900007]
>
> I have systematically such a behaviour, in 99.99999% of my tests.


Always run on your micro benchmarks, never anything else.

> When there is something better, it is usually because something else
> (3.2/3.3) has been modified.
>
> I have my idea where this is coming from.


I know where this is coming from as it's been stated umpteen times on
numerous threads. As usual you simply ignore any facts that you feel
like, particularly with respect to any real world use cases.

>
> Question: When it is claimed, that this has been tested,
> do you mean stringbench.py as proposed many times by Terry?
> (Thanks for an answer).


I find it amusing that you ask for an answer but refuse point blank to
provide answers yourself. I suspect that you've bitten off more than
you can chew.

>
> jmf
>


--
If you're using GoogleCrap™ please read this
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      04-01-2013
On 1 avr, 21:28, Chris Angelico <(E-Mail Removed)> wrote:
> On Tue, Apr 2, 2013 at 6:15 AM, jmfauth <(E-Mail Removed)> wrote:
> > Py32
> >>>> import timeit
> >>>> timeit.repeat("'a' * 1000 + 'ẞ'")

> > [0.7005365263669056, 0.6810694766790423, 0.6811978680727229]
> >>>> timeit.repeat("'a' * 1000 + 'z'")

> > [0.7105829560031083, 0.6904999426964764, 0.6938637184431968]

>
> > Py33
> > import timeit
> > timeit.repeat("'a' * 1000 + 'ẞ'")
> > [1.1484035160337613, 1.1233738895227505, 1.1215708962703874]
> > timeit.repeat("'a' * 1000 + 'z'")
> > [0.6640958193635527, 0.6469043692851528, 0.6458961423900007]

>
> This is what's called a microbenchmark. Can you show me any instance
> in production code where an operation like this is done repeatedly, in
> a time-critical place? It's a contrived example, and it's usually
> possible to find regressions in any system if you fiddle enough with
> the example. Do you have, for instance, a web server that can handle
> 1000 tps on 3.2 and only 600 tps on 3.3, all other things being equal?
>
> ChrisA


-----

Of course this is an example, as many I gave. Examples you may find in
apps.

Can you point and give at least a bunch of examples, showing
there is no regression, at least to contradict me. The only
one I succeed to see (in month), is the one given by Steven, a status
quo.

I will happily accept them. The only think I read is "this is faster",
"it has been tested", ...

jmf

 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      04-01-2013
In article <5159beb6$0$29967$c3e8da3$(E-Mail Removed) om>,
Steven D'Aprano <(E-Mail Removed)> wrote:
>> The import job isn't done yet, but so far we've processed 116 million
>> records and had to clean up four of them. I can live with that.
>> Sometimes practicality trumps correctness.

>
>Well, true. It has to be said that few programming languages (and
>databases) make it easy to do the right thing. On the other hand, you're
>a programmer. Your job is to write correct code, not easy code.


This is really getting off topic, but fundamentally, I'm an engineer.
My job is to build stuff that make money for my company. That means
making judgement calls about what's not worth fixing, because the cost
to fix it exceeds the value.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
Re: Performance (pystone) of python 2.4 lower then python 2.3 ??? Andreas Kostyrka Python 0 12-17-2004 02:00 PM
Performance (pystone) of python 2.4 lower then python 2.3 ??? Lucas Hofman Python 13 12-16-2004 03:24 AM
RE: Straw poll on Python performance (was Re: Python is far from atop performer ...) Robert Brewer Python 1 01-10-2004 06:54 AM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM



Advertisments