Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Performance of int/long in Python 3

Reply
Thread Tools

Performance of int/long in Python 3

 
 
Terry Reedy
Guest
Posts: n/a
 
      03-28-2013
On 3/28/2013 10:38 AM, Chris Angelico wrote:

> PEP393 strings have two optimizations, or kinda three:
>
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
>
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?)


Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
amounts to a simple copy of the internal bytes. I do not know if *all*
the codecs for such encodings are 393-aware, but I do know that the
utf-8 and latin-1 group are. This is one operation that 3.3+ does much
faster than 3.2-


--
Terry Jan Reedy

 
Reply With Quote
 
 
 
 
Ian Kelly
Guest
Posts: n/a
 
      03-28-2013
On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <(E-Mail Removed)> wrote:
> PEP393 strings have two optimizations, or kinda three:
>
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
>
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?) Both are
> optimized down to a single byte per character.


The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header. The smaller header omits the utf8
pointer (which optionally points to an additional UTF-8 representation
of the string) and its associated length variable. These are not
needed for ASCII-only strings because an ASCII string can be directly
interpreted as a UTF-8 string for the same result. The smaller header
also omits the "wstr_length" field which, according to the PEP,
"differs from length only if there are surrogate pairs in the
representation." For an ASCII string, of course there would not be
any surrogate pairs.
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 3:01 AM, Terry Reedy <(E-Mail Removed)> wrote:
> On 3/28/2013 10:38 AM, Chris Angelico wrote:
>
>> PEP393 strings have two optimizations, or kinda three:
>>
>> 1a) ASCII-only strings
>> 1b) Latin1-only strings
>> 2) BMP-only strings
>> 3) Everything else
>>
>> Options 1a and 1b are almost identical - I'm not sure what the detail
>> is, but there's something flagging those strings that fit inside seven
>> bits. (Something to do with optimizing encodings later?)

>
>
> Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
> amounts to a simple copy of the internal bytes. I do not know if *all* the
> codecs for such encodings are 393-aware, but I do know that the utf-8 and
> latin-1 group are. This is one operation that 3.3+ does much faster than
> 3.2-


Thanks Terry. So that's not so much a representation difference as a
flag that costs little or nothing to retain, and can improve
performance in the encode later on. Sounds like a useful tweak to the
basics of flexible string representation, without being particularly
germane to jmf's complaints.

ChrisA
 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      03-28-2013
On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <(E-Mail Removed)> wrote:
> The flexible string representation takes the problem from the
> other side, it attempts to work with the characters by using
> their representations and it (can only) fails...


This is false. As I've pointed out to you before, the FSR does not
divide characters up by representation. It divides them up by
codepoint -- more specifically, by the *bit-width* of the codepoint.
We call the internal format of the string "ASCII" or "Latin-1" or
"UCS-2" for conciseness and a point of reference, but fundamentally
all of the FSR formats are simply byte arrays of *codepoints* -- you
know, those things you keep harping on. The major optimization
performed by the FSR is to consistently truncate the leading zero
bytes from each codepoint when it is possible to do so safely. But
regardless of to what extent this truncation is applied, the string is
*always* internally just an array of codepoints, and the same
algorithms apply for all representations.
 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
Chris,

Your problem with int/long, the start of this thread, is
very intersting.

This is not a demonstration, a proof, rather an illustration.

Assume you have a set of integers {0...9} and an operator,
let say, the addition.

Idea.
Just devide this set in two chunks, {0...4} and {5...9}
and work hardly to optimize the addition of 2 operands in
the sets {0...4}.

The problems.
- When optimizing "{0...4}", your algorithm will most probably
weaken "{5...9}".
- When using "{5...9}", you do not benefit from your algorithm, you
will be penalized just by the fact you has optimized "{0...4}"
- And the first mistake, you are just penalized and impacted by the
fact you have to select in which subset you operands are when
working with "{0...9}".

Very interestingly, working with the representation (bytes) of
these integers will not help. You have to consider conceptually
{0..9} as numbers.

Now, replace numbers by characters, bytes by "encoded code points",
and you have qualitatively the flexible string representation.

In Unicode, there is one more level of abstraction: one conceptually
neither works with characters, nor with "encoded code points", but
with unicode transformed formated "entities". (see my previous post).

That means you can work very hardly on the "bytes levels",
you will never solves the problem which is one level higher
in the unicode hierarchy:
character -> code point -> utf -> bytes (implementation)
with the important fact that this construct can only go
from left to right.

---

In fact, by proposing a flexible representation of ints, you may
just fall in the same trap the flexible string representation
presents.

----

All this stuff is explained in good books about the coding of the
characters and/or unicode.
The unicode.org documention explains it too. It is a little
bit harder to discover, because the doc is presenting always
this stuff from a "technical" perspective.
You get it when reading a large part of the Unicode doc.

jmf



 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 3:55 AM, jmfauth <(E-Mail Removed)> wrote:
> Assume you have a set of integers {0...9} and an operator,
> let say, the addition.
>
> Idea.
> Just devide this set in two chunks, {0...4} and {5...9}
> and work hardly to optimize the addition of 2 operands in
> the sets {0...4}.
>
> The problems.
> - When optimizing "{0...4}", your algorithm will most probably
> weaken "{5...9}".
> - When using "{5...9}", you do not benefit from your algorithm, you
> will be penalized just by the fact you has optimized "{0...4}"
> - And the first mistake, you are just penalized and impacted by the
> fact you have to select in which subset you operands are when
> working with "{0...9}".
>
> Very interestingly, working with the representation (bytes) of
> these integers will not help. You have to consider conceptually
> {0..9} as numbers.


Yeah, and there's an easy representation of those numbers. But let's
look at Python's representations of integers. I have a sneaking
suspicion something takes note of how large the number is before
deciding how to represent it. Look!

>>> sys.getsizeof(1)

14
>>> sys.getsizeof(1<<2)

14
>>> sys.getsizeof(1<<4)

14
>>> sys.getsizeof(1<<

14
>>> sys.getsizeof(1<<31)

18
>>> sys.getsizeof(1<<30)

18
>>> sys.getsizeof(1<<16)

16
>>> sys.getsizeof(1<<12345)

1660
>>> sys.getsizeof(1<<123456)

16474

Small numbers are represented more compactly than large ones! And it's
not like in REXX, where all numbers are stored as strings.

Go fork CPython and make the changes you suggest. Then run real-world
code on it and see how it performs. Or at very least, run plausible
benchmarks like the strings benchmark from the standard tests.

My original post about integers was based on two comparisons: Python 2
and Pike. Both languages have an optimization for "small" integers
(where "small" is "within machine word" - on rechecking some of my
stats, I find that I perhaps should have used a larger offset, as the
64-bit Linux Python I used appeared to be a lot faster than it should
have been), which Python 3 doesn't have. Real examples, real
statistics, real discussion. (I didn't include Pike stats in what I
posted, for two reasons: firstly, it would require a reworking of the
code, rather than simply "run this under both interpreters"; and
secondly, Pike performance is completely different from CPython
performance, and is non-comparable. Pike is more similar to PyPy, able
to compile - in certain circumstances - to machine code. So the
comparisons were Py2 vs Py3.)

ChrisA
 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
On 28 mar, 17:33, Ian Kelly <(E-Mail Removed)> wrote:
> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <(E-Mail Removed)> wrote:
> > The flexible string representation takes the problem from the
> > other side, it attempts to work with the characters by using
> > their representations and it (can only) fails...

>
> This is false. *As I've pointed out to you before, the FSR does not
> divide characters up by representation. *It divides them up by
> codepoint -- more specifically, by the *bit-width* of the codepoint.
> We call the internal format of the string "ASCII" or "Latin-1" or
> "UCS-2" for conciseness and a point of reference, but fundamentally
> all of the FSR formats are simply byte arrays of *codepoints* -- you
> know, those things you keep harping on. *The major optimization
> performed by the FSR is to consistently truncate the leading zero
> bytes from each codepoint when it is possible to do so safely. *But
> regardless of to what extent this truncation is applied, the string is
> *always* internally just an array of codepoints, and the same
> algorithms apply for all representations.


-----

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "", "" or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the "jambon cru"
and a good bottle a dry white wine.

jmf




 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 4:48 AM, jmfauth <(E-Mail Removed)> wrote:
> If Python had imlemented Unicode correctly, there would
> be no difference in using an "a", "", "" or any character,
> what the narrow builds did.


I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA
 
Reply With Quote
 
rurpy@yahoo.com
Guest
Posts: n/a
 
      03-28-2013
On 03/28/2013 01:48 AM, Steven D'Aprano wrote:
> On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:
>> More seriously Ive never seen anyone -- cause or person -- aided by
>> the use of excessively strong language.

>
> Of course not. By definition, if it helps, it wasn't *excessively* strong
> language.


For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

>> IOW I repeat my support for Ned's request: Ad hominiem attacks are not
>> welcome, irrespective of the context/provocation.

>
> Insults are not ad hominem attacks.


Insults may or may not be ad hominem attacks. There is nothing
mutually exclusive about those terms.

> "You sir, are a bounder and a cad. Furthermore, your
> argument is wrong, because of reasons."
>
> may very well be an insult, but it also may be correct, and the reasons
> logically valid.


Those are two different statements. The first is an ad hominem
attack and is not welcome here. The second is an acceptable
response.

> "Your argument is wrong, because you are a bounder
> and a cad."
>
> is an ad hominem fallacy, because even bounders and cads may tell the
> truth occasionally, or be correct by accident.


That it is a fallacy does not mean it is not also an attack.

> I find it interesting that nobody has yet felt the need to defend JMF,
> and tell me I was factually incorrect about him (as opposed to merely
> impolite or mean-spirited).


Nothing "interesting" about it at all. Most of us (perhaps
unlike you) are not interested in discussing the personal
characteristics of posters here (in contrast to discussing
the technical opinions they post).

Further, "liar" is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.

>[...]
> I would rather that you called me a liar to my face
> and gave me the opportunity to respond, than for you to ignore everything
> I said.


Even if you personally would prefer someone to respond by
calling you a liar, your personal preferences do not form
a basis for desirable posting behavior here.

But again you're creating a false dichotomy. Those are not
the only two choices. A third choice is neither ignore you
nor call you a liar but to factually point out where you are
wrong, or (if it is a matter of opinion) why one holds a
different opinion. That was the point Ned Deily was making
I believe.

> I hope that we all agree that we want a nice, friendly, productive
> community where everyone is welcome.


I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.

> But some people simply cannot or
> will not behave in ways that are compatible with those community values.
> There are some people whom we *do not want here*


In other words, everyone is NOT welcome.

> -- spoilers and messers,
> vandals and spammers and cheats and liars and trolls and crackpots of all
> sorts.


Where those terms are defined by you and a handful of other
voracious posters. "Troll" in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.

> We only disagree as to the best way to make it clear to them that
> they are not welcome so long as they continue their behaviour.


No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.

> [1] Although sadly, given the reality of communication on the Internet,
> sometimes kill-filing is the least-worst option.


Please, please, killfile jmfauth, ranting rick, xaw lee and
anyone else you don't like so that the rest of us can be spared
the orders of magnitude larger, more disruptive and more offensive
posts generated by your (plural) responses to them.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.
 
Reply With Quote
 
Ned Deily
Guest
Posts: n/a
 
      03-28-2013
In article
<(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:
> I'd rather this list have some vinegar than it devolve into
> uselessness. Or, worse, if there's a hard-and-fast rule about
> courtesy, devolve into aspartame... everyone's courteous in words but
> hates each other underneath. Or am I taking the analogy too far?


I think you are positing false choices. No one - at least I'm not - is
advocating to avoid challenging false or misleading statements in the
interests of maintaining some false "see how well we all get along"
facade. The point is we can have meaningful, hard-nosed discussions
without resorting to personal insults, i.e. flaming. I think the
discussion in this topic over the past 24 hours or so demonstrates that.

--
Ned Deily,
http://www.velocityreviews.com/forums/(E-Mail Removed)

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
Re: Performance (pystone) of python 2.4 lower then python 2.3 ??? Andreas Kostyrka Python 0 12-17-2004 02:00 PM
Performance (pystone) of python 2.4 lower then python 2.3 ??? Lucas Hofman Python 13 12-16-2004 03:24 AM
RE: Straw poll on Python performance (was Re: Python is far from atop performer ...) Robert Brewer Python 1 01-10-2004 06:54 AM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM



Advertisments