Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Is LOAD_GLOBAL really that slow?

Reply
Thread Tools

Re: Is LOAD_GLOBAL really that slow?

 
 
Carsten Haese
Guest
Posts: n/a
 
      08-30-2007
On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
> It seems a common opinion that global access is much slower than local
> variable access. However, my benchmarks show a relatively small
> difference:
>
> ./python -m timeit -r 10 -v -s 'x = [None] * 10000
> def foo():
> for i in x:
> list; list; list; list; list; list; list; list; list; list' 'foo()'
> 10 loops -> 0.0989 secs100 loops -> 0.991 secs
> raw times: 0.999 0.985 0.987 0.985 0.985 0.982 0.982 0.982 0.981 0.985
> 100 loops, best of 10: 9.81 msec per loop
>
> ./python -m timeit -r 10 -v -s 'x = [None] * 10000
> def foo():
> mylist = list
> for i in x:
> mylist; mylist; mylist; mylist; mylist; mylist; mylist; mylist;
> mylist; mylist' 'foo()'
> 10 loops -> 0.0617 secs
> 100 loops -> 0.61 secs
> raw times: 0.603 0.582 0.582 0.583 0.581 0.583 0.58 0.583 0.584 0.582
> 100 loops, best of 10: 5.8 msec per loop
>
> So global access is about 70% slower than local variable access. To
> put that in perspective, two local variable accesses will take longer
> than a single global variable access.
>
> This is a very extreme benchmark though. In practice, other overheads
> will probably drop the difference to a few percent at most. Not that
> important in my book.


Your comparison is flawed, because the function call and the inner for
loop cause a measurement offset that makes the locals advantage seems
smaller than it is. In the interest of comparing the times for just the
local lookup versus just the global lookup, I think the following
timings are more appropriate:

$ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): pass" "f(42)"
1000000 loops, best of 10: 0.3 usec per loop
$ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): x" "f(42)"
1000000 loops, best of 10: 0.331 usec per loop
$ python2.5 -mtimeit -r 10 -s"y=42" -s"def f(x): y" "f(42)"
1000000 loops, best of 10: 0.363 usec per loop

There is no loop overhead here, and after subtracting the function call
overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
global lookup, so local lookups are just about twice as fast as global
lookups.

True, whether this difference is significant does depend on how many
name lookups your code makes and how much else it's doing, but if you're
doing a lot of number crunching and not a lot of I/O, the difference
might be significant. Also, even if using local names is only slightly
faster than using globals, it's still not slower, and the resulting code
is still more readable and more maintainable. Using locals is a win-win
scenario.

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net


 
Reply With Quote
 
 
 
 
Rhamphoryncus
Guest
Posts: n/a
 
      08-30-2007
On Aug 29, 8:33 pm, Carsten Haese <(E-Mail Removed)> wrote:
> On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
> > It seems a common opinion that global access is much slower than local
> > variable access. However, my benchmarks show a relatively small
> > difference:

>
> > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
> > def foo():
> > for i in x:
> > list; list; list; list; list; list; list; list; list; list' 'foo()'
> > 10 loops -> 0.0989 secs100 loops -> 0.991 secs
> > raw times: 0.999 0.985 0.987 0.985 0.985 0.982 0.982 0.982 0.981 0.985
> > 100 loops, best of 10: 9.81 msec per loop

>
> > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
> > def foo():
> > mylist = list
> > for i in x:
> > mylist; mylist; mylist; mylist; mylist; mylist; mylist; mylist;
> > mylist; mylist' 'foo()'
> > 10 loops -> 0.0617 secs
> > 100 loops -> 0.61 secs
> > raw times: 0.603 0.582 0.582 0.583 0.581 0.583 0.58 0.583 0.584 0.582
> > 100 loops, best of 10: 5.8 msec per loop

>
> > So global access is about 70% slower than local variable access. To
> > put that in perspective, two local variable accesses will take longer
> > than a single global variable access.

>
> > This is a very extreme benchmark though. In practice, other overheads
> > will probably drop the difference to a few percent at most. Not that
> > important in my book.

>
> Your comparison is flawed, because the function call and the inner for
> loop cause a measurement offset that makes the locals advantage seems
> smaller than it is. In the interest of comparing the times for just the
> local lookup versus just the global lookup, I think the following
> timings are more appropriate:


That's why I used far more name lookups, to minimize the overhead.


> $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): pass" "f(42)"
> 1000000 loops, best of 10: 0.3 usec per loop
> $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): x" "f(42)"
> 1000000 loops, best of 10: 0.331 usec per loop
> $ python2.5 -mtimeit -r 10 -s"y=42" -s"def f(x): y" "f(42)"
> 1000000 loops, best of 10: 0.363 usec per loop


On my box, the best results I got after several runs were 0.399,
0.447, 0464. Even less difference than my original results.


> There is no loop overhead here, and after subtracting the function call
> overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
> global lookup, so local lookups are just about twice as fast as global
> lookups.
>
> True, whether this difference is significant does depend on how many
> name lookups your code makes and how much else it's doing, but if you're
> doing a lot of number crunching and not a lot of I/O, the difference
> might be significant. Also, even if using local names is only slightly
> faster than using globals, it's still not slower, and the resulting code
> is still more readable and more maintainable. Using locals is a win-win
> scenario.


You get very small speed gains (assuming your code is doing anything
significant), for a lot of effort (trying out different options,
seeing if they're actually faster on different boxes.) The
readability cost is there, even if it is smaller than many of the
other obfuscations people attempt. If the speed gains were really
that important you should rewrite in C, where you'd get far greater
speed gains.

So it only seems worthwhile when you really, *really* need to get a
slight speedup on your box, you don't need to get any more speedup
than that, and C is not an option.

Fwiw, I posted this after developing yet another patch to optimize
global lookups. It does sometimes show an improvement on specific
benchmarks, but overall it harms performance. Looking into why, it
doesn't make sense that a python dictionary lookup can have less cost
than two simple array indexes, but there you go. Python dictionaries
are already damn fast.

--
Adam Olsen, aka Rhamphoryncus

 
Reply With Quote
 
 
 
 
Chris Mellon
Guest
Posts: n/a
 
      08-30-2007
On 8/30/07, Rhamphoryncus <(E-Mail Removed)> wrote:
> On Aug 29, 8:33 pm, Carsten Haese <(E-Mail Removed)> wrote:
> > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
> > There is no loop overhead here, and after subtracting the function call
> > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
> > global lookup, so local lookups are just about twice as fast as global
> > lookups.
> >


__builtins__ lookups are an extra dict lookup slower than just global
variables, too. Don't forget those.


> > True, whether this difference is significant does depend on how many
> > name lookups your code makes and how much else it's doing, but if you're
> > doing a lot of number crunching and not a lot of I/O, the difference
> > might be significant. Also, even if using local names is only slightly
> > faster than using globals, it's still not slower, and the resulting code
> > is still more readable and more maintainable. Using locals is a win-win
> > scenario.

>
> You get very small speed gains (assuming your code is doing anything
> significant), for a lot of effort (trying out different options,
> seeing if they're actually faster on different boxes.) The
> readability cost is there, even if it is smaller than many of the
> other obfuscations people attempt. If the speed gains were really
> that important you should rewrite in C, where you'd get far greater
> speed gains.
>


I've doubled the speed of a processing loop by moving globals lookups
out of the loop. Rewriting in C would have taken at least a day, even
with Pyrex, localizing the lookup took about 2 minutes.

> So it only seems worthwhile when you really, *really* need to get a
> slight speedup on your box, you don't need to get any more speedup
> than that, and C is not an option.
>


It's not a huge optimization, but it's really easy to write if you
don't mind adding fake kwargs to your functions. Just for the heck of
it I also wrote a decorator that will re-write the bytecode so that
any global that can be looked up at function definition will be
re-written as a local (actually with LOAD_CONST). You can see it at
http://code.google.com/p/wxpsvg/wiki...lsOptimization. Disclaimer:
While I've tested it with a variety of functions and it's never broken
anything, I've never actually used this for anything except an
intellectual exercise. Use at your own risk.

> Fwiw, I posted this after developing yet another patch to optimize
> global lookups. It does sometimes show an improvement on specific
> benchmarks, but overall it harms performance. Looking into why, it
> doesn't make sense that a python dictionary lookup can have less cost
> than two simple array indexes, but there you go. Python dictionaries
> are already damn fast.
>


I certainly believe that changes to pythons internals to try to make
LOAD_GLOBAL itself faster can be difficult, with even "obvious"
optimizations ending up slower. However, LOAD_FAST (and LOAD_CONST)
are faster than LOAD_GLOBAL and, for the reason you just stated, is
unlikely to change.
 
Reply With Quote
 
Rhamphoryncus
Guest
Posts: n/a
 
      08-30-2007
On Aug 30, 12:04 pm, "Chris Mellon" <(E-Mail Removed)> wrote:
> On 8/30/07, Rhamphoryncus <(E-Mail Removed)> wrote:
>
> > On Aug 29, 8:33 pm, Carsten Haese <(E-Mail Removed)> wrote:
> > > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
> > > There is no loop overhead here, and after subtracting the function call
> > > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
> > > global lookup, so local lookups are just about twice as fast as global
> > > lookups.

>
> __builtins__ lookups are an extra dict lookup slower than just global
> variables, too. Don't forget those.


Heh right, I forgot that. That's what my benchmark was actually
testing.


> > > True, whether this difference is significant does depend on how many
> > > name lookups your code makes and how much else it's doing, but if you're
> > > doing a lot of number crunching and not a lot of I/O, the difference
> > > might be significant. Also, even if using local names is only slightly
> > > faster than using globals, it's still not slower, and the resulting code
> > > is still more readable and more maintainable. Using locals is a win-win
> > > scenario.

>
> > You get very small speed gains (assuming your code is doing anything
> > significant), for a lot of effort (trying out different options,
> > seeing if they're actually faster on different boxes.) The
> > readability cost is there, even if it is smaller than many of the
> > other obfuscations people attempt. If the speed gains were really
> > that important you should rewrite in C, where you'd get far greater
> > speed gains.

>
> I've doubled the speed of a processing loop by moving globals lookups
> out of the loop. Rewriting in C would have taken at least a day, even
> with Pyrex, localizing the lookup took about 2 minutes.


I'm guessing that was due to deep voodoo involving your processor's
pipeline, branch prediction, caching, etc. I'd be interested in
seeing small examples of it though.


> > So it only seems worthwhile when you really, *really* need to get a
> > slight speedup on your box, you don't need to get any more speedup
> > than that, and C is not an option.

>
> It's not a huge optimization, but it's really easy to write if you
> don't mind adding fake kwargs to your functions. Just for the heck of
> it I also wrote a decorator that will re-write the bytecode so that
> any global that can be looked up at function definition will be
> re-written as a local (actually with LOAD_CONST). You can see it athttp://code.google.com/p/wxpsvg/wiki/GlobalsOptimization. Disclaimer:
> While I've tested it with a variety of functions and it's never broken
> anything, I've never actually used this for anything except an
> intellectual exercise. Use at your own risk.


Doubling the throughput while doing *real work* is definitely more
significant than maybe-or-maybe-not-quite-doubling without any real
work.

--
Adam Olsen, aka Rhamphoryncus

 
Reply With Quote
 
Chris Mellon
Guest
Posts: n/a
 
      08-30-2007
On 8/30/07, Rhamphoryncus <(E-Mail Removed)> wrote:
> On Aug 30, 12:04 pm, "Chris Mellon" <(E-Mail Removed)> wrote:
> > On 8/30/07, Rhamphoryncus <(E-Mail Removed)> wrote:
> >
> > > On Aug 29, 8:33 pm, Carsten Haese <(E-Mail Removed)> wrote:
> > > > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
> > > > There is no loop overhead here, and after subtracting the function call
> > > > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
> > > > global lookup, so local lookups are just about twice as fast as global
> > > > lookups.

> >
> > __builtins__ lookups are an extra dict lookup slower than just global
> > variables, too. Don't forget those.

>
> Heh right, I forgot that. That's what my benchmark was actually
> testing.
>
>
> > > > True, whether this difference is significant does depend on how many
> > > > name lookups your code makes and how much else it's doing, but if you're
> > > > doing a lot of number crunching and not a lot of I/O, the difference
> > > > might be significant. Also, even if using local names is only slightly
> > > > faster than using globals, it's still not slower, and the resulting code
> > > > is still more readable and more maintainable. Using locals is a win-win
> > > > scenario.

> >
> > > You get very small speed gains (assuming your code is doing anything
> > > significant), for a lot of effort (trying out different options,
> > > seeing if they're actually faster on different boxes.) The
> > > readability cost is there, even if it is smaller than many of the
> > > other obfuscations people attempt. If the speed gains were really
> > > that important you should rewrite in C, where you'd get far greater
> > > speed gains.

> >
> > I've doubled the speed of a processing loop by moving globals lookups
> > out of the loop. Rewriting in C would have taken at least a day, even
> > with Pyrex, localizing the lookup took about 2 minutes.

>
> I'm guessing that was due to deep voodoo involving your processor's
> pipeline, branch prediction, caching, etc. I'd be interested in
> seeing small examples of it though.
>


There's certainly deep voodoo involved in the function. For example, I
used to have a test for early exit but the branch was slower than the
extra 3 function calls prevented by the branch. Very surprised to get
that result at such a high level, but I tested (over and over) and it
was very consistent. This was with psyco as well, so perhaps some
characteristic of the generated assembly.

Psyco of course is even less work and more benefit than localizing
globals, so thats my first step for "make it faster".

>
> > > So it only seems worthwhile when you really, *really* need to get a
> > > slight speedup on your box, you don't need to get any more speedup
> > > than that, and C is not an option.

> >
> > It's not a huge optimization, but it's really easy to write if you
> > don't mind adding fake kwargs to your functions. Just for the heck of
> > it I also wrote a decorator that will re-write the bytecode so that
> > any global that can be looked up at function definition will be
> > re-written as a local (actually with LOAD_CONST). You can see it athttp://code.google.com/p/wxpsvg/wiki/GlobalsOptimization. Disclaimer:
> > While I've tested it with a variety of functions and it's never broken
> > anything, I've never actually used this for anything except an
> > intellectual exercise. Use at your own risk.

>
> Doubling the throughput while doing *real work* is definitely more
> significant than maybe-or-maybe-not-quite-doubling without any real
> work.
>


I wouldn't actually recommend that anyone use this, I just wrote it for fun.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem in running a basic code in python 3.3.0 that includes HTML file Satabdi Mukherjee Python 1 04-04-2013 07:48 PM
Is LOAD_GLOBAL really that slow? Adam Olsen Python 0 08-30-2007 01:23 AM
OT : But help really really needed re: Domain Name selling, hosting etc. problem nc HTML 1 02-03-2005 07:24 PM
REALLY REALLY WERID PROBLEM!!!!pls take a look Amir ASP .Net 3 01-23-2004 06:01 PM
really really mysterious IE6 problem--secure site ultraviolet353 Computer Support 7 11-22-2003 07:56 PM



Advertisments