Velocity Reviews > Microbenchmark: Summing over array of doubles

# Microbenchmark: Summing over array of doubles

Yaroslav Bulatov
Guest
Posts: n/a

 08-01-2004
I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

gcc -O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631

http://yaroslav.hopto.org/russianwik...-numeric-speed

numarray takes over Numeric at about 1100 floats

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.

Yaroslav

Duncan Booth
Guest
Posts: n/a

 08-01-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) (Yaroslav Bulatov) wrote in
news:(E-Mail Removed) om:

> I made an array of 10 million floats timed how long it takes to sum
> the elements, here's what I got (millis):

I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.

Also, without running it I don't see how most of your Python examples work
since you seem to call the timing function with the result of the
calculation which doesn't make sense.

Christopher T King
Guest
Posts: n/a

 08-01-2004
On 1 Aug 2004, Duncan Booth wrote:

> I just had a brief look at the code you posted. Are you not concerned about
> accuracy in any of your calculations? Summing a 10 million element array by
> simply accumulating each element into a running total is guaranteed to give
> a lousy result.

Lousy or not, I believe that's how numarray is implemented internally, so
at least all the benchmarks are the same. If you want accuracy summing
that many numbers, you're going to have to do it in software (likely by
summing each mantissa bit individually and reconstructing the float
afterward), so it will be abysmally slow (obviously not what the OP
wants).

Christopher T King
Guest
Posts: n/a

 08-01-2004
On 31 Jul 2004, Yaroslav Bulatov wrote:

> I'm doing intensive computation on arrays in Python, so if you have
> suggestions on Python/C solutions that could push the envelope, please
> let me know.

If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results

numarray should be just as fast as the -O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:

call clock
movl %eax, %esi
movl \$9999999, %ebx
..L11:
decl %ebx
jns .L11
subl \$16, %esp
call clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n",sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
numarray uses.

Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random_array? It shouldn't
affect your results, but it would speed up initialization time a bit.

There are a few inefficiences in the pytime module (mostly involving
range() and *args/**kwargs), but I don't think they'll have too big of an
tests using Psyco to remove much of the Python overhead.

For completeness, I'd also suggest both running the Java version using a
JIT compiler such as Kaffe, and compiling it natively using gcj (the
latter should approach the speed of C).

Duncan Booth
Guest
Posts: n/a

 08-02-2004
Christopher T King <(E-Mail Removed)> wrote in
news(E-Mail Removed):

> On 1 Aug 2004, Duncan Booth wrote:
>
>> I just had a brief look at the code you posted. Are you not concerned
>> about accuracy in any of your calculations? Summing a 10 million
>> element array by simply accumulating each element into a running
>> total is guaranteed to give a lousy result.

>
> Lousy or not, I believe that's how numarray is implemented internally,
> so at least all the benchmarks are the same. If you want accuracy
> summing that many numbers, you're going to have to do it in software
> (likely by summing each mantissa bit individually and reconstructing
> the float afterward), so it will be abysmally slow (obviously not what
> the OP wants).
>

My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):

return 0

The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.

If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel.

beliavsky@aol.com
Guest
Posts: n/a

 08-02-2004
(E-Mail Removed) (Yaroslav Bulatov) wrote in message news:<(E-Mail Removed). com>...
> I made an array of 10 million floats timed how long it takes to sum
> the elements, here's what I got (millis):
>
> gcc -O2: 21
> Python with numarray: 104
> Python with Numeric: 302
> java: 325
> gcc: 348
> Python with Psyco: 1317
> Pure Python using sum: 2312
> Pure Python: 5631
>
> http://yaroslav.hopto.org/russianwik...-numeric-speed
>
> numarray takes over Numeric at about 1100 floats
>
> I'm doing intensive computation on arrays in Python, so if you have
> suggestions on Python/C solutions that could push the envelope, please
> let me know.

What hardware are you using?

On a 2.8 GHz Intel Pentium 4, a Fortran 95 code compiled with Compaq
Visual Fortran 6.6C using the -optimize:5 option takes 40
milliseconds. Probably you could improve the speed of the C program by
compiling it with the Intel C compiler. The Fortran program speed is
constrained by array accesses, not the additions (which are done in
hardware). A program to compute sum(xx**4) takes the same time as one
to compute sum(xx).

program xsum_double
! sum 1e7 doubles
implicit none
integer, parameter :: n = 10000000
real(kind= :: xx(n)
real :: t1,t2,xsum
call random_seed()
call random_number(xx)
call cpu_time(t1)
xsum = sum(xx)
call cpu_time(t2)
print*,1000*(t2-t1),xsum
end program xsum_double

Yaroslav Bulatov
Guest
Posts: n/a

 08-02-2004
Duncan Booth <(E-Mail Removed)> wrote in message news:<Xns95395FD0EF85duncanrcpcouk@127.0.0.1>...
> Christopher T King <(E-Mail Removed)> wrote in
> news(E-Mail Removed):
>
> > On 1 Aug 2004, Duncan Booth wrote:
> >
> >> I just had a brief look at the code you posted. Are you not concerned
> >> about accuracy in any of your calculations? Summing a 10 million
> >> element array by simply accumulating each element into a running
> >> total is guaranteed to give a lousy result.

> >
> > Lousy or not, I believe that's how numarray is implemented internally,
> > so at least all the benchmarks are the same. If you want accuracy
> > summing that many numbers, you're going to have to do it in software
> > (likely by summing each mantissa bit individually and reconstructing
> > the float afterward), so it will be abysmally slow (obviously not what
> > the OP wants).
> >

>
> My point being that speed isn't everything. Most applications doing large
> floating point calculations should be concerned about accuracy, and how not
> to add up a large array of floating point numbers was one of the first
> things I was taught in computer science classes. The fastest way to sum 10
> million numbers (albeit at some considerable loss of accuracy):
>
> return 0
>
>
> The 'correct' way to sum a large array of floats is, of course, to
> calculate a lot of partial sums and sum them. For example, naively, we
> might say that to sum the array we sum the first half, sum the second half
> and add the results together. This definition being recursive ensures that
> if the inputs are all of a similar size then all the intermediate
> calculations involve summing similarly sized values. A more sophisticated
> technique might also try to handle the case where not all values are a
> similar size.
>
> If you are worried about speed, then calculating partial sums is also the
> way to go: the naive technique used by the original poster leaves no scope
> for parallel calculation. It should be possible to write a slightly more
> complex loop in the C version that runs a lot faster on systems that are
> capable of performing multiple floating point instructions in parallel.

You are right, naive summing generates significant accuracy losses.
I estimated the error by summing [0.123456789012345]*1000000 and
comparing it to
1234567.89012345. All methods have error about 1e-4 .

The method below sums the array at the same speed as regular Python
sum loop, but reports error < 1e-15 .

def sum2(arr):
size = len(arr)
for itr in range(int(ceil(log(size)/log(2)))):
for i in xrange(0, size, 2**(itr+1)):
next_i = i+2**itr
if next_i<size:
arr[i]+=arr[next_i]
return arr[0]

When arbitrary precision numbers are being used, this method has
performance O(nd) vs. regular summing O(n(d+log d)) where n is size of
array, d is number of digits of the elements. In practice I got about
20% performance increase when summing an array of 1000 Decimals using
method above.

A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))

Yaroslav

Yaroslav Bulatov
Guest
Posts: n/a

 08-02-2004
Christopher T King <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> On 31 Jul 2004, Yaroslav Bulatov wrote:
>
> > I'm doing intensive computation on arrays in Python, so if you have
> > suggestions on Python/C solutions that could push the envelope, please
> > let me know.

>
> If you're doing mostly vector calculations as opposed to summing, I've
> been doing some work on adding SIMD support to numarray, with pleasing
> results (around 2x speedups). I've also done some work adding local
> parallel processing support to numarray, with not-so-pleasing results
> (mostly due to Python overhead).
>
>
> numarray should be just as fast as the -O2 C version. I was puzzled at
> first as to where the speed discrepancy came from, but the culprit is in
> the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
> the loop entirely. As a matter of fact, there isn't even any fadd
> instruction in the assembler output:
>
> call clock
> movl %eax, %esi
> movl \$9999999, %ebx
> .L11:
> decl %ebx
> jns .L11
> subl \$16, %esp
> call clock
>
> As you can see, the 21ms you're seeing is the time spent counting down
> from 9,999,999 to 0. To obtain correct results, add a line such as
> 'printf("%f\n",sum);' after the main loop in the C version. This will
> force gcc to leave the actual calculation in place and give you accurate
> results.
>
> The above fix will likely render numarray faster than the C version.
> Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
> numarray uses.

You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?

> Is there any reason why in the Python/numarray version, you use
> Numeric's RandomArray rather than numarray.random_array? It shouldn't
> affect your results, but it would speed up initialization time a bit.

There isn't a good reason, I simply didn't know about
numarray.random_array

>
> There are a few inefficiences in the pytime module (mostly involving
> range() and *args/**kwargs), but I don't think they'll have too big of an
> tests using Psyco to remove much of the Python overhead.
>
> For completeness, I'd also suggest both running the Java version using a
> JIT compiler such as Kaffe, and compiling it natively using gcj (the
> latter should approach the speed of C).

Tim Peters
Guest
Posts: n/a

 08-02-2004
[Yaroslav Bulatov]
....
> A better estimate of error difference would use random digits as
> opposed to [x]*10000000, but I don't know how I would calculate exact
> answer in any reasonable amount of time. (using Decimals takes over a
> second for 4000 array (with conversion))

floating point distillation

It's possible to compute the exact sum of a list of floats using float
arithmetic, where, ignoring overflow, the result is a minimal set of
fp numbers whose mathematical (infinite precision) sum exactly equals
the mathematical sum of the inputs.

In the worst case the best of these methods can require O(log N)
passes over the list of N inputs, but in real life they usually
converge to the result in two passes independent of N, sometimes with
a third pass but over a very short list of floats.

The better-known "Kahan summation" technique is essentially a
partially-broken computation of the two highest-order components of
one distillation pass.

A case like summing [1e100] + [1.0] * (N-1) shows that left-to-right
can deliver arbitrarily bad results (the fp sum of that is 1e100
regardless of N).

Christopher T King
Guest
Posts: n/a

 08-03-2004
On 2 Aug 2004, Yaroslav Bulatov wrote:

> You are right, how silly of me! Fixing the script now results in 130
> millis mean, 8.42 millis standard deviation, which is slower than
> numarray (104, 2.6 respectively). I wonder why numarray gives faster
> results on such a simple task?

The reason is that your C loop uses array indexing, whereas numarray's
increments the array pointer on each iteration through the loop an forgoes
any indexing. Even though this translates into about same number of
opcodes on x86, the latter is slightly faster because it skips the
(implied) bitshift operation. Also don't forget to compile your C version
with -O3, as this can make a big difference in that little loop (maybe not

In principle, for simple access patterns like summation, etc., numarray
will always be faster than your "typical" C implementation, since its core
vector functions were designed to be very efficient. Of course, an
efficient C implementation will be as fast or faster, but you'd have to
use those efficiency tricks in every vector operation you perform, which
could be quite tedious without writing your own library. (The above
generaliztion of course only applies to very large arrays.)

Either way, your results are quite promising -- they show that Python with
numarray is as good or better than C when it comes to vector processing
applications, all while being easier to use. That's good publicity!