Velocity Reviews > C++ > Hi, how can I optimize the following code?

# Hi, how can I optimize the following code?

kenneth
Guest
Posts: n/a

 12-11-2006
I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

Noah Roberts
Guest
Posts: n/a

 12-11-2006

kenneth wrote:
> I was trying to use multiple thread to optimize my following code, but
> met some problems, anyone can help me?
>
> k[3][500] are initialized.
>
> int computePot() {
> int i, j;
>
> for( i=0; i<500; i++ ) {
> for( j=0; j<i-1; j++ ) {
> distx = pow( (r[0][j] - r[0][i]), 2 );
> disty = pow( (r[1][j] - r[1][i]), 2 );
> distz = pow( (r[2][j] - r[2][i]), 2 );
> dist = sqrt( distx + disty + distz );
> pot += 1.0 / dist;
> }
> }
>
> I doubt that If this code really can be optimized?

You can quite often optimize an O(N^2) operation by coming up with an
algorithm that is not O(N^2). The compiler will hit your loops and
optimize them quite a bit but that doesn't resolve the problem of
having an algorithm that is a square of N. Don't know what you are
trying to do. You might describe the problem in comp.programming and
see if they have some ideas.

In other words, look to the algorithms you are using not ways to
micro-optimize.

=?ISO-8859-1?Q?Tommi_H=F6yn=E4l=E4nmaa?=
Guest
Posts: n/a

 12-11-2006
kenneth kirjoitti:
> int computePot() {
> int i, j;
>
> for( i=0; i<500; i++ ) {
> for( j=0; j<i-1; j++ ) {
> distx = pow( (r[0][j] - r[0][i]), 2 );
> disty = pow( (r[1][j] - r[1][i]), 2 );
> distz = pow( (r[2][j] - r[2][i]), 2 );
> dist = sqrt( distx + disty + distz );
> pot += 1.0 / dist;
> }
> }
>
> I doubt that If this code really can be optimized?
>

I can't say how to optimize for multithreading but
don't use pow function to compute squares.
Function pow takes the exponent as a floating point number and it may
use exponent and logarithm functions.

Use something like
----
inline double SquareDouble(double r)
{
return r * r;
}
----
or
----
#define square(x) ((x)*(x))
----

If you use GNU C consider using the const attribute:
----
inline double SquareDouble(double r) __attribute__((const))
{
return r * r;
}
----

--
Tommi Höynälänmaa
sähköposti / e-mail: http://www.velocityreviews.com/forums/(E-Mail Removed)
kotisivu / homepage: http://www.iki.fi/tohoyn/

Jacek Dziedzic
Guest
Posts: n/a

 12-11-2006
kenneth wrote:
> I was trying to use multiple thread to optimize my following code, but
> met some problems, anyone can help me?

platform-specific solutions, so are out of the scope of this NG.

> k[3][500] are initialized.

What's k?

> int computePot() {
> int i, j;
>
> for( i=0; i<500; i++ ) {
> for( j=0; j<i-1; j++ ) {
> distx = pow( (r[0][j] - r[0][i]), 2 );
> disty = pow( (r[1][j] - r[1][i]), 2 );
> distz = pow( (r[2][j] - r[2][i]), 2 );
> dist = sqrt( distx + disty + distz );
> pot += 1.0 / dist;
> }
> }
>
> I doubt that If this code really can be optimized?

Sure it can. First of all forget that 'pow' thing.
To get the square, just multiply the value by itself.
This will definitely make it much faster.

Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

HTH,
- J.

Chris Theis
Guest
Posts: n/a

 12-11-2006
"kenneth" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
>I was trying to use multiple thread to optimize my following code, but
> met some problems, anyone can help me?
>
> k[3][500] are initialized.
>
> int computePot() {
> int i, j;
>
> for( i=0; i<500; i++ ) {
> for( j=0; j<i-1; j++ ) {
> distx = pow( (r[0][j] - r[0][i]), 2 );
> disty = pow( (r[1][j] - r[1][i]), 2 );
> distz = pow( (r[2][j] - r[2][i]), 2 );
> dist = sqrt( distx + disty + distz );
> pot += 1.0 / dist;
> }
> }
>
> I doubt that If this code really can be optimized?

Hi,

in principle I would let the compiler worry about that and not try to
micro-optimize. However, you can move the assignments outside to register
variables like Jacek already showed and get rid of the pow(). Additionally,
it's cheap to check if some "manual" loop-unrolling via meta-templates might
get you some speed.

Cheers
Chris

Nikolaos D. Bougalis
Guest
Posts: n/a

 12-11-2006
Jacek Dziedzic wrote:

> Another thing is that you don't need to index out
> the second vector inside the inner loop, you can as
> well do it outside, like in:
>
> for(i) {
> register double r0i=r[0][i]; // now evaluated only 500 times
> register double r1i=r[1][i];
> register double r2i=r[2][i];
> for(j) {
> distx = (r[0][j]-r0i)*(r[0][j]-r0i);
> // ...
> }
> }
>
> That should give your code a boost.

Doubtful. Any compiler worth its salt would have hoisted the evaluations out
of the loop by itself. Just how any compiler worth its salt will convert
'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx = (r[0][j]-r0i); distx *=
distx;'

But the fact is that code like what the OP showed /rarely/ benefits from
micro-optimizations; that's not to say that trying to make such optimizations
aren't necessary at times -- only that it is a much better idea to focus on
the algorithm and try to make it simpler.

-n

Ian Collins
Guest
Posts: n/a

 12-12-2006
kenneth wrote:
> I was trying to use multiple thread to optimize my following code, but
> met some problems, anyone can help me?
>
> k[3][500] are initialized.
>
> int computePot() {
> int i, j;
>
> for( i=0; i<500; i++ ) {
> for( j=0; j<i-1; j++ ) {
> distx = pow( (r[0][j] - r[0][i]), 2 );
> disty = pow( (r[1][j] - r[1][i]), 2 );
> distz = pow( (r[2][j] - r[2][i]), 2 );
> dist = sqrt( distx + disty + distz );
> pot += 1.0 / dist;
> }
> }
>
> I doubt that If this code really can be optimized?
>

you might be able to get some gains.

--
Ian Collins.

Jacek Dziedzic
Guest
Posts: n/a

 12-12-2006
Nikolaos D. Bougalis wrote:
> Jacek Dziedzic wrote:
>
>> Another thing is that you don't need to index out
>> the second vector inside the inner loop, you can as
>> well do it outside, like in:
>>
>> for(i) {
>> register double r0i=r[0][i]; // now evaluated only 500 times
>> register double r1i=r[1][i];
>> register double r2i=r[2][i];
>> for(j) {
>> distx = (r[0][j]-r0i)*(r[0][j]-r0i);
>> // ...
>> }
>> }
>>
>> That should give your code a boost.

>
> Doubtful. Any compiler worth its salt would have hoisted the
> evaluations out of the loop by itself. Just how any compiler worth its
> salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx =
> (r[0][j]-r0i); distx *= distx;'

I disagree. First of all, I doubt any compiler would turn
pow(double,2) into double*double.

Secondly, the conversion distx=something*something into
distx=something, distx*=distx does not necessarily speed up
anything, that would depend on the computer architecture,
so I doubt "any compiler worth its salt will convert" it.

Thirdly, even though a good compiler is supposed to
move outside of the loop what does not depend on the loop
iterating variable, my experience show it is worth trying
to help it. I have recently spent a few days, if not weeks,
having to micro-optimize tight loops, working with a compiler
that is rather good at optimizing, the Intel compiler for
Itanium 2 (this platform _needs_ good optimization at
compiler level). I think you would be surprised at how
dumb sometimes the compiler was, or rather, how typical,
simple and out-of-the-book optimizations made the code
faster even at -O3.

Therefore, if the OP asked for ways to optimize, I gave
him two hints. I'm 99% sure they won't slow the code down
and pretty sure they will speed it up. I'm too lazy though
to try it out, anyway, it depends on the platform.

> But the fact is that code like what the OP showed /rarely/ benefits
> from micro-optimizations; that's not to say that trying to make such
> optimizations aren't necessary at times -- only that it is a much better
> idea to focus on the algorithm and try to make it simpler.

Of course. But I assumed the OP ran out of options on this one.
For example, this looked like a calculation of a harmonic potential.
Typically for the calculations of potential one assumes a cutoff
radius, beyond which the interactions are assumed to vanish.
Here, such algorithmic optimization would turn this O(N^2) into
O(N). However, with the harmonic potential it is impossible,
because it decays to slowly. That led me to the conclusion that
the OP has already ran out of options on algorithmic optimization
and is left with mirco-optimizations.

howgh ,
- J.

Nikolaos D. Bougalis
Guest
Posts: n/a

 12-12-2006
Jacek Dziedzic wrote:
> Nikolaos D. Bougalis wrote:
>> Jacek Dziedzic wrote:
>>
>>> Another thing is that you don't need to index out
>>> the second vector inside the inner loop, you can as
>>> well do it outside, like in:
>>>
>>> for(i) {
>>> register double r0i=r[0][i]; // now evaluated only 500 times
>>> register double r1i=r[1][i];
>>> register double r2i=r[2][i];
>>> for(j) {
>>> distx = (r[0][j]-r0i)*(r[0][j]-r0i);
>>> // ...
>>> }
>>> }
>>>
>>> That should give your code a boost.

>>
>> Doubtful. Any compiler worth its salt would have hoisted the
>> evaluations out of the loop by itself. Just how any compiler worth its
>> salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx
>> = (r[0][j]-r0i); distx *= distx;'

>
> I disagree. First of all, I doubt any compiler would turn
> pow(double,2) into double*double.

I said the compiler would hoist the r[0][i], r[1][i] and r[2][i] out of the
loop. I said nothing about converting the pow(d,2) into d*d, although I don't
see why a compiler could not do that, if it had intrinsic support for pow()
and understood the semantics of the pow() call..

> Secondly, the conversion distx=something*something into
> distx=something, distx*=distx does not necessarily speed up
> anything, that would depend on the computer architecture,
> so I doubt "any compiler worth its salt will convert" it.

You are misrepresenting what I said. In your code "something" was a complex
expression that required evaluation. Do you honestly believe that any decent
compiler would calculate "a-b" twice to do (a-b)*(a-b)? Common subexpression
elimination is one of the most basic optimizations a compiler can and does
perform.

Basic optimizations aside, you would really be surprised what compilers can
do nowadays. I was quite stunned to see the latest CL.EXE for example, do some
crazy tricks in a math intensive function, shaving off two divisions by using
multiplications and additions instead and managing to keep both the U and V
pipes going full speed significantly improving throughput. I couldn't even
come close to matching it and I'm not a newbie at the optimization game.

> Thirdly, even though a good compiler is supposed to
> move outside of the loop what does not depend on the loop
> iterating variable, my experience show it is worth trying
> to help it. I have recently spent a few days, if not weeks,
> having to micro-optimize tight loops, working with a compiler
> that is rather good at optimizing, the Intel compiler for
> Itanium 2 (this platform _needs_ good optimization at
> compiler level). I think you would be surprised at how
> dumb sometimes the compiler was, or rather, how typical,
> simple and out-of-the-book optimizations made the code
> faster even at -O3.

I don't disagree that hand-optimization and a helping hand to the compiler
are necessary even today. What I am saying, is that the manual hoist operation
you showed and most such trivial optimizations are unlikely to have any
significant effect. The compiler will take care of simple things like that if
you compile with optimizations enabled and you don't use volatile every other
line.

A very good example of optimizations that programmers can make but compilers
cannot is easily demonstrated with this little function that sums a floating
point array (extern float *huge_array):

The naive approach is this:

float array_sum()
{
float sum = 0;

for(int i = 0; i < huge_array_size; i++)
sum += huge_array[i];

return sum;
}

The better approach takes advantage of abelian operations,
something a compiler cannot do because of restrictions placed by
the ANSI/ISO C and C++ standards:

float array_sum()
{
float sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0

for(int i = 0; i < huge_array_size; i += 4)
{
sum0 += huge_array[i];
sum1 += huge_array[i + 1];
sum2 += huge_array[i + 2];
sum3 += huge_array[i + 3];
}

return (sum0 + sum2) + (sum1 + sum3);
}

-n

Jacek Dziedzic
Guest
Posts: n/a

 12-12-2006
Nikolaos D. Bougalis wrote:
> Jacek Dziedzic wrote:
>> Nikolaos D. Bougalis wrote:
>>> Jacek Dziedzic wrote:
>>>
>>>> Another thing is that you don't need to index out
>>>> the second vector inside the inner loop, you can as
>>>> well do it outside, like in:
>>>>
>>>> for(i) {
>>>> register double r0i=r[0][i]; // now evaluated only 500 times
>>>> register double r1i=r[1][i];
>>>> register double r2i=r[2][i];
>>>> for(j) {
>>>> distx = (r[0][j]-r0i)*(r[0][j]-r0i);
>>>> // ...
>>>> }
>>>> }
>>>>
>>>> That should give your code a boost.
>>>
>>> Doubtful. Any compiler worth its salt would have hoisted the
>>> evaluations out of the loop by itself. Just how any compiler worth
>>> its salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into:
>>> 'distx = (r[0][j]-r0i); distx *= distx;'

>>
>> I disagree. First of all, I doubt any compiler would turn
>> pow(double,2) into double*double.

>
> I said the compiler would hoist the r[0][i], r[1][i] and r[2][i] out
> of the loop. I said nothing about converting the pow(d,2) into d*d,

Yes, but you said "doubtful" to my statement "that should give
your code a boost". What I meant to say was "for this code to give
no boost, it would require the compiler to turn pow(double,2) into
double*double. Therefore, by being doubtful wrt to this code's
potential boost, you imply this action of the compiler.".

> although I don't see why a compiler could not do that, if it had
> intrinsic support for pow() and understood the semantics of the pow()
> call..

OK, perhaps a smart compiler could do that. But honestly
I wouldn't expect it.

>> Secondly, the conversion distx=something*something into
>> distx=something, distx*=distx does not necessarily speed up
>> anything, that would depend on the computer architecture,
>> so I doubt "any compiler worth its salt will convert" it.

>
> You are misrepresenting what I said. In your code "something" was a
> complex expression that required evaluation. Do you honestly believe
> that any decent compiler would calculate "a-b" twice to do (a-b)*(a-b)?

No, of course I don't believe that. This was a misunderstanding.
I meant that (a-b) would be stored into a register, there squared
and sent to distx. I somehow thought you implied that a memory
location would be squared, rather than a register.

> Common subexpression elimination is one of the most basic optimizations
> a compiler can and does perform.

Yes, I don't claim a-b will be evaluated twice.

> Basic optimizations aside, you would really be surprised what
> compilers can do nowadays. I was quite stunned to see the latest CL.EXE
> for example, do some crazy tricks in a math intensive function, shaving
> managing to keep both the U and V pipes going full speed significantly
> improving throughput. I couldn't even come close to matching it and I'm
> not a newbie at the optimization game.

Yes, I also usually trust the compiler to optimize better, especially
since for the Itanium they are supposed to be good at it. Maybe I was
lucky then that I managed to squeeze some extra ticks from the code
I was working on by simple tricks like changing

for(i) a[i]=something;
into
while(i++) *(a_ptr+i)=something;

or the other way round.

> I don't disagree that hand-optimization and a helping hand to the
> compiler are necessary even today. What I am saying, is that the manual
> hoist operation you showed and most such trivial optimizations are
> unlikely to have any significant effect. The compiler will take care of
> simple things like that if you compile with optimizations enabled and
> you don't use volatile every other line.

OK, then that was just a suggestion for the OP to try and see for
himself. I think neither I can make a general statement "taking
things out of the loop will make things faster" nor you can make
a general statement "taking things out of the loop will not make
this all depends on the compiler.

We'll never know unless the OP tries.

I could have as well said "make sure you have -O3" and then
somebody could say "come on, it's on by default" -- just a
suggestion to try out.

>
> A very good example of optimizations that programmers can make but
> compilers cannot is easily demonstrated with this little function that
> sums a floating point array (extern float *huge_array):
>
> The naive approach is this:
>
> float array_sum()
> {
> float sum = 0;
>
> for(int i = 0; i < huge_array_size; i++)
> sum += huge_array[i];
>
> return sum;
> }
>
> The better approach takes advantage of abelian operations,
> something a compiler cannot do because of restrictions placed by
> the ANSI/ISO C and C++ standards:
>
> float array_sum()
> {
> float sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0
>
> for(int i = 0; i < huge_array_size; i += 4)
> {
> sum0 += huge_array[i];
> sum1 += huge_array[i + 1];
> sum2 += huge_array[i + 2];
> sum3 += huge_array[i + 3];
> }
>
> return (sum0 + sum2) + (sum1 + sum3);
> }

OK.

- J.