Velocity Reviews > Computing a distance matrix using SSE2

Computing a distance matrix using SSE2

Stefano Teso
Guest
Posts: n/a

 08-04-2009

Hi,

First of all I don't know if this is the right group to post, sorry if
it is not; any suggestions on the right place for this kind of
discussion is warmly welcome.

I am writing a stochastic optimization algorithm which, as a part of
the energy calculation procedure, requires the computation of the
squared distance between a 3D vector from an array of n other 3D
vectors. This is the most computational intensive part of the
algorithm, so I wrote it using SSE2 intrinsics as defined in
<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
with all components (x, y, z) aligned to a the 16-byte boundary, and
the same goes for the distance array.

The current code looks basically as follows:

static void
calc_distance_row (double *d, const double *xarray, const double
*yarray, const double *zarray, const int i, const int lo, const int
hi)
{

/* computes the *square* of the distance between a fixed vector i and
all vectors j in the range [lo, hi), whose coords are specified in
xarray, yarray, zarray */

/* NOTE: dist2 is a rather obvious C function that uses no intrinsics
*/

int j, lo_aligned, hi_aligned;
__m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
double *x, *y, *z;

lo_aligned = (lo & ~3);
hi_aligned = (hi & ~3);

for (j = lo; j < lo_aligned; j++)
darray[j] = dist2 (xarray[i], yarray[i], zarray[i], xarray[j],
yarray[j], zarray[j]);

x = &xarray[lo_aligned];
y = &yarray[lo_aligned];
z = &zarray[lo_aligned];
d = &darray[lo_aligned];

for (j = lo_aligned; j < hi_aligned; j++) {

dx = _mm_sub_pd (ix, jx);
dy = _mm_sub_pd (iy, iy);
dz = _mm_sub_pd (iz, iz);

mx = _mm_mul_pd (dx, dx);
my = _mm_mul_pd (dy, dy);
mz = _mm_mul_pd (dz, dz);

_mm_store_pd (&d[0], s1);

/* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */

x = &x[4];
y = &y[4];
z = &z[4];
d = &d[4];
}

for (j = hi_aligned; j < hi; j++)
d[j] = dist2 (xarray[i], yarray[i], zarray[i], xarray[j],
yarray[j], zarray[j]);
}

(There may be typos in the above routine, anyway the real code was
tested against a simple minded C version, and proved to work as
expected, outperforming it by about 2x.)

The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration. The average array size is 100 to 1000 vectors, so no huge
number crunching here.

Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them? And even more, is there any way I could improve the
performance by better cache usage / prefetching? I already tried
unrolling the middle loop more, but it held no improvement.

I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.

----
Stefano

Hendrik van der Heijden
Guest
Posts: n/a

 08-04-2009

Stefano Teso schrieb:
> I would like to squeeze as much performance as possible from this
> routine. Any suggestion is appreciated.

> dy = _mm_sub_pd (iy, iy);
> dz = _mm_sub_pd (iz, iz);

^ jy,jz ?!

Assembler output would have been nice.

The biggest improvement I can think of would be switching
from doubles to floats, completely or just for the distance
calculation. Do you need double precision?

A small thing: If you don't explicitly use _mm_load_pd, you
disallow the compiler to use a memory operand instruction.
The _mm_sub_pd ops could use a memory operand instead of jx,jy,jz.
Try dx = _mm_sub_pd(ix, (_m128d*)(&x[0])).

Another one: If you do pointer arithmetic like x = &x[4], you
have 5 running variables (x,y,z,d,hi_aligned-j) in the loop

Hendrik vdH

Stefano Teso
Guest
Posts: n/a

 08-04-2009

Hi,

just a quick update. The code posted is indeed bugged, since lo_align
is always <= lo. I managed to eliminate the first and last cycles by
allowing the computation to go lower than lo (down to lo_aligned) and
higher than hi (up to the newly defined hi_aligned = (hi + 3) & ~3),
and adding extra padding to the distance and coord arrays to handle
the otherwse-out-of-bounds memory access when j > hi. The new loop
looks like:

lo_aligned = lo & ~3;
hi_aligned = (hi + 3) & ~3

for (j = lo_aligned; j < hi_aligned; j) {
<distance code>
}

Any suggestion on how to improve the code is still very welcome.

Thanks,

----
Stefano

user923005
Guest
Posts: n/a

 08-04-2009

On Aug 4, 7:41=A0am, Stefano Teso
<(E-Mail Removed)> wrote:
> Hi,
>
> First of all I don't know if this is the right group to post, sorry if
> it is not; any suggestions on the right place for this kind of
> discussion is warmly welcome.
>
> I am writing a stochastic optimization algorithm which, as a part of
> the energy calculation procedure, requires the computation of the
> squared distance between a 3D vector from an array of n other 3D
> vectors. This is the most computational intensive part of the
> algorithm, so I wrote it using SSE2 intrinsics as defined in
> <emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
> with all components (x, y, z) aligned to a the 16-byte boundary, and
> the same goes for the distance array.
>
> The current code looks basically as follows:
>
> static void
> calc_distance_row (double *d, const double *xarray, const double
> *yarray, const double *zarray, const int i, const int lo, const int
> hi)
> {
>
> /* computes the *square* of the distance between a fixed vector i and
> all vectors j in the range [lo, hi), whose coords are specified in
> xarray, yarray, zarray */
>
> /* NOTE: dist2 is a rather obvious C function that uses no intrinsics
> */
>
> =A0 =A0 int j, lo_aligned, hi_aligned;
> =A0 =A0 __m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
> =A0 =A0 double *x, *y, *z;
>
> =A0 =A0 lo_aligned =3D (lo & ~3);
> =A0 =A0 hi_aligned =3D (hi & ~3);
>
> =A0 =A0 for (j =3D lo; j < lo_aligned; j++)
> =A0 =A0 =A0 =A0 darray[j] =3D dist2 (xarray[i], yarray[i], zarray[i], xar=

ray[j],
> yarray[j], zarray[j]);
>
> =A0 =A0 x =3D &xarray[lo_aligned];
> =A0 =A0 y =3D &yarray[lo_aligned];
> =A0 =A0 z =3D &zarray[lo_aligned];
> =A0 =A0 d =3D &darray[lo_aligned];
>
> =A0 =A0 ix =3D _mm_load1_pd (&xarray[i]);
> =A0 =A0 iy =3D _mm_load1_pd (&yarray[i]);
> =A0 =A0 iz =3D _mm_load1_pd (&zarray[i]);
>
> =A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {
>
> =A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
> =A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
> =A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);
>
> =A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
> =A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
> =A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);
>
> =A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
> =A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
> =A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);
>
> =A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
> =A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);
>
> =A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);
>
> =A0 =A0 =A0 =A0 /* the above chunk of code is replicated once with '2' in
> place of '0' in the array indices */
>
> =A0 =A0 =A0 =A0 x =3D &x[4];
> =A0 =A0 =A0 =A0 y =3D &y[4];
> =A0 =A0 =A0 =A0 z =3D &z[4];
> =A0 =A0 =A0 =A0 d =3D &d[4];
> =A0 =A0 }
>
> =A0 =A0 for (j =3D hi_aligned; j < hi; j++)
> =A0 =A0 =A0 =A0 d[j] =3D dist2 (xarray[i], yarray[i], zarray[i], xarray[j=

],
> yarray[j], zarray[j]);
>
> }
>
> (There may be typos in the above routine, anyway the real code was
> tested against a simple minded C version, and proved to work as
> expected, outperforming it by about 2x.)
>
> The first and third loops are used to handle non-unrollable cases,
> while the second loop -- the only one using SSE2 intrinsics -- is
> unrolled twice, for a whopping 4 square distances computed per
> iteration. The average array size is 100 to 1000 vectors, so no huge
> number crunching here.
>
> Is this the best way to perform this operation? Is there any obvious
> pitfall in the above code? In particular, how do I check that I am
> incurring in unaligned access penalties, and if that is the case, how
> do I avoid them? And even more, is there any way I could improve the
> performance by better cache usage / prefetching? I already tried
> unrolling the middle loop more, but it held no improvement.
>
> I would like to squeeze as much performance as possible from this
> routine. Any suggestion is appreciated.
>
>
> ----
> Stefano

You want news:sci.math.num-analysis for questions about numerical
algorithms.

tni
Guest
Posts: n/a

 08-05-2009

Stefano Teso wrote:

> for (j = lo_aligned; j < hi_aligned; j++) {
>
>
> dx = _mm_sub_pd (ix, jx);
> dy = _mm_sub_pd (iy, iy);
> dz = _mm_sub_pd (iz, iz);

Don't do that, the load isn't necessary. Use proper indexing for the
arrays. (Get rid of x, y, z.)

for (j = lo_aligned; j < hi_aligned; j += 2) { // 2 with no unrolling
dx = _mm_sub_pd(ix, xarray[j]);
....

You did notice your bug (dy and dz are 0), right? I wouldn't be
surprised, if most of your loop was optimized away because of that, so
your performance numbers might be complete nonsense.

> mx = _mm_mul_pd (dx, dx);
> my = _mm_mul_pd (dy, dy);
> mz = _mm_mul_pd (dz, dz);
>
> s0 = _mm_add_pd (mx, my);
> s1 = _mm_add_pd (s0, mz);

Put all that stuff into a single statement (so that the compiler won't
be tempted to do unnecessary loads/stores):

_mm_mul_pd(dz, dz)));

> _mm_store_pd (&d[0], s1);

If you don't need the result right away, do a non-temporal store:
_mm_stream_pd(d + j, s);

Actually, get rid of s:
_mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));

> /* the above chunk of code is replicated once with '2' in
> place of '0' in the array indices */
>
> x = &x[4];
> y = &y[4];
> z = &z[4];
> d = &d[4];

Use proper array indexing. GCC is reasonably intelligent and will likely
make use of the complex addressing modes - no need for 4 pointer
increments.

> The first and third loops are used to handle non-unrollable cases,
> while the second loop -- the only one using SSE2 intrinsics -- is
> unrolled twice, for a whopping 4 square distances computed per
> iteration.

Loop unrolling will likely not buy you anything. On some CPUs, it may
even hurt performance (they have optimizations for small loops).

> Is this the best way to perform this operation? Is there any obvious
> pitfall in the above code? In particular, how do I check that I am
> incurring in unaligned access penalties, and if that is the case, how
> do I avoid them?

If there is an unaligned access, your program will simply crash.

> And even more, is there any way I could improve the
> performance by better cache usage / prefetching?

It's unlikely that prefetching will buy you anything (it may even hurt
performance).

If you control the memory allocation, you should allocate aligned memory
(_mm_malloc, _mm_free) in the first place.

You should really post the assembly. Disassemble your code with objdump.

sfuerst
Guest
Posts: n/a

 08-05-2009

One trick is to notice that |x-y|^2 == |x|^2 + |y|^2 - 2 x.y

This means that if you can pre-process your data, then you can remove
the
subtractions from the inner loop, and shrink the critical path.

Simply construct the dot product of x and (-2y), and then add the
result to
the saved values of |x|^2 and |y|^2. There are extra additions, yes,
but
they may be done in parallel with the multiplies.

Since you only have on order a thousand points, the data will fit in
your
cache. The extra |x|^2 values hopefully won't add too much extra
memory
pressure.

Of course, you probably should benchmark this - sometimes the
"obvious" way
is faster.

Steven

Stefano Teso
Guest
Posts: n/a

 08-05-2009

On 4 Aug, 21:54, "christian.bau"
<(E-Mail Removed)> wrote:
> crash if your data isn't aligned, so you will notice if your data
> isn't aligned. malloc () will typically return nicely aligned
> pointers.

I am using posix_memalign to allocate the coord data, so the
_mm_load_pd should never trigger any exceptions.

> You should worry mostly about making your code cache friendly when you
> call this multiple times. And depending on what you do with the
> results, there may be some mathematical ways to save operations.

Ok, but I don't know how to find, using GNU/Linux, the code paths that
trigger cache misses by profiling... So I cannot really judge if the
code will be cache friendly or not.

----
Stefano

Stefano Teso
Guest
Posts: n/a

 08-05-2009

Hi,

On 5 Aug, 02:07, tni <(E-Mail Removed)> wrote:
> Stefano Teso wrote:
> > =A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {

>
> > =A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
> > =A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
> > =A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);

>
> > =A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
> > =A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
> > =A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);

>
> Don't do that, the load isn't necessary. Use proper indexing for the
> arrays. (Get rid of x, y, z.)

Done.

>
> for (j =3D lo_aligned; j < hi_aligned; j +=3D 2) { // 2 with no unrolling
> =A0 =A0 =A0dx =3D _mm_sub_pd(ix, xarray[j]);
> ...

a) I removed the unrolling since it didn't improve the performance at
all (actually it didn't worsen it either, but why keep useless code in
the source?).

b) What is

> You did notice your bug (dy and dz are 0), right? I wouldn't be
> surprised, if most of your loop was optimized away because of that, so
> your performance numbers might be complete nonsense.

The bug was introduced when posting the code, it was not present in
the original source.

> > =A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
> > =A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
> > =A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);

>
> > =A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
> > =A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);

>
> Put all that stuff into a single statement (so that the compiler won't
> be tempted to do unnecessary loads/stores):
>
> _mm_mul_pd(dz, dz)));

I tried this and it doesn't make any difference, performance wise, and
the assembly output looks unaffected. The compiler was probably

> > =A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);

>
> If you don't need the result right away, do a non-temporal store:
> _mm_stream_pd(d + j, s);
>
> Actually, get rid of s:
> _mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));

Unfortunately I need the values of the distance array right after this
function is called, but thanks for the tip. Using it BTW slows down
the function by about 100 cycles.

> Loop unrolling will likely not buy you anything. On some CPUs, it may
> even hurt performance (they have optimizations for small loops).

I just checked and you are right; I re-rolled the loop.

> You should really post the assembly. Disassemble your code with objdump.

Here you go. C source code:

/* static */ void
calc_dist_row (const double *x,
const double *y,
const double *z,
const int i,
const int bandlo,
const int bandhi)
{
__m128d ix, iy, iz;
__m128d dx, dy, dz;
__m128d mx, my, mz;
__m128d dd;
int j, _bandlo, _bandhi;

_bandlo =3D bandlo & ~1;
_bandhi =3D bandhi;

for (j =3D _bandlo; j < _bandhi; j +=3D 2) {

dx =3D _mm_sub_pd (ix, _mm_load_pd (&x[j]));
dy =3D _mm_sub_pd (iy, _mm_load_pd (&y[j]));
dz =3D _mm_sub_pd (iz, _mm_load_pd (&z[j]));

mx =3D _mm_mul_pd (dx, dx);
my =3D _mm_mul_pd (dy, dy);
mz =3D _mm_mul_pd (dz, dz);

_mm_store_pd (&dist_row[j], dd);
}
}

And the assembly:

0000000000000a90 <calc_dist_row>:
a90: 48 63 c9 movsxd rcx,ecx
a93: 41 ff c1 inc r9d
a96: 41 83 e0 fe and r8d,0xfffffffffffffffe
a9a: 41 83 e1 fe and r9d,0xfffffffffffffffe
a9e: f2 0f 12 2c cf movddup xmm5,QWORD PTR [rdi+rcx*8]
aa3: f2 0f 12 24 ce movddup xmm4,QWORD PTR [rsi+rcx*8]
aa8: f2 0f 12 1c ca movddup xmm3,QWORD PTR [rdx+rcx*8]
aad: 45 39 c8 cmp r8d,r9d
ab0: 7d 67 jge b19 <calc_dist_row+0x89>
ab2: 49 63 c0 movsxd rax,r8d
ab5: 48 c1 e0 03 shl rax,0x3
ab9: 48 01 c7 add rdi,rax
abc: 48 01 c6 add rsi,rax
abf: 48 01 c2 add rdx,rax
ac2: 48 03 05 00 00 00 00 add rax,QWORD PTR [rip+0x0]
# ac9 <calc_dist_row+0x39>
ac9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
ad0: 66 0f 28 c5 movapd xmm0,xmm5
ad4: 66 0f 28 d4 movapd xmm2,xmm4
ad8: 66 0f 5c 07 subpd xmm0,XMMWORD PTR [rdi]
adc: 66 0f 5c 16 subpd xmm2,XMMWORD PTR [rsi]
ae0: 66 0f 28 cb movapd xmm1,xmm3
ae4: 66 0f 59 c0 mulpd xmm0,xmm0
ae8: 66 0f 5c 0a subpd xmm1,XMMWORD PTR [rdx]
aec: 66 0f 59 d2 mulpd xmm2,xmm2
af0: 66 0f 59 c9 mulpd xmm1,xmm1
af4: 66 0f 58 c2 addpd xmm0,xmm2
af8: 41 83 c0 02 add r8d,0x2
afc: 66 0f 58 c1 addpd xmm0,xmm1
b00: 48 83 c7 10 add rdi,0x10
b04: 66 0f 29 00 movapd XMMWORD PTR [rax],xmm0
b08: 48 83 c6 10 add rsi,0x10
b0c: 48 83 c2 10 add rdx,0x10
b10: 48 83 c0 10 add rax,0x10
b14: 45 39 c1 cmp r9d,r8d
b17: 7f b7 jg ad0 <calc_dist_row+0x40>
b19: f3 c3 repz ret
b1b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]

Thanks,

----
Stefano

Stefano Teso
Guest
Posts: n/a

 08-05-2009

> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 dd =3D3D _mm_sqrt_pd (_mm_add_pd (_mm_add=

_pd (mx, my), mz));

Whoops! Ignore the square root here please, pasted the wrong code. The
assembly is the right one though.

----
Stefano

sfuerst
Guest
Posts: n/a

 08-05-2009

After some benchmarking, it looks like this trick actually helps,
making the code 33% faster.

#include <emmintrin.h>

#define SIZE 1024

static double dist_row[SIZE];

__attribute__((noinline)) void calc_dist_row1(const double *x, const
double *y, const double *z, int i, int bandlo, int bandhi)
{
__m128d ix, iy, iz;
__m128d dx, dy, dz;
__m128d mx, my, mz;
__m128d dd;
int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

for (j = _bandlo; j < _bandhi; j += 2)
{

mx = _mm_mul_pd(dx, dx);
my = _mm_mul_pd(dy, dy);
mz = _mm_mul_pd(dz, dz);

_mm_store_pd(&dist_row[j], dd);
}
}

__attribute__((noinline)) void calc_dist_row2(const double *x, const
double *y, const double *z, int i, int bandlo, int bandhi)
{
typedef double v2df __attribute__((vector_size(16)));

v2df ix, iy, iz;
v2df dx, dy, dz;
v2df mx, my, mz;
v2df dd;
int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

ix = (v2df) {x[i], x[i]};
iy = (v2df) {y[i], y[i]};
iz = (v2df) {z[i], z[i]};

for (j = _bandlo; j < _bandhi; j += 2)
{
dx = ix - *((v2df *) &x[j]);
dy = iy - *((v2df *) &y[j]);
dz = iz - *((v2df *) &z[j]);

mx = dx * dx;
my = dy * dy;
mz = dz * dz;

*((v2df *) &dist_row[j]) = mx + my + mz;
}
}

/* Slower! */
__attribute__((noinline)) void calc_dist_row3(const double *x, const
double *y, const double *z, const double *r, int i, int bandlo, int
bandhi)
{
typedef double v2df __attribute__((vector_size(16)));

v2df ix, iy, iz, ir;
v2df dx, dy, dz, rr;
v2df mx, my, mz;
v2df dd;
v2df mtwo = {-2, -2};

int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

ix = (v2df) {x[i], x[i]};
iy = (v2df) {y[i], y[i]};
iz = (v2df) {z[i], z[i]};
ir = (v2df) {r[i], r[i]};

ix *= mtwo;
iy *= mtwo;
iz *= mtwo;

for (j = _bandlo; j < _bandhi; j += 2)
{
rr = *((v2df *) &r[j]) + ir;
dx = *((v2df *) &x[j]) * ix;
dy = *((v2df *) &y[j]) * iy;
dz = *((v2df *) &z[j]) * iz;

*((v2df *) &dist_row[j]) = dx + dy + dz + rr;
}
}

int main(void)
{
int i, j;

double x[SIZE];
double y[SIZE];
double z[SIZE];
double r[SIZE];

for (i = 0; i < SIZE; i++)
{
x[i] = i;
y[i] = i * i;
z[i] = i * i - 137;
r[i] = x[i] * x[i] + y[i] * y[i] + z[i] * z[i];
}

for (j = 0; j < 1000; j++)
{
for (i = 0; i < 1000; i++)
{
/* 3.33s */
//calc_dist_row1(x, y, z, i, 0, SIZE);

/* 3.33s */
//calc_dist_row1(x, y, z, i, 0, SIZE);

/* 2.25s */
calc_dist_row3(x, y, z, r, i, 0, SIZE);
}
}
}

The first subroutine is the original version (without the slow sqrt
call which otherwise dominates the timing). The __attribute__
((noinline)) is to prevent the compiler from optimizing the function
calls away. The second subroutine is the same algorithm translated
into gcc's vector types, and thus much easier to read. The final
subroutine uses the distance formula expansion to reduce the critical
path, and allow more calculation in parallel.

In gcc 4.4, the resulting times for the program to run are given in
the comments in main(). Note that gcc 4.3 optimizes less well, and
the third subroutine is actually slower than the first two... (Also,

Here is the resulting asm for the inner loop:

Originally, it was:

L1:
movapd %xmm5,%xmm0
movapd %xmm4,%xmm2
movapd %xmm3,%xmm1
subpd (%rdi),%xmm0
subpd (%rsi),%xmm2
subpd (%rdx),%xmm1
mulpd %xmm0,%xmm0
mulpd %xmm2,%xmm2
mulpd %xmm1,%xmm1
movapd %xmm0,(%rax)
cmp %rcx,%rax
jne L1

After the algorithm change:
L1:
movapd (%rdi),%xmm0
movapd (%rsi),%xmm1
mulpd %xmm4,%xmm0
mulpd %xmm3,%xmm1
movapd (%rdx),%xmm1
mulpd %xmm2,%xmm1
movapd (%r,%xmm1
movapd %xmm0,(%r10)
cmp %rax,%r8
jne L1

Steven

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post Melzzzzz C++ 3 10-14-2012 02:31 PM Szyk C++ 1 10-29-2011 01:26 PM David Mathog C Programming 1 11-20-2010 12:39 AM Silverstrand Front Page News 0 11-01-2006 05:13 PM Hipo C++ 6 05-17-2006 11:24 PM