I am confused - I have gone through a two day exercise tuning up an

FFT routine, and at the last stage I find that where the biggest gains

were expected, all my conventional understanding of efficient C seems

to be turned on its head.

Up until this point I have achieved good speed improvements following

an evolutionary approach, starting from the routine entry point I have

modified code, run, validated and timed a test case and follow the

steepest descent optimisation approach.

When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024

point complex FFT (1GHz Celeron - Win2000). I am still using a baby C

system - Turbo C, but would be surprised if this is the cause of the

problems.

I have reduced this time to 0.6ms per FFT call, when I reached the

heart of the code, the innermost of 3 loops, where the correlation

between complex exponentials and data occurs :

tempr=wr*data[ixj]-wi*data[ixj+1];

tempi=wr*data[ixj+1]+wi*data[ixj];

data[ixj]=data[ixData]-tempr;

data[ixj+1]=data[ixData+1]-tempi;

data[ixData] += tempr;

data[ixData+1] += tempi;

This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot

of opportunity to improve things. I have removed superfluous array

indexing and changed to full 64 bit arithmetic (which has proved to be

faster than 32bit in previous tweaking) as follows

#if defined(FFT_DOUBLE)

# define srcPtr0_ doublePtr_1__

# define srcPtr1_ doublePtr_2__

# define tgtPtr0_ doublePtr_3__

# define tgtPtr1_ doublePtr_4__

#endif

#if defined(FFT_FLOAT)

# define srcPtr0_ floatPtr_1__

# define srcPtr1_ floatPtr_2__

# define tgtPtr0_ floatPtr_3__

# define tgtPtr1_ floatPtr_4__

#endif

srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;

tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;

curReal = wr **srcPtr0_ -wi **srcPtr1_;

curImag= wr **srcPtr1_ +wi **srcPtr0_;

*srcPtr0_ = *tgtPtr0_ -curReal;

*srcPtr1_ = *tgtPtr1_ -curImag;

*tgtPtr0_ += curReal;

*tgtPtr1_ += curImag;

# undef srcPtr0_

# undef srcPtr1_

# undef tgtPtr0_

# undef tgtPtr1_

However now I find the time per call has gone up significantly.

Using the FFT_FLOAT option above, the call time increased by 39%

compared with the original code.

Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So

by rights I should just use the original code, with all variables

changed to doubles, but I am loathe to do so just yet until I

understand why the current bottleneck exists.

Why should accessing vectorised data using pointer references be so

much slower than indexing - surely Turbo-C cannot have pointer

de-referencing so badly wrong?

Some other ancillary points that I would be grateful for feedback on :

I have written a software timer that squeezes better resolution out of

clock() than the CLK_TCK (supposedly 18.2ms), which is normally

available.

I have timed some runs with a stopwatch, this is what I got

_(" This is the standard value, ms per each tick in clock()")

#undef CLK_TCK 18.2

_(" This measured over 45s on a Celeron 1GHz, ms per each tick in

clock()")

#define CLK_TCK 56.12

On a 850MHz Athlon (Win-9

system I can get 600ns resolution, however

with all the system overheads in Win-2k, even using a 1GHz Celeron, I

only achieve around 1ms resolution.

Is there any easy way to disable or reduce a lot of the Win-2k system

workload to get more accurate timing? I remember reading about 5

years ago of a "software stopwatch" that could read a far higher

resolution CPU timer, anyone have a reference to this?

These are typically the sort of results I am getting

FFT_NRC starting to execute 5000 loops

FFT_NRC time = 934456 ±88(3.5s) ns

FFT_NRC_prev starting to execute 5000 loops

FFT_NRC_prev time = 776079 ±88(3.5s) ns

FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs

(20.41%)

FFT_NRC_prev starting to execute 1000 loops

FFT_NRC_prev time = 777115 ±451(3.5s) ns

FFT_NRC starting to execute 1000 loops

FFT_NRC time = 935104 ±451(3.5s) ns

Thanks,

Paul.