Velocity Reviews > Is this the optimal FIR filter on all platforms?

Is this the optimal FIR filter on all platforms?

Johan Bergman
Guest
Posts: n/a

 10-25-2003
Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see there
were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan

#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

Tim Prince
Guest
Posts: n/a

 10-25-2003
Johan Bergman wrote:

> Hi,
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach
>
> The example below calculates the output signal for 10000 lags from an FIR
> filter with 10000 taps. The input signal and the filter coefficients is
> just rubbish in this example. For my intended usage of this filter, it is
> not necessary to store the state of the filter.
>
> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on
> the different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
>
> The performance within parentheses was obtained when a certain line was
> commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see
> there were very different behaviors on the different platforms!
>
> The SunOS program performs extremely well without the critial line. Note
> that the 103 Mcycles means that only one cycle was spent per iteration in
> the inner for loop! But it performs terribly when the critical line is
> included, about 28 times slower!
>
> As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
> clock cycles per iteration in the inner loop, without the critical line
> and only about twice as much with the critical line.
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> Regards,
> Johan
>
>
> #include <malloc.h>
> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));
> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }

Assuming that you are using basically the same compiler and the same
optimization options in each case, you must be running on different
hardware. You've given no reason to believe that the OS is responsible.
If one or more of your platforms supports SSE/SSE2 instructions, and you
can change the operands to float/double, you should be able to get much
better performance. Split the sum into 4 or 8 partial sums, thus
parallelizing the sum reduction. This becomes OT if you use a C compiler
whichs supports parallel instructions only in asm.
--
Tim Prince

Arthur J. O'Dwyer
Guest
Posts: n/a

 10-25-2003

On Sat, 25 Oct 2003, Johan Bergman wrote:
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach

What in the world is an FIR filter? (Finite impulse-response, yes,
yes, I know, I looked it up. Point is, practically nobody in this
newsgroup will know, either, and they'll *all* have to look it up
if they really want to help you. You're making it harder for people

> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on the
> different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum

> Can someone help me to speed up the program on the Linux and SunOS
> platforms?

Well, first let's clean up your code and see what it looks like
then. Clean code is always easier to operate on.

> #include <malloc.h>

Non-standard header. 'malloc' is declared in <stdlib.h>.

#include <stdlib.h>

> int main()

[Some regulars strongly believe in 'int main(void)'. I
don't have a strong preference either way, but it's wise
to consider why you picked the style you did, even on
little things like this.]

> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =

Okay, yeah, 'const' is great. But really, this is ridiculous.
You don't think that all these consts are somehow making your
program more efficient, do you? (They're not hurting, but
over-constifying is Bad mostly because it's Ugly, and Ugly code
is hard to make Right.)

> (const int * const) malloc(nrof_taps*sizeof(int));

malloc(nrof_taps * sizeof *coeff_ptr);

No chance for error using this idiom.
Also note that casting a malloc() result to 'const' anything
is just silly, because free() takes a non-const pointer
argument. How are you going to free() this memory, hmm?

> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;

Way too many similarly-named variables, in my opinion.
Again, purely a stylistic issue, but again, clean code is
nice code. Note that in the cleaned-up version, I've pulled
in their scope to where they're actually used. This may
actually help with the efficiency of the program, since some
optimizers look for "windows" in which objects can be moved
into registers and suchlike. The wider their scopes are, the
harder it is for the optimizer to see them.

> tmp_output_ptr = (int *) output_ptr;

Casting away constness is Bad. Especially when you've just
finished jumping through hoops to make it const in the first
place. Geez.

> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;

STOP PERFORMING RANDOM CASTS! For Pete's sake!
Get a hold of yourself here!

> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;

This line (and the one below) are just begging to be
simplified. Let's expand it so we can see the order
of operations.

> }
> *tmp_output_ptr++ = sum;
> }

/* always */ return 0; /* from main */

> }

Okay, let's put it all together: no casts, clean names,
clean mallocs, and see what we get!

You see how all those tmp_*_ptr objects just melt away
once we start to see the outline of the program clearly.
In fact, we uncover a subtle bug in the memory allocation,
and fix it ('input' was being malloc'ed one element short)!
Finally, we strive for idiomatic identifiers: using 'i' and
'j' for the nested loop controls, and losing the silly
'_ptr' suffixes on the main data arrays.

#include <stdlib.h>

int main(void)
{
int nrof_lags = 10000; /* Could both of these be #defines? */
int nrof_taps = 10000; /* That would help a little. */

int *coeffs = malloc(nrof_taps * sizeof *coeffs);
int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
int *output = malloc(nrof_lags * sizeof *output);
int i, j;

for (i=0; i < nrof_lags; ++i)
{
int sum = 0;
for (j=0; j < nrof_taps; ++j)
sum += coeffs[j] * input[i+j];
output[i] = sum;
}

free(coeffs);
free(input);
free(output);
return 0;
}

There -- isn't that a heck of a lot simpler and easier to
If not, you might try changing 'int' to 'unsigned int' (in
fact, you might do that anyway, to guard against overflow
conditions); reversing the order of either or both loops;
if many of the 'coeffs[j]' are zero, try pre-processing
'coeffs' a little bit; little tweaks like that.

I think most of your performance differences are due to
different cache layouts on the different machines, but I
could easily be wrong.

How's the code do now?

HTH,
-Arthur

nobody
Guest
Posts: n/a

 10-25-2003
"Johan Bergman" <(E-Mail Removed)> wrote in message
news:_ylmb.34856\$(E-Mail Removed)...
> Hi,
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach
>
> The example below calculates the output signal for 10000 lags from an FIR
> filter with 10000 taps. The input signal and the filter coefficients is

just
> rubbish in this example. For my intended usage of this filter, it is not
> necessary to store the state of the filter.
>
> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on

the
> different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
>
> The performance within parentheses was obtained when a certain line was
> commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see

there
> were very different behaviors on the different platforms!
>
> The SunOS program performs extremely well without the critial line. Note
> that the 103 Mcycles means that only one cycle was spent per iteration in
> the inner for loop! But it performs terribly when the critical line is
> included, about 28 times slower!
>
> As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
> clock cycles per iteration in the inner loop, without the critical line

and
> only about twice as much with the critical line.
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> Regards,
> Johan
>
>
> #include <malloc.h>
> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));
> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }
>

No idea what the problem is, but if you comment out
*tmp_output_ptr++ = sum;
compiler may optimize away calculation of sum in previous statement,
which could explain performance differences. Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular platform.

I would appreciate posting of proper explanation and solution,
when you find it.

CBFalconer
Guest
Posts: n/a

 10-25-2003
Johan Bergman wrote:
>

.... snip ...
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> #include <malloc.h>

No such standard header. Use stdlib.h.

> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));

Don't cast malloc output.

> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));

Same here.

> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));

Same here.

> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;

Uninitialized data. Anything can happen.

> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }

I suggest you first create a legal program before measuring
performance.

--
Chuck F ((E-Mail Removed)) ((E-Mail Removed))
Available for consulting/temporary embedded and systems.

Arthur J. O'Dwyer
Guest
Posts: n/a

 10-25-2003

On Sat, 25 Oct 2003, nobody wrote:
>
> "Johan Bergman" <(E-Mail Removed)> wrote...
> >
> > Maybe someone can help me to optimize this C/C++ implementation of a FIR
> > filter (although I realize that I should probably consider an FFT approach

> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

> 2. use calloc() instead of malloc(). I'm not sure what are alignment
> implications (some additional instructions for acces to data via
> dereferenced
> ptr in malloc'd block?) in case of malloc() use on any particular platform.

On many modern systems, there won't even be a significant difference
between the performance of calloc and malloc -- but since malloc doesn't
need to write zeroes everywhere, it's faster in theory. OTOH, if the
OP used calloc, he would have a program that actually exhibited defined,
reproducible behavior.

-Arthur

nobody
Guest
Posts: n/a

 10-25-2003
"Arthur J. O'Dwyer" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
>
> On Sat, 25 Oct 2003, nobody wrote:
> >
> > "Johan Bergman" <(E-Mail Removed)> wrote...
> > >
> > > Maybe someone can help me to optimize this C/C++ implementation of a

FIR
> > > filter (although I realize that I should probably consider an FFT

approach

>
> > Try
> > 1. optimize code little bit, instead of
> > sum = 0;
> > for (tap=0; tap<nrof_taps; tap++)
> > {
> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > }
> > *tmp_output_ptr++ = sum;
> > do
> > for (tap=0; tap<nrof_taps; tap++)
> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > (sum is cleared in each outer iteration and therefore I assume it's
> > not used somewhere later)

>
> That will almost certainly slow down the code immensely; remember
> that 'tmp_output_ptr' points to a 100000-element array, which is highly
> unlikely to be readily accessible. 'sum', on the other hand, should
> have been optimized by the compiler to be stored very accessibly in
> a register or suchlike.
>

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.

> > 2. use calloc() instead of malloc(). I'm not sure what are alignment
> > implications (some additional instructions for acces to data via
> > dereferenced
> > ptr in malloc'd block?) in case of malloc() use on any particular

platform.
>
> On many modern systems, there won't even be a significant difference
> between the performance of calloc and malloc -- but since malloc doesn't
> need to write zeroes everywhere, it's faster in theory. OTOH, if the

That is true, but allocation is done outside of loop so this difference
would be IMHO practically inmesurable in given program (10K
multiplications and additions in inner loop times 10K iterations
and additions in outer loop. But I was really speculating, because
I can't explain those huge time execution differences (assuming
correct methodology was used to measure this time).

> OP used calloc, he would have a program that actually exhibited defined,
> reproducible behavior.
>
> -Arthur
>

Sheldon Simms
Guest
Posts: n/a

 10-25-2003
On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:

> "Arthur J. O'Dwyer" <(E-Mail Removed)> wrote in message
> news(E-Mail Removed)...
>>
>> On Sat, 25 Oct 2003, nobody wrote:
>> >
>> > Try
>> > 1. optimize code little bit, instead of
>> > sum = 0;
>> > for (tap=0; tap<nrof_taps; tap++)
>> > {
>> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
>> > }
>> > *tmp_output_ptr++ = sum;
>> > do
>> > for (tap=0; tap<nrof_taps; tap++)
>> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
>> > (sum is cleared in each outer iteration and therefore I assume it's
>> > not used somewhere later)

>>
>> That will almost certainly slow down the code immensely; remember
>> that 'tmp_output_ptr' points to a 100000-element array, which is highly
>> unlikely to be readily accessible. 'sum', on the other hand, should
>> have been optimized by the compiler to be stored very accessibly in
>> a register or suchlike.
>>

> I was thinking of that too, but array has 10K elemnts, not 100K, so
> assuming 32 bit inegers, there is malloc'd something below 160K
> from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> OP is running SunOS on hardware with more memory than that.
> But nonwithstanding, you may be right.

For what it's worth, I compiled both versions on my machine (which
has plenty of memory). The version with sum was faster:

with sum:
[sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test1 test1.c
[sheldon@wsxyz]\$ time ./test1

real 0m0.627s
user 0m0.560s
sys 0m0.000s

without sum:
[sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test2 test2.c
[sheldon@wsxyz]\$ time ./test2

real 0m0.710s
user 0m0.630s
sys 0m0.000s

nobody
Guest
Posts: n/a

 10-25-2003
"Sheldon Simms" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
> On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
>
> > "Arthur J. O'Dwyer" <(E-Mail Removed)> wrote in message
> > news(E-Mail Removed)...
> >>
> >> On Sat, 25 Oct 2003, nobody wrote:
> >> >
> >> > Try
> >> > 1. optimize code little bit, instead of
> >> > sum = 0;
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > {
> >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > }
> >> > *tmp_output_ptr++ = sum;
> >> > do
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > (sum is cleared in each outer iteration and therefore I assume it's
> >> > not used somewhere later)
> >>
> >> That will almost certainly slow down the code immensely; remember
> >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
> >> unlikely to be readily accessible. 'sum', on the other hand, should
> >> have been optimized by the compiler to be stored very accessibly in
> >> a register or suchlike.
> >>

> > I was thinking of that too, but array has 10K elemnts, not 100K, so
> > assuming 32 bit inegers, there is malloc'd something below 160K
> > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> > OP is running SunOS on hardware with more memory than that.
> > But nonwithstanding, you may be right.

>
> For what it's worth, I compiled both versions on my machine (which
> has plenty of memory). The version with sum was faster:
>
> with sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test1 test1.c
> [sheldon@wsxyz]\$ time ./test1
>
> real 0m0.627s
> user 0m0.560s
> sys 0m0.000s
>
>
> without sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test2 test2.c
> [sheldon@wsxyz]\$ time ./test2
>
> real 0m0.710s
> user 0m0.630s
> sys 0m0.000s
>

Thanks. So I was wrong. My sincere apologies. Had I SunOS box
at my disposal, I would test it before posting

Arthur J. O'Dwyer
Guest
Posts: n/a

 10-25-2003

On Sat, 25 Oct 2003, nobody wrote:
>
> "Arthur J. O'Dwyer" <(E-Mail Removed)> wrote...
> > On Sat, 25 Oct 2003, nobody wrote:
> > > optimize code little bit, instead of
> > > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > > do
> > > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;

> >
> > That will almost certainly slow down the code immensely; remember
> > that 'tmp_output_ptr' points to a 100000-element array, which is highly
> > unlikely to be readily accessible. 'sum', on the other hand, should
> > have been optimized by the compiler to be stored very accessibly in
> > a register or suchlike.

>
> I was thinking of that too, but array has 10K elemnts, not 100K, so
> assuming 32 bit inegers, there is malloc'd something below 160K
> from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> OP is running SunOS on hardware with more memory than that.

More main memory, yes. More cache memory, no. I was referring to
the size of the cache in this case. You're right, though, that if
the arrays get really ungodly huge, then the OP might need to look
at the effects of whatever virtual memory facilities his OS is
trying to use.

This is way OT; I'll stop now.

-Arthur