Hi,

I'm currently working on a matrix multiplication code (matrix times

matrix), and have come along some interesting/confusing results

concerning the running time of the (apparently) same algorithm,

when implemented in C or C++. I noticed that the implementation

in C is faster by a factor of 2.5 compared to a identical(?)

C++-implementation.

The basic algorithm in C is:

/* in function main() */

idxC = 0;

for(i=0; i< dimension; i++)

for(k=0; k< dimension; k++) {

idxA = i*dimension;

endA = idxA + dimension;

idxB = k;

while ( idxA < endA ) {

Celem[idxC] += Aelem[idxA] * Belem[idxB];

idxA++;

idxB += dimension;

};

idxC++;

};

where matrices A, B, and C are implemented using a 1-dimensional

array and line-by-line numbering (i.e. Fortran-style).

For example, idxA points to the current element A(i,j) of matrix A.

After idxA++, the pointer will point to element A(i,j+1).

Similar, idxA += dimension will move the pointer to element A(i+1,j).

Thus, the code computes the matrix product C=A*B, while trying to

avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>

void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,

FortranMatrix<MatElem>& C) const {

/* ... */

int idxC = 0;

for(int i=0; i< dimension; i++)

for(int k=0; k< dimension; k++) {

int idxA = i*dimension;

int endA = idxA + dimension;

int idxB = k;

while ( idxA < endA ) {

C.elem[idxC] += elem[idxA] * B.elem[idxB];

idxA++;

idxB += dimension;

};

idxC++;

};

}

Here, A, B, and C, are objects of some matrix class that have a

member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes

around 10 seconds on my machine (Pentium III, gcc, version 3.2,

optimization level -O3 and -funroll-loops).

The C++-implementation takes approximately 25 seconds on the same

machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for

this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>

void FortranMatrix<MatElem>::stamult(const MatElem* A,

const MatElem* B,

MatElem* C, int dimension) {

/* ... */

int idxC = 0;

for(int i=0; i< dimension; i++)

for(int k=0; k< dimension; k++) {

int idxA = i*dimension;

int endA = idxA + dimension;

int idxB = k;

while ( idxA < endA ) {

C[idxC] += A[idxA] * B[idxB];

idxA++;

idxB += dimension;

};

idxC++;

};

}

Now, this is almost identical to the C algorithm, but the code still

takes around 20 seconds. I was actually prepared to see certain

performance losses when using C++ instead of C. However, I find it

confusing that you lose a factor 2 in a code that's not even close to

using objected oriented features that may slow the code down.

I didn't try it without using templates, but that can't make too

much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you

very much for taking the time to read this rather longish article

Michael