Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Performance issue: matrix multiplication in C and C++

Reply
Thread Tools

Performance issue: matrix multiplication in C and C++

 
 
Michael Bader
Guest
Posts: n/a
 
      03-02-2004
Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

Michael

 
Reply With Quote
 
 
 
 
Gianni Mariani
Guest
Posts: n/a
 
      03-02-2004
Michael Bader wrote:
....
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];


C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

-- still, you can also fetch the addresses this->elem, B.elem and C.elem
and place them in temporaries (although, the compiler can probably do it
for this->elem and B.elem since *this and B are const but not for C.).


> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?


Which version of gcc are you using ?

>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article


Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

G



 
Reply With Quote
 
 
 
 
Andre Kostur
Guest
Posts: n/a
 
      03-02-2004
Michael Bader <(E-Mail Removed)> wrote in
news:c22ga6$m2a$(E-Mail Removed)-muenchen.de:

> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>
> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C)
> const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).


Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.

 
Reply With Quote
 
Claudio Puviani
Guest
Posts: n/a
 
      03-02-2004
"Michael Bader" <(E-Mail Removed)> wrote
> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>
> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?
>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article


The most likely reason is that g++ doesn't generate intermediate code that's as
good as gcc (in fact, g++ 3.x generates bigger and slower code than g++ 2.95).
This is fairly common. The worst difference I've seen was with Watcom's C and
C++ compilers in the early 90's where some code ran 10 times or more slower when
compiled with their C++ compiler than with their C compiler. This isn't a
difference between the C and C++ languages, but between the two compiler
implementations. C++ compilers just don't have as long a history as C compilers,
and in the time that C++ compiler writers have had to spend adding and
supporting the myriad features that C doesn't have, the C compiler writers have
been able to invest in better code generation. C++ will eventually catch up
(today's C++ compilers compare favorably with C compilers 10 years ago). In the
meantime, you can certainly write performance-critical code in C and link it
into your C++ application.

Claudio Puviani


 
Reply With Quote
 
NKOBAYE027
Guest
Posts: n/a
 
      03-02-2004

"Gianni Mariani" <(E-Mail Removed)> wrote in message
news:c22kkr$(E-Mail Removed)...
> Michael Bader wrote:
> ...
> >
> > The respective algorithm in C++ is:
> >
> > template<class MatElem>
> > void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> > FortranMatrix<MatElem>& C) const

{
> > /* ... */
> > int idxC = 0;
> > for(int i=0; i< dimension; i++)
> > for(int k=0; k< dimension; k++) {
> > int idxA = i*dimension;
> > int endA = idxA + dimension;
> > int idxB = k;
> > while ( idxA < endA ) {
> > C.elem[idxC] += elem[idxA] * B.elem[idxB];

>
> C.elem[idxC]
>
> You can probably remove this from the loop.
>
> The compiler probably can't optimize the fetch of the "elem" array, so
> you'll need to do it yourself.
>
> Try somthing like:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const

{
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> MatElem v = 0;
> while ( idxA < endA ) {
> v += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> C.elem[idxC] = v;
> idxC++;
> };
> }
>

I would also remove

> int idxA = i*dimension;
> int endA = idxA + dimension;


to the outer loop...no need to redeclare and initialise them each time
through the loop over k. keep what depends upon the loop variable within the
loop and remove everything else


 
Reply With Quote
 
Jason Leasure
Guest
Posts: n/a
 
      03-02-2004


Michael Bader wrote:
> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>


somewhere up here you declare idxA, endA, idxB, etc. ONCE

> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {


> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;

here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

move your declarations to the outside of the loop - keep the definitions
where you need them though.


> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?
>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article
>
> Michael
>


guessingly yours,
Jason

 
Reply With Quote
 
Michael Bader
Guest
Posts: n/a
 
      03-03-2004
Andre Kostur wrote:
> Show, don't tell. Where is the actual class declaration of FortranMatrix
> <MatElem>? What _is_ MatElem? Show it's declaration too.


No problem, I just felt that the posting was already lengthy enough
without it ...

template<class MatElem> class FortranMatrix {

public:

// Constructors
FortranMatrix(int dim);
FortranMatrix(const FortranMatrix<MatElem>& C);

// element access
// --> direct access by memory index
inline MatElem& operator[] (int index) const;
inline MatElem& operator[] (int index);

// --> access via line index and column index
inline MatElem& operator() (int line, int col) const;
inline MatElem& operator() (int line, int col);

inline void mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const;

/* ... addtionial methods skipped ... */

public:

// Members
MatElem* elem; // matrix elements
int dimension; // number of lines (also number of columns)

};

MatElem is instantiated by double.

Regards

Michael


 
Reply With Quote
 
Michael Bader
Guest
Posts: n/a
 
      03-03-2004
Jason Leasure wrote:
>> int idxC = 0;
>> for(int i=0; i< dimension; i++)
>> for(int k=0; k< dimension; k++) {

>
>> int idxA = i*dimension;
>> int endA = idxA + dimension;
>> int idxB = k;

> here you declare each of these variables (and assign to them) dimension
> squared times.. I'd guess doing this amounts to moving the stack
> pointer around a lot more than you have to.


I gave that a try, but the performance gain is almost negligible.
Probably, the optimizer can figure that out on its own.

Thanks anyway

Michael

 
Reply With Quote
 
Michael Bader
Guest
Posts: n/a
 
      03-03-2004
Claudio Puviani wrote:
> The most likely reason is that g++ doesn't generate intermediate code
> that's as good as gcc (in fact, g++ 3.x generates bigger and slower code
> than g++ 2.95).


Hi,

thanks for the suggestion, I'll definitely try to do the same
comparison using an older version of gcc/g++.
I also wanted to use an Intel compiler, but I need to ask
someone who owns a licence for that ...

Thanks

Michael

 
Reply With Quote
 
Michael Bader
Guest
Posts: n/a
 
      03-03-2004
Gianni Mariani wrote:
> Try posting compilable code. The last example really does look like it
> should be at least as fast as the C code. Without examining the actual
> generated assembly code, it's hard to know why it's doing what it's doing.


I've put the code for download on
http://www5.in.tum.de/~bader/mmult/
(directory, listing should be allowed).

You need to set the flag __TEMPLATES_IN_HEADERS for compiling
(for g++, use option -D__TEMPLATES_IN_HEADERS).

If you can use a different compiler/architecture, the results could
lso be interesting.

Thanks in advance

Michael

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
where can I find good samples for efficient computation of matrix multiplication? walala VHDL 2 03-24-2010 10:06 AM
Source of term "multiplication" in matrix multiplication William Hughes C Programming 13 03-15-2010 02:04 PM
matrix Multiplication Sssasss Python 7 10-18-2006 05:01 PM
Segmentation fault in Matrix Multiplication amitnanda@gmail.com C Programming 14 01-26-2006 09:35 PM
matrix multiplication robix C Programming 3 11-13-2003 12:17 AM



Advertisments