Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > loop optimization

Reply
Thread Tools

loop optimization

 
 
Jayden Shui
Guest
Posts: n/a
 
      02-02-2012
Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include <time.h>
#include <iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m < 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i <= ie; ++i)
{
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
{
a[i][j][k0] = a[i][j][k1];
}
}
}

}

clock_t t1 = clock();
cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";

return 0;
}
 
Reply With Quote
 
 
 
 
Jayden Shui
Guest
Posts: n/a
 
      02-02-2012
On Feb 2, 10:09*am, Paavo Helde <(E-Mail Removed)> wrote:
> Jayden Shui <(E-Mail Removed)> wrote in news:db0c071d-a79a-49fd-
> (E-Mail Removed):
>
> > Dear All,

>
> > I have a high dimensional array and want to assign one block of data
> > in the array to another block in the array. The two blocks are not
> > overlapped. The loops are very slow.

>
> Compared to what? It appears that you had to repeat this 100000 times
> just in order to get some meaningful timings - does not look slow to me.
>
> Other than that, try to increase your compiler optimization level. The
> inner loop is 2 iterations and should be rolled out, some compilers may
> require special flags for doing that (-funroll-loops for gcc, for
> example).
>
> hth
> Paavo
>
>
>
>
>
>
>
> > Code is attached. Any ways to
> > optimize it? Thanks a lot.

>
> > #include <time.h>
> > #include <iostream>
> > using namespace std;

>
> > int main()
> > {
> > * * float a[100][100][8];

>
> > * * int ib = 1, ie = 99;
> > * * int jb = 1, je = 99;
> > * * int kb0 = 0, ke0 = 1;
> > * * int kb1 = 3, ke1 = 2;

>
> > * * clock_t t0 = clock();

>
> > * * // Loop to count CPU time.
> > * * for (int m = 0; m < 100000; ++m) {

>
> > * * // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
> > slow
> > * * for (int i = ib; i <= ie; ++i)
> > * * {
> > * * * * for (int j = jb; j <= je; ++j)
> > * * * * {
> > * * * * * * for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
> > * * * * * * {
> > * * * * * * * * a[i][j][k0] = a[i][j][k1];
> > * * * * * * }
> > * * * * }
> > * * }

>
> > * * }

>
> > * * clock_t t1 = clock();
> > * * cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
> > \n";

>
> > * * return 0;
> > }


This is testing code for CPU time of the operations. My code has lots
of such similar operations. I hope using some loop optimization
techniques in http://en.wikipedia.org/wiki/Loop_optimization to speed
it up. I think the testing code should be run in less than 0.5 s in i7
after loop optimization (-O2 or O3 is really not enough) according to
my experience.
 
Reply With Quote
 
 
 
 
Victor Bazarov
Guest
Posts: n/a
 
      02-02-2012
On 2/2/2012 9:55 AM, Jayden Shui wrote:
> I have a high dimensional array and want to assign one block of data
> in the array to another block in the array. The two blocks are not
> overlapped. The loops are very slow. Code is attached. Any ways to
> optimize it? Thanks a lot.
>
> #include<time.h>
> #include<iostream>
> using namespace std;
>
> int main()
> {
> float a[100][100][8];
>
> int ib = 1, ie = 99;
> int jb = 1, je = 99;
> int kb0 = 0, ke0 = 1;
> int kb1 = 3, ke1 = 2;
>
> clock_t t0 = clock();
>
> // Loop to count CPU time.
> for (int m = 0; m< 100000; ++m) {
>
> // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
> slow
> for (int i = ib; i<= ie; ++i)
> {
> for (int j = jb; j<= je; ++j)
> {
> for (int k0 = kb0, k1 = kb1; k0<= ke0; ++k0, --k1)
> {
> a[i][j][k0] = a[i][j][k1];
> }
> }
> }
>
> }
>
> clock_t t1 = clock();
> cout<< "CPU Time = "<< double(t1 - t0) / CLOCKS_PER_SEC<< "s
> \n";
>
> return 0;
> }


Some compilers I've seen (don't recall which, though) could generate
better code if you helped them reduce the indexing by extracting the
"common subexpression" outside of the inner loop. Something like

for (int i = ...
{
float (&ai)[100][8] = a[i];
for (int j = ...
{
float (&aij)[8] = ai[j];
for (int k0 ...
{
aij[k0] = aij[k1];
}
}
}

Better compilers do that on their own. Try it and see if you get any
improvement. If not, you could (a) parallelize the outer loop (since
there is no overlap) but keep in mind it's only worth if the overhead of
starting worker threads is much smaller than the actual work each has to
perform, or (b) change your algorithm and the data structure.

Barring those choices, you're pretty much stuck with what you have.

V
--
I do not respond to top-posted replies, please don't ask
 
Reply With Quote
 
copx
Guest
Posts: n/a
 
      02-02-2012
On 02.02.2012 15:55, Jayden Shui wrote:
> Dear All,
>
> I have a high dimensional array and want to assign one block of data
> in the array to another block in the array. The two blocks are not
> overlapped. The loops are very slow. Code is attached. Any ways to
> optimize it? Thanks a lot.

[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?


 
Reply With Quote
 
Jayden Shui
Guest
Posts: n/a
 
      02-02-2012
On Feb 2, 11:12*am, copx <(E-Mail Removed)> wrote:
> On 02.02.2012 15:55, Jayden Shui wrote:> Dear All,
>
> > I have a high dimensional array and want to assign one block of data
> > in the array to another block in the array. The two blocks are not
> > overlapped. The loops are very slow. Code is attached. Any ways to
> > optimize it? Thanks a lot.

>
> [snip code]
>
> I compiled this and ran it on my desktop computer which is not very
> fast:
>
> Unoptimized build ("g++ loop.cpp"): 15.762s
> Optimized build ("g++ -O2 loop.cpp"): 0s
>
> Yes, that's a zero. It runs so fast you cannot even measure it using
> your method. Did you forget to enable optimization?


This is due to there is no output and the code in fact is not run in -
O2. Please add

int i, j, k;
cin >> i >> j >> k;
cin << a[i][j][k];

before return 0;
 
Reply With Quote
 
Victor Bazarov
Guest
Posts: n/a
 
      02-02-2012
On 2/2/2012 1:51 PM, Christian Gollwitzer wrote:
> Am 02.02.12 15:55, schrieb Jayden Shui:
>> Dear All,
>>
>>
>> // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]

>
> I don't really understand what kind of copy this performs. But if you
> can reformulate this operation to copy (large) contiguous blocks of
> memory, you could use memcpy, which is probably optimized better. As
> others have suggested, this problem is probably memory bandwidth
> limited. This means you can gain something by vectorizing as large
> blocks as possible at once.


The problem is that the order of elements in the destination block is
inverted (see the "-1"?) That prevents memcpy (or memmove) from being used.

V
--
I do not respond to top-posted replies, please don't ask
 
Reply With Quote
 
Juha Nieminen
Guest
Posts: n/a
 
      02-02-2012
Christian Gollwitzer <(E-Mail Removed)> wrote:
> I don't really understand what kind of copy this performs. But if you
> can reformulate this operation to copy (large) contiguous blocks of
> memory, you could use memcpy, which is probably optimized better. As
> others have suggested, this problem is probably memory bandwidth
> limited. This means you can gain something by vectorizing as large
> blocks as possible at once.


When dealing with multidimensional arrays, if you need to handle
(read, copy, modify...) large portions of it at adjacent indices,
it's always much more efficient to perform the operations on elements
that are in contiguous memory locations rather than jumping around.
This makes it important which index of a multidimensional array is
modified most often.

The reason for this is, obviously, caching. When you access an element
in the array, the CPU will load adjacent memory data to the cache. If
you then access an adjacent element of the array, it will most probably
be already loaded into the cache (closest to the CPU), which will make
it really fast. If, however, you jumped around the RAM by incrementing
the wrong index, you will get lots of cache misses.

For that reason eg. this:

for(unsigned i1 = 0; i1 < size1; ++i1)
for(unsigned i2 = 0; i2 < size2; ++i2)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

will usually be much faster than:

for(unsigned i2 = 0; i2 < size2; ++i2)
for(unsigned i1 = 0; i1 < size1; ++i1)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

memcpy() is often very fast not only because it has been optimized to
death during the last 30 years, but also because it accesses consecutive
memory locations, which is cache-friendly.
 
Reply With Quote
 
Asger Joergensen
Guest
Posts: n/a
 
      02-03-2012
Hi Jayden

Jayden Shui wrote:

> int ib = 1, ie = 99;
> int jb = 1, je = 99;
> int kb0 = 0, ke0 = 1;
> int kb1 = 3, ke1 = 2;

.....
> for (int j = jb; j <= je; ++j)
> {
> for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)


putting in the numbers:
for (int k0 = 0, k1 = 3; k0 <= ke0; ++k0, --k1)

so You are essentially only swapping the first four floats, right ?

well to my solution:

no need for all these [][][] they are just to make things slower
and in some cases simpler to understand.

// first set some sizes
const int lineSize = 8;
const int squareSize = 800;
const int cubeSize = 80000;

// get one chuck of memory of the right size
float cube[ cubeSize ];

float *square = cube;
float* squaresEnd = cube + cubeSize;

while( square < squaresEnd )
{
float* line = square;
float* linesEnd = square + squareSize;

while( line < linesEnd )
{
float* lineBegin = line;
float* lineEnd = line + 3; // lineSize ****

for(; lineBegin < lineEnd; ++lineBegin, --lineEnd )//**
*lineBegin = *lineEnd;

line += lineSize;
}

square += squareSize;
}

this is close to double speed on my machine, but I only have an old
Borland compiler, so it might not be the same for You.

****
I don't know if the 3 was what You intended, if not replace with
lineSize.
**
I removed the = from <= if they are pointing to the same float there
is really no need to swap.


Best regards
Asger-P
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Triple nested loop python (While loop insde of for loop inside ofwhile loop) Isaac Won Python 9 03-04-2013 10:08 AM
Zero Optimization and Sign Optimization??? Ravikiran C Programming 22 11-24-2008 03:19 AM
java loop optimization star Java 38 09-22-2005 09:13 AM
Loop Optimization, Array Alignment Rajeev C Programming 7 09-18-2004 12:37 PM
Simple loop optimization mrTriffid Java 7 02-08-2004 06:20 AM



Advertisments