Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Performance Question ...

Reply
Thread Tools

Performance Question ...

 
 
Konrad Mühler
Guest
Posts: n/a
 
      06-10-2008
Hi,

I've a list of objects. I iterate the list and read the value of each
object for many operation like:

x = myList[i].value1 + myList[i].value2
etc.

My question:
Is it efficient to always use myList[i] or should I get the pointer to
the current element and use this instead like

currentElement->value1 ...

Is it low-performance to always let the list return the i's element?

Thank you
Konrad
 
Reply With Quote
 
 
 
 
dizzy
Guest
Posts: n/a
 
      06-10-2008
Konrad Muhler wrote:

> Hi,
>
> I've a list of objects. I iterate the list and read the value of each
> object for many operation like:
>
> x = myList[i].value1 + myList[i].value2
> etc.


So you don't have a std::list of objects as std::list has no op[] (BECAUSE
such an op[] would be too expensive). What does your list:p[] do? What is
the algorithmic complexity of it?

> My question:
> Is it efficient to always use myList[i] or should I get the pointer to
> the current element and use this instead like


Depends on your imlementation of list. If your list is actually something
like a std::vector then an op[] is fast enough. If your data set is much
larger than your data cache then that will be a performance bottleneck no
matter if you use references/pointers/iterators to elements or a fast op[].

> currentElement->value1 ...
>
> Is it low-performance to always let the list return the i's element?


Have you benchmarked/profiled it? What results does it have on your
particular combination of platform, implementation, compiler (version),
optimization flags?

Acording to that make your decision.

--
Dizzy

 
Reply With Quote
 
 
 
 
Mirco Wahab
Guest
Posts: n/a
 
      06-10-2008
Konrad Mühler wrote:
> I've a list of objects. I iterate the list and read the value of each
> object for many operation like:
> x = myList[i].value1 + myList[i].value2
> etc.
> My question:
> Is it efficient to always use myList[i] or should I get the pointer to
> the current element and use this instead like
> currentElement->value1 ...
> Is it low-performance to always let the list return the i's element?


Thats depends on the compiler. The newer gcc's will
optimize the differences away, whereas older ones
(and maybe Visual Studio) will not.

try it:

==>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>

struct listitem { int value1, value2, value3; };

int test_array(listitem *list, int nitems)
{
int x=0;
for(int j=0; j<nitems; ++j)
for(int i=0; i<nitems; ++i)
x += (list[i].value1 - list[j].value1) + (list[i].value2 - list[j].value2);
return x;
}

int test_pointer(listitem *list, int nitems)
{
int x=0;
for(listitem *pj=list; pj<list+nitems; ++pj)
for(listitem *pi=list; pi<list+nitems; ++pi)
x += (pi->value1 - pj->value1) + (pi->value2 - pj->value2);
return x;
}

double timethis(const char *s, int(*p)(listitem*,int),listitem *list, int nitems)
{
std::cout << s << std::endl;
time_t t=clock();
int x = p(list, nitems); // fool optimizer! (x is always zero)
return double(clock() - t + x)/CLOCKS_PER_SEC;
}

int main(int argc, char*argv[])
{
int nitems = argv[1] ? atoi(argv[1]) : 20000;
listitem *list = new listitem[nitems];
memset(list, 0x1A, sizeof(listitem)*nitems);

double ta=0, tp=0;
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);

std::cout << "length: " << nitems << std::endl
<< "array: " << ta << " sec" << std::endl
<< "pointer: " << tp << " sec" << std::endl;

delete [] list;
return 0;
}
<==

Regards

Mirco

 
Reply With Quote
 
Marcel Müller
Guest
Posts: n/a
 
      06-10-2008
Hi,

Konrad Mühler schrieb:
> I've a list of objects. I iterate the list and read the value of each
> object for many operation like:
>
> x = myList[i].value1 + myList[i].value2
> etc.
>
> My question:
> Is it efficient to always use myList[i] or should I get the pointer to
> the current element and use this instead like
>
> currentElement->value1 ...
>
> Is it low-performance to always let the list return the i's element?


what is the complexity of myList.operator[]?
The complexity of your loop will be one order higher (unless you have
something less efficient somewhere else in the loop).

The STL containers usually do not provide operator[] unless it has low
complexity, usually O(1). So if you do not have your own implementation
of operator[] it is likely that you end up with O(n), which is the
minimum of a loop over the entire container content. The performance
impact is minimal in most cases.

However, if your operator[] has O(log(n)) (like SGI's rope
implementation) or O(n) like a linked list, you will have O(n*log(n)) or
O(n²) respectively, which is bad.


Marcel
 
Reply With Quote
 
Jerry Coffin
Guest
Posts: n/a
 
      06-10-2008
In article <g2lsvb$fv2$(E-Mail Removed)-freiberg.de>, http://www.velocityreviews.com/forums/(E-Mail Removed)-
halle.de says...
> Konrad Mühler wrote:
> > I've a list of objects. I iterate the list and read the value of each


[ ... ]

> Thats depends on the compiler. The newer gcc's will
> optimize the differences away, whereas older ones
> (and maybe Visual Studio) will not.


This depends heavily upon interpretation. Some people only use "list" to
mean "linked list". In that case, I find it difficult to believe that a
compiler is capable of optimizing out the differences between subscript-
style notation and an explicit iterator (a pointer, in this case).

Other people use "list" to mean almost any sort of linear data
structure. In this case, (e.g. using an array) optimizing away the
difference between array-style and pointer-style access has been routine
among nearly all compilers for years -- all versions of "Visual Studio"
(and versions of Microsoft C++ and Microsoft C since long before they
started using the "Visual" brand) have known how to do this kind of
optimization.

Likewise, while there may have been _some_ early version of gcc that
couldn't do it, I doubt that version was ever in wide use -- I'm pretty
sure by the time it was labeled "version 1.0", it could do this.

> try it:


[ code using dynamically allocated array elided ... ]

Just keep in mind that this may not respond to the question being asked
by the OP -- it all depends on what he means by "list".

--
Later,
Jerry.

The universe is a figment of its own imagination.
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      06-10-2008
Jerry Coffin wrote:
> In article <g2lsvb$fv2$(E-Mail Removed)-freiberg.de>, (E-Mail Removed)-
> halle.de says...
>> Thats depends on the compiler. The newer gcc's will
>> optimize the differences away, whereas older ones
>> (and maybe Visual Studio) will not.

>
> This depends heavily upon interpretation. Some people only use "list" to
> mean "linked list". In that case, I find it difficult to believe that a
> compiler is capable of optimizing out the differences between subscript-
> style notation and an explicit iterator (a pointer, in this case).


Right. And this was my interpretation of the OP's list.
How does this:

>>> x = myList[i].value1 + myList[i].value2


look like in your eyes?

> Other people use "list" to mean almost any sort of linear data
> structure. In this case, (e.g. using an array) optimizing away the
> difference between array-style and pointer-style access has been routine
> among nearly all compilers for years -- all versions of "Visual Studio"
> (and versions of Microsoft C++ and Microsoft C since long before they
> started using the "Visual" brand) have known how to do this kind of
> optimization.


Not entirely true. The code posted by me *will* show significant
differences for pointer vs. array formulation between gcc 3.x and
4.x *and* between VC++6 and VC++9. I tested it

> [ code using dynamically allocated array elided ... ]
> Just keep in mind that this may not respond to the question being asked
> by the OP -- it all depends on what he means by "list".


OK, I'm not absolutely sure (but so aren't you)

From interpreting the OP's code, this seems (at last
for me) a valid interpretation.

Regards & Thanks

Mirco
 
Reply With Quote
 
Jerry Coffin
Guest
Posts: n/a
 
      06-10-2008
In article <g2mknj$md3$(E-Mail Removed)-freiberg.de>, (E-Mail Removed)-
halle.de says...

[ ... ]

> Right. And this was my interpretation of the OP's list.
> How does this:
>
> >>> x = myList[i].value1 + myList[i].value2

>
> look like in your eyes?


This looks like it requires operator[] to be valid for whatever sort of
things myList is -- but that doesn't tell us a thing. Overloading
operator[] for a linked list is trivial.

> > Other people use "list" to mean almost any sort of linear data
> > structure. In this case, (e.g. using an array) optimizing away the
> > difference between array-style and pointer-style access has been routine
> > among nearly all compilers for years -- all versions of "Visual Studio"
> > (and versions of Microsoft C++ and Microsoft C since long before they
> > started using the "Visual" brand) have known how to do this kind of
> > optimization.

>
> Not entirely true. The code posted by me *will* show significant
> differences for pointer vs. array formulation between gcc 3.x and
> 4.x *and* between VC++6 and VC++9. I tested it


I glanced through your code, and frankly I wasn't much impressed. First
of all, what you timed was substantially different from what the OP
asked about. Second, you called the functions through a pointer
(repeatedly) and included the time to call the function via a pointer in
the overall time for the function. Third, you included the first run
through the data in the time for executing the code -- on the first run,
the data will frequently be in main memory, but on the second and
subsequent runs, it's likely to be in the cache, so the first run will
often be substantially slower than the second and subsequent runs.

Here's a bit of code that gets rid of at least those specific problems:

#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>
#include <vector>
#include <algorithm>

struct listitem { int value1, value2; };

listitem gen_rand() {
listitem ret;
ret.value1 = rand();
ret.value2 = rand();
return ret;
}

int test_array(listitem const *list, size_t nitems)
{
int x=0;

for(size_t i=0; i<nitems; ++i)
x += list[i].value1 + list[i].value2;
return x;
}

int test_pointer(listitem const *list, size_t nitems)
{
int x=0;
for(listitem const *pi=list; pi<list+nitems; ++pi)
x += pi->value1 + pi->value2;
return x;
}

int main(int argc, char*argv[]) {
const int rep_count = 20;
const int size=2000000;

static listitem list[size];

size_t i;

for (i=0; i<size; i++)
list[i] = gen_rand();

int val1 = test_array(list, size); // prime the cache.
int val2 = test_pointer(list, size);

clock_t time1 = 0;

// do the real tests
for (i=0; i<rep_count; i++) {
clock_t start1 = clock();
val1 = test_array(list, size);
clock_t stop1 = clock();
time1 += stop1 - start1;
}

clock_t time2 = 0;

for (i=0; i<rep_count; i++) {
clock_t start2 = clock();
val2 = test_pointer(list, size);
clock_t stop2 = clock();
time2 += stop2-start2;
}

std::cout << "Array value: " << val1;
std::cout << "\nPointer value: " << val2;

std::cout << "\nArray Time: " << time1;
std::cout << "\nPointer Time : " << time2;

return 0;
}

Compiled with VC++ 6.0, using the default "Release" settings for a
console application, the results I get from a few runs of this are:

C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 110
Pointer Time : 93
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 94
Pointer Time : 109
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 109
Pointer Time : 94
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 94
Pointer Time : 109

The only differences we're seeing are of a single clock tick, and
there's no apparent bias in either direction. For all practical
purpsoses the two are identical in speed. Looking at the assembly
language produced for each shows that it's not just for all practical
purpuposes:

test_array:
; _list$ = ecx
; _nitems$ = edx
; Line 18
push esi
; Line 19
xor eax, eax
; Line 21
xor esi, esi
test edx, edx
jbe SHORT $L10248
push edi
npad 6
$L10246:
; Line 22
mov edi, DWORD PTR [ecx+esi*8+4]
add edi, DWORD PTR [ecx+esi*8]
add esi, 1
add eax, edi
cmp esi, edx
jb SHORT $L10246
pop edi
$L10248:
pop esi
; Line 24
ret 0

test_pointer:
; _list$ = ecx
; _nitems$ = edx
; Line 29
lea edx, DWORD PTR [ecx+edx*8]
xor eax, eax
cmp ecx, edx
jae SHORT $L10257
push esi
npad 6
$L10255:
; Line 30
mov esi, DWORD PTR [ecx+4]
add esi, DWORD PTR [ecx]
add ecx, 8
add eax, esi
cmp ecx, edx
jb SHORT $L10255
pop esi
$L10257:
; Line 32
ret 0

I also did a quick test by stripping the source code down to the parts
we care about:

struct listitem { int value1, value2; };

int test_array(listitem const *list, int nitems)
{
int x=0;
int i;
for(i=0; i<nitems; ++i)
x += list[i].value1 + list[i].value2;
return x;
}

int test_pointer(listitem const *list, int nitems)
{
int x=0;
listitem const *pi;
for(pi=list; pi<list+nitems; ++pi)
x += pi->value1 + pi->value2;
return x;
}

and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
Visual C++ 1.0). Here's the assembly language it produces for the two
inner loops:

test_array:
$F344:
; Line 8
mov ax,WORD PTR [bx-2]
add ax,WORD PTR [bx]
add cx,ax
add bx,4
dec dx
jne $F344

test_pointer:
$F355:
; Line 17
mov ax,WORD PTR [si+2]
add ax,WORD PTR [si]
add dx,ax
add si,4
cmp WORD PTR [bp-4],si ;list
ja $F355

In this case, there may be a small difference between the two -- but
it's the array version that's faster. Even on a machine from that time
frame, the difference would be too small to care much about though --
given the memory constraints of MS-DOS, the time difference for the
largest array you could create would still have been less than a
millisecond.

--
Later,
Jerry.

The universe is a figment of its own imagination.
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      06-11-2008
Hi Jerry,

thanks for your meaningful response.

Jerry Coffin wrote:
> In article <g2mknj$md3$(E-Mail Removed)-freiberg.de>, (E-Mail Removed)-
> halle.de says...
>> How does this:
>>
>>>>> x = myList[i].value1 + myList[i].value2

>> look like in your eyes?

>
> This looks like it requires operator[] to be valid for whatever sort of
> things myList is -- but that doesn't tell us a thing. Overloading
> operator[] for a linked list is trivial.


OK

>>> [...] optimizing away the
>>> difference between array-style and pointer-style access has been routine
>>> among nearly all compilers for years -- all versions of "Visual Studio"
>>> (and versions of Microsoft C++ and Microsoft C since long before they
>>> started using the "Visual" brand) have known how to do this kind of
>>> optimization.

>> Not entirely true. The code posted by me *will* show significant
>> differences for pointer vs. array formulation between gcc 3.x and
>> 4.x *and* between VC++6 and VC++9. I tested it

>
> I glanced through your code, and frankly I wasn't much impressed.


What the ...

> First of all, what you timed was substantially different from what the OP
> asked about. Second, you called the functions through a pointer
> (repeatedly) and included the time to call the function via a pointer in
> the overall time for the function. Third, you included the first run
> through the data in the time for executing the code -- on the first run,
> the data will frequently be in main memory, but on the second and
> subsequent runs, it's likely to be in the cache, so the first run will
> often be substantially slower than the second and subsequent runs.


All three points are somehow unfounded, imho.
1) what was timed was simply interleaved sequential access to
a block of data that is not too trivial regarding L2
access of the underlying architecture. To take out one
"dimension", as you did, also makes things too simple
for any optimizing compiler and wouldn't support my view ... ,
2) calling through pointer was done 2x per test scenario, which
has to be compared with the NxN loop calculation in the function,
that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
we assume the indirect call to take as much cycles as the inner-loop
calculation (which it obviously doesn't),
3) this does't really count because the data block initialization is
done immediately before the first test function invocation. BTW,
due to the "indirect call dispatcher design", this was easy to
add(and I added it - it didn't change any results, see enclosed source)

> Here's a bit of code that gets rid of at least those specific problems:
> #include <cstdlib>
> [snipped]
> ...
> struct listitem { int value1, value2; };
> ...

Why not to include some other data item in between?

struct listitem { int value1, dummy, value2; };

> int test_array(listitem const *list, size_t nitems)
> {
> int x=0;
>
> for(size_t i=0; i<nitems; ++i)
> x += list[i].value1 + list[i].value2;
> return x;
> }


This is just a sequential block access which can
be translated to assembler using only one index
register, one counter, and one accumulator.

> [snipped]
> std::cout << "Array value: " << val1;
> std::cout << "\nPointer value: " << val2;
>
> std::cout << "\nArray Time: " << time1;
> std::cout << "\nPointer Time : " << time2;


On any of my machines, your code (using your
prameters) runs just through within a few clock
cycles. But I see your point. On such simple
constructs, almost any compiler can deduce
the optimal access sequence and can easily
prevent repeated index offset multiplications.

> Compiled with VC++ 6.0, using the default "Release" settings for a
> console application, the results I get from a few runs of this are:
> ...
> [snipped]


ohh, this is nice:

> test_array:
> ... $L10257:
> mov edi, DWORD PTR [ecx+esi*8+4]
> add edi, DWORD PTR [ecx+esi*8]
> add esi, 1
> add eax, edi
> cmp esi, edx
> jb SHORT $L10246
> ...


and

> test_pointer:
> ... $L10255:
> mov esi, DWORD PTR [ecx+4]
> add esi, DWORD PTR [ecx]
> add ecx, 8
> add eax, esi
> cmp ecx, edx
> jb SHORT $L10255
> ...


This makes my point absolutely clear, the
array access is *not* optimized away by
your compiler (aybe this is the asm of the
"debug" settings?). If the above is really
faster an a mychine, thats a problem of
the instruction decoding/reordering or
whatever.

> I also did a quick test by stripping the source code down to the parts
> we care about:
> [simplified code snipped]
> ...
> and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
> Visual C++ 1.0). Here's the assembly language it produces for the two
> inner loops:


OK

> test_array:
> $F344:
> ; Line 8
> mov ax,WORD PTR [bx-2]
> add ax,WORD PTR [bx]
> add cx,ax
> add bx,4
> dec dx
> jne $F344
>
> test_pointer:
> $F355:
> ; Line 17
> mov ax,WORD PTR [si+2]
> add ax,WORD PTR [si]
> add dx,ax
> add si,4
> cmp WORD PTR [bp-4],si ;list
> ja $F355


Yes, thats simply using the adjacent positions
of struct members:

mov ax, WORD PTR [bx-2]
add ax, WORD PTR [bx]

and advancing the pointer into the data block
by the struct size (16 bit ints?)

add bx,4

Thie compiler obviously did convert the
"array access" nto a "pointer access".

> In this case, there may be a small difference between the two -- but
> it's the array version that's faster. Even on a machine from that time
> frame, the difference would be too small to care much about though --
> given the memory constraints of MS-DOS, the time difference for the
> largest array you could create would still have been less than a
> millisecond.


Lets see what the new gcc 4.3.1 thinks about the inner loop
if full optimization is set (g++-4.3 -O3 -dA -c stuff.cxx -S):

[array]
..L6:
# basic block 3
addl 4(%ecx,%edx,, %eax # displacement(base, index, scale)
addl (%ecx,%edx,, %eax
addl $1, %edx
cmpl %edx, %ebx
jg .L6

[pointer]
..L14:
# basic block 3
addl 4(%edx), %eax
addl (%edx), %eax
addl $8, %edx
cmpl %ecx, %edx
jb .L14

What do we see here? The "array-loop" uses a different
addressing scheme (indirect register w/base and displacement)
compared to the "pointer-loop". This proves my point again,
the pointer loop will be optimized into simpler and
shorter machine instructions. Due to modern processor
architectures and instruction translation/reordering,
one can't deduce from the above code which is faster -
but this might *possibly* pop up on some architectures.

I modified my code and did rerun systematical comparisons
on some boxes:

--- 8< -----------------------------------------------------------

[C2Q-9300/3.33GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.1 sec pointer: 0.91 sec arr/ptr: 120.879%

[C2Q-9300/3.33GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.09 sec arr/ptr: 100.917%

[C2Q-6600/3.15GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.09 sec pointer: 0.91 sec arr/ptr: 119.78%

[C2Q-6600/3.15GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.11 sec arr/ptr: 99.0991%

[P4/2.6GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.61 sec pointer: 1.74 sec arr/ptr: 92.5287%

[P4/2.6GHz icc 10.1 -O3 -xN]
array: 1.97 sec pointer: 1.97 sec arr/ptr: 100%

[Athlon64/2.3GHz VC++6 /O2 /Zp4]
array: 1.875 sec pointer: 1.922 sec arr/ptr: 97.5546%

[Athlon64/2.3GHz VC++9 /O2 /Oy /Ot /Zp4]
array: 2.376 sec pointer: 1.89 sec arr/ptr: 125.714%

[Athlon64/2.3GHz gcc-mingw-3.4.2 -O3 -march=pentiumpro]
array: 3.656 sec pointer: 3.297 sec arr/ptr: 110.889%

--- 8< -----------------------------------------------------------

(using the code in the appendix). The *same* program shows
different behaviour eg. on P4 vs. Core2Q, obviously
due to the different chip-internals ...

Regards

Mirco

code, modified: ==>

#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>

struct listitem { int value1, dummy, value2; };

double timethis(const char*s, int(*p)(listitem*,int), listitem *list, int n)
{
std::cout << s << ' ' << std::flush; // print some process message ...
time_t t=clock(); // the indirect call does not contribute to run-time,
int x = p(list, n); // it's just a dispatcher (and prevents inlining)
return double(clock()-t + x)/CLOCKS_PER_SEC; // fool optimizer! (x == zero)
}

int test_array(listitem *list, int nitems)
{
int x=0;
for(int j=0; j<nitems; ++j)
for(int i=0; i<nitems; ++i)
x += (list[i].value1-list[j].value1) + (list[i].value2-list[j].value2);
return x;
}

int test_pointer(listitem *list, int nitems)
{
int x=0;
for(listitem *pj=list; pj<list+nitems; ++pj)
for(listitem *pi=list; pi<list+nitems; ++pi)
x += (pi->value1-pj->value1) + (pi->value2-pj->value2);
return x;
}

int main(int argc, char*argv[])
{
int nitems = argv[1] ? atoi(argv[1]) : 22222;
listitem *list = new listitem[nitems];
memset(list, 0x1A, sizeof(listitem)*nitems);
timethis("warmup", test_array, list, nitems);

double ta=0, tp=0;
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);

std::cout << "length: " << nitems << std::endl
<< "array: " << ta << " sec" << '\t'
<< "pointer: " << tp << " sec" << '\t'
<< "arr/ptr: " << (ta/tp)*100. << "%" << std::endl;

delete [] list;
return 0;
}
<==
 
Reply With Quote
 
Jerry Coffin
Guest
Posts: n/a
 
      06-11-2008
In article <g2o6p6$14gf$(E-Mail Removed)-freiberg.de>, (E-Mail Removed)-
halle.de says...

[ ... ]

> All three points are somehow unfounded, imho.
> 1) what was timed was simply interleaved sequential access to
> a block of data that is not too trivial regarding L2
> access of the underlying architecture. To take out one
> "dimension", as you did, also makes things too simple
> for any optimizing compiler and wouldn't support my view ... ,


The OP asked a fairly specific question, including a definition of the
structure of interest. Adding more to that structure simply because the
one he cares about doesn't "support your view" seems disingenuous at
best.

> 2) calling through pointer was done 2x per test scenario, which
> has to be compared with the NxN loop calculation in the function,
> that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
> we assume the indirect call to take as much cycles as the inner-loop
> calculation (which it obviously doesn't),


The problem isn't entirely with the indirection itself, as with the fact
that when making an indirect call, the compiler often can't (or at least
doesn't) carry out optimizations that it would carry out otherwise. If
the function would normally be called indirectly, that's fine and
wonderful -- but there's no indication that would be the case here. As
such, the testing method may be having a substantial effect on the
result.

> 3) this does't really count because the data block initialization is
> done immediately before the first test function invocation. BTW,
> due to the "indirect call dispatcher design", this was easy to
> add(and I added it - it didn't change any results, see enclosed source)
>
> > Here's a bit of code that gets rid of at least those specific problems:
> > #include <cstdlib>
> > [snipped]
> > ...
> > struct listitem { int value1, value2; };
> > ...

> Why not to include some other data item in between?


Because the OP specified exactly this structure as being what he cares
about.

> struct listitem { int value1, dummy, value2; };
>
> > int test_array(listitem const *list, size_t nitems)
> > {
> > int x=0;
> >
> > for(size_t i=0; i<nitems; ++i)
> > x += list[i].value1 + list[i].value2;
> > return x;
> > }

>
> This is just a sequential block access which can
> be translated to assembler using only one index
> register, one counter, and one accumulator.


Sure -- and it corresponds directly to what the OP said he cared about.

[ ... ]

> This makes my point absolutely clear, the
> array access is *not* optimized away by
> your compiler (aybe this is the asm of the
> "debug" settings?). If the above is really
> faster an a mychine, thats a problem of
> the instruction decoding/reordering or
> whatever.


Yes and no -- I decided it got too long, but probably should have left
in a more detailed analysis of the code. The fact is that yes, there is
a difference in the generated code. The other fact is that there's not
even a theoretical difference in speed. The parts that are different are
between adding one each iteration, then multiplying by 8, or adding 8
either iteration. The addition is the same speed either way. The
multiplication by 8 doesn't change the amount of time either. On the
original 8088/8086, there would have been a difference -- calculating an
effective address took a variable amount of time, and the more complex a
calculation used, the longer it took. On those processors, you really
would have gained something by transforming one form to the other -- and
compilers at the time did.

On every since (at least) the 286, there has been an effective address
unit that can carry out the calculation in constant time. That being the
case, there's no difference in speed based on the complexity of the
effective address calculation -- addressing like [esi], [esi+ebx],
[esi+ebx*8] all take _exactly_ the same amount of time to generate on
every reasonably modern processor.

[ ... ]

> mov ax, WORD PTR [bx-2]
> add ax, WORD PTR [bx]
>
> and advancing the pointer into the data block
> by the struct size (16 bit ints?)


Yes -- I said it was old! It generated code for MS-DOS and 16-bit
Windows.

>
> add bx,4
>
> Thie compiler obviously did convert the
> "array access" nto a "pointer access".


Quite true -- back when processors were sufficiently primitive that it
made a difference, they did the optimization. Nowadays, they don't
because it's an ineffective waste of time.

[ ... ]

> What do we see here? The "array-loop" uses a different
> addressing scheme (indirect register w/base and displacement)
> compared to the "pointer-loop". This proves my point again,
> the pointer loop will be optimized into simpler and
> shorter machine instructions. Due to modern processor
> architectures and instruction translation/reordering,
> one can't deduce from the above code which is faster -
> but this might *possibly* pop up on some architectures.


Pardon my being blunt, but "Nonsense". People who write compilers have
known about how to do this transformation for _years_. The reason they
don't do it (at least on the x86) is simple: it's no longer effective.

--
Later,
Jerry.

The universe is a figment of its own imagination.
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      06-11-2008
Jerry Coffin wrote:
> In article <g2o6p6$14gf$(E-Mail Removed)-freiberg.de>, (E-Mail Removed)-
> halle.de says...
>> All three points are somehow unfounded, imho.

>
> The OP asked a fairly specific question, including a definition of the
> structure of interest. Adding more to that structure simply because the
> one he cares about doesn't "support your view" seems disingenuous at
> best.


No, he didn't. It was simply an abstract question regarding
efficiency of array vs. pointer access into data aggregates,
possibly simple structs (we don't even know). How can you
say: /a fairly specific question/?

Also, he didn't give any /definition of the structure of interest/
anywhere? He used a generalized idiom that wouldn't even compile.
Whats your point in making such statements?

> The problem isn't entirely with the indirection itself, as with the fact
> that when making an indirect call, the compiler often can't (or at least
> doesn't) carry out optimizations that it would carry out otherwise. If
> the function would normally be called indirectly, that's fine and
> wonderful -- but there's no indication that would be the case here. As
> such, the testing method may be having a substantial effect on the
> result.


Thats entirely the point. The compiler *shouldn't* do anything
that we don't want him to do. Why would we create a test case
that probably destroys our interesting part by unpredictable
*global* optimizations? So the indirect call served exactly
the purpose intended by me. Fine.

>> Why not to include some other data item in between?

>
> Because the OP specified exactly this structure as being what he cares
> about.


No, he used some symbolic code in order to make his point.
He added even /and so on/ (whatever that means

>> This is just a sequential block access which can
>> be translated to assembler using only one index
>> register, one counter, and one accumulator.

>
> Sure -- and it corresponds directly to what the OP said he cared about.


Yes and no

>> This makes my point absolutely clear, the
>> array access is *not* optimized away by
>> your compiler (aybe this is the asm of the
>> "debug" settings?). If the above is really
>> faster an a mychine, thats a problem of
>> the instruction decoding/reordering or
>> whatever.

>
> Yes and no -- I decided it got too long, but probably should have left
> in a more detailed analysis of the code. The fact is that yes, there is
> a difference in the generated code. The other fact is that there's not
> even a theoretical difference in speed. The parts that are different are
> between adding one each iteration, then multiplying by 8, or adding 8
> either iteration. The addition is the same speed either way. The
> multiplication by 8 doesn't change the amount of time either.


Thats not really conclusive for me. As you probably know, the scaled
index variant (mov edi, DWORD PTR [ecx+esi*8+4]) uses at least one byte
of opcode length more than the unscaled one (mov esi, DWORD PTR [ecx+4])
according to the Intel instruction set manual. So your assertion
might be probably wrong.

> On the original 8088/8086, there would have been a difference -- calculating an
> effective address took a variable amount of time, and the more complex a
> calculation used, the longer it took. On those processors, you really
> would have gained something by transforming one form to the other -- and
> compilers at the time did.
> On every since (at least) the 286, there has been an effective address
> unit that can carry out the calculation in constant time.


I'm not exactly sure if you are correct here but it reminds me
somehow from the dark that the scaled address generation is ex-
pensive until the Pentium or even P6 (frequent AGI stalls on the P5?).

> That being the case, there's no difference in speed based on the complexity of the
> effective address calculation -- addressing like [esi], [esi+ebx],
> [esi+ebx*8] all take _exactly_ the same amount of time to generate on
> every reasonably modern processor.


You have all sorts of issues here, pipelining, instruction size
differences, used up registers (modifications under way in the
pipeline) and so on. /all take _exactly_ the same amount of time/
is nothing I would bet my hand for.

>> What do we see here? The "array-loop" uses a different
>> addressing scheme (indirect register w/base and displacement)
>> compared to the "pointer-loop". This proves my point again,
>> the pointer loop will be optimized into simpler and
>> shorter machine instructions. Due to modern processor
>> architectures and instruction translation/reordering,
>> one can't deduce from the above code which is faster -
>> but this might *possibly* pop up on some architectures.

>
> Pardon my being blunt, but "Nonsense". People who write compilers have
> known about how to do this transformation for _years_. The reason they
> don't do it (at least on the x86) is simple: it's no longer effective.


Nice argument. As you know, there are some implementations of
the x86 instruction set and from running the programs posted
earlier I can see 10-25% difference on effectiveness between
addressing modes with the same program on different cpus.
And this *has* to do with the OP's question: will there
be /bad performance/ or not.

What exactly is the *Nonsense* you found out in your critic
of my paragraph above?

Regards & Thanks

Mirco
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
Newbie Question: python mysqldb performance question cjl Python 3 05-21-2007 02:49 AM
Performance related Question..... Cris Rock ASP .Net 1 02-12-2004 10:47 AM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM
.NET Performance Question, Please Reply Don Beal ASP .Net 13 09-29-2003 03:46 PM



Advertisments