Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Array optimizing problem in C++?

Reply
Thread Tools

Array optimizing problem in C++?

 
 
Razii
Guest
Posts: n/a
 
      03-23-2008

From an old post by James Kanze

On Apr 9 2003, 5:42 pm, ka...@gabi-soft.de (James Kanze) wrote:

>When I pass an "array" to a function in C/C++, I actually pass
>a pointer to the first element. And the compiler, when it compiles the
>function, only sees pointers -- it has no way of determining any
>relationship between them. Consider, for example, a smoothing function
>(in C or earlier C++):
>
> void
> smooth( double* dest,
> double const* src,
> size_t len )
> {
> for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> }
> }
>
>
>The compiler cannot arrange to use the src[ i + 1 ] value from the
>preceding pass through the loop as the src[ i ] value in the current
>pass, since the write to dest[ i - 1 ] might have changed it. In Java,
>it can, since two arrays are either identical or disjoint.
>
>
>This sort of code shows up frequently. In benchmarks from Java
>suppliers comparing Java to C++, of course. But also in any number
>of applications dealing with physical measures: meteorology, geophysical
>research (oil prospection), etc.


Out of curiosity, I tried to test if the above is true. It didn't make
any difference. In fact, C++ was a bit faster (not by much, just 6%).
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.



==c++==
#include <iostream>
#include <cstdlib>
#include <ctime>

void fill (double* src, int len);
void smooth (double* dest,
double const* src,
int len );

int main()
{
const int len = 50000;

double src_array [len] = {0};
double dest_array [len] = {0};


fill(src_array, len);

clock_t start=clock();

for (int i = 0; i < 10000; i++)
smooth (dest_array, src_array, len);


clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;

return 0;
}


void smooth (double* dest, double const* src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;

}
}

void fill (double* src, int len)
{
srand((unsigned)std::time(0));

for (int i = 0 ; i < len ; ++ i ) {
src[i] = rand();
}
}




==== java ========

import java.util.*;

public class Arrays{
public static void main (String[] arg)
{
final int LEN = 50000;
double[] src_array = new double [LEN];
fill(src_array, LEN);
double[] dest_array = new double [LEN];

long start = System.currentTimeMillis();

for (int i = 0; i < 10000; i++)
smooth(dest_array, src_array, LEN);

long end = System.currentTimeMillis();

System.out.println("Time smooth(): " + (end - start) + " ms");


}


static void smooth (double[] dest, double[] src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}


static void fill (double[] src, int len)
{
for (int i = 0 ; i < len; i++)
src[i] = Math.random();
}

}

 
Reply With Quote
 
 
 
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 4:08 am, Razii <DONTwhatever...@hotmail.com> wrote:
> From an old post by James Kanze
>
> On Apr 9 2003, 5:42 pm, ka...@gabi-soft.de (James Kanze) wrote:
>
>
>
> >When I pass an "array" to a function in C/C++, I actually pass
> >a pointer to the first element. And the compiler, when it compiles the
> >function, only sees pointers -- it has no way of determining any
> >relationship between them. Consider, for example, a smoothing function
> >(in C or earlier C++):

>
> > void
> > smooth( double* dest,
> > double const* src,
> > size_t len )
> > {
> > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> > }
> > }

>
> >The compiler cannot arrange to use the src[ i + 1 ] value from the
> >preceding pass through the loop as the src[ i ] value in the current
> >pass, since the write to dest[ i - 1 ] might have changed it. In Java,
> >it can, since two arrays are either identical or disjoint.

>
> >This sort of code shows up frequently. In benchmarks from Java
> >suppliers comparing Java to C++, of course. But also in any number
> >of applications dealing with physical measures: meteorology, geophysical
> >research (oil prospection), etc.

>
> Out of curiosity, I tried to test if the above is true. It didn't make
> any difference. In fact, C++ was a bit faster (not by much, just 6%).
> Probably due to array bound check in Java, if there is in indeed an
> issue with C++ arrays, overall there is no difference.


If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

$ g++ -O0 smooth.cpp
9884.867 ms

$ g++ -O3 smooth.cpp
8791.123 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
1207.944 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
loops smooth.cpp
1084.385 ms

This is your code slightly modified to use QueryPerformanceCounter for
timing (using Windows). Most of the speed up is a result of -ffast-
math.

Based on the quality and rigor of the other tests I've seen you do,
I'd guess that you missed a few optimization flags on the Java side as
well.

Jason
 
Reply With Quote
 
 
 
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 4:35 am, "jason.cipri...@gmail.com"
<jason.cipri...@gmail.com> wrote:
> If you are going for blinding speed be sure to use proper optimization
> flags; otherwise you are only calculating benchmarks for poorly
> optimized code, which, like most of your benchmarks, is meaningless:
>
> $ g++ -O0 smooth.cpp
> 9884.867 ms
>
> $ g++ -O3 smooth.cpp
> 8791.123 ms
>
> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
> 1207.944 ms
>
> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
> loops smooth.cpp
> 1084.385 ms


I left one out:

$ g++ -03 -mfpmath=sse -msse3 smooth.cpp
7379.883 ms

Compare to -O3 without having GCC generate SSE instructions for you.

Jason
 
Reply With Quote
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
Also, wrt James' original post:

> > void
> > smooth( double* dest,
> > double const* src,
> > size_t len )
> > {
> > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> > }
> > }

>
> >The compiler cannot arrange to use the src[ i + 1 ] value from the
> >preceding pass through the loop as the src[ i ] value in the current
> >pass, since the write to dest[ i - 1 ] might have changed it. In Java,
> >it can, since two arrays are either identical or disjoint.


I am not sure what you would expect in either language. I believe that
James had been expecting it to cache [i+1] in an fpu register, and use
that instead of accessing the value the next time through.

As it stands right now, VS at least (I didn't check GCC) generates
assembler instructions that are more equivalent to this:

fpu_register = src[i - 1]
fpu_register += src[i]
fpu_register += src[i + 1]
fpu_register /= 3
dest[i - 1] = fpu_register

Using fld, fadd, fdiv, and fstp on Intel machines. It never loads
src[i + 1] anyways. I have not tested this or done any research, but I
suspect this is still a bit faster than:

fpu_register = src[i - 1]
fpu_register += other_fpu_register
other_fpu_register = src[i + 1]
fpu_register += other_fpu_register
fpu_register /= 3
dest[i - 1] = fpu_register

Which I believe is what the original expected "optimization" was.

I have removed comp.lang.java.programmer from this one, since it is
unrelated to Java.

Jason
 
Reply With Quote
 
Razii
Guest
Posts: n/a
 
      03-23-2008
On Sun, 23 Mar 2008 01:35:06 -0700 (PDT), ""
<> wrote:

>If you are going for blinding speed be sure to use proper optimization
>flags; otherwise you are only calculating benchmarks for poorly
>optimized code, which, like most of your benchmarks, is meaningless:


Where did you get that I am not using proper optimizing flag? I said
nothing about flags in the last post.

>$ g++ -O0 smooth.cpp
>9884.867 ms


I am using vc++

>I'd guess that you missed a few optimization flags on the Java side as
>well.


There are no optimization flags for java compiler or vm. Post proof
and links.
 
Reply With Quote
 
Jon Harrop
Guest
Posts: n/a
 
      03-23-2008
wrote:
> If you are going for blinding speed be sure to use proper optimization
> flags; otherwise you are only calculating benchmarks for poorly
> optimized code, which, like most of your benchmarks, is meaningless:
>
> $ g++ -O0 smooth.cpp
> 9884.867 ms
>
> $ g++ -O3 smooth.cpp
> 8791.123 ms
>
> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
> 1207.944 ms


GCC's -ffast-math option breaks semantics, so it is not a valid
optimization. I would not use it in a comparison against Java unless you
know that accuracy is not required for the given task.

> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
> loops smooth.cpp
> 1084.385 ms
>
> This is your code slightly modified to use QueryPerformanceCounter for
> timing (using Windows). Most of the speed up is a result of -ffast-
> math.
>
> Based on the quality and rigor of the other tests I've seen you do,
> I'd guess that you missed a few optimization flags on the Java side as
> well.


I have altered all implementations of this benchmark to perform progressive
blurring from one array to the other and back rather than overwriting the
destination array repeatedly. This slightly improved performance in all
cases.

On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
best-of-three results:

g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
java: 4.016 javac Arrays.java; time java Arrays

On the same hardware running 32-bit Windows XP Pro, I get:

F#: 3.970 fsc -O3

This shows an insignificant difference in performance which, if nothing
else, is interesting because the second C++ program uses bounds checking. I
was also surprised to see OCaml do so well on this benchmark because it
makes no attempt to hoist bounds checks. F# on .NET typically beats all
open source compilers on Linux on this kind of benchmark (float array
intensive). C# presumably would too.

The bench2 program is the following, more idiomatic, C++ version that uses
bounds checking:

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>

void smooth (std::vector<double> &dest, std::vector<double> &src)
{
for (int i = 1; i < src.size() - 1; i++)
dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
}

void fill (std::vector<double> &src)
{
srand((unsigned)std::time(0));

for (int i = 0; i < src.size(); ++ i )
src.at(i) = rand();
}

int main()
{
const int len = 50000;

std::vector<double> src_array(len);
std::vector<double> dest_array(len);

fill(src_array);

clock_t start=clock();

for (int i = 0; i < 5000; i++)
{
smooth(dest_array, src_array);
smooth(src_array, dest_array);
}

clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

return 0;
}

Here is my OCaml equivalent:

let smooth dst src =
let n = Array.length dst in
for i=1 to n-2 do
dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
done

let () =
let n = 50000 in
let src = Array.init n (fun _ -> Random.float 1.) in
let dst = Array.create n 0. in
let t = Sys.time() in
for i=1 to 5000 do
smooth dst src;
smooth src dst;
done;
Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u
 
Reply With Quote
 
Jon Harrop
Guest
Posts: n/a
 
      03-23-2008
wrote:
> If you are going for blinding speed be sure to use proper optimization
> flags; otherwise you are only calculating benchmarks for poorly
> optimized code, which, like most of your benchmarks, is meaningless:
>
> $ g++ -O0 smooth.cpp
> 9884.867 ms
>
> $ g++ -O3 smooth.cpp
> 8791.123 ms
>
> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
> 1207.944 ms


GCC's -ffast-math option breaks semantics, so it is not a valid
optimization. I would not use it in a comparison against Java unless you
know that accuracy is not required for the given task.

> $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
> loops smooth.cpp
> 1084.385 ms
>
> This is your code slightly modified to use QueryPerformanceCounter for
> timing (using Windows). Most of the speed up is a result of -ffast-
> math.
>
> Based on the quality and rigor of the other tests I've seen you do,
> I'd guess that you missed a few optimization flags on the Java side as
> well.


I have altered all implementations of this benchmark to perform progressive
blurring from one array to the other and back rather than overwriting the
destination array repeatedly. This slightly improved performance in all
cases.

On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
best-of-three results:

g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
java: 4.016 javac Arrays.java; time java Arrays

On the same hardware running 32-bit Windows XP Pro, I get:

F#: 3.970 fsc -O3

This shows an insignificant difference in performance which, if nothing
else, is interesting because the second C++ program uses bounds checking. I
was also surprised to see OCaml do so well on this benchmark because it
makes no attempt to hoist bounds checks. F# on .NET typically beats all
open source compilers on Linux on this kind of benchmark (float array
intensive). C# presumably would too.

The bench2 program is the following, more idiomatic, C++ version that uses
bounds checking:

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>

void smooth (std::vector<double> &dest, std::vector<double> &src)
{
for (int i = 1; i < src.size() - 1; i++)
dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
}

void fill (std::vector<double> &src)
{
srand((unsigned)std::time(0));

for (int i = 0; i < src.size(); ++ i )
src.at(i) = rand();
}

int main()
{
const int len = 50000;

std::vector<double> src_array(len);
std::vector<double> dest_array(len);

fill(src_array);

clock_t start=clock();

for (int i = 0; i < 5000; i++)
{
smooth(dest_array, src_array);
smooth(src_array, dest_array);
}

clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

return 0;
}

Here is my OCaml equivalent:

let smooth dst src =
let n = Array.length dst in
for i=1 to n-2 do
dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
done

let () =
let n = 50000 in
let src = Array.init n (fun _ -> Random.float 1.) in
let dst = Array.create n 0. in
let t = Sys.time() in
for i=1 to 5000 do
smooth dst src;
smooth src dst;
done;
Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u
 
Reply With Quote
 
red floyd
Guest
Posts: n/a
 
      03-23-2008
Razii wrote:
> On Sun, 23 Mar 2008 01:35:06 -0700 (PDT), ""
> <> wrote:
>
>> If you are going for blinding speed be sure to use proper optimization
>> flags; otherwise you are only calculating benchmarks for poorly
>> optimized code, which, like most of your benchmarks, is meaningless:

>
> Where did you get that I am not using proper optimizing flag? I said
> nothing about flags in the last post.
>
>> $ g++ -O0 smooth.cpp
>> 9884.867 ms

>
> I am using vc++
>


Well, then, you can't make sweeping statements about C++ vs. Java. You
can only make statements about *a particular implementation* of C++ vs.
Java.
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      03-23-2008
red floyd wrote:
> Well, then, you can't make sweeping statements about C++ vs. Java. You
> can only make statements about *a particular implementation* of C++ vs.
> Java.


This is one of the procedural flaws common to all benchmarks. Generally
speaking, one can only hope for full disclosure of the conditions for a
benchmark, and that those conditions approximate some meaningful aspect of
one's actual business need.

This does not mean that benchmarks are useless, only that one cannot decry
them for limitations inherent to the benchmark process itself.

--
Lew
 
Reply With Quote
 
Razii
Guest
Posts: n/a
 
      03-23-2008
On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <> wrote:

>Well, then, you can't make sweeping statements about C++ vs. Java. You
>can only make statements about *a particular implementation* of C++ vs.
>Java.


I used the proper flag /O2 in vc++. Also, when you are deploying a
commercial software, you will have to use flags that target the
least-common-denominator processor. That's a disadvantage of C++ vs
JIT language. The JIT compiler knows what processor it is running on,
and can generate code specifically for that processor. Thus, I won't
use anything other than /O2 for c++... because, as I said, when you
are deploying a commercial software, you will have to use flags that
target the least-common-denominator processor anyway.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Array optimizing problem in C++? Razii Java 114 04-16-2008 07:55 PM
Optimizing ruby constant array data Mike Austin Ruby 4 03-11-2006 12:51 PM
Firefox optimizing websites (help) Tiernan Firefox 11 03-07-2005 12:14 AM
much optimizing = bad code ? (array/vector) Hagen C++ 8 03-03-2005 12:49 AM
Optimizing array accesses Clemens Lode C++ 2 10-12-2003 01:59 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57