Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Array optimizing problem in C++?

Reply
Thread Tools

Array optimizing problem in C++?

 
 
courpron@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 9:08*am, Razii <(E-Mail Removed)> wrote:
> From an old post by James Kanze
>
> On Apr 9 2003, 5:42 pm, (E-Mail Removed) (James Kanze) wrote:
>
>
>
>
>
> >When I pass an "array" to a function in C/C++, I actually pass
> >a pointer to the first element. *And the compiler, when it compiles the
> >function, only sees pointers -- it has no way of determining any
> >relationship between them. Consider, for example, a smoothing function
> >(in C or earlier C++):

>
> > * *void
> > * *smooth( double* dest,
> > * * * * * *double const* src,
> > * * * * * *size_t len )
> > * *{
> > * * * *for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> > * * * * * *dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> > * * * *}
> > * *}

>
> >The compiler cannot arrange to use the src[ i + 1 ] value from the
> >preceding pass through the loop as the src[ i ] value in the current
> >pass, since the write to dest[ i - 1 ] might have changed it. *In Java,
> >it can, since two arrays are either identical or disjoint.

>
> >This sort of code shows up frequently. *In benchmarks from Java
> >suppliers comparing Java to C++, of course. *But also in any number
> >of applications dealing with physical measures: meteorology, geophysical
> >research (oil prospection), etc.

>
> Out of curiosity, I tried to test if the above is true. It didn't make
> any difference. In fact, C++ was a bit faster (not by much, just 6%).
> Probably due to array bound check in Java, if there is in indeed an
> issue with C++ arrays, overall there is no difference.


The issue C++ has with arrays is known as pointer aliasing. C has the
keyword "restrict" to deal with this problem but not C++. The
"restrict" keyword should be added in the next C++ standard. Most C++
compilers currently propose their own "restrict" keywords though.
Anyway, C++ has std::valarray. Arrays constructed with std::valarray
are free from aliasing. So, in the absolute, the Java language has no
performance edge over the current version of the C++ language in this
domain. However, even if theorically, std:valarray is free from
pointer aliasing, in practice, this isn't always the case, depending
on the compiler.

> if there is in indeed an issue with C++ arrays, overall there is no difference.


With the right optimization flags, the smooth function would probably
get inlined, and static analysis of pointer aliasing should happen,
allowing the C++ compiler to perform no aliasing optimizations.

> Probably due to array bound check in Java


A good java compiler should have applied bounds checking elimination
in this case especially since the smooth function should be inlined.

As a side note, your program should handle ints instead of doubles, to
focus more on the performance of array access.

Alexandre Courpron.
 
Reply With Quote
 
 
 
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 12:00 pm, Razii <(E-Mail Removed)> wrote:
> On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <(E-Mail Removed)> wrote:
> >Well, then, you can't make sweeping statements about C++ vs. Java. You
> >can only make statements about *a particular implementation* of C++ vs.
> >Java.

>
> I used the proper flag /O2 in vc++. Also, when you are deploying a
> commercial software, you will have to use flags that target the
> least-common-denominator processor. That's a disadvantage of C++ vs
> JIT language.


There are three things that are not correct about this:

1) The -ffast-math flag to GCC does not use any special processor
instructions. It simply removes certain assumptions, which is safe
when you know they aren't true. This is the flag with the major
performance increase. Do not confuse "targeting the least-common-
denominator processor" with "not using an ideal compiler for the
application". Whether you compile your commercial code with GCC or VC+
+ is irrelevant to what machine you are targetting. If your goal is to
target the least-common-denominator, then you would leave off, for
example, the SSE optimizations. However, in my example, it was -ffast-
math that was responsible for the largest speedup. Target platform is
not relevant.

2) It's more of a consequence of only distributing the least-common-
denominator than of using C++. That's a choice you can make. It would
be completely reasonable, for example, for a vendor of high-
performance research applications to distribute SSE-optimized builds.
E.g. Folding@Home (non-commercial, http://folding.stanford.edu)
maintains a GPU-optimized version of their software as well as a
"normal" one; and their non-GPU client checks for various processor
extensions at run-time (such as SSE) and uses optimized code
accordingly. This is the same with 32- and 64-bit applications. This
is also common in OpenGL application development; where applications
can look for specific OpenGL features and take advantage of them,
rather than always only targeting the least-common-denominator of
graphics hardware. The same thing applies to Java wrt OpenGL.
Depending on how you *want* to look at it, an Intel machine with SSE
support is just as different from an Intel machine without SSE as it
is from a PowerPC with AltiVec (same with multicore machines). It just
so happens that you can easily target the least-common-denominator for
those Intel machines, but nothing stops you from releasing platform-
optimized builds. So it is not a disadvantage of C++ vs. anything; it
is a disadvantage of sticking to the last-common-denominator.

3) A JIT language has an extra layer of abstraction before the
hardware. This makes it difficult to perform these particular kinds of
optimizations, anyways. It's apples and oranges, the Java JIT compiler
is a completely different kind of beast than a native code compiler,
the optimizations available for a JIT compiler to make and the
approaches it can take to making them are too different from those of
a native compiler to make a valid comparison. They're just different
tools.

What you are talking about is a difference between targeting the least-
common-denominator of platforms, and targeting specific platforms --
which is an issue present no matter what language you are using. You
are not talking about a difference between any two specific languages.
Furthermore, what *I* was talking about is a difference between
different compilers, targeting the same platform (I am sorry my point
wasn't clear, I had meant to show that the compiler can generate very
fast code for you, in particular -ffast-math [which does not target
specific hardware features] on GCC moreso than the the architecture-
specific options).

> The JIT compiler knows what processor it is running on,
> and can generate code specifically for that processor.


True, and a valid point. However, depending on the nature of what you
are compiling, you easily could lose a lot of information about the
original intention of the code in the original optimizing pass that
limits how effective processor-specific optimizations made by the JIT
compiler can be.

For example (this is a very specific example but please extend it to
general cases): let's say you have some Java code that, I don't know,
multiplies the elements of two large arrays of floating-point values
together in a loop. You compile it to bytecode, and in the process
javac unrolls that loop to optimize a bit. Then you run it on a
platform that, say, provides a specific instruction for multiplying
large arrays of floating-point values together very quickly. If the
original compiler had targeted this platform to begin with, it could
have seen your access pattern and used that optimized instruction.
Instead it unrolled the loop, a least-common-denominator optimization.
Now the JIT compiler sees the unrolled loop, and because it does not
see the sequential access pattern that was very apparent in the
original code, it fails to analyze the unrolled loop and determine
that it should be "re-rolled" and use the optimized instruction
instead. Now this particular example might not be that great, I know,
because you could conceivably construct a JIT compiler that can
recognize and re-roll unrolled loops; but the general point is what I
said: you can potentially lose a lot of information by compiling code
that you can not recover, and this information could be critical to
performing the best optimizations off the bat.

Consequently, the JIT compiler must turn to other techniques for
optimization, such as caching compiled code for faster startup times,
etc. Again, it's a whole different beast.

> Thus, I won't
> use anything other than /O2 for c++... because, as I said, when you
> are deploying a commercial software, you will have to use flags that
> target the least-common-denominator processor anyway.


I addressed that above: Again, 1) some optimizations, like -ffast-math
or -funroll-loops (and be careful, of course, you can't blindly use
these, you have to make sure they make sense for your code), do not
target a specific processor, and 2) there is no rule that says you
have to target the least-common-denominator processor; and a heavy
duty research application, such as what James Kanze *originally*
cited, would most certainly take advantage of special features on the
target hardware.

Jason
 
Reply With Quote
 
 
 
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 11:12 am, Razii <(E-Mail Removed)> wrote:
> I said
> nothing about flags in the last post.


As was pointed out in your other thread, these are important details
that you should not leave out; especially in this particular benchmark
(which is not I/O bound like the last one, and is heavily affected by
compiler optimizations).

Jason
 
Reply With Quote
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
<(E-Mail Removed)> wrote in message
news:0d80b694-8609-4f80-9aeb-
http://www.velocityreviews.com/forums/(E-Mail Removed)...
> (I am sorry my point
> wasn't clear, I had meant to show that the compiler can generate very
> fast code for you, in particular -ffast-math [which does not target
> specific hardware features] on GCC moreso than the the architecture-
> specific options).


Here are some key missing timings with GCC and VS, Razii:

$ g++ -O2 smooth.cpp
8677.623 ms

$ g++ -O2 -ffast-math smooth.cpp
1077.232 ms

$ g++ -O2 -ffast-math -funroll-loops smooth.cpp
919.622 ms

With CL 14 (VS2005):
7275.611 ms

No platform-specific optimizations were used. Note the order of
magnitude speed up using -ffast-math and -funroll-loops with GCC,
which generates code that can still be run on the least-common-
denominator of Intel-compatible platforms.

Also note that -O2 provides slightly better performance than -O3. Just
goes to show why I shouldn't be using -O3, I guess

Jason

 
Reply With Quote
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 1:18 pm, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> Also note that -O2 provides slightly better performance than -O3. Just
> goes to show why I shouldn't be using -O3, I guess


I take that back, I got it down to 903ms with -O3 over -O2. Perhaps my
machine was just in a bad mood yesterday.

I guess I shouldn't post in this thread for a while, I'm getting a
little uncomfortable with my post ratio here.

Jason
 
Reply With Quote
 
courpron@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 5:00*pm, Razii <(E-Mail Removed)> wrote:
> On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <(E-Mail Removed)> wrote:
> >Well, then, you can't make sweeping statements about C++ vs. Java. *You
> >can only make statements about *a particular implementation* of C++ vs.
> >Java.

>
> I used the proper flag /O2 in vc++. Also, when you are deploying a
> commercial software, you will have to use flags that target the
> least-common-denominator processor. That's a disadvantage of C++ vs
> JIT language. The JIT compiler knows what processor it is running on,
> and can generate code specifically for that processor. Thus, I won't
> use anything other than /O2 for c++... because, as I said, when you
> are deploying a commercial software, you will have to use flags that
> target the least-common-denominator processor anyway.


You are using Visual C++. You should use at least the following flags,
in order to enable whole program optimization and link-time code
generation :

cl /O2 /GL prog.cpp /link /ltcg

You should make again all your benchmarks with at least those options
enabled. By the way, as I said earlier, you should use ints instead of
doubles. It will save you from a lot of troubles.

Alexandre Courpron.

 
Reply With Quote
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      03-23-2008
On Mar 23, 1:18 pm, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> With CL 14 (VS2005):
> 7275.611 ms


With /O2 and LTCG enabled!
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      03-23-2008
On 23 mar, 09:58, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> Also, wrt James' original post:


> > > void
> > > smooth( double* dest,
> > > double const* src,
> > > size_t len )
> > > {
> > > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> > > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> > > }
> > > }


> > >The compiler cannot arrange to use the src[ i + 1 ] value
> > >from the preceding pass through the loop as the src[ i ]
> > >value in the current pass, since the write to dest[ i - 1 ]
> > >might have changed it. In Java, it can, since two arrays
> > >are either identical or disjoint.


> I am not sure what you would expect in either language. I
> believe that James had been expecting it to cache [i+1] in an
> fpu register, and use that instead of accessing the value the
> next time through.


Exactly. It's a very common optimization.

> As it stands right now, VS at least (I didn't check GCC)
> generates assembler instructions that are more equivalent to
> this:


> fpu_register = src[i - 1]
> fpu_register += src[i]
> fpu_register += src[i + 1]
> fpu_register /= 3
> dest[i - 1] = fpu_register


> Using fld, fadd, fdiv, and fstp on Intel machines. It never
> loads src[i + 1] anyways. I have not tested this or done any
> research, but I suspect this is still a bit faster than:


> fpu_register = src[i - 1]
> fpu_register += other_fpu_register
> other_fpu_register = src[i + 1]
> fpu_register += other_fpu_register
> fpu_register /= 3
> dest[i - 1] = fpu_register


The fastest solution would pre-charge the first two values, and
only read one new value each time through the loop. (Don't
forget that memory bandwidth will be the limiting factor for
this type of loop.) The Intel's stack architecture may make
optimizing this somewhat more difficult, but it should still be
possible. You'll end up with more instructions, but less memory
accesses and better run time.

But of course, you can't do this in C++, because dest[ i -1 ]
might modify one of the values you're holding in register. (Of
course, some compilers do generate two versions of the loop,
with code which checks for aliasing first, and uses one or the
other, depending on whether aliasing is present or not. But
this isn't the usual case.)

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
Razii
Guest
Posts: n/a
 
      03-23-2008
On Sun, 23 Mar 2008 10:29:44 -0700 (PDT), James Kanze
<(E-Mail Removed)> wrote:

>> I am not sure what you would expect in either language. I
>> believe that James had been expecting it to cache [i+1] in an
>> fpu register, and use that instead of accessing the value the
>> next time through.

>
>Exactly. It's a very common optimization.


Post an example that I can test where aliasing is a problem. In the
example that I posted, c++ was faster.


 
Reply With Quote
 
Razii
Guest
Posts: n/a
 
      03-23-2008
On Sun, 23 Mar 2008 10:23:07 -0700 (PDT), (E-Mail Removed) wrote:

>You should make again all your benchmarks with at least those options
>enabled


What all other bench marks? I only posted one IO example. This
post was not a benchmark but a question to Kanze to prove by posting
an example (that I can test) where aliasing in c++ makes optimizing
arrays hard (or impossible). The example that he gave (which I used),

for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}

c++ was in fact faster. Obviously aliasing was no issue in this case,
as he claimed.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Array optimizing problem in C++? Razii Java 114 04-16-2008 07:55 PM
Optimizing ruby constant array data George Ogata Ruby 4 03-11-2006 12:51 PM
Firefox optimizing websites (help) Tiernan Firefox 11 03-07-2005 12:14 AM
much optimizing = bad code ? (array/vector) Hagen C++ 8 03-03-2005 12:49 AM
Optimizing array accesses Clemens Lode C++ 2 10-12-2003 01:59 PM



Advertisments