Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C++ (http://www.velocityreviews.com/forums/f39-c.html)
-   -   Intel vs Gnu compiler output quality (http://www.velocityreviews.com/forums/t621783-intel-vs-gnu-compiler-output-quality.html)

jhc0033@gmail.com 06-23-2008 10:36 AM

Intel vs Gnu compiler output quality
 
My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp...lang=icpp&id=3

Any idea why?

Lionel B 06-23-2008 11:16 AM

Re: Intel vs Gnu compiler output quality
 
On Mon, 23 Jun 2008 03:36:58 -0700, jhc0033@gmail.com wrote:

> My experience has generally been that, for CPU-intensive tasks, the
> Intel compiler produces code that is about as fast as that produced by
> the Gnu compiler.
>
> However, on this simple Shootout entry, Intel seems to be 4.5 times
> faster:
>
> http://shootout.alioth.debian.org/gp...lang=icpp&id=3
>
> Any idea why?


Have you profiled the code? My guess would be that the bulk of the CPU
time is spent in the trig functions.

There are a host of possible explanations... maybe the Intel trig
functions are faster (but do they compute the same level of accuracy?)
Are the optimization levels really comparable? Does it make a difference
whether IEEE floating-point compliance is enforced (e.g. the GCC -ffast-
math flag can make quite a difference)?. Short of analysing the generated
assembler it's probably impossible to say.

In the past I've noticed that ICC seemed to be more aggressive at
vectorization, although recent versions of GCC do a better job (the
benchmarks don't specify which version of GCC was used) - and I'm not
sure if this is relevant here (you can test this: I think both compilers
will tell you what they vectorise if you ask them nicely).

In any case, this sort of benchmark is highly artificial and probably
quite irrelevant to real-life program performance. FWIW I too have found
very little to choose between ICC and GCC over a fair variety of real-
world numerically-intensive tasks (although I've also found later
versions of ICC on Linux to be unusably buggy).

Regards,

--
Lionel B

Mirco Wahab 06-23-2008 04:04 PM

Re: Intel vs Gnu compiler output quality
 
jhc0033@gmail.com wrote:
> My experience has generally been that, for CPU-intensive tasks, the
> Intel compiler produces code that is about as fast as that produced by
> the Gnu compiler.
>
> However, on this simple Shootout entry, Intel seems to be 4.5 times
> faster:
>
> http://shootout.alioth.debian.org/gp...lang=icpp&id=3
>
> Any idea why?


because the intel icc/icpc does magical
optimizations on this code and loads the
fpu stack (on x86) from ST(0) up to ST(6)
in the process, whereas the g++ (4.3)
doesn't have the vigor to go further up
than ST(2).

Out of this follows, the gcc code has to
to much more fldl/fildl and fst/fstp
to the L1, which isn't bad but not even
close to FPU register fiddling.

Thats it, basically.

Regards

M.

Lionel B 06-23-2008 04:27 PM

Re: Intel vs Gnu compiler output quality
 
On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:

> jhc0033@gmail.com wrote:
>> My experience has generally been that, for CPU-intensive tasks, the
>> Intel compiler produces code that is about as fast as that produced by
>> the Gnu compiler.
>>
>> However, on this simple Shootout entry, Intel seems to be 4.5 times
>> faster:
>>
>> http://shootout.alioth.debian.org/gp4/benchmark.php?

test=partialsums&lang=icpp&id=3
>>
>> Any idea why?

>
> because the intel icc/icpc does magical optimizations on this code and
> loads the fpu stack (on x86) from ST(0) up to ST(6) in the process,
> whereas the g++ (4.3) doesn't have the vigor to go further up than
> ST(2).
>
> Out of this follows, the gcc code has to to much more fldl/fildl and
> fst/fstp
> to the L1, which isn't bad but not even close to FPU register fiddling.


That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?

> Thats it, basically.


I'll quibble that "basically" ;-)

--
Lionel B

Mirco Wahab 06-23-2008 05:03 PM

Re: Intel vs Gnu compiler output quality
 
Lionel B wrote:
> On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:
>> Out of this follows, the gcc code has to to much more fldl/fildl and
>> fst/fstp
>> to the L1, which isn't bad but not even close to FPU register fiddling.

>
> That all sounds very impressive... could you possibly explain what it
> means, roughly, to a non-assembler/microprocessor architecture expert?
> Also, what about on x86_64?


Shouldn't sound very impressive imho. The central part of said
benchmark is the following loop:
21:
for (int k = 1; k <= n; ++k, pot = -pot) {
kd = double(k);
kd2 = kd * kd;
kd3 = kd * kd2;

sink = std::sin(kd);
cosk = std::cos(kd);

res1 += std::pow(dt, kd);
res2 += 1.0 / std::sqrt(kd);
res3 += 1.0 / (kd2 + kd);
res4 += 1.0 / (kd3 * sink * sink);
res5 += 1.0 / (kd3 * cosk * cosk);
res6 += 1.0 / kd;
res7 += 1.0 / kd2;
res8 += pot / kd;
res9 += pot / (2.0 * kd - 1.0);
}
39:

What one may see is a bunch of operands that are
used all along the computation of the 9 different
terms (kd, kd2 etc). For me, it looks like the
Intel compiler counts the occurrences of these
operands and puts the "best" five into the upper
four or five fpu registers (x86) (ST[3] ... ST[7])
and does the increments if the res[1-9] terms
entirely out of these fpu registers.

Example:
;;; res4 += 1.0 / (kd3 * sink * sink);
;;; res5 += 1.0 / (kd3 * cosk * cosk);
;;; res6 += 1.0 / kd;
;;; res7 += 1.0 / kd2;
gives:
fdiv %st, %st(2) #36.27
fdiv %st, %st(1) #32.34
fxch %st(1) #32.13
faddp %st, %st(6) #32.13
fldl 112(%esp) #33.34
fxch %st(6) #32.13
fstpl 80(%esp) #32.13
fld %st(4) #33.34
fmul %st(6), %st #33.34
fmulp %st, %st(6) #33.41
fdiv %st, %st(5) #33.41
fldl 96(%esp) #33.13
faddp %st, %st(6) #33.13
fxch %st(5) #33.13
fstpl 96(%esp) #33.13
fldl 104(%esp) #34.34
fmul %st, %st(4) #34.34
fmulp %st, %st(4) #34.41
fxch %st(3) #34.41
fdivr %st(4), %st #34.41
[snipped]

One can immediately see that the operations use and store stuff
across the (almost) full fpu register set %st(0) .. %st(6).
Even the last register, %st(7) is used (elsewhere). A lot of
'fxch' operations are used too, which is '(fpu-) register renaming'
and costs 0 cycles on newer x86. This is necessary to throw out
operands no longer used, they are 'renamed' from %st(7) to %st
(which is the 'top of stack'). To the application, the x86 fpu
is a stack and can only used like a stack - except for 'renaming'.

x86_64 doesn't make a difference here. Only SSE would, which
isn't involved.

regards

M.

Lionel B 06-24-2008 08:20 AM

Re: Intel vs Gnu compiler output quality
 
On Mon, 23 Jun 2008 19:03:03 +0200, Mirco Wahab wrote:

> Lionel B wrote:
>> On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:
>>> Out of this follows, the gcc code has to to much more fldl/fildl and
>>> fst/fstp
>>> to the L1, which isn't bad but not even close to FPU register
>>> fiddling.

>>
>> That all sounds very impressive... could you possibly explain what it
>> means, roughly, to a non-assembler/microprocessor architecture expert?
>> Also, what about on x86_64?

>
> Shouldn't sound very impressive imho. The central part of said benchmark
> is the following loop:


[...]

Thanks,

--
Lionel B


All times are GMT. The time now is 07:04 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.