Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   Cannot optimize 64bit Linux code (http://www.velocityreviews.com/forums/t594931-cannot-optimize-64bit-linux-code.html)

legrape@gmail.com 02-28-2008 07:24 PM

Cannot optimize 64bit Linux code
 
I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.

Dick

Walter Roberson 02-28-2008 07:39 PM

Re: Cannot optimize 64bit Linux code
 
In article <83f5f291-4c86-48f6-8625-5ead760a46bf@e25g2000prg.googlegroups.com>,
<legrape@gmail.com> wrote:
>I am porting a piece of C code to 64bit on Linux. I am using 64bit
>integers. It is a floating point intensive code and when I compile
>(gcc) on 64 bit machine, I don't see any runtime improvement when
>optimizing -O3.


>It seems like something is inhibiting the optimization. Someone on
>comp.lang.fortran suggested it might be an alignment problem.


Possibly. It could possibly also be a cache issue: you might have
cache-line conflicts, or the larger size of your integers might
be causing your key data to no longer fit into cache.
--
"The shallow murmur, but the deep are dumb." -- Sir Walter Raleigh

santosh 02-28-2008 07:44 PM

Re: Cannot optimize 64bit Linux code
 
legrape@gmail.com wrote:

> I am porting a piece of C code to 64bit on Linux. I am using 64bit
> integers. It is a floating point intensive code and when I compile
> (gcc) on 64 bit machine, I don't see any runtime improvement when
> optimizing -O3. If I construct a small program I can get significant
> (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
> machine, it runs 5x faster on the 64 bit machine than does the 64bit
> compiled code.
>
> It seems like something is inhibiting the optimization. Someone on
> comp.lang.fortran suggested it might be an alignment problem. I am
> trying to go through and eliminate all 32 bit integers righ now (this
> is a pretty large hunk of code). But thought I would survey this
> group, in case it is something naive I am missing.
>
> Any opinion is welcomed. I really need this to run up to speed, and I
> need the big address space. Thanks in advance.


This group may not be the best option. Maybe you should try a Linux or
GCC group?

If the same code and compilation commands produce such runtime
difference then perhaps the 64 bit version of the compiler and it's
runtime libraries, as well as perhaps the system runtime libraries are
not yet exploiting all the optimisations possible. Did you try giving
gcc the permission to use intrinsics and SSE? Alignment could well be a
problem though gcc *should* have chosen the best alignment for the
target, unless you specified otherwise. Are there any aspects to your
code (like choice of data types, compiler specific pragmas, struct
padding) that are perhaps selected for 32 bit systems and thus less
than optimal under 64 bit?

Did you try with the Intel compiler? If it produces better code then
that is a piece of evidence indicative, perhaps, that gcc isn't
emitting good code.


cr88192 02-29-2008 12:20 AM

Re: Cannot optimize 64bit Linux code
 

<legrape@gmail.com> wrote in message
news:83f5f291-4c86-48f6-8625-5ead760a46bf@e25g2000prg.googlegroups.com...
>I am porting a piece of C code to 64bit on Linux. I am using 64bit
> integers. It is a floating point intensive code and when I compile
> (gcc) on 64 bit machine, I don't see any runtime improvement when
> optimizing -O3. If I construct a small program I can get significant
> (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
> machine, it runs 5x faster on the 64 bit machine than does the 64bit
> compiled code.
>
> It seems like something is inhibiting the optimization. Someone on
> comp.lang.fortran suggested it might be an alignment problem. I am
> trying to go through and eliminate all 32 bit integers righ now (this
> is a pretty large hunk of code). But thought I would survey this
> group, in case it is something naive I am missing.
>
> Any opinion is welcomed. I really need this to run up to speed, and I
> need the big address space. Thanks in advance.
>



OT:

this is actually an issue related to the mismatch between current processor
performance behavior, and the calling conventions used on Linux x86-64.

they were like:
let's base everything on a variant of the "register" calling convention, and
use SSE for all the floating point math rather than crufty old x87.

the problem is that, current processors don't quite agree, and in practice
this sort of thing actually goes *slower*...

it seems, actually, that x87, lots of mem loads/stores, and complex
addressing forms, can be used to better effect wrt performance than SSE,
register-heavy approaches, and the use of "simple" addressing forms (in
seeming opposition to current "optimization wisdom").

I can't give much explanation as to why this is exactly, but it has been my
observation (periodic performance testing during the ongoing
compiler-writing task...).

my guess is because these things are heavily optimized, given that much
existing x86 code uses them heavily (this may change in the future though,
as 64 bit code becomes more prevalent...).


my guess is that the calling convention was designed according to some
misguided sense of "optimization wisdom", rather than good solid benchmarks.

better performance could probably have been achieved at present just by
pretending the x86-64 was just an x86 with more registers and gueranteed
present SSE.

not only this, but the convention is designed in such a way as to be awkward
as well, and leaves open the question of how to effectively pull off
varargs...



or, at least, this is what happens on my processor (an Athlon 64 X2 4400+).

I don't know if it is similar on Intel chips.


> Dick



Bartc 02-29-2008 12:37 AM

Re: Cannot optimize 64bit Linux code
 

<legrape@gmail.com> wrote in message
news:83f5f291-4c86-48f6-8625-5ead760a46bf@e25g2000prg.googlegroups.com...
>I am porting a piece of C code to 64bit on Linux. I am using 64bit
> integers. It is a floating point intensive code and when I compile
> (gcc) on 64 bit machine, I don't see any runtime improvement when
> optimizing -O3. If I construct a small program I can get significant
> (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
> machine, it runs 5x faster on the 64 bit machine than does the 64bit
> compiled code.
>
> It seems like something is inhibiting the optimization. Someone on
> comp.lang.fortran suggested it might be an alignment problem. I am
> trying to go through and eliminate all 32 bit integers righ now (this
> is a pretty large hunk of code). But thought I would survey this
> group, in case it is something naive I am missing.
>
> Any opinion is welcomed. I really need this to run up to speed, and I
> need the big address space. Thanks in advance.


Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..

Does the program compiled in 32-bit mode run faster when compiled with
optimisation than without (or a 32 or 64-bit machine)? In other words, what
scale of improvement are you expecting? (This on the main program)

Is the improvement really likely to be 5x or more? If not, that sounds like
something wrong with the 64-bit-compiled version, forget the optimisation,
if the 32-bit version can run that much faster.

Do you have the capability to look at a sample of code and see what
exactly is the 64-compiler generating? I doubt it's going to be as silly as
using (and emulating) 128-bit floats, but it does sound like there's
something seriously wrong. It seems unlikely that using int32 instead of
int64 would slow things down 5 times or more.

An alignment fault would be a compiler error; but you can print out a few
data addresses and see whether they are on 8/16-byte boundaries or whatever
is recommended.

Is the small program doing anything similar to the big one? It may be
benefiting from smaller instruction/data cache requirements.

You might find that ints/pointers suddenly turn from 32-bits to 64-bits when
compiled on 64-bit (and therefore using twice the memory bandwidth if you
have a lot of them), that might hit some of the performance. You might like
to check the size of pointers, if you don't need 64-bit addressing.

--
Bart




cr88192 02-29-2008 04:04 AM

Re: Cannot optimize 64bit Linux code
 

"Bartc" <bc@freeuk.com> wrote in message
news:6tIxj.15225$XI.2979@text.news.virginmedia.com ...
>
> <legrape@gmail.com> wrote in message
> news:83f5f291-4c86-48f6-8625-5ead760a46bf@e25g2000prg.googlegroups.com...
>>I am porting a piece of C code to 64bit on Linux. I am using 64bit
>> integers. It is a floating point intensive code and when I compile
>> (gcc) on 64 bit machine, I don't see any runtime improvement when
>> optimizing -O3. If I construct a small program I can get significant
>> (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
>> machine, it runs 5x faster on the 64 bit machine than does the 64bit
>> compiled code.
>>
>> It seems like something is inhibiting the optimization. Someone on
>> comp.lang.fortran suggested it might be an alignment problem. I am
>> trying to go through and eliminate all 32 bit integers righ now (this
>> is a pretty large hunk of code). But thought I would survey this
>> group, in case it is something naive I am missing.
>>
>> Any opinion is welcomed. I really need this to run up to speed, and I
>> need the big address space. Thanks in advance.

>
> Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..
>
> Does the program compiled in 32-bit mode run faster when compiled with
> optimisation than without (or a 32 or 64-bit machine)? In other words,
> what
> scale of improvement are you expecting? (This on the main program)
>
> Is the improvement really likely to be 5x or more? If not, that sounds
> like
> something wrong with the 64-bit-compiled version, forget the optimisation,
> if the 32-bit version can run that much faster.
>


yes, that is a bit harsh...


> Do you have the capability to look at a sample of code and see what
> exactly is the 64-compiler generating? I doubt it's going to be as silly
> as
> using (and emulating) 128-bit floats, but it does sound like there's
> something seriously wrong. It seems unlikely that using int32 instead of
> int64 would slow things down 5 times or more.
>


int32 vs int64, int32 should actually be faster on x86-64 (after all, 32-bit
ints have both less-complex instruction encodings, aka, no REX prefix, ...,
and also because the core of x86-64 is, after all, still x86...).

as for emulating 128 bit floats, it is conceivably possible. I am aware, in
any case, that on x86-64 gcc uses a 128-bit long-double, but whether or not
this is an 80-bit float stuffed into a 128 bit space (doing magic of
shuffling between SSE regs and the FPU), or whether it uses emulated 128 bit
floats, I don't know (I have not investigated gcc's output in this case).

note that SSE does not support 80 bit floats, and the conventions used on
x86-64 generally don't use the FPU (it may be used for some calculations,
but not much else), so if using long double, it is very possible something
funky is going on.

if this is the case, maybe try switching over to double and see if anything
is different.


> An alignment fault would be a compiler error; but you can print out a few
> data addresses and see whether they are on 8/16-byte boundaries or
> whatever
> is recommended.
>


yes. unless one is using "__attribute__((packed))" everywhere, it should not
be a problem...


> Is the small program doing anything similar to the big one? It may be
> benefiting from smaller instruction/data cache requirements.
>
> You might find that ints/pointers suddenly turn from 32-bits to 64-bits
> when
> compiled on 64-bit (and therefore using twice the memory bandwidth if you
> have a lot of them), that might hit some of the performance. You might
> like
> to check the size of pointers, if you don't need 64-bit addressing.
>



yes, I will agree here...


> --
> Bart
>
>
>



Dick Dowell 03-01-2008 08:47 PM

Re: Cannot optimize 64bit Linux code
 
Thanks for all the hints and thoughts.

My small program is:

main()
{
struct timespec ts;
double x,y;
int i;
long long n;
n=15000000;
n *= 10000;
fprintf(stderr,"LONG %Ld\n",n);
/*
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
*/
printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
_POSIX_THREAD_CPUTIME
,_POSIX_CPUTIME);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
n = ts.tv_nsec;
fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
y=3.3;
for(i=0;i<111100000;i++) {
x=sqrt(y);
y += 1.0;
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
}

It shows considerable improvement with -O3.

I think the problem is something less esoteric than the cache,
wordsize, etc. One thing I didn't say, I have multi threading loaded,
though no new threads are created by these runs. I have tried a newer
redhat, have not tried Intel compilers.

Dick

Dick Dowell 03-01-2008 08:59 PM

Re: Cannot optimize 64bit Linux code
 
I think I misspoke on my timer program. That one was used to attempt
to measure thread time. You can remove the references to the timers
and run it. It only shows about a 2x improvement on optimization.

The large difference I have actually seen is 32bit compile on another
machine, run on 64bit machine (12sec) versus 64bit code compiled on
64bit machine (70sec).

Sorry for the confusion.

Dick

Walter Roberson 03-02-2008 03:11 AM

Re: Cannot optimize 64bit Linux code
 
In article <ee716b06-2f24-487b-a22c-2128a12605da@s19g2000prg.googlegroups.com>,
Dick Dowell <dick.dowell@avagotech.com> wrote:
>Thanks for all the hints and thoughts.


>My small program is:


>main()
>{
> struct timespec ts;
> double x,y;
> int i;
> long long n;
> n=15000000;
> n *= 10000;
> fprintf(stderr,"LONG %Ld\n",n);
> /*
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> */
> printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
>_POSIX_THREAD_CPUTIME
> ,_POSIX_CPUTIME);
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> n = ts.tv_nsec;
> fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
> fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
> y=3.3;
> for(i=0;i<111100000;i++) {
> x=sqrt(y);
> y += 1.0;
> }
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
> fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
>}


>It shows considerable improvement with -O3.


You do not do anything with x after you compute it. Any good
optimizer would optimize away the x=sqrt(y) statement. Once that
is done, the optimizer could even eliminate the loop completely
and replace it by y += 111100000. Compilers that did the one or
both of these optimizations would result in much faster code than
compilers that did not. Your problem might have nothing to do
with 64 bit integers and everything to do with which optimizations
the compiler performs.
--
"The human mind is so strangely capricious, that, when freed from
the pressure of real misery, it becomes open and sensitive to the
ideal apprehension of ideal calamities." -- Sir Walter Scott

Dick Dowell 03-10-2008 04:35 PM

Re: Cannot optimize 64bit Linux code
 
Thanks for all the suggestions. I've discovered the ineffectiveness
of optimization is data dependent. I managed to profile the code and
78% of the runtime is spent in something called

_mul [1] (from gprof output, the [1] just means #1 cpu user)

Here's another line from gprof report
granularity: each sample hit covers 4 byte(s) for 0.01% of 109.71
seconds

index % time self children called name
<spontaneous>
[1] 78.0 85.55 0.00 __mul [1]
-----------------------------------------------

Dick


All times are GMT. The time now is 04:08 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.