Go Back   Velocity Reviews > Newsgroups > C Programming
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

C Programming - Optimization idea: put (y&1) in [] instead of if()

 
Thread Tools Search this Thread
Old 11-04-2009, 09:40 AM   #1
Default Optimization idea: put (y&1) in [] instead of if()


Is it OK to discuss C programming in this ng?

I just stumbled on an optimization which surprised
me, though I'm sure it's "old hat" to many of you.

An interest in random-number generation was kindled
by some recent Usenet threads, so I looked into
the Mersenne twister. There are downloadable
twisters on the 'Net but for fun I typed in some
C code of my own.

It was no surprise that my code ran slower, at first,
than the downloaded code which had obviously been
carefully optimized. Comparing the codes I noticed
one difference which turned out to have a huge effect.

Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !! I'm not saying the fragment
here runs twice as fast: The *entire twister*,
which contains another 20 ops or so in addition
to this fragment, runs more than twice as fast!

Experimental conditions:
gcc -O3 (4.1.2)
Pentium(R) T2390

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */

By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces
> andl $1, %edx
> ...
> negl %edx
> ...
> andl $-1727483681, %edx
> xorl %edx, %eax

which *I think* is a surprisingly clever way
to deal with this " y & 1 ?" without branching.

I avoided the "FASTWAY", at first, since "^ 0"
seemed like a time-waster.
I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)

Several weeks ago I suggested we post "surprising
optimizations." I hope those eager to tell us how
naive James must be about *this* optimization will
post some surprises of their own!

James Dow Allen


James Dow Allen
  Reply With Quote
Old 11-04-2009, 10:14 AM   #2
Noob
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
James Dow Allen wrote:

> Is it OK to discuss C programming in this ng?
>


The thing is, your question is really not about C.

You're asking why the machine code generated for

p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];

runs faster on a specific micro-architecture than the machine code
generated for

p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);

If you're specifically interested in x86 and amd64 performance, then you
want comp.lang.asm.x86

If you want to know why branches (actually, branch mispredicts) are so
expensive, irrespective of architecture, then you want comp.arch

Regards.


Noob
  Reply With Quote
Old 11-04-2009, 11:46 AM   #3
bartc
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()

"James Dow Allen" <> wrote in message
news:0199e8e6-ab0d-4cdf-83c7-...
> Is it OK to discuss C programming in this ng?
>


Well, it's mainly endless bickering between experts, but if you must...

> Here is the relevant fragment: If you turn on
> FASTWAY, the Mersenne twister runs *more than
> twice as fast* !!


> /* Fragment demonstrating a huge speed-up: */
> auto unsigned long y, *p;
> static unsigned long xorer[2] = {
> 0, 0x9908b0df,
> };
> /* ... */
> #ifdef FASTWAY
> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
> #else
> p[0] = y & 1
> ? (p[397] ^ y >> 1) ^ 0x9908b0df
> : (p[397] ^ y >> 1);
> #endif
> /* End Fragment */
>
> By the way,
> p[0] = (p[397] ^ y >> 1)
> ^ (y & 1 ? 0x9908b0df : 0);
> is *almost* as fast as FASTWAY. gcc -O3 produces


What about:

p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));

I suppose it comes down to whether a multiply is slower if faster than a
branch. An expression that converts 0,1 to ~0,0 would help (but I couldn't
think of one), then an xor instead of multiply can be used.

--
Bartc




bartc
  Reply With Quote
Old 11-04-2009, 12:20 PM   #4
James Dow Allen
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
My post was sincere. I thought C programmers
might want to discuss C programming! (Perhaps
I should have mentioned that I *am* already aware
that some non-Pentium architectures have
C compilers.

On Nov 4, 5:14 pm, Noob <r...@127.0.0.1> wrote:
> James Dow Allen wrote:
> > Is it OK to discuss C programming in this ng?
> >

>
> If you're specifically interested in x86 and amd64 performance, then
> you want comp.lang.asm.x86
>
> If you want to know why branches (actually, branch mispredicts) are so
> expensive, irrespective of architecture, then you want comp.arch


Well, although I think yours is just a parody post, in fact
I'm *not* interested in x86 performance and I *don't* want to learn
why branch mispredicts are expensive.

I *do* want to stimulate discussion about C programming by
C programmers. Can you recommend a ng for that?


I thought that was a great parody by Noob of
some of the topicality complaints in this ng!
(But Noob, please include a smiley-face in future;
you'd be amazed at some of the outrageous complaints
which are actually presented seriously here!)

Of course, *discussions of non-topicality* are
*always* topical!! Just as a patent lawyer can fix
an improper claim with six silly words (see below),
I could have avoided any objection by changing
each of my sentences into a *question* by prefixing
eight words:
"Would it not be non-topical to comment that ..."


(A patent lawyer can convert a claim which improperly
"patents an algorithm" into a proper claim by adding
6 words. Change:
"I claim the algorithm which ..."
to
"I claim a digital computing apparatus which
performs the algorithm which ..."
Apologies to patent lawyers for divulging this trick:
they've been charging $500 per word for these six words.)

James Dow Allen


James Dow Allen
  Reply With Quote
Old 11-04-2009, 12:36 PM   #5
Eric Sosman
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
bartc wrote:
>
> "James Dow Allen" <> wrote in message
> news:0199e8e6-ab0d-4cdf-83c7-...
>> Is it OK to discuss C programming in this ng?
>>

>
> Well, it's mainly endless bickering between experts, but if you must...
>
>> Here is the relevant fragment: If you turn on
>> FASTWAY, the Mersenne twister runs *more than
>> twice as fast* !!

>
>> /* Fragment demonstrating a huge speed-up: */
>> auto unsigned long y, *p;
>> static unsigned long xorer[2] = {
>> 0, 0x9908b0df,
>> };
>> /* ... */
>> #ifdef FASTWAY
>> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
>> #else
>> p[0] = y & 1
>> ? (p[397] ^ y >> 1) ^ 0x9908b0df
>> : (p[397] ^ y >> 1);
>> #endif
>> /* End Fragment */
>>
>> By the way,
>> p[0] = (p[397] ^ y >> 1)
>> ^ (y & 1 ? 0x9908b0df : 0);
>> is *almost* as fast as FASTWAY. gcc -O3 produces

>
> What about:
>
> p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));
>
> I suppose it comes down to whether a multiply is slower if faster than a
> branch. An expression that converts 0,1 to ~0,0 would help (but I
> couldn't think of one),


(y & 1) - 1

> then an xor instead of multiply can be used.


I suspect you mean "and."

--
Eric Sosman
lid


Eric Sosman
  Reply With Quote
Old 11-04-2009, 01:56 PM   #6
Seebs
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
On 2009-11-04, James Dow Allen <> wrote:
> Is it OK to discuss C programming in this ng?
>


No, this is the group for discussing things you wish to sell in
World of Warcraft. (You might think the in-game "trade" channel
was for that, but actually "trade" is a code word for "uninformed
political debate and gay-bashing".)

> /* Fragment demonstrating a huge speed-up: */
> auto unsigned long y, *p;
> static unsigned long xorer[2] = {
> 0, 0x9908b0df,
> };
> /* ... */
> #ifdef FASTWAY
> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
> #else
> p[0] = y & 1
> ? (p[397] ^ y >> 1) ^ 0x9908b0df
> : (p[397] ^ y >> 1);
> #endif
> /* End Fragment */


This is not especially surprising. On modern CPUs, branches are usually
much more expensive than virtually anything else. Your optimization
removes a branch, replacing it with a lookup. The chances are that even
optimizing the subexpression out might help.

> By the way,
> p[0] = (p[397] ^ y >> 1)
> ^ (y & 1 ? 0x9908b0df : 0);


Yup.

> I'm afraid I have a decades-old prejudice against
> memory accesses, but it appears that may be totally
> misplaced! Why is FASTWAY so much faster? I
> suppose because it avoids any conditional branches,
> is that correct? (The "y & 1 ?" condition will
> get a 50:50 split I think which might be worst-case
> for any branch-prediction.)


Yes.

> Several weeks ago I suggested we post "surprising
> optimizations." I hope those eager to tell us how
> naive James must be about *this* optimization will
> post some surprises of their own!


It's not hugely surprising, but it's a very nice piece of work -- simple,
trivially shown correct, and a huge speed improvement for a lot of common
cases.

-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet-
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!


Seebs
  Reply With Quote
Old 11-04-2009, 02:16 PM   #7
bartc
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()

"Eric Sosman" <> wrote in message
news:hcrsgs$3hc$...
> bartc wrote:
>>
>> "James Dow Allen" <> wrote in message
>> news:0199e8e6-ab0d-4cdf-83c7-...


>>> #ifdef FASTWAY
>>> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
>>> #else
>>> p[0] = y & 1
>>> ? (p[397] ^ y >> 1) ^ 0x9908b0df
>>> : (p[397] ^ y >> 1);
>>> #endif


>> p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));
>>
>> I suppose it comes down to whether a multiply is slower [or] faster than
>> a branch. An expression that converts 0,1 to ~0,0 would help (but I
>> couldn't think of one),

>
> (y & 1) - 1
>
>> then an xor instead of multiply can be used.

>
> I suspect you mean "and."


Yes, of course...

--
Bartc



bartc
  Reply With Quote
Old 11-04-2009, 09:36 PM   #8
Stephen Sprunk
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
James Dow Allen wrote:
> Experimental conditions:
> gcc -O3 (4.1.2)
> Pentium(R) T2390
>
> /* Fragment demonstrating a huge speed-up: */
> auto unsigned long y, *p;
> static unsigned long xorer[2] = {
> 0, 0x9908b0df,
> };
> /* ... */
> #ifdef FASTWAY
> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
> #else
> p[0] = y & 1
> ? (p[397] ^ y >> 1) ^ 0x9908b0df
> : (p[397] ^ y >> 1);
> #endif
> /* End Fragment */
> ...
> I avoided the "FASTWAY", at first, since "^ 0"
> seemed like a time-waster.
> I'm afraid I have a decades-old prejudice against
> memory accesses, but it appears that may be totally
> misplaced! Why is FASTWAY so much faster? I
> suppose because it avoids any conditional branches,
> is that correct? (The "y & 1 ?" condition will
> get a 50:50 split I think which might be worst-case
> for any branch-prediction.)


That's it, in a nutshell. Modern CPUs are pretty good at hiding the
latency of predictable/cacheable loads and of predictable branches.
Modern compilers are also good at hoisting loads, whereas it's much more
difficult to hoist a branch.

Your "slow" version had an unpredictable branch; your "fast" version
replaced that with a cacheable, hoistable load. You are using a
reasonably modern compiler and CPU, so a performance improvement is not
surprising.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking


Stephen Sprunk
  Reply With Quote
Old 11-05-2009, 03:17 AM   #9
James Dow Allen
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
On Nov 5, 4:36*am, Stephen Sprunk <step...@sprunk.org> wrote:
> James Dow Allen wrote:
> > I'm afraid I have a decades-old prejudice against
> > memory accesses, but it appears that may be totally
> > misplaced! *Why is FASTWAY so much faster? *I
> > suppose because it avoids any conditional branches,
> > is that correct? *(The "y & 1 ?" condition will
> > get a 50:50 split I think which might be worst-case
> > for any branch-prediction.)

>
> That's it, in a nutshell. *Modern CPUs are pretty good at hiding the
> latency of predictable/cacheable loads and of predictable branches.
> Modern compilers are also good at hoisting loads, whereas it's much more
> difficult to hoist a branch.


Thank you. Still, it seemed remarkable that *more than half* the
time spent by the entire twister (much longer piece of code than
the posted fragment) was wasted by this "unhoistable" branch,
especially since the FASTWAY would have been slower, I think,
on many Jurassic-era processors.

I shared my surprise altruistically(!) thinking others might want
to be reminded to look for this (unusual?) optimization idea.

Any other not-so-obvious optimization ideas?


> Stephen Sprunk * * * * "God does not play dice." *--Albert Einstein
> CCIE #3723 * * * * "God is an inveterate gambler, and He throws the
> K5SSS * * * *dice at every possible opportunity." --Stephen Hawking


Retrocausality is a fascinating possibility. When the method
of the Old One is finally revealed, I wonder if Einstein will
turn out to be right, after all.

James


James Dow Allen
  Reply With Quote
Old 11-05-2009, 11:48 PM   #10
Flash Gordon
 
Posts: n/a
Default Re: Optimization idea: put (y&1) in [] instead of if()
Noob wrote:
> James Dow Allen wrote:
>
>> Is it OK to discuss C programming in this ng?
>>

>
> The thing is, your question is really not about C.
>
> You're asking why the machine code generated for
>
> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
>
> runs faster on a specific micro-architecture than the machine code
> generated for
>
> p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df
> : (p[397] ^ y >> 1);
>
> If you're specifically interested in x86 and amd64 performance, then you
> want comp.lang.asm.x86
>
> If you want to know why branches (actually, branch mispredicts) are so
> expensive, irrespective of architecture, then you want comp.arch


Also if you want to know why it is architecture dependent, because some
architectures don't have pipelines, some have delayed branches, so it
can do the test, decide whether or not to branch, and then do something
else that is always required before the branch takes effect.

Don't assume all architectures work like the ones you know.
--
Flash Gordon


Flash Gordon
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Search Engine Optimization Guidelines alpha Computer Support 1 02-25-2008 04:28 PM
Web Hosting, Designing, Email Marketing & Search Engine Optimization Packages karen.systems@gmail.com Computer Support 3 07-23-2007 01:38 AM
saraff pravesh pravesh saraff saraff.pravesh.cool@gmail.com Computer Support 31 05-31-2007 01:23 PM
must read content, pravesh saraff saraff.pravesh.cool@gmail.com Computer Support 0 05-30-2007 04:03 PM
BIOS Optimization Guide Rev. 8.21 Silverstrand Front Page News 0 08-24-2005 02:46 PM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46