![]() |
|
|
|||||||
![]() |
C Programming - Optimization idea: put (y&1) in [] instead of if() |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Is it OK to discuss C programming in this ng?
I just stumbled on an optimization which surprised me, though I'm sure it's "old hat" to many of you. An interest in random-number generation was kindled by some recent Usenet threads, so I looked into the Mersenne twister. There are downloadable twisters on the 'Net but for fun I typed in some C code of my own. It was no surprise that my code ran slower, at first, than the downloaded code which had obviously been carefully optimized. Comparing the codes I noticed one difference which turned out to have a huge effect. Here is the relevant fragment: If you turn on FASTWAY, the Mersenne twister runs *more than twice as fast* !! I'm not saying the fragment here runs twice as fast: The *entire twister*, which contains another 20 ops or so in addition to this fragment, runs more than twice as fast! Experimental conditions: gcc -O3 (4.1.2) Pentium(R) T2390 /* Fragment demonstrating a huge speed-up: */ auto unsigned long y, *p; static unsigned long xorer[2] = { 0, 0x9908b0df, }; /* ... */ #ifdef FASTWAY p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; #else p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df : (p[397] ^ y >> 1); #endif /* End Fragment */ By the way, p[0] = (p[397] ^ y >> 1) ^ (y & 1 ? 0x9908b0df : 0); is *almost* as fast as FASTWAY. gcc -O3 produces > andl $1, %edx > ... > negl %edx > ... > andl $-1727483681, %edx > xorl %edx, %eax which *I think* is a surprisingly clever way to deal with this " y & 1 ?" without branching. I avoided the "FASTWAY", at first, since "^ 0" seemed like a time-waster. I'm afraid I have a decades-old prejudice against memory accesses, but it appears that may be totally misplaced! Why is FASTWAY so much faster? I suppose because it avoids any conditional branches, is that correct? (The "y & 1 ?" condition will get a 50:50 split I think which might be worst-case for any branch-prediction.) Several weeks ago I suggested we post "surprising optimizations." I hope those eager to tell us how naive James must be about *this* optimization will post some surprises of their own! James Dow Allen James Dow Allen |
|
|
|
|
#2 |
|
Posts: n/a
|
James Dow Allen wrote:
> Is it OK to discuss C programming in this ng? > The thing is, your question is really not about C. You're asking why the machine code generated for p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; runs faster on a specific micro-architecture than the machine code generated for p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df : (p[397] ^ y >> 1); If you're specifically interested in x86 and amd64 performance, then you want comp.lang.asm.x86 If you want to know why branches (actually, branch mispredicts) are so expensive, irrespective of architecture, then you want comp.arch Regards. Noob |
|
|
|
#3 |
|
Posts: n/a
|
"James Dow Allen" <> wrote in message news:0199e8e6-ab0d-4cdf-83c7-... > Is it OK to discuss C programming in this ng? > Well, it's mainly endless bickering between experts, but if you must... > Here is the relevant fragment: If you turn on > FASTWAY, the Mersenne twister runs *more than > twice as fast* !! > /* Fragment demonstrating a huge speed-up: */ > auto unsigned long y, *p; > static unsigned long xorer[2] = { > 0, 0x9908b0df, > }; > /* ... */ > #ifdef FASTWAY > p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; > #else > p[0] = y & 1 > ? (p[397] ^ y >> 1) ^ 0x9908b0df > : (p[397] ^ y >> 1); > #endif > /* End Fragment */ > > By the way, > p[0] = (p[397] ^ y >> 1) > ^ (y & 1 ? 0x9908b0df : 0); > is *almost* as fast as FASTWAY. gcc -O3 produces What about: p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1)); I suppose it comes down to whether a multiply is slower if faster than a branch. An expression that converts 0,1 to ~0,0 would help (but I couldn't think of one), then an xor instead of multiply can be used. -- Bartc bartc |
|
|
|
#4 |
|
Posts: n/a
|
My post was sincere. I thought C programmers
might want to discuss C programming! (Perhaps I should have mentioned that I *am* already aware that some non-Pentium architectures have C compilers. On Nov 4, 5:14 pm, Noob <r...@127.0.0.1> wrote: > James Dow Allen wrote: > > Is it OK to discuss C programming in this ng? > > > > If you're specifically interested in x86 and amd64 performance, then > you want comp.lang.asm.x86 > > If you want to know why branches (actually, branch mispredicts) are so > expensive, irrespective of architecture, then you want comp.arch Well, although I think yours is just a parody post, in fact I'm *not* interested in x86 performance and I *don't* want to learn why branch mispredicts are expensive. I *do* want to stimulate discussion about C programming by C programmers. Can you recommend a ng for that? I thought that was a great parody by Noob of some of the topicality complaints in this ng! (But Noob, please include a smiley-face in future; you'd be amazed at some of the outrageous complaints which are actually presented seriously here!) Of course, *discussions of non-topicality* are *always* topical!! Just as a patent lawyer can fix an improper claim with six silly words (see below), I could have avoided any objection by changing each of my sentences into a *question* by prefixing eight words: "Would it not be non-topical to comment that ..." (A patent lawyer can convert a claim which improperly "patents an algorithm" into a proper claim by adding 6 words. Change: "I claim the algorithm which ..." to "I claim a digital computing apparatus which performs the algorithm which ..." Apologies to patent lawyers for divulging this trick: they've been charging $500 per word for these six words.) James Dow Allen James Dow Allen |
|
|
|
#5 |
|
Posts: n/a
|
bartc wrote:
> > "James Dow Allen" <> wrote in message > news:0199e8e6-ab0d-4cdf-83c7-... >> Is it OK to discuss C programming in this ng? >> > > Well, it's mainly endless bickering between experts, but if you must... > >> Here is the relevant fragment: If you turn on >> FASTWAY, the Mersenne twister runs *more than >> twice as fast* !! > >> /* Fragment demonstrating a huge speed-up: */ >> auto unsigned long y, *p; >> static unsigned long xorer[2] = { >> 0, 0x9908b0df, >> }; >> /* ... */ >> #ifdef FASTWAY >> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; >> #else >> p[0] = y & 1 >> ? (p[397] ^ y >> 1) ^ 0x9908b0df >> : (p[397] ^ y >> 1); >> #endif >> /* End Fragment */ >> >> By the way, >> p[0] = (p[397] ^ y >> 1) >> ^ (y & 1 ? 0x9908b0df : 0); >> is *almost* as fast as FASTWAY. gcc -O3 produces > > What about: > > p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1)); > > I suppose it comes down to whether a multiply is slower if faster than a > branch. An expression that converts 0,1 to ~0,0 would help (but I > couldn't think of one), (y & 1) - 1 > then an xor instead of multiply can be used. I suspect you mean "and." -- Eric Sosman lid Eric Sosman |
|
|
|
#6 |
|
Posts: n/a
|
On 2009-11-04, James Dow Allen <> wrote:
> Is it OK to discuss C programming in this ng? > No, this is the group for discussing things you wish to sell in World of Warcraft. (You might think the in-game "trade" channel was for that, but actually "trade" is a code word for "uninformed political debate and gay-bashing".) > /* Fragment demonstrating a huge speed-up: */ > auto unsigned long y, *p; > static unsigned long xorer[2] = { > 0, 0x9908b0df, > }; > /* ... */ > #ifdef FASTWAY > p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; > #else > p[0] = y & 1 > ? (p[397] ^ y >> 1) ^ 0x9908b0df > : (p[397] ^ y >> 1); > #endif > /* End Fragment */ This is not especially surprising. On modern CPUs, branches are usually much more expensive than virtually anything else. Your optimization removes a branch, replacing it with a lookup. The chances are that even optimizing the subexpression out might help. > By the way, > p[0] = (p[397] ^ y >> 1) > ^ (y & 1 ? 0x9908b0df : 0); Yup. > I'm afraid I have a decades-old prejudice against > memory accesses, but it appears that may be totally > misplaced! Why is FASTWAY so much faster? I > suppose because it avoids any conditional branches, > is that correct? (The "y & 1 ?" condition will > get a 50:50 split I think which might be worst-case > for any branch-prediction.) Yes. > Several weeks ago I suggested we post "surprising > optimizations." I hope those eager to tell us how > naive James must be about *this* optimization will > post some surprises of their own! It's not hugely surprising, but it's a very nice piece of work -- simple, trivially shown correct, and a huge speed improvement for a lot of common cases. -s -- Copyright 2009, all wrongs reversed. Peter Seebach / usenet- http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated! Seebs |
|
|
|
#7 |
|
Posts: n/a
|
"Eric Sosman" <> wrote in message news:hcrsgs$3hc$... > bartc wrote: >> >> "James Dow Allen" <> wrote in message >> news:0199e8e6-ab0d-4cdf-83c7-... >>> #ifdef FASTWAY >>> p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; >>> #else >>> p[0] = y & 1 >>> ? (p[397] ^ y >> 1) ^ 0x9908b0df >>> : (p[397] ^ y >> 1); >>> #endif >> p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1)); >> >> I suppose it comes down to whether a multiply is slower [or] faster than >> a branch. An expression that converts 0,1 to ~0,0 would help (but I >> couldn't think of one), > > (y & 1) - 1 > >> then an xor instead of multiply can be used. > > I suspect you mean "and." Yes, of course... -- Bartc bartc |
|
|
|
#8 |
|
Posts: n/a
|
James Dow Allen wrote:
> Experimental conditions: > gcc -O3 (4.1.2) > Pentium(R) T2390 > > /* Fragment demonstrating a huge speed-up: */ > auto unsigned long y, *p; > static unsigned long xorer[2] = { > 0, 0x9908b0df, > }; > /* ... */ > #ifdef FASTWAY > p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; > #else > p[0] = y & 1 > ? (p[397] ^ y >> 1) ^ 0x9908b0df > : (p[397] ^ y >> 1); > #endif > /* End Fragment */ > ... > I avoided the "FASTWAY", at first, since "^ 0" > seemed like a time-waster. > I'm afraid I have a decades-old prejudice against > memory accesses, but it appears that may be totally > misplaced! Why is FASTWAY so much faster? I > suppose because it avoids any conditional branches, > is that correct? (The "y & 1 ?" condition will > get a 50:50 split I think which might be worst-case > for any branch-prediction.) That's it, in a nutshell. Modern CPUs are pretty good at hiding the latency of predictable/cacheable loads and of predictable branches. Modern compilers are also good at hoisting loads, whereas it's much more difficult to hoist a branch. Your "slow" version had an unpredictable branch; your "fast" version replaced that with a cacheable, hoistable load. You are using a reasonably modern compiler and CPU, so a performance improvement is not surprising. S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking Stephen Sprunk |
|
|
|
#9 |
|
Posts: n/a
|
On Nov 5, 4:36*am, Stephen Sprunk <step...@sprunk.org> wrote:
> James Dow Allen wrote: > > I'm afraid I have a decades-old prejudice against > > memory accesses, but it appears that may be totally > > misplaced! *Why is FASTWAY so much faster? *I > > suppose because it avoids any conditional branches, > > is that correct? *(The "y & 1 ?" condition will > > get a 50:50 split I think which might be worst-case > > for any branch-prediction.) > > That's it, in a nutshell. *Modern CPUs are pretty good at hiding the > latency of predictable/cacheable loads and of predictable branches. > Modern compilers are also good at hoisting loads, whereas it's much more > difficult to hoist a branch. Thank you. Still, it seemed remarkable that *more than half* the time spent by the entire twister (much longer piece of code than the posted fragment) was wasted by this "unhoistable" branch, especially since the FASTWAY would have been slower, I think, on many Jurassic-era processors. I shared my surprise altruistically(!) thinking others might want to be reminded to look for this (unusual?) optimization idea. Any other not-so-obvious optimization ideas? > Stephen Sprunk * * * * "God does not play dice." *--Albert Einstein > CCIE #3723 * * * * "God is an inveterate gambler, and He throws the > K5SSS * * * *dice at every possible opportunity." --Stephen Hawking Retrocausality is a fascinating possibility. When the method of the Old One is finally revealed, I wonder if Einstein will turn out to be right, after all. James James Dow Allen |
|
|
|
#10 |
|
Posts: n/a
|
Noob wrote:
> James Dow Allen wrote: > >> Is it OK to discuss C programming in this ng? >> > > The thing is, your question is really not about C. > > You're asking why the machine code generated for > > p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1]; > > runs faster on a specific micro-architecture than the machine code > generated for > > p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df > : (p[397] ^ y >> 1); > > If you're specifically interested in x86 and amd64 performance, then you > want comp.lang.asm.x86 > > If you want to know why branches (actually, branch mispredicts) are so > expensive, irrespective of architecture, then you want comp.arch Also if you want to know why it is architecture dependent, because some architectures don't have pipelines, some have delayed branches, so it can do the test, decide whether or not to branch, and then do something else that is always required before the branch takes effect. Don't assume all architectures work like the ones you know. -- Flash Gordon Flash Gordon |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Search Engine Optimization Guidelines | alpha | Computer Support | 1 | 02-25-2008 04:28 PM |
| Web Hosting, Designing, Email Marketing & Search Engine Optimization Packages | karen.systems@gmail.com | Computer Support | 3 | 07-23-2007 01:38 AM |
| saraff pravesh pravesh saraff | saraff.pravesh.cool@gmail.com | Computer Support | 31 | 05-31-2007 01:23 PM |
| must read content, pravesh saraff | saraff.pravesh.cool@gmail.com | Computer Support | 0 | 05-30-2007 04:03 PM |
| BIOS Optimization Guide Rev. 8.21 | Silverstrand | Front Page News | 0 | 08-24-2005 02:46 PM |