Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Optimizing C

Reply
Thread Tools

Optimizing C

 
 
Richard G. Riley
Guest
Posts: n/a
 
      03-13-2006

Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      03-13-2006
Richard G. Riley wrote:

> Without resorting to asm chunks I'm working on a few small routines
> which manipulate bitmasks. I'm looking for any guidance on writing C
> in a manner which tilts the compilers hand in, if possible, a
> compiler/underlying processor independant way : althought to be fair I
> cant see this stuff on anything other than x86, but who knows.
>
> I found some ok info here:
>
> http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
> http://www.devx.com/amd/Article/21314
>
> But any techniques for doing the standard bit fiddles without
> occurring extra unnecessary cpu bandwidth would be nice. By standard
> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
> found etc. The operations will be occurring on sequences of
> words and the smallest word will be the standard CPU word size (32
> bits). By sequences of words : I might, for example, be locating
> the first non zero bit from the left in sequence of 64 words.


First, it's not at all clear what you're trying to speed
up, nor what you have measured and found to be too slow. A
"generic" optimization is often ill-directed; optimizing "on
general principles" is often ill-advised. If you have a specific
operation you'd like to speed up, it'd be a good idea to show
the actual code you're currently using, along with measurements
of its current speed and an estimate of the amount of speedup
you simply MUST achieve to make the code useful.

As to finding the first non-zero bit in an array of 64 ints,
the first thing to do is nail down what you mean by "first," lest
endianness issues pollute the solutions. Having done that, you
should also decide what form the answer should take -- it may be
simpler (or harder) to get a bit mask than a bit index, and you
may benefit by adjusting the rest of the code to use the more
easily obtained answer.

If one-bits are expected to be fairly sparse, you'll probably
want to search through largish "chunks" to find one that isn't
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You'd need to implement both and measure.

If you've used memchr() and have located a non-zero byte, it
might make sense to use a precalculated 256-element table to finish
the job without further "computation." Then again, it might not:
modern CPU's are much faster than memory, and might be able to
complete a multi-step computation in less time than a memory fetch.
More measurements are needed.

If you've searched int-by-int you could dive directly into
the second search, or you could first use a couple of comparisons
to decide which of the four bytes holds the "first" non-zero and
then proceed as if memchr() had located that byte. Which will be
faster? Measure, measure, measure.

If you need a bit index and don't want to use a table, the
obvious test-and-shift-and-count approach is probably best. If
you need a mask as opposed to an index, the `x & (x - 1)' trick
may be faster. Measure, measure, -- is there an echo in here?

Finally, an impression of the two Web pages you cited. A
quick glance suggests they're filled with well-intentioned and
sometimes reasonable advice, but ALL of it is out of context.
(Necessarily so, of course: the authors of the Web pages have no
idea what kind of program you're trying to write.) Even expert
advice can become nonsense when taken out of context. Your
grandmother's doctor is (presumably) an expert; do you therefore
take the same medicines that are prescribed for your grandmother?
IMHO, it is best to read and understand advice of the sort these
pages offer, but NOT to follow it until and unless you have reason
to believe it applies to your situation.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
 
 
 
Laurent Deniau
Guest
Posts: n/a
 
      03-13-2006
Richard G. Riley wrote:
> Without resorting to asm chunks I'm working on a few small routines
> which manipulate bitmasks. I'm looking for any guidance on writing C
> in a manner which tilts the compilers hand in, if possible, a
> compiler/underlying processor independant way : althought to be fair I
> cant see this stuff on anything other than x86, but who knows.
>
> I found some ok info here:
>
> http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
> http://www.devx.com/amd/Article/21314
>
> But any techniques for doing the standard bit fiddles without
> occurring extra unnecessary cpu bandwidth would be nice. By standard
> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
> found etc. The operations will be occurring on sequences of
> words and the smallest word will be the standard CPU word size (32
> bits). By sequences of words : I might, for example, be locating
> the first non zero bit from the left in sequence of 64 words.
>
> Any pointers appreciated.


Maybe http://www.jjj.de/bitwizardry/ is a good start.

a+, ld.
 
Reply With Quote
 
Richard G. Riley
Guest
Posts: n/a
 
      03-13-2006
On 2006-03-13, Eric Sosman <(E-Mail Removed)> wrote:
> Richard G. Riley wrote:
>
>> Without resorting to asm chunks I'm working on a few small routines
>> which manipulate bitmasks. I'm looking for any guidance on writing C
>> in a manner which tilts the compilers hand in, if possible, a
>> compiler/underlying processor independant way : althought to be fair I
>> cant see this stuff on anything other than x86, but who knows.
>>
>> I found some ok info here:
>>
>> http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
>> http://www.devx.com/amd/Article/21314
>>
>> But any techniques for doing the standard bit fiddles without
>> occurring extra unnecessary cpu bandwidth would be nice. By standard
>> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
>> found etc. The operations will be occurring on sequences of
>> words and the smallest word will be the standard CPU word size (32
>> bits). By sequences of words : I might, for example, be locating
>> the first non zero bit from the left in sequence of 64 words.

>
> First, it's not at all clear what you're trying to speed
> up, nor what you have measured and found to be too slow. A


Hi Eric,

I'm looking for advice/pointers on the operators mentioned above. Its
not a case of anything being too slow : its a question of ensuring or
approaching optimum performance as early as possible. I did a lot of
this stuff at assembler level years ago when writing high performance
vga libraries, but want, if possible to leave it all in C now.

From my opening paragraph :

shift & rotate operations : left & right shift operations
anding : bitwise and operations
oring : bitwise or operations

I am writing some small libraries for generating image masks,
transparancy masks and edge silhouettes. Some more of the details are below.

> "generic" optimization is often ill-directed; optimizing "on
> general principles" is often ill-advised. If you have a specific
> operation you'd like to speed up, it'd be a good idea to show
> the actual code you're currently using, along with measurements
> of its current speed and an estimate of the amount of speedup
> you simply MUST achieve to make the code useful.


I'm looking for general guidelines : nothing spectacuarly
specific. Although you do give some good guidlines, some of which had
crossed my mind already, but certainly not all.

>
> As to finding the first non-zero bit in an array of 64 ints,
> the first thing to do is nail down what you mean by "first," lest
> endianness issues pollute the solutions. Having done that, you


Thats in the details : but here we can assume that the sequence is
high bit == left most pixel. Also that any colour issues are segregated
into 3 or more color planes : initial code can assume a monochrome
bitmap.

I dont really think endian comes into it since the operations will
be taken care of at the word level. Even if it wasnt monochrome then
the most specific X bits would be leftmost pixel where X is the color
depth : in this case I would see the only code issues being that of
increasing the bit width of the test mask and the number of shifts
done in order to check the next pixel.

(Just to clarify the endian thing : all work is in memory buffers and
is in no way related to video HW addressing.)

> should also decide what form the answer should take -- it may be
> simpler (or harder) to get a bit mask than a bit index, and you
> may benefit by adjusting the rest of the code to use the more
> easily obtained answer.


Thats something to think about ok. Although I always considered just
anding and shifting 0x80000000 until a non zero result was
indicated. The mask can be generated from the bit by subtracting one.
(e.g 0x4000000-1 -> 0x3fffffff) and readding the bit.

After left pixel is detected we detect right most and if we are
creating a mask of words for rest of scanline its then trivial.

>
> If one-bits are expected to be fairly sparse, you'll probably
> want to search through largish "chunks" to find one that isn't
> all zeroes, then search within that chunk for the proper bit.
> You could do the search an `int' at a time with an ordinary loop,
> or you might use memchr() to search byte-by-byte. The memchr()
> method might (or might not) take longer for the search, but note
> that it leaves a smaller space for the bit-by-bit search to explore.
> You'd need to implement both and measure.


Yes, check word for 0 before looking for a set bit : simple enough &
very practical operation with no real overhead at time of word
read. I dont expect the images to be sparse so I'm not sure its
worth the overhead of calling a library to find it : considering left
most pixel detection: so In C, this seems fairly optimal to me (and
this type of thing I'm looking for advice on)

(early caveat : all pseudo c code assuming 32bit word for clarity and not
getting lost in type sizes)


bitmap =0;
while(columnsToCheck-- && !(bitmap = *imagePtr++));
if(bitmap){
/* now find leftmost pixel */
}


should be optimal enough?

>
> If you've used memchr() and have located a non-zero byte, it


memchr() would certinaly not be on the agenda I think : it needs a
character to compare whereas the implicit recognition of "zero"
or "non zero" is faster. Also, I read in words (properly aligned of
course) and not bytes : bytes would certainly strangle the application.

> might make sense to use a precalculated 256-element table to finish
> the job without further "computation." Then again, it might not:


Cou you expand on that table bit a little : I dont know about this. Is
it something CPU specific?

> modern CPU's are much faster than memory, and might be able to
> complete a multi-step computation in less time than a memory fetch.
> More measurements are needed.
>
> If you've searched int-by-int you could dive directly into
> the second search, or you could first use a couple of comparisons
> to decide which of the four bytes holds the "first" non-zero and
> then proceed as if memchr() had located that byte. Which will be
> faster? Measure, measure, measure.
>
> If you need a bit index and don't want to use a table, the
> obvious test-and-shift-and-count approach is probably best. If
> you need a mask as opposed to an index, the `x & (x - 1)' trick


Could you expand on that? I would see it as simple as

(find left pixel mask)

/* we know the bitmask is non zero so a bit check is a must*/
for(bit=0x8000000;!(bit&bitmap);bit>>=1);
/*word bit mask protects all bits from leftmost pixel on*/
leftWordMask = bit | (bit-1);

> may be faster. Measure, measure, -- is there an echo in here?


Ive used a few profilers on Vax, Convex and on MS : but will be using
gcov for initial run time analysis on linux - but I'm not sure if its
what I want for real execution time profiling. But I'm certainly
looking for "C hints" on faster ways with bitmaps first : then I can
decide on a design approach : I think the code above (silly errors not
withstanding), appears to be relatively quick and the core "loop" areas are
small enough where I would expect their execution footprint to get cached.

>
> Finally, an impression of the two Web pages you cited. A
> quick glance suggests they're filled with well-intentioned and
> sometimes reasonable advice, but ALL of it is out of context.


Yes & no : it has some general advice on standard stuff which can help
: alignment, powers of two etc. It even discusses use of machine words
at some stage instead of smaller units which can cause excessive
overhead I think. Certainly helpful to anyone embarking on optimizing
anything in C for the first time.

> (Necessarily so, of course: the authors of the Web pages have no
> idea what kind of program you're trying to write.) Even expert
> advice can become nonsense when taken out of context. Your
> grandmother's doctor is (presumably) an expert; do you therefore
> take the same medicines that are prescribed for your grandmother?
> IMHO, it is best to read and understand advice of the sort these
> pages offer, but NOT to follow it until and unless you have reason
> to believe it applies to your situation.
>


I appreciate your reply,

thanks for taking the time to do so.
 
Reply With Quote
 
Richard G. Riley
Guest
Posts: n/a
 
      03-13-2006
On 2006-03-13, Laurent Deniau <(E-Mail Removed)> wrote:
>
> Maybe http://www.jjj.de/bitwizardry/ is a good start.
>
> a+, ld.


Great! Thanks : suprised I didnt google it up in the first place. Nice
resource, cheers.

 
Reply With Quote
 
James Dow Allen
Guest
Posts: n/a
 
      03-13-2006

Richard G. Riley wrote:
> Without resorting to asm chunks I'm working on a few small routines
> which manipulate bitmasks. I'm looking for any guidance on writing C
> in a manner which tilts the compilers hand in, if possible, a
> compiler/underlying processor independant way : althought to be fair I
> cant see this stuff on anything other than x86, but who knows.


Perhaps useless info if you only run on x86 but...

The Motorola 68020 has special bitfield opcodes which some
C compilers will generate *if* you declare the data as C bitfields.
(One of these special opcodes -- 'bfffo' -- is unusual and can come in
*very* handy sometimes, but I doubt any compiler will use *it*.)

James Dow Allen

 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      03-13-2006


Richard G. Riley wrote On 03/13/06 10:43,:
> On 2006-03-13, Eric Sosman <(E-Mail Removed)> wrote:
>
>>Richard G. Riley wrote:
>>
>>
>>>Without resorting to asm chunks I'm working on a few small routines
>>>which manipulate bitmasks. I'm looking for any guidance on writing C
>>>in a manner which tilts the compilers hand in, if possible, a
>>>compiler/underlying processor independant way : althought to be fair I
>>>cant see this stuff on anything other than x86, but who knows.
>>>
>>>I found some ok info here:
>>>
>>>http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
>>>http://www.devx.com/amd/Article/21314
>>>
>>>But any techniques for doing the standard bit fiddles without
>>>occurring extra unnecessary cpu bandwidth would be nice. By standard
>>>bit fiddles I mean : shift & rotate, anding, oring, shift til bit
>>>found etc. The operations will be occurring on sequences of
>>>words and the smallest word will be the standard CPU word size (32
>>>bits). By sequences of words : I might, for example, be locating
>>>the first non zero bit from the left in sequence of 64 words.

>>
>> First, it's not at all clear what you're trying to speed
>>up, nor what you have measured and found to be too slow. A

>
>
> Hi Eric,
>
> I'm looking for advice/pointers on the operators mentioned above. Its
> not a case of anything being too slow : its a question of ensuring or
> approaching optimum performance as early as possible. [...]


D.E. Knuth (elaborating on a remark by Hoare): "We should
forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil."

W.A. Wulf: "More computing sins are committed in the name
of efficiency (without necessarily achieving it) than for any
other single reason, including blind stupidity."

Jackson's Laws of Program Optimization:
FIRST LAW: Don't do it.
SECOND LAW (for experts only): Don't do it yet.

See also <http://www.flounder.com/optimization.htm>.

... and that's all I'll offer on the matter. Until you
have made measurements or done calculations that demonstrate
a performance problem, "approaching optimum performance as
early as possible" is folly. If you wish to behave foolishly,
go ahead -- but do it on your own; I'm not going to assist
you to your ruin.

--
(E-Mail Removed)

 
Reply With Quote
 
Richard G. Riley
Guest
Posts: n/a
 
      03-13-2006
On 2006-03-13, Eric Sosman <(E-Mail Removed)> wrote:
>
>
> D.E. Knuth (elaborating on a remark by Hoare): "We should
> forget about small efficiencies, say about 97% of the time:
> premature optimization is the root of all evil."
>
> W.A. Wulf: "More computing sins are committed in the name
> of efficiency (without necessarily achieving it) than for any
> other single reason, including blind stupidity."
>
> Jackson's Laws of Program Optimization:
> FIRST LAW: Don't do it.
> SECOND LAW (for experts only): Don't do it yet.
>
> See also <http://www.flounder.com/optimization.htm>.
>
> ... and that's all I'll offer on the matter. Until you
> have made measurements or done calculations that demonstrate
> a performance problem, "approaching optimum performance as
> early as possible" is folly. If you wish to behave foolishly,
> go ahead -- but do it on your own; I'm not going to assist
> you to your ruin.
>


Thanks for those references : all very valid if designing a large
system. I am designing/impementing a very low level system with fairly
specific requirements : and apriori knowledge of what can be achieved
at the earliest stages to guide the design to embrace known optimal
techniques is very much relevant. The key here is "known optimal
techniques" : if there are such techniques already worked out (as the
link another poster gave recomemnds) then it would be foolish to dismiss
those out of hand.
 
Reply With Quote
 
Mark F. Haigh
Guest
Posts: n/a
 
      03-13-2006
Richard G. Riley wrote:
> Without resorting to asm chunks I'm working on a few small routines
> which manipulate bitmasks. I'm looking for any guidance on writing C
> in a manner which tilts the compilers hand in, if possible, a
> compiler/underlying processor independant way : althought to be fair I
> cant see this stuff on anything other than x86, but who knows.
>
> I found some ok info here:
>
> http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
> http://www.devx.com/amd/Article/21314
>
> But any techniques for doing the standard bit fiddles without
> occurring extra unnecessary cpu bandwidth would be nice. By standard
> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
> found etc. The operations will be occurring on sequences of
> words and the smallest word will be the standard CPU word size (32
> bits). By sequences of words : I might, for example, be locating
> the first non zero bit from the left in sequence of 64 words.
>
> Any pointers appreciated.


Isolate the bit scan routine into its own function, and write it in C
as simply as possible. Then special-case the function for x86/x86-64
using the appropriate bit scan machine instruction (bsf, bsr). Make
sure the function actually gets inlined.

This is really OT, but I mention it because I've yet to see a compiler
that actually generates bit scan instructions, and the speedup from
using them is around 10x (in my experience). Bit scanning is heavily
used in some areas of networking, particularly in packet
classification.

However, this is probably the _only_ thing you should code in assembly,
as bit twiddling C code is normally compiled to very tight assembly.
In fact, it's most often faster than non-expert hand written assembly,
simply because the compiler knows more about the minutae of the machine
architecture and instruction scheduling.


Mark F. Haigh
(E-Mail Removed)

 
Reply With Quote
 
Richard G. Riley
Guest
Posts: n/a
 
      03-13-2006
On 2006-03-13, Mark F. Haigh <(E-Mail Removed)> wrote:
> Richard G. Riley wrote:
>> Without resorting to asm chunks I'm working on a few small routines
>> which manipulate bitmasks. I'm looking for any guidance on writing C
>> in a manner which tilts the compilers hand in, if possible, a
>> compiler/underlying processor independant way : althought to be fair I
>> cant see this stuff on anything other than x86, but who knows.
>>
>> I found some ok info here:
>>
>> http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
>> http://www.devx.com/amd/Article/21314
>>
>> But any techniques for doing the standard bit fiddles without
>> occurring extra unnecessary cpu bandwidth would be nice. By standard
>> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
>> found etc. The operations will be occurring on sequences of
>> words and the smallest word will be the standard CPU word size (32
>> bits). By sequences of words : I might, for example, be locating
>> the first non zero bit from the left in sequence of 64 words.
>>
>> Any pointers appreciated.

>
> Isolate the bit scan routine into its own function, and write it in C
> as simply as possible. Then special-case the function for x86/x86-64
> using the appropriate bit scan machine instruction (bsf, bsr). Make
> sure the function actually gets inlined.
>
> This is really OT, but I mention it because I've yet to see a compiler
> that actually generates bit scan instructions, and the speedup from
> using them is around 10x (in my experience). Bit scanning is heavily
> used in some areas of networking, particularly in packet
> classification.
>
> However, this is probably the _only_ thing you should code in assembly,
> as bit twiddling C code is normally compiled to very tight assembly.
> In fact, it's most often faster than non-expert hand written assembly,
> simply because the compiler knows more about the minutae of the machine
> architecture and instruction scheduling.
>


Thanks Mark : I'll take a look and might try that at a later date :
I'm "off" with assembler at the moment but I'll put that on the back
burner until the "optimized C" parts are in place - although to be
honest theres not a lot there other than good common sense
programming. The link another poster gave is interesting too

http://www.jjj.de/bitwizardry/

But I've just located an x86 reference PDF and they are certainly the
instructions for boundary bit checks.

>
> Mark F. Haigh
> (E-Mail Removed)
>


cheers,

richard.


--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matlo...g.html#tth_sEc
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
optimizing DropDownLists Alejandro Penate-Diaz ASP .Net 1 04-08-2005 01:54 PM
Firefox optimizing websites (help) Tiernan Firefox 11 03-07-2005 12:14 AM
Optimizing ViewState Ben Fidge ASP .Net 4 02-18-2005 02:39 PM
Optimizing using the IncrementalBuild property ASP .Net 5 07-28-2004 05:49 PM
Optimizing perfomance on T3 line Christoph Schad Cisco 12 01-01-2004 08:40 PM



Advertisments