Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Performance of hand-optimised assembly

Reply
Thread Tools

Performance of hand-optimised assembly

 
 
Seebs
Guest
Posts: n/a
 
      12-24-2011
On 2011-12-24, BartC <(E-Mail Removed)> wrote:
> So while modern compilers might be ultra-sophisticated, most of these
> aren't! Admittedly they are all free downloads..


The commercial compiler vendors usually regard gcc as a poor performer, I'm
told.

That said... One of the things compilers tend to be better at than humans
is *large* optimizations. Consider sqlite; while it's distributed with a
reasonably normal structure of source files and such, actual compilation is
done by merging all the sources into a single large file. Why? Because
the compiled output runs noticably faster.

Compilers can afford to try experiments and optimizations that human users
usually can't.

-s
--
Copyright 2011, all wrongs reversed. Peter Seebach / http://www.velocityreviews.com/forums/(E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.
 
Reply With Quote
 
 
 
 
Geoff
Guest
Posts: n/a
 
      12-24-2011
On 24 Dec 2011 19:24:14 GMT, Seebs <(E-Mail Removed)> wrote:

>On 2011-12-24, BartC <(E-Mail Removed)> wrote:
>> So while modern compilers might be ultra-sophisticated, most of these
>> aren't! Admittedly they are all free downloads..

>
>The commercial compiler vendors usually regard gcc as a poor performer, I'm
>told.


They naturally would, wouldn't they? I would expect them to come up
with benchmarks and lots of nice slide presentations touting their
products against each other and against the freeware stuff. I would be
more impressed with a commercial compiler vendor stating that xyz
freeware compiler is a good performer.

The trouble with GCC is that there are so many switches it's probably
impossible to test all the modes and actually come up with a truly
optimum configuration for a given target and application.

Back in the day, I was programming an embedded Z80/Z180 RTOS using
Microtec Research's MCC80. One day, I found an unusual bottleneck in a
function and nothing in the source could account for it. Examining the
generated assembly I found the thing was thrashing around with a
string of mov hl,hl; mov bc,hl; mov hl,bc; mov hl,hl; opcodes. Thirty
minutes on the phone with the maintenance programmer yielded
indication of a bug in the compiler. Fixed in another hour, emailed a
new release of the compiler. It's hard to do that these days.
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      12-24-2011
"BartC" <(E-Mail Removed)> writes:
<snip>
> In any case the 'winning entry' (gcc -O3) bypassed the whole
> issue by inlining the code (which I considered cheating).


That's an interesting point. I think it would be enough to show that
"beating" gcc -O2 is both possible and worth-while, but inlining is a
tool available to compilers that is often not available to hand coders.
When it can be done, it's another point in favour of sticking to C and
relying on the compiler.

<snip>
--
Ben.
 
Reply With Quote
 
Philip Lantz
Guest
Posts: n/a
 
      12-24-2011
BartC wrote:
> Having to use that ghastly AT&T syntax is an argument *against* using
> assembly, even if it is faster...


You can use the Intel syntax with gas. Use the pseudo-op
.intel_syntax noprefix
You can even use it in gcc inline assembly--open each inline group with
that and close with
.att_syntax prefix
I only bother to do it for code groups with more than a few
instructions.
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      12-24-2011
Willem <(E-Mail Removed)> writes:

> Ben Bacarisse wrote:
> ) I'm starting a new thread, but it is prompted by this remark from 88888
> ) Dihedral <(E-Mail Removed)> about a short binary search
> ) function I posted:
> )
> ) <snip>
> ) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
> ) times. clang's code is a little better at -O0, but the optimised code
> ) is a little worse:
> ) -O0 -O1 -O2 -O3
> ) gcc 0.1128 0.0633 0.0571 0.0562
> ) g++ 0.1167 0.0633 0.0572 0.0564
> ) clang 0.1033 0.0669 0.0672 0.0672
> )
> ) These times are microseconds per call. The loop overhead has been
> ) subtracted[2].
>
> Have you tried enabling the processor-specific switches on gcc?


As per Jens's post, I just tried -march=native and got:

-O0 -O1 -O2 -O3
gcc 0.1127 0.0633 0.0571 0.0563
gcc -march=native 0.1124 0.0633 0.0567 0.0561

The tiny improvement in the -O2 and -O3 times is repeatable, but the -O0
time varies by two or three in the last digits anyway. It's not a
significant improvement on this processor.

> ) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache


I can't at the moment test the AMD chip.

--
Ben.
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      12-25-2011
James Harris <(E-Mail Removed)> writes:

> On Dec 23, 6:43*pm, Ben Bacarisse <(E-Mail Removed)> wrote:
>
> ...
>
> I don't understand some of your code. Queries below.
>
>> Here is a test program:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> size_t binsearch(int a[], size_t length, int key)

>
> Why use size_t instead of unsigned int or unsigned long int,
> especially as you add it to sum in the test harness which is long int?
> Why not use unsigned for sum in the test harness?


int is better yet because the output will then be same across machines.
Using *any* unsigned type means the "not found" -1 return becomes a
value that can vary from implementation to implementation. For this
test code I should have used int, but size_t is the "right" return type
for a binary search function like this and I did not think to change it.

>> {
>> * * *size_t low = 0, high = length;

>
> Why not high = length - 1?


Explained below.

>> * * *while (high > low) {
>> * * * * * size_t middle = low + (high - low)/2;

>
> Why not middle = (high + low) / 2?


To avoid "overflow". (Quotes because it's wrap-around with unsigned
types but it's still wrong.) You'd need peculiar input to see this
in practise but it seems nicer to avoid the issue altogether.

>> * * * * * if (key == a[middle])
>> * * * * * * * *return middle;
>> * * * * * else if (key > a[middle])
>> * * * * * * * *low = middle + 1;
>> * * * * * else high = middle;

>
> Why not high = middle - 1?


Unsigned types, basically. The symmetric version that adds or subtracts
one from middle always keeps the target index in [low..high]. That
means high must start at length - 1 which is annoying for unsigned types
like size_t (because length == 0 becomes a special case). I keep the
target index in [low..high) (i.e. a half open interval) so high is set
to length at the start and to middle when the range is being cut to
lower portion.

There are lots of ways round this, but I think they all complicate an
otherwise algorithm. Perhaps the simplest is just to test length first
and return -1 if it's 0 before doing any real work. Maybe you have a
favourite method that can use size_t bounds?

>> * * *}
>> * * *return -1;
>>
>> }

>
> The reason for the queries is I wrote an asm binsearch but when
> comparing it with your code realised that you had some differences in
> the approach.


Obviously there's no need to copy the approach in the body of the
function, but I'd say you should copy the prototype (at least at first).
If it turn out that you can get 20% better performance with asm that
uses different argument and/or return types, then that's an interesting
fact and should not be ignored just because I used size_t.

--
Ben.
 
Reply With Quote
 
Weland
Guest
Posts: n/a
 
      12-25-2011
> and so on) was a thing of the past. But maybe I'm wrong. Is it still
> possible to get a significant speed-up over C by hand coding? How
> much depends on the processor?


It is still possible, but it does greatly depend on the CPU and compiler.
For most common platforms, beating GCC is fairly difficult (or possible,
but not worthwile IMO). However, as you go deeper in their line of ports,
it becomes more and more possible.

On platforms that have only fairly recently become supported, like TI's
MSP430 (mind you, that is a microcontroller), it can be done and it is
worthwhile. I use msp430-gcc at work; most of the time it generates rea-
sonable code, but it doesn't always make the best decisions. I don't re-
member it generating incorrect code, but I do remember it generating
slow code more than once.

IMO, most of the time there's just no reason to do it on modern, high
performance architectures -- the difficulties in porting it later and the
additional time you spent just aren't worth it. But on memory-constrained
systems it might still be an option, as long as you have a sane separation
of portable and non-portable code.

--
(E-Mail Removed)
SDF Public Access UNIX System - http://sdf.org
% grep me no patterns and I'll tell you no lines
 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      12-25-2011
Well, after 3 days, there are some things that look
fairly obvious:


1) Nobody knows assembly here, besides io_x that is the only one
that really answered the question and presented an assembly
language program.
(Sorry if I missed one assembly language statement or discussion
in the text of all the messages posted into this strange thread)

2) Since the problem chosen by Ben was geared to problems where a
compiler is better than an assembly language programmer the
C programmers must win. Ben twicked the rules like this:

<quote>
I've felt (it's only a feeling) that hand coding for speed (rather
than to access features not available in C like extra wide
multiplies and so on) was a thing of the past.
<quote>

But exactly using features not available in "high level" languages
like C is what makes assembly endure today as a very good way of
programming (I would not say that assembly language is a "language"
in the sense of a "computer language"). So, if you eliminate one
of the main reasons, the other reasons left are surely not so
worthhile.

3) Anyway, I passed the favorite compiler of this group through the
program, and after 30 minutes I had improved by 8% even with -O2.

I wanted to post something but I am not so convinced. Is it worth?
I can hear already the screams of this crowd praying their "portability"
god when an old computer hacker tells them:

ASSEMBLY LIVES!

jacob




 
Reply With Quote
 
88888 Dihedral
Guest
Posts: n/a
 
      12-25-2011
jacob navia於 2011年12月26日星期一UTC+8上午3時12分54 寫道:
> Well, after 3 days, there are some things that look
> fairly obvious:
>
>
> 1) Nobody knows assembly here, besides io_x that is the only one
> that really answered the question and presented an assembly
> language program.
> (Sorry if I missed one assembly language statement or discussion
> in the text of all the messages posted into this strange thread)
>
> 2) Since the problem chosen by Ben was geared to problems where a
> compiler is better than an assembly language programmer the
> C programmers must win. Ben twicked the rules like this:
>
> <quote>
> I've felt (it's only a feeling) that hand coding for speed (rather
> than to access features not available in C like extra wide
> multiplies and so on) was a thing of the past.
> <quote>
>
> But exactly using features not available in "high level" languages
> like C is what makes assembly endure today as a very good way of
> programming (I would not say that assembly language is a "language"
> in the sense of a "computer language"). So, if you eliminate one
> of the main reasons, the other reasons left are surely not so
> worthhile.
>
> 3) Anyway, I passed the favorite compiler of this group through the
> program, and after 30 minutes I had improved by 8% even with -O2.
>
> I wanted to post something but I am not so convinced. Is it worth?
> I can hear already the screams of this crowd praying their "portability"
> god when an old computer hacker tells them:
>
> ASSEMBLY LIVES!
>
> jacob
>
>


If computing PI or RSA2048 encoding and decoding, nowadays the job is still
paid well to use 64 bit multipliers in HW directly.
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      12-25-2011
"jacob navia" <(E-Mail Removed)> wrote in message
news:jd7sjj$gum$(E-Mail Removed)...

> 1) Nobody knows assembly here, besides io_x that is the only one
> that really answered the question and presented an assembly
> language program.


He did, but it's not clear how closely it corresponds to the C algorithm
presented. And anyway it seemed the results were of more interest than the
details, which are obviously going to be very specific to processor,
assembler, language and compiler.

(But here anyway is the asm code I used, in Nasm format, for x86-32, and
used as inline code for the body of a function taking parameters a, length
and key:

%define rlow ebx
%define rhigh ecx
%define rmiddle eax
%define rkey edi
%define rdata esi

mov rdata,[a]
mov rhigh,[length]
mov rkey,[key]
mov rlow,0

lab1:
cmp rhigh,rlow
jle labx

mov rmiddle,rhigh
sub rmiddle,rlow
shr rmiddle,1
add rmiddle,rlow

cmp [rmiddle*4+rdata],rkey
jz labret
jge lab3

lea rlow,[rmiddle+1]
jmp lab1

lab3:
mov rhigh,rmiddle
jmp lab1

labx: mov eax,-1
labret:
pop ebp
retn 12

)
--
Bartc

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
adding assembly to windows\assembly through bat file Grant Merwitz ASP .Net 3 09-15-2005 11:40 AM
Assembly's manifest definition does not match the assembly reference. Horatiu Margavan via .NET 247 ASP .Net 0 08-30-2004 04:14 PM
ASP.NET 2.0: What is the namespace and assembly name of generated assembly SA ASP .Net 0 08-09-2004 05:09 PM
Referencing assembly from GAC using @assembly fails Brent ASP .Net 1 01-23-2004 08:23 PM
can a strongly named assembly reference a regular assembly? Prasanna Padmanabhan ASP .Net 1 11-19-2003 06:21 AM



Advertisments