Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Odd benchmark behavior

Reply
Thread Tools

Odd benchmark behavior

 
 
Nils M Holm
Guest
Posts: n/a
 
      03-26-2012

I ran across some odd (as far as I can tell) behavior recently, when
performing some fun benchmarks. (Yeah, but please bear with me!)

The goal was to find out how bad exactly the code generated by my
toy compiler is, so I passed the following program to GNU C (GCC),
Portable C (PCC), and my toy compiler (let's call it SCC):

-----[ p.c ] ---------------------------------------------------
#include <stdio.h>

#define MAX 20000

main() {
int i, j, p[MAX], k;

for (k=0, i=3; k<MAX; i+=2) {
for (j=0; j<k; j++)
if (i % p[j] == 0)
break;
if (j < k) continue;
p[k++] = i;
}
printf("%d\n", p[k-1]);
}
----------------------------------------------------------------

Just a sieve algorithm to calculate the 20,000'th prime number,
nothing fancy, just some tight loops. Please note that I do NOT
intend to boast the code quality of SCC here. The code emitted by
it is in fact abysmally bad. So I was quite surprised to get the
following run times of the generated programs (median of three
runs, almost no deviation):

gcc -O2 p.c: 12.39s
pcc -O p.c: 12.42s
scc p.c: 12.27s

I can explain the difference between GCC/PCC and SCC by lower
startup overhead, but shouldn't the better code quality of PCC
and GCC compensate for this over a run time of 12 seconds?

I am rather puzzled now. Any ideas?

The output of my toy compiler compiler is attached below. You can
download the complete compiler source code here in case you want
to do some own experiments.

http://www.t3x.org/subc/

Any explanations or additional data (run times) would be greatly
appreciated.

-----[ compiler output (SubC) ]---------------------------------
.text
.globl Cmain
Cmain: pushl %ebp # main() {
movl %esp,%ebp
addl $-80012,%esp # int i, j, p[MAX], k;
movl $0,%eax # for (k=0, i=3;
movl %eax,-80012(%ebp)
movl $3,%eax
movl %eax,-4(%ebp)
L2: movl -80012(%ebp),%eax # k<MAX;
pushl %eax
movl $20000,%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L6
incl %edx
L6: movl %edx,%eax
orl %eax,%eax
jnz L7
jmp L4
L7: jmp L3
L5: movl $2,%eax # i+=2) {
pushl %eax
movl -4(%ebp),%eax
popl %ecx
addl %ecx,%eax
movl %eax,-4(%ebp)
jmp L2
L3: movl $0,%eax # for (j=0;
movl %eax,-8(%ebp)
L8: movl -8(%ebp),%eax # j<k;
pushl %eax
movl -80012(%ebp),%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L12
incl %edx
L12: movl %edx,%eax
orl %eax,%eax
jnz L13
jmp L10
L13: jmp L9
L11: movl -8(%ebp),%eax # j++) {
incl -8(%ebp)
jmp L8
L9: movl -4(%ebp),%eax # if (i % p[j] == 0)
pushl %eax
leal -80008(%ebp),%eax
pushl %eax
movl -8(%ebp),%eax
shll $2,%eax
popl %ecx
addl %ecx,%eax
movl (%eax),%eax
popl %ecx
xchgl %eax,%ecx
cdq
idivl %ecx
movl %edx,%eax
pushl %eax
movl $0,%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jne L14
incl %edx
L14: movl %edx,%eax
orl %eax,%eax
jnz L16
jmp L15
L16: jmp L10 # break;
L15: jmp L11
L10: movl -8(%ebp),%eax # if (j < k)
pushl %eax
movl -80012(%ebp),%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L17
incl %edx
L17: movl %edx,%eax
orl %eax,%eax
jnz L19
jmp L18
L19: jmp L5 # continue;
L18: leal -80008(%ebp),%eax # p[k++] = i;
pushl %eax
movl -80012(%ebp),%eax
incl -80012(%ebp)
shll $2,%eax
popl %ecx
addl %ecx,%eax
pushl %eax
movl -4(%ebp),%eax
popl %edx
movl %eax,(%edx)
jmp L5 # }
L4: .data
L20: .byte 37
.byte 'd'
.byte 10
.byte 0
.text # printf("%d\n", p[k-1]);
movl $L20,%eax
pushl %eax
leal -80008(%ebp),%eax
pushl %eax
movl -80012(%ebp),%eax
pushl %eax
movl $1,%eax
popl %ecx
xchgl %eax,%ecx
subl %ecx,%eax
shll $2,%eax
popl %ecx
addl %ecx,%eax
movl (%eax),%eax
pushl %eax
pushl $2
call Cprintf
addl $12,%esp
L1: addl $80012,%esp # }
popl %ebp
ret
----------------------------------------------------------------

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      03-26-2012
"Nils M Holm" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

> The goal was to find out how bad exactly the code generated by my
> toy compiler is, so I passed the following program to GNU C (GCC),
> Portable C (PCC), and my toy compiler (let's call it SCC):


> #include <stdio.h>
>
> #define MAX 20000
>
> main() {
> int i, j, p[MAX], k;
>
> for (k=0, i=3; k<MAX; i+=2) {
> for (j=0; j<k; j++)
> if (i % p[j] == 0)
> break;
> if (j < k) continue;
> p[k++] = i;
> }
> printf("%d\n", p[k-1]);
> }


> gcc -O2 p.c: 12.39s
> pcc -O p.c: 12.42s
> scc p.c: 12.27s
>
> I can explain the difference between GCC/PCC and SCC by lower
> startup overhead, but shouldn't the better code quality of PCC
> and GCC compensate for this over a run time of 12 seconds?
>
> I am rather puzzled now. Any ideas?


I tried it with my compiler (not of C, but similar), which while not a toy,
does also generate some very bad code. And while it was slower than the Cs,
all the results were in quite a narrow range: 3.1 to 3.4 seconds for various
C compilers, and 3.5 seconds for my compiler (all x86-32).

The 3.1 seconds was for gcc -O3.

However, with the conventional sieve benchmark, gcc -O3 was over 3 times as
fast as my compiler. So perhaps there's something peculiar with your
benchmark.

(The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
4:1 with the regular sieve, however the latter does repeat the main loop
thousands of times, which I suspect -O3 takes advantage of. You need a
different benchmark which isn't repetitive, and doesn't use too much memory
either, otherwise the timings will be dominated by memory accesses.)

--
Bartc

 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      03-26-2012
"io_x" <(E-Mail Removed)> wrote in message
news:4f70351c$0$1385$(E-Mail Removed). ..
>
> "Nils M Holm" <(E-Mail Removed)> ha scritto nel messaggio


>> gcc -O2 p.c: 12.39s
>> pcc -O p.c: 12.42s
>> scc p.c: 12.27s

>
> i not understand why you build one algo that take 12s if there is
> one other can take 0s...


Because we're investigating code generation not algorithms. A benchmark that
runs in 0 seconds on any compiler is not very useful!

--
Bartc

 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      03-26-2012
On 03/26/12 08:51 PM, Nils M Holm wrote:
> I ran across some odd (as far as I can tell) behavior recently, when
> performing some fun benchmarks. (Yeah, but please bear with me!)
>
> The goal was to find out how bad exactly the code generated by my
> toy compiler is, so I passed the following program to GNU C (GCC),
> Portable C (PCC), and my toy compiler (let's call it SCC):
>
> -----[ p.c ] ---------------------------------------------------
> #include<stdio.h>
>
> #define MAX 20000
>
> main() {
> int i, j, p[MAX], k;
>
> for (k=0, i=3; k<MAX; i+=2) {
> for (j=0; j<k; j++)
> if (i % p[j] == 0)
> break;
> if (j< k) continue;
> p[k++] = i;
> }
> printf("%d\n", p[k-1]);
> }
> ----------------------------------------------------------------
>
> Just a sieve algorithm to calculate the 20,000'th prime number,
> nothing fancy, just some tight loops. Please note that I do NOT
> intend to boast the code quality of SCC here. The code emitted by
> it is in fact abysmally bad. So I was quite surprised to get the
> following run times of the generated programs (median of three
> runs, almost no deviation):
>
> gcc -O2 p.c: 12.39s
> pcc -O p.c: 12.42s
> scc p.c: 12.27s
>
> I can explain the difference between GCC/PCC and SCC by lower
> startup overhead, but shouldn't the better code quality of PCC
> and GCC compensate for this over a run time of 12 seconds?
>
> I am rather puzzled now. Any ideas?


The probably isn't a great deal an optimiser can do with your code. I
tried your code (doubling to 40000 to get a meaningful time) and got a
run time between 3.3 and 3.4 seconds with everything from -g to -O3
(-fast with Sun cc),

--
Ian Collins
 
Reply With Quote
 
Nils M Holm
Guest
Posts: n/a
 
      03-26-2012
BartC <(E-Mail Removed)> wrote:
> (The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
> 4:1 with the regular sieve, however the latter does repeat the main loop
> thousands of times, which I suspect -O3 takes advantage of. You need a
> different benchmark which isn't repetitive, and doesn't use too much memory
> either, otherwise the timings will be dominated by memory accesses.)


Very good point, did not think of that! I already suspected that the
results were tightly knit to my benchmark program, because I did some
other tests where SCC performed as lousy as expected.

Alright, now going to implement the optimizer...

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
 
Reply With Quote
 
gwowen
Guest
Posts: n/a
 
      03-26-2012
On Mar 26, 10:40*am, "BartC" <(E-Mail Removed)> wrote:
> "io_x" <(E-Mail Removed)> wrote in message
>
> news:4f70351c$0$1385$(E-Mail Removed). ..
>
>
>
> > "Nils M Holm" <(E-Mail Removed)> ha scritto nel messaggio
> >> gcc -O2 p.c: 12.39s
> >> pcc -O p.c: 12.42s
> >> scc p.c: 12.27s

>
> > i not understand why you build one algo that take 12s if there is
> > one other can take 0s...

>
> Because we're investigating code generation not algorithms. A benchmark that
> runs in 0 seconds on any compiler is not very useful!
>
> --
> Bartc


<CODE type="io_x" idiotic_obfuscation="off">

printf("224737\n");

</CODE>

I WIN!
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      03-26-2012
"Nils M Holm" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

> if (i % p[j] == 0)


> it is in fact abysmally bad. So I was quite surprised to get the
> following run times of the generated programs (median of three
> runs, almost no deviation):
>
> gcc -O2 p.c: 12.39s
> pcc -O p.c: 12.42s
> scc p.c: 12.27s


> I am rather puzzled now. Any ideas?


> idivl %ecx


I suspect the % operation (which needs a divide) dominates the proceedings.
I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
seconds).

--
Bartc

 
Reply With Quote
 
Nils M Holm
Guest
Posts: n/a
 
      03-27-2012
BartC <(E-Mail Removed)> wrote:
> I suspect the % operation (which needs a divide) dominates the proceedings.
> I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
> seconds).


Yes, division is certainly an expensive operation, but I like your
initial guess better (memory access dominates the results). So I
rewrote the program slightly to give the optimizer a chance to do
better register allocation and voila:

scc p2.c: 30s
gcc -O2 p2.c: 15s

Here is the new program (using an even more brute-force algorithm):

----------------------------------------------------------------
#include <stdio.h>

#define MAX 10000

main() {
int i, j, k, n;

for (k=0, i=3; k<MAX; i+=2) {
for (j=2; j<i/2; j++)
if (i % j == 0)
break;
if (j < k) continue;
n = i;
k++;
}
printf("%d\n", n);
}
----------------------------------------------------------------

I also tried a dummy remainder and it indeed doubled the run time
of the SCC-generated executable while the GCC-generated executable
was not affected. I guess that GCC simply removes the dead code.

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      03-27-2012


"Nils M Holm" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> BartC <(E-Mail Removed)> wrote:
>> I suspect the % operation (which needs a divide) dominates the
>> proceedings.
>> I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to
>> 6.6
>> seconds).

>
> Yes, division is certainly an expensive operation, but I like your
> initial guess better (memory access dominates the results).


I don't think your program actually used that much memory (only 80KB or so).
Have a look at the NSIEVE benchmark:

http://shootout.alioth.debian.org/gp...sieve&lang=gcc

which allocates memory for some 5 million integers. (The overheads of memory
accesses were such that I was able to get to a factor of 2.5:1 of gcc -O3,
using interpreted bytecode! Compared with 6:1 when only a few thousand
integers.)

--
Bartc

 
Reply With Quote
 
Nils M Holm
Guest
Posts: n/a
 
      03-27-2012
BartC <(E-Mail Removed)> wrote:
> I don't think your program actually used that much memory (only 80KB or so).
> Have a look at the NSIEVE benchmark:
>
> http://shootout.alioth.debian.org/gp...sieve&lang=gcc


I think it is not a matter of the amount of memory being allocated,
but a matter of the amount of memory actually being accessed. The
algorithm used in the shootout accesses memory N+p(N) times, where
N is the upper limit of the range to check and N is the number of
primes in that range. My algorithm accesses memory at least N^2/2
times, so the difference is about the same as O(n) vs O(n^2), i.e.
my program does a lot more memory access in the same amount of
memory.

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Odd behavior with odd code Michael Speer C Programming 33 02-18-2007 07:31 AM
VERY odd routing behavior when attempting VPN connections over Wifi Robert Gordon Wireless Networking 0 08-25-2005 04:04 PM
Firefox under Linux -- odd behavior Dennis J. Tuchler Firefox 0 07-28-2004 04:05 PM
Odd behavior behind the PIX Charles Haron Cisco 1 04-21-2004 04:05 AM
Odd console behavior on Cat 5005 Mike Voss Cisco 0 11-19-2003 10:49 PM



Advertisments