Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > More optimisation

Reply
Thread Tools

More optimisation

 
 
kid joe
Guest
Posts: n/a
 
      06-16-2009
Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
a[i] *= 2;
b[i] *= 3;
c[i] *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a[i] *= 2;
for(long int i = 0; i < n; i++)
b[i] *= 3;
for(long int i = 0; i < n; i++)
c[i] *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.

I know people will say to profile it on my system! But Id be interested
in peoples thoughts on this on a variety of different hardwares.

Cheers,
Joe


--
...................... o _______________ _,
` Good Evening! , /\_ _| | .-'_|
`................, _\__`[_______________| _| (_|
] [ \, ][ ][ (_|


 
Reply With Quote
 
 
 
 
user923005
Guest
Posts: n/a
 
      06-16-2009
On Jun 16, 1:58*pm, kid joe <spamt...@spamtrap.invalid> wrote:
> Hi all,
>
> I know this is fairly system specific but I was hoping some people here
> with a wide experience of C implementations on different platforms could
> pass on their knowledge.
>
> If I want to perform actions on several large arrays, whats the best way
> to do that? Lets say for a silly example I have arrays a,b,c all of size n
> and I want to multiply everything in a by 2, everything in b by 3 and
> everything in c by 4.
>
> Method 1:
> for(long int i = 0; i < n; i++) {
> * a[i] *= 2;
> * b[i] *= 3;
> * c[i] *= 4;
>
> }
>
> Method 2:
> for(long int i = 0; i < n; i++)
> * a[i] *= 2;
> for(long int i = 0; i < n; i++)
> * b[i] *= 3;
> for(long int i = 0; i < n; i++)
> * c[i] *= 4;
>
> Now you could argue that Method 1 should be faster because there are less
> branching instructions, and branching is expensive. But then you could
> argue that Method 2 should be better because theres better data locality
> so better use of cache.
>
> I know people will say to profile it on my system! But Id be interested
> in peoples thoughts on this on a variety of different hardwares.


My opinion is that these optimizations are very, very silly.

You should not perform these optimizations unless:
1. The routine does not meet performance specifications
2. The profile reveals that initialization of arrays is the bottleneck

Array initialization is linear for a 1-D array, and so it is probably
the least of your worries.

It's like buying Pirelli P3 tires for an old Volkswagen beetle.
It's costly, it's silly, and it won't make the beetle go any faster.

P.S.
You are also right about the profile. I guess that different machines
and different compilers will give different answers. And if you
change the data types you may get different answers yet again.

P.P.S.
The first rule of optimization is "DON'T DO IT."
The second rule of optimization (for experts only) is "Don't do it
*yet*."

These are serious rules. A practice of premature optimization because
you think something might go faster due to tweaky little hacks is a
very, very, very, very bad coding practice. Not only will it make the
code more expensive, it is also going to be a complete waste of time
in almost all cases.

If you want to make code go faster, the very last place you should
ever imagine going is hopping on board the good ship "tweaky hacks."
Instead, improve the algorithm. Chances are very good you can find an
algorithm in the literature that will improve the code. If you cannot
find one, consider thinking about how to improve the algorithm
yourself.

I guess you will ignore this advice, since it did not stick last
time. That's too bad.
 
Reply With Quote
 
 
 
 
Kenny McCormack
Guest
Posts: n/a
 
      06-16-2009
In article <h1911r$vsv$>, kid joe <> wrote:
>Hi all,
>
>I know this is fairly system specific but I was hoping some people here
>with a wide experience of C implementations on different platforms could
>pass on their knowledge.
>
>If I want to perform actions on several large arrays, whats the best way
>to do that? Lets say for a silly example I have arrays a,b,c all of size n
>and I want to multiply everything in a by 2, everything in b by 3 and
>everything in c by 4.
>
>Method 1:
>for(long int i = 0; i < n; i++) {


Of course you know, all you will get in the way of responses is people
pointing out that the above is a syntax error in C (C89, of course,
since that's all there is).

Or you may, if you are lucky, get responses telling you, in patronizing
detail, how you need to flag the above as "C99" (or gcc, or whatever) only.

 
Reply With Quote
 
Lie Ryan
Guest
Posts: n/a
 
      06-16-2009
The fastest would be to use SSE, I think...
 
Reply With Quote
 
Tim Prince
Guest
Posts: n/a
 
      06-17-2009
kid joe wrote:

>
> I know this is fairly system specific but I was hoping some people here
> with a wide experience of C implementations on different platforms could
> pass on their knowledge.
>
> If I want to perform actions on several large arrays, whats the best way
> to do that? Lets say for a silly example I have arrays a,b,c all of size n
> and I want to multiply everything in a by 2, everything in b by 3 and
> everything in c by 4.
>
> Method 1:
> for(long int i = 0; i < n; i++) {
> a[i] *= 2;
> b[i] *= 3;
> c[i] *= 4;
> }
>
> Method 2:
> for(long int i = 0; i < n; i++)
> a[i] *= 2;
> for(long int i = 0; i < n; i++)
> b[i] *= 3;
> for(long int i = 0; i < n; i++)
> c[i] *= 4;
>
> Now you could argue that Method 1 should be faster because there are less
> branching instructions, and branching is expensive. But then you could
> argue that Method 2 should be better because theres better data locality
> so better use of cache.

For several years, all desktop CPUs (and laptops of similar families)
have supported at least 4 write combine buffers per logical or physical
processor, thus efficiently supporting 4 or more parallel paths to
memory. There are some still on the market which don't efficiently
support vectorized store without 16-byte alignment, so then you might
have an argument for splitting it up.
Without restrict or other indications to the compiler that the arrays
don't alias, you will kill performance with multiple arrays per loop.

Certain compilers, with appropriate options and syntax, will choose
these optimizations for you.
The loop length threshold where OpenMP parallelism pays off will be
lower with more work in the loop, if you remove the other possible
obstacles to optimization.
 
Reply With Quote
 
Stephen Sprunk
Guest
Posts: n/a
 
      06-17-2009
kid joe wrote:
> Method 1:
> for(long int i = 0; i < n; i++) {
> a[i] *= 2;
> b[i] *= 3;
> c[i] *= 4;
> }
>
> Method 2:
> for(long int i = 0; i < n; i++)
> a[i] *= 2;
> for(long int i = 0; i < n; i++)
> b[i] *= 3;
> for(long int i = 0; i < n; i++)
> c[i] *= 4;
>
> Now you could argue that Method 1 should be faster because there are less
> branching instructions, and branching is expensive. But then you could
> argue that Method 2 should be better because theres better data locality
> so better use of cache.


I expect the better data locality in method 2 would give better
performance unless n is very small; your loops aren't complex enough to
keep a modern desktop/server CPU busy in between waiting for cache line
fills.

Branching is _not_ expensive on such CPUs (though it may be on old or
embedded CPUs); what is really expensive is a _mispredicted_ branch, and
in simple for loops like the above, the branch should be correctly
predicted for every iteration except the last (and perhaps the first).
The rest of the branches are effectively free. The increments and
comparisons before the branches will just fill in pipeline bubbles and
are also effectively free. However, those costs may reassert themselves
if your actual loops are more complicated than shown above and there
aren't any bubbles to fill.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
 
Reply With Quote
 
Kaz Kylheku
Guest
Posts: n/a
 
      06-17-2009
On 2009-06-16, kid joe <> wrote:
> Hi all,
>
> I know this is fairly system specific but I was hoping some people here
> with a wide experience of C implementations on different platforms could
> pass on their knowledge.
>
> If I want to perform actions on several large arrays, whats the best way
> to do that?


The best way is to look at the memory hiearchy of the processor you are using.
Try to avoid needless re-loading the same parts of an array from one level of
caching to another; e.g. from main memory to the L2 cache, or from L2 to L1.

> Lets say for a silly example I have arrays a,b,c all of size n
> and I want to multiply everything in a by 2, everything in b by 3 and
> everything in c by 4.
>
> Method 1:
> for(long int i = 0; i < n; i++) {
> a[i] *= 2;
> b[i] *= 3;
> c[i] *= 4;
> }
>
> Method 2:
> for(long int i = 0; i < n; i++)
> a[i] *= 2;
> for(long int i = 0; i < n; i++)
> b[i] *= 3;
> for(long int i = 0; i < n; i++)
> c[i] *= 4;
>
> Now you could argue that Method 1 should be faster because there are less
> branching instructions, and branching is expensive. But then you could
> argue that Method 2 should be better because theres better data locality
> so better use of cache.


This caching analysis is false. Neither method uses caching particularly well.
This program simply doesn't benefit from caching very much, because it
processes large (or presumably large) arrays sequentially. From a caching point
of view, it makes little difference whether we process each of three large
arrays separately, or march through them together.

The problem is that our program never looks at a value once it has
updated it. Caching provides the most benefit when you access the same thing
you have recently accessed. If you keep accessing new material, caching
provides less of a benefit. The organization of the cache system can still
speed up the accesses when the pattern is linear, because memory access is
parallelized: entire cache lines are being sucked into the cache in a burst
mode memory access, and in parallel through a wide bus, rather than individual
words. So the fact that you are processing words in order is a kind of locality
which helps to take advantage of the parallelism inherent in the organization
of the cache in the lines.

The one caching consideration may be that the arrays are subject to worst-case
cache aliasing. Suppose that a[], b[] and c[] are situated at such addresses
that their corresponding parts hash to the same cache line, and suppose that
your processor's cache is of the simplest possible type: direct mapped. So
a[0], b[0] and c[0] land on the same cache line and so on. This means that the
first example will trash horribly. The very first access a[0] *= 2 will load an
entire cache line. So you have a[1], a[2] through, say, a[7] ready for fast
access. But you don't take advantage of it: the next access after a[0] is not
a[1] but b[0]. And in our worst-case situation, this maps to the same cache
line as a[0], with a direct mapping. Oops! The cache line is dirty since we
just stored a[0], so the whole line has to be written out first. After being
written out, the cache line is loaded with b[0] through b[7]. The b[0] is
updated in the cache line, and now c[0] is accessed. Oops! The entire cache
line is flushed and re-loaded again with c[0] thorugh c[7]. Then the trashing
repeats with a[1], over the same three cache lines, and so on.

This potential problem does not threaten the second example, since you are
marching through large areas of memory in order; you are not accessing three
different large areas of memory at some fixed displacement from each other.

The first example will run about as fast as the second if the cache lines do
not conflict. That is to say, if the parallel elements of a[], b[] and c[] map
to three separate cache lines you are okay. It doesn't matter whether your
cache has 32 lines in total or 256. You just need three unique ones at any time
so that the loop can suck data without interference.
Aliasing problems can occur at multiple levels in the memory hiearchy.

And also in other kinds of caches like TLB's for virtual address translation.

Suppose you have a direct mapped TLB, and the code alternates its accesses
between two virtual pages that clash to the same TLB entry. Poof; you are
playing a game of TLB miss-and-reload ping pong.
 
Reply With Quote
 
Antoninus Twink
Guest
Posts: n/a
 
      06-17-2009
On 16 Jun 2009 at 21:33, user923005 wrote:
> These are serious rules. A practice of premature optimization because
> you think something might go faster due to tweaky little hacks is a
> very, very, very, very bad coding practice.


This is true up to a point.

However, *all other things being equal* it's obviously better to choose
code that's likely to run faster on most platforms (as opposed to the
Heathfield method of choosing code likely to run slower on most
platforms).

I think the OP was saying: "OK, I've got this situation where I could do
this in two different ways. Both are equally clear and readable, both
are correct and maintainable, so why not let probable speed on probable
architectures be the determining factor as to which one I pick?"

And why not indeed? Even if you see efficiency (or in Heathfield's case,
inefficiency) as the very last entry in your lexicographic order of the
desirable properties of code, it still beats tossing a coin.

 
Reply With Quote
 
Hamiral
Guest
Posts: n/a
 
      06-17-2009
kid joe a écrit :
> Method 2:
> for(long int i = 0; i < n; i++)
> a[i] *= 2;
> for(long int i = 0; i < n; i++)
> b[i] *= 3;
> for(long int i = 0; i < n; i++)
> c[i] *= 4;


I'd sayr method two is faster, but if you like micro-optimization, I'd
do it like this :

long int i;
int *tmp_ptr; (assuming a, b and c are int arrays)

tmp_ptr = a;
for(i=0; i<n; i++) {
*tmp_ptr <<= 1; // equivalent to *=2
tmp_ptr++;
}
tmp_ptr = b;
for(i=0; i<n; i++) {
*tmp_ptr *= 3;
tmp_ptr++;
}
tmp_ptr = c;
for(i=0; i<n; i++) {
*tmp_ptr <<= 2; // equivalent to *=4
tmp_ptr++;
}

But maybe assigning tmp_ptr each time makes you lose the speed you won
by using bitshift and pointer incrementation...

As a final note, I'd say you should rely on your compiler. Unless you're
on a very low end hardware, I'm really not sure you can't optimize
better than your compiler does...

Ham
 
Reply With Quote
 
Edwin van den Oetelaar
Guest
Posts: n/a
 
      06-17-2009
kid joe wrote:
> Hi all,
>
> I know this is fairly system specific but I was hoping some people here
> with a wide experience of C implementations on different platforms could
> pass on their knowledge.
>
> If I want to perform actions on several large arrays, whats the best way
> to do that? Lets say for a silly example I have arrays a,b,c all of size n
> and I want to multiply everything in a by 2, everything in b by 3 and
> everything in c by 4.
>
> Method 1:
> for(long int i = 0; i < n; i++) {
> a[i] *= 2;
> b[i] *= 3;
> c[i] *= 4;
> }
>
> Method 2:
> for(long int i = 0; i < n; i++)
> a[i] *= 2;
> for(long int i = 0; i < n; i++)
> b[i] *= 3;
> for(long int i = 0; i < n; i++)
> c[i] *= 4;



If your calculations are more expensive than *2 *3 *4 it could be worth running multiple threads if
you have more cores/cpu.
Method 2 can be run in parallel, method 1 not.

Just my $0.02, benchmark if in doubt.

Gr,
Edwin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Search Engine Optimisation sorry.no.email@post_NG.com HTML 0 05-08-2006 11:54 AM
Network speed (question on optimisation) Dima Computer Support 1 04-09-2004 09:05 PM
Optimisation of regexps in Perl? Fredrik Ramsberg Perl 2 10-15-2003 08:30 AM
boolean loop optimisation Roedy Green Java 8 09-12-2003 10:24 AM
optimisation for the web chaz Digital Photography 2 08-12-2003 09:35 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57