Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Why large performance degradation?

Reply
Thread Tools

Why large performance degradation?

 
 
Kevin Wan
Guest
Posts: n/a
 
      07-28-2003
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!
 
Reply With Quote
 
 
 
 
Carsten Hansen
Guest
Posts: n/a
 
      07-28-2003

"Kevin Wan" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Hello,
>
> Would anyone explain why there is a consistent large performance
> degradation with the dumb copy?
>
> Thanks in advance!
>
> array_copy_dumb.c:
>
> /* array_copy_dumb.c */
> #define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
> source[i]
>
> int main(void)
> {
> char source[65536], destination[65536];
> int i, j;
> for (j = 0; j < 100; ++j)
> DUMBCOPY; /* 1 */
>
> return 0;
> }
>
> array_copy_smart.c:
>
> /* array_copy_smart.c */
> #include <string.h>
>
> #define SMARTCOPY memcpy(destination, source, 65536)
>
> int main(void)
> {
> char source[65536], destination[65536];
> int i, j;
> for (j = 0; j < 100; ++j)
> SMARTCOPY;
>
> return 0;
> }
>
> $ gcc -O3 array_copy_dumb.c -o array_copy_dumb
> $ time array_copy_dumb
> real 4.99
> user 4.86
> sys 0.00
> $ gcc -O3 array_copy_smart.c -o array_copy_smart
> $ time array_copy_smart
> real 0.15
> user 0.15
> sys 0.00
>
> I just know it's due to cache! But would anyone explain it in detail?
>
> Thanks again!



Any decent C library will have a highly optimized version memcpy. Typically
it will be implemented in assembly language.
At least it will do copying of 4 bytes at a time if possible. Some may do 8
bytes at a time using floating-point instructions.
Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
may play with cache lines.

Your program has undefined behavior as you copy uninitialized data.

Carsten Hansen


 
Reply With Quote
 
 
 
 
Tim Prince
Guest
Posts: n/a
 
      07-28-2003
Kevin Wan wrote:

> Hello,
>
> Would anyone explain why there is a consistent large performance
> degradation with the dumb copy?
>
> Thanks in advance!
>
> array_copy_dumb.c:
>
> /* array_copy_dumb.c */
> #define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
> source[i]
>
> int main(void)
> {
> char source[65536], destination[65536];
> int i, j;
> for (j = 0; j < 100; ++j)
> DUMBCOPY; /* 1 */
>
> return 0;
> }
>
> array_copy_smart.c:
>
> /* array_copy_smart.c */
> #include <string.h>
>
> #define SMARTCOPY memcpy(destination, source, 65536)
>
> int main(void)
> {
> char source[65536], destination[65536];
> int i, j;
> for (j = 0; j < 100; ++j)
> SMARTCOPY;
>
> return 0;
> }
>
> $ gcc -O3 array_copy_dumb.c -o array_copy_dumb
> $ time array_copy_dumb
> real 4.99
> user 4.86
> sys 0.00
> $ gcc -O3 array_copy_smart.c -o array_copy_smart
> $ time array_copy_smart
> real 0.15
> user 0.15
> sys 0.00
>
> I just know it's due to cache! But would anyone explain it in detail?
>
>

If you know so much, and aren't willing to read up on it, nor to tell
anything about the architecture, why are you asking?
But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
at a time, as appropriate to the architecture for which it was built.
--
Tim Prince
 
Reply With Quote
 
Martijn
Guest
Posts: n/a
 
      07-28-2003
Tim Prince wrote:
> But, do consider that memcpy() surely performs this 4, 8, 16 or more
> bytes at a time, as appropriate to the architecture for which it was
> built.


Apart from that, it may be faster on your particular implementation, it may
not be faster on another. Some reasons:

- memcpy may not be using array-indexing, but incrementing a pointer instead
(hence some extra memory access and calculations to get the memory address
of the source and destination) - you could do this to your own code and see
what happens

- memcpy may copy from back to front (on some systems, a comparison with 0
is faster than with an arbitrary number - only a very minor gain but in
65536 comparisons it may help)

- memcpy may make direct use of registers, and in your case i may not be
translated by the compiler to be a register (hence some extra copying to and
from the memory)

- memcpy may use movsb, movsw, movsd (intel only) or similar assembly
instructions

Hope this helped,

--
Martijn Haak
http://www.serenceconcepts.nl


 
Reply With Quote
 
Nils Petter Vaskinn
Guest
Posts: n/a
 
      07-28-2003
On Sun, 27 Jul 2003 20:38:06 -0700, Kevin Wan wrote:

> Hello,
>
> Would anyone explain why there is a consistent large performance
> degradation with the dumb copy?
>


> #define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =

65536 copy operations


> #define SMARTCOPY memcpy(destination, source, 65536)

1 call to memcpy that has probably been optimized


You'll have to look at the implementation of memcpy to see what it does
differently.

hth
NPV
 
Reply With Quote
 
Dan Pop
Guest
Posts: n/a
 
      07-28-2003
In <(E-Mail Removed) > http://www.velocityreviews.com/forums/(E-Mail Removed) (Kevin Wan) writes:

>Would anyone explain why there is a consistent large performance
>degradation with the dumb copy?
>
>#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
>
>#define SMARTCOPY memcpy(destination, source, 65536)


There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

The first thing to do is to replace the byte-by-byte copying by a
register-by-register copying. The next thing is to play tricks with the
cache, so that the input data is already in the cache by the time you
need it. There may be even more processor-specific tricks for
accelerating the memory accesses when performed in a purely sequential
way. The processor may even have a specialised instruction for performing
this operation as fast as possible. To exploit all these things, memcpy
is typically implemented in assembly.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: (E-Mail Removed)
 
Reply With Quote
 
Randy Howard
Guest
Posts: n/a
 
      07-29-2003
In article <bg3nee$2id$(E-Mail Removed)>, (E-Mail Removed) says...
> There are many ways of accelerating memcpy on modern architectures, while
> your dumb loop can't be much optimised by a compiler (unless the compiler
> writer has spent some effort into recognising it as a dumb memcpy and
> replaced it by a smart memcpy).


I had a situation not too long ago where memcmp() on MSVC 6.0 was a
huge performance bottleneck. gcc seems to be smart enough to use
block compare instructions for appropriate Intel CPUs, where the MS
compiler was not. This isn't normally important, but this app did
some very large block memcmp()'s, and it became a problem. Also
knowing that that the size of the blocks being compared would always
be evenly divisible by 4 allowed it to really be optimized on Intel.

Rewriting an inline assembly version for WIN32 resulted in a 30%
performance improvement. Linux/gcc code compiled from the same source
(minus the memcmp() replacement) ran w/basically identical performance
with no manual intervention.


 
Reply With Quote
 
Randy Howard
Guest
Posts: n/a
 
      07-29-2003
In article <(E-Mail Removed)>, (E-Mail Removed)
says...
> In any case, the morals of the story is "Don't do it yourself if the
> language gives you a way of doing it".


Usually. Not always. But, there isn't much point in not following the
above, unless you have performance measurements that show a particular
library routine is a bottleneck. In those cases, worry about trying
to find a better method.



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
median of large data set (from large file) friend.05@gmail.com Perl Misc 5 04-02-2009 04:06 AM
why why why why why Mr. SweatyFinger ASP .Net 4 12-21-2006 01:15 PM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
[Urgent] Is there a size limit on returning a large dataset or a large typed array from web service? Ketchup ASP .Net Web Services 1 05-25-2004 10:11 AM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments