strange performance behaviour for memcpy

Discussion in 'Windows 64bit' started by =?Utf-8?B?a2xhdXM=?=, May 3, 2006.

  1. I use XP Pro x64 and Visual Studio 2005.
    1000 memcpy calls with 1MB width last 1375 ticks compiled for 64 bit.
    1000 memcpy calls with 1MB width last 2719 ticks compiled for 32 bit.
    => 64 bit seems to be twice times faster

    10000 memcpy calls with 512KB width last 6125 ticks compiled for 64 bit.
    10000 memcpy calls with 512KB width last 1422 ticks compiled for 32 bit.
    => 64 bit seems to be more than four times slower!

    Could someone shed some light on this strange behaviour?
     
    =?Utf-8?B?a2xhdXM=?=, May 3, 2006
    #1
    1. Advertising

  2. =?Utf-8?B?a2xhdXM=?=

    Rob Perkins Guest

    klaus wrote:
    > I use XP Pro x64 and Visual Studio 2005.
    > 1000 memcpy calls with 1MB width last 1375 ticks compiled for 64 bit.
    > 1000 memcpy calls with 1MB width last 2719 ticks compiled for 32 bit.
    > => 64 bit seems to be twice times faster
    >
    > 10000 memcpy calls with 512KB width last 6125 ticks compiled for 64 bit.
    > 10000 memcpy calls with 512KB width last 1422 ticks compiled for 32 bit.
    > => 64 bit seems to be more than four times slower!
    >
    > Could someone shed some light on this strange behaviour?


    Your process got preempted. Make your test with 20 million memcpy calls,
    or set your process to realtime priority before running your test. At
    your own risk, of course.

    Rob
     
    Rob Perkins, May 3, 2006
    #2
    1. Advertising

  3. "Rob Perkins" wrote:

    > klaus wrote:
    > > I use XP Pro x64 and Visual Studio 2005.
    > > 1000 memcpy calls with 1MB width last 1375 ticks compiled for 64 bit.
    > > 1000 memcpy calls with 1MB width last 2719 ticks compiled for 32 bit.
    > > => 64 bit seems to be twice times faster
    > >
    > > 10000 memcpy calls with 512KB width last 6125 ticks compiled for 64 bit.
    > > 10000 memcpy calls with 512KB width last 1422 ticks compiled for 32 bit.
    > > => 64 bit seems to be more than four times slower!
    > >
    > > Could someone shed some light on this strange behaviour?

    >
    > Your process got preempted. Make your test with 20 million memcpy calls,
    > or set your process to realtime priority before running your test. At
    > your own risk, of course.
    >
    > Rob
    >

    Rob,

    I do not think that this is caused by another thread. Regardless how often I
    repeat the test, I have similar results. And, more importantly, an
    application compiled for win32 runs three times faster in comparison to the
    situation compiled for x64. This was the starting point to profile the
    application and get down to the memcpy problem.
     
    =?Utf-8?B?a2xhdXM=?=, May 3, 2006
    #3
  4. =?Utf-8?B?a2xhdXM=?=

    Rob Perkins Guest


    > I do not think that this is caused by another thread. Regardless how often I
    > repeat the test, I have similar results. And, more importantly, an
    > application compiled for win32 runs three times faster in comparison to the
    > situation compiled for x64. This was the starting point to profile the
    > application and get down to the memcpy problem.


    As far as I can tell, the tests themselves are disparate. You adjusted
    the order of magnitude of the data copied at the same time as you
    adjusted the order of magnitude of the number of transfers

    Make some tests with 4 KB memcpy calls, then 16 KB calls, then 64, then,
    256, then 512, then 1 MB. Make a series with 1000 calls, followed by
    10000, followed by 1000000 calls, using each of the memory sizes. That's
    18 test series.

    With that data, we might be able to begin to guess why.

    Rob
     
    Rob Perkins, May 3, 2006
    #4
  5. If you trace down into the assembly of memcpy for x86/x64 you will see that
    the x86 uses SSE2 extensions to speed up the copy whereas x64 seems to just
    copy one block (4 or 8 bytes ?) at a time.

    I don't know exactly why one is faster than another for different sizes
    (512KB vs 1MB) but the fact that the code being exectuted is radically
    different for x86 vs x64 would be enough for to just say "huh, that's weird,
    oh well" and move on with things. :)

    If you want to do a 'fair' test then write your own memcpy that just loops
    copying 4/8 bytes at a time i.e. don't use the optimized assembly routines
    .... of course this will (should) be slower than using the asm versions.


    Chris
     
    Chris Kushnir, May 3, 2006
    #5
  6. "Chris Kushnir" wrote:

    > If you trace down into the assembly of memcpy for x86/x64 you will see that
    > the x86 uses SSE2 extensions to speed up the copy whereas x64 seems to just
    > copy one block (4 or 8 bytes ?) at a time.
    >
    > I don't know exactly why one is faster than another for different sizes
    > (512KB vs 1MB) but the fact that the code being exectuted is radically
    > different for x86 vs x64 would be enough for to just say "huh, that's weird,
    > oh well" and move on with things. :)
    >
    > If you want to do a 'fair' test then write your own memcpy that just loops
    > copying 4/8 bytes at a time i.e. don't use the optimized assembly routines
    > .... of course this will (should) be slower than using the asm versions.
    >
    >
    > Chris


    … No, I do not want to move on with things. The application I want to
    compile for 64 bit is performance critical. And I found out that the overall
    performance under 64 bit was three times slower compared with 32 bit.
    Actually, the application uses memcpy with different sizes:
    for (int i=1; i <= 32768; i++)
    memcpy(dest, source, i * 16);

    So the crucial point for me is: When I want to move a 32 bit app to 64 bit
    and the app uses memcpy very often which routine I have to call or what
    preparations I have to do to make sure the 64 bit application is not slower
    than the 32 bit?
    Or do I have to wait for an improved version of the run-time library?
     
    =?Utf-8?B?a2xhdXM=?=, May 4, 2006
    #6
  7. "Rob Perkins" wrote:

    >
    > > I do not think that this is caused by another thread. Regardless how often I
    > > repeat the test, I have similar results. And, more importantly, an
    > > application compiled for win32 runs three times faster in comparison to the
    > > situation compiled for x64. This was the starting point to profile the
    > > application and get down to the memcpy problem.

    >
    > As far as I can tell, the tests themselves are disparate. You adjusted
    > the order of magnitude of the data copied at the same time as you
    > adjusted the order of magnitude of the number of transfers
    >
    > Make some tests with 4 KB memcpy calls, then 16 KB calls, then 64, then,
    > 256, then 512, then 1 MB. Make a series with 1000 calls, followed by
    > 10000, followed by 1000000 calls, using each of the memory sizes. That's
    > 18 test series.
    >
    > With that data, we might be able to begin to guess why.
    >
    > Rob
    >

    Even the overall perfomance under 64 bit is more than twice slower as under
    32 bit. See my reply to Chris.
     
    =?Utf-8?B?a2xhdXM=?=, May 4, 2006
    #7
  8. >> No, I do not want to move on with things.

    I wouldn't either, but the point i was trying to make is that you have no
    control over how MS implements memcpy.
    You have little choice but to accept it and hope that future versions of the
    CRT have a more optimized version.

    If you know x64 assembly you could write your own, but that's about your
    only option - as i see it.
    The only other thing i can think of is to try CopyMemory, however,
    traditionally this calls memcpy.

    For what it's worth, i have done testing on several of my apps and found
    that overall performance is the same for x86 and x64 compilations.
    If you are finding that overall app performance is significantly worse under
    x64 i would suspect that your app run-times are heavily dependant on memcpy,
    or other functions that haven't been optimized for x64 yet.

    I would suggest you report this as a bug/enhancement request to MS:
    http://lab.msdn.microsoft.com/productfeedback/


    Chris
     
    Chris Kushnir, May 4, 2006
    #8
  9. > I don't know exactly why one is faster than another for different sizes
    > (512KB vs 1MB) but the fact that the code being exectuted is radically


    When copying 1M you are out of cache and in this case x86 version with SSE
    movqda is slower.
    if you do 2 memcpy 512K each inside the loop but copyind 2 different 512K
    areas you'll get the same slow result on x86.
    As I see x64 version is using "prefetchnta", so in the case when you are out
    of cache it's better but it does not use SSE, so if your data is in cache,
    SSE version is faster.
    What if to combine "prefetch" with SSE :)
    If you know for sure that all your data is aligned and there are no
    intersects, you can write your own memcpy which will be for sure faster.

    The only small annoying problem is that x64 C compiler does not support
    "_asm {}"

    Regards,
    Sergey


    "Chris Kushnir" <> wrote in message
    news:%...
    > If you trace down into the assembly of memcpy for x86/x64 you will see
    > that the x86 uses SSE2 extensions to speed up the copy whereas x64 seems
    > to just copy one block (4 or 8 bytes ?) at a time.
    >
    > I don't know exactly why one is faster than another for different sizes
    > (512KB vs 1MB) but the fact that the code being exectuted is radically
    > different for x86 vs x64 would be enough for to just say "huh, that's
    > weird, oh well" and move on with things. :)
    >
    > If you want to do a 'fair' test then write your own memcpy that just loops
    > copying 4/8 bytes at a time i.e. don't use the optimized assembly routines
    > ... of course this will (should) be slower than using the asm versions.
    >
    >
    > Chris
    >
    >
     
    Sergey Kashyrin, May 4, 2006
    #9
  10. =?Utf-8?B?a2xhdXM=?=

    Guest Guest

    SK- [Thu, 4 May 2006 18:32:06 -0400]:
    >What if to combine "prefetch" with SSE :)


    From what I know, a memory-bound routine will
    -not- improve by using prefetch since all memory
    bus cycles are already being used. Throwing in
    prefetches could even slow a memcpy down.

    --
    40th Floor - Software @ http://40th.com/
    iPlay : the ultimate audio player for mobiles
    parametric eq, xfeed, reverb; all on a mobile
     
    Guest, May 5, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Falcon

    Strange taskbar behaviour (notification area)

    Falcon, Aug 17, 2004, in forum: Wireless Networking
    Replies:
    0
    Views:
    736
    Falcon
    Aug 17, 2004
  2. joost68
    Replies:
    5
    Views:
    467
  3. hpoppe
    Replies:
    0
    Views:
    480
    hpoppe
    Nov 7, 2004
  4. Giuen
    Replies:
    0
    Views:
    1,163
    Giuen
    Sep 12, 2008
  5. Lawrence D'Oliveiro

    Ban memcpy??

    Lawrence D'Oliveiro, May 19, 2009, in forum: NZ Computing
    Replies:
    7
    Views:
    509
    JohnO
    May 21, 2009
Loading...

Share This Page