www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - drastic slowdown for copies

reply "Momo" <_ _.de> writes:
I'm currently investigating the difference of speed between 
references and copies. And it seems that copies got a immense 
slowdown if they reach a size of >= 20 bytes.
In the code below you can see if my struct has a size of < 20 
bytes (e.g. 4 ints = 16 bytes) a copy is cheaper than a 
reference. But with 5 ints (= 20 bytes) it gets a slowdown of ~3 
times. I got these results:

16 bytes:
by ref: 49
by copy: 34
by move: 32

20 bytes:
by ref: 51
by copy: 104
by move: 103

My question is: why?

My system is Win 8.1, 64 Bit and I'm using dmd 2.067.1 (32 bit)

Code:

import std.stdio;
import std.datetime;

struct S {
     int[4] values;
}

pragma(msg, S.sizeof);

void by_ref(ref const S s) {

}

void by_copy(const S s) {

}

enum size_t Loops = 10_000_000;

void main() {
     StopWatch sw;

     sw.start();
     for (size_t i = 0; i < Loops; i++) {
         S s = S();
         by_ref(s);
     }
     sw.stop();

     writeln("by ref: ", sw.peek().msecs);

     sw.reset();

     sw.start();
     for (size_t i = 0; i < Loops; i++) {
         S s = S();
         by_copy(s);
     }
     sw.stop();

     writeln("by copy: ", sw.peek().msecs);

     sw.reset();

     sw.start();
     for (size_t i = 0; i < Loops; i++) {
         by_copy(S());
     }
     sw.stop();

     writeln("by move: ", sw.peek().msecs);
}
May 28 2015
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
16 bytes is 64 bit - the same size as a reference. So copying it 
is overall a bit less work - sending a 64 bit struct is as small 
as a 64 bit reference and you don't go through the pointer.

So up to them, it is a bit faster.


Add another byte and now the copy is too big to fit in a 
register, so it needs to spill over into somewhere else which 
means a bunch more work for the cpu.
May 28 2015
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 05/28/2015 11:27 PM, Adam D. Ruppe wrote:
 16 bytes is 64 bit
It's actually 128 bits.
May 28 2015
prev sibling next sibling parent "Momo" <_ _.de> writes:
On Thursday, 28 May 2015 at 21:27:42 UTC, Adam D. Ruppe wrote:
 16 bytes is 64 bit - the same size as a reference. So copying 
 it is overall a bit less work - sending a 64 bit struct is as 
 small as a 64 bit reference and you don't go through the 
 pointer.

 So up to them, it is a bit faster.


 Add another byte and now the copy is too big to fit in a 
 register, so it needs to spill over into somewhere else which 
 means a bunch more work for the cpu.
But even in release mode (and with optimizations turned on) it is
 3 times slower. Can I somehow enforce references, like in C++? 
I tried already in ref, const ref and immutable ref, nothing works.
May 28 2015
prev sibling parent reply "Momo" <_ _.de> writes:
Perhaps you can give me another detailed answer.
I get a slowdown for all parts (ref, copy and move) if I use 
uninitialized floats. I got these results from the following code:

by ref:  2369
by copy: 2335
by move: 2341

Code:

struct vec2f {
     float x;
     float y;
}

But if I assign 0 to them I got these results:

by ref:  49
by copy: 22
by move: 25

Why?
May 29 2015
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/29/2015 06:55 AM, Momo wrote:
 Perhaps you can give me another detailed answer.
 I get a slowdown for all parts (ref, copy and move) if I use
 uninitialized floats.
Floating point variables are initialized to .nan of their types (e.g. float.nan). Apparently, the CPU is slow when using those special values: http://stackoverflow.com/questions/3606054/how-slow-is-nan-arithmetic-in-the-intel-x64-fpu Ali
May 29 2015
prev sibling next sibling parent "thedeemon" <dlang thedeemon.com> writes:
On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:
 I'm currently investigating the difference of speed between 
 references and copies. And it seems that copies got a immense 
 slowdown if they reach a size of >= 20 bytes.
This is processor-specific, on different models of CPUs you might get different results. Here's what I see running your program with 4 and 5 ints in the struct: C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline 16u C:\prog\D>copyref.exe by ref: 18 by copy: 85 by move: 84 C:\prog\D>copyref.exe by ref: 18 by copy: 72 by move: 72 C:\prog\D>copyref.exe by ref: 16 by copy: 72 by move: 72 C:\prog\D>dmd copyref.d -ofcopyref.exe -release -O -inline 20u C:\prog\D>copyref.exe by ref: 23 by copy: 98 by move: 91 C:\prog\D>copyref.exe by ref: 20 by copy: 91 by move: 102 C:\prog\D>copyref.exe by ref: 23 by copy: 91 by move: 91 I see these digits on an old Core 2 Quad and very similar on a Core i3. So your findings are not reproducible.
May 29 2015
prev sibling parent reply "thedeemon" <dlang thedeemon.com> writes:
On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:

Ah, actually it's more complicated, as it depends on inlining a 
lot.
Indeed, without -O and -inline I was able to get by_ref to be 
slightly slower than by_copy for struct of 4 ints. But when 
inlining turns on, the numbers change in different directions. 
And for 5 ints inlining influence is quite different:

4 ints:             5 ints:
-release
by ref: 53          by ref: 53
by copy: 57         by copy: 137
by move: 54         by move: 137

-release -O
by ref: 38          by ref: 34
by copy: 54         by copy: 137
by move: 49         by move: 137

-release -O -inline
by ref: 15          by ref: 20
by copy: 72         by copy: 91
by move: 72         by move: 91
May 29 2015
next sibling parent "thedeemon" <dlang thedeemon.com> writes:
On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote:

Above was on Core 2 Quad,

here's for Core i3:

4 ints          5 ints
-release
by ref: 67      by ref: 66
by copy: 44     by copy: 142
by move: 45     by move: 137

-release -O
by ref: 29      by ref: 29
by copy: 41     by copy: 141
by move: 40     by move: 142

-release -O -inline
by ref: 16      by ref: 20
by copy: 83     by copy: 104
by move: 83     by move: 104
May 29 2015
prev sibling parent "Momo" <_ _.de> writes:
On Friday, 29 May 2015 at 07:51:31 UTC, thedeemon wrote:
 On Thursday, 28 May 2015 at 21:23:11 UTC, Momo wrote:

 Ah, actually it's more complicated, as it depends on inlining a 
 lot.
Yes. And real functions are more complex and inlining is no reliable option.
 Indeed, without -O and -inline I was able to get by_ref to be 
 slightly slower than by_copy for struct of 4 ints. But when 
 inlining turns on, the numbers change in different directions. 
 And for 5 ints inlining influence is quite different:

 4 ints:             5 ints:
 -release
 by ref: 53          by ref: 53
 by copy: 57         by copy: 137
 by move: 54         by move: 137

 -release -O
 by ref: 38          by ref: 34
 by copy: 54         by copy: 137
 by move: 49         by move: 137

 -release -O -inline
 by ref: 15          by ref: 20
 by copy: 72         by copy: 91
 by move: 72         by move: 91
So as you can see, it is 2-3 times slower. Is there an alternative?
May 29 2015