www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - I feel outraged -

reply Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:
- that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
CPU architectures.

Justin
Oct 15 2009
next sibling parent reply downs <default_357-line yahoo.de> writes:
Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :p Also I have no idea what you mean. Should delegate _values_ be heap allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also, do you really want to heap even more work on the ailing GC?
Oct 15 2009
next sibling parent reply Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:
downs Wrote:

 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :p

Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.
 
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG"). When the tough gets going, the going have to get tough. (Meaning to start thinking outside of the square.)
Oct 15 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Justin Johansson:

 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").

This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophile
Oct 15 2009
next sibling parent Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:
bearophile Wrote:

 Justin Johansson:
 
 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").

This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophile

Steve, bearophile, et. al., Yes, timezone (being Australia) is a severe disadvantage for "reality-time" discussion. Best I retire with this as conjecture for the moment. Buona Notte (23.30) Justin
Oct 15 2009
prev sibling parent reply Justin Johansson <no spam.com> writes:
bearophile Wrote:

 Justin Johansson:
 
 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").

This is a nice situation: with just 10-20 minutes of experimental tests (that later you have to show us) you can show us you are right, or wrong. Bye, bearophile

Fresh light of morning ... I'm really glad to have brought this up as I would not have bothered to revisit a performance issue that I had in porting some C++ to D (and this was not looking good for D at first sight). As it turns out though, my initial fear about the .sizeof delegate was unfounded as the performance bottleneck was in an loop inside of a method taking what was effectively a callback parameter. The C++ design was basically implementing a classical visitor pattern over a collection. In porting to D I had a choice of either doing a one-for-one translation of the C++ classes or redesigning using D delegates. This morning I boiled the problem down to the simplest possible with the two forms and benchmarked these. From the code I have reproduced below, clearly the issue is nothing to do with the cost of instantiating a delegate or visitor object. That turns out to be a one time cost and irrelevant given where the iteration actually occurs. So while this isn't the proof or disproof that bearophile asked for, this turns out to be a clear demonstration of the performance-enhancing power of D delegates over an otherwise ingrained C++ thinking approach. I'm impressed. (Hope that statement isn't too emotional :-) #! /opt/dmd1/linux/bin/rdmd -I/opt/dmd1/src/phobos -L-L/opt/dmd1/linux/lib -release import std.perf, std.stdio; class SetOfIntegers { private int from, to; this( int from, int to) { this.from = from; this.to = to; } int forEach( int delegate( int x) apply) { for (auto i = from; i <= to; ++i) { apply( i); } return 0; } int forEach( Visitor v) { for (auto i = from; i <= to; ++i) { v.visit( i); } return 0; } } class Visitor { abstract int visit( int x); } class MyVisitor: Visitor { int visit( int x) { return 0; } } void test1( SetOfIntegers s, Visitor v) { auto pc = new PerformanceCounter(); pc.start(); scope(exit) { pc.stop(); writefln( "Using D style delegate callback: %d msec", pc.milliseconds()); } s.forEach( &v.visit); } void test2( SetOfIntegers s, Visitor v) { auto pc = new PerformanceCounter(); pc.start(); scope(exit) { pc.stop(); writefln( "Using C++ style virtual callback: %d msec", pc.milliseconds()); } s.forEach( v); } void main() { writefln( "Delegates vs virtual function callbacks"); writefln(); SetOfIntegers s = new SetOfIntegers( 1, 10000000); Visitor v = new MyVisitor(); for (auto i = 0; i < 10; ++i) { test1( s, v); test2( s, v); writefln(); } writefln(); } $ ./perf.d Delegates vs virtual function callbacks Using D style delegate callback: 121 msec Using C++ style virtual callback: 146 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 145 msec Using D style delegate callback: 120 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 146 msec Using D style delegate callback: 121 msec Using C++ style virtual callback: 147 msec Sweet. cheers Justin Johansson
Oct 15 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Justin Johansson:

 this turns out to be
 a clear demonstration of the performance-enhancing power of D delegates over an
 otherwise ingrained C++ thinking approach.

I have changed your benchmark a little, you may want to look at its timings too (I have taken timings with it with DMD and LDC, and the results differ): version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.time: CLOCKS_PER_SEC, clock; } else { import std.c.stdio: printf; import std.c.time: CLOCKS_PER_SEC, clock; } double myclock() { return clock() / cast(double)CLOCKS_PER_SEC; } abstract class Visitor { //interface Visitor { // try this too abstract int visit(int x); } final class MyVisitor: Visitor { int visit(int x) { return 0; } } struct IntRange { int stop; int forEachDeleg(int delegate(int x) apply) { for (int i; i < stop; i++) apply(i); return 0; } int forEachObj(Visitor v) { for (int i; i < stop; i++) v.visit(i); return 0; } } void testD(IntRange s, Visitor v) { auto start = myclock(); s.forEachDeleg(&v.visit); auto stop = myclock(); printf("Using D style delegate callback: %d ms\n", cast(int)((stop - start) * 1000)); } void testCpp(IntRange s, Visitor v) { auto start = myclock(); s.forEachObj(v); auto stop = myclock(); printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop - start) * 1000)); } void main() { auto s = IntRange(400_000_000); Visitor v = new MyVisitor(); for (int i; i < 5; i++) { testD(s, v); testCpp(s, v); printf("\n"); } } (I suggest you to use the _ inside big number literals in D, they avoid few bugs). (Few days ago I think to have found that interfaces aren't implemented efficiently in LDC. Lindquist has answered he will improve the situation.) Bye, bearophile
Oct 15 2009
parent reply Justin Johansson <no spam.com> writes:
bearophile Wrote:

 Justin Johansson:
 
 this turns out to be
 a clear demonstration of the performance-enhancing power of D delegates over an
 otherwise ingrained C++ thinking approach.

I have changed your benchmark a little, you may want to look at its timings too (I have taken timings with it with DMD and LDC, and the results differ): version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.time: CLOCKS_PER_SEC, clock; } else { import std.c.stdio: printf; import std.c.time: CLOCKS_PER_SEC, clock; } double myclock() { return clock() / cast(double)CLOCKS_PER_SEC; } abstract class Visitor { //interface Visitor { // try this too abstract int visit(int x); } final class MyVisitor: Visitor { int visit(int x) { return 0; } } struct IntRange { int stop; int forEachDeleg(int delegate(int x) apply) { for (int i; i < stop; i++) apply(i); return 0; } int forEachObj(Visitor v) { for (int i; i < stop; i++) v.visit(i); return 0; } } void testD(IntRange s, Visitor v) { auto start = myclock(); s.forEachDeleg(&v.visit); auto stop = myclock(); printf("Using D style delegate callback: %d ms\n", cast(int)((stop - start) * 1000)); } void testCpp(IntRange s, Visitor v) { auto start = myclock(); s.forEachObj(v); auto stop = myclock(); printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop - start) * 1000)); } void main() { auto s = IntRange(400_000_000); Visitor v = new MyVisitor(); for (int i; i < 5; i++) { testD(s, v); testCpp(s, v); printf("\n"); } } (I suggest you to use the _ inside big number literals in D, they avoid few bugs). (Few days ago I think to have found that interfaces aren't implemented efficiently in LDC. Lindquist has answered he will improve the situation.) Bye, bearophile

Thanks muchly (also the _ tip) Just ran your code with these results (D1/phobos/linux): #! /opt/dmd1/linux/bin/rdmd -I/opt/dmd1/src/phobos -L-L/opt/dmd1/linux/lib -release -O Also added -O switch this time though have no idea what level of optimization that does. (btw. In this test code, the -release switch doesn't do anything does it as that's just for conditional compilation?) A. abstract class Visitor version :- Using D style delegate callback: 2720 ms Using C++ style virtual callback: 2249 ms Using D style delegate callback: 2560 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2170 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2099 ms Using C++ style virtual callback: 2259 ms Using D style delegate callback: 2640 ms Using C++ style virtual callback: 2250 ms B. interface Visitor version :- Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2519 ms Using C++ style virtual callback: 2510 ms Using D style delegate callback: 2509 ms Using C++ style virtual callback: 2500 ms Using D style delegate callback: 2510 ms Using C++ style virtual callback: 2500 ms The results are not clear cut at all this time. So what's going on? ciao justin
Oct 16 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Justin Johansson:

 Also added -O switch this time though have no idea what level of optimization
that does.
 (btw. In this test code, the -release switch doesn't do anything does it
 as that's just for conditional compilation?)

In DMD: -O means "full optimizations minus the inlining (and keeping asserts, bound tests, contracts and maybe more). -release means no asserts (but it keeps assert(0)), no bound tests and no contracts. -inline means to perform inlining. So generally when you care for performance you compile in DMD with: -O -release -inline (But sometimes inlining makes the performance a little worse, because there's more pressure on the small code half of L1 cache). In this program -release doesn't change the timings probably because there's nothing to remove (bound tests, etc). In LDC: -O equals to -O2, that means an average optimization. -O3 means more optimization and includes two successive inlining passes (so foreach over an opApply are often fully simplified. But only few delegates/function pointers are inlined). -O4 and -O5 currently mean -O3, in future (I hope soon!) -O4 will perform all the optimizations of -O3 plus link-time optimization and _Dmain interning (that's already doable, but only manually). If you add -inline I think (but I am not sure) it performs a third inlining pass. There is the -release too that does as in DMD, plus flags for a finer releasing (for example to disable just asserts but not array bounds) that are not available in DMD.
The results are not clear cut at all this time.  So what's going on?<

I don't know. I have a certain experience of benchmarks now, and I know they are tricky. I usually like to help people understand they don't understand what's going on in their life, because they often have just an illusion of understanding things :-) You may use something like obj2asm (or a disassembler) to see the asm produces in both cases, to understand a little better. If you don't have ways to do it, I can show you the resulting asm myself. Bye, bearophile
Oct 16 2009
parent Justin Johansson <no spam.com> writes:
bearophile Wrote:

 Justin Johansson:
 
The results are not clear cut at all this time.  So what's going on?<

I don't know. I have a certain experience of benchmarks now, and I know they are tricky. I usually like to help people understand they don't understand what's going on in their life, because they often have just an illusion of understanding things :-)

"because they often have just an illusion of understanding things :-)" So true.
 You may use something like obj2asm (or a disassembler) to see the asm produces
in both cases, to understand a little better. If you don't have ways to do it,
I can show you the resulting asm myself.
 Bye,
 bearophile

No worries; I'm fine with groking asm. Thanks very much for your time and encouragement. ciao, justin
Oct 16 2009
prev sibling parent Jeremie Pelletier <jeremiep gmail.com> writes:
Justin Johansson wrote:
 downs Wrote:
 
 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.

 Justin


Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").

I don't see why delegates should be allocated on the heap, if so then dynamic arrays would have to too, because they're the same size. It wouldn't be efficient because even if dereferences 'may' be faster than stack pushes, having arrays or delegates in the heap would double the number of dereferences needed, double the chances of memory not being in the cache and double the code to create and access them.
 When the tough gets going, the going have to get tough.
 (Meaning to start thinking outside of the square.)

And here I was trying to think outside of the tesseract :o)
Oct 15 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 15 Oct 2009 07:45:02 -0400, Justin Johansson  
<procode adam.com.after-dot-com-add-dot-au> wrote:

 downs Wrote:

 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer  

 CPU architectures.

 Justin

- with this weird way of writing posts? The subject should tell us about the content, not your emotional state! :p

Re subject line: fair call, you are right. Emotions aside, at least this time I got a response.

You got a response because I'm actually awake and at a computer :) I don't think you should expect much earlier than 7am eastern from the US participants (regarding your 3 am post about a manifesto, followed by an assumed lack of interest at 5 am). But I have to agree with downs. Although I look at "non-descriptive" posts, it has nothing to do with my likelihood of reading *or* responding. Attributing response to changing such a non-essential piece of a post is like thinking you made it rain by dancing.
 Also I have no idea what you mean. Should delegate _values_ be heap  
 allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The  
 stack is relatively likely to be in the CPU cache. A random pointer  
 dereferencing .. isn't. Also, do you really want to heap even more work  
 on the ailing GC?

I will be bold and say yes to off-stack allocation, whether that be in the general heap or other (and probably other to avoid an "ailing CG").

When the majority of delegates survive exactly one function call, I think you might be very much wrong. You only save on allocation vs. stack when you pass it through many function calls. In fact, using such a delegate will probably be more penalized if the memory location is not local (and stack usually is close to the cache), not to mention putting it off stack means an additional pointer dereference.
 When the tough gets going, the going have to get tough.
 (Meaning to start thinking outside of the square.)

The going isn't tough yet :) delegates work just fine for me. -Steve
Oct 15 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 15 Oct 2009 07:15:45 -0400, Justin Johansson  
<procode adam.com.after-dot-com-add-dot-au> wrote:

 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer dereference  
 in contemporary
 CPU architectures.

How do you propose to fix it? I think it is the minimal approach. You need 4 bytes for the function pointer, and 4 bytes for the instance data. -Steve
Oct 15 2009
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
Two discoveries were made from this benchmark.

1) There is no appreciable speed difference between delegates and functors. I
re-ran the benchmark several times; sometimes one was faster, sometimes the
other - no clear advantage was discernible. The visible differences can be
blamed on experimental error. Feel free to rerun it on a pure benchmarking
machine..
2) The GC is slooooow (factor of 40!). No surprise there.

The code:

gentoo-pc ~ $ cat test.d; gdc-build test.d -o test_c -O3 -frelease
-march=nocona && ./test_c
module test;

import std.stdio;

struct Functor {
  void delegate() dg;
  void opCall() { dg(); }
}

void bench(I, C)(string name, I iters, C callable) {
  auto start = sec();
  // sorry
  for (I l = 0; l < iters; ++l)
    static if (is(typeof(callable.opCall)))
      callable.opCall();
    else
      callable();
  auto taken = sec() - start;
  writefln(name, ": ", taken, "s, ",
    ((taken / iters) * 1000_000), " Ás per call"
  );
}

struct _test3 {
  void test() { }
  void opCall() {
    auto dg = new Functor;
    dg.dg = &test;
    dg.opCall();
  }
}

import tools.time;
void main() {
  auto dg1 = (){ }, dg2 = new Functor;
  dg2.dg = dg1;
  // spin up processor
  writefln("Warm-up");
  for (int k = 0; k < 1024*1024*256; ++k) { dg1(); (*dg2)(); }
  writefln("Begin benchmark");
  const ITERS = cast(long) (1024*1024*1024) * 4;
  bench("Method 1", ITERS, dg1);
  bench("Method 2", ITERS, dg2);
  _test3 test3; // Done this way to allow inlining
  bench("Method 3", ITERS / 256, test3);
}
gdc -J. test.d tools/time.d tools/log.d tools/compat.d tools/base.d
tools/smart_import.d tools/ctfe.d tools/tests.d tools/functional.d -o test_c
-O3 -frelease -march=nocona
Warm-up
Begin benchmark
Method 1: 20.5247s, 0.00477877 Ás per call
Method 2: 19.6544s, 0.00457615 Ás per call
Method 3: 2.86392s, 0.170703 Ás per call
Oct 15 2009
parent downs <default_357-line yahoo.de> writes:
On consideration, this wasn't a test of the two methods at all, but a test of
the compiler's ability to inline. Disregard it.
Oct 15 2009
prev sibling parent Don <nospam nospam.com> writes:
Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

Not so. On 286 and earlier, stack pushes were more expensive. They're the same on 386 and later (including Core2, K7,K8,K10), but you have a chance of a cache miss with a pointer deref. In my C++ experience I got a 25% speedup of my entire app by replacing heap pointers with stack delegates!
Oct 15 2009