digitalmars.D - I feel outraged -

Justin Johansson (4/4) Oct 15 2009 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

downs (3/9) Oct 15 2009 - with this weird way of writing posts? The subject should tell us about...

Justin Johansson (6/17) Oct 15 2009 I will be bold and say yes to off-stack allocation,

bearophile (4/9) Oct 15 2009 This is a nice situation: with just 10-20 minutes of experimental tests ...

Justin Johansson (6/18) Oct 15 2009 Steve, bearophile, et. al.,
Justin Johansson (105/117) Oct 15 2009 Fresh light of morning ...

bearophile (57/60) Oct 15 2009 I have changed your benchmark a little, you may want to look at its timi...

Justin Johansson (32/108) Oct 16 2009 Thanks muchly (also the _ tip)

bearophile (19/23) Oct 16 2009 In DMD:

Justin Johansson (7/17) Oct 16 2009 "because they often have just an illusion of understanding things :-)"

Steven Schveighoffer (18/43) Oct 15 2009 You got a response because I'm actually awake and at a computer :) I
Jeremie Pelletier (8/27) Oct 15 2009 I don't see why delegates should be allocated on the heap, if so then

Steven Schveighoffer (5/9) Oct 15 2009 How do you propose to fix it? I think it is the minimal approach. You ...
downs (52/52) Oct 15 2009 Two discoveries were made from this benchmark.

downs (1/1) Oct 15 2009 On consideration, this wasn't a test of the two methods at all, but a te...

Don (6/12) Oct 15 2009 Not so. On 286 and earlier, stack pushes were more expensive. They're

Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:

- that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
CPU architectures.

Justin

Oct 15 2009

downs <default_357-line yahoo.de> writes:

Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

- with this weird way of writing posts? The subject should tell us about the
content, not your emotional state! :p

Also I have no idea what you mean. Should delegate _values_ be heap allocated?!
That'd be insanity. Also, I'm fairly sure you're wrong. The stack is relatively
likely to be in the CPU cache. A random pointer dereferencing .. isn't. Also,
do you really want to heap even more work on the ailing GC?

Oct 15 2009

Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:

downs Wrote:

 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

 
 - with this weird way of writing posts? The subject should tell us about the
content, not your emotional state! :p

Re subject line: fair call, you are right.  Emotions aside, at least this time
I got a response.

 
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

I will be bold and say yes to off-stack allocation,
whether that be in the general heap or other (and probably other to avoid an
"ailing CG").

When the tough gets going, the going have to get tough.
(Meaning to start thinking outside of the square.)

Oct 15 2009

bearophile <bearophileHUGS lycos.com> writes:

Justin Johansson:

 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

 
 I will be bold and say yes to off-stack allocation,
 whether that be in the general heap or other (and probably other to avoid an
"ailing CG").

This is a nice situation: with just 10-20 minutes of experimental tests (that
later you have to show us) you can show us you are right, or wrong.

Bye,
bearophile

Oct 15 2009

Justin Johansson <procode adam.com.after-dot-com-add-dot-au> writes:

bearophile Wrote:

 Justin Johansson:
 
 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

 
 I will be bold and say yes to off-stack allocation,
 whether that be in the general heap or other (and probably other to avoid an
"ailing CG").

 
 This is a nice situation: with just 10-20 minutes of experimental tests (that
later you have to show us) you can show us you are right, or wrong.
 
 Bye,
 bearophile

Steve, bearophile, et. al.,

Yes, timezone (being Australia) is a severe disadvantage for "reality-time"
discussion.
Best I retire with this as conjecture for the moment.

Buona Notte (23.30)
Justin

Oct 15 2009

Justin Johansson <no spam.com> writes:

bearophile Wrote:

 Justin Johansson:
 
 downs:
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

 
 I will be bold and say yes to off-stack allocation,
 whether that be in the general heap or other (and probably other to avoid an
"ailing CG").

 
 This is a nice situation: with just 10-20 minutes of experimental tests (that
later you have to show us) you can show us you are right, or wrong.
 
 Bye,
 bearophile

Fresh light of morning ...

I'm really glad to have brought this up as I would not have bothered to revisit
a performance
issue that I had in porting some C++ to D (and this was not looking good for D
at first sight).

As it turns out though, my initial fear about the .sizeof delegate was
unfounded as the performance
bottleneck was in an loop inside of a method taking what was effectively a
callback parameter.

The C++ design was basically implementing a classical visitor pattern over a
collection.

In porting to D I had a choice of either doing a one-for-one translation of the
C++ classes or
redesigning using D delegates.

This morning I boiled the problem down to the simplest possible with the two
forms and
benchmarked these.  From the code I have reproduced below, clearly the
issue is nothing to do with the cost of instantiating a delegate or visitor
object.  That turns
out to be a one time cost and irrelevant given where the iteration actually
occurs.

So while this isn't the proof or disproof that bearophile asked for, this turns
out to be
a clear demonstration of the performance-enhancing power of D delegates over an
otherwise ingrained C++ thinking approach.

I'm impressed.  (Hope that statement isn't too emotional  :-)



-release

import std.perf, std.stdio;


class SetOfIntegers
{
   private int from, to;
   
   this( int from, int to) {
      this.from = from;
      this.to = to;
   }

   int forEach( int delegate( int x) apply) {
      for (auto i = from; i <= to; ++i) {
         apply( i);
      }

      return 0;
   }

   int forEach( Visitor v) {
      for (auto i = from; i <= to; ++i) {
         v.visit( i);
      }

      return 0;
   }
}



class Visitor
{
   abstract int visit( int x);
}



class MyVisitor: Visitor
{
   int visit( int x) { return 0; }
}



void test1( SetOfIntegers s, Visitor v)
{
    auto pc = new PerformanceCounter();
    pc.start();

    scope(exit) {
       pc.stop();
       writefln( "Using D style delegate callback:  %d msec",
pc.milliseconds());
    }

    s.forEach( &v.visit);
}


void test2( SetOfIntegers s, Visitor v)
{
    auto pc = new PerformanceCounter();
    pc.start();

    scope(exit) {
       pc.stop();
       writefln( "Using C++ style virtual callback: %d msec",
pc.milliseconds());
    }

    s.forEach( v);
}


void main() {
   writefln( "Delegates vs virtual function callbacks");
   writefln();

   SetOfIntegers s = new SetOfIntegers( 1, 10000000);
   Visitor v = new MyVisitor();

   for (auto i = 0; i < 10; ++i) {
      test1( s, v);
      test2( s, v);
      writefln();
   }

   writefln();
}


$ ./perf.d
Delegates vs virtual function callbacks

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 146 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 147 msec

Using D style delegate callback:  120 msec
Using C++ style virtual callback: 145 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 145 msec

Using D style delegate callback:  120 msec
Using C++ style virtual callback: 145 msec

Using D style delegate callback:  120 msec
Using C++ style virtual callback: 147 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 147 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 147 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 146 msec

Using D style delegate callback:  121 msec
Using C++ style virtual callback: 147 msec


Sweet.


cheers
Justin Johansson

Oct 15 2009

bearophile <bearophileHUGS lycos.com> writes:

Justin Johansson:

 this turns out to be
 a clear demonstration of the performance-enhancing power of D delegates over an
 otherwise ingrained C++ thinking approach.

I have changed your benchmark a little, you may want to look at its timings too
(I have taken timings with it with DMD and LDC, and the results differ):

version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.time: CLOCKS_PER_SEC, clock;
} else {
    import std.c.stdio: printf;
    import std.c.time: CLOCKS_PER_SEC, clock;
}

double myclock() {
    return clock() / cast(double)CLOCKS_PER_SEC;
}

abstract class Visitor {
//interface Visitor { // try this too
    abstract int visit(int x); 
}

final class MyVisitor: Visitor {
    int visit(int x) { return 0; }
}

struct IntRange {
    int stop;

    int forEachDeleg(int delegate(int x) apply) {
        for (int i; i < stop; i++)
            apply(i);
        return 0;
    }

    int forEachObj(Visitor v) {
        for (int i; i < stop; i++)
            v.visit(i);
        return 0;
    }
}

void testD(IntRange s, Visitor v) {
     auto start = myclock();
     s.forEachDeleg(&v.visit);
     auto stop = myclock();
     printf("Using D style delegate callback:  %d ms\n", cast(int)((stop -
start) * 1000));
}

void testCpp(IntRange s, Visitor v) {
     auto start = myclock();
     s.forEachObj(v);
     auto stop = myclock();
     printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop -
start) * 1000));
}

void main() {
    auto s = IntRange(400_000_000);
    Visitor v = new MyVisitor();

    for (int i; i < 5; i++) {
        testD(s, v);
        testCpp(s, v);
        printf("\n");
    }
}

(I suggest you to use the _ inside big number literals in D, they avoid few
bugs).

(Few days ago I think to have found that interfaces aren't implemented
efficiently in LDC. Lindquist has answered he will improve the situation.)

Bye,
bearophile

Oct 15 2009

Justin Johansson <no spam.com> writes:

bearophile Wrote:

 Justin Johansson:
 
 this turns out to be
 a clear demonstration of the performance-enhancing power of D delegates over an
 otherwise ingrained C++ thinking approach.

 
 I have changed your benchmark a little, you may want to look at its timings
too (I have taken timings with it with DMD and LDC, and the results differ):
 
 version (Tango) {
     import tango.stdc.stdio: printf;
     import tango.stdc.time: CLOCKS_PER_SEC, clock;
 } else {
     import std.c.stdio: printf;
     import std.c.time: CLOCKS_PER_SEC, clock;
 }
 
 double myclock() {
     return clock() / cast(double)CLOCKS_PER_SEC;
 }
 
 abstract class Visitor {
 //interface Visitor { // try this too
     abstract int visit(int x); 
 }
 
 final class MyVisitor: Visitor {
     int visit(int x) { return 0; }
 }
 
 struct IntRange {
     int stop;
 
     int forEachDeleg(int delegate(int x) apply) {
         for (int i; i < stop; i++)
             apply(i);
         return 0;
     }
 
     int forEachObj(Visitor v) {
         for (int i; i < stop; i++)
             v.visit(i);
         return 0;
     }
 }
 
 void testD(IntRange s, Visitor v) {
      auto start = myclock();
      s.forEachDeleg(&v.visit);
      auto stop = myclock();
      printf("Using D style delegate callback:  %d ms\n", cast(int)((stop -
start) * 1000));
 }
 
 void testCpp(IntRange s, Visitor v) {
      auto start = myclock();
      s.forEachObj(v);
      auto stop = myclock();
      printf("Using C++ style virtual callback: %d ms\n", cast(int)((stop -
start) * 1000));
 }
 
 void main() {
     auto s = IntRange(400_000_000);
     Visitor v = new MyVisitor();
 
     for (int i; i < 5; i++) {
         testD(s, v);
         testCpp(s, v);
         printf("\n");
     }
 }
 
 (I suggest you to use the _ inside big number literals in D, they avoid few
bugs).
 
 (Few days ago I think to have found that interfaces aren't implemented
efficiently in LDC. Lindquist has answered he will improve the situation.)
 
 Bye,
 bearophile

Thanks muchly (also the _ tip)

Just ran your code with these results (D1/phobos/linux):


-release -O

Also added -O switch this time though have no idea what level of optimization
that does.
(btw. In this test code, the -release switch doesn't do anything does it
as that's just for conditional compilation?)


A. abstract class Visitor version :-

Using D style delegate callback:  2720 ms
Using C++ style virtual callback: 2249 ms

Using D style delegate callback:  2560 ms
Using C++ style virtual callback: 2259 ms

Using D style delegate callback:  2170 ms
Using C++ style virtual callback: 2259 ms

Using D style delegate callback:  2099 ms
Using C++ style virtual callback: 2259 ms

Using D style delegate callback:  2640 ms
Using C++ style virtual callback: 2250 ms


B. interface Visitor version :-

Using D style delegate callback:  2509 ms
Using C++ style virtual callback: 2500 ms

Using D style delegate callback:  2509 ms
Using C++ style virtual callback: 2500 ms

Using D style delegate callback:  2519 ms
Using C++ style virtual callback: 2510 ms

Using D style delegate callback:  2509 ms
Using C++ style virtual callback: 2500 ms

Using D style delegate callback:  2510 ms
Using C++ style virtual callback: 2500 ms

The results are not clear cut at all this time.  So what's going on?

ciao
justin

Oct 16 2009

bearophile <bearophileHUGS lycos.com> writes:

Justin Johansson:

 Also added -O switch this time though have no idea what level of optimization
that does.
 (btw. In this test code, the -release switch doesn't do anything does it
 as that's just for conditional compilation?)

In DMD:
-O means "full optimizations minus the inlining (and keeping asserts, bound
tests, contracts and maybe more).
-release means no asserts (but it keeps assert(0)), no bound tests and no
contracts.
-inline means to perform inlining.
So generally when you care for performance you compile in DMD with: -O -release
-inline
(But sometimes inlining makes the performance a little worse, because there's
more pressure on the small code half of L1 cache).
In this program -release doesn't change the timings probably because there's
nothing to remove (bound tests, etc).

In LDC:
-O equals to -O2, that means an average optimization.
-O3 means more optimization and includes two successive inlining passes (so
foreach over an opApply are often fully simplified. But only few
delegates/function pointers are inlined).
-O4 and -O5 currently mean -O3, in future (I hope soon!) -O4 will perform all
the optimizations of -O3 plus link-time optimization and _Dmain interning
(that's already doable, but only manually).
If you add -inline I think (but I am not sure) it performs a third inlining
pass. 
There is the -release too that does as in DMD, plus flags for a finer releasing
(for example to disable just asserts but not array bounds) that are not
available in DMD.


The results are not clear cut at all this time.  So what's going on?<

I don't know. I have a certain experience of benchmarks now, and I know they
are tricky.

I usually like to help people understand they don't understand what's going on
in their life, because they often have just an illusion of understanding things
:-)

You may use something like obj2asm (or a disassembler) to see the asm produces
in both cases, to understand a little better. If you don't have ways to do it,
I can show you the resulting asm myself.

Bye,
bearophile

Oct 16 2009

Justin Johansson <no spam.com> writes:

bearophile Wrote:

 Justin Johansson:
 
The results are not clear cut at all this time.  So what's going on?<

 
 I don't know. I have a certain experience of benchmarks now, and I know they
are tricky.
 
 I usually like to help people understand they don't understand what's going on
in their life, because they often have just an illusion of understanding things
:-)

"because they often have just an illusion of understanding things :-)"

So true.

 You may use something like obj2asm (or a disassembler) to see the asm produces
in both cases, to understand a little better. If you don't have ways to do it,
I can show you the resulting asm myself.
 Bye,
 bearophile

No worries; I'm fine with groking asm.
Thanks very much for your time and encouragement.

ciao,
justin

Oct 16 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 15 Oct 2009 07:45:02 -0400, Justin Johansson  
<procode adam.com.after-dot-com-add-dot-au> wrote:

 downs Wrote:

 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer  

 dereference in contemporary
 CPU architectures.

 Justin

 - with this weird way of writing posts? The subject should tell us  
 about the content, not your emotional state! :p

 Re subject line: fair call, you are right.  Emotions aside, at least  
 this time I got a response.

You got a response because I'm actually awake and at a computer :)  I  
don't think you should expect much earlier than 7am eastern from the US  
participants (regarding your 3 am post about a manifesto, followed by an  
assumed lack of interest at 5 am).

But I have to agree with downs.  Although I look at "non-descriptive"  
posts, it has nothing to do with my likelihood of reading *or*  
responding.  Attributing response to changing such a non-essential piece  
of a post is like thinking you made it rain by dancing.

 Also I have no idea what you mean. Should delegate _values_ be heap  
 allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The  
 stack is relatively likely to be in the CPU cache. A random pointer  
 dereferencing .. isn't. Also, do you really want to heap even more work  
 on the ailing GC?

 I will be bold and say yes to off-stack allocation,
 whether that be in the general heap or other (and probably other to  
 avoid an "ailing CG").

When the majority of delegates survive exactly one function call, I think  
you might be very much wrong.  You only save on allocation vs. stack when  
you pass it through many function calls.  In fact, using such a delegate  
will probably be more penalized if the memory location is not local (and  
stack usually is close to the cache), not to mention putting it off stack  
means an additional pointer dereference.

 When the tough gets going, the going have to get tough.
 (Meaning to start thinking outside of the square.)

The going isn't tough yet :)  delegates work just fine for me.

-Steve

Oct 15 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

Justin Johansson wrote:
 downs Wrote:
 
 Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.

 Justin

 - with this weird way of writing posts? The subject should tell us about the
content, not your emotional state! :p

 
 Re subject line: fair call, you are right.  Emotions aside, at least this time
I got a response.
 
 Also I have no idea what you mean. Should delegate _values_ be heap
allocated?! That'd be insanity. Also, I'm fairly sure you're wrong. The stack
is relatively likely to be in the CPU cache. A random pointer dereferencing ..
isn't. Also, do you really want to heap even more work on the ailing GC?

 
 I will be bold and say yes to off-stack allocation,
 whether that be in the general heap or other (and probably other to avoid an
"ailing CG").

I don't see why delegates should be allocated on the heap, if so then 
dynamic arrays would have to too, because they're the same size.

It wouldn't be efficient because even if dereferences 'may' be faster 
than stack pushes, having arrays or delegates in the heap would double 
the number of dereferences needed, double the chances of memory not 
being in the cache and double the code to create and access them.

 When the tough gets going, the going have to get tough.
 (Meaning to start thinking outside of the square.)

And here I was trying to think outside of the tesseract :o)

Oct 15 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 15 Oct 2009 07:15:45 -0400, Justin Johansson  
<procode adam.com.after-dot-com-add-dot-au> wrote:

 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).

 AFAIK, stack pushes are still more expensive than a pointer dereference  
 in contemporary
 CPU architectures.

How do you propose to fix it?  I think it is the minimal approach.  You  
need 4 bytes for the function pointer, and 4 bytes for the instance data.

-Steve

Oct 15 2009

downs <default_357-line yahoo.de> writes:

Two discoveries were made from this benchmark.

1) There is no appreciable speed difference between delegates and functors. I
re-ran the benchmark several times; sometimes one was faster, sometimes the
other - no clear advantage was discernible. The visible differences can be
blamed on experimental error. Feel free to rerun it on a pure benchmarking
machine..
2) The GC is slooooow (factor of 40!). No surprise there.

The code:

gentoo-pc ~ $ cat test.d; gdc-build test.d -o test_c -O3 -frelease
-march=nocona && ./test_c
module test;

import std.stdio;

struct Functor {
  void delegate() dg;
  void opCall() { dg(); }
}

void bench(I, C)(string name, I iters, C callable) {
  auto start = sec();
  // sorry
  for (I l = 0; l < iters; ++l)
    static if (is(typeof(callable.opCall)))
      callable.opCall();
    else
      callable();
  auto taken = sec() - start;
  writefln(name, ": ", taken, "s, ",
    ((taken / iters) * 1000_000), " �s per call"
  );
}

struct _test3 {
  void test() { }
  void opCall() {
    auto dg = new Functor;
    dg.dg = &test;
    dg.opCall();
  }
}

import tools.time;
void main() {
  auto dg1 = (){ }, dg2 = new Functor;
  dg2.dg = dg1;
  // spin up processor
  writefln("Warm-up");
  for (int k = 0; k < 1024*1024*256; ++k) { dg1(); (*dg2)(); }
  writefln("Begin benchmark");
  const ITERS = cast(long) (1024*1024*1024) * 4;
  bench("Method 1", ITERS, dg1);
  bench("Method 2", ITERS, dg2);
  _test3 test3; // Done this way to allow inlining
  bench("Method 3", ITERS / 256, test3);
}
gdc -J. test.d tools/time.d tools/log.d tools/compat.d tools/base.d
tools/smart_import.d tools/ctfe.d tools/tests.d tools/functional.d -o test_c
-O3 -frelease -march=nocona
Warm-up
Begin benchmark
Method 1: 20.5247s, 0.00477877 �s per call
Method 2: 19.6544s, 0.00457615 �s per call
Method 3: 2.86392s, 0.170703 �s per call

Oct 15 2009

downs <default_357-line yahoo.de> writes:

On consideration, this wasn't a test of the two methods at all, but a test of
the compiler's ability to inline. Disregard it.

Oct 15 2009

Don <nospam nospam.com> writes:

Justin Johansson wrote:
 - that the .sizeof a delegate is 8 bytes (on a 32-bit machine).
 
 AFAIK, stack pushes are still more expensive than a pointer dereference in
contemporary
 CPU architectures.
 
 Justin

Not so.  On 286 and earlier, stack pushes were more expensive. They're 
the same on 386 and later (including Core2, K7,K8,K10), but you have a 
chance of a cache miss with a pointer deref. In my C++ experience I got 
a 25% speedup of my entire app by replacing heap pointers with stack 
delegates!

Oct 15 2009

D Programming

C/C++ Programming

Other

digitalmars.D - I feel outraged -