digitalmars.D.learn - Skynet 1M Fiber microbenchmark in D

Per =?UTF-8?B?Tm9yZGzDtnc=?= (10/10) Oct 18 2017 I'm curious about Fiber/coroutine performance in D compared to

Per =?UTF-8?B?Tm9yZGzDtnc=?= (3/10) Oct 18 2017 I Fibers aren't supposed to take any parameters how are we

Biotronic (12/24) Oct 18 2017 class MyFiber : Fiber {

=?UTF-8?B?Tm9yZGzDtnc=?= (10/37) Oct 18 2017 Thanks, I figured that out myself after a while ;)

drug (5/7) Oct 18 2017 Probably it will works - every fiber substract 1 from its argument, then...
Biotronic (48/56) Oct 18 2017 This is how I did it:

=?UTF-8?B?Tm9yZGzDtnc=?= (4/11) Oct 18 2017 Thanks!

Biotronic (10/12) Oct 18 2017 Afraid it's set in stone. Now, it doesn't actually use the GC for
Martin Nowak (4/6) Oct 20 2017 You could use std.typecons.scoped!Fiber, though it'll easily

ikod (19/75) Oct 18 2017 I ran this under linux perf, and here is top from 'perf report'

drug (5/27) Oct 18 2017 It depends on how much time test took. If it was small then of course

Martin Nowak (23/35) Oct 20 2017 Someone should really implement sth. along this line.

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

I'm curious about Fiber/coroutine performance in D compared to 
other languages such as Rust.

How should the following test be implemented in D?

Creates an actor (goroutine, whatever), which spawns 10 new 
actors, each of them spawns 10 more actors, etc. until one 
million actors are created on the final level. Then, each of them 
returns back its ordinal number (from 0 to 999999), which are 
summed on the previous level and sent back upstream, until 
reaching the root actor. (The answer should be 499999500000).

See also: https://github.com/atemerev/skynet

Oct 18 2017

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet

I Fibers aren't supposed to take any parameters how are we 
supposed to pass values to it during creation?

Oct 18 2017

Biotronic <simen.kjaras gmail.com> writes:

On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet

 I Fibers aren't supposed to take any parameters how are we 
 supposed to pass values to it during creation?

class MyFiber : Fiber {
     this(int arguments, string go, float here) {
         super(&run);
         // Save arguments somewhere
     }
     void run() {
         // Use arguments here.
     }
}

--
   Biotronic

Oct 18 2017

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Wednesday, 18 October 2017 at 11:04:10 UTC, Biotronic wrote:
 On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw 
 wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), 
 which are summed on the previous level and sent back 
 upstream, until reaching the root actor. (The answer should 
 be 499999500000).

 See also: https://github.com/atemerev/skynet

 I Fibers aren't supposed to take any parameters how are we 
 supposed to pass values to it during creation?

 class MyFiber : Fiber {
     this(int arguments, string go, float here) {
         super(&run);
         // Save arguments somewhere
     }
     void run() {
         // Use arguments here.
     }
 }

 --
   Biotronic

Thanks, I figured that out myself after a while ;)

Another thing...how should the synchronization between the fibers 
figure out when the total number of fibers have reached one 
million?...via an atomic counter fed by reference to the 
constructor...or are there better ways? Because I do need an 
atomic reference counter here, right?

And how do I parallelize this over multiple worker threads? 
AFAICT fibers are by default all spawned in the same main thread, 
right?

Oct 18 2017

drug <drug2004 bk.ru> writes:

18.10.2017 14:34, Nordlöw пишет:
 And how do I parallelize this over multiple worker threads? AFAICT 
 fibers are by default all spawned in the same main thread, right?

Probably it will works - every fiber substract 1 from its argument, then 
divides remainder by count of child fibers and spawns fibers with that 
value. so the last fibers gets 1 as its argument, then substract 1 from 
it, get zero and terminate.

Oct 18 2017

Biotronic <simen.kjaras gmail.com> writes:

On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
 Another thing...how should the synchronization between the 
 fibers figure out when the total number of fibers have reached 
 one million?...via an atomic counter fed by reference to the 
 constructor...or are there better ways? Because I do need an 
 atomic reference counter here, right?

This is how I did it:
import core.thread : Fiber;

class MyFiber : Fiber {
     int _depth;
     ulong _index;
     ulong _value;

     this(int depth, ulong index) {
         super(&run);
         _depth = depth;
         _index = index;
     }

     void run() {
         if (_depth == 6) { // 10^6 == 1 million, so stop here.
             _value = _index;
             return;
         }

         _value = 0;
         foreach (i; 0..10) { // Line 23
             auto e = new MyFiber(_depth+1, _index * 10 + i);
             e.call();
             _value += e._value;
         }
     }
}

unittest {
     import std.stdio : writeln;
     import std.datetime.stopwatch : StopWatch, AutoStart;
     auto sw = StopWatch(AutoStart.yes);
     auto a = new MyFiber(0, 0);
     a.call();
     sw.stop();
     assert(a._value == 499999500000);
     writeln(a._value, " after ", sw.peek);
}


 And how do I parallelize this over multiple worker threads? 
 AFAICT fibers are by default all spawned in the same main 
 thread, right?

True. Well, they're not really spawned on any thread - they're 
allocated on the heap, have their own stack, and are run on 
whichever thread happens to invoke their call() method.

I experimented a little bit with parallelism, and the easiest 
definitely is to replace line 23 with this:

foreach (i; taskPool.parallel(10.iota, 1)) {

It seems to make very little difference in terms of run time, 
though. I tried using a mix of these approaches - parallel at low 
depth, basically just to fill up the cores, and serial closer to 
the leaves. The difference is still negligible, so I assume the 
losses are elsewhere.

--
   Biotronic

Oct 18 2017

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
 It seems to make very little difference in terms of run time, 
 though. I tried using a mix of these approaches - parallel at 
 low depth, basically just to fill up the cores, and serial 
 closer to the leaves. The difference is still negligible, so I 
 assume the losses are elsewhere.

 --
   Biotronic

Thanks!

Further, are we forced to use the GC for Fiber allocation or can 
a sub-class of Fibers implement its own allocation strategy?

Oct 18 2017

Biotronic <simen.kjaras gmail.com> writes:

On Wednesday, 18 October 2017 at 12:32:31 UTC, Nordlöw wrote:
 Further, are we forced to use the GC for Fiber allocation or 
 can a sub-class of Fibers implement its own allocation strategy?

Afraid it's set in stone. Now, it doesn't actually use the GC for 
allocating the stack memory, instead opting for VirtualAlloc on 
Windows and mmap, valloc or malloc on other platforms. Of course, 
it's possible to change the implementation in core.thread, but 
it's probably more work than you want. Copying the entire Fiber 
class and changing it also runs into some problems, since it 
depends on private stuff in core.thread.

--
   Biotronic

Oct 18 2017

Martin Nowak <code dawg.eu> writes:

On Wednesday, 18 October 2017 at 12:32:31 UTC, Nordlöw wrote:
 Further, are we forced to use the GC for Fiber allocation or 
 can a sub-class of Fibers implement its own allocation strategy?

You could use std.typecons.scoped!Fiber, though it'll easily 
overflow your stack.
Unfortunately Scoped doesn't seem to work with refCounted atm.

Oct 20 2017

ikod <geller.garry gmail.com> writes:

On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
 On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
 Another thing...how should the synchronization between the 
 fibers figure out when the total number of fibers have reached 
 one million?...via an atomic counter fed by reference to the 
 constructor...or are there better ways? Because I do need an 
 atomic reference counter here, right?

 This is how I did it:
 import core.thread : Fiber;

 class MyFiber : Fiber {
     int _depth;
     ulong _index;
     ulong _value;

     this(int depth, ulong index) {
         super(&run);
         _depth = depth;
         _index = index;
     }

     void run() {
         if (_depth == 6) { // 10^6 == 1 million, so stop here.
             _value = _index;
             return;
         }

         _value = 0;
         foreach (i; 0..10) { // Line 23
             auto e = new MyFiber(_depth+1, _index * 10 + i);
             e.call();
             _value += e._value;
         }
     }
 }

 unittest {
     import std.stdio : writeln;
     import std.datetime.stopwatch : StopWatch, AutoStart;
     auto sw = StopWatch(AutoStart.yes);
     auto a = new MyFiber(0, 0);
     a.call();
     sw.stop();
     assert(a._value == 499999500000);
     writeln(a._value, " after ", sw.peek);
 }


 And how do I parallelize this over multiple worker threads? 
 AFAICT fibers are by default all spawned in the same main 
 thread, right?

 True. Well, they're not really spawned on any thread - they're 
 allocated on the heap, have their own stack, and are run on 
 whichever thread happens to invoke their call() method.

 I experimented a little bit with parallelism, and the easiest 
 definitely is to replace line 23 with this:

 foreach (i; taskPool.parallel(10.iota, 1)) {

 It seems to make very little difference in terms of run time, 
 though. I tried using a mix of these approaches - parallel at 
 low depth, basically just to fill up the cores, and serial 
 closer to the leaves. The difference is still negligible, so I 
 assume the losses are elsewhere.

 --
   Biotronic

I ran this under linux perf, and here is top from 'perf report'



...........................................................................................

      7.34%  t        [kernel.kallsyms]   [k] clear_page
      6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
      5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
      3.90%  t        t                   [.] nothrow 
core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong)
      3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
      3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
      2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
      2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
      2.53%  t        t                   [.] nothrow void 
core.thread.Fiber.__dtor()

So looks like memory management, even not GC, take most of the 
time (if I interpret these numbers correctly)

Oct 18 2017

drug <drug2004 bk.ru> writes:

18.10.2017 16:37, ikod пишет:
 
 I ran this under linux perf, and here is top from 'perf report'
 


 .......................................................................
................... 
 

       7.34%  t        [kernel.kallsyms]   [k] clear_page
       6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
       5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
       3.90%  t        t                   [.]
nothrow core.thread.Fiber 
 core.thread.Fiber.__ctor(void delegate(), ulong)
       3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
       3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
       2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
       2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
       2.53%  t        t                   [.]
nothrow void 
 core.thread.Fiber.__dtor()
 
 So looks like memory management, even not GC, take most of the time (if 
 I interpret these numbers correctly)

It depends on how much time test took. If it was small then of course 
creating the fibers took too much percents. Ideally we need to find the 
duration of the test when percents of memory management stop changing - 
it will be more reliable numbers.

Oct 18 2017

Martin Nowak <code dawg.eu> writes:

On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet

 I Fibers aren't supposed to take any parameters how are we 
 supposed to pass values to it during creation?

Someone should really implement sth. along this line.

```d
   size_t cumsum(Fiber!int fib, size_t val)
   {
     size_t sum = val;
     foreach (i; 0 .. 10)
     {
         val = fib.yield(sum); // yield sum, get next val
         sum += val;
     }
     return sum;
   }
   auto fib = fiber!func;
   while (fib.state != Fiber.State.hold)
       writeln(fib.call(random()));
```

There is 
https://dlang.org/library/std/concurrency/generator.this.html, 
but it's not too well designed, relying on an runtime type cast 
in Fiber.yield which will fail if the yielded value isn't exactly 
of the same specified as template parameter.
https://github.com/dlang/phobos/pull/1910#r16965194

Oct 20 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Skynet 1M Fiber microbenchmark in D