www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Skynet 1M Fiber microbenchmark in D

reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
I'm curious about Fiber/coroutine performance in D compared to 
other languages such as Rust.

How should the following test be implemented in D?

Creates an actor (goroutine, whatever), which spawns 10 new 
actors, each of them spawns 10 more actors, etc. until one 
million actors are created on the final level. Then, each of them 
returns back its ordinal number (from 0 to 999999), which are 
summed on the previous level and sent back upstream, until 
reaching the root actor. (The answer should be 499999500000).

See also: https://github.com/atemerev/skynet
Oct 18
parent reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet
I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
Oct 18
next sibling parent reply Biotronic <simen.kjaras gmail.com> writes:
On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet
I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
class MyFiber : Fiber { this(int arguments, string go, float here) { super(&run); // Save arguments somewhere } void run() { // Use arguments here. } } -- Biotronic
Oct 18
parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Wednesday, 18 October 2017 at 11:04:10 UTC, Biotronic wrote:
 On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw 
 wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), 
 which are summed on the previous level and sent back 
 upstream, until reaching the root actor. (The answer should 
 be 499999500000).

 See also: https://github.com/atemerev/skynet
I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
class MyFiber : Fiber { this(int arguments, string go, float here) { super(&run); // Save arguments somewhere } void run() { // Use arguments here. } } -- Biotronic
Thanks, I figured that out myself after a while ;) Another thing...how should the synchronization between the fibers figure out when the total number of fibers have reached one million?...via an atomic counter fed by reference to the constructor...or are there better ways? Because I do need an atomic reference counter here, right? And how do I parallelize this over multiple worker threads? AFAICT fibers are by default all spawned in the same main thread, right?
Oct 18
next sibling parent drug <drug2004 bk.ru> writes:
18.10.2017 14:34, Nordlöw пишет:
 And how do I parallelize this over multiple worker threads? AFAICT 
 fibers are by default all spawned in the same main thread, right?
Probably it will works - every fiber substract 1 from its argument, then divides remainder by count of child fibers and spawns fibers with that value. so the last fibers gets 1 as its argument, then substract 1 from it, get zero and terminate.
Oct 18
prev sibling parent reply Biotronic <simen.kjaras gmail.com> writes:
On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
 Another thing...how should the synchronization between the 
 fibers figure out when the total number of fibers have reached 
 one million?...via an atomic counter fed by reference to the 
 constructor...or are there better ways? Because I do need an 
 atomic reference counter here, right?
This is how I did it: import core.thread : Fiber; class MyFiber : Fiber { int _depth; ulong _index; ulong _value; this(int depth, ulong index) { super(&run); _depth = depth; _index = index; } void run() { if (_depth == 6) { // 10^6 == 1 million, so stop here. _value = _index; return; } _value = 0; foreach (i; 0..10) { // Line 23 auto e = new MyFiber(_depth+1, _index * 10 + i); e.call(); _value += e._value; } } } unittest { import std.stdio : writeln; import std.datetime.stopwatch : StopWatch, AutoStart; auto sw = StopWatch(AutoStart.yes); auto a = new MyFiber(0, 0); a.call(); sw.stop(); assert(a._value == 499999500000); writeln(a._value, " after ", sw.peek); }
 And how do I parallelize this over multiple worker threads? 
 AFAICT fibers are by default all spawned in the same main 
 thread, right?
True. Well, they're not really spawned on any thread - they're allocated on the heap, have their own stack, and are run on whichever thread happens to invoke their call() method. I experimented a little bit with parallelism, and the easiest definitely is to replace line 23 with this: foreach (i; taskPool.parallel(10.iota, 1)) { It seems to make very little difference in terms of run time, though. I tried using a mix of these approaches - parallel at low depth, basically just to fill up the cores, and serial closer to the leaves. The difference is still negligible, so I assume the losses are elsewhere. -- Biotronic
Oct 18
next sibling parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
 It seems to make very little difference in terms of run time, 
 though. I tried using a mix of these approaches - parallel at 
 low depth, basically just to fill up the cores, and serial 
 closer to the leaves. The difference is still negligible, so I 
 assume the losses are elsewhere.

 --
   Biotronic
Thanks! Further, are we forced to use the GC for Fiber allocation or can a sub-class of Fibers implement its own allocation strategy?
Oct 18
next sibling parent Biotronic <simen.kjaras gmail.com> writes:
On Wednesday, 18 October 2017 at 12:32:31 UTC, Nordlöw wrote:
 Further, are we forced to use the GC for Fiber allocation or 
 can a sub-class of Fibers implement its own allocation strategy?
Afraid it's set in stone. Now, it doesn't actually use the GC for allocating the stack memory, instead opting for VirtualAlloc on Windows and mmap, valloc or malloc on other platforms. Of course, it's possible to change the implementation in core.thread, but it's probably more work than you want. Copying the entire Fiber class and changing it also runs into some problems, since it depends on private stuff in core.thread. -- Biotronic
Oct 18
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Wednesday, 18 October 2017 at 12:32:31 UTC, Nordlöw wrote:
 Further, are we forced to use the GC for Fiber allocation or 
 can a sub-class of Fibers implement its own allocation strategy?
You could use std.typecons.scoped!Fiber, though it'll easily overflow your stack. Unfortunately Scoped doesn't seem to work with refCounted atm.
Oct 20
prev sibling parent reply ikod <geller.garry gmail.com> writes:
On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
 On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
 Another thing...how should the synchronization between the 
 fibers figure out when the total number of fibers have reached 
 one million?...via an atomic counter fed by reference to the 
 constructor...or are there better ways? Because I do need an 
 atomic reference counter here, right?
This is how I did it: import core.thread : Fiber; class MyFiber : Fiber { int _depth; ulong _index; ulong _value; this(int depth, ulong index) { super(&run); _depth = depth; _index = index; } void run() { if (_depth == 6) { // 10^6 == 1 million, so stop here. _value = _index; return; } _value = 0; foreach (i; 0..10) { // Line 23 auto e = new MyFiber(_depth+1, _index * 10 + i); e.call(); _value += e._value; } } } unittest { import std.stdio : writeln; import std.datetime.stopwatch : StopWatch, AutoStart; auto sw = StopWatch(AutoStart.yes); auto a = new MyFiber(0, 0); a.call(); sw.stop(); assert(a._value == 499999500000); writeln(a._value, " after ", sw.peek); }
 And how do I parallelize this over multiple worker threads? 
 AFAICT fibers are by default all spawned in the same main 
 thread, right?
True. Well, they're not really spawned on any thread - they're allocated on the heap, have their own stack, and are run on whichever thread happens to invoke their call() method. I experimented a little bit with parallelism, and the easiest definitely is to replace line 23 with this: foreach (i; taskPool.parallel(10.iota, 1)) { It seems to make very little difference in terms of run time, though. I tried using a mix of these approaches - parallel at low depth, basically just to fill up the cores, and serial closer to the leaves. The difference is still negligible, so I assume the losses are elsewhere. -- Biotronic
I ran this under linux perf, and here is top from 'perf report' # Overhead Command Shared Object Symbol # ........ ....... .................. ........................................................................................... # 7.34% t [kernel.kallsyms] [k] clear_page 6.80% t [kernel.kallsyms] [k] __do_page_fault 5.39% t [kernel.kallsyms] [k] __lock_text_start 3.90% t t [.] nothrow core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong) 3.73% t [kernel.kallsyms] [k] unmap_page_range 3.32% t [kernel.kallsyms] [k] flush_tlb_mm_range 2.70% t [kernel.kallsyms] [k] _raw_spin_lock 2.57% t libpthread-2.23.so [.] pthread_mutex_unlock 2.53% t t [.] nothrow void core.thread.Fiber.__dtor() So looks like memory management, even not GC, take most of the time (if I interpret these numbers correctly)
Oct 18
parent drug <drug2004 bk.ru> writes:
18.10.2017 16:37, ikod пишет:
 
 I ran this under linux perf, and here is top from 'perf report'
 
 # Overhead  Command  Shared Object       Symbol
 # ........  .......  .................. 
 .......................................................................
................... 
 
 #
       7.34%  t        [kernel.kallsyms]   [k] clear_page
       6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
       5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
       3.90%  t        t                   [.]
nothrow core.thread.Fiber 
 core.thread.Fiber.__ctor(void delegate(), ulong)
       3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
       3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
       2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
       2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
       2.53%  t        t                   [.]
nothrow void 
 core.thread.Fiber.__dtor()
 
 So looks like memory management, even not GC, take most of the time (if 
 I interpret these numbers correctly)
It depends on how much time test took. If it was small then of course creating the fibers took too much percents. Ideally we need to find the duration of the test when percents of memory management stop changing - it will be more reliable numbers.
Oct 18
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
 On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw 
 wrote:
 Creates an actor (goroutine, whatever), which spawns 10 new 
 actors, each of them spawns 10 more actors, etc. until one 
 million actors are created on the final level. Then, each of 
 them returns back its ordinal number (from 0 to 999999), which 
 are summed on the previous level and sent back upstream, until 
 reaching the root actor. (The answer should be 499999500000).

 See also: https://github.com/atemerev/skynet
I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
Someone should really implement sth. along this line. ```d size_t cumsum(Fiber!int fib, size_t val) { size_t sum = val; foreach (i; 0 .. 10) { val = fib.yield(sum); // yield sum, get next val sum += val; } return sum; } auto fib = fiber!func; while (fib.state != Fiber.State.hold) writeln(fib.call(random())); ``` There is https://dlang.org/library/std/concurrency/generator.this.html, but it's not too well designed, relying on an runtime type cast in Fiber.yield which will fail if the yielded value isn't exactly of the same specified as template parameter. https://github.com/dlang/phobos/pull/1910#r16965194
Oct 20