digitalmars.D.learn - Performance issue with fiber

hanabi1224 (116/116) Jul 21 2021 Hi, I'm new to D lang and encounter some performance issues with

seany (4/8) Jul 21 2021 Following.
Stefan Koch (9/13) Jul 24 2021 The Fiber scheduler will spawn a new fiber for every job.

jfondren (43/58) Jul 24 2021 I get 8->3 seconds using vibe's fiber scheduler, which still

russhy (3/8) Jul 26 2021 -brelease is a typo issue, i don't think that produce defired

jfondren (8/19) Jul 26 2021 No, it builds a release build with -brelease. Try it yourself,

hanabi1224 (7/10) Jul 26 2021 Why fiber pool? Isn't fiber a lightweight logical thread which is

Denis Feklushkin (7/12) Jul 27 2021 Spawning fiber is expensive (but not so expensive as spawning

hanabi1224 (11/12) Jul 28 2021 Sorry but I cannot agree with the logic behind this statement,

=?UTF-8?Q?Ali_=c3=87ehreli?= (13/18) Jul 28 2021 I assume the opposite because normally, the number of times a thread or

hanabi1224 (12/16) Jul 28 2021 You are right, but that's not my point. whether fiber spawning is

James Blachly (4/10) Jul 28 2021 Agree with sibling comment by hanabi1224: spawning fiber (in other

Mathias LANG (20/23) Jul 28 2021 I took a quick look, and the first problem I saw was that you

drug (11/36) Jul 28 2021 I profiled the provided example (not `FiberScheduler`) using perf. Both

hanabi1224 (6/20) Jul 28 2021 Very interesting findings! After making the Fiber fix, I also

Daniel Kozak (33/55) Jul 30 2021 I have rewrite it to be same as dart version

hanabi1224 (9/10) Jul 30 2021 Thanks! There're both generator version and fiber version on the

hanabi1224 (9/14) Jul 28 2021 Thanks for pointing it out! Looks like I was benchmarking thread

hanabi1224 <harlowmoo gmail.com> writes:

Hi, I'm new to D lang and encounter some performance issues with 
fiber, not sure if there's something obviously wrong with my code.

Basically, I found D lang's logical thread communication is quite 
like Elixir's (process),
I have programs written in both D and Elixir but the D version 
(release build) is like >5 times slower.

The program is to find first N primes by creating another 
coroutine for every prime found. And the idea is from the 
concurrent prime sieve example on golang's homepage.

Elixir code: 
[1.ex](https://github.com/hanabi1224/Programming-Language-Benchmarks/blob/main/bench/algorithm/coro-prime-sieve/1.ex)

D code: 
[1.d](https://github.com/hanabi1224/Programming-Language-Benchmarks/blob/main/bench/algorithm/coro-prime-sieve/1.d)

Perf result of calculating first 2000 primes on my laptop.

Elixir: 1.33s user 0.25s system 108% cpu 1.451 total

D(dmd): 10.41s user 34.48s system 361% cpu 12.405 total

D(ldc): 8.45s user 33.06s system 356% cpu 11.638 total


Also attaching code inline:

D
```d
import std;
import core.stdc.stdlib: exit;

__gshared Tid mainTid;
__gshared bool terminated = false;

const int mailBoxSize = 1;

void main(string[] args) {
     auto n = args.length > 1 ? args[1].to!int() : 10;
     auto scheduler = new FiberScheduler;
     scheduler.start({
         mainTid = thisTid();
         setMaxMailboxSize(mainTid, n, OnCrowding.throwException);
         auto filterTid = spawnLinked(&filter, n);
         setMaxMailboxSize(filterTid, mailBoxSize, 
OnCrowding.block);
         auto generatorTid = spawnLinked(&generate, filterTid);
         for(auto i=0;i<n;i++){
             auto prime = receiveOnly!int;
             writeln(prime);
         }
         terminated = true;
         exit(0);
     });
}

void generate(Tid tid) {
     for (auto i=2;!terminated;i++) {
         tid.send(i);
     }
}

void filter(int nLeft) {
     auto prime = receiveOnly!int;
     mainTid.send(prime);
     if (nLeft > 0) {
         filterInner(prime, nLeft);
     }
}

void filterInner(int prime, int nLeft) {
     auto nextTid = spawnLinked(&filter, nLeft-1);
     setMaxMailboxSize(nextTid, mailBoxSize, OnCrowding.block);
     while(!terminated){
         auto d = receiveOnly!int;
         if (d % prime != 0) {
             nextTid.send(d);
         }
     }
}
```
Elixir
```elixir
defmodule App do
   def main(args) do
     n = String.to_integer(Enum.at(args,0,"27"), 10)
     generate(n)
   end

   def generate(n) do
     mainPid = self()
     pid = spawn_link(fn -> filter(mainPid, n) end)
     generateLoop(pid, 2)
   end

   def generateLoop(pid, n) do
     send(pid, {:n, n})
     receive do
       :gen -> generateLoop(pid, n + 1)
     end
   end

   def filter(mainPid, nLeft) do
     receive do
       {:n, n} -> filterInner(mainPid, n, nLeft)
     end
   end

   def filterInner(mainPid, prime, nLeft) do
     send(mainPid, :gen)

     if nLeft > 1 do
       pid = spawn_link(fn -> filter(mainPid, nLeft-1) end)
       recieveAndSendToFilter(mainPid, self(), pid, prime)
     else
       System.halt(0)
     end
   end

   def recieveAndSendToFilter(mainPid, rxPid, txPid, prime) do
     receive do
       {:n, n} -> recieveAndSendToFilterInner(mainPid, rxPid, 
txPid, prime, n)
     end
   end
   def recieveAndSendToFilterInner(mainPid, rxPid, txPid, prime, 
n) do
     if Integer.mod(n, prime) != 0 do
       send(txPid, {:n, n})
     else
       send(mainPid, :gen)
     end
     recieveAndSendToFilter(mainPid, rxPid, txPid, prime)
   end
end
```

Jul 21 2021

seany <seany uni-bonn.de> writes:

On Wednesday, 21 July 2021 at 22:51:38 UTC, hanabi1224 wrote:
 Hi, I'm new to D lang and encounter some performance issues 
 with fiber, not sure if there's something obviously wrong with 
 my code.

 [...]

Following.

I am also in need of more information to increase speed of D 
binaries using parallel code.

Jul 21 2021

Stefan Koch <uplink.coder gmail.com> writes:

On Wednesday, 21 July 2021 at 22:51:38 UTC, hanabi1224 wrote:
 Hi, I'm new to D lang and encounter some performance issues 
 with fiber, not sure if there's something obviously wrong with 
 my code.

There is your problem.
     auto scheduler = new FiberScheduler;

The Fiber scheduler will spawn a new fiber for every job.
It will not use a fiber pool. Spawning a new fiber is expensive 
because of the stack allocation for it.
Also if I recall correctly it will run single-threaded but I am 
not 100% sure on that.
Just have a look at the running processes ... if you just see one 
than you are single threaded.

Jul 24 2021

jfondren <julian.fondren gmail.com> writes:

On Saturday, 24 July 2021 at 09:17:47 UTC, Stefan Koch wrote:
 On Wednesday, 21 July 2021 at 22:51:38 UTC, hanabi1224 wrote:
 Hi, I'm new to D lang and encounter some performance issues 
 with fiber, not sure if there's something obviously wrong with 
 my code.

 There is your problem.
     auto scheduler = new FiberScheduler;

 The Fiber scheduler will spawn a new fiber for every job.
 It will not use a fiber pool. Spawning a new fiber is expensive 
 because of the stack allocation for it.
 Also if I recall correctly it will run single-threaded but I am 
 not 100% sure on that.
 Just have a look at the running processes ... if you just see 
 one than you are single threaded.

I get 8->3 seconds using vibe's fiber scheduler, which still 
isn't competitive with Elixir.

```d
--- primes.d	2021-07-24 21:37:46.633053839 -0500
+++ primesv1.d	2021-07-24 21:35:50.843053425 -0500
   -1,16 +1,19   
  /++ dub.sdl:
+    dependency "vibe-core" version="~>1.16.0"
   +/
-import std;
-import core.stdc.stdlib : exit;
+import std.stdio, std.conv;
+import vibe.core.concurrency;

  __gshared Tid mainTid;
  __gshared bool terminated = false;

  const int mailBoxSize = 1;

+extern(C) void _exit(int status);
+
  void main(string[] args) {
      auto n = args.length > 1 ? args[1].to!int() : 10;
-    auto scheduler = new FiberScheduler;
+    setConcurrencyPrimitive(ConcurrencyPrimitive.workerTask);
      scheduler.start({
          mainTid = thisTid();
          setMaxMailboxSize(mainTid, n, OnCrowding.throwException);
   -22,7 +25,7   
              writeln(prime);
          }
          terminated = true;
-        exit(0);
+        _exit(0);
      });
  }
```

build:

```
dub build --compiler=ldc -brelease --single primesv1.d
```

I think this is just a very goroutine-friendly test that relies 
on constantly spawning fibers and abusing message-passing rather 
than architecting out the concurrent parts of your program and 
how they should communicate. std.parallel's more appropriate in D.

Jul 24 2021

russhy <russhy gmail.com> writes:

 ```

 build:

 ```
 dub build --compiler=ldc -brelease --single primesv1.d
 ```


-brelease is a typo issue, i don't think that produce defired 
effect, most likely it defaulted to debug build

it should be -b release

Jul 26 2021

jfondren <julian.fondren gmail.com> writes:

On Monday, 26 July 2021 at 15:27:48 UTC, russhy wrote:
 ```

 build:

 ```
 dub build --compiler=ldc -brelease --single primesv1.d
 ```


 -brelease is a typo issue, i don't think that produce defired 
 effect, most likely it defaulted to debug build

 it should be -b release

No, it builds a release build with -brelease. Try it yourself, 
the first line of output tells you what the build type is. 
Instead of interpreting -abc as -a -b -c like getopt, dub 
interprets it as -a bc

--arch/-a works the same way, and although I don't see this usage 
in the official documentation, I didn't make it up:

https://duckduckgo.com/?q=dub+%22brelease%22&norw=1

Jul 26 2021

hanabi1224 <harlowmoo gmail.com> writes:

Thank you for your response! I've got some questions tho.

On Saturday, 24 July 2021 at 09:17:47 UTC, Stefan Koch wrote:

 It will not use a fiber pool.

Why fiber pool? Isn't fiber a lightweight logical thread which is 
already implemented with thread pool internally?

 Spawning a new fiber is expensive because of the stack 
 allocation for it.

Actually, I've benchmarked many stackful coroutine 
implementations in different langs but they're all much faster, 
and BTW, AFAIK go's goroutine is stackful as well (4KB IIRC)

Jul 26 2021

Denis Feklushkin <denis.feklushin gmail.com> writes:

On Monday, 26 July 2021 at 12:09:07 UTC, hanabi1224 wrote:
 Thank you for your response! I've got some questions tho.

 On Saturday, 24 July 2021 at 09:17:47 UTC, Stefan Koch wrote:

 It will not use a fiber pool.

 Why fiber pool? Isn't fiber a lightweight logical thread which 
 is already implemented with thread pool internally?

Spawning fiber is expensive (but not so expensive as spawning 
thread, of course), but switching is fast.

Thus, you can spawn and pause "workers" fibers for avaiting of 
jobs.
(Probably, this behaviour is already implemented in number of 
libraries and it isn't actually need to implement another one.)

Jul 27 2021

hanabi1224 <harlowmoo gmail.com> writes:

On Wednesday, 28 July 2021 at 01:12:16 UTC, Denis Feklushkin 
wrote:
 Spawning fiber is expensive

Sorry but I cannot agree with the logic behind this statement, 
the whole point of using fiber is that, spwaning system thread is 
expensive, thus ppl create lightweight thread 'fiber', then if a 
'fiber' is still considered expensive, should ppl invent 'giber' 
on top of 'fiber', then 'hiber' on top of 'giber'? That will be 
infinite.

If you mean 'each fiber has its own stack' by 'expensive', to be 
fair, golang's goroutine is stackful (4KB stack for each 
goroutine), it's like >50X faster with the same test case.

Jul 28 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 7/28/21 1:15 AM, hanabi1224 wrote:

 On Wednesday, 28 July 2021 at 01:12:16 UTC, Denis Feklushkin wrote:
 Spawning fiber is expensive

 Sorry but I cannot agree with the logic behind this statement, the whole
 point of using fiber is that, spwaning system thread is expensive, thus
 ppl create lightweight thread 'fiber'

I assume the opposite because normally, the number of times a thread or 
fiber is spawned is nothing compared to the number of times they are 
context-switched. So, spawning can be expensive and nobody would realize 
as long as switching is cheap.

There are other reasons why fibers are faster than threads all related 
to context switching:

- CPU cache efficiency

- Translation lookaside buffer (TLB) efficiency

- Holding on to the entirety of the time slice given by the OS

Ali

P.S. The little I know on these topics is included in this presentation:

   https://dconf.org/2016/talks/cehreli.html

Jul 28 2021

hanabi1224 <harlowmoo gmail.com> writes:

On Wednesday, 28 July 2021 at 16:31:49 UTC, Ali Çehreli wrote:
 I assume the opposite because normally, the number of times a 
 thread or fiber is spawned is nothing compared to the number of 
 times they are context-switched. So, spawning can be expensive 
 and nobody would realize as long as switching is cheap.

You are right, but that's not my point. whether fiber spawning is 
expensive should be compared to thread, and it should be much 
less expensive, ppl can expect to create much more fibers at the 
same time than system thread, even if it's stackful (that should 
mostly contribute to heap memory usage, fiber stack size should 
not be a perf bottleneck before running out of mem). And when 
analyzing perf issues with fiber, it's not a valid reason to me 
that 'fiber is expensive' because fiber itself is the solution to 
the expensiveness of thread, and non of the other fiber 
implementations in other languages/runtime have the same issue 
with the same test case.

Jul 28 2021

James Blachly <james.blachly gmail.com> writes:

On 7/27/21 9:12 PM, Denis Feklushkin wrote:
 Spawning fiber is expensive (but not so expensive as spawning thread, of 
 course), but switching is fast.
 
 Thus, you can spawn and pause "workers" fibers for avaiting of jobs.
 (Probably, this behaviour is already implemented in number of libraries 
 and it isn't actually need to implement another one.)

Agree with sibling comment by hanabi1224: spawning fiber (in other 
language runtimes) is INCREDIBLY fast, even though they have stack.

Something is wrong here.

Jul 28 2021

Mathias LANG <geod24 gmail.com> writes:

On Wednesday, 21 July 2021 at 22:51:38 UTC, hanabi1224 wrote:
 Hi, I'm new to D lang and encounter some performance issues 
 with fiber, not sure if there's something obviously wrong with 
 my code.

I took a quick look, and the first problem I saw was that you 
were using `spawnLinked` but not replacing the scheduler.
`std.concurrency` uses a global `scheduler` variable to do its 
job.
Hence doing:
```diff
- auto scheduler = new FiberScheduler();
+ scheduler = new FiberScheduler();
```

Will ensure that `spawnLinked` works as expected.
There are a few other things to consider, w.r.t. fibers:
- Our Fibers are 4 pages (on Linux) by default;
- We have an extra guard page, because we are a native language, 
so we can't do the same trick as Go to auto-grow the stack;
- Spawning fibers *is* expensive and other languages reuse fibers 
(Yes, Go recycle them);

The `FiberScheduler` implementation is unfortunately pretty bad: 
it [does not re-use 
fibers](https://github.com/dlang/phobos/blob/b48cca57e8ad2dc56872499836bfa1e70e390abb/std/concurr
ncy.d#L1578-L1599). I believe this is the core of the issue.

Jul 28 2021

drug <drug2004 bk.ru> writes:

28.07.2021 17:39, Mathias LANG пишет:
 On Wednesday, 21 July 2021 at 22:51:38 UTC, hanabi1224 wrote:
 Hi, I'm new to D lang and encounter some performance issues with 
 fiber, not sure if there's something obviously wrong with my code.

 
 I took a quick look, and the first problem I saw was that you were using 
 `spawnLinked` but not replacing the scheduler.
 `std.concurrency` uses a global `scheduler` variable to do its job.
 Hence doing:
 ```diff
 - auto scheduler = new FiberScheduler();
 + scheduler = new FiberScheduler();
 ```
 
 Will ensure that `spawnLinked` works as expected.
 There are a few other things to consider, w.r.t. fibers:
 - Our Fibers are 4 pages (on Linux) by default;
 - We have an extra guard page, because we are a native language, so we 
 can't do the same trick as Go to auto-grow the stack;
 - Spawning fibers *is* expensive and other languages reuse fibers (Yes, 
 Go recycle them);
 
 The `FiberScheduler` implementation is unfortunately pretty bad: it 
 [does not re-use 
 fibers](https://github.com/dlang/phobos/blob/b48cca57e8ad2dc56872499836bfa1e70e390abb/std/concurr
ncy.d#L1578-L1599). 
 I believe this is the core of the issue.

I profiled the provided example (not `FiberScheduler`) using perf. Both 
dmd and ldc2 gave the same result - `void filterInner(int, int)` took 
~90% of the run time. The time was divided between:
	`int std.concurrency.receiveOnly!(int).receiveOnly()` - 58%
	`void std.concurrency.send!(int).send(std.concurrency.Tid, int)` - 31%

So most of the time is messages passing.

Between the fibers creating took very few time. Perf output contains 
information only of `void std.concurrency.FiberScheduler.create(void 
delegate()).wrap()` which took less than 0.5%. But I wouldn't say that I 
did the profiling ideally so take it with a grain of salt.

Jul 28 2021

hanabi1224 <harlowmoo gmail.com> writes:

On Wednesday, 28 July 2021 at 16:26:49 UTC, drug wrote:
 I profiled the provided example (not `FiberScheduler`) using 
 perf. Both dmd and ldc2 gave the same result - `void 
 filterInner(int, int)` took ~90% of the run time. The time was 
 divided between:
 	`int std.concurrency.receiveOnly!(int).receiveOnly()` - 58%
 	`void std.concurrency.send!(int).send(std.concurrency.Tid, 
 int)` - 31%

 So most of the time is messages passing.

 Between the fibers creating took very few time. Perf output 
 contains information only of `void 
 std.concurrency.FiberScheduler.create(void delegate()).wrap()` 
 which took less than 0.5%. But I wouldn't say that I did the 
 profiling ideally so take it with a grain of salt.

Very interesting findings! After making the Fiber fix, I also 
made profiling with valgrind, the result shows MessageBox related 
staff contributes to ~13.7% of total cycles, swapContex related 
staff add up to a larger percentage (My rough estimation is 
50%), I'd like to share the result svg but did not figure out 

how to upload here.

Jul 28 2021

Daniel Kozak <kozzi11 gmail.com> writes:

On Wed, Jul 28, 2021 at 11:41 PM hanabi1224 via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On Wednesday, 28 July 2021 at 16:26:49 UTC, drug wrote:
 I profiled the provided example (not `FiberScheduler`) using
 perf. Both dmd and ldc2 gave the same result - `void
 filterInner(int, int)` took ~90% of the run time. The time was
 divided between:
       `int std.concurrency.receiveOnly!(int).receiveOnly()` - 58%
       `void std.concurrency.send!(int).send(std.concurrency.Tid,
 int)` - 31%

 So most of the time is messages passing.

 Between the fibers creating took very few time. Perf output
 contains information only of `void
 std.concurrency.FiberScheduler.create(void delegate()).wrap()`
 which took less than 0.5%. But I wouldn't say that I did the
 profiling ideally so take it with a grain of salt.

 Very interesting findings! After making the Fiber fix, I also
 made profiling with valgrind, the result shows MessageBox related
 staff contributes to ~13.7% of total cycles, swapContex related
 staff add up to a larger percentage (My rough estimation is
  >50%), I'd like to share the result svg but did not figure out
 how to upload here.

I have rewrite it to be same as dart version

import std;

void main(string[] args) {
    auto n = args.length > 1 ? args[1].to!int() : 5;

    auto r = new Generator!int(
    {
        for(auto i = 2;;i++)
            yield(i);
    });

    for(auto i=0;i<n;i++)
    {

        auto prime = r.front;
        writeln(prime);
        r = filter(r, prime);

    }

}

Generator!int filter(Generator!int input, int prime)
{

    return new Generator!int(
    {
        while (input.empty is false)
        {
            input.popFront();
            auto i = input.front;
            if (i % prime != 0)
            {
                yield(i);
            }
        }
    });
}

Jul 30 2021

hanabi1224 <harlowmoo gmail.com> writes:

On Friday, 30 July 2021 at 14:41:06 UTC, Daniel Kozak wrote:
 I have rewrite it to be same as dart version

Thanks! There're both generator version and fiber version on the 
site(if possible), the 2 versions are not really comparable to 
each other (generator solutions should be much faster). There's 
another dart implementation with Isolate 
[here](https://github.com/hanabi1224/Programming-Language-Benchmarks/blob/main/bench/algorithm/coro-
rime-sieve/2.dart), it's unlisted because of very bad performance. (Isolate is
the closest thing in dart to thread or fiber but it's much much more expensive
to even spawn)

I'd like to list D's generator solution but please note that it's 

the fiber one is still a separate issue.

Jul 30 2021

hanabi1224 <harlowmoo gmail.com> writes:

On Wednesday, 28 July 2021 at 14:39:29 UTC, Mathias LANG wrote:
 Hence doing:
 ```diff
 - auto scheduler = new FiberScheduler();
 + scheduler = new FiberScheduler();
 ```

Thanks for pointing it out! Looks like I was benchmarking thread 
instead of fiber. I just made the change you suggest but the 
result is very similar, that being said, using system thread or 
fiber does not make any obvious difference in this test case, 
this fact itself seems problematic, fiber should be much faster 
than system thread in this test case (as I have proved for many 
other langs with the same case, I published results 
[here](https://programming-language-benchmarks.vercel.app/probl
m/coro-prime-sieve) but note that not all of them are implemented with stackful
coroutine), unless there's some defect in D's current fiber implementation.

Jul 28 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Performance issue with fiber