www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Lets talk about fibers

reply "Liran Zvibel" <liran weka.io> writes:
Hi,

We discussed (not) moving fibers between threads on DConf last 
week, and later it was discussed in the announce group, I think 
this matter is important enough to get a thread of it's own.

Software fibers/coroutines were created to make asynchronous 
programming using a Reactor (or another "event loop i/o 
scheduler") more seamless.

For those unaware of the Reactor Pattern, I advise reading [ 
http://en.wikipedia.org/wiki/Reactor_pattern ; 
http://www.dre.vanderbilt.edu/~schmidt/PDF/reactor-siemens.pdf ], 
and for some perspective at how other languages have addressed 
this I recommend watching Guido Van Rossum's talk about acyncio 
and Python: https://www.youtube.com/watch?v=aurOB4qYuFM

The Reactor pattern is a long-time widely accepted way to achieve 
low latency async io operations, that fortunately became famous 
thanks to the Web and the C10k requirement/problem. Using the 
Reactor is the most efficient way to leverage current CPU 
architectures to perform lots of IO for many reasons outside of 
this scope.
Another very important quality to using a rector based approach, 
is that since all event handlers just serialize on a single IO 
scheduler ("the reactor") on each thread, if designed correctly 
programmers don't have to think about concurrency and care about 
code-races.

Another thing to note: when using the reactor pattern you have to 
make sure that no event handler blocks at all, never! Once an 
event-handler blocks, since being a non-preemptive model, the 
other event handlers will not be able to run, basically starving 
themselves and the clients on the other side of the network.
Reactor implementations usually detect, and notify when an event 
handler took too much time until giving away control (this is 
dependent on application, but should be in the usec range on 
current hw).

The downside for the reactor pattern (used to be) that the 
programmer has to manually keep the state/context of how the 
event handler worked. Since each "logical" operation was 
comprised by many i/o transactions (some NW protocol to keep 
track, maybe accessing a networked DB for some data, 
reading/writing to local/remote files/ etc) the reactor would 
also keep a context for each callback and IO event and the 
programmer had to either update the context and keep registering 
new event handlers manually for all extra I/O transactions and in 
many cases change callback registration in some cases.
This downside means that it's more difficult to program for a 
Reactor model, but since programmers don't have to think about 
races and concurrency issues (and then debug them...) from our 
experience it still more efficient to program than 
general-purpose threads if you care about correctness/coherency.
One way so mitigate this complexity was through the Proactor 
pattern -- implementing higher-level async. IO services over the 
reactor, thus sparing the programmer a lot of the low-level 
context headaches.

Up until now I did not say anything about Fibers/coroutines.

What Fibers bring to the table, is the ability to program within 
the reactor model without having to manually keep a context that 
is separate for the program logic, and without the requirement to 
manually re/register callbacks for different IO events.
D's Fibers allowed us to create an async io library with support 
for network/file/disk operations and higher level conditions 
(waiters, barriers, etc) that allows the programmer to write code 
as-if it runs in its own thread (almost, sometimes fibers are 
explicitly "spawned" -- added to the reactor, and 
fiber-conditions are slightly different than spawning and joining 
threads) without paying the huge correctness/coherence and 
performance penalties of the threading model.

There are two main reasons why it does not make sense to move 
fibers between threads:

1. You'll start having concurrency issues. Lets assume we have a 
main fiber that received some request, and it spawns 3 fibers 
looking into different DBs to get some info and update an array 
with the data. The array will probably be on the stack of the 
first fiber. If fibers don't move between threads, there is 
nothing to worry about (as expected by the model). If you start 
moving fibers across threads you have to start guarding this 
array now, to make sure it's still coherent.
This is a simple example, but basically shows that you're 
"losing" one of the biggest selling point of the whole reactor 
based model.

2. Fibers and reactor based IO make work well (read: make sense) 
when you have a situation where you have lots of concurrent very 
small transactions (similar to the Web C10k problem or a storage 
machine). In this case, if one of the threads has more capacity 
than the rest, then the IO scheduler ("reactor") will just make 
sure to spawn new fibers accepting new transactions in that 
fiber. If you don't have a situation that balancing can be done 
via placing new requests in the right place, then probably you 
should not use the reactor model, but a different one that suits 
your application better.
Currently we can spawn another reactor to take more load, but the 
load is balanced statically at a system-wide level. On previous 
projects we had several reactors running on different threads and 
providing very different functionality (with different handlers, 
naturally).
We never got to a situation that moving a fiber between threads 
made any sense.

As we see, there is nothing to gain and lots to lose by moving 
fibers between threads.

Now, if we want to make sure fibers are well supported in D there 
are several other things we should do:

1. Implement a good asyncIO library that supports fiber based 
programming. I don't know Vibe.d very well (e.g. at all), maybe 
we (Weka.IO) can help review it and suggest ways to make it into 
a general async IO library (we have over 15 years experience 
developing with the reactor model in many environments)

2. Adding better compiler support. The one problem with fibers is 
that upon creation you have to know the stack size for that 
fiber. Different functions will create different stack depths. It 
is very convenient to use the stack to hold all objects (recall 
Walter's first day talk, for example), and it can be used as very 
convenient way to "garbage collect" all resources added during 
the run of that fiber, but currently we don't leverage it to the 
max since we don't have a good way to know/limit the amount of 
memory used this way.
If the compiler will be able to analyze stack usage by functions 
(recursively) and be able to give us hints regarding the 
upper-bounds of stack usage, we will be able to use the stack 
more aggressively and utilize memory much better.
Also -- I think such static analysis will be a big selling point 
for D for systems like ours.

I think now everything is written down, and we can move the 
discussion here.

Liran.
Jun 03 2015
next sibling parent reply "Joakim" <dlang joakim.fea.st> writes:
On Wednesday, 3 June 2015 at 18:34:34 UTC, Liran Zvibel wrote:
 There are two main reasons why it does not make sense to move 
 fibers between threads:

 1. You'll start having concurrency issues. Lets assume we have 
 a main fiber that received some request, and it spawns 3 fibers 
 looking into different DBs to get some info and update an array 
 with the data. The array will probably be on the stack of the 
 first fiber. If fibers don't move between threads, there is 
 nothing to worry about (as expected by the model). If you start 
 moving fibers across threads you have to start guarding this 
 array now, to make sure it's still coherent.
 This is a simple example, but basically shows that you're 
 "losing" one of the biggest selling point of the whole reactor 
 based model.

 2. Fibers and reactor based IO make work well (read: make 
 sense) when you have a situation where you have lots of 
 concurrent very small transactions (similar to the Web C10k 
 problem or a storage machine). In this case, if one of the 
 threads has more capacity than the rest, then the IO scheduler 
 ("reactor") will just make sure to spawn new fibers accepting 
 new transactions in that fiber. If you don't have a situation 
 that balancing can be done via placing new requests in the 
 right place, then probably you should not use the reactor 
 model, but a different one that suits your application better.
 Currently we can spawn another reactor to take more load, but 
 the load is balanced statically at a system-wide level. On 
 previous projects we had several reactors running on different 
 threads and providing very different functionality (with 
 different handlers, naturally).
 We never got to a situation that moving a fiber between threads 
 made any sense.

 As we see, there is nothing to gain and lots to lose by moving 
 fibers between threads.
Your entire argument seems based on fibers moving between threads breaking your reactor IO model. If there was an option to disable fibers moving or if you had to explicitly ask for a fiber to move, your argument is moot. I have no dog in this fight, just pointing out that your argument is very specific to your use.
Jun 03 2015
next sibling parent reply "Liran Zvibel" <liran weka.io> writes:
On Thursday, 4 June 2015 at 01:51:25 UTC, Joakim wrote:
 Your entire argument seems based on fibers moving between 
 threads
 breaking your reactor IO model.  If there was an option to
 disable fibers moving or if you had to explicitly ask for a 
 fiber
 to move, your argument is moot.

 I have no dog in this fight, just pointing out that your 
 argument
 is very specific to your use.
This is not "my" reactor IO model, this is the model that was popularized by ACE in the '90 (and since this is how I got to know it this is how I call it), and later became the asyncio programming model. This model was important enough for Guido Van Rossum to spend a lot of his time to add to Python, and Google created a whole programming language around [and I can give more references to that model if you like]. My point is that moving fibers between threads is difficult to implement and makes the model WEAKER. So you work hard, and get less (or just never use that feature you worked hard on as it breaks the model). The main problem with adding flexibility is that initially it always sounds like a "good idea". I just want to stress the point that in this case it's actually not such a good idea. If you can come up with another programming model that leverages fibers (and is popular), and moving fibers between threads makes sense in that model, then I think the discussion should be how stronger that other model is with fibers being able to move, and whether it's worth the effort. Since I think you won't come up with a very good case to moving them between threads on that other popular programming model, and since it's difficult to implement, and since it already makes one popular programming model weaker -- I suggest not to do it. Currently asyncio is supported by D (Vibe.d and Weka.IO are using it) well without this ability. At the end of my post I suggested to use the resources freed by not-moving-fibers differently and just endorse the asyncio programming model rather then add generic "flexibility" features.
Jun 04 2015
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Thursday, 4 June 2015 at 07:24:48 UTC, Liran Zvibel wrote:
 Since I think you won't come up with a very good case to moving 
 them between threads on that other popular programming model,
INCOMING WORKLOAD ("__" denotes yield+delay): a____aaaaaaa b____bbbbbb c____cccccccc d____dddddd e____eeeeeee SCHEDULING WITHOUT MIGRATION: CORE 1: aaaaaaaa CORE 2: bcdef___bbbbbbccccccccddddddeeeeeee SCHEDULING WITH MIGRATION: CORE 1: aaaaaaaacccccccceeeeeee CORE 2: bcdef___bbbbbbdddddd And this isn't even a worst case scenario. Please note that it is common to start a task by looking up global caches first. So this is a common pattern: 1. look up caches 2. wait for response 3. process
Jun 04 2015
parent reply "Liran Zvibel" <liran weka.io> writes:
On Thursday, 4 June 2015 at 08:43:31 UTC, Ola Fosheim Grøstad 
wrote:
 On Thursday, 4 June 2015 at 07:24:48 UTC, Liran Zvibel wrote:
 Since I think you won't come up with a very good case to 
 moving them between threads on that other popular programming 
 model,
INCOMING WORKLOAD ("__" denotes yield+delay): a____aaaaaaa b____bbbbbb c____cccccccc d____dddddd e____eeeeeee SCHEDULING WITHOUT MIGRATION: CORE 1: aaaaaaaa CORE 2: bcdef___bbbbbbccccccccddddddeeeeeee SCHEDULING WITH MIGRATION: CORE 1: aaaaaaaacccccccceeeeeee CORE 2: bcdef___bbbbbbdddddd And this isn't even a worst case scenario. Please note that it is common to start a task by looking up global caches first. So this is a common pattern: 1. look up caches 2. wait for response 3. process
Fibers are good when you get tons of new work constantly. If you just have a few things that runs forever, you're most probably better off with threads. It's true that you can misuse fibers that than complains that things don't work well for you, but I don't think it should be supported by the language. If you assume that new jobs always come in (and then you schedule new jobs to the more-empty fibers), there is no need to balance old jobs (That will finish very soon anyway). If you have a blocking operation it should not be in fibers anyways. We have a deferToThread mechanism with a thread pool that waits for such functions (if we want to do something that takes some time, or use external library). Fibers should never ever block. If your fiber is blocking you're violating the model. Fibers aren't some magic to solve every CS problem possible. There is a defined class of problems that work well for fibers, and there fibers should be utilized (and even then with great discipline). If your problem is not one of these -- use another form of concurrency/parallelism. One of my main arguments against Go is "If your only tool is a hammer, then every problem looks like a nail" -- D should not go that route. Looking at your example -- a good scheduler should have distributed a-e evenly across both cores to begin with. Then a good fibers programmer should yield() after each unit of work, so aaaaaaa won't be a valid state. Finally, the blocking code should have run outside the fibers io scheduler, and just have that fiber waiting in suspended mode until it's runnable again, allowing other fibers to execute.
Jun 04 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Thursday, 4 June 2015 at 13:42:41 UTC, Liran Zvibel wrote:
 If you assume that new jobs always come in (and then you 
 schedule new jobs to the more-empty fibers), there is no need 
 to balance old jobs (That will finish very soon anyway).
That assumes that the tasks don't do much work but just wait and wait and wait.
 If you have a blocking operation it should not be in fibers 
 anyways.
 We have a deferToThread mechanism with a thread pool that waits 
 for such functions (if we want to do something that takes some 
 time, or use external library).
 Fibers should never ever block. If your fiber is blocking 
 you're violating the model.

 Fibers aren't some magic to solve every CS problem possible.
Actually, co-routines have been basic concurrency building blocks since the 50s, and from a CS perspective the degree of parallelism is an implementation detail.
 Looking at your example -- a good scheduler should have 
 distributed a-e evenly across both cores to begin with.
Nah, because that would require an a priori estimate.
 Then a good fibers programmer should yield() after each unit of 
 work, so aaaaaaa won't be a valid state.
Won't work when you call external libraries. Here is a likely pattern for an image scaling service: 1. check cache 2. request data if not found 3. process, save in cache and return 1____________2____________33333333 You can't just break up workload 3, you would run out of memory.
Jun 04 2015
prev sibling parent Ivan Timokhin <timokhin.iv gmail.com> writes:
On Thu, Jun 04, 2015 at 07:24:47AM +0000, Liran Zvibel wrote:
 If you can come up with another programming model that leverages
 fibers (and is popular), and moving fibers between threads makes
 sense in that model, then I think the discussion should be how
 stronger that other model is with fibers being able to move, and
 whether it's worth the effort.
This might be relevant: https://channel9.msdn.com/Events/GoingNative/2013/Bringing-await-to-Cpp Specifically slide 12 (~12:30 in the video), where he discusses implementation.
Jun 04 2015
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/15 9:51 PM, Joakim wrote:

 Your entire argument seems based on fibers moving between threads
 breaking your reactor IO model.  If there was an option to
 disable fibers moving or if you had to explicitly ask for a fiber
 to move, your argument is moot.

 I have no dog in this fight, just pointing out that your argument
 is very specific to your use.
I plead complete ignorance and inexperience with fibers and thread scheduling. But I think the sanest approach here is to NOT support moving fibers, and then add support if it becomes necessary. We can make the scheduler something that's parameterized, or hell, just edit your own runtime if you need it! It may also be that fibers that move can't be statically checked to see if they will break on moving. That may simply just be on you, like casting. I think for the most part, the safest default is to have a fiber scheduler that cannot possibly create races. Let's build from there. -Steve
Jun 04 2015
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, 4 June 2015 at 13:16:48 UTC, Steven Schveighoffer 
wrote:
 On 6/3/15 9:51 PM, Joakim wrote:

 Your entire argument seems based on fibers moving between 
 threads
 breaking your reactor IO model.  If there was an option to
 disable fibers moving or if you had to explicitly ask for a 
 fiber
 to move, your argument is moot.

 I have no dog in this fight, just pointing out that your 
 argument
 is very specific to your use.
I plead complete ignorance and inexperience with fibers and thread scheduling. But I think the sanest approach here is to NOT support moving fibers, and then add support if it becomes necessary. We can make the scheduler something that's parameterized, or hell, just edit your own runtime if you need it! It may also be that fibers that move can't be statically checked to see if they will break on moving. That may simply just be on you, like casting. I think for the most part, the safest default is to have a fiber scheduler that cannot possibly create races. Let's build from there.
One thing that needs to be considered that deadalnix pointed out at dconf is that we _do_ have shared(Fiber), and we have to deal with that in some manner, even if we don't want to support moving fibers across threads (even if that simply means disallowing shared(Fiber)). - Jonathan M Davis
Jun 04 2015
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
I mostly agree with what you wrote, but I'd like to point out 
that it's probably safe to move some kinds of fibers across 
threads:

If the fiber's main function is pure and its parameters have no 
mutable indirection (i.e. if the function is strongly pure), 
there should be no way to get data races.

Therefore I believe we could theoretically support moving such 
fibers. But currently I see no way how most fibers can be made 
pure, after all you want to do IO in them. Of course, we could 
forego the purity requirement, but then the compiler can no 
longer support us.
Jun 04 2015
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Jun-2015 21:34, Liran Zvibel wrote:
 Hi,
[snip]
 There are two main reasons why it does not make sense to move fibers
 between threads:
For me language being TLS by default is enough to not even try this madness. If we allow moves a typical fiber will see different "globals" depending on where it is scheduled next. For instance, if a thread local connection is used (inside of some pool presumably) then: Socket socket; first_part = socket.read(...); // assume this yields second_part = socket.read(...); // then this may use different socket -- Dmitry Olshansky
Jun 04 2015
parent Dan Olson <gorox comcast.net> writes:
Dmitry Olshansky <dmitry.olsh gmail.com> writes:

 On 03-Jun-2015 21:34, Liran Zvibel wrote:
 Hi,
[snip]
 There are two main reasons why it does not make sense to move fibers
 between threads:
For me language being TLS by default is enough to not even try this madness. If we allow moves a typical fiber will see different "globals" depending on where it is scheduled next.
Opposite problem too, with LLVM's TLS optimizations, the Fiber may keep accessing same "global" even when yield() resumes on a different thread. int someTls; // optimizer caches address auto fib = new Fiber({ for (;;) { printf("%d fiber before yield\n", someTls); ++someTls; // thread A's var Fiber.yield(); ++someTls; // resumed thread B, but still A's var printf("%d fiber after yield\n", someTls); } });
Jun 04 2015
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, 3 June 2015 at 18:34:34 UTC, Liran Zvibel wrote:
 As we see, there is nothing to gain and lots to lose by moving 
 fibers between threads.
Given that it sounds like LLVM _can't_ implement moving fibers (or if it can, it'll really hurt performance), I think that we need a really compelling reason to allow it. And I haven't heard one from anyone thus far. Initially, at dconf, Walter asserted that we needed to make fibers moveable across threads, but I haven't really heard anyone give a reason why we need to. deadalnix talked about load balancing that way, but you gave good reasons as to why that didn't make sense, and that argument is the closest that I've seen to a reason why it would make sense to move fibers across threads. Now, like Steven, I've never used a fiber in my life (I really should look into them one of these days), so I'm ill-suited for making a decision on this, but it sounds to me like we should start by having it be illegal to move fibers across threads and then add the ability later if someone comes up with a good enough reason. Certainly, it's sounds questionable that it even _can_ be implemented and costly if it can. Another approach would be to make it so that shared(Fiber) could be moved across threads but that Fiber can't be (or at least, it's undefined behavior if you do, since the compiler will assume that you won't), and if the 3 major backends can all support moving fibers across threads (even in an inefficient fashion), then we can just implement that support for shared(Fiber) and say that folks are free to shoot themselves in the foot using that if they so desire and let Fiber be more restrictive and not have it take the performance hit incurred by allowing fibers to be passed across threads. But if LLVM really can't support moving fibers across threads, then I think that the clear answer is that we shouldn't allow it at all (in which case, shared(Fiber) should probably be outright disallowed). - Jonathan M Davis
Jun 04 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Thursday, 4 June 2015 at 22:28:52 UTC, Jonathan M Davis wrote:
 anyone give a reason why we need to. deadalnix talked about 
 load balancing that way, but you gave good reasons as to why 
 that didn't make sense,
What good reasons? By the time you get response from your shared memcache or database the x86 cache level 1 and possibly 2 is cold. And cache level 3 is shared, so there is no cache penalty for switching cores. Add to this that two-and-two cores share primary caches so if you don't pair tasks that address the same memory you loose up to 10-20% performance in addition to unused capacity and increased latency. Smart scheduling matters, both at the OS level and at the application level. That's not a controversial statement (only in these forums…)! The only good reason for not switching is that you lack resources/know-how. But then you probably should not make it a language feature in the first place...? There is no reason to pretend that synthetic performance benchmarks don't carry weight when people pick a language for production. That's just wishful thinking.
Jun 05 2015
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/15 7:29 AM, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Thursday, 4 June 2015 at 22:28:52 UTC, Jonathan M Davis wrote:
 anyone give a reason why we need to. deadalnix talked about load
 balancing that way, but you gave good reasons as to why that didn't
 make sense,
What good reasons? By the time you get response from your shared memcache or database the x86 cache level 1 and possibly 2 is cold. And cache level 3 is shared, so there is no cache penalty for switching cores. Add to this that two-and-two cores share primary caches so if you don't pair tasks that address the same memory you loose up to 10-20% performance in addition to unused capacity and increased latency.
I think I'll go with Liran's experience over your hypothetical anecdotes. You seem to have a lot of academic knowledge, but I'd rather see what actually happens. If you have that data, please share. -Steve
Jun 05 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 13:20:27 UTC, Steven Schveighoffer 
wrote:
 I think I'll go with Liran's experience over your hypothetical 
 anecdotes. You seem to have a lot of academic knowledge, but 
 I'd rather see what actually happens. If you have that data, 
 please share.
There is absolutely no reason to go personal. I address weak arguments when I see them. Liran claimed there were no benefits to migrating fibers. That's not true. He is speaking for his particular use case, that is fine. It is easy to create a benchmark where locking fibers to a thread is beneficial. But it is completely orthogonal to my most likely D use case which is in low-latency web-services. There will be no data that benefits D until D is a making itself look like a serious contender and do it well in aggressive external benchmarking. You don't get the luxury to choose what workload D's performance is benchmarked with! D is an underdog compared to C++/Rust/Go. That means you need to get that 10-20% performance edge in benchmarks to make D look attractive. If you want D to succeed you need to figure out what is D's main selling point and make it a compiler-based feature. If it is a library only solution, then any language can steal your thunder...
Jun 05 2015
next sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Friday, 5 June 2015 at 14:17:35 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 5 June 2015 at 13:20:27 UTC, Steven Schveighoffer 
 wrote:
 I think I'll go with Liran's experience over your hypothetical 
 anecdotes. You seem to have a lot of academic knowledge, but 
 I'd rather see what actually happens. If you have that data, 
 please share.
There is absolutely no reason to go personal. I address weak arguments when I see them. Liran claimed there were no benefits to migrating fibers. That's not true. He is speaking for his particular use case, that is fine. It is easy to create a benchmark where locking fibers to a thread is beneficial. But it is completely orthogonal to my most likely D use case which is in low-latency web-services. There will be no data that benefits D until D is a making itself look like a serious contender and do it well in aggressive external benchmarking. You don't get the luxury to choose what workload D's performance is benchmarked with! D is an underdog compared to C++/Rust/Go. That means you need to get that 10-20% performance edge in benchmarks to make D look attractive.
I agree, but I dare doubt that a slight performance edge will make the difference. There are load of factors (knowledge base, infrastructure, complacency, C++-Guruism, marketing etc.) why D is an underdog.
 If you want D to succeed you need to figure out what is D's 
 main selling point and make it a compiler-based feature. If it 
 is a library only solution, then any language can steal your 
 thunder...
The "problem" D has is that it has loads of selling points. Rust and Go were designed with very specific goals in mind, thus it's easy to sell them "You want X? We have X!". D has been developed over the years by a community not a committee. D is more like "You want X? Yeah, we have X, actually a slightly improved version of X we call it EX, and Y and Z on top of that. And A B C too! And templates!" - "Sorry, man! Too complicated for me! Can I just have a for-loop, please? Milk, no sugar, thanks." I know, as usual I simplify things and exaggerate! He he he. But programming languages are like everything else, only because something is good doesn't mean that people will buy it. As regard compiler-based features, as soon as features are compiler-based people will complain "Why is it built-in? That should be handled by a library! I want more freedom!" I know for sure.
Jun 05 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 14:51:05 UTC, Chris wrote:
 I agree, but I dare doubt that a slight performance edge will 
 make the difference. There are load of factors (knowledge base, 
 infrastructure, complacency, C++-Guruism, marketing etc.) why D 
 is an underdog.
But everybody loves the underdog when it catches up to the pack and beats the pack on the finish line. ;^) I now follow Pony because of this self-provided benchmark: http://ponylang.org/benchmarks_all.pdf They are communicating a focus for a domain, a good understanding of their area, and it makes me want to give it a spin even at this early stage where I obviously can't actually use it. I am not saying Pony is good, but it makes a good case for itself IMO.
 no sugar, thanks." I know, as usual I simplify things and 
 exaggerate! He he he. But programming languages are like 
 everything else, only because something is good doesn't mean 
 that people will buy it.
Sure, but it is also important to make people take notice. People take notice of benchmark leaders. And too often benchmarks measure throughput while latency is just as important. End user don't notice peak throughput (which is measurable as a bleep on the cloud server instance-count logs), they notice reduced latency. So to me latency is the most important aspect of a web-service (+ programmer productivity). I don't find Go exciting, but they show concern for latency (concurrent GC etc). Communicating that concern is good, even before they reach whatever goals they have.
 As regard compiler-based features, as soon as features are 
 compiler-based people will complain "Why is it built-in? That 
 should be handled by a library! I want more freedom!" I know 
 for sure.
Heh, not if it is getting you an edge, but if it is a second citizen addition. Yes, then I agree. Cheers!
Jun 05 2015
parent reply "Chris" <wendlec tcd.ie> writes:
On Friday, 5 June 2015 at 17:28:39 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 5 June 2015 at 14:51:05 UTC, Chris wrote:
 I agree, but I dare doubt that a slight performance edge will 
 make the difference. There are load of factors (knowledge 
 base, infrastructure, complacency, C++-Guruism, marketing 
 etc.) why D is an underdog.
But everybody loves the underdog when it catches up to the pack and beats the pack on the finish line. ;^) I now follow Pony because of this self-provided benchmark: http://ponylang.org/benchmarks_all.pdf They are communicating a focus for a domain, a good understanding of their area, and it makes me want to give it a spin even at this early stage where I obviously can't actually use it. I am not saying Pony is good, but it makes a good case for itself IMO.
 no sugar, thanks." I know, as usual I simplify things and 
 exaggerate! He he he. But programming languages are like 
 everything else, only because something is good doesn't mean 
 that people will buy it.
Sure, but it is also important to make people take notice. People take notice of benchmark leaders. And too often benchmarks measure throughput while latency is just as important. End user don't notice peak throughput (which is measurable as a bleep on the cloud server instance-count logs), they notice reduced latency. So to me latency is the most important aspect of a web-service (+ programmer productivity). I don't find Go exciting, but they show concern for latency (concurrent GC etc). Communicating that concern is good, even before they reach whatever goals they have.
 As regard compiler-based features, as soon as features are 
 compiler-based people will complain "Why is it built-in? That 
 should be handled by a library! I want more freedom!" I know 
 for sure.
Heh, not if it is getting you an edge, but if it is a second citizen addition. Yes, then I agree. Cheers!
Thanks for showing me Pony. Languages like Nim and Pony keep popping up which shows a) how important native compilation is and b) that there are still loads of issues in standard languages (C/C++/Python/Java/C#). But D is already there, it's already usable, and new languages often re-invent D.
Jun 05 2015
parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Friday, 5 June 2015 at 18:25:26 UTC, Chris wrote:
 On Friday, 5 June 2015 at 17:28:39 UTC, Ola Fosheim Grøstad 
 wrote:
 On Friday, 5 June 2015 at 14:51:05 UTC, Chris wrote:
 I agree, but I dare doubt that a slight performance edge will 
 make the difference. There are load of factors (knowledge 
 base, infrastructure, complacency, C++-Guruism, marketing 
 etc.) why D is an underdog.
But everybody loves the underdog when it catches up to the pack and beats the pack on the finish line. ;^) I now follow Pony because of this self-provided benchmark: http://ponylang.org/benchmarks_all.pdf They are communicating a focus for a domain, a good understanding of their area, and it makes me want to give it a spin even at this early stage where I obviously can't actually use it. I am not saying Pony is good, but it makes a good case for itself IMO.
 no sugar, thanks." I know, as usual I simplify things and 
 exaggerate! He he he. But programming languages are like 
 everything else, only because something is good doesn't mean 
 that people will buy it.
Sure, but it is also important to make people take notice. People take notice of benchmark leaders. And too often benchmarks measure throughput while latency is just as important. End user don't notice peak throughput (which is measurable as a bleep on the cloud server instance-count logs), they notice reduced latency. So to me latency is the most important aspect of a web-service (+ programmer productivity). I don't find Go exciting, but they show concern for latency (concurrent GC etc). Communicating that concern is good, even before they reach whatever goals they have.
 As regard compiler-based features, as soon as features are 
 compiler-based people will complain "Why is it built-in? That 
 should be handled by a library! I want more freedom!" I know 
 for sure.
Heh, not if it is getting you an edge, but if it is a second citizen addition. Yes, then I agree. Cheers!
Thanks for showing me Pony. Languages like Nim and Pony keep popping up which shows a) how important native compilation is and [...]
Which is why after all those years, the OpenJDK will eventually support AOT compilation to native code for Java 10 with some work being done in JEP 220[0], and .NET does AOT native code on Windows Phone 8 (MDIL), with static compilation with Visual C++ backend coming with .NET Native. And Android also went native with the Dalvik re-write. The best approach is anyway to have a JIT/AOT capable toolchain and use them accordingly to the deployment target. [0]Which means Oracle finally accepted why almost all commercial JVM vendors do offer such a feature. I read somewhere that JIT only was a kind of Sun political issue.
Jun 08 2015
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/15 10:17 AM, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Friday, 5 June 2015 at 13:20:27 UTC, Steven Schveighoffer wrote:
 I think I'll go with Liran's experience over your hypothetical
 anecdotes. You seem to have a lot of academic knowledge, but I'd
 rather see what actually happens. If you have that data, please share.
There is absolutely no reason to go personal.
I didn't, actually. Your arguments seem well crafted and persuasive, but I've seen so many arguments based on theory that don't always pan out. I like to see hard data. That's what Liran's experience provides. Perhaps you have it too? Please share if you do. -Steve
Jun 05 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 19:21:32 UTC, Steven Schveighoffer 
wrote:
 I didn't, actually. Your arguments seem well crafted and 
 persuasive, but I've seen so many arguments based on theory 
 that don't always pan out. I like to see hard data. That's what 
 Liran's experience provides. Perhaps you have it too? Please 
 share if you do.
I have absolutely no idea what you are talking about. Experience is data? Huh? If you talk about benchmarking, you do this by defining a baseline to measure up against and run a wide set of demanding workloads with increasing load until the system performance collapses, then you analyze the outcome for each workload. One usually pick best-of-breed "competitor" as the baseline. E.g. Nginx gained traction by benchmarking against Apache. If you are talking about multi-threading/fibers/event-based systems you read technical optimization manuals from CPU vendors for each processor generation, they provide what you need to know when designing scheduling heuristics. The problem is how to give the scheduler meta information. In event systems that is explicit, in D you could provide information through "yield" either by profiling, analysis, or explict... but getting to event based performance isn't all that easy...
Jun 06 2015
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jun-2015 14:29, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Thursday, 4 June 2015 at 22:28:52 UTC, Jonathan M Davis wrote:
 anyone give a reason why we need to. deadalnix talked about load
 balancing that way, but you gave good reasons as to why that didn't
 make sense,
What good reasons? By the time you get response from your shared memcache or database the x86 cache level 1 and possibly 2 is cold.
Cache arguments are hard to get right w/o experiment. That "possibly" may be enough compared to certainly cold. However I'll answer theoretically to equally theoretical argument. If there is affinity and we assume that OS schedules threads on the same cores* then each core has it's cache loaded with (some of) stacks of its fibers. If we assume sharing fibers across all cores, then each core will have to cache stacks for all of fibers which is wasteful. So fiber affinity => that much less burden on each of core's caches, making them that much hotter. * You seem to assume the same. Fine assumption given that OS usually tries to keep the same cores working on the same threads, for the similar reasons I believe.
  Add to this that
 two-and-two cores share primary caches so if you don't pair tasks that
 address the same memory you loose up to 10-20% performance in addition
 to unused capacity and increased latency. Smart scheduling matters, both
 at the OS level and at the application level. That's not a controversial
 statement (only in these forums…)!
Moving fibers across threads have no effect on all of the above even if there is some truth. There is simply no way to control what core executes which thread to begin with, this assignment is the OS territory.
 The only good reason for not switching is that you lack
 resources/know-how.
Reasons were presented, but there is nothing in your answer that at least acknowledges that.
 But then you probably should not make it a language
 feature in the first place...?
Then it's a good chance for you to prove your design by experimentation. That if we all accept concurrency issues with moving fibers that violate some language guarantees. -- Dmitry Olshansky
Jun 05 2015
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 13:44:16 UTC, Dmitry Olshansky wrote:
 If there is affinity and we assume that OS schedules threads on 
 the same cores*  then each core has it's cache loaded with 
 (some of) stacks of its fibers. If we assume sharing fibers 
 across all cores, then each core will have to cache stacks for 
 all of fibers which is wasteful.
If you cannot control affinity then you can't take advantage of hyper-threading either? I need to think of this in terms of _smart_ scheduling and adaptive load balancing.
 Moving fibers across threads have no effect on all of the above 
 even if there is some truth.
In order to get benefits from hyper-threading you need pay close attention how you schedule, or you should turn it off.
 There is simply no way to control what core executes which 
 thread to begin with, this assignment is the OS territory.
If your OS is does not support hyper-threading level control you should turn it off...
 The only good reason for not switching is that you lack
 resources/know-how.
Reasons were presented, but there is nothing in your answer that at least acknowledges that.
No, there were no performance related reasons, only TLS (which is a questionable feature to begin with).
 Then it's a good chance for you to prove your design by 
 experimentation. That if we all accept concurrency issues with 
 moving fibers that violate some language guarantees.
There is nothing to prove. You either perform worse or better than a carefully scheduled event-based solution in C++. You either perform worse or better than Go 1.5 in scheduling and GC. However, doing well in externally designed and executed benchmarks on _language_ _features_ is good marketing (even if that 10-20% edge does not matter in real world applications). Right now, neither concurrency or GC are really D language features, they are more like library/runtime features. That makes it difficult to excel in those areas. In languages like Go, Erlang and Pony concurrency is a language feature.
Jun 05 2015
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jun-2015 17:04, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Friday, 5 June 2015 at 13:44:16 UTC, Dmitry Olshansky wrote:
 If there is affinity and we assume that OS schedules threads on the
 same cores*  then each core has it's cache loaded with (some of)
 stacks of its fibers. If we assume sharing fibers across all cores,
 then each core will have to cache stacks for all of fibers which is
 wasteful.
If you cannot control affinity then you can't take advantage of hyper-threading either?
You choose to ignore the point about duplicating the same memory in each core's cache. To me it seems like throwing random CPU technologies won't help make your argument stronger. However I stand corrected - there are sys-calls to confine thread to specifics subset of cores. The point about cache stays as is as it assumed each thread prefers to run the same core vs e.g. always running on the same core.
 I need to think of this in terms of _smart_
 scheduling and adaptive load balancing.
Can't help you there, especially w/o definition of the first. Adaptive load-balancing is quite possible with fibers sticking to a thread and is a question of application design.
 Moving fibers across threads have no effect on all of the above even
 if there is some truth.
In order to get benefits from hyper-threading you need pay close attention how you schedule, or you should turn it off.
I bet it still helps some workloads and hurts others without "me" scheduling anything. There are some things OS can do just fine.
 There is simply no way to control what core executes which thread to
 begin with, this assignment is the OS territory.
If your OS is does not support hyper-threading level control you should turn it off...
Not sure if this is English, but I stand corrected in that one may set thread affinity for each thread manually. What I argued for is that default is mostly the same and the point stands as is.
 The only good reason for not switching is that you lack
 resources/know-how.
Reasons were presented, but there is nothing in your answer that at least acknowledges that.
No, there were no performance related reasons,
I haven't said performance. Fast and incorrect is cheap.
 only TLS (which is a
 questionable feature to begin with).
Aye, no implicit data-races by default is questionable design. What questions do you have? -- Dmitry Olshansky
Jun 05 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 15:06:04 UTC, Dmitry Olshansky wrote:
 You choose to ignore the point about duplicating the same 
 memory in each core's cache. To me it seems like throwing
Not sure what you mean by this. 3rd level cache is shared. Die-level cache is shared. Primary caches are small and are shared between pairs of hyper-threaded cores. If a task has been suspended for 100ms you can just assume that primary cache is cold.
 Adaptive load-balancing is quite possible with fibers sticking 
 to a thread and is a question of application design.
Then you should not have fibers at all since an event based solution is even faster (but more work). Coroutines is a convenience feature, not a performance feature. You need control over workload scheduling to optimize to prevent 3rd level cache pollution. Random fine grained scheduling is not good for memory intensive workloads because you push out data from the caches prematurely.
 I bet it still helps some workloads and hurts others without 
 "me" scheduling anything.
Hyperthreading requires two cores to run specific workloads at the same time. If not you are better off just halting that extra core. The idea with hyperthreading is that one thread fills in holes in the pipeline when the other thread is stalled.
 Not sure if this is English,
When people pick on typos the debate is essentially over... EOD
Jun 05 2015
prev sibling parent reply Dan Olson <gorox comcast.net> writes:
"Ola Fosheim "Grøstad\"" <ola.fosheim.grostad+dlang gmail.com> writes:

 No, there were no performance related reasons, only TLS (which is a
 questionable feature to begin with).
On TLS and migrating Fibers - these were posted elsewhere, and want to make sure that when you read TLS Fiber problem here, it is understood to be something that could be solved by compiler solution. David has a good overview of the problem here: https://github.com/ldc-developers/ldc/issues/666 And Boost discussion to show D is not alone here: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/coroutine_thread.html
Jun 05 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 5 June 2015 at 15:18:59 UTC, Dan Olson wrote:
 On TLS and migrating Fibers - these were posted elsewhere, and 
 want to
 make sure that when you read TLS Fiber problem here, it is 
 understood to
 be something that could be solved by compiler solution.
What I meant is that I don't have a use case for TLS in my own programs. I think TLS is primarily useful for runtime-level issues like thread local allocators. I either read from global immutables or use lock-free datastructures for sharing...
Jun 05 2015
prev sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 05/06/15 16:44, Dmitry Olshansky wrote:
 * You seem to assume the same. Fine assumption given that OS usually
 tries to keep the same cores working on the same threads, for the
 similar reasons I believe.
I see that people already raised the point that the OS does allow you to pin a thread to specific cores, so lets skip repeating that. AFAIK, the kernel tries to keep threads running on the same core they did before is because moving them requires so much locking, synchronous assembly instructions and barriers, resulting in huge costs for migrating threads between cores. Which turns out to be relevant to this discussion, because that will, likely, also be required in order to move fibers between threads. A while back, a friend and myself ran an (incomplete) research project where we tried reverting to the long discarded "one thread per socket" model. It actually performed really well (much much better than the "common wisdom" would have it perform), provided you did two things: 1. Use a thread pool. Do not actually spawn a new thread each time a new incoming connection arrives and 2. pin that thread to a core, don't let it migrate Since we are talking about several tens of thousands of threads, each random fluctuation in the load resulted in the kernel's scheduler wishing to migrate them, resulting in losing thousands of percent worth of performance. Once we locked the threads into place, we were, more or less, on par with micro-threading in terms of overall performance the server could take. Shachar
Jun 06 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Saturday, 6 June 2015 at 18:49:30 UTC, Shachar Shemesh wrote:
 Since we are talking about several tens of thousands of 
 threads, each random fluctuation in the load resulted in the
Using an unlikely workload that the kernel has not been designed and optimized for is in general a bad idea. Especially on a generic scheduler that has no knowledge of the nature of the workload and therefore is (or should be) designed to avoid worst case starvation scenarios.
Jun 07 2015
prev sibling parent reply "Dicebot" <public dicebot.lv> writes:
For the record : I am fully with Liran on this case.
Jun 04 2015
parent "Paolo Invernizzi" <paolo.invernizzi no.address> writes:
On Friday, 5 June 2015 at 06:03:13 UTC, Dicebot wrote:
 For the record : I am fully with Liran on this case.
+1 also for me. At work we are using fibers when appropriate, and I see no advantages in moving them. /P
Jun 04 2015