www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Wait-free thread communication

reply Jin <nin-jin ya.ru> writes:
Idea: no mutex, no CAS, only one direction queues without any 
locks.

My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive

---
writers =512
readers =1
std.concurency milliseconds=1313
jin.go milliseconds=113
---

Realization:

Queue! - buffered static typed channel. One and only one thread 
can push data to queue, and one and only one can take data from 
queue. Thread blocks when he takes from empty queue and pushes to 
fulfilled queue.

Queues! - list of queues with round robin balancing

go! - starts new thread, in future i want to implement something 
like goroutine, but i do not understand how yet.

I need some good advice, i am newbie in D :-)

Problems:

For blocking thread i use loop with Thread.sleep - this is bad 
decision IMHO.

On my system i can create only up to 1000 threads without errors. 
Fibers from core.thread or Tasks from std.parallelism potentiaдly 
can resolve this problem, but how to integrate their gracefully.

API of jin-go can be better?

Which dub modules can help me?
Jan 08
next sibling parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
 Idea: no mutex, no CAS, only one direction queues without any 
 locks.

 My prototype (https://github.com/nin-jin/go.d) is up to 10x 
 faster than std.concurrency.send/receive
Very interesting. D has builtin unittests. You should add stress unittests to assure that your logic is correct. You can start by searching for keyword `unittest` in the std.concurrency module for advice how to do this.
Jan 08
parent Jin <nin-jin ya.ru> writes:
On Friday, 8 January 2016 at 18:07:49 UTC, Nordlöw wrote:
 On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
 Idea: no mutex, no CAS, only one direction queues without any 
 locks.

 My prototype (https://github.com/nin-jin/go.d) is up to 10x 
 faster than std.concurrency.send/receive
Very interesting. D has builtin unittests. You should add stress unittests to assure that your logic is correct. You can start by searching for keyword `unittest` in the std.concurrency module for advice how to do this.
I just add unit tests. But how to idiomatic implement benchmarks (compiling must be in release mode)? Currently, i was implement main function in app.d and run with "dub --build=release", but nobody can import my module.
Jan 08
prev sibling next sibling parent reply thedeemon <dlang thedeemon.com> writes:
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
 Idea: no mutex, no CAS, only one direction queues without any 
 locks.

 My prototype (https://github.com/nin-jin/go.d) is up to 10x 
 faster than std.concurrency.send/receive
Sorry, that's not looking any good. You call it wait-free when in fact it's just the opposite: if a queue buffer is full on push it just waits in Thread.sleep which is 1) not wait-free at all 2) very heavy (call to kernel, context switch) And when buffer is not full/empty it works as a simple single-threaded queue, which means it's fine for using with fibers inside one thread but will not work correctly in multithreaded setting. Your benchmarks show time difference in your favor just because you compare very different things: your queue is benchmarked in single thread with fibers while std.concurrency is measured with multiple threads communicating with each other. Doing many switches between threads is much slower than switching between fibers in one thread, hence the time difference. It's not that your queue is any good, it's just you measure it wrong.
Jan 09
parent reply Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:
 Your benchmarks show time difference in your favor just because 
 you compare very different things: your queue is benchmarked in 
 single thread with fibers while std.concurrency is measured 
 with multiple threads communicating with each other. Doing many 
 switches between threads is much slower than switching between 
 fibers in one thread, hence the time difference. It's not that 
 your queue is any good, it's just you measure it wrong.
No, jin.go creates new native thread on every call. And this is problem :-) We can not create thousand of threads without errors. std.concurrency uses mutex to synchromize access to message queue. Cost of syncronization is proportional to count of threads. On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:
 You call it wait-free when in fact it's just the opposite: if a 
 queue buffer is full on push  it just waits in Thread.sleep 
 which is
 1) not wait-free at all
 2) very heavy (call to kernel, context switch)
 And when buffer is not full/empty it works as a simple 
 single-threaded queue, which means it's fine for using with 
 fibers inside one thread but will not work correctly in 
 multithreaded setting.
Taking data from empty queue and pushing data to full queue is logicaly impossible for queues. We have 3 strategies for this cases: 1. Throwing runtime exception, like std.concurrency.send on full message box by default. 2. Blocking thread, like std.concurrency.receive on empty message box. 2. Skipping action. jin-go uses blocking strategy by default and you can check queue.empty and queue.full if you want to implement other strategy.
Jan 09
parent reply thedeemon <dlang thedeemon.com> writes:
On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:

 No, jin.go creates new native thread on every call. And this is 
 problem :-) We can not create thousand of threads without 
 errors.
Ah, sorry, I misread the source. FiberScheduler got me distracted. Why is it there?
Jan 09
parent Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 13:55:16 UTC, thedeemon wrote:
 On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:

 No, jin.go creates new native thread on every call. And this 
 is problem :-) We can not create thousand of threads without 
 errors.
Ah, sorry, I misread the source. FiberScheduler got me distracted. Why is it there?
This is artefact of experiments with fibers :-)
Jan 09
prev sibling next sibling parent reply Andy Smith <andyrsmith googlemail.com> writes:
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
 Idea: no mutex, no CAS, only one direction queues without any 
 locks.

 My prototype (https://github.com/nin-jin/go.d) is up to 10x 
 faster than std.concurrency.send/receive

 ---
 writers =512
 readers =1
 std.concurency milliseconds=1313
 jin.go milliseconds=113
 ---

 Realization:

 Queue! - buffered static typed channel. One and only one thread 
 can push data to queue, and one and only one can take data from 
 queue. Thread blocks when he takes from empty queue and pushes 
 to fulfilled queue.

 Queues! - list of queues with round robin balancing

 go! - starts new thread, in future i want to implement 
 something like goroutine, but i do not understand how yet.

 I need some good advice, i am newbie in D :-)

 Problems:

 For blocking thread i use loop with Thread.sleep - this is bad 
 decision IMHO.

 On my system i can create only up to 1000 threads without 
 errors. Fibers from core.thread or Tasks from std.parallelism 
 potentiaдly can resolve this problem, but how to integrate 
 their gracefully.

 API of jin-go can be better?

 Which dub modules can help me?
I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others... Cheeers, A.
Jan 09
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
 I'm a little worried you have no volatile writes or fences 
 around your code when you 'publish' an event using head/tail 
 etc. It looks like it's working but how are you ensuring no 
 compiler/CPU reordering is ocurring. Does x86_64 actually allow 
 you to get away with this? I know its memory model is stricter 
 than others...
But not on ARM, so he should use atomic acquire-release semantics on indices for push/pull. I suggest porting over spsc_queue from Boost.
Jan 09
parent reply David Nadlinger <code klickverbot.at> writes:
On Saturday, 9 January 2016 at 15:16:43 UTC, Ola Fosheim Grøstad 
wrote:
 On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
 I'm a little worried you have no volatile writes or fences 
 around your code when you 'publish' an event using head/tail 
 etc. It looks like it's working but how are you ensuring no 
 compiler/CPU reordering is ocurring. Does x86_64 actually 
 allow you to get away with this? I know its memory model is 
 stricter than others...
But not on ARM, so he should use atomic acquire-release semantics on indices for push/pull.
Not only that. It's a problem on x86 as well because advanced optimizers like those in GDC or LDC will happily assume that the members are not written to by another thread (because there would be a race otherwise) and cache the loads or even eliminate some stores. Any guarantees the processor would offer are irrelevant if the compiler has already reordered your operations. It should be quite easy to see such effects. Just compile a simple test case on GDC or LDC with optimizations on.
 I suggest porting over spsc_queue from Boost.
That would certainly be a starting point, although a high-performance single-producer single-consumer queue is trivial to implement once you understand atomics. Then again, I couldn't convince Andrei that a well-defined memory model (in which it even makes sense to talk about atomics in the first place) is important for D yet. Right now, you basically have to hope that everything works as in C++ – which is not a bad bet on GDC and LDC, and the weaker optimizer in DMD hides a lot of potential issues there – and stay away from things like `consume` loads that would depend on language semantics. — David
Jan 09
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Saturday, 9 January 2016 at 17:42:41 UTC, David Nadlinger 
wrote:
 Not only that. It's a problem on x86 as well because advanced 
 optimizers like those in GDC or LDC will happily assume that 
 the members are not written to by another thread (because there 
 would be a race otherwise) and cache the loads or even 
 eliminate some stores. Any guarantees the processor would offer 
 are irrelevant if the compiler has already reordered your 
 operations. It should be quite easy to see such effects. Just 
 compile a simple test case on GDC or LDC with optimizations on.
Yes, well, he had a sleep() in there that might prevent the compiler from moving instructions etc, but even if so... it is a bad thing to rely on. Explicit atomics document what is going on (even when not technically needed), and should always be used where you want atomic operations, I think.
 That would certainly be a starting point, although a 
 high-performance single-producer single-consumer queue is 
 trivial to implement once you understand atomics.
Yes. But my experience from writing custom multi-single queues is that it can end up harder than it looks to get it working and efficient. Don't be surprised if it takes 5-10x more time than anticipated to get it 100% right... So why not start with something that is correct? You still have to spend quite a bit of time to verify that the code is identical... but that is much easier than to "formally" prove that your own implementation is correct. (Intuition is often wrong in this area...)
Jan 09
parent David Nadlinger <code klickverbot.at> writes:
On Saturday, 9 January 2016 at 22:44:53 UTC, Ola Fosheim Grøstad 
wrote:
 Yes. But my experience from writing custom multi-single queues 
 is that it can end up harder than it looks to get it working 
 and efficient. […] (Intuition is often wrong in this area...)
I wholeheartedly agree with that statement. However, if one doesn't understand atomics well enough to correctly implement even a simple single-single queue, I'm not sure whether one would be able to correctly port an existing implementation either. Then again, the primitives might be similar enough between C++11 and D so that a literal translation is possible without understanding what is going on. — David
Jan 10
prev sibling next sibling parent Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
 I'm a little worried you have no volatile writes or fences 
 around your code when you 'publish' an event using head/tail 
 etc. It looks like it's working but how are you ensuring no 
 compiler/CPU reordering is ocurring. Does x86_64 actually allow 
 you to get away with this? I know its memory model is stricter 
 than others...
This works fine in my tests, but how to enforce memory barrier in D?
Jan 09
prev sibling parent reply Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
 I'm a little worried you have no volatile writes or fences 
 around your code when you 'publish' an event using head/tail 
 etc. It looks like it's working but how are you ensuring no 
 compiler/CPU reordering is ocurring. Does x86_64 actually allow 
 you to get away with this? I know its memory model is stricter 
 than others...
I just add atomic fence for push and take: this.messages[ this.tail ] = value; atomicFence; this.tail = ( this.tail + 1 ) % this.size;
Jan 09
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Saturday, 9 January 2016 at 15:51:51 UTC, Jin wrote:
 On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
 I'm a little worried you have no volatile writes or fences 
 around your code when you 'publish' an event using head/tail 
 etc. It looks like it's working but how are you ensuring no 
 compiler/CPU reordering is ocurring. Does x86_64 actually 
 allow you to get away with this? I know its memory model is 
 stricter than others...
I just add atomic fence for push and take: this.messages[ this.tail ] = value; atomicFence; this.tail = ( this.tail + 1 ) % this.size;
You need it in the tests. I haven't used atomics in D, but you have atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const shared T val) atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref shared T val, V1 newval) The compiler should then be able to ignore it if the CPU handles it well enough, assuming the D implementation work. Something along the lines of: immutable i = atomicLoad!(MemoryOrder.raw)(tail); immutable n = (i+1)%size; if( n == atomicLoad!(MemoryOrder.acq)(head) ) return QUEUE_FULL; buffer[n] = ...; atomicStore(MemoryOrder.rel, tail, n); Or something like that.
Jan 09
parent reply Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad 
wrote:
 You need it in the tests.
If memory writes will reorder, then current tests will fail. Memory reades can be in any order. On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad wrote:
 I haven't used atomics in D, but you have

   atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const 
 shared T val)
   atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref 
 shared T val, V1 newval)
I do not understand how to use its right. So i simple use atomicFence. Performance does not degrade.
Jan 09
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:
 On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim 
 Grøstad wrote:
 You need it in the tests.
If memory writes will reorder, then current tests will fail. Memory reades can be in any order.
They have to be atomic if you want your code to be portable.
Jan 09
parent reply Jin <nin-jin ya.ru> writes:
On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim Grøstad 
wrote:
 They have to be atomic if you want your code to be portable.
Do not want yet :-)
Jan 09
parent reply Andy Smith <andyrsmith googlemail.com> writes:
On Saturday, 9 January 2016 at 17:24:47 UTC, Jin wrote:
 On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim 
 Grøstad wrote:
 They have to be atomic if you want your code to be portable.
Do not want yet :-)
Okay I've just refreshed my mental X86/64 memory model :-) What you've done should be okay from a CPU-reordering perspective. In x86/64 Loads are not reordered with other loads and stores are not reordered with other stores. So from a CPU perspective you should be okay on X86/64. There's still a grey area for me around possible *compiler* re-orderings... there's nothing to stop the compiler reordering those two lines with respect to each other because from an 'as-if-sequential' perspective those two operations are commutative. Unfortunately I'm not well versed enough on the internals of the three main compiler versions to give an opinion on whether this will be a problem or not :-( If your version is working for you on your platform/compiler, I'd recommend maybe being a bit conservative and wrap your code with appropriate version blocks until you know the answer to these questions for the other platforms/compilers (ARM, GDC, LDC etc.). And put appropriate explanation why that's the case in your README.md. I think that's just a basic courtesy to other users that you don't let them walk over a minefield without fair warning :-) Cheers, A.
Jan 09
parent reply David Nadlinger <code klickverbot.at> writes:
On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:
 Unfortunately I'm not well versed enough on the internals of 
 the three main compiler versions to give an opinion on whether 
 this will be a problem or not :-(
It might work on DMD, but I know from experience (one-to-many queues, i.e. single-consumer-multi-producer or the other way round) that the LDC optimizer will ruthlessly break code written in that "I know that it would work in assembly, let's just hope the compiler doesn't touch it" style. — David
Jan 09
parent Andy Smith <andyrsmith googlemail.com> writes:
On Saturday, 9 January 2016 at 17:49:46 UTC, David Nadlinger 
wrote:
 On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:
 Unfortunately I'm not well versed enough on the internals of 
 the three main compiler versions to give an opinion on whether 
 this will be a problem or not :-(
It might work on DMD, but I know from experience (one-to-many queues, i.e. single-consumer-multi-producer or the other way round) that the LDC optimizer will ruthlessly break code written in that "I know that it would work in assembly, let's just hope the compiler doesn't touch it" style. — David
Ah Cheers David... I think our messages crossed :-) Cheers, A.
Jan 09
prev sibling parent David Nadlinger <code klickverbot.at> writes:
On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:
 So i simple use atomicFence. Performance does not degrade.
Either you are not calling it in the way you think you are, then, or your code is otherwise very unoptimized. You can definitely measure the impact of a full mfence on otherwise tight lock-free code. — David
Jan 09
prev sibling parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 01/09/2016 04:51 PM, Jin wrote:
 I just add atomic fence for push and take:
 
         this.messages[ this.tail ] = value;
         atomicFence;
         this.tail = ( this.tail + 1 ) % this.size;
Don't do this, memory fences are expensive. This is what you need for a spsc queue. https://github.com/MartinNowak/lock-free/commit/233739262c14e00866a60f4a6a86c1b979ac968b
Jan 10
prev sibling parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 01/08/2016 05:58 PM, Jin wrote:
 Idea: no mutex, no CAS, only one direction queues without any locks.
 
 My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than
 std.concurrency.send/receive
Yes single-reader single-writer queue are the fastest way for inter-thread communication. You might have a look at my [lock-free](http://code.dlang.org/packages/lock-free) for a correct implementation.
 Problems:
 
 For blocking thread i use loop with Thread.sleep - this is bad decision
 IMHO.
Have a look at this exponential backoff implementation for my GC spinlock PR. https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34 In general you need some sort of configurable or adaptive backoff or you'll waste too much time context switching. -Martin
Jan 10
parent reply Jin <nin-jin ya.ru> writes:
On Sunday, 10 January 2016 at 21:25:34 UTC, Martin Nowak wrote:
 For blocking thread i use loop with Thread.sleep - this is bad 
 decision IMHO.
Have a look at this exponential backoff implementation for my GC spinlock PR. https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34 In general you need some sort of configurable or adaptive backoff or you'll waste too much time context switching.
I am using Waiter for this https://github.com/nin-jin/go.d/blob/master/source/jin/go.d#L171
Jan 13
parent reply Jin <nin-jin ya.ru> writes:
I just use this queues to implement Go like api for concurrency 
(coroutines+channels): 
http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.org
Mar 27
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 27-Mar-2016 21:23, Jin wrote:
 I just use this queues to implement Go like api for concurrency
 (coroutines+channels):
 http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.org
If nothing changed implementation-wise this is just data-racy queues :) -- Dmitry Olshansky
Mar 28
parent reply Jin <nin-jin ya.ru> writes:
On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:
 On 27-Mar-2016 21:23, Jin wrote:
 I just use this queues to implement Go like api for concurrency
 (coroutines+channels):
 http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.org
If nothing changed implementation-wise this is just data-racy queues :)
Why?
Mar 28
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28-Mar-2016 20:03, Jin wrote:
 On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:
 On 27-Mar-2016 21:23, Jin wrote:
 I just use this queues to implement Go like api for concurrency
 (coroutines+channels):
 http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.org
If nothing changed implementation-wise this is just data-racy queues :)
Why?
All I see is a ring buffer with hand-wavy atomicFence on one of mutating operations. popFront is not protected at all. Also force yielding a thread is not a sane synchronization technique. Over all - I suggest to not label this as "wait free" code as it's waaay far from what it takes to get that. -- Dmitry Olshansky
Mar 28
parent Jin <nin-jin ya.ru> writes:
On Monday, 28 March 2016 at 17:16:14 UTC, Dmitry Olshansky wrote:
 If nothing changed implementation-wise this is just data-racy 
 queues :)
Why?
All I see is a ring buffer with hand-wavy atomicFence on one of mutating operations. popFront is not protected at all.
popFront does not need protection. It is atommic for provider.
 Also force yielding a thread is not a sane synchronization 
 technique.
Here the fibers are used.
 Over all - I suggest to not label this as "wait free" code as 
 it's waaay far from what it takes to get that.
Each operation (clear,front,popFront,full,put) has a fixed number of steps if you checks for clear before access to front and check for full before put. If you not check this - you will be blockek of cource. What do you expect? Exception? Ignore?
Mar 28