www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Poor memory allocation performance with a lot of threads on 36 core

reply Witek <witold.baryluk+dlang gmail.com> writes:
Hi,

I was playing with std.parallelism, and implemented a parallel 
NQueens code to count number of solutions to the classical 
NQueens problem. I do limited parallel recursion using 
taskPool.parallel foreach, and then switch at some level to 
serial algorithm. I use return values / atomic variables to count 
number of iterations / solutions, and then propagate that up 
using return values and aggregate. (I might switch later to the 
taskPool.reduce, with custom opAdd on some struct, or use 
taskPool.workerLocalStorage which I think is awesome concept).

Anyhow, everything was good on 1 or 2 threads, or maybe few more, 
on my laptop with old Dual Core CPU. I was able to speed it up 
exactly by a factor of 2x.

I wanted to try it out on bigger machine, so used Amazone AWS EC2 
c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and 
results were pretty terrible.

I hardly was able to utilize more than 3% of each CPU, the 
program actually significantly slower, which was surprising that 
there was no synchronization, false sharing or too fine 
parallelism in the program.

strace -f, was showing tons of futex operations, but I know that 
std.parallel.Foreach implementation doesn't use synchronization 
in the main loop, just some atomic variable reads in case another 
thread did a break or throwns an exception. It couldn't be a 
bottleneck.

I traced the problem to the array allocations in my 
implementation - because I only want to iterate over remaining 
possible rows, I keep int[] partialSolution and int[] 
availableRows, with invariant that intersection is empty, and the 
union contains integers 0 to N-1. (N = size of the problem).

Because partialSolution grows by 1 on each level of recursion, I 
just copy it and append single element, then pass it recursively 
(either to different thread or just function call, depending how 
deep in recursion we already in). It could be solve by using 
single-linked list and using common tail for all partialSolution 
in different branches, but then - list would be reversed, I would 
loose random access, which is little bit annoying (but shouldn't 
hurt performance, as I am going to scan entire partialSolution 
array anyway probably). And I would still need to allocate a new 
list node (which granted could be speed-up using thread local 
free-list).

Even bigger problem is availableRows, which I need to remove some 
elements from, which equates to allocating new array, and copying 
all elements but one. This cannot be done using list. And COW 
tree would be too expensive, and would still require some 
allocations and possibly rebalancing, etc.

I found that this is indeed a problem, because If I allocate 
int[64] x = void; on the stack, and then use 
std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is 
safe, because I wait for threads to finish before I exit the 
scope, and threads only read from this arrays, before making own 
copies), I am able to run my nqueens implementation at full 
system utilization (using 36 hyperthreads at 100% each, for a 
speed-up of about 21x), and it is able to solve N=18 in 7 seconds 
(compared to about 2.5 minutes), with parallel part up to level 8 
of recursion (to improve load balancing).

So, the question is, why is D / DMD allocator so slow under heavy 
multithreading? The working set is pretty small (few megabytes at 
most), so I do not think this is an issue with GC scanning 
itself.  Can I plug-in tcmalloc / jemalloc, to be used as the 
underlying allocator, instead of using glibc? Or is D runtime 
using mmap/srbk/etc directly?

Thanks.
Feb 18 2016
next sibling parent reply Dicebot <public dicebot.lv> writes:
On 02/18/2016 02:00 PM, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under heavy
 multithreading? The working set is pretty small (few megabytes at most),
 so I do not think this is an issue with GC scanning itself.  Can I
 plug-in tcmalloc / jemalloc, to be used as the underlying allocator,
 instead of using glibc? Or is D runtime using mmap/srbk/etc directly?
DMD/druntime use stop-the-world GC implementation. That means every time anything needs to be allocated there is possibility of collection cycle which will pause execution of all threads. It doesn't matter if there is much garbage - the very pause is the problem. Using allocations not controlled by GC (even plain malloc) should change situation notably.
Feb 18 2016
parent Witek <witold.baryluk+dlang gmail.com> writes:
On Thursday, 18 February 2016 at 13:10:28 UTC, Dicebot wrote:
 On 02/18/2016 02:00 PM, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be 
 used as the underlying allocator, instead of using glibc? Or 
 is D runtime using mmap/srbk/etc directly?
DMD/druntime use stop-the-world GC implementation. That means every time anything needs to be allocated there is possibility of collection cycle which will pause execution of all threads. It doesn't matter if there is much garbage - the very pause is the problem. Using allocations not controlled by GC (even plain malloc) should change situation notably.
The working set is pretty small, and I do not think GC collection is triggered at all. I think the allocations itself are slow / not scalable to multiple threads. I am going to check some GC stats, or disable GC collections completely, add some explicit 'scope(exit) delete ...;' where necessary, and see if anything changes.
Feb 18 2016
prev sibling next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be used 
 as the underlying allocator, instead of using glibc? Or is D 
 runtime using mmap/srbk/etc directly?

 Thanks.
Currently, all memory allocations use a global GC lock[1]. As such, presently high-parallelism programs need to avoid allocating memory via the GC. You can avoid this problem by using a different allocation / memory management strategy. You may want to have a look at std.experimental.allocator. [1]: https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370
Feb 18 2016
parent reply Witek <witold.baryluk+dlang gmail.com> writes:
On Thursday, 18 February 2016 at 13:49:45 UTC, Vladimir Panteleev 
wrote:
 On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be 
 used as the underlying allocator, instead of using glibc? Or 
 is D runtime using mmap/srbk/etc directly?

 Thanks.
Currently, all memory allocations use a global GC lock[1]. As such, presently high-parallelism programs need to avoid allocating memory via the GC. You can avoid this problem by using a different allocation / memory management strategy. You may want to have a look at std.experimental.allocator. [1]: https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370
Yeah, I was just using stuff like: int[] newPartialSolution = partialSolution ~ row; int[] newAvailableSolution = vailableSolution[0 .. $-1].dup; // Move last element if needed in newAvailableSolution. It was pretty hard to find out, because it was hidden behind "~". Yes, -vgc helped here, but still, I was not expecting so terrible performance. I will try using std.experimental.allocator, but this doesn't play well with "~", and I would need to manually do expandArray, and array operations, which is a pain. It would be nice to encode allocator used in the type, potentially by wrapping array into custom struct/class. "As of this time, std.experimental.allocator is not integrated with D's built-in operators that allocate memory, such as new, array literals, or array concatenation operators."
Feb 18 2016
next sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:
 It was pretty hard to find out, because it was hidden behind 
 "~". Yes, -vgc helped here, but still, I was not expecting so 
 terrible performance.
Aside -vgc, you can also use nogc to forbid GC use in code thus annotated.
Feb 18 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:
 I will try using std.experimental.allocator, but this doesn't 
 play well with "~", and I would need to manually do 
 expandArray, and array operations, which is a pain. It would be 
 nice to encode allocator used in the type, potentially by 
 wrapping array into custom struct/class.
You can try std.container https://dlang.org/phobos/std_container.html - it uses C heap for memory management and has overloads for append operators.
Feb 19 2016
prev sibling next sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under heavy
 multithreading?
It's a global GC, possibly with a little per-thread pool. As part of the abortive Amber language project, I was looking into ways to craft per-thread GC. You need to tell the runtime whether a variable is marked as shared or __gshared and that's pretty much sufficient -- you can only refer to unshared variables from one thread, which means you can do a local collection stopping only one thread's execution. You can have one heap for each thread and one cross-thread heap. This work hasn't happened in D yet. I would like to look into D's GC and parallelism more. I've started on mark/sweep parallelism but haven't made any worthwhile progress. I'll take this as my next task. It's more awkward because it requires changes to the runtime interface, which means modifying both DMD and the runtime.
Feb 18 2016
next sibling parent reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d
wrote:
 [=E2=80=A6]
=20
 I would like to look into D's GC and parallelism more. I've started
 on=C2=A0
 mark/sweep parallelism but haven't made any worthwhile progress.
 I'll=C2=A0
 take this as my next task. It's more awkward because it requires
 changes=C2=A0
 to the runtime interface, which means modifying both DMD and the
 runtime.
The OpenJDK folk have more or less given up on traditional mark/sweep and even concurrent mark/sweep for JVM-based codes (most because there is still a need for a global stop the world collection at some points): the G1 algorithms that is now the default has a very different approach and appears to be a very significant improvement. See for example: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All and all the Oracle and OpenJDK blurb. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 18 2016
parent Chris Wright <dhasenan gmail.com> writes:
On Fri, 19 Feb 2016 07:01:56 +0000, Russel Winder via Digitalmars-d wrote:

 On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d wrote:
 […]
 
 I would like to look into D's GC and parallelism more. I've started on
 mark/sweep parallelism but haven't made any worthwhile progress. I'll
 take this as my next task. It's more awkward because it requires
 changes to the runtime interface, which means modifying both DMD and
 the runtime.
The OpenJDK folk have more or less given up on traditional mark/sweep and even concurrent mark/sweep for JVM-based codes (most because there is still a need for a global stop the world collection at some points): the G1 algorithms that is now the default has a very different approach and appears to be a very significant improvement. See for example: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All and all the Oracle and OpenJDK blurb.
Looks like they're using write barriers. I like write barriers, but I'm not sure they'd be so popular with the D crowd. But they might be using them only during a collection, which would be interesting. That article doesn't have anywhere near enough information for me to implement a GC in a similar style, but it's enough for me to start searching for more writeups. Thanks!
Feb 19 2016
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:
 On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading?
It's a global GC, possibly with a little per-thread pool. As part of the abortive Amber language project, I was looking into ways to craft per-thread GC. You need to tell the runtime whether a variable is marked as shared or __gshared and that's pretty much sufficient -- you can only refer to unshared variables from one thread, which means you can do a local collection stopping only one thread's execution. You can have one heap for each thread and one cross-thread heap. This work hasn't happened in D yet.
Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.
 I would like to look into D's GC and parallelism more. I've 
 started on mark/sweep parallelism but haven't made any 
 worthwhile progress. I'll take this as my next task. It's more 
 awkward because it requires changes to the runtime interface, 
 which means modifying both DMD and the runtime.
Sociomantic has a concurrent GC for D1, and I think that they've ported it to D2 (if not, I expect that they're in the process of doing so), and that may or may not end up in druntime at some point, but the key problem that I recall is that it relies on fork to do what it does, which works fantastically on *nix systems, but doesn't work on Windows, and I'm not sure that anyone has figured out how to do the same thing on Windows yet (the key thing is that it needs to be able to take a snapshot of the memory). And other performance improvements have been made to the GC, but the fact that we're a system language that allows you ultimately to do most anything really limits what we can do in comparison to a language sitting in VM. - Jonathan M Davis
Feb 19 2016
next sibling parent reply Yuxuan Shui <yshuiv7 gmail.com> writes:
On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis 
wrote:
 On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright 
 wrote:
 [...]
Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.
Can't we keep track of when this thing happens? We can have the compiler insert call to move data in/out of the per-thread pool when this kind of things happen.
 [...]

 - Jonathan M Davis
Feb 19 2016
parent Chris Wright <dhasenan gmail.com> writes:
On Fri, 19 Feb 2016 21:45:31 +0000, Yuxuan Shui wrote:

 On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis wrote:
 On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:
 [...]
Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.
Can't we keep track of when this thing happens? We can have the compiler insert call to move data in/out of the per-thread pool when this kind of things happen.
Yes. Some casts already invoke runtime functions. I believe array casts do this, as do class casts. I thought I'd made that point somewhere, but apparently I didn't hit the post button. It's always safe (albeit inefficient) to treat all data as shared, so you can promote things to the shared pool liberally and never return them to the local pool (until that memory got collected, at least).
Feb 19 2016
prev sibling parent rsw0x <anonymous anonymous.com> writes:
On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis 
wrote:
 but the fact that we're a system language that allows you 
 ultimately to do most anything really limits what we can do in 
 comparison to a language sitting in VM.

 - Jonathan M Davis
some small language changes could greatly improve the ability to implement an efficient GC in D i.e, no interior pointers by default the overhead caused by having to find object roots far outweighs the cost of carrying a base pointer + offset
Feb 19 2016
prev sibling next sibling parent Bottled Gin <gin bottle.com> writes:
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:

 Anyhow, everything was good on 1 or 2 threads, or maybe few 
 more, on my laptop with old Dual Core CPU. I was able to speed 
 it up exactly by a factor of 2x.

 I wanted to try it out on bigger machine, so used Amazone AWS 
 EC2 c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, 
 and results were pretty terrible.
In my experience multicore on a VPS does not work the same way as muticore on a local dedicated machine. But my experience is from a Linode instance back in 2014 when Linode only offered XEN virtualization. It will be interesting to see how a Linode pans out now that they offer KVM. I would encourage you to try with a 2-core Amazon AWS and see if you get a 2X performance scaling as you got with a dedicated machine. - Puneet
Feb 19 2016
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be used 
 as the underlying allocator, instead of using glibc? Or is D 
 runtime using mmap/srbk/etc directly?

 Thanks.
As others have noted, this is due to heavy contention in the GC. There is a pending PR (https://github.com/D-Programming-Language/druntime/pull/1447) to replace the recursive mutex with a spinlock, that should improve the numbers a bit but doesn't solve the problem. Since version 2.070 we also suspend threads in parallel which heavily reduces the pause times with many threads https://github.com/D-Programming-Language/druntime/pull/1110. The real fix (using thread local allocator caches) has a very high priority in our backlog (https://trello.com/c/K7HrSnwo/28-thread-cache-for-gc), but isn't yet fully implemented. You can see the latest state here https://github.com/MartinNowak/druntime/commits/gcCache. I still need to add a queue on each thread cache to sync metadata. So for the time being, use at least 2.070.0, and try to replace GC allocations with malloc.
Feb 19 2016