digitalmars.D - Poor memory allocation performance with a lot of threads on 36 core
- Witek (63/63) Feb 18 2016 Hi,
- Dicebot (7/12) Feb 18 2016 DMD/druntime use stop-the-world GC implementation. That means every time
- Witek (7/22) Feb 18 2016 The working set is pretty small, and I do not think GC collection
- Vladimir Panteleev (9/16) Feb 18 2016 Currently, all memory allocations use a global GC lock[1]. As
- Witek (17/34) Feb 18 2016 Yeah, I was just using stuff like:
- Vladimir Panteleev (3/6) Feb 18 2016 Aside -vgc, you can also use @nogc to forbid GC use in code thus
- Kagamin (4/9) Feb 19 2016 You can try std.container
- Chris Wright (13/15) Feb 18 2016 It's a global GC, possibly with a little per-thread pool.
- Russel Winder via Digitalmars-d (19/29) Feb 18 2016 The OpenJDK folk have more or less given up on traditional mark/sweep
- Chris Wright (7/25) Feb 19 2016 Looks like they're using write barriers. I like write barriers, but I'm
- Jonathan M Davis (20/37) Feb 19 2016 Unfortunately, given how easy it is to cast between mutable,
- Yuxuan Shui (5/16) Feb 19 2016 Can't we keep track of when this thing happens? We can have the
- Chris Wright (7/21) Feb 19 2016 Yes. Some casts already invoke runtime functions. I believe array casts
- rsw0x (7/11) Feb 19 2016 some small language changes could greatly improve the ability to
- Bottled Gin (9/15) Feb 19 2016 In my experience multicore on a VPS does not work the same way as
- Martin Nowak (17/24) Feb 19 2016 As others have noted, this is due to heavy contention in the GC.
Hi, I was playing with std.parallelism, and implemented a parallel NQueens code to count number of solutions to the classical NQueens problem. I do limited parallel recursion using taskPool.parallel foreach, and then switch at some level to serial algorithm. I use return values / atomic variables to count number of iterations / solutions, and then propagate that up using return values and aggregate. (I might switch later to the taskPool.reduce, with custom opAdd on some struct, or use taskPool.workerLocalStorage which I think is awesome concept). Anyhow, everything was good on 1 or 2 threads, or maybe few more, on my laptop with old Dual Core CPU. I was able to speed it up exactly by a factor of 2x. I wanted to try it out on bigger machine, so used Amazone AWS EC2 c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and results were pretty terrible. I hardly was able to utilize more than 3% of each CPU, the program actually significantly slower, which was surprising that there was no synchronization, false sharing or too fine parallelism in the program. strace -f, was showing tons of futex operations, but I know that std.parallel.Foreach implementation doesn't use synchronization in the main loop, just some atomic variable reads in case another thread did a break or throwns an exception. It couldn't be a bottleneck. I traced the problem to the array allocations in my implementation - because I only want to iterate over remaining possible rows, I keep int[] partialSolution and int[] availableRows, with invariant that intersection is empty, and the union contains integers 0 to N-1. (N = size of the problem). Because partialSolution grows by 1 on each level of recursion, I just copy it and append single element, then pass it recursively (either to different thread or just function call, depending how deep in recursion we already in). It could be solve by using single-linked list and using common tail for all partialSolution in different branches, but then - list would be reversed, I would loose random access, which is little bit annoying (but shouldn't hurt performance, as I am going to scan entire partialSolution array anyway probably). And I would still need to allocate a new list node (which granted could be speed-up using thread local free-list). Even bigger problem is availableRows, which I need to remove some elements from, which equates to allocating new array, and copying all elements but one. This cannot be done using list. And COW tree would be too expensive, and would still require some allocations and possibly rebalancing, etc. I found that this is indeed a problem, because If I allocate int[64] x = void; on the stack, and then use std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is safe, because I wait for threads to finish before I exit the scope, and threads only read from this arrays, before making own copies), I am able to run my nqueens implementation at full system utilization (using 36 hyperthreads at 100% each, for a speed-up of about 21x), and it is able to solve N=18 in 7 seconds (compared to about 2.5 minutes), with parallel part up to level 8 of recursion (to improve load balancing). So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly? Thanks.
Feb 18 2016
On 02/18/2016 02:00 PM, Witek wrote:So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly?DMD/druntime use stop-the-world GC implementation. That means every time anything needs to be allocated there is possibility of collection cycle which will pause execution of all threads. It doesn't matter if there is much garbage - the very pause is the problem. Using allocations not controlled by GC (even plain malloc) should change situation notably.
Feb 18 2016
On Thursday, 18 February 2016 at 13:10:28 UTC, Dicebot wrote:On 02/18/2016 02:00 PM, Witek wrote:The working set is pretty small, and I do not think GC collection is triggered at all. I think the allocations itself are slow / not scalable to multiple threads. I am going to check some GC stats, or disable GC collections completely, add some explicit 'scope(exit) delete ...;' where necessary, and see if anything changes.So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly?DMD/druntime use stop-the-world GC implementation. That means every time anything needs to be allocated there is possibility of collection cycle which will pause execution of all threads. It doesn't matter if there is much garbage - the very pause is the problem. Using allocations not controlled by GC (even plain malloc) should change situation notably.
Feb 18 2016
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly? Thanks.Currently, all memory allocations use a global GC lock[1]. As such, presently high-parallelism programs need to avoid allocating memory via the GC. You can avoid this problem by using a different allocation / memory management strategy. You may want to have a look at std.experimental.allocator. [1]: https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370
Feb 18 2016
On Thursday, 18 February 2016 at 13:49:45 UTC, Vladimir Panteleev wrote:On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:Yeah, I was just using stuff like: int[] newPartialSolution = partialSolution ~ row; int[] newAvailableSolution = vailableSolution[0 .. $-1].dup; // Move last element if needed in newAvailableSolution. It was pretty hard to find out, because it was hidden behind "~". Yes, -vgc helped here, but still, I was not expecting so terrible performance. I will try using std.experimental.allocator, but this doesn't play well with "~", and I would need to manually do expandArray, and array operations, which is a pain. It would be nice to encode allocator used in the type, potentially by wrapping array into custom struct/class. "As of this time, std.experimental.allocator is not integrated with D's built-in operators that allocate memory, such as new, array literals, or array concatenation operators."So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly? Thanks.Currently, all memory allocations use a global GC lock[1]. As such, presently high-parallelism programs need to avoid allocating memory via the GC. You can avoid this problem by using a different allocation / memory management strategy. You may want to have a look at std.experimental.allocator. [1]: https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370
Feb 18 2016
On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:It was pretty hard to find out, because it was hidden behind "~". Yes, -vgc helped here, but still, I was not expecting so terrible performance.Aside -vgc, you can also use nogc to forbid GC use in code thus annotated.
Feb 18 2016
On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:I will try using std.experimental.allocator, but this doesn't play well with "~", and I would need to manually do expandArray, and array operations, which is a pain. It would be nice to encode allocator used in the type, potentially by wrapping array into custom struct/class.You can try std.container https://dlang.org/phobos/std_container.html - it uses C heap for memory management and has overloads for append operators.
Feb 19 2016
On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:So, the question is, why is D / DMD allocator so slow under heavy multithreading?It's a global GC, possibly with a little per-thread pool. As part of the abortive Amber language project, I was looking into ways to craft per-thread GC. You need to tell the runtime whether a variable is marked as shared or __gshared and that's pretty much sufficient -- you can only refer to unshared variables from one thread, which means you can do a local collection stopping only one thread's execution. You can have one heap for each thread and one cross-thread heap. This work hasn't happened in D yet. I would like to look into D's GC and parallelism more. I've started on mark/sweep parallelism but haven't made any worthwhile progress. I'll take this as my next task. It's more awkward because it requires changes to the runtime interface, which means modifying both DMD and the runtime.
Feb 18 2016
On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d wrote:[=E2=80=A6] =20 I would like to look into D's GC and parallelism more. I've started on=C2=A0 mark/sweep parallelism but haven't made any worthwhile progress. I'll=C2=A0 take this as my next task. It's more awkward because it requires changes=C2=A0 to the runtime interface, which means modifying both DMD and the runtime.The OpenJDK folk have more or less given up on traditional mark/sweep and even concurrent mark/sweep for JVM-based codes (most because there is still a need for a global stop the world collection at some points): the G1 algorithms that is now the default has a very different approach and appears to be a very significant improvement. See for example: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All and all the Oracle and OpenJDK blurb. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 18 2016
On Fri, 19 Feb 2016 07:01:56 +0000, Russel Winder via Digitalmars-d wrote:On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d wrote:Looks like they're using write barriers. I like write barriers, but I'm not sure they'd be so popular with the D crowd. But they might be using them only during a collection, which would be interesting. That article doesn't have anywhere near enough information for me to implement a GC in a similar style, but it's enough for me to start searching for more writeups. Thanks![…] I would like to look into D's GC and parallelism more. I've started on mark/sweep parallelism but haven't made any worthwhile progress. I'll take this as my next task. It's more awkward because it requires changes to the runtime interface, which means modifying both DMD and the runtime.The OpenJDK folk have more or less given up on traditional mark/sweep and even concurrent mark/sweep for JVM-based codes (most because there is still a need for a global stop the world collection at some points): the G1 algorithms that is now the default has a very different approach and appears to be a very significant improvement. See for example: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All and all the Oracle and OpenJDK blurb.
Feb 19 2016
On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.So, the question is, why is D / DMD allocator so slow under heavy multithreading?It's a global GC, possibly with a little per-thread pool. As part of the abortive Amber language project, I was looking into ways to craft per-thread GC. You need to tell the runtime whether a variable is marked as shared or __gshared and that's pretty much sufficient -- you can only refer to unshared variables from one thread, which means you can do a local collection stopping only one thread's execution. You can have one heap for each thread and one cross-thread heap. This work hasn't happened in D yet.I would like to look into D's GC and parallelism more. I've started on mark/sweep parallelism but haven't made any worthwhile progress. I'll take this as my next task. It's more awkward because it requires changes to the runtime interface, which means modifying both DMD and the runtime.Sociomantic has a concurrent GC for D1, and I think that they've ported it to D2 (if not, I expect that they're in the process of doing so), and that may or may not end up in druntime at some point, but the key problem that I recall is that it relies on fork to do what it does, which works fantastically on *nix systems, but doesn't work on Windows, and I'm not sure that anyone has figured out how to do the same thing on Windows yet (the key thing is that it needs to be able to take a snapshot of the memory). And other performance improvements have been made to the GC, but the fact that we're a system language that allows you ultimately to do most anything really limits what we can do in comparison to a language sitting in VM. - Jonathan M Davis
Feb 19 2016
On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis wrote:On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:Can't we keep track of when this thing happens? We can have the compiler insert call to move data in/out of the per-thread pool when this kind of things happen.[...]Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.[...] - Jonathan M Davis
Feb 19 2016
On Fri, 19 Feb 2016 21:45:31 +0000, Yuxuan Shui wrote:On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis wrote:Yes. Some casts already invoke runtime functions. I believe array casts do this, as do class casts. I thought I'd made that point somewhere, but apparently I didn't hit the post button. It's always safe (albeit inefficient) to treat all data as shared, so you can promote things to the shared pool liberally and never return them to the local pool (until that memory got collected, at least).On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:Can't we keep track of when this thing happens? We can have the compiler insert call to move data in/out of the per-thread pool when this kind of things happen.[...]Unfortunately, given how easy it is to cast between mutable, const, immutable, shared (and it's quite common to construct something as mutable and then cast it to immutable or shared) and how it's pretty easy to pass objects across threads, it becomes _very_ problematic to have a per-thread memory pool in D, even if theoretically it's a great idea.
Feb 19 2016
On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis wrote:but the fact that we're a system language that allows you ultimately to do most anything really limits what we can do in comparison to a language sitting in VM. - Jonathan M Davissome small language changes could greatly improve the ability to implement an efficient GC in D i.e, no interior pointers by default the overhead caused by having to find object roots far outweighs the cost of carrying a base pointer + offset
Feb 19 2016
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:Anyhow, everything was good on 1 or 2 threads, or maybe few more, on my laptop with old Dual Core CPU. I was able to speed it up exactly by a factor of 2x. I wanted to try it out on bigger machine, so used Amazone AWS EC2 c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and results were pretty terrible.In my experience multicore on a VPS does not work the same way as muticore on a local dedicated machine. But my experience is from a Linode instance back in 2014 when Linode only offered XEN virtualization. It will be interesting to see how a Linode pans out now that they offer KVM. I would encourage you to try with a 2-core Amazon AWS and see if you get a 2X performance scaling as you got with a dedicated machine. - Puneet
Feb 19 2016
On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly? Thanks.As others have noted, this is due to heavy contention in the GC. There is a pending PR (https://github.com/D-Programming-Language/druntime/pull/1447) to replace the recursive mutex with a spinlock, that should improve the numbers a bit but doesn't solve the problem. Since version 2.070 we also suspend threads in parallel which heavily reduces the pause times with many threads https://github.com/D-Programming-Language/druntime/pull/1110. The real fix (using thread local allocator caches) has a very high priority in our backlog (https://trello.com/c/K7HrSnwo/28-thread-cache-for-gc), but isn't yet fully implemented. You can see the latest state here https://github.com/MartinNowak/druntime/commits/gcCache. I still need to add a queue on each thread cache to sync metadata. So for the time being, use at least 2.070.0, and try to replace GC allocations with malloc.
Feb 19 2016