digitalmars.D - Poor memory allocation performance with a lot of threads on 36 core

Witek (63/63) Feb 18 2016 Hi,

Dicebot (7/12) Feb 18 2016 DMD/druntime use stop-the-world GC implementation. That means every time

Witek (7/22) Feb 18 2016 The working set is pretty small, and I do not think GC collection

Vladimir Panteleev (9/16) Feb 18 2016 Currently, all memory allocations use a global GC lock[1]. As

Witek (17/34) Feb 18 2016 Yeah, I was just using stuff like:

Vladimir Panteleev (3/6) Feb 18 2016 Aside -vgc, you can also use @nogc to forbid GC use in code thus
Kagamin (4/9) Feb 19 2016 You can try std.container

Chris Wright (13/15) Feb 18 2016 It's a global GC, possibly with a little per-thread pool.

Russel Winder via Digitalmars-d (19/29) Feb 18 2016 The OpenJDK folk have more or less given up on traditional mark/sweep

Chris Wright (7/25) Feb 19 2016 Looks like they're using write barriers. I like write barriers, but I'm

Jonathan M Davis (20/37) Feb 19 2016 Unfortunately, given how easy it is to cast between mutable,

Yuxuan Shui (5/16) Feb 19 2016 Can't we keep track of when this thing happens? We can have the

Chris Wright (7/21) Feb 19 2016 Yes. Some casts already invoke runtime functions. I believe array casts

rsw0x (7/11) Feb 19 2016 some small language changes could greatly improve the ability to

Bottled Gin (9/15) Feb 19 2016 In my experience multicore on a VPS does not work the same way as
Martin Nowak (17/24) Feb 19 2016 As others have noted, this is due to heavy contention in the GC.

Witek <witold.baryluk+dlang gmail.com> writes:

Hi,

I was playing with std.parallelism, and implemented a parallel 
NQueens code to count number of solutions to the classical 
NQueens problem. I do limited parallel recursion using 
taskPool.parallel foreach, and then switch at some level to 
serial algorithm. I use return values / atomic variables to count 
number of iterations / solutions, and then propagate that up 
using return values and aggregate. (I might switch later to the 
taskPool.reduce, with custom opAdd on some struct, or use 
taskPool.workerLocalStorage which I think is awesome concept).

Anyhow, everything was good on 1 or 2 threads, or maybe few more, 
on my laptop with old Dual Core CPU. I was able to speed it up 
exactly by a factor of 2x.

I wanted to try it out on bigger machine, so used Amazone AWS EC2 
c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and 
results were pretty terrible.

I hardly was able to utilize more than 3% of each CPU, the 
program actually significantly slower, which was surprising that 
there was no synchronization, false sharing or too fine 
parallelism in the program.

strace -f, was showing tons of futex operations, but I know that 
std.parallel.Foreach implementation doesn't use synchronization 
in the main loop, just some atomic variable reads in case another 
thread did a break or throwns an exception. It couldn't be a 
bottleneck.

I traced the problem to the array allocations in my 
implementation - because I only want to iterate over remaining 
possible rows, I keep int[] partialSolution and int[] 
availableRows, with invariant that intersection is empty, and the 
union contains integers 0 to N-1. (N = size of the problem).

Because partialSolution grows by 1 on each level of recursion, I 
just copy it and append single element, then pass it recursively 
(either to different thread or just function call, depending how 
deep in recursion we already in). It could be solve by using 
single-linked list and using common tail for all partialSolution 
in different branches, but then - list would be reversed, I would 
loose random access, which is little bit annoying (but shouldn't 
hurt performance, as I am going to scan entire partialSolution 
array anyway probably). And I would still need to allocate a new 
list node (which granted could be speed-up using thread local 
free-list).

Even bigger problem is availableRows, which I need to remove some 
elements from, which equates to allocating new array, and copying 
all elements but one. This cannot be done using list. And COW 
tree would be too expensive, and would still require some 
allocations and possibly rebalancing, etc.

I found that this is indeed a problem, because If I allocate 
int[64] x = void; on the stack, and then use 
std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is 
safe, because I wait for threads to finish before I exit the 
scope, and threads only read from this arrays, before making own 
copies), I am able to run my nqueens implementation at full 
system utilization (using 36 hyperthreads at 100% each, for a 
speed-up of about 21x), and it is able to solve N=18 in 7 seconds 
(compared to about 2.5 minutes), with parallel part up to level 8 
of recursion (to improve load balancing).

So, the question is, why is D / DMD allocator so slow under heavy 
multithreading? The working set is pretty small (few megabytes at 
most), so I do not think this is an issue with GC scanning 
itself.  Can I plug-in tcmalloc / jemalloc, to be used as the 
underlying allocator, instead of using glibc? Or is D runtime 
using mmap/srbk/etc directly?

Thanks.

Feb 18 2016

Dicebot <public dicebot.lv> writes:

On 02/18/2016 02:00 PM, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under heavy
 multithreading? The working set is pretty small (few megabytes at most),
 so I do not think this is an issue with GC scanning itself.  Can I
 plug-in tcmalloc / jemalloc, to be used as the underlying allocator,
 instead of using glibc? Or is D runtime using mmap/srbk/etc directly?

DMD/druntime use stop-the-world GC implementation. That means every time
anything needs to be allocated there is possibility of collection cycle
which will pause execution of all threads. It doesn't matter if there is
much garbage - the very pause is the problem.

Using allocations not controlled by GC (even plain malloc) should change
situation notably.

Feb 18 2016

Witek <witold.baryluk+dlang gmail.com> writes:

On Thursday, 18 February 2016 at 13:10:28 UTC, Dicebot wrote:
 On 02/18/2016 02:00 PM, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be 
 used as the underlying allocator, instead of using glibc? Or 
 is D runtime using mmap/srbk/etc directly?

 DMD/druntime use stop-the-world GC implementation. That means 
 every time anything needs to be allocated there is possibility 
 of collection cycle which will pause execution of all threads. 
 It doesn't matter if there is much garbage - the very pause is 
 the problem.

 Using allocations not controlled by GC (even plain malloc) 
 should change
 situation notably.

The working set is pretty small, and I do not think GC collection 
is triggered at all. I think the allocations itself are slow / 
not scalable to multiple threads.

I am going to check some GC stats, or disable GC collections 
completely, add some explicit 'scope(exit) delete ...;' where 
necessary, and see if anything changes.

Feb 18 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be used 
 as the underlying allocator, instead of using glibc? Or is D 
 runtime using mmap/srbk/etc directly?

 Thanks.

Currently, all memory allocations use a global GC lock[1]. As 
such, presently high-parallelism programs need to avoid 
allocating memory via the GC.

You can avoid this problem by using a different allocation / 
memory management strategy. You may want to have a look at 
std.experimental.allocator.

[1]: 
https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370

Feb 18 2016

Witek <witold.baryluk+dlang gmail.com> writes:

On Thursday, 18 February 2016 at 13:49:45 UTC, Vladimir Panteleev 
wrote:
 On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be 
 used as the underlying allocator, instead of using glibc? Or 
 is D runtime using mmap/srbk/etc directly?

 Thanks.

 Currently, all memory allocations use a global GC lock[1]. As 
 such, presently high-parallelism programs need to avoid 
 allocating memory via the GC.

 You can avoid this problem by using a different allocation / 
 memory management strategy. You may want to have a look at 
 std.experimental.allocator.

 [1]: 
 https://github.com/D-Programming-Language/druntime/blob/30f8c1af39eb17d8ebec1f5fd401eb5cfd6b36da/src/gc/gc.d#L348-L370

Yeah, I was just using stuff like:

int[] newPartialSolution = partialSolution ~ row;
int[] newAvailableSolution = vailableSolution[0 .. $-1].dup;
// Move last element if needed in newAvailableSolution.

It was pretty hard to find out, because it was hidden behind "~". 
Yes, -vgc helped here, but still, I was not expecting so terrible 
performance.

I will try using std.experimental.allocator, but this doesn't 
play well with "~", and I would need to manually do expandArray, 
and array operations, which is a pain. It would be nice to encode 
allocator used in the type, potentially by wrapping array into 
custom struct/class.

"As of this time, std.experimental.allocator is not integrated 
with D's built-in operators that allocate memory, such as new, 
array literals, or array concatenation operators."

Feb 18 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:
 It was pretty hard to find out, because it was hidden behind 
 "~". Yes, -vgc helped here, but still, I was not expecting so 
 terrible performance.

Aside -vgc, you can also use  nogc to forbid GC use in code thus 
annotated.

Feb 18 2016

Kagamin <spam here.lot> writes:

On Thursday, 18 February 2016 at 13:55:02 UTC, Witek wrote:
 I will try using std.experimental.allocator, but this doesn't 
 play well with "~", and I would need to manually do 
 expandArray, and array operations, which is a pain. It would be 
 nice to encode allocator used in the type, potentially by 
 wrapping array into custom struct/class.

You can try std.container 
https://dlang.org/phobos/std_container.html - it uses C heap for 
memory management and has overloads for append operators.

Feb 19 2016

Chris Wright <dhasenan gmail.com> writes:

On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under heavy
 multithreading?

It's a global GC, possibly with a little per-thread pool.

As part of the abortive Amber language project, I was looking into ways 
to craft per-thread GC. You need to tell the runtime whether a variable 
is marked as shared or __gshared and that's pretty much sufficient -- you 
can only refer to unshared variables from one thread, which means you can 
do a local collection stopping only one thread's execution. You can have 
one heap for each thread and one cross-thread heap.

This work hasn't happened in D yet.

I would like to look into D's GC and parallelism more. I've started on 
mark/sweep parallelism but haven't made any worthwhile progress. I'll 
take this as my next task. It's more awkward because it requires changes 
to the runtime interface, which means modifying both DMD and the runtime.

Feb 18 2016

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d
wrote:
 [=E2=80=A6]
=20
 I would like to look into D's GC and parallelism more. I've started
 on=C2=A0
 mark/sweep parallelism but haven't made any worthwhile progress.
 I'll=C2=A0
 take this as my next task. It's more awkward because it requires
 changes=C2=A0
 to the runtime interface, which means modifying both DMD and the
 runtime.

The OpenJDK folk have more or less given up on traditional mark/sweep
and even concurrent mark/sweep for JVM-based codes (most because there
is still a need for a global stop the world collection at some points):
the G1 algorithms that is now the default has a very different approach
and appears to be a very significant improvement. See for example:

http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All

and all the Oracle and OpenJDK blurb.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 18 2016

Chris Wright <dhasenan gmail.com> writes:

On Fri, 19 Feb 2016 07:01:56 +0000, Russel Winder via Digitalmars-d wrote:

 On Thu, 2016-02-18 at 17:27 +0000, Chris Wright via Digitalmars-d wrote:
 […]
 
 I would like to look into D's GC and parallelism more. I've started on
 mark/sweep parallelism but haven't made any worthwhile progress. I'll
 take this as my next task. It's more awkward because it requires
 changes to the runtime interface, which means modifying both DMD and
 the runtime.

 
 The OpenJDK folk have more or less given up on traditional mark/sweep
 and even concurrent mark/sweep for JVM-based codes (most because there
 is still a need for a global stop the world collection at some points):
 the G1 algorithms that is now the default has a very different approach
 and appears to be a very significant improvement. See for example:
 
 http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All
 
 and all the Oracle and OpenJDK blurb.

Looks like they're using write barriers. I like write barriers, but I'm 
not sure they'd be so popular with the D crowd. But they might be using 
them only during a collection, which would be interesting.

That article doesn't have anywhere near enough information for me to 
implement a GC in a similar style, but it's enough for me to start 
searching for more writeups. Thanks!

Feb 19 2016

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:
 On Thu, 18 Feb 2016 13:00:12 +0000, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading?

 It's a global GC, possibly with a little per-thread pool.

 As part of the abortive Amber language project, I was looking 
 into ways to craft per-thread GC. You need to tell the runtime 
 whether a variable is marked as shared or __gshared and that's 
 pretty much sufficient -- you can only refer to unshared 
 variables from one thread, which means you can do a local 
 collection stopping only one thread's execution. You can have 
 one heap for each thread and one cross-thread heap.

 This work hasn't happened in D yet.

Unfortunately, given how easy it is to cast between mutable, 
const, immutable, shared (and it's quite common to construct 
something as mutable and then cast it to immutable or shared) and 
how it's pretty easy to pass objects across threads, it becomes 
_very_ problematic to have a per-thread memory pool in D, even if 
theoretically it's a great idea.

 I would like to look into D's GC and parallelism more. I've 
 started on mark/sweep parallelism but haven't made any 
 worthwhile progress. I'll take this as my next task. It's more 
 awkward because it requires changes to the runtime interface, 
 which means modifying both DMD and the runtime.

Sociomantic has a concurrent GC for D1, and I think that they've 
ported it to D2 (if not, I expect that they're in the process of 
doing so), and that may or may not end up in druntime at some 
point, but the key problem that I recall is that it relies on 
fork to do what it does, which works fantastically on *nix 
systems, but doesn't work on Windows, and I'm not sure that 
anyone has figured out how to do the same thing on Windows yet 
(the key thing is that it needs to be able to take a snapshot of 
the memory). And other performance improvements have been made to 
the GC, but the fact that we're a system language that allows you 
ultimately to do most anything really limits what we can do in 
comparison to a language sitting in VM.

- Jonathan M Davis

Feb 19 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis 
wrote:
 On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright 
 wrote:
 [...]

 Unfortunately, given how easy it is to cast between mutable, 
 const, immutable, shared (and it's quite common to construct 
 something as mutable and then cast it to immutable or shared) 
 and how it's pretty easy to pass objects across threads, it 
 becomes _very_ problematic to have a per-thread memory pool in 
 D, even if theoretically it's a great idea.

Can't we keep track of when this thing happens? We can have the 
compiler insert call to move data in/out of the per-thread pool 
when this kind of things happen.

 [...]

 - Jonathan M Davis

Feb 19 2016

Chris Wright <dhasenan gmail.com> writes:

On Fri, 19 Feb 2016 21:45:31 +0000, Yuxuan Shui wrote:

 On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis wrote:
 On Thursday, 18 February 2016 at 17:27:13 UTC, Chris Wright wrote:
 [...]

 Unfortunately, given how easy it is to cast between mutable, const,
 immutable, shared (and it's quite common to construct something as
 mutable and then cast it to immutable or shared) and how it's pretty
 easy to pass objects across threads, it becomes _very_ problematic to
 have a per-thread memory pool in D, even if theoretically it's a great
 idea.

 
 Can't we keep track of when this thing happens? We can have the compiler
 insert call to move data in/out of the per-thread pool when this kind of
 things happen.

Yes. Some casts already invoke runtime functions. I believe array casts 
do this, as do class casts. I thought I'd made that point somewhere, but 
apparently I didn't hit the post button.

It's always safe (albeit inefficient) to treat all data as shared, so you 
can promote things to the shared pool liberally and never return them to 
the local pool (until that memory got collected, at least).

Feb 19 2016

rsw0x <anonymous anonymous.com> writes:

On Friday, 19 February 2016 at 08:29:00 UTC, Jonathan M Davis 
wrote:
 but the fact that we're a system language that allows you 
 ultimately to do most anything really limits what we can do in 
 comparison to a language sitting in VM.

 - Jonathan M Davis

some small language changes could greatly improve the ability to 
implement an efficient GC in D i.e, no interior pointers by 
default

the overhead caused by having to find object roots far outweighs 
the cost of carrying a base pointer + offset

Feb 19 2016

Bottled Gin <gin bottle.com> writes:

On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:

 Anyhow, everything was good on 1 or 2 threads, or maybe few 
 more, on my laptop with old Dual Core CPU. I was able to speed 
 it up exactly by a factor of 2x.

 I wanted to try it out on bigger machine, so used Amazone AWS 
 EC2 c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, 
 and results were pretty terrible.

In my experience multicore on a VPS does not work the same way as 
muticore on a local dedicated machine. But my experience is from 
a Linode instance back in 2014 when Linode only offered XEN 
virtualization. It will be interesting to see how a Linode pans 
out now that they offer KVM. I would encourage you to try with a 
2-core Amazon AWS and see if you get a 2X performance scaling as 
you got with a dedicated machine.

- Puneet

Feb 19 2016

Martin Nowak <code dawg.eu> writes:

On Thursday, 18 February 2016 at 13:00:12 UTC, Witek wrote:
 So, the question is, why is D / DMD allocator so slow under 
 heavy multithreading? The working set is pretty small (few 
 megabytes at most), so I do not think this is an issue with GC 
 scanning itself.  Can I plug-in tcmalloc / jemalloc, to be used 
 as the underlying allocator, instead of using glibc? Or is D 
 runtime using mmap/srbk/etc directly?

 Thanks.

As others have noted, this is due to heavy contention in the GC.
There is a pending PR 
(https://github.com/D-Programming-Language/druntime/pull/1447) to 
replace the recursive mutex with a spinlock, that should improve 
the numbers a bit but doesn't solve the problem.
Since version 2.070 we also suspend threads in parallel which 
heavily reduces the pause times with many threads 
https://github.com/D-Programming-Language/druntime/pull/1110.

The real fix (using thread local allocator caches) has a very 
high priority in our backlog 
(https://trello.com/c/K7HrSnwo/28-thread-cache-for-gc), but isn't 
yet fully implemented. You can see the latest state here 
https://github.com/MartinNowak/druntime/commits/gcCache. I still 
need to add a queue on each thread cache to sync metadata.

So for the time being, use at least 2.070.0, and try to replace 
GC allocations with malloc.

Feb 19 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Poor memory allocation performance with a lot of threads on 36 core