www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The GC and performance, but not what you expect

reply "Atila Neves" <atila.neves gmail.com> writes:
So after the last day of DConf was over, back at the Aloft Brian
Schott taught me and Jonathan Crapuchettes about perf
(https://perf.wiki.kernel.org/index.php/Main_Page). If you're
profiling binaries on Linux, this thing is a must have and I have
no idea how I'd never heard about it before.

Anyway, I needed a program to test it out with, and since the
last time I had something that didn't perform as well as I wanted
it to and the dmd profiler didn't help me much was my MQTT
broker, I used it on that.

Now, that broker beat the other language implementations on my
shootout on throughput by a country mile
(http://atilanevesoncode.wordpress.com/2014/01/08/adding-java-and-c-to-the-mqtt-benchmarks-or-how-i-learned-to-stop-worrying-and-love-the-garbage-collector/)
but it lagged behind the winner (Java) in throughput-constrained
latency. I put perf to work on the D/vibe.d and Java versions and
unsurprisingly it told me that the CPU was idle a lot more of the
time for D than it was for Java. Digging in deeper, the main
hotspot according to perf was pthread_mutex_lock. That surprised
me since the program is single-threaded (and in fact runs slower
when multiple threads are used). vibe.d turned out to be
responsible for part of that. I posted on the vibe.d forum and
Sonke promply removed all the pesky locking. Results improved,
but still 6% behind Java.

I carried on and removed unnecessary array allocations from my
code. Before perf I had no idea these were contributing to
slowing the program down. With that change, Java beats D by 3%.

I've run out of optimisations to make (I think), and
pthread_mutex_lock is still top of the list, called by
_D4core4sync5mutex5Mutex4lockMFNeZv, which in turn is called by
_D2gc2gc2GC6mallocMFmkPmZPv.

The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.

Atila
May 29 2014
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking? -- /Jacob Carlborg
May 29 2014
parent reply Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking?
properly, but collections is not atomar and you have to prevent threads that are created during collection from collecting. so you still need to lock
May 29 2014
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2014-05-29 13:01, Robert Schadek via Digitalmars-d wrote:

 properly, but collections is not atomar and you have to prevent threads
 that are created during collection from collecting. so you still need to
 lock
The GC already stops the world, can a thread then be created during a collection? -- /Jacob Carlborg
May 29 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2014 4:01 AM, Robert Schadek via Digitalmars-d wrote:
 On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking?
properly, but collections is not atomar and you have to prevent threads that are created during collection from collecting. so you still need to lock
If it's single threaded, and the single thread is doing the collecting, who is starting up a new thread?
May 29 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the 
 collecting, who is starting up a new thread?
class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
May 29 2014
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 29 May 2014 16:01:16 -0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the collecting,  
 who is starting up a new thread?
class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
It's my understanding that dtors are run without the lock held. Is that no longer true? -Steve
May 29 2014
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 29 May 2014 20:01:16 +0000
schrieb "Sean Kelly" <sean invisibleduck.org>:

 On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the 
 collecting, who is starting up a new thread?
class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home. -- Marco
May 29 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.
When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
next sibling parent reply "Joakim" <dlang joakim.airpost.net> writes:
On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.
When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349
Jun 10 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:
 On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.
When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349
That's a different issue, which is that the current GC isn't necessarily exception safe and throwing from a dtor violates this. I don't see any mention there about allocating in a dtor.
Jun 10 2014
parent "Joakim" <dlang joakim.airpost.net> writes:
On Tuesday, 10 June 2014 at 17:58:56 UTC, Sean Kelly wrote:
 On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:
 On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.
When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349
That's a different issue, which is that the current GC isn't necessarily exception safe and throwing from a dtor violates this. I don't see any mention there about allocating in a dtor.
I believe the mechanism is that the assert is triggered, it tries to allocate, then you get the invalid memory operation error. Exactly why it allocates, I'm not sure.
Jun 10 2014
prev sibling parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 10.06.2014 17:15, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.
When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
I don't think that ever worked, at least for the last couple of years. Threads run during sweep, and an allocation during the collection might pass the gc-lock because it allows recursive locking in the same thread, but if the allocation hits an empty free-list, it will try to run another collection which will cause the InvalidMemoryOperationError. If it happens to allocate from a free-list, it will likely cause memory corruption, because sweeping won't know it has been allocated (mark and freebits not updated), and put it back into a free list.
Jun 10 2014
prev sibling next sibling parent reply "Wanderer" <no-reply no-reply.org> writes:
It will be hard to beat Java's garbage collector. They use 
generational, sequential memory approach, while D's memory 
manager (correct me if I'm wrong) uses more common approach and 
has to deal with possible memory fragmentation. Allocating a new 
object is extremely cheap in Java, literally a few machine 
instructions, and there is no locking because each thread has its 
own eden memory and they never overlap.

Unless D introduces the possibility to move objects around in 
heap (which is not possible with the support of pointers, I 
guess) and eliminate the fragmentation, it will be hard to make 
GC as fast. Just 3% of difference look great enough to me. :-)
May 29 2014
next sibling parent reply Shammah Chancellor <anonymous coward.com> writes:
On 2014-05-29 11:03:11 +0000, Wanderer said:

 It will be hard to beat Java's garbage collector. They use 
 generational, sequential memory approach, while D's memory manager 
 (correct me if I'm wrong) uses more common approach and has to deal 
 with possible memory fragmentation. Allocating a new object is 
 extremely cheap in Java, literally a few machine instructions, and 
 there is no locking because each thread has its own eden memory and 
 they never overlap.
 
 Unless D introduces the possibility to move objects around in heap 
 (which is not possible with the support of pointers, I guess) and 
 eliminate the fragmentation, it will be hard to make GC as fast. Just 
 3% of difference look great enough to me. :-)
I was under the impression that the D gc does move objects around on the heap and then update the pointers.
May 29 2014
parent Jacob Carlborg <doob me.com> writes:
On 2014-05-29 14:18, Shammah Chancellor wrote:

 I was under the impression that the D gc does move objects around on the
 heap and then update the pointers.
It does not. It would require barriers (read or write, don't remember which) and my people here are against that. -- /Jacob Carlborg
May 29 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/29/14, 4:03 AM, Wanderer wrote:
 It will be hard to beat Java's garbage collector. They use generational,
 sequential memory approach, while D's memory manager (correct me if I'm
 wrong) uses more common approach and has to deal with possible memory
 fragmentation. Allocating a new object is extremely cheap in Java,
 literally a few machine instructions, and there is no locking because
 each thread has its own eden memory and they never overlap.

 Unless D introduces the possibility to move objects around in heap
 (which is not possible with the support of pointers, I guess) and
 eliminate the fragmentation, it will be hard to make GC as fast. Just 3%
 of difference look great enough to me. :-)
Thing is the allocation patterns in D and Java are mightily different. Java does need a good GC because it allocates everywhere for everything. I think we need to come up with our own measurements and statistics when it comes about collecting D's litter. Andrei
May 29 2014
parent "Brad Anderson" <eco gnuk.net> writes:
On Friday, 30 May 2014 at 05:45:52 UTC, Andrei Alexandrescu wrote:
 Thing is the allocation patterns in D and Java are mightily 
 different. Java does need a good GC because it allocates 
 everywhere for everything. I think we need to come up with our 
 own measurements and statistics when it comes about collecting 
 D's litter.

 Andrei
For those that haven't seen this pull request Etienne Cimon has started some work on adding statistics to the GC: https://github.com/D-Programming-Language/druntime/pull/803
May 30 2014
prev sibling next sibling parent reply "safety0ff" <safety0ff.dev gmail.com> writes:
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that
 one coming.
I would have seen it coming if the app. was multi-threaded. Hopefully a re-write of the GC would include something akin to Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.
May 29 2014
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 29 May 2014 12:35:46 +0000
schrieb "safety0ff" <safety0ff.dev gmail.com>:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that
 one coming.
I would have seen it coming if the app. was multi-threaded. Hopefully a re-write of the GC would include something akin to Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.
Orvid has been inspired by TCMalloc. If he can realize his idea of a GC it should have a similar thread local allocation heap with no locking. -- Marco
May 29 2014
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 29 May 2014 06:09:17 -0400, Atila Neves <atila.neves gmail.com>  
wrote:


 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
Just checked. It used to be a long time ago that single threaded apps did not take the lock. It appears not to be the case any more, the lock is always taken. The explanation in the comment is that finalizers may spawn a new thread. See https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 Surely we can figure this out... More and more, it's looking like we are going to start needing thread-local pools for thread-local allocations. To work around, you might be able to pre-allocate all your data before starting, and make judicious re-use of buffers. But of course, this may make the code so much more ugly, which it shouldn't have to be. -Steve
May 29 2014
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a new
 thread. See
 
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 The strange thing is, finalizers may not even be executed. :-/ I would simply change the spec saying "finalizers may not start a new thread." Getting back to Atila's issue, I would simply comment-out that lock in druntime, beat Java, and then put the lock back in. :o) Ali
May 29 2014
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 29 May 2014 14:42:19 -0400, Ali =C3=87ehreli <acehreli yahoo.com=
 wrote:
 On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

  > The explanation in the comment is that finalizers may spawn a new
  > thread. See
  >  =
 https://github.com/D-Programming-Language/druntime/blob/master/src/gc/=
gc.d#L410 =
 The strange thing is, finalizers may not even be executed.
They should be executed if the memory is collected.
 :-/ I would simply change the spec saying "finalizers may not start a =
=
 new thread."
Probably a good idea :)
 Getting back to Atila's issue, I would simply comment-out that lock in=
=
 druntime, beat Java, and then put the lock back in. :o)
Probably the sanest option, but then you can't really say that D beats = Java :) -Steve
May 29 2014
prev sibling parent "Atila Neves" <atila.neves gmail.com> writes:
I tried modifying the runtime but unfortunately vibe.d doesn't
compile with the latest dmd. I don't feel like fixing it or
figuring out how to compile gdc either. As mentioned later, I
wouldn't be able to claim to beat Java, but I would be able to
see how much faster it would go without the lock, though. Oh well.

Also, there was a bit too much focus on Java on this thread. The
fact is that the C implementation is faster than D as well, and
just as fast as Java.

Atila

On Thursday, 29 May 2014 at 18:42:20 UTC, Ali Çehreli wrote:
 On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a
new
 thread. See
 
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 The strange thing is, finalizers may not even be executed. :-/ I would simply change the spec saying "finalizers may not start a new thread." Getting back to Atila's issue, I would simply comment-out that lock in druntime, beat Java, and then put the lock back in. :o) Ali
May 30 2014
prev sibling parent "Martin Nowak" <code dawg.eu> writes:
On Thursday, 29 May 2014 at 13:06:20 UTC, Steven Schveighoffer
wrote:
 More and more, it's looking like we are going to start needing 
 thread-local pools for thread-local allocations.
This is the best solution, get rid of the locking for most allocations even in multi-threaded programs.
May 29 2014
prev sibling next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 If you're profiling binaries on Linux, this thing is a must 
 have and I have no idea how I'd never heard about it before.
On the topic of perf, I found a stupid trick the other day and wrote it down on the wiki: http://wiki.dlang.org/Perf
May 29 2014
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 29 May 2014 18:00:13 +0000
schrieb "Brian Schott" <briancschott gmail.com>:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 If you're profiling binaries on Linux, this thing is a must 
 have and I have no idea how I'd never heard about it before.
On the topic of perf, I found a stupid trick the other day and wrote it down on the wiki: http://wiki.dlang.org/Perf
But does that beat simply attaching the debugger to the process and pausing it? -- Marco
May 29 2014
prev sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 On the topic of perf, I found a stupid trick the other day and 
 wrote it down on the wiki: http://wiki.dlang.org/Perf
Great tip! Now I want bash completion for it. But I can't seem to get it by reusing _kill in bash-completion: https://superuser.com/questions/760613/reusing-kill-completion-in-bashrc-fails Ideas anyone?
May 30 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that one coming.
This is sadly unavoidable with the current GC. Simply put, it's possible for a finalizer to spawn a thread, and so the GC always has to lock before collecting, even if the program is single-threaded at the beginning of the collection cycle.
May 29 2014
prev sibling parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
May 30 2014
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 30 May 2014 10:29:58 +0200
schrieb Rainer Schuetze <r.sagitario gmx.de>:

 
 
 On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
I recently tried to write a single-type memory allocator and thought I'd come out faster than TCMalloc due to its simplicity. But as soon as I added a single CAS I was already over the time that TCMalloc needs. That way I learned that CAS is not as cheap as it looks and the fastest allocators work thread local as long as possible. -- Marco
May 30 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
 simplicity. But as soon as I added a single CAS I was already
 over the time that TCMalloc needs. That way I learned that CAS
 is not as cheap as it looks and the fastest allocators work
 thread local as long as possible.
22 cycles latency if on a valid cacheline? + overhead of going to memory Did you try to add explicit prefetch, maybe that would help? Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 cycles on Haswell). You need instructions to fill the pipeline between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM and do a lot of testing on different CPUs. Explicit prefetching, lock free strategies etc are tricky to get right. Get it wrong and it is worse than the naive implementation.
May 30 2014
next sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
This is also one of many good reasons to not make allocators 
library based, but do it in the compiler and backend. That also 
allows you do allocate multiple objects in a single CAS. (i.e. 
the compiler collects multiple allocs calls and replace it with 
one)
May 30 2014
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 30 May 2014 15:54:57 +0000
schrieb "Ola Fosheim Gr=C3=B8stad"
<ola.fosheim.grostad+dlang gmail.com>:

 On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
 simplicity. But as soon as I added a single CAS I was already
 over the time that TCMalloc needs. That way I learned that CAS
 is not as cheap as it looks and the fastest allocators work
 thread local as long as possible.
=20 22 cycles latency if on a valid cacheline? + overhead of going to memory =20 Did you try to add explicit prefetch, maybe that would help? =20 Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5=20 cycles on Haswell). You need instructions to fill the pipeline=20 between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM=20 and do a lot of testing on different CPUs. Explicit prefetching,=20 lock free strategies etc are tricky to get right. Get it wrong=20 and it is worse than the naive implementation.
I'm on a Core 2 Duo. But this doesn't sound like I want to try it. core.atomic is as low as I wanted to go. Anyway I deleted that code when I realized just how fast allocation is with TCMalloc already. And that's a general purpose allocator. --=20 Marco
May 30 2014
parent reply "Kagamin" <spam here.lot> writes:
The design of TCMalloc seems to be compatible with Boehm GC, so D 
GC can probably use same ideas.
May 31 2014
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 31 May 2014 08:52:34 +0000
schrieb "Kagamin" <spam here.lot>:

 The design of TCMalloc seems to be compatible with Boehm GC, so D 
 GC can probably use same ideas.
Did I mention that Orvid planned on doing that for his replacement GC? It looks like there are now a lot of proof of concept GC modifications by several people. I think that's ok for the benefit of exploring possibly incompatible changes and trade-offs. A too early joint effort could mean that some ideas cannot be explored to their full extent with the typical myths about how good or bad some idea would have turned out. -- Marco
Jun 01 2014
prev sibling parent reply Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:
Hi,

Yes my little GC project is coming along.  It allocates much faster in 
multi threaded applications when all threads are competing for the lock. 
  D's current GC is very slow when there is contention for the GC-lock 
which is acquired for every malloc.  The other hidden lock usage is when 
a dynamic array increases in size.  A significant problem is that the 
thread holding the lock might be pre-empted by the OS, causing all 
threads to wait until it is running again.  My lock-free allocator fixes 
all this trouble.

(Note: On Windows the current GC uses EnterCriticalSection from 
kernel32.lib)

In single threaded applications, D's GC performs much better.  It still 
has the overhead of maintaining a free list while allocating. The free 
list is built when requesting pools from the OS.  This list is stored as 
a linked list using the fresh memory cells.  A cell is zerod when it is 
allocated.

My implementation is slightly faster at allocations because it doesn't 
need to zero the memory before before. It only maintains a such a free 
list for recycled memory.

I'll try to make some code available soon for people to try if they are 
interested.  There are still some jobs I need to do.  I'm just 
implementing the utility functions now (sizeOf, addrOf) etc.   Allocated 
memory is currently never returned to the OS, just recycled for new 
allocations.

Cheers
Adam








On 30/05/2014 6:29 PM, Rainer Schuetze via Digitalmars-d wrote:
 On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.
A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
Jun 10 2014
next sibling parent reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 Yes my little GC project is coming along.  It allocates much 
 faster in multi threaded applications when all threads are 
 competing for the lock.
Have you benchmarked your lock-free GC against D's builtin with different allocation sizes? I've seen comments on the need for that... Anyhow very exciting times - I'm eagerly awaiting your next GC news update!
Jun 10 2014
parent reply Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 11/06/2014 6:39 AM, "Nordlöw" via Digitalmars-d wrote:
 Have you benchmarked your lock-free GC against D's builtin with
 different allocation sizes? I've seen comments on the need for
 that...
Only a little. In scripts where I deliberately introduce contention my allocator is quicker. It's much closer when there is little contention. This little test script uses 12 threads to do 100,000 allocations each. It does a run for the sizes: (16, 32, 64, 512, 512, 64, 32, 16). (script attached below) With DMD's built in GC 407 milli seconds (16 bytes) 409 milli seconds (32 bytes) 432 milli seconds (64 bytes) 638 milli seconds (512 bytes) -collect- 633 milli seconds (512 bytes) 422 milli seconds (64 bytes) 418 milli seconds (32 bytes) 394 milli seconds (16 bytes) My GC 22 milli seconds 18 milli seconds 27 milli seconds 293 milli seconds -collect- 194 milli seconds 133 milli seconds 60 milli seconds 62 milli seconds The timing of the collect is not included because it's not fair to benchmark as my GC does not minimise yet. This means my GC could easily run out of memory sooner. (a job for later) Some interesting observations are: 512 size allocations don't come from thread specific arenas, so there is more contention for the CAS operation. The allocations after the collection would be mostly from the free list so they are slower than the initial fresh allocations. (List maintenance, and zeroing of memory required. This is the same as the built-in GC.) The only real difference in the speeds after the collect is lock contention. I have a 6 core cpu (hyper threaded to 12). So contention levels may be different on machines with less cores. Not sure how this will change things. It might cause less contention, but blocking threads may slow things down on the built-in GC. Tests are on Windows using 32 bit code. Cheers Adam ---- import std.stdio; import std.datetime; import std.concurrency; import core.memory; import core.thread; struct LinkedList16{ int value = 0; LinkedList16* next; } struct LinkedList32{ int value = 0; LinkedList32* next; byte[16] padding; } struct LinkedList64{ int value = 0; LinkedList64* next; byte[32] padding; } struct LinkedList512{ int value = 0; LinkedList512* next; byte[256] padding; } shared int threadCount = 0; void main(){ core.memory.GC.disable(); doTest!LinkedList16; doTest!LinkedList32; doTest!LinkedList64; doTest!LinkedList512; GC.collect(); doTest!LinkedList512; doTest!LinkedList64; doTest!LinkedList32; doTest!LinkedList16; } void doTest(T)(){ auto start = Clock.currSystemTick(); foreach(i; 0 .. 12){ auto tid = spawn(&doSomething!T, thisTid); threadCount++; } while(threadCount >0){}; auto ln = Clock.currSystemTick() - start; writeln(ln.msecs, " milli seconds"); } void doSomething(T)(Tid tid){ auto top = new T; auto recent = top; //Create a linked list foreach(i; 1 .. 100_000){ auto newList = new T; newList.value = i % 128; recent.next = newList; recent = newList; } //Sum the values. (Just spends some time walking the memory). recent = top; long total=0; while(recent !is null){ total += recent.value; recent = recent.next; } //writeln("Total : ", total ); threadCount--; }
Jun 10 2014
parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 Only a little.  In scripts where I deliberately introduce 
 contention my allocator is quicker. It's much closer when there 
 is little contention.
Very interesting... It would be nice to see single-threaded benchmarks aswell for completeness. Thx-
Jun 11 2014
prev sibling parent Etienne <etcimon gmail.com> writes:
On 2014-06-10 5:53 AM, Adam Sakareassen via Digitalmars-d wrote:
 Hi,

 Yes my little GC project is coming along.  It allocates much faster in
 multi threaded applications when all threads are competing for the lock.
   D's current GC is very slow when there is contention for the GC-lock
 which is acquired for every malloc.  The other hidden lock usage is when
 a dynamic array increases in size.  A significant problem is that the
 thread holding the lock might be pre-empted by the OS, causing all
 threads to wait until it is running again.  My lock-free allocator fixes
 all this trouble.
Could you share a fork containing the source code to this? I'm very curious as to how you've implemented it.
Jun 12 2014