digitalmars.D - The GC and performance, but not what you expect
- Atila Neves (34/34) May 29 2014 So after the last day of DConf was over, back at the Aloft Brian
- Jacob Carlborg (5/9) May 29 2014 Doesn't the runtime know how many threads currently exist? If it's only
- Robert Schadek via Digitalmars-d (4/11) May 29 2014 properly, but collections is not atomar and you have to prevent threads
- Jacob Carlborg (5/8) May 29 2014 The GC already stops the world, can a thread then be created during a
- Walter Bright (3/17) May 29 2014 If it's single threaded, and the single thread is doing the collecting, ...
- Sean Kelly (7/9) May 29 2014 class Foo {
- Steven Schveighoffer (5/15) May 29 2014 It's my understanding that dtors are run without the lock held. Is that ...
- Marco Leise (6/17) May 29 2014 Nice try, but destructors called by the GC are currently
- Sean Kelly (4/6) Jun 10 2014 When did that happen? Some effort was made at one point to
- Joakim (3/10) Jun 10 2014 More than two years ago, possibly longer:
- Sean Kelly (4/15) Jun 10 2014 That's a different issue, which is that the current GC isn't
- Joakim (4/21) Jun 10 2014 I believe the mechanism is that the assert is triggered, it tries
- Rainer Schuetze (9/15) Jun 10 2014 I don't think that ever worked, at least for the last couple of years.
- Wanderer (11/11) May 29 2014 It will be hard to beat Java's garbage collector. They use
- Shammah Chancellor (3/15) May 29 2014 I was under the impression that the D gc does move objects around on
- Jacob Carlborg (5/7) May 29 2014 It does not. It would require barriers (read or write, don't remember
- Andrei Alexandrescu (6/16) May 29 2014 Thing is the allocation patterns in D and Java are mightily different.
- Brad Anderson (4/10) May 30 2014 For those that haven't seen this pull request Etienne Cimon has
- safety0ff (4/9) May 29 2014 I would have seen it coming if the app. was multi-threaded.
- Marco Leise (7/19) May 29 2014 Orvid has been inspired by TCMalloc. If he can realize his
- Steven Schveighoffer (15/19) May 29 2014 Just checked. It used to be a long time ago that single threaded apps di...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (7/10) May 29 2014 https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc...
- Steven Schveighoffer (9/20) May 29 2014 They should be executed if the memory is collected.
- Atila Neves (10/22) May 30 2014 I tried modifying the runtime but unfortunately vibe.d doesn't
- Martin Nowak (4/6) May 29 2014 This is the best solution, get rid of the locking for most
- Brian Schott (3/5) May 29 2014 On the topic of perf, I found a stupid trick the other day and
- Marco Leise (6/12) May 29 2014 But does that beat simply attaching the debugger to the
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/7) May 30 2014 Great tip!
- Sean Kelly (5/9) May 29 2014 This is sadly unavoidable with the current GC. Simply put, it's
- Rainer Schuetze (7/11) May 30 2014 A lock should not be more than a CAS operation that always succeeds in a...
- Marco Leise (10/27) May 30 2014 I recently tried to write a single-type memory allocator and
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (10/14) May 30 2014 22 cycles latency if on a valid cacheline?
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/5) May 30 2014 This is also one of many good reasons to not make allocators
- Marco Leise (9/26) May 30 2014 Am Fri, 30 May 2014 15:54:57 +0000
- Kagamin (2/2) May 31 2014 The design of TCMalloc seems to be compatible with Boehm GC, so D
- Marco Leise (12/14) Jun 01 2014 Did I mention that Orvid planned on doing that for his
- Adam Sakareassen via Digitalmars-d (27/39) Jun 10 2014 Hi,
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/8) Jun 10 2014 Have you benchmarked your lock-free GC against D's builtin with
- Adam Sakareassen via Digitalmars-d (113/116) Jun 10 2014 Only a little. In scripts where I deliberately introduce contention my
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (4/7) Jun 11 2014 Very interesting...
- Etienne (3/12) Jun 12 2014 Could you share a fork containing the source code to this? I'm very
So after the last day of DConf was over, back at the Aloft Brian Schott taught me and Jonathan Crapuchettes about perf (https://perf.wiki.kernel.org/index.php/Main_Page). If you're profiling binaries on Linux, this thing is a must have and I have no idea how I'd never heard about it before. Anyway, I needed a program to test it out with, and since the last time I had something that didn't perform as well as I wanted it to and the dmd profiler didn't help me much was my MQTT broker, I used it on that. Now, that broker beat the other language implementations on my shootout on throughput by a country mile (http://atilanevesoncode.wordpress.com/2014/01/08/adding-java-and-c-to-the-mqtt-benchmarks-or-how-i-learned-to-stop-worrying-and-love-the-garbage-collector/) but it lagged behind the winner (Java) in throughput-constrained latency. I put perf to work on the D/vibe.d and Java versions and unsurprisingly it told me that the CPU was idle a lot more of the time for D than it was for Java. Digging in deeper, the main hotspot according to perf was pthread_mutex_lock. That surprised me since the program is single-threaded (and in fact runs slower when multiple threads are used). vibe.d turned out to be responsible for part of that. I posted on the vibe.d forum and Sonke promply removed all the pesky locking. Results improved, but still 6% behind Java. I carried on and removed unnecessary array allocations from my code. Before perf I had no idea these were contributing to slowing the program down. With that change, Java beats D by 3%. I've run out of optimisations to make (I think), and pthread_mutex_lock is still top of the list, called by _D4core4sync5mutex5Mutex4lockMFNeZv, which in turn is called by _D2gc2gc2GC6mallocMFmkPmZPv. The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming. Atila
May 29 2014
On 2014-05-29 12:09, Atila Neves wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking? -- /Jacob Carlborg
May 29 2014
On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:On 2014-05-29 12:09, Atila Neves wrote:properly, but collections is not atomar and you have to prevent threads that are created during collection from collecting. so you still need to lockThe GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking?
May 29 2014
On 2014-05-29 13:01, Robert Schadek via Digitalmars-d wrote:properly, but collections is not atomar and you have to prevent threads that are created during collection from collecting. so you still need to lockThe GC already stops the world, can a thread then be created during a collection? -- /Jacob Carlborg
May 29 2014
On 5/29/2014 4:01 AM, Robert Schadek via Digitalmars-d wrote:On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:If it's single threaded, and the single thread is doing the collecting, who is starting up a new thread?On 2014-05-29 12:09, Atila Neves wrote:properly, but collections is not atomar and you have to prevent threads that are created during collection from collecting. so you still need to lockThe GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.Doesn't the runtime know how many threads currently exist? If it's only one, could the GC avoid the locking?
May 29 2014
On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:If it's single threaded, and the single thread is doing the collecting, who is starting up a new thread?class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
May 29 2014
On Thu, 29 May 2014 16:01:16 -0400, Sean Kelly <sean invisibleduck.org> wrote:On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:It's my understanding that dtors are run without the lock held. Is that no longer true? -SteveIf it's single threaded, and the single thread is doing the collecting, who is starting up a new thread?class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
May 29 2014
Am Thu, 29 May 2014 20:01:16 +0000 schrieb "Sean Kelly" <sean invisibleduck.org>:On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home. -- MarcoIf it's single threaded, and the single thread is doing the collecting, who is starting up a new thread?class Foo { ~this() { auto t = new Thread({ auto a = new char[100]; }); t.start(); } }
May 29 2014
On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home.When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home.When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:That's a different issue, which is that the current GC isn't necessarily exception safe and throwing from a dtor violates this. I don't see any mention there about allocating in a dtor.On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home.When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
On Tuesday, 10 June 2014 at 17:58:56 UTC, Sean Kelly wrote:On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:I believe the mechanism is that the assert is triggered, it tries to allocate, then you get the invalid memory operation error. Exactly why it allocates, I'm not sure.On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:That's a different issue, which is that the current GC isn't necessarily exception safe and throwing from a dtor violates this. I don't see any mention there about allocating in a dtor.On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:More than two years ago, possibly longer: https://issues.dlang.org/show_bug.cgi?id=7349Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home.When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
On 10.06.2014 17:15, Sean Kelly wrote:On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:I don't think that ever worked, at least for the last couple of years. Threads run during sweep, and an allocation during the collection might pass the gc-lock because it allows recursive locking in the same thread, but if the allocation hits an empty free-list, it will try to run another collection which will cause the InvalidMemoryOperationError. If it happens to allocate from a free-list, it will likely cause memory corruption, because sweeping won't know it has been allocated (mark and freebits not updated), and put it back into a free list.Nice try, but destructors called by the GC are currently effectively nogc. So don't try that at home.When did that happen? Some effort was made at one point to ensure that allocations worked from dtors. Not that I'm in favor, but...
Jun 10 2014
It will be hard to beat Java's garbage collector. They use generational, sequential memory approach, while D's memory manager (correct me if I'm wrong) uses more common approach and has to deal with possible memory fragmentation. Allocating a new object is extremely cheap in Java, literally a few machine instructions, and there is no locking because each thread has its own eden memory and they never overlap. Unless D introduces the possibility to move objects around in heap (which is not possible with the support of pointers, I guess) and eliminate the fragmentation, it will be hard to make GC as fast. Just 3% of difference look great enough to me. :-)
May 29 2014
On 2014-05-29 11:03:11 +0000, Wanderer said:It will be hard to beat Java's garbage collector. They use generational, sequential memory approach, while D's memory manager (correct me if I'm wrong) uses more common approach and has to deal with possible memory fragmentation. Allocating a new object is extremely cheap in Java, literally a few machine instructions, and there is no locking because each thread has its own eden memory and they never overlap. Unless D introduces the possibility to move objects around in heap (which is not possible with the support of pointers, I guess) and eliminate the fragmentation, it will be hard to make GC as fast. Just 3% of difference look great enough to me. :-)I was under the impression that the D gc does move objects around on the heap and then update the pointers.
May 29 2014
On 2014-05-29 14:18, Shammah Chancellor wrote:I was under the impression that the D gc does move objects around on the heap and then update the pointers.It does not. It would require barriers (read or write, don't remember which) and my people here are against that. -- /Jacob Carlborg
May 29 2014
On 5/29/14, 4:03 AM, Wanderer wrote:It will be hard to beat Java's garbage collector. They use generational, sequential memory approach, while D's memory manager (correct me if I'm wrong) uses more common approach and has to deal with possible memory fragmentation. Allocating a new object is extremely cheap in Java, literally a few machine instructions, and there is no locking because each thread has its own eden memory and they never overlap. Unless D introduces the possibility to move objects around in heap (which is not possible with the support of pointers, I guess) and eliminate the fragmentation, it will be hard to make GC as fast. Just 3% of difference look great enough to me. :-)Thing is the allocation patterns in D and Java are mightily different. Java does need a good GC because it allocates everywhere for everything. I think we need to come up with our own measurements and statistics when it comes about collecting D's litter. Andrei
May 29 2014
On Friday, 30 May 2014 at 05:45:52 UTC, Andrei Alexandrescu wrote:Thing is the allocation patterns in D and Java are mightily different. Java does need a good GC because it allocates everywhere for everything. I think we need to come up with our own measurements and statistics when it comes about collecting D's litter. AndreiFor those that haven't seen this pull request Etienne Cimon has started some work on adding statistics to the GC: https://github.com/D-Programming-Language/druntime/pull/803
May 30 2014
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.I would have seen it coming if the app. was multi-threaded. Hopefully a re-write of the GC would include something akin to Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.
May 29 2014
Am Thu, 29 May 2014 12:35:46 +0000 schrieb "safety0ff" <safety0ff.dev gmail.com>:On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:Orvid has been inspired by TCMalloc. If he can realize his idea of a GC it should have a similar thread local allocation heap with no locking. -- MarcoThe GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.I would have seen it coming if the app. was multi-threaded. Hopefully a re-write of the GC would include something akin to Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.
May 29 2014
On Thu, 29 May 2014 06:09:17 -0400, Atila Neves <atila.neves gmail.com> wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.Just checked. It used to be a long time ago that single threaded apps did not take the lock. It appears not to be the case any more, the lock is always taken. The explanation in the comment is that finalizers may spawn a new thread. See https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 Surely we can figure this out... More and more, it's looking like we are going to start needing thread-local pools for thread-local allocations. To work around, you might be able to pre-allocate all your data before starting, and make judicious re-use of buffers. But of course, this may make the code so much more ugly, which it shouldn't have to be. -Steve
May 29 2014
On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:The explanation in the comment is that finalizers may spawn a new thread. Seehttps://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 The strange thing is, finalizers may not even be executed. :-/ I would simply change the spec saying "finalizers may not start a new thread." Getting back to Atila's issue, I would simply comment-out that lock in druntime, beat Java, and then put the lock back in. :o) Ali
May 29 2014
On Thu, 29 May 2014 14:42:19 -0400, Ali =C3=87ehreli <acehreli yahoo.com=wrote:On 05/29/2014 06:06 AM, Steven Schveighoffer wrote: > The explanation in the comment is that finalizers may spawn a new > thread. See > =https://github.com/D-Programming-Language/druntime/blob/master/src/gc/=gc.d#L410 =The strange thing is, finalizers may not even be executed.They should be executed if the memory is collected.:-/ I would simply change the spec saying "finalizers may not start a ==new thread."Probably a good idea :)Getting back to Atila's issue, I would simply comment-out that lock in==druntime, beat Java, and then put the lock back in. :o)Probably the sanest option, but then you can't really say that D beats = Java :) -Steve
May 29 2014
I tried modifying the runtime but unfortunately vibe.d doesn't compile with the latest dmd. I don't feel like fixing it or figuring out how to compile gdc either. As mentioned later, I wouldn't be able to claim to beat Java, but I would be able to see how much faster it would go without the lock, though. Oh well. Also, there was a bit too much focus on Java on this thread. The fact is that the C implementation is faster than D as well, and just as fast as Java. Atila On Thursday, 29 May 2014 at 18:42:20 UTC, Ali Çehreli wrote:On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:The explanation in the comment is that finalizers may spawn anewthread. Seehttps://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 The strange thing is, finalizers may not even be executed. :-/ I would simply change the spec saying "finalizers may not start a new thread." Getting back to Atila's issue, I would simply comment-out that lock in druntime, beat Java, and then put the lock back in. :o) Ali
May 30 2014
On Thursday, 29 May 2014 at 13:06:20 UTC, Steven Schveighoffer wrote:More and more, it's looking like we are going to start needing thread-local pools for thread-local allocations.This is the best solution, get rid of the locking for most allocations even in multi-threaded programs.
May 29 2014
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:If you're profiling binaries on Linux, this thing is a must have and I have no idea how I'd never heard about it before.On the topic of perf, I found a stupid trick the other day and wrote it down on the wiki: http://wiki.dlang.org/Perf
May 29 2014
Am Thu, 29 May 2014 18:00:13 +0000 schrieb "Brian Schott" <briancschott gmail.com>:On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:But does that beat simply attaching the debugger to the process and pausing it? -- MarcoIf you're profiling binaries on Linux, this thing is a must have and I have no idea how I'd never heard about it before.On the topic of perf, I found a stupid trick the other day and wrote it down on the wiki: http://wiki.dlang.org/Perf
May 29 2014
On the topic of perf, I found a stupid trick the other day and wrote it down on the wiki: http://wiki.dlang.org/PerfGreat tip! Now I want bash completion for it. But I can't seem to get it by reusing _kill in bash-completion: https://superuser.com/questions/760613/reusing-kill-completion-in-bashrc-fails Ideas anyone?
May 30 2014
On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.This is sadly unavoidable with the current GC. Simply put, it's possible for a finalizer to spawn a thread, and so the GC always has to lock before collecting, even if the program is single-threaded at the beginning of the collection cycle.
May 29 2014
On 29.05.2014 12:09, Atila Neves wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
May 30 2014
Am Fri, 30 May 2014 10:29:58 +0200 schrieb Rainer Schuetze <r.sagitario gmx.de>:On 29.05.2014 12:09, Atila Neves wrote:I recently tried to write a single-type memory allocator and thought I'd come out faster than TCMalloc due to its simplicity. But as soon as I added a single CAS I was already over the time that TCMalloc needs. That way I learned that CAS is not as cheap as it looks and the fastest allocators work thread local as long as possible. -- MarcoThe GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
May 30 2014
On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:simplicity. But as soon as I added a single CAS I was already over the time that TCMalloc needs. That way I learned that CAS is not as cheap as it looks and the fastest allocators work thread local as long as possible.22 cycles latency if on a valid cacheline? + overhead of going to memory Did you try to add explicit prefetch, maybe that would help? Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 cycles on Haswell). You need instructions to fill the pipeline between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM and do a lot of testing on different CPUs. Explicit prefetching, lock free strategies etc are tricky to get right. Get it wrong and it is worse than the naive implementation.
May 30 2014
This is also one of many good reasons to not make allocators library based, but do it in the compiler and backend. That also allows you do allocate multiple objects in a single CAS. (i.e. the compiler collects multiple allocs calls and replace it with one)
May 30 2014
Am Fri, 30 May 2014 15:54:57 +0000 schrieb "Ola Fosheim Gr=C3=B8stad" <ola.fosheim.grostad+dlang gmail.com>:On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:I'm on a Core 2 Duo. But this doesn't sound like I want to try it. core.atomic is as low as I wanted to go. Anyway I deleted that code when I realized just how fast allocation is with TCMalloc already. And that's a general purpose allocator. --=20 Marcosimplicity. But as soon as I added a single CAS I was already over the time that TCMalloc needs. That way I learned that CAS is not as cheap as it looks and the fastest allocators work thread local as long as possible.=20 22 cycles latency if on a valid cacheline? + overhead of going to memory =20 Did you try to add explicit prefetch, maybe that would help? =20 Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5=20 cycles on Haswell). You need instructions to fill the pipeline=20 between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM=20 and do a lot of testing on different CPUs. Explicit prefetching,=20 lock free strategies etc are tricky to get right. Get it wrong=20 and it is worse than the naive implementation.
May 30 2014
The design of TCMalloc seems to be compatible with Boehm GC, so D GC can probably use same ideas.
May 31 2014
Am Sat, 31 May 2014 08:52:34 +0000 schrieb "Kagamin" <spam here.lot>:The design of TCMalloc seems to be compatible with Boehm GC, so D GC can probably use same ideas.Did I mention that Orvid planned on doing that for his replacement GC? It looks like there are now a lot of proof of concept GC modifications by several people. I think that's ok for the benefit of exploring possibly incompatible changes and trade-offs. A too early joint effort could mean that some ideas cannot be explored to their full extent with the typical myths about how good or bad some idea would have turned out. -- Marco
Jun 01 2014
Hi, Yes my little GC project is coming along. It allocates much faster in multi threaded applications when all threads are competing for the lock. D's current GC is very slow when there is contention for the GC-lock which is acquired for every malloc. The other hidden lock usage is when a dynamic array increases in size. A significant problem is that the thread holding the lock might be pre-empted by the OS, causing all threads to wait until it is running again. My lock-free allocator fixes all this trouble. (Note: On Windows the current GC uses EnterCriticalSection from kernel32.lib) In single threaded applications, D's GC performs much better. It still has the overhead of maintaining a free list while allocating. The free list is built when requesting pools from the OS. This list is stored as a linked list using the fresh memory cells. A cell is zerod when it is allocated. My implementation is slightly faster at allocations because it doesn't need to zero the memory before before. It only maintains a such a free list for recycled memory. I'll try to make some code available soon for people to try if they are interested. There are still some jobs I need to do. I'm just implementing the utility functions now (sizeOf, addrOf) etc. Allocated memory is currently never returned to the OS, just recycled for new allocations. Cheers Adam On 30/05/2014 6:29 PM, Rainer Schuetze via Digitalmars-d wrote:On 29.05.2014 12:09, Atila Neves wrote:The GC is preventing me from beating Java, but not because of collections. It's the locking it does to allocate instead! I don't know about the rest of you but I definitely didn't see that one coming.A lock should not be more than a CAS operation that always succeeds in a single threaded application, but OS overhead might be considerable. Adam Sakareassen has recently announced a rewrite of the GC that has lock-free allocations: http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com Unfortunately, it isn't available yet...
Jun 10 2014
Yes my little GC project is coming along. It allocates much faster in multi threaded applications when all threads are competing for the lock.Have you benchmarked your lock-free GC against D's builtin with different allocation sizes? I've seen comments on the need for that... Anyhow very exciting times - I'm eagerly awaiting your next GC news update!
Jun 10 2014
On 11/06/2014 6:39 AM, "Nordlöw" via Digitalmars-d wrote:Have you benchmarked your lock-free GC against D's builtin with different allocation sizes? I've seen comments on the need for that...Only a little. In scripts where I deliberately introduce contention my allocator is quicker. It's much closer when there is little contention. This little test script uses 12 threads to do 100,000 allocations each. It does a run for the sizes: (16, 32, 64, 512, 512, 64, 32, 16). (script attached below) With DMD's built in GC 407 milli seconds (16 bytes) 409 milli seconds (32 bytes) 432 milli seconds (64 bytes) 638 milli seconds (512 bytes) -collect- 633 milli seconds (512 bytes) 422 milli seconds (64 bytes) 418 milli seconds (32 bytes) 394 milli seconds (16 bytes) My GC 22 milli seconds 18 milli seconds 27 milli seconds 293 milli seconds -collect- 194 milli seconds 133 milli seconds 60 milli seconds 62 milli seconds The timing of the collect is not included because it's not fair to benchmark as my GC does not minimise yet. This means my GC could easily run out of memory sooner. (a job for later) Some interesting observations are: 512 size allocations don't come from thread specific arenas, so there is more contention for the CAS operation. The allocations after the collection would be mostly from the free list so they are slower than the initial fresh allocations. (List maintenance, and zeroing of memory required. This is the same as the built-in GC.) The only real difference in the speeds after the collect is lock contention. I have a 6 core cpu (hyper threaded to 12). So contention levels may be different on machines with less cores. Not sure how this will change things. It might cause less contention, but blocking threads may slow things down on the built-in GC. Tests are on Windows using 32 bit code. Cheers Adam ---- import std.stdio; import std.datetime; import std.concurrency; import core.memory; import core.thread; struct LinkedList16{ int value = 0; LinkedList16* next; } struct LinkedList32{ int value = 0; LinkedList32* next; byte[16] padding; } struct LinkedList64{ int value = 0; LinkedList64* next; byte[32] padding; } struct LinkedList512{ int value = 0; LinkedList512* next; byte[256] padding; } shared int threadCount = 0; void main(){ core.memory.GC.disable(); doTest!LinkedList16; doTest!LinkedList32; doTest!LinkedList64; doTest!LinkedList512; GC.collect(); doTest!LinkedList512; doTest!LinkedList64; doTest!LinkedList32; doTest!LinkedList16; } void doTest(T)(){ auto start = Clock.currSystemTick(); foreach(i; 0 .. 12){ auto tid = spawn(&doSomething!T, thisTid); threadCount++; } while(threadCount >0){}; auto ln = Clock.currSystemTick() - start; writeln(ln.msecs, " milli seconds"); } void doSomething(T)(Tid tid){ auto top = new T; auto recent = top; //Create a linked list foreach(i; 1 .. 100_000){ auto newList = new T; newList.value = i % 128; recent.next = newList; recent = newList; } //Sum the values. (Just spends some time walking the memory). recent = top; long total=0; while(recent !is null){ total += recent.value; recent = recent.next; } //writeln("Total : ", total ); threadCount--; }
Jun 10 2014
Only a little. In scripts where I deliberately introduce contention my allocator is quicker. It's much closer when there is little contention.Very interesting... It would be nice to see single-threaded benchmarks aswell for completeness. Thx-
Jun 11 2014
On 2014-06-10 5:53 AM, Adam Sakareassen via Digitalmars-d wrote:Hi, Yes my little GC project is coming along. It allocates much faster in multi threaded applications when all threads are competing for the lock. D's current GC is very slow when there is contention for the GC-lock which is acquired for every malloc. The other hidden lock usage is when a dynamic array increases in size. A significant problem is that the thread holding the lock might be pre-empted by the OS, causing all threads to wait until it is running again. My lock-free allocator fixes all this trouble.Could you share a fork containing the source code to this? I'm very curious as to how you've implemented it.
Jun 12 2014