digitalmars.D - The GC and performance, but not what you expect

Atila Neves (34/34) May 29 2014 So after the last day of DConf was over, back at the Aloft Brian

Jacob Carlborg (5/9) May 29 2014 Doesn't the runtime know how many threads currently exist? If it's only

Robert Schadek via Digitalmars-d (4/11) May 29 2014 properly, but collections is not atomar and you have to prevent threads

Jacob Carlborg (5/8) May 29 2014 The GC already stops the world, can a thread then be created during a
Walter Bright (3/17) May 29 2014 If it's single threaded, and the single thread is doing the collecting, ...

Sean Kelly (7/9) May 29 2014 class Foo {

Steven Schveighoffer (5/15) May 29 2014 It's my understanding that dtors are run without the lock held. Is that ...
Marco Leise (6/17) May 29 2014 Nice try, but destructors called by the GC are currently

Sean Kelly (4/6) Jun 10 2014 When did that happen? Some effort was made at one point to

Joakim (3/10) Jun 10 2014 More than two years ago, possibly longer:

Sean Kelly (4/15) Jun 10 2014 That's a different issue, which is that the current GC isn't

Joakim (4/21) Jun 10 2014 I believe the mechanism is that the assert is triggered, it tries

Rainer Schuetze (9/15) Jun 10 2014 I don't think that ever worked, at least for the last couple of years.

Wanderer (11/11) May 29 2014 It will be hard to beat Java's garbage collector. They use

Shammah Chancellor (3/15) May 29 2014 I was under the impression that the D gc does move objects around on

Jacob Carlborg (5/7) May 29 2014 It does not. It would require barriers (read or write, don't remember

Andrei Alexandrescu (6/16) May 29 2014 Thing is the allocation patterns in D and Java are mightily different.

Brad Anderson (4/10) May 30 2014 For those that haven't seen this pull request Etienne Cimon has

safety0ff (4/9) May 29 2014 I would have seen it coming if the app. was multi-threaded.

Marco Leise (7/19) May 29 2014 Orvid has been inspired by TCMalloc. If he can realize his

Steven Schveighoffer (15/19) May 29 2014 Just checked. It used to be a long time ago that single threaded apps di...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (7/10) May 29 2014 https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc...

Steven Schveighoffer (9/20) May 29 2014 They should be executed if the memory is collected.
Atila Neves (10/22) May 30 2014 I tried modifying the runtime but unfortunately vibe.d doesn't

Martin Nowak (4/6) May 29 2014 This is the best solution, get rid of the locking for most

Brian Schott (3/5) May 29 2014 On the topic of perf, I found a stupid trick the other day and

Marco Leise (6/12) May 29 2014 But does that beat simply attaching the debugger to the
=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/7) May 30 2014 Great tip!

Sean Kelly (5/9) May 29 2014 This is sadly unavoidable with the current GC. Simply put, it's
Rainer Schuetze (7/11) May 30 2014 A lock should not be more than a CAS operation that always succeeds in a...

Marco Leise (10/27) May 30 2014 I recently tried to write a single-type memory allocator and

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (10/14) May 30 2014 22 cycles latency if on a valid cacheline?

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/5) May 30 2014 This is also one of many good reasons to not make allocators
Marco Leise (9/26) May 30 2014 Am Fri, 30 May 2014 15:54:57 +0000

Kagamin (2/2) May 31 2014 The design of TCMalloc seems to be compatible with Boehm GC, so D

Marco Leise (12/14) Jun 01 2014 Did I mention that Orvid planned on doing that for his

Adam Sakareassen via Digitalmars-d (27/39) Jun 10 2014 Hi,

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/8) Jun 10 2014 Have you benchmarked your lock-free GC against D's builtin with

Adam Sakareassen via Digitalmars-d (113/116) Jun 10 2014 Only a little. In scripts where I deliberately introduce contention my

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (4/7) Jun 11 2014 Very interesting...

Etienne (3/12) Jun 12 2014 Could you share a fork containing the source code to this? I'm very

"Atila Neves" <atila.neves gmail.com> writes:

So after the last day of DConf was over, back at the Aloft Brian
Schott taught me and Jonathan Crapuchettes about perf
(https://perf.wiki.kernel.org/index.php/Main_Page). If you're
profiling binaries on Linux, this thing is a must have and I have
no idea how I'd never heard about it before.

Anyway, I needed a program to test it out with, and since the
last time I had something that didn't perform as well as I wanted
it to and the dmd profiler didn't help me much was my MQTT
broker, I used it on that.

Now, that broker beat the other language implementations on my
shootout on throughput by a country mile
(http://atilanevesoncode.wordpress.com/2014/01/08/adding-java-and-c-to-the-mqtt-benchmarks-or-how-i-learned-to-stop-worrying-and-love-the-garbage-collector/)
but it lagged behind the winner (Java) in throughput-constrained
latency. I put perf to work on the D/vibe.d and Java versions and
unsurprisingly it told me that the CPU was idle a lot more of the
time for D than it was for Java. Digging in deeper, the main
hotspot according to perf was pthread_mutex_lock. That surprised
me since the program is single-threaded (and in fact runs slower
when multiple threads are used). vibe.d turned out to be
responsible for part of that. I posted on the vibe.d forum and
Sonke promply removed all the pesky locking. Results improved,
but still 6% behind Java.

I carried on and removed unnecessary array allocations from my
code. Before perf I had no idea these were contributing to
slowing the program down. With that change, Java beats D by 3%.

I've run out of optimisations to make (I think), and
pthread_mutex_lock is still top of the list, called by
_D4core4sync5mutex5Mutex4lockMFNeZv, which in turn is called by
_D2gc2gc2GC6mallocMFmkPmZPv.

The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.

Atila

May 29 2014

Jacob Carlborg <doob me.com> writes:

On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

Doesn't the runtime know how many threads currently exist? If it's only 
one, could the GC avoid the locking?

-- 
/Jacob Carlborg

May 29 2014

Robert Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

 Doesn't the runtime know how many threads currently exist? If it's
 only one, could the GC avoid the locking?

properly, but collections is not atomar and you have to prevent threads
that are created during collection from collecting. so you still need to
lock

May 29 2014

Jacob Carlborg <doob me.com> writes:

On 2014-05-29 13:01, Robert Schadek via Digitalmars-d wrote:

 properly, but collections is not atomar and you have to prevent threads
 that are created during collection from collecting. so you still need to
 lock

The GC already stops the world, can a thread then be created during a 
collection?

-- 
/Jacob Carlborg

May 29 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/29/2014 4:01 AM, Robert Schadek via Digitalmars-d wrote:
 On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

 Doesn't the runtime know how many threads currently exist? If it's
 only one, could the GC avoid the locking?

 properly, but collections is not atomar and you have to prevent threads
 that are created during collection from collecting. so you still need to
 lock


If it's single threaded, and the single thread is doing the collecting, who is 
starting up a new thread?

May 29 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the 
 collecting, who is starting up a new thread?

class Foo {
      ~this() {
          auto t = new Thread({ auto a = new char[100]; });
          t.start();
      }
}

May 29 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 29 May 2014 16:01:16 -0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the collecting,  
 who is starting up a new thread?

 class Foo {
       ~this() {
           auto t = new Thread({ auto a = new char[100]; });
           t.start();
       }
 }

It's my understanding that dtors are run without the lock held. Is that no  
longer true?

-Steve

May 29 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 29 May 2014 20:01:16 +0000
schrieb "Sean Kelly" <sean invisibleduck.org>:

 On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 If it's single threaded, and the single thread is doing the 
 collecting, who is starting up a new thread?

 
 class Foo {
       ~this() {
           auto t = new Thread({ auto a = new char[100]; });
           t.start();
       }
 }

Nice try, but destructors called by the GC are currently
effectively  nogc. So don't try that at home.

-- 
Marco

May 29 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.

When did that happen?  Some effort was made at one point to 
ensure that allocations worked from dtors.  Not that I'm in 
favor, but...

Jun 10 2014

"Joakim" <dlang joakim.airpost.net> writes:

On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.

 When did that happen?  Some effort was made at one point to 
 ensure that allocations worked from dtors.  Not that I'm in 
 favor, but...

More than two years ago, possibly longer:

https://issues.dlang.org/show_bug.cgi?id=7349

Jun 10 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:
 On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.

 When did that happen?  Some effort was made at one point to 
 ensure that allocations worked from dtors.  Not that I'm in 
 favor, but...

 More than two years ago, possibly longer:

 https://issues.dlang.org/show_bug.cgi?id=7349

That's a different issue, which is that the current GC isn't
necessarily exception safe and throwing from a dtor violates
this.  I don't see any mention there about allocating in a dtor.

Jun 10 2014

"Joakim" <dlang joakim.airpost.net> writes:

On Tuesday, 10 June 2014 at 17:58:56 UTC, Sean Kelly wrote:
 On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:
 On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.

 When did that happen?  Some effort was made at one point to 
 ensure that allocations worked from dtors.  Not that I'm in 
 favor, but...

 More than two years ago, possibly longer:

 https://issues.dlang.org/show_bug.cgi?id=7349

 That's a different issue, which is that the current GC isn't
 necessarily exception safe and throwing from a dtor violates
 this.  I don't see any mention there about allocating in a dtor.

I believe the mechanism is that the assert is triggered, it tries 
to allocate, then you get the invalid memory operation error.  
Exactly why it allocates, I'm not sure.

Jun 10 2014

Rainer Schuetze <r.sagitario gmx.de> writes:

On 10.06.2014 17:15, Sean Kelly wrote:
 On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:
 Nice try, but destructors called by the GC are currently
 effectively  nogc. So don't try that at home.

 When did that happen?  Some effort was made at one point to ensure that
 allocations worked from dtors.  Not that I'm in favor, but...

I don't think that ever worked, at least for the last couple of years.

Threads run during sweep, and an allocation during the collection might 
pass the gc-lock because it allows recursive locking in the same thread, 
but if the allocation hits an empty free-list, it will try to run 
another collection which will cause the InvalidMemoryOperationError.

If it happens to allocate from a free-list, it will likely cause memory 
corruption, because sweeping won't know it has been allocated (mark and 
freebits not updated), and put it back into a free list.

Jun 10 2014

"Wanderer" <no-reply no-reply.org> writes:

It will be hard to beat Java's garbage collector. They use 
generational, sequential memory approach, while D's memory 
manager (correct me if I'm wrong) uses more common approach and 
has to deal with possible memory fragmentation. Allocating a new 
object is extremely cheap in Java, literally a few machine 
instructions, and there is no locking because each thread has its 
own eden memory and they never overlap.

Unless D introduces the possibility to move objects around in 
heap (which is not possible with the support of pointers, I 
guess) and eliminate the fragmentation, it will be hard to make 
GC as fast. Just 3% of difference look great enough to me. :-)

May 29 2014

Shammah Chancellor <anonymous coward.com> writes:

On 2014-05-29 11:03:11 +0000, Wanderer said:

 It will be hard to beat Java's garbage collector. They use 
 generational, sequential memory approach, while D's memory manager 
 (correct me if I'm wrong) uses more common approach and has to deal 
 with possible memory fragmentation. Allocating a new object is 
 extremely cheap in Java, literally a few machine instructions, and 
 there is no locking because each thread has its own eden memory and 
 they never overlap.
 
 Unless D introduces the possibility to move objects around in heap 
 (which is not possible with the support of pointers, I guess) and 
 eliminate the fragmentation, it will be hard to make GC as fast. Just 
 3% of difference look great enough to me. :-)

I was under the impression that the D gc does move objects around on 
the heap and then update the pointers.

May 29 2014

Jacob Carlborg <doob me.com> writes:

On 2014-05-29 14:18, Shammah Chancellor wrote:

 I was under the impression that the D gc does move objects around on the
 heap and then update the pointers.

It does not. It would require barriers (read or write, don't remember 
which) and my people here are against that.

-- 
/Jacob Carlborg

May 29 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/29/14, 4:03 AM, Wanderer wrote:
 It will be hard to beat Java's garbage collector. They use generational,
 sequential memory approach, while D's memory manager (correct me if I'm
 wrong) uses more common approach and has to deal with possible memory
 fragmentation. Allocating a new object is extremely cheap in Java,
 literally a few machine instructions, and there is no locking because
 each thread has its own eden memory and they never overlap.

 Unless D introduces the possibility to move objects around in heap
 (which is not possible with the support of pointers, I guess) and
 eliminate the fragmentation, it will be hard to make GC as fast. Just 3%
 of difference look great enough to me. :-)

Thing is the allocation patterns in D and Java are mightily different. 
Java does need a good GC because it allocates everywhere for everything. 
I think we need to come up with our own measurements and statistics when 
it comes about collecting D's litter.

Andrei

May 29 2014

"Brad Anderson" <eco gnuk.net> writes:

On Friday, 30 May 2014 at 05:45:52 UTC, Andrei Alexandrescu wrote:
 Thing is the allocation patterns in D and Java are mightily 
 different. Java does need a good GC because it allocates 
 everywhere for everything. I think we need to come up with our 
 own measurements and statistics when it comes about collecting 
 D's litter.

 Andrei

For those that haven't seen this pull request Etienne Cimon has 
started some work on adding statistics to the GC:

https://github.com/D-Programming-Language/druntime/pull/803

May 30 2014

"safety0ff" <safety0ff.dev gmail.com> writes:

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that
 one coming.

I would have seen it coming if the app. was multi-threaded.

Hopefully a re-write of the GC would include something akin to 
Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.

May 29 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 29 May 2014 12:35:46 +0000
schrieb "safety0ff" <safety0ff.dev gmail.com>:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that
 one coming.

 
 I would have seen it coming if the app. was multi-threaded.
 
 Hopefully a re-write of the GC would include something akin to 
 Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.

Orvid has been inspired by TCMalloc. If he can realize his
idea of a GC it should have a similar thread local allocation
heap with no locking.

-- 
Marco

May 29 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 29 May 2014 06:09:17 -0400, Atila Neves <atila.neves gmail.com>  
wrote:


 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

Just checked. It used to be a long time ago that single threaded apps did  
not take the lock. It appears not to be the case any more, the lock is  
always taken.

The explanation in the comment is that finalizers may spawn a new thread.  
See  
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410

Surely we can figure this out...

More and more, it's looking like we are going to start needing  
thread-local pools for thread-local allocations.

To work around, you might be able to pre-allocate all your data before  
starting, and make judicious re-use of buffers. But of course, this may  
make the code so much more ugly, which it shouldn't have to be.

-Steve

May 29 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a new
 thread. See
 

https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 


The strange thing is, finalizers may not even be executed. :-/ I would 
simply change the spec saying "finalizers may not start a new thread."

Getting back to Atila's issue, I would simply comment-out that lock in 
druntime, beat Java, and then put the lock back in. :o)

Ali

May 29 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 29 May 2014 14:42:19 -0400, Ali =C3=87ehreli <acehreli yahoo.com=
 wrote:

 On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

  > The explanation in the comment is that finalizers may spawn a new
  > thread. See
  >  =

 https://github.com/D-Programming-Language/druntime/blob/master/src/gc/=

gc.d#L410  =

 The strange thing is, finalizers may not even be executed.

They should be executed if the memory is collected.

 :-/ I would simply change the spec saying "finalizers may not start a =

 =

 new thread."

Probably a good idea :)

 Getting back to Atila's issue, I would simply comment-out that lock in=

  =

 druntime, beat Java, and then put the lock back in. :o)

Probably the sanest option, but then you can't really say that D beats  =

Java :)

-Steve

May 29 2014

"Atila Neves" <atila.neves gmail.com> writes:

I tried modifying the runtime but unfortunately vibe.d doesn't
compile with the latest dmd. I don't feel like fixing it or
figuring out how to compile gdc either. As mentioned later, I
wouldn't be able to claim to beat Java, but I would be able to
see how much faster it would go without the lock, though. Oh well.

Also, there was a bit too much focus on Java on this thread. The
fact is that the C implementation is faster than D as well, and
just as fast as Java.

Atila

On Thursday, 29 May 2014 at 18:42:20 UTC, Ali Çehreli wrote:
 On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a

 new
 thread. See
 

 https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410


 The strange thing is, finalizers may not even be executed. :-/ 
 I would simply change the spec saying "finalizers may not start 
 a new thread."

 Getting back to Atila's issue, I would simply comment-out that 
 lock in druntime, beat Java, and then put the lock back in. :o)

 Ali

May 30 2014

"Martin Nowak" <code dawg.eu> writes:

On Thursday, 29 May 2014 at 13:06:20 UTC, Steven Schveighoffer
wrote:
 More and more, it's looking like we are going to start needing 
 thread-local pools for thread-local allocations.

This is the best solution, get rid of the locking for most
allocations even in multi-threaded programs.

May 29 2014

"Brian Schott" <briancschott gmail.com> writes:

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 If you're profiling binaries on Linux, this thing is a must 
 have and I have no idea how I'd never heard about it before.

On the topic of perf, I found a stupid trick the other day and 
wrote it down on the wiki: http://wiki.dlang.org/Perf

May 29 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 29 May 2014 18:00:13 +0000
schrieb "Brian Schott" <briancschott gmail.com>:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 If you're profiling binaries on Linux, this thing is a must 
 have and I have no idea how I'd never heard about it before.

 
 On the topic of perf, I found a stupid trick the other day and 
 wrote it down on the wiki: http://wiki.dlang.org/Perf

But does that beat simply attaching the debugger to the
process and pausing it?

-- 
Marco

May 29 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 On the topic of perf, I found a stupid trick the other day and 
 wrote it down on the wiki: http://wiki.dlang.org/Perf

Great tip!

Now I want bash completion for it. But I can't seem to get it by 
reusing _kill in bash-completion:

https://superuser.com/questions/760613/reusing-kill-completion-in-bashrc-fails

Ideas anyone?

May 30 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see 
 that one coming.

This is sadly unavoidable with the current GC.  Simply put, it's
possible for a finalizer to spawn a thread, and so the GC always
has to lock before collecting, even if the program is
single-threaded at the beginning of the collection cycle.

May 29 2014

Rainer Schuetze <r.sagitario gmx.de> writes:

On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

A lock should not be more than a CAS operation that always succeeds in a 
single threaded application, but OS overhead might be considerable.

Adam Sakareassen has recently announced a rewrite of the GC that has 
lock-free allocations: 
http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com

Unfortunately, it isn't available yet...

May 30 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 30 May 2014 10:29:58 +0200
schrieb Rainer Schuetze <r.sagitario gmx.de>:

 
 
 On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

 
 A lock should not be more than a CAS operation that always succeeds in a 
 single threaded application, but OS overhead might be considerable.
 
 Adam Sakareassen has recently announced a rewrite of the GC that has 
 lock-free allocations: 
 http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com
 
 Unfortunately, it isn't available yet...

I recently tried to write a single-type memory allocator and
thought I'd come out faster than TCMalloc due to its
simplicity. But as soon as I added a single CAS I was already
over the time that TCMalloc needs. That way I learned that CAS
is not as cheap as it looks and the fastest allocators work
thread local as long as possible.

-- 
Marco

May 30 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
 simplicity. But as soon as I added a single CAS I was already
 over the time that TCMalloc needs. That way I learned that CAS
 is not as cheap as it looks and the fastest allocators work
 thread local as long as possible.

22 cycles latency if on a valid cacheline?
+ overhead of going to memory

Did you try to add explicit prefetch, maybe that would help?

Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 
cycles on Haswell). You need instructions to fill the pipeline 
between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM 
and do a lot of testing on different CPUs. Explicit prefetching, 
lock free strategies etc are tricky to get right. Get it wrong 
and it is worse than the naive implementation.

May 30 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

This is also one of many good reasons to not make allocators 
library based, but do it in the compiler and backend. That also 
allows you do allocate multiple objects in a single CAS. (i.e. 
the compiler collects multiple allocs calls and replace it with 
one)

May 30 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 30 May 2014 15:54:57 +0000
schrieb "Ola Fosheim Gr=C3=B8stad"
<ola.fosheim.grostad+dlang gmail.com>:

 On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
 simplicity. But as soon as I added a single CAS I was already
 over the time that TCMalloc needs. That way I learned that CAS
 is not as cheap as it looks and the fastest allocators work
 thread local as long as possible.

=20
 22 cycles latency if on a valid cacheline?
 + overhead of going to memory
=20
 Did you try to add explicit prefetch, maybe that would help?
=20
 Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5=20
 cycles on Haswell). You need instructions to fill the pipeline=20
 between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM=20
 and do a lot of testing on different CPUs. Explicit prefetching,=20
 lock free strategies etc are tricky to get right. Get it wrong=20
 and it is worse than the naive implementation.

I'm on a Core 2 Duo. But this doesn't sound like I want to try
it. core.atomic is as low as I wanted to go. Anyway I deleted
that code when I realized just how fast allocation is with
TCMalloc already. And that's a general purpose allocator.

--=20
Marco

May 30 2014

"Kagamin" <spam here.lot> writes:

The design of TCMalloc seems to be compatible with Boehm GC, so D 
GC can probably use same ideas.

May 31 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 31 May 2014 08:52:34 +0000
schrieb "Kagamin" <spam here.lot>:

 The design of TCMalloc seems to be compatible with Boehm GC, so D 
 GC can probably use same ideas.

Did I mention that Orvid planned on doing that for his
replacement GC?
It looks like there are now a lot of proof of concept GC
modifications by several people. I think that's ok for the
benefit of exploring possibly incompatible changes and
trade-offs. A too early joint effort could mean that some
ideas cannot be explored to their full extent with the typical
myths about how good or bad some idea would have turned out.

-- 
Marco

Jun 01 2014

Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:

Hi,

Yes my little GC project is coming along.  It allocates much faster in 
multi threaded applications when all threads are competing for the lock. 
  D's current GC is very slow when there is contention for the GC-lock 
which is acquired for every malloc.  The other hidden lock usage is when 
a dynamic array increases in size.  A significant problem is that the 
thread holding the lock might be pre-empted by the OS, causing all 
threads to wait until it is running again.  My lock-free allocator fixes 
all this trouble.

(Note: On Windows the current GC uses EnterCriticalSection from 
kernel32.lib)

In single threaded applications, D's GC performs much better.  It still 
has the overhead of maintaining a free list while allocating. The free 
list is built when requesting pools from the OS.  This list is stored as 
a linked list using the fresh memory cells.  A cell is zerod when it is 
allocated.

My implementation is slightly faster at allocations because it doesn't 
need to zero the memory before before. It only maintains a such a free 
list for recycled memory.

I'll try to make some code available soon for people to try if they are 
interested.  There are still some jobs I need to do.  I'm just 
implementing the utility functions now (sizeOf, addrOf) etc.   Allocated 
memory is currently never returned to the OS, just recycled for new 
allocations.

Cheers
Adam








On 30/05/2014 6:29 PM, Rainer Schuetze via Digitalmars-d wrote:
 On 29.05.2014 12:09, Atila Neves wrote:
 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

 A lock should not be more than a CAS operation that always succeeds in a
 single threaded application, but OS overhead might be considerable.

 Adam Sakareassen has recently announced a rewrite of the GC that has
 lock-free allocations:
 http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmars-d puremagic.com


 Unfortunately, it isn't available yet...

Jun 10 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 Yes my little GC project is coming along.  It allocates much 
 faster in multi threaded applications when all threads are 
 competing for the lock.

Have you benchmarked your lock-free GC against D's builtin with
different allocation sizes? I've seen comments on the need for
that...

Anyhow very exciting times - I'm eagerly awaiting your next GC
news update!

Jun 10 2014

Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 11/06/2014 6:39 AM, "Nordlöw" via Digitalmars-d wrote:
 Have you benchmarked your lock-free GC against D's builtin with
 different allocation sizes? I've seen comments on the need for
 that...

Only a little.  In scripts where I deliberately introduce contention my 
allocator is quicker. It's much closer when there is little contention. 
  This little test script uses 12 threads to do 100,000 allocations 
each.  It does a run for the sizes:
(16, 32, 64, 512, 512, 64, 32, 16).  (script attached below)

With DMD's built in GC
407 milli seconds (16 bytes)
409 milli seconds (32 bytes)
432 milli seconds (64 bytes)
638 milli seconds (512 bytes)
-collect-
633 milli seconds (512 bytes)
422 milli seconds (64 bytes)
418 milli seconds (32 bytes)
394 milli seconds (16 bytes)

My GC
22 milli seconds
18 milli seconds
27 milli seconds
293 milli seconds
-collect-
194 milli seconds
133 milli seconds
60 milli seconds
62 milli seconds

The timing of the collect is not included because it's not fair to 
benchmark as my GC does not minimise yet.  This means my GC could easily 
run out of memory sooner.  (a job for later)

Some interesting observations are:

512 size allocations don't come from thread specific arenas, so there is 
more contention for the CAS operation.

The allocations after the collection would be mostly from the free list 
so they are slower than the initial fresh allocations.  (List 
maintenance, and zeroing of memory required.  This is the same as the 
built-in GC.)  The only real difference in the speeds after the collect 
is lock contention.

I have a 6 core cpu (hyper threaded to 12).  So contention levels may be 
different on machines with less cores. Not sure how this will change 
things.  It might cause less contention, but blocking threads may slow 
things down on the built-in GC.   Tests are on Windows using 32 bit code.

Cheers
Adam


----

import std.stdio;
import std.datetime;
import std.concurrency;
import core.memory;
import core.thread;

struct LinkedList16{
	int value = 0;
	LinkedList16* next;
}

struct LinkedList32{
	int value = 0;
	LinkedList32* next;
	byte[16] padding;
}

struct LinkedList64{
	int value = 0;
	LinkedList64* next;
	byte[32] padding;
}

struct LinkedList512{
	int value = 0;
	LinkedList512* next;
	byte[256] padding;
}

shared int threadCount = 0;
void main(){
	core.memory.GC.disable();
	doTest!LinkedList16;
	doTest!LinkedList32;
	doTest!LinkedList64;
	doTest!LinkedList512;
	GC.collect();
	doTest!LinkedList512;
	doTest!LinkedList64;
	doTest!LinkedList32;
	doTest!LinkedList16;
	
}

void doTest(T)(){
	auto start = Clock.currSystemTick();
	foreach(i; 0 .. 12){
		auto tid = spawn(&doSomething!T, thisTid);
		threadCount++;
	}
	while(threadCount >0){};

	auto ln = Clock.currSystemTick() - start;
	writeln(ln.msecs, " milli seconds");
}

void doSomething(T)(Tid tid){
	auto top = new T;
	auto recent = top;

	//Create a linked list
	foreach(i; 1 .. 100_000){
		auto newList = new T;
		newList.value = i % 128;
		
		recent.next = newList;
		recent = newList;
	}

	//Sum the values.  (Just spends some time walking the memory).
	recent = top;
	long total=0;
	while(recent !is null){
		total += recent.value;
		recent = recent.next;
	}
	//writeln("Total : ", total );
	threadCount--;
}

Jun 10 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 Only a little.  In scripts where I deliberately introduce 
 contention my allocator is quicker. It's much closer when there 
 is little contention.

Very interesting...

It would be nice to see single-threaded benchmarks aswell for 
completeness.

Thx-

Jun 11 2014

Etienne <etcimon gmail.com> writes:

On 2014-06-10 5:53 AM, Adam Sakareassen via Digitalmars-d wrote:
 Hi,

 Yes my little GC project is coming along.  It allocates much faster in
 multi threaded applications when all threads are competing for the lock.
   D's current GC is very slow when there is contention for the GC-lock
 which is acquired for every malloc.  The other hidden lock usage is when
 a dynamic array increases in size.  A significant problem is that the
 thread holding the lock might be pre-empted by the OS, causing all
 threads to wait until it is running again.  My lock-free allocator fixes
 all this trouble.

Could you share a fork containing the source code to this? I'm very 
curious as to how you've implemented it.

Jun 12 2014

D Programming

C/C++ Programming

Other

digitalmars.D - The GC and performance, but not what you expect