digitalmars.D - GC experiments. Writing my own GC.

Adam Sakareassen via Digitalmars-d (105/105) May 12 2014 Hi all,

FrankLike (4/6) May 12 2014 Congratulations!
Rainer Schuetze (18/50) May 13 2014 [...]

Adam Sakareassen via Digitalmars-d (6/64) May 13 2014 Destructors will almost certainly be run on the background thread.

Andrei Alexandrescu (4/9) May 13 2014 [snip]

Adam Sakareassen via Digitalmars-d (10/20) May 13 2014 Thank-you.

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (3/5) May 30 2014 This sounds almost to good to be true!

Wanderer (2/8) May 30 2014 Probably this is the reason why results are so nice :-)

Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:

Hi all,

As a learning exercise I've just been doing some experimenting with 
rewriting the garbage collection code, and thought I might share some of 
the initial results.  I only program as a hobby these days, and I'm 
certainly no expert, but I thought some people might find it interesting.

My interest started because I wrote a LR1 file parser in D.  I then 
multi threaded the application so multiple files could be parsed 
simultaneously. (Disk IO was all on one thread). To my surprise, the 
through-put dropped significantly.  I could process the files a lot 
faster using only one thread.   It turns out the delays were due to my 
liberal use of the �new� statement.  Rather than block allocate the 
memory I thought I would just hack the GC.  So I cleared out gc.d in the 
runtime and started again.  The basic plan was to make it more 
multi-thread friendly.  Already I have learnt quite a lot. There are a 
number of things I would do differently if I tried another re-write. 
However so far it has been a good learning experience.

Currently, allocations are all working, and the mark phase is running. 
It still will not sweep and free the memory.

All memory allocations are entirely lock free (using CAS instructions). 
  So a pre-empted thread will never block another.   For allocations of 
less than 128 bytes, each thread is allocated memory from it's own 
memory pool to avoid false sharing on the CPU's cache.   The collector 
component runs on a background thread using a mark and sweep algorithm 
which is basically the same as the existing algorithm.  Currently the 
thread will wake up every 100ms and decide if a collection should be 
performed.  An emergency collection will run in the foreground if a 
memory allocation fails during that period.

The mark phase needs to stop the world.  The sweeping portion of the 
collection will run in the background.   This is similar to the current 
implementation as the world is restarted after the mark phase, however 
the thread doing the collection will not allocate the requested memory 
to the calling thread until after the sweep has completed.  This means 
that single threaded applications always wait for the full garbage 
collection cycle.

So far allocation speed seems to have improved.  I can't test collection 
speed as it's not complete.  As a test I wrote a simple function that 
allocates a linked list of 2 million items.  This function is then 
spawned by 20 threads.  This test script is shown below.  Timing for 
allocation (with GC disabled) is as follows.  (Using DMD 2.065)

Existing GC code:  15700ms (average)
My GC code:   500ms (Average)

When performing the same amount of allocations on a single thread, the 
new code is still slightly faster than the old.

What this demonstrates is that the locking mechanisms in the current GC 
code is a huge overhead for multi threaded applications that perform a 
lot of memory allocations.  (ie.  Use the �new� operator or dynamic 
arrays.)

It would be nice to see the default GC and memory allocator improved. 
There is certainly room for improvement on the allocator end which may 
mask some of the performance issues associated with garbage collection.

In the future I think D needs to look at making collection precise.  It 
would not be too hard to adjust the mark and sweep GC to be nearly 
precise.  The language needs to support precise GC before things like 
moving garbage collection become feasible.

Anyway, I just thought I'd share the results of my experimenting.  I 
would be happy to make the code available in a few weeks time.  Perhaps 
someone might find is useful.  I need to get it finished and tested 
first. :-)

Cheers!
Adam

------
//Test script that generated these results:
import std.stdio;
import std.datetime;
import std.concurrency;
import core.memory;

class LinkedList{
	long value =0;
	LinkedList next;
}

shared int threadCount = 0;

void main(){
	core.memory.GC.disable();
	auto start = Clock.currSystemTick();

	foreach(i; 0 .. 20){
		auto tid = spawn(&doSomething, thisTid);
		threadCount++;
	}
		
	while(threadCount >0){};
	
	auto ln = Clock.currSystemTick() - start;
	writeln(ln.msecs, "ms");
}

void doSomething(Tid tid){
	auto top = new LinkedList;
	auto recent = top;

	//Create the linked list
	foreach(i; 1 .. 2_000_000){
		auto newList = new LinkedList;
		newList.value = i;
		recent.next = newList;
		recent = newList;
	}
	
	//Sum the values.  (Just spends some time walking the memory).
	recent = top;
	long total=0;
	while(recent !is null){
		total += recent.value;
		recent = recent.next;
	}
	writeln("Total : ", total );
	threadCount--;
}

May 12 2014

"FrankLike" <1150015857 qq.com> writes:

 Existing GC code:  15700ms (average)
 My GC code:   500ms (Average)

Congratulations!

Can you share you good work for us? or exe? dll?

Thank you.

Frank

May 12 2014

Rainer Schuetze <r.sagitario gmx.de> writes:

This sounds pretty cool.

On 13.05.2014 06:42, Adam Sakareassen via Digitalmars-d wrote:
 Hi all,

[...]
 All memory allocations are entirely lock free (using CAS instructions).
   So a pre-empted thread will never block another.   For allocations of
 less than 128 bytes, each thread is allocated memory from it's own
 memory pool to avoid false sharing on the CPU's cache.

There was a GSoC project a few years ago which was planning to do 
something similar, but then switched to go for implementing precise 
scanning. I always wanted to add lock-free memory allocations, too.

[...]
 The mark phase needs to stop the world.  The sweeping portion of the
 collection will run in the background.   This is similar to the current
 implementation as the world is restarted after the mark phase, however
 the thread doing the collection will not allocate the requested memory
 to the calling thread until after the sweep has completed.  This means
 that single threaded applications always wait for the full garbage
 collection cycle.

Does this mean that destructors/finalizers are called from a different 
thread? It was never guaranteed to be run on the same thread as the 
allocation of the object so far, but effectively that happens for 
single-threaded applications. It could help to continue doing that, some 
resources have to be released from the same thread they have been 
acquired (e.g. UI handles). For multi-threaded applications, I'm fine 
with "guaranteeing" that the destructor is never run on the same thread, 
though some standard mechanism should exist to forward to another thread.

 So far allocation speed seems to have improved.  I can't test collection
 speed as it's not complete.  As a test I wrote a simple function that
 allocates a linked list of 2 million items.  This function is then
 spawned by 20 threads.  This test script is shown below.  Timing for
 allocation (with GC disabled) is as follows.  (Using DMD 2.065)

 Existing GC code:  15700ms (average)
 My GC code:   500ms (Average)

Very promising numbers :-)

 When performing the same amount of allocations on a single thread, the
 new code is still slightly faster than the old.

 What this demonstrates is that the locking mechanisms in the current GC
 code is a huge overhead for multi threaded applications that perform a
 lot of memory allocations.  (ie.  Use the �new� operator or dynamic
 arrays.)

 It would be nice to see the default GC and memory allocator improved.
 There is certainly room for improvement on the allocator end which may
 mask some of the performance issues associated with garbage collection.

 In the future I think D needs to look at making collection precise.  It
 would not be too hard to adjust the mark and sweep GC to be nearly
 precise.  The language needs to support precise GC before things like
 moving garbage collection become feasible.

You might check how compatible your implementation is with this: 
https://github.com/rainers/druntime/tree/gcx_precise2

May 13 2014

Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:

Destructors will almost certainly be run on the background thread. 
Although those run at shutdown will likely be on the main thread.

I hope the overall performance results are good.  I really need to see 
the timing of the collections to know.  But in any case it does indicate 
that some work on removing lock contention is worthwhile.

On 13/05/2014 5:04 PM, Rainer Schuetze via Digitalmars-d wrote:
 This sounds pretty cool.

 On 13.05.2014 06:42, Adam Sakareassen via Digitalmars-d wrote:
 Hi all,

 [...]
 All memory allocations are entirely lock free (using CAS instructions).
   So a pre-empted thread will never block another.   For allocations of
 less than 128 bytes, each thread is allocated memory from it's own
 memory pool to avoid false sharing on the CPU's cache.

 There was a GSoC project a few years ago which was planning to do
 something similar, but then switched to go for implementing precise
 scanning. I always wanted to add lock-free memory allocations, too.

 [...]
 The mark phase needs to stop the world.  The sweeping portion of the
 collection will run in the background.   This is similar to the current
 implementation as the world is restarted after the mark phase, however
 the thread doing the collection will not allocate the requested memory
 to the calling thread until after the sweep has completed.  This means
 that single threaded applications always wait for the full garbage
 collection cycle.

 Does this mean that destructors/finalizers are called from a different
 thread? It was never guaranteed to be run on the same thread as the
 allocation of the object so far, but effectively that happens for
 single-threaded applications. It could help to continue doing that, some
 resources have to be released from the same thread they have been
 acquired (e.g. UI handles). For multi-threaded applications, I'm fine
 with "guaranteeing" that the destructor is never run on the same thread,
 though some standard mechanism should exist to forward to another thread.

 So far allocation speed seems to have improved.  I can't test collection
 speed as it's not complete.  As a test I wrote a simple function that
 allocates a linked list of 2 million items.  This function is then
 spawned by 20 threads.  This test script is shown below.  Timing for
 allocation (with GC disabled) is as follows.  (Using DMD 2.065)

 Existing GC code:  15700ms (average)
 My GC code:   500ms (Average)

 Very promising numbers :-)

 When performing the same amount of allocations on a single thread, the
 new code is still slightly faster than the old.

 What this demonstrates is that the locking mechanisms in the current GC
 code is a huge overhead for multi threaded applications that perform a
 lot of memory allocations.  (ie.  Use the �new� operator or dynamic
 arrays.)

 It would be nice to see the default GC and memory allocator improved.
 There is certainly room for improvement on the allocator end which may
 mask some of the performance issues associated with garbage collection.

 In the future I think D needs to look at making collection precise.  It
 would not be too hard to adjust the mark and sweep GC to be nearly
 precise.  The language needs to support precise GC before things like
 moving garbage collection become feasible.

 You might check how compatible your implementation is with this:
 https://github.com/rainers/druntime/tree/gcx_precise2

May 13 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/12/14, 9:42 PM, Adam Sakareassen via Digitalmars-d wrote:
 Hi all,

 As a learning exercise I've just been doing some experimenting with
 rewriting the garbage collection code, and thought I might share some of
 the initial results.  I only program as a hobby these days, and I'm
 certainly no expert, but I thought some people might find it interesting.

[snip]

This all is _very_ encouraging, please do share. Better yet, please 
contribute patches (one bite-size at a time) to dmd!! -- Andrei

May 13 2014

Adam Sakareassen via Digitalmars-d <digitalmars-d puremagic.com> writes:

Thank-you.

Once I am happy with result I will certainly share.  The internal 
structures of the GC have been redesigned so I can't simply patch the 
existing file.  Once it is fully operational I will make it available 
for testing.

The good performance numbers are for a very specific test case, so it 
would be nice to have some more generalised testing once we get there. 
I would welcome the work being included in DMD if others thought it 
suited the generalised case.



On 13/05/2014 5:52 PM, Andrei Alexandrescu via Digitalmars-d wrote:
 On 5/12/14, 9:42 PM, Adam Sakareassen via Digitalmars-d wrote:
 Hi all,

 As a learning exercise I've just been doing some experimenting with
 rewriting the garbage collection code, and thought I might share some of
 the initial results.  I only program as a hobby these days, and I'm
 certainly no expert, but I thought some people might find it interesting.

 [snip]

 This all is _very_ encouraging, please do share. Better yet, please
 contribute patches (one bite-size at a time) to dmd!! -- Andrei

May 13 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 Existing GC code:  15700ms (average)
 My GC code:   500ms (Average)

This sounds almost to good to be true!

What remains to be fixed before a pull request can be made?

/Per

May 30 2014

"Wanderer" <no-reply no-reply.org> writes:

On Friday, 30 May 2014 at 10:45:14 UTC, Nordlöw wrote:
 Existing GC code:  15700ms (average)
 My GC code:   500ms (Average)

 This sounds almost to good to be true!


 For allocations of less than 128 bytes, each thread
 is allocated memory from it's own memory pool to avoid
 false sharing on the CPU's cache.


Probably this is the reason why results are so nice :-)

May 30 2014

D Programming

C/C++ Programming

Other

digitalmars.D - GC experiments. Writing my own GC.