www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Concurrent GC (for Windows)

reply Rainer Schuetze <r.sagitario gmx.de> writes:
Hi,

more GC talk: the last couple of days, I've been experimenting with 
implementing a concurrent GC on Windows inspired by Leandros CDGC. 
Here's a report on my experiments:

http://rainers.github.io/visuald/druntime/concurrentgc.html

tl;dr: there is a working(?) partially concurrent GC here: 
https://github.com/rainers/druntime/tree/concurrent_gc2
but it eats a whole lot of memory.

Rainer
Jun 03 2014
next sibling parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
Am 03.06.2014 09:35, schrieb Rainer Schuetze:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.

 Rainer
This sounds quite promising. Did you think about using a write barrier instead of doing the chech per page? Because the way you propose will mean, that all writes, no matter if relevant for the GC or not, will cause the GC to reconsider the page. If we implement a write barrier into the front end that only gets called when a write through a pointer to a pointer happens, the GC could only rescan memory blocks that actually have changed pointers. Also the GC might be able to remember the old pointer value before the write is execued resulting in a consitent snapshot of the heap, which would eliminate the need to rescan after the background thread is finished. Kind Regards Benjamin Thaut
Jun 03 2014
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 04.06.2014 07:44, Benjamin Thaut wrote:
 Am 03.06.2014 09:35, schrieb Rainer Schuetze:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.

 Rainer
This sounds quite promising. Did you think about using a write barrier instead of doing the chech per page? Because the way you propose will mean, that all writes, no matter if relevant for the GC or not, will cause the GC to reconsider the page. If we implement a write barrier into the front end that only gets called when a write through a pointer to a pointer happens, the GC could only rescan memory blocks that actually have changed pointers. Also the GC might be able to remember the old pointer value before the write is execued resulting in a consitent snapshot of the heap, which would eliminate the need to rescan after the background thread is finished.
Yes, write barriers are a good addition to this, but is way more difficult to implement and will meet more resistance from Walter. It's still something I'd like to look into.
Jun 04 2014
parent Benjamin Thaut <code benjamin-thaut.de> writes:
Am 04.06.2014 22:26, schrieb Rainer Schuetze:
 Yes, write barriers are a good addition to this, but is way more
 difficult to implement and will meet more resistance from Walter. It's
 still something I'd like to look into.
I assume they are hard to implement because the backend no longer has type information when generating the writes? I remember that Walter stated a few weeks ago that he considered implementing write barriers.
Jun 04 2014
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 03/06/14 09:35, Rainer Schuetze wrote:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.
How does other GC's handle these memory problems? I'm thinking of Java -- /Jacob Carlborg
Jun 03 2014
next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 4 June 2014 at 06:51:04 UTC, Jacob Carlborg wrote:
 On 03/06/14 09:35, Rainer Schuetze wrote:
 Hi,

 more GC talk: the last couple of days, I've been experimenting 
 with
 implementing a concurrent GC on Windows inspired by Leandros 
 CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.
How does other GC's handle these memory problems? I'm thinking a virtual machine?
Jun 04 2014
parent Jacob Carlborg <doob me.com> writes:
On 2014-06-04 09:14, Paulo Pinto wrote:


My question still remains :) -- /Jacob Carlborg
Jun 04 2014
prev sibling parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 04.06.2014 08:51, Jacob Carlborg wrote:
 On 03/06/14 09:35, Rainer Schuetze wrote:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.
How does other GC's handle these memory problems? I'm thinking of Java
All more sophisticated GCs have write or read barriers. That makes it much easier to keep track of modifications during concurrent collection. One reason for the huge memory foot print is that the application is allowed to continue to allocate during collection, but it will do this by getting more memory from the OS. It needs to be somehow throttled to avoid going too far with this. Most of the remaining pause time is sweeping garbage. I think about deferring sweeping into allocations by running finalizers just before reusing the memory for another object. This can throttle allocations a bit while at the same time reduce pauses.
Jun 04 2014
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2014-06-04 22:37, Rainer Schuetze wrote:

 All more sophisticated GCs have write or read barriers. That makes it
 much easier to keep track of modifications during concurrent collection.
Right, now I start to remember. -- /Jacob Carlborg
Jun 05 2014
prev sibling parent reply Martin Nowak <code dawg.eu> writes:
On 06/04/2014 10:37 PM, Rainer Schuetze wrote:
 Most of the remaining pause time is sweeping garbage. I think about
 deferring sweeping into allocations by running finalizers just before
 reusing the memory for another object. This can throttle allocations a
 bit while at the same time reduce pauses.
That's pausing the thread which triggered the collection, right? Other threads don't need to wait for sweeping. Dividing the cost to run finalizers among many allocations is a charming idea.
Jun 15 2014
parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 15.06.2014 23:30, Martin Nowak wrote:
 On 06/04/2014 10:37 PM, Rainer Schuetze wrote:
 Most of the remaining pause time is sweeping garbage. I think about
 deferring sweeping into allocations by running finalizers just before
 reusing the memory for another object. This can throttle allocations a
 bit while at the same time reduce pauses.
That's pausing the thread which triggered the collection, right? Other threads don't need to wait for sweeping.
Yes, the world is not stopped during sweeping. The GC lock still blocks any thread that tries to allocate, though.
 Dividing the cost to run finalizers among many allocations is a charming
 idea.
Another nice property: If memory is recycled during malloc, you can run the finalizers without the GC lock held just before returning, so there are less restrictions on what can happen in a destructor.
Jun 15 2014
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Jun-2014 11:35, Rainer Schuetze пишет:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.
I'm not sure if how it would perform but what about the following idea for COW snapshotting. 1. Don't use the separate process and shared view, use background thread. 2. Allocate heap space with MapViewOfFile, but private not shared. During collection, creating a snapshot (during stop the world) as: 1. Make a new, read-write view of heap with MapViewOfFile, this is what background thread will use to scan memory. 2. Save base address of the original view, and unmap it. 3. MapViewOfFileEx with MAP_COPY and the old base address (that gives us back the same heap but in CoW mode). ... move on with application and let background thread do the marking. Once marking is done, do stop the world and: 1. VirtualQuery (or QueryWorkingSetEx) over the application's CoW view of the heap. See the ranges of pages that have flipped to PAGE_READWRITE, those are the one modified and duplicated. If new allocation are served from it then it will include newly allocated stuff. 2. Copy these ranges of pages over to the normal read-write view. 3. Close CoW mapping (thereby freeing duplicates) and remap it again as r-w view on the same address with MapViewOfFileEx. I wasn't able to get QueryWorkingSetEx to behave but I believe it must be a better way then VirtualQuery every page-range to death. See the sketch of the idea here : https://gist.github.com/DmitryOlshansky/5e32057e047425480f0e
 Rainer
-- Dmitry Olshansky
Jun 11 2014
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 11.06.2014 18:59, Dmitry Olshansky wrote:
 03-Jun-2014 11:35, Rainer Schuetze пишет:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html

 tl;dr: there is a working(?) partially concurrent GC here:
 https://github.com/rainers/druntime/tree/concurrent_gc2
 but it eats a whole lot of memory.
I'm not sure if how it would perform but what about the following idea for COW snapshotting. 1. Don't use the separate process and shared view, use background thread. 2. Allocate heap space with MapViewOfFile, but private not shared. During collection, creating a snapshot (during stop the world) as: 1. Make a new, read-write view of heap with MapViewOfFile, this is what background thread will use to scan memory. 2. Save base address of the original view, and unmap it. 3. MapViewOfFileEx with MAP_COPY and the old base address (that gives us back the same heap but in CoW mode). .... move on with application and let background thread do the marking. Once marking is done, do stop the world and: 1. VirtualQuery (or QueryWorkingSetEx) over the application's CoW view of the heap. See the ranges of pages that have flipped to PAGE_READWRITE, those are the one modified and duplicated. If new allocation are served from it then it will include newly allocated stuff. 2. Copy these ranges of pages over to the normal read-write view. 3. Close CoW mapping (thereby freeing duplicates) and remap it again as r-w view on the same address with MapViewOfFileEx. I wasn't able to get QueryWorkingSetEx to behave but I believe it must be a better way then VirtualQuery every page-range to death. See the sketch of the idea here : https://gist.github.com/DmitryOlshansky/5e32057e047425480f0e
Cool stuff! I remember trying something similar, but IIRC forcing the same address with MapViewOfFile somehow failed (maybe this was across processes). I tried your version on both Win32 and Win64 successfully, though. I implemented the QueryWorkingSetEx version like this (you need a converted psapi.lib for Win32): enum PAGES = 512; //SIZE / 4096; PSAPI_WORKING_SET_EX_INFORMATION[PAGES] info; foreach(i, ref inf; info) inf.VirtualAddress = heap + i * 4096; if (!QueryWorkingSetEx(GetCurrentProcess(), info.ptr, info.sizeof)) throw new Exception(format("Could not query info (%d).\n", GetLastError())); foreach(i, ref inf; info) writefln("flags page %d: %x", i, inf.VirtualAttributes); and you can check the "shared" field to get copied pages. This function is not supported on XP, though. A short benchmark shows that VirtualQuery needs 55/42 ms for your test on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17 ms for both. If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is copied), I get 80-90 ms more. The numbers are not great, but I guess the usual memory usage and number of modified pages will be much lower. I'll see if I can integrate this into the concurrent implementation.
Jun 11 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jun-2014 10:34, Rainer Schuetze пишет:
 On 11.06.2014 18:59, Dmitry Olshansky wrote:
 03-Jun-2014 11:35, Rainer Schuetze пишет:
 Hi,

 more GC talk: the last couple of days, I've been experimenting with
 implementing a concurrent GC on Windows inspired by Leandros CDGC.
 Here's a report on my experiments:

 http://rainers.github.io/visuald/druntime/concurrentgc.html
[snip]
 See the sketch of the idea here :
 https://gist.github.com/DmitryOlshansky/5e32057e047425480f0e
Cool stuff! I remember trying something similar, but IIRC forcing the same address with MapViewOfFile somehow failed (maybe this was across processes). I tried your version on both Win32 and Win64 successfully, though.
 I implemented the QueryWorkingSetEx version like this (you need a
 converted psapi.lib for Win32):
Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.
 enum PAGES = 512; //SIZE / 4096;
 PSAPI_WORKING_SET_EX_INFORMATION[PAGES] info;
 foreach(i, ref inf; info)
      inf.VirtualAddress = heap + i * 4096;
 if (!QueryWorkingSetEx(GetCurrentProcess(), info.ptr, info.sizeof))
      throw new Exception(format("Could not query info (%d).\n",
 GetLastError()));

 foreach(i, ref inf; info)
      writefln("flags page %d: %x", i, inf.VirtualAttributes);


 and you can check the "shared" field to get copied pages.
 This function
 is not supported on XP, though.
I wouldn't worry about it, it's not like XP users are growing in numbers. Also it looks like only 64bit version is good to go, as on 32bit it would reduce usable memory in half.
 A short benchmark shows that VirtualQuery needs 55/42 ms for your test
 on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17
 ms for both.
Seems in line with my measurements. Strictly speaking 1/2 of pages, interleaved should give the estimate of the worst case. Together with remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb of heap.
 If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
 copied), I get 80-90 ms more.
Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time. It might help to split the full heap into multiple views. Also using VirtualProtect during the first step - turning a mapping into CoW one is faster then unmap/map (by factor of 2). One thing that may help is saving a pointer to the end of used heap at the moment of scan, then remaping only this portion as COW. Last issue I see is adjustment of pointers - in a GC, the mapped view is mapped at new address so it would need a fixup them during scanning.
 The numbers are not great, but I guess the usual memory usage and number
 of modified pages will be much lower. I'll see if I can integrate this
 into the concurrent implementation.
Wish you luck, I'm still not sure if it will help. -- Dmitry Olshansky
Jun 12 2014
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 13.06.2014 02:38, Dmitry Olshansky wrote:
 12-Jun-2014 10:34, Rainer Schuetze пишет:
 I implemented the QueryWorkingSetEx version like this (you need a
 converted psapi.lib for Win32):
Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.
Grab coffimplib.exe.
 This function
 is not supported on XP, though.
I wouldn't worry about it, it's not like XP users are growing in numbers. Also it looks like only 64bit version is good to go, as on 32bit it would reduce usable memory in half.
There could also be the fallback to VirtualQuery if QueryWorkingSetEx doesn't exist.
 A short benchmark shows that VirtualQuery needs 55/42 ms for your test
 on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17
 ms for both.
Seems in line with my measurements. Strictly speaking 1/2 of pages, interleaved should give the estimate of the worst case. Together with remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb of heap.
 If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
 copied), I get 80-90 ms more.
Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.
 It might help to split the full heap into multiple views.
The current GC uses pools of max 32 MB, so that already exists.
 Also using VirtualProtect during the first step - turning a mapping into
 CoW one is faster then unmap/map (by factor of 2).

 One thing that may help is saving a pointer to the end of used heap at
 the moment of scan, then remaping only this portion as COW.
The pool architecture already does this if scanning just ignores new pools during collection. Another optimization is to segregate the heap into memory with references and memory with plain data (NO_SCAN) at page/pool granularity. NO_SCAN-pages won't need COW. My hope is that this reduces necessary duplicate memory addresses considerably.
 Last issue I see is adjustment of pointers - in a GC, the mapped view is
 mapped at new address so it would need a fixup them during scanning.
Agreed, that's a slight additional cost for scanning, but I don't think it will be too difficult to implement.
 The numbers are not great, but I guess the usual memory usage and number
 of modified pages will be much lower. I'll see if I can integrate this
 into the concurrent implementation.
Wish you luck, I'm still not sure if it will help.
Jun 13 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Jun-2014 12:22, Rainer Schuetze пишет:
 On 13.06.2014 02:38, Dmitry Olshansky wrote:
 12-Jun-2014 10:34, Rainer Schuetze пишет:
 I implemented the QueryWorkingSetEx version like this (you need a
 converted psapi.lib for Win32):
Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.
Grab coffimplib.exe.
Cool, thanks.
 If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
 copied), I get 80-90 ms more.
Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.
I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made. -- Dmitry Olshansky
Jun 13 2014
parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 13.06.2014 12:08, Dmitry Olshansky wrote:
 13-Jun-2014 12:22, Rainer Schuetze пишет:
 If I add the actual copy into heap2 (i.e. every fourth page of 512
 MB is
 copied), I get 80-90 ms more.
Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.
I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made.
I added COW to the concurrent GC ( https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems more stable with respect to pause times than the version with reading write watches. I also see considerable time spent during unmapping the CoW view, about as much as copying the data. One issue with calling OS functions while the world is stopped: if these function require a lock on some synchronization object, and this object is locked by another thread that is suspended, we create a dead lock. Do you know if MapViewOfFileEx and UnmapViewOfFile are lock free? The same can happen with GetWriteWatch and ResetWriteWatch, though.
Jun 14 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
14-Jun-2014 20:34, Rainer Schuetze пишет:
 On 13.06.2014 12:08, Dmitry Olshansky wrote:
 13-Jun-2014 12:22, Rainer Schuetze пишет:
 If I add the actual copy into heap2 (i.e. every fourth page of 512
 MB is
 copied), I get 80-90 ms more.
Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time.
Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.
I thought that's what I do by keeping original view mapped. The time taken is ~ proportional to the number of pages modified in the CoW view, since these are private copies I think it might take this much to recycle them. It's ~1 ms if no changes are made.
I added COW to the concurrent GC ( https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems more stable with respect to pause times than the version with reading write watches.
How do I run benchmarks and what are the newest results for you?
 I also see considerable time spent during unmapping the CoW view, about
 as much as copying the data.

 One issue with calling OS functions while the world is stopped: if these
 function require a lock on some synchronization object, and this object
 is locked by another thread that is suspended, we create a dead lock.

 Do you know if MapViewOfFileEx and UnmapViewOfFile are lock free?
No idea. It may also depend on OS version, given the fuzz about Win7 being the first with lock-free thread scheduler there may be giant-locks in memory-mapping at least in some versions.
 The same can happen with GetWriteWatch and ResetWriteWatch, though.
-- Dmitry Olshansky
Jun 14 2014
parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 14.06.2014 20:17, Dmitry Olshansky wrote:
 I added COW to the concurrent GC (
 https://github.com/rainers/druntime/tree/concurrent_gc2 ) and it seems
 more stable with respect to pause times than the version with reading
 write watches.
How do I run benchmarks and what are the newest results for you?
The benchmarks have been recently merged into druntime. Compile and run runbench.d in the benchmark folder. I have updated the report and added new benchmarks at http://rainers.github.io/visuald/druntime/concurrentgc.html
Jun 15 2014
prev sibling parent Martin Nowak <code dawg.eu> writes:
On 06/12/2014 08:34 AM, Rainer Schuetze wrote:
 If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
 copied), I get 80-90 ms more.

 The numbers are not great, but I guess the usual memory usage and number
 of modified pages will be much lower. I'll see if I can integrate this
 into the concurrent implementation.
Is there a chance that we can pause mutator threads when they trigger to many page copies?
Jun 15 2014