www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - O(N) GC: The patch

reply dsimcha <dsimcha yahoo.com> writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5623

I've found a way to speed up the GC massively on large heaps without 
excessive ripple effects.  Technically it's still O(N), but with about a 
hundred fold smaller constant in the case of large heaps with most stuff 
not scanned.  Now, I think the O(N) (where N is the total size of the 
heap) term has such a small constant that it's for almost all practcal 
purposes the GC is O(S) (where S is the size of the scanned portion of 
the heap).  It also no longer has any O(N^2) pathological case (which I 
had discovered while reading the code).

So far all unittests for Phobos, dstats and 
std.parallelism/parallelfuture pass with this enabled.  Please test some 
other code so we can wring out the corner cases in time for the next 
release.

Basically all I did was diverge the Pool struct slightly into large and 
small object sub-varieties.  The large object sub-variety is used to 
allocate objects of at least a page.  It only stores gcbits at page-size 
offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest 
B_PAGE bin so that they can be found in O(1).

I also added a field to the Pool struct so that the number of free pages 
in a pool can be tracked in O(1).  This should drastically lessen the 
time it takes to perform large allocations on large heaps.  Right now a 
free memory region is found by a linear search through the pools in the 
case of large allocations.  Unfortunately, I don't see any easy way to 
fix this.  This patch at least allows short circuiting a large number of 
pools, if there isn't enough free space in the whole pool, let alone 
contiguous space.

Here are the benchmarks with this patch enabled.

Collected a 10 megabyte heap in 0 milliseconds.
Collected a 50 megabyte heap in 0 milliseconds.
Collected a 250 megabyte heap in 1 milliseconds.
Collected a 500 megabyte heap in 0 milliseconds.
Collected a 1000 megabyte heap in 1 milliseconds.
Collected a 5000 megabyte heap in 3 milliseconds.
Collected a 10000 megabyte heap in 6 milliseconds.
Collected a 30000 megabyte heap in 16 milliseconds.
Collected a 50000 megabyte heap in 26 milliseconds.
Feb 20 2011
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 20 Feb 2011 15:19:24 -0500, dsimcha <dsimcha yahoo.com> wrote:

 http://d.puremagic.com/issues/show_bug.cgi?id=5623

 I've found a way to speed up the GC massively on large heaps without  
 excessive ripple effects.  Technically it's still O(N), but with about a  
 hundred fold smaller constant in the case of large heaps with most stuff  
 not scanned.  Now, I think the O(N) (where N is the total size of the  
 heap) term has such a small constant that it's for almost all practcal  
 purposes the GC is O(S) (where S is the size of the scanned portion of  
 the heap).  It also no longer has any O(N^2) pathological case (which I  
 had discovered while reading the code).

 So far all unittests for Phobos, dstats and  
 std.parallelism/parallelfuture pass with this enabled.  Please test some  
 other code so we can wring out the corner cases in time for the next  
 release.

 Basically all I did was diverge the Pool struct slightly into large and  
 small object sub-varieties.  The large object sub-variety is used to  
 allocate objects of at least a page.  It only stores gcbits at page-size  
 offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest  
 B_PAGE bin so that they can be found in O(1).

 I also added a field to the Pool struct so that the number of free pages  
 in a pool can be tracked in O(1).  This should drastically lessen the  
 time it takes to perform large allocations on large heaps.  Right now a  
 free memory region is found by a linear search through the pools in the  
 case of large allocations.  Unfortunately, I don't see any easy way to  
 fix this.  This patch at least allows short circuiting a large number of  
 pools, if there isn't enough free space in the whole pool, let alone  
 contiguous space.

 Here are the benchmarks with this patch enabled.

 Collected a 10 megabyte heap in 0 milliseconds.
 Collected a 50 megabyte heap in 0 milliseconds.
 Collected a 250 megabyte heap in 1 milliseconds.
 Collected a 500 megabyte heap in 0 milliseconds.
 Collected a 1000 megabyte heap in 1 milliseconds.
 Collected a 5000 megabyte heap in 3 milliseconds.
 Collected a 10000 megabyte heap in 6 milliseconds.
 Collected a 30000 megabyte heap in 16 milliseconds.
 Collected a 50000 megabyte heap in 26 milliseconds.

Those numbers look really promising! I will examine your patch and see if I can think of anything in the array appending that would be affected by it. -Steve
Feb 20 2011
prev sibling parent reply Jason House <jason.james.house gmail.com> writes:
Sounds promising. How does it effect other cases? Some typical GC-heavy
benchmark? Lots of smaller no scan objects that are just under your
optimization threshold?

dsimcha Wrote:

 http://d.puremagic.com/issues/show_bug.cgi?id=5623
 
 I've found a way to speed up the GC massively on large heaps without 
 excessive ripple effects.  Technically it's still O(N), but with about a 
 hundred fold smaller constant in the case of large heaps with most stuff 
 not scanned.  Now, I think the O(N) (where N is the total size of the 
 heap) term has such a small constant that it's for almost all practcal 
 purposes the GC is O(S) (where S is the size of the scanned portion of 
 the heap).  It also no longer has any O(N^2) pathological case (which I 
 had discovered while reading the code).
 
 So far all unittests for Phobos, dstats and 
 std.parallelism/parallelfuture pass with this enabled.  Please test some 
 other code so we can wring out the corner cases in time for the next 
 release.
 
 Basically all I did was diverge the Pool struct slightly into large and 
 small object sub-varieties.  The large object sub-variety is used to 
 allocate objects of at least a page.  It only stores gcbits at page-size 
 offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest 
 B_PAGE bin so that they can be found in O(1).
 
 I also added a field to the Pool struct so that the number of free pages 
 in a pool can be tracked in O(1).  This should drastically lessen the 
 time it takes to perform large allocations on large heaps.  Right now a 
 free memory region is found by a linear search through the pools in the 
 case of large allocations.  Unfortunately, I don't see any easy way to 
 fix this.  This patch at least allows short circuiting a large number of 
 pools, if there isn't enough free space in the whole pool, let alone 
 contiguous space.
 
 Here are the benchmarks with this patch enabled.
 
 Collected a 10 megabyte heap in 0 milliseconds.
 Collected a 50 megabyte heap in 0 milliseconds.
 Collected a 250 megabyte heap in 1 milliseconds.
 Collected a 500 megabyte heap in 0 milliseconds.
 Collected a 1000 megabyte heap in 1 milliseconds.
 Collected a 5000 megabyte heap in 3 milliseconds.
 Collected a 10000 megabyte heap in 6 milliseconds.
 Collected a 30000 megabyte heap in 16 milliseconds.
 Collected a 50000 megabyte heap in 26 milliseconds.

Feb 20 2011
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jason House:

 How does it effect other cases?

I have asked for the timings for a small benchmark, and the results are good (see the issue in Bugzilla). Bye, bearophile
Feb 20 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 2/20/2011 8:05 PM, Jason House wrote:
 Sounds promising. How does it effect other cases? Some typical GC-heavy
benchmark? Lots of smaller no scan objects that are just under your
optimization threshold?

I posted some new benchmarks that are more like realistic workloads to the original bug report. The first benchmark I posted was admittedly a corner case. These new ones are more like typical scientific computing/large object allocation heavy cases. Also note that the effect of the patch will be magnified in a multithreaded program because more efficient GC/allocation means that there will be less bottlenecking on malloc() and less time with the world stopped. As a reminder, the report is at: http://d.puremagic.com/issues/show_bug.cgi?id=5623
Feb 21 2011