digitalmars.D - O(N) GC: The patch
- dsimcha <dsimcha yahoo.com> Feb 20 2011
- "Steven Schveighoffer" <schveiguy yahoo.com> Feb 20 2011
- Jason House <jason.james.house gmail.com> Feb 20 2011
- bearophile <bearophileHUGS lycos.com> Feb 20 2011
- dsimcha <dsimcha yahoo.com> Feb 21 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623 I've found a way to speed up the GC massively on large heaps without excessive ripple effects. Technically it's still O(N), but with about a hundred fold smaller constant in the case of large heaps with most stuff not scanned. Now, I think the O(N) (where N is the total size of the heap) term has such a small constant that it's for almost all practcal purposes the GC is O(S) (where S is the size of the scanned portion of the heap). It also no longer has any O(N^2) pathological case (which I had discovered while reading the code). So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled. Please test some other code so we can wring out the corner cases in time for the next release. Basically all I did was diverge the Pool struct slightly into large and small object sub-varieties. The large object sub-variety is used to allocate objects of at least a page. It only stores gcbits at page-size offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be found in O(1). I also added a field to the Pool struct so that the number of free pages in a pool can be tracked in O(1). This should drastically lessen the time it takes to perform large allocations on large heaps. Right now a free memory region is found by a linear search through the pools in the case of large allocations. Unfortunately, I don't see any easy way to fix this. This patch at least allows short circuiting a large number of pools, if there isn't enough free space in the whole pool, let alone contiguous space. Here are the benchmarks with this patch enabled. Collected a 10 megabyte heap in 0 milliseconds. Collected a 50 megabyte heap in 0 milliseconds. Collected a 250 megabyte heap in 1 milliseconds. Collected a 500 megabyte heap in 0 milliseconds. Collected a 1000 megabyte heap in 1 milliseconds. Collected a 5000 megabyte heap in 3 milliseconds. Collected a 10000 megabyte heap in 6 milliseconds. Collected a 30000 megabyte heap in 16 milliseconds. Collected a 50000 megabyte heap in 26 milliseconds.
Feb 20 2011
On Sun, 20 Feb 2011 15:19:24 -0500, dsimcha <dsimcha yahoo.com> wrote:http://d.puremagic.com/issues/show_bug.cgi?id=5623 I've found a way to speed up the GC massively on large heaps without excessive ripple effects. Technically it's still O(N), but with about a hundred fold smaller constant in the case of large heaps with most stuff not scanned. Now, I think the O(N) (where N is the total size of the heap) term has such a small constant that it's for almost all practcal purposes the GC is O(S) (where S is the size of the scanned portion of the heap). It also no longer has any O(N^2) pathological case (which I had discovered while reading the code). So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled. Please test some other code so we can wring out the corner cases in time for the next release. Basically all I did was diverge the Pool struct slightly into large and small object sub-varieties. The large object sub-variety is used to allocate objects of at least a page. It only stores gcbits at page-size offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be found in O(1). I also added a field to the Pool struct so that the number of free pages in a pool can be tracked in O(1). This should drastically lessen the time it takes to perform large allocations on large heaps. Right now a free memory region is found by a linear search through the pools in the case of large allocations. Unfortunately, I don't see any easy way to fix this. This patch at least allows short circuiting a large number of pools, if there isn't enough free space in the whole pool, let alone contiguous space. Here are the benchmarks with this patch enabled. Collected a 10 megabyte heap in 0 milliseconds. Collected a 50 megabyte heap in 0 milliseconds. Collected a 250 megabyte heap in 1 milliseconds. Collected a 500 megabyte heap in 0 milliseconds. Collected a 1000 megabyte heap in 1 milliseconds. Collected a 5000 megabyte heap in 3 milliseconds. Collected a 10000 megabyte heap in 6 milliseconds. Collected a 30000 megabyte heap in 16 milliseconds. Collected a 50000 megabyte heap in 26 milliseconds.
Those numbers look really promising! I will examine your patch and see if I can think of anything in the array appending that would be affected by it. -Steve
Feb 20 2011
Sounds promising. How does it effect other cases? Some typical GC-heavy benchmark? Lots of smaller no scan objects that are just under your optimization threshold? dsimcha Wrote:http://d.puremagic.com/issues/show_bug.cgi?id=5623 I've found a way to speed up the GC massively on large heaps without excessive ripple effects. Technically it's still O(N), but with about a hundred fold smaller constant in the case of large heaps with most stuff not scanned. Now, I think the O(N) (where N is the total size of the heap) term has such a small constant that it's for almost all practcal purposes the GC is O(S) (where S is the size of the scanned portion of the heap). It also no longer has any O(N^2) pathological case (which I had discovered while reading the code). So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled. Please test some other code so we can wring out the corner cases in time for the next release. Basically all I did was diverge the Pool struct slightly into large and small object sub-varieties. The large object sub-variety is used to allocate objects of at least a page. It only stores gcbits at page-size offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be found in O(1). I also added a field to the Pool struct so that the number of free pages in a pool can be tracked in O(1). This should drastically lessen the time it takes to perform large allocations on large heaps. Right now a free memory region is found by a linear search through the pools in the case of large allocations. Unfortunately, I don't see any easy way to fix this. This patch at least allows short circuiting a large number of pools, if there isn't enough free space in the whole pool, let alone contiguous space. Here are the benchmarks with this patch enabled. Collected a 10 megabyte heap in 0 milliseconds. Collected a 50 megabyte heap in 0 milliseconds. Collected a 250 megabyte heap in 1 milliseconds. Collected a 500 megabyte heap in 0 milliseconds. Collected a 1000 megabyte heap in 1 milliseconds. Collected a 5000 megabyte heap in 3 milliseconds. Collected a 10000 megabyte heap in 6 milliseconds. Collected a 30000 megabyte heap in 16 milliseconds. Collected a 50000 megabyte heap in 26 milliseconds.
Feb 20 2011
Jason House:How does it effect other cases?
I have asked for the timings for a small benchmark, and the results are good (see the issue in Bugzilla). Bye, bearophile
Feb 20 2011
On 2/20/2011 8:05 PM, Jason House wrote:Sounds promising. How does it effect other cases? Some typical GC-heavy benchmark? Lots of smaller no scan objects that are just under your optimization threshold?
I posted some new benchmarks that are more like realistic workloads to the original bug report. The first benchmark I posted was admittedly a corner case. These new ones are more like typical scientific computing/large object allocation heavy cases. Also note that the effect of the patch will be magnified in a multithreaded program because more efficient GC/allocation means that there will be less bottlenecking on malloc() and less time with the world stopped. As a reminder, the report is at: http://d.puremagic.com/issues/show_bug.cgi?id=5623
Feb 21 2011









"Steven Schveighoffer" <schveiguy yahoo.com> 