www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - O(N) Garbage collection?

reply dsimcha <dsimcha yahoo.com> writes:
I've been trying out D's new 64-bit compiler and a serious barrier to 
using it effectively seems to be abysmal garbage collection performance 
with large heaps. It seems like the time for a garbage collection to run 
scales linearly with the size of the heap *even if most of the heap is 
marked as NO_SCAN*.  I'm running a program with a heap size of ~6GB, 
almost all of which is strings (DNA sequences), which are not scanned by 
the GC.  It's spending most of its time in GC, based on pausing it every 
once in a while in GDB and seeing what's at the top of the stack.

Here's a test program and the results for a few runs.

import std.stdio, std.datetime, core.memory, std.conv;

void main(string[] args) {
     if(args.length < 2) {
         stderr.writeln("Need size.");
         return;
     }

     immutable mul = to!size_t(args[1]);
     auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

     auto sw = StopWatch(autoStart);
     GC.collect();
     immutable msec = sw.peek.msecs;
     writefln("Collected a %s megabyte heap in %s milliseconds.",
         mul, msec);
}

Outputs for various sizes:

Collected a 10 megabyte heap in 1 milliseconds.
Collected a 50 megabyte heap in 4 milliseconds.
Collected a 200 megabyte heap in 16 milliseconds.
Collected a 500 megabyte heap in 41 milliseconds.
Collected a 1000 megabyte heap in 80 milliseconds.
Collected a 5000 megabyte heap in 397 milliseconds.
Collected a 10000 megabyte heap in 801 milliseconds.
Collected a 30000 megabyte heap in 2454 milliseconds.
Collected a 50000 megabyte heap in 4096 milliseconds.

Note that these tests were run on a server with over 100 GB of physical 
RAM, so a shortage of physical memory isn't the problem.  Shouldn't GC 
be O(1) with respect to the size of the unscanned portion of the heap?
Feb 18 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
BTW, here are the timings if I remove the GC.BlkAttr.NO_SCAN, meaning 
that everything gets scanned.  Said timings aren't drastically 
different.  Something is seriously wrong here.

Collected a 10 megabyte heap in 1 milliseconds.
Collected a 50 megabyte heap in 3 milliseconds.
Collected a 200 megabyte heap in 15 milliseconds.
Collected a 500 megabyte heap in 38 milliseconds.
Collected a 1000 megabyte heap in 77 milliseconds.
Collected a 5000 megabyte heap in 696 milliseconds.
Collected a 10000 megabyte heap in 1394 milliseconds.
Collected a 30000 megabyte heap in 2885 milliseconds.
Collected a 50000 megabyte heap in 4343 milliseconds.

On 2/19/2011 12:03 AM, dsimcha wrote:
 I've been trying out D's new 64-bit compiler and a serious barrier to
 using it effectively seems to be abysmal garbage collection performance
 with large heaps. It seems like the time for a garbage collection to run
 scales linearly with the size of the heap *even if most of the heap is
 marked as NO_SCAN*. I'm running a program with a heap size of ~6GB,
 almost all of which is strings (DNA sequences), which are not scanned by
 the GC. It's spending most of its time in GC, based on pausing it every
 once in a while in GDB and seeing what's at the top of the stack.

 Here's a test program and the results for a few runs.

 import std.stdio, std.datetime, core.memory, std.conv;

 void main(string[] args) {
 if(args.length < 2) {
 stderr.writeln("Need size.");
 return;
 }

 immutable mul = to!size_t(args[1]);
 auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

 auto sw = StopWatch(autoStart);
 GC.collect();
 immutable msec = sw.peek.msecs;
 writefln("Collected a %s megabyte heap in %s milliseconds.",
 mul, msec);
 }

 Outputs for various sizes:

 Collected a 10 megabyte heap in 1 milliseconds.
 Collected a 50 megabyte heap in 4 milliseconds.
 Collected a 200 megabyte heap in 16 milliseconds.
 Collected a 500 megabyte heap in 41 milliseconds.
 Collected a 1000 megabyte heap in 80 milliseconds.
 Collected a 5000 megabyte heap in 397 milliseconds.
 Collected a 10000 megabyte heap in 801 milliseconds.
 Collected a 30000 megabyte heap in 2454 milliseconds.
 Collected a 50000 megabyte heap in 4096 milliseconds.

 Note that these tests were run on a server with over 100 GB of physical
 RAM, so a shortage of physical memory isn't the problem. Shouldn't GC be
 O(1) with respect to the size of the unscanned portion of the heap?

Feb 19 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 BTW, here are the timings if I remove the GC.BlkAttr.NO_SCAN, meaning 
 that everything gets scanned.  Said timings aren't drastically 
 different.  Something is seriously wrong here.

Languages like Clojure, Scala, Boo, etc, start their development on a virtual machine where there is a refined GC, a standard library, a good back-end, etc, all things that require a very large amount of work to be built well and tuned. D language tries to re-create everything, even refusing the LLVM back-end (LDC docet) so there is a lot of painful work to do, to create a decent enough GC, etc. The current engineering quality of the Java GC will be out of reach of D language maybe forever... Bye, bearophile
Feb 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
You may be right, but:

1.  Reinventing the wheel has its advantages in that you get to step 
back and reevaluate things and remove all the cruft that built up on the 
existing wheel.

2.  I'm guessing this is a silly bug somewhere rather than a deep design 
flaw, and that it can easily be solved by someone with a good mental 
model of the whole implementation of the D GC (Sean or Walter) by using 
a better data structure.  I've found several things in the GC source 
code that look like linear searches, but I don't understand the code 
well enough to know what to do about them.

On 2/19/2011 10:17 AM, bearophile wrote:
 dsimcha:

 BTW, here are the timings if I remove the GC.BlkAttr.NO_SCAN, meaning
 that everything gets scanned.  Said timings aren't drastically
 different.  Something is seriously wrong here.

Languages like Clojure, Scala, Boo, etc, start their development on a virtual machine where there is a refined GC, a standard library, a good back-end, etc, all things that require a very large amount of work to be built well and tuned. D language tries to re-create everything, even refusing the LLVM back-end (LDC docet) so there is a lot of painful work to do, to create a decent enough GC, etc. The current engineering quality of the Java GC will be out of reach of D language maybe forever... Bye, bearophile

Feb 19 2011
prev sibling next sibling parent reply Ulrik Mikaelsson <ulrik.mikaelsson gmail.com> writes:
Just a thought; I guess the references to the non-GC-scanned strings
are held in GC-scanned memory, right? Are the number of such
references also increased linearly?

I'm not a GC-expert, but if so, wouldn't that pretty much force the GC
to do at least one follow-up of every reference, before realizing it's
pointing to non-GC memory? That COULD explain the linear increase.

That said, I too feel some improvements in the border between GC and
other resource-managements methods are needed. The prospect of good
GC/non-GC combinations was what drew me here in the first place. I
would welcome some clear language and runtime-support for
non-GC-memory, such as frameworks for ref-counting and
tree-allocation, and well-defined semantics of object lifetime both in
GC and non-GC mode. I've mostly sorted out the kinks of the myself (I
think), but it's proving to be _very_ difficult, mostly undocumented,
and often appears to be an afterthought rather than by-design.

Regards
/ Ulrik

2011/2/19 dsimcha <dsimcha yahoo.com>:
 I've been trying out D's new 64-bit compiler and a serious barrier to usi=

 it effectively seems to be abysmal garbage collection performance with la=

 heaps. It seems like the time for a garbage collection to run scales
 linearly with the size of the heap *even if most of the heap is marked as
 NO_SCAN*. =C2=A0I'm running a program with a heap size of ~6GB, almost al=

 which is strings (DNA sequences), which are not scanned by the GC. =C2=A0=

 spending most of its time in GC, based on pausing it every once in a whil=

 in GDB and seeing what's at the top of the stack.

 Here's a test program and the results for a few runs.

 import std.stdio, std.datetime, core.memory, std.conv;

 void main(string[] args) {
 =C2=A0 =C2=A0if(args.length < 2) {
 =C2=A0 =C2=A0 =C2=A0 =C2=A0stderr.writeln("Need size.");
 =C2=A0 =C2=A0 =C2=A0 =C2=A0return;
 =C2=A0 =C2=A0}

 =C2=A0 =C2=A0immutable mul =3D to!size_t(args[1]);
 =C2=A0 =C2=A0auto ptr =3D GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

 =C2=A0 =C2=A0auto sw =3D StopWatch(autoStart);
 =C2=A0 =C2=A0GC.collect();
 =C2=A0 =C2=A0immutable msec =3D sw.peek.msecs;
 =C2=A0 =C2=A0writefln("Collected a %s megabyte heap in %s milliseconds.",
 =C2=A0 =C2=A0 =C2=A0 =C2=A0mul, msec);
 }

 Outputs for various sizes:

 Collected a 10 megabyte heap in 1 milliseconds.
 Collected a 50 megabyte heap in 4 milliseconds.
 Collected a 200 megabyte heap in 16 milliseconds.
 Collected a 500 megabyte heap in 41 milliseconds.
 Collected a 1000 megabyte heap in 80 milliseconds.
 Collected a 5000 megabyte heap in 397 milliseconds.
 Collected a 10000 megabyte heap in 801 milliseconds.
 Collected a 30000 megabyte heap in 2454 milliseconds.
 Collected a 50000 megabyte heap in 4096 milliseconds.

 Note that these tests were run on a server with over 100 GB of physical R=

 so a shortage of physical memory isn't the problem. =C2=A0Shouldn't GC be=

 with respect to the size of the unscanned portion of the heap?

Feb 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/19/2011 12:50 PM, Ulrik Mikaelsson wrote:
 Just a thought; I guess the references to the non-GC-scanned strings
 are held in GC-scanned memory, right? Are the number of such
 references also increased linearly?

Well, first of all, the benchmark I posted seems to indicate otherwise. Second of all, I was running this program before on yeast DNA and it was ridiculously fast. Then I tried to do the same thing on human DNA and it became slow as molasses. Roughly speaking, w/o getting into the biology much, I've got one string for each gene. Yeast have about 1/3 as many genes as humans, but the genes are on average about 100 times smaller. Therefore, the difference should be at most a small constant factor and in actuality it's a huge constant factor. Note: I know I could make the program in question a lot more space efficient, and that's what I ended up doing. It works now. It's just that it was originally written for yeast, where space efficiency is obviously not a concern, and I would have liked to just try a one-off calculation on the human genome without having to rewrite portions of it.
Feb 19 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 2/19/2011 2:46 PM, Trass3r wrote:
 Note: I know I could make the program in question a lot more space
 efficient, and that's what I ended up doing. It works now. It's just
 that it was originally written for yeast, where space efficiency is
 obviously not a concern, and I would have liked to just try a one-off
 calculation on the human genome without having to rewrite portions of it.

What kind of research do you do?

Bioinformatics. The program in question is basically looking at what DNA sequence features are most predictive of gene expression. I found some extreme weirdness in yeast, so I decided to see whether I could find the same thing in a species as distantly related as humans.
Feb 19 2011
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"dsimcha" <dsimcha yahoo.com> wrote in message 
news:ijp61d$1bu1$1 digitalmars.com...
 On 2/19/2011 12:50 PM, Ulrik Mikaelsson wrote:
 Just a thought; I guess the references to the non-GC-scanned strings
 are held in GC-scanned memory, right? Are the number of such
 references also increased linearly?

Well, first of all, the benchmark I posted seems to indicate otherwise. Second of all, I was running this program before on yeast DNA and it was ridiculously fast. Then I tried to do the same thing on human DNA and it became slow as molasses. Roughly speaking, w/o getting into the biology much, I've got one string for each gene. Yeast have about 1/3 as many genes as humans, but the genes are on average about 100 times smaller. Therefore, the difference should be at most a small constant factor and in actuality it's a huge constant factor.

Out of curiosity, roughly how many, umm "characters" (I forget the technical term for each T, G, etc), are in each yeast gene, and how many genes do they have? (Humans have, umm, was it 26? My last biology class was ages ago.)
 Note:  I know I could make the program in question a lot more space 
 efficient, and that's what I ended up doing.  It works now.  It's just 
 that it was originally written for yeast, where space efficiency is 
 obviously not a concern, and I would have liked to just try a one-off 
 calculation on the human genome without having to rewrite portions of it.

Feb 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 2/19/2011 10:21 PM, Nick Sabalausky wrote:
 Out of curiosity, roughly how many, umm "characters" (I forget the technical
 term for each T, G, etc), are in each yeast gene, and how many genes do they
 have? (Humans have, umm, was it 26? My last biology class was ages ago.)

It varies massively, but you can compute the averages yourself. There are about 6,000 yeast genes and about 12 million nucleotides (the technical term for "characters"). Humans have about 20k to 25k genes, and a total of about 3 billion nucleotides, though a lot of this is intergenic regions (stuff that isn't genes).
Feb 19 2011
parent "Nick Sabalausky" <a a.a> writes:
"dsimcha" <dsimcha yahoo.com> wrote in message 
news:ijq2bl$2opj$1 digitalmars.com...
 On 2/19/2011 10:21 PM, Nick Sabalausky wrote:
 Out of curiosity, roughly how many, umm "characters" (I forget the 
 technical
 term for each T, G, etc), are in each yeast gene, and how many genes do 
 they
 have? (Humans have, umm, was it 26? My last biology class was ages ago.)

It varies massively, but you can compute the averages yourself. There are about 6,000 yeast genes and about 12 million nucleotides (the technical term for "characters"). Humans have about 20k to 25k genes, and a total of about 3 billion nucleotides, though a lot of this is intergenic regions (stuff that isn't genes).

Ahh, I was confusing "gene" with "chromosome" (and probably still got the exact number wrong ;) ). Interesting stuff. And that certianly explains the computation-practicality difference of yeast vs human: approx 12MB vs 3GB, assuming ASCII/UTF-8.
Feb 19 2011
prev sibling next sibling parent Trass3r <un known.com> writes:
 Note:  I know I could make the program in question a lot more space  
 efficient, and that's what I ended up doing.  It works now.  It's just  
 that it was originally written for yeast, where space efficiency is  
 obviously not a concern, and I would have liked to just try a one-off  
 calculation on the human genome without having to rewrite portions of it.

What kind of research do you do?
Feb 19 2011
prev sibling next sibling parent reply retard <re tard.com.invalid> writes:
Sat, 19 Feb 2011 14:32:27 -0500, dsimcha wrote:

 On 2/19/2011 12:50 PM, Ulrik Mikaelsson wrote:
 Just a thought; I guess the references to the non-GC-scanned strings
 are held in GC-scanned memory, right? Are the number of such references
 also increased linearly?

Well, first of all, the benchmark I posted seems to indicate otherwise. Second of all, I was running this program before on yeast DNA and it was ridiculously fast. Then I tried to do the same thing on human DNA and it became slow as molasses. Roughly speaking, w/o getting into the biology much, I've got one string for each gene. Yeast have about 1/3 as many genes as humans, but the genes are on average about 100 times smaller. Therefore, the difference should be at most a small constant factor and in actuality it's a huge constant factor. Note: I know I could make the program in question a lot more space efficient, and that's what I ended up doing. It works now. It's just that it was originally written for yeast, where space efficiency is obviously not a concern, and I would have liked to just try a one-off calculation on the human genome without having to rewrite portions of it.

Probably one reason for this behavior is the lack of testing. My desktop only has 24 GB of DDR3. I have another machine with 16 GB of DDR2, but don't know how to combine the address spaces via clustering. This would also horribly drag down GC performance. Even JVM is badly tuned for larger systems, they might use the Azul Java runtimes instead..
Feb 19 2011
parent "Nick Sabalausky" <a a.a> writes:
"retard" <re tard.com.invalid> wrote in message 
news:ijp7pa$1d34$1 digitalmars.com...
 Sat, 19 Feb 2011 14:32:27 -0500, dsimcha wrote:

 On 2/19/2011 12:50 PM, Ulrik Mikaelsson wrote:
 Just a thought; I guess the references to the non-GC-scanned strings
 are held in GC-scanned memory, right? Are the number of such references
 also increased linearly?

Well, first of all, the benchmark I posted seems to indicate otherwise. Second of all, I was running this program before on yeast DNA and it was ridiculously fast. Then I tried to do the same thing on human DNA and it became slow as molasses. Roughly speaking, w/o getting into the biology much, I've got one string for each gene. Yeast have about 1/3 as many genes as humans, but the genes are on average about 100 times smaller. Therefore, the difference should be at most a small constant factor and in actuality it's a huge constant factor. Note: I know I could make the program in question a lot more space efficient, and that's what I ended up doing. It works now. It's just that it was originally written for yeast, where space efficiency is obviously not a concern, and I would have liked to just try a one-off calculation on the human genome without having to rewrite portions of it.

Probably one reason for this behavior is the lack of testing. My desktop only has 24 GB of DDR3. I have another machine with 16 GB of DDR2, but don't know how to combine the address spaces via clustering. This would also horribly drag down GC performance. Even JVM is badly tuned for larger systems, they might use the Azul Java runtimes instead..

*Only* 24GB of DDR3, huh? :) Makes me feel like a pauper: I recently upgraded from 1GB to 2GB of DDR1 ;) (It actually had been 2GB a few years ago, but I cannablized half of it to build my Linux box.) Out of curiosity, what are you running on that? (Multiple instances of Crysis? High-definition voxels?)
Feb 19 2011
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 19 Feb 2011 00:03:27 -0500, dsimcha <dsimcha yahoo.com> wrote:

 I've been trying out D's new 64-bit compiler and a serious barrier to  
 using it effectively seems to be abysmal garbage collection performance  
 with large heaps. It seems like the time for a garbage collection to run  
 scales linearly with the size of the heap *even if most of the heap is  
 marked as NO_SCAN*.  I'm running a program with a heap size of ~6GB,  
 almost all of which is strings (DNA sequences), which are not scanned by  
 the GC.  It's spending most of its time in GC, based on pausing it every  
 once in a while in GDB and seeing what's at the top of the stack.

 Here's a test program and the results for a few runs.

 import std.stdio, std.datetime, core.memory, std.conv;

 void main(string[] args) {
      if(args.length < 2) {
          stderr.writeln("Need size.");
          return;
      }

      immutable mul = to!size_t(args[1]);
      auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

      auto sw = StopWatch(autoStart);
      GC.collect();
      immutable msec = sw.peek.msecs;
      writefln("Collected a %s megabyte heap in %s milliseconds.",
          mul, msec);
 }

 Outputs for various sizes:

 Collected a 10 megabyte heap in 1 milliseconds.
 Collected a 50 megabyte heap in 4 milliseconds.
 Collected a 200 megabyte heap in 16 milliseconds.
 Collected a 500 megabyte heap in 41 milliseconds.
 Collected a 1000 megabyte heap in 80 milliseconds.
 Collected a 5000 megabyte heap in 397 milliseconds.
 Collected a 10000 megabyte heap in 801 milliseconds.
 Collected a 30000 megabyte heap in 2454 milliseconds.
 Collected a 50000 megabyte heap in 4096 milliseconds.

 Note that these tests were run on a server with over 100 GB of physical  
 RAM, so a shortage of physical memory isn't the problem.  Shouldn't GC  
 be O(1) with respect to the size of the unscanned portion of the heap?

Having recently constructed the GC model in my head (and it's rapidly deteriorating from there, believe me), here is a stab at what I see could be a bottleneck. The way the GC works is you have this giant loop (in pseudocode): bool changed; while(changed) { changed = false; foreach(memblock in heap) { if(memblock.marked && memblock.containsPointers) foreach(pointer in memblock) { auto memblock2 = heap.findBlock(pointer); if(memblock2 && !memblock2.marked) { memblock2.mark(); changed = true; } } } } So you can see two things. First, every iteration of the outer while loop loops through *all* memory blocks, even if they do not contain pointers. This has a non-negligible cost. Second, there looks like the potential for the while loop to mark one, or at least a very small number, of blocks, so the algorithm worst case degenerates into O(n^2). This may not be happening, but it made me a little uneasy. The part in my mind that already deteriorated is whether marked blocks which have already been scanned are scanned again. I would guess not, but if that's not the case, that would be a really easy thing to fix. Also note that the findPointer function is a binary search I think. So you are talking O(lg(n)) there, not O(1). Like I said, I may not remember exactly how it works. -steve
Feb 19 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Yeh, I rebuilt the same model in my head over the past few hours (like 
you, I had a good mental model of the GC at one point but have slowly 
forgotten it).  Unfortunately it looks like there's no easy fix.  It 
also seems like gcbits are allocated as 1 bit for every 16 bytes of heap 
space, no matter what, and that a scan touches all of these, no matter 
what.  This is another place for O(N) to creep in.  About the only thing 
that would fix these is a rewrite, since addressing these problems would 
have ripple effects everywhere in the GC.

The obvious but ugly workaround is to use the C heap for large, 
long-lived data structures, especially ones that live until the program 
terminates (because then you don't have to worry about freeing them). 
This sucks horribly, though, because it forces you to think about memory 
management in exactly the kind of way the GC is meant to automate, and 
prevents the use of standard library functions and builtin array 
appending, etc. for building said data structures.

Maybe what we need to remedy this is a function called GC.forget(). 
This function would allow you to build your data structures as regular 
GC objects.  Then you'd call forget on the head pointer, and the GC 
would remove all information about the data structure (or move it to a 
completely unscanned, etc. forgottenData data structure, in which case 
you could call remind() to bring it back to normal GC memory), 
transitively if it's something more than just a simple array.  The data 
would now act as if it was allocated by the C heap.

In a way this is poor man's generational GC.  It's poor man's in the 
sense that it requires the user to manually specify which objects are 
long-lived.  On the other hand, it could be better than generational GC 
in some ways because you wouldn't need tons of compiler infrastructure 
or pay tons of performance penalties to update the GC data structures on 
simple pointer assignment.

Then again, I don't imagine GC.forget being particularly easy to 
implement either, except in the simplest case where you have a huge 
array that takes up a whole pool.  You'd have to do things like split 
pools in half if forget() was called on an object in the middle of a pool.

On 2/19/2011 5:34 PM, Steven Schveighoffer wrote:
 On Sat, 19 Feb 2011 00:03:27 -0500, dsimcha <dsimcha yahoo.com> wrote:

 I've been trying out D's new 64-bit compiler and a serious barrier to
 using it effectively seems to be abysmal garbage collection
 performance with large heaps. It seems like the time for a garbage
 collection to run scales linearly with the size of the heap *even if
 most of the heap is marked as NO_SCAN*. I'm running a program with a
 heap size of ~6GB, almost all of which is strings (DNA sequences),
 which are not scanned by the GC. It's spending most of its time in GC,
 based on pausing it every once in a while in GDB and seeing what's at
 the top of the stack.

 Here's a test program and the results for a few runs.

 import std.stdio, std.datetime, core.memory, std.conv;

 void main(string[] args) {
 if(args.length < 2) {
 stderr.writeln("Need size.");
 return;
 }

 immutable mul = to!size_t(args[1]);
 auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

 auto sw = StopWatch(autoStart);
 GC.collect();
 immutable msec = sw.peek.msecs;
 writefln("Collected a %s megabyte heap in %s milliseconds.",
 mul, msec);
 }

 Outputs for various sizes:

 Collected a 10 megabyte heap in 1 milliseconds.
 Collected a 50 megabyte heap in 4 milliseconds.
 Collected a 200 megabyte heap in 16 milliseconds.
 Collected a 500 megabyte heap in 41 milliseconds.
 Collected a 1000 megabyte heap in 80 milliseconds.
 Collected a 5000 megabyte heap in 397 milliseconds.
 Collected a 10000 megabyte heap in 801 milliseconds.
 Collected a 30000 megabyte heap in 2454 milliseconds.
 Collected a 50000 megabyte heap in 4096 milliseconds.

 Note that these tests were run on a server with over 100 GB of
 physical RAM, so a shortage of physical memory isn't the problem.
 Shouldn't GC be O(1) with respect to the size of the unscanned portion
 of the heap?

Having recently constructed the GC model in my head (and it's rapidly deteriorating from there, believe me), here is a stab at what I see could be a bottleneck. The way the GC works is you have this giant loop (in pseudocode): bool changed; while(changed) { changed = false; foreach(memblock in heap) { if(memblock.marked && memblock.containsPointers) foreach(pointer in memblock) { auto memblock2 = heap.findBlock(pointer); if(memblock2 && !memblock2.marked) { memblock2.mark(); changed = true; } } } } So you can see two things. First, every iteration of the outer while loop loops through *all* memory blocks, even if they do not contain pointers. This has a non-negligible cost. Second, there looks like the potential for the while loop to mark one, or at least a very small number, of blocks, so the algorithm worst case degenerates into O(n^2). This may not be happening, but it made me a little uneasy. The part in my mind that already deteriorated is whether marked blocks which have already been scanned are scanned again. I would guess not, but if that's not the case, that would be a really easy thing to fix. Also note that the findPointer function is a binary search I think. So you are talking O(lg(n)) there, not O(1). Like I said, I may not remember exactly how it works. -steve

Feb 19 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
...and actually, forget() would only work for arrays of primitives, 
because if the object has pointers, you can change what these point to 
after calling forget() and Bad Things Can Happen.

On 2/19/2011 6:17 PM, dsimcha wrote:
 Yeh, I rebuilt the same model in my head over the past few hours (like
 you, I had a good mental model of the GC at one point but have slowly
 forgotten it). Unfortunately it looks like there's no easy fix. It also
 seems like gcbits are allocated as 1 bit for every 16 bytes of heap
 space, no matter what, and that a scan touches all of these, no matter
 what. This is another place for O(N) to creep in. About the only thing
 that would fix these is a rewrite, since addressing these problems would
 have ripple effects everywhere in the GC.

 The obvious but ugly workaround is to use the C heap for large,
 long-lived data structures, especially ones that live until the program
 terminates (because then you don't have to worry about freeing them).
 This sucks horribly, though, because it forces you to think about memory
 management in exactly the kind of way the GC is meant to automate, and
 prevents the use of standard library functions and builtin array
 appending, etc. for building said data structures.

 Maybe what we need to remedy this is a function called GC.forget(). This
 function would allow you to build your data structures as regular GC
 objects. Then you'd call forget on the head pointer, and the GC would
 remove all information about the data structure (or move it to a
 completely unscanned, etc. forgottenData data structure, in which case
 you could call remind() to bring it back to normal GC memory),
 transitively if it's something more than just a simple array. The data
 would now act as if it was allocated by the C heap.

 In a way this is poor man's generational GC. It's poor man's in the
 sense that it requires the user to manually specify which objects are
 long-lived. On the other hand, it could be better than generational GC
 in some ways because you wouldn't need tons of compiler infrastructure
 or pay tons of performance penalties to update the GC data structures on
 simple pointer assignment.

 Then again, I don't imagine GC.forget being particularly easy to
 implement either, except in the simplest case where you have a huge
 array that takes up a whole pool. You'd have to do things like split
 pools in half if forget() was called on an object in the middle of a pool.

 On 2/19/2011 5:34 PM, Steven Schveighoffer wrote:
 On Sat, 19 Feb 2011 00:03:27 -0500, dsimcha <dsimcha yahoo.com> wrote:

 I've been trying out D's new 64-bit compiler and a serious barrier to
 using it effectively seems to be abysmal garbage collection
 performance with large heaps. It seems like the time for a garbage
 collection to run scales linearly with the size of the heap *even if
 most of the heap is marked as NO_SCAN*. I'm running a program with a
 heap size of ~6GB, almost all of which is strings (DNA sequences),
 which are not scanned by the GC. It's spending most of its time in GC,
 based on pausing it every once in a while in GDB and seeing what's at
 the top of the stack.

 Here's a test program and the results for a few runs.

 import std.stdio, std.datetime, core.memory, std.conv;

 void main(string[] args) {
 if(args.length < 2) {
 stderr.writeln("Need size.");
 return;
 }

 immutable mul = to!size_t(args[1]);
 auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

 auto sw = StopWatch(autoStart);
 GC.collect();
 immutable msec = sw.peek.msecs;
 writefln("Collected a %s megabyte heap in %s milliseconds.",
 mul, msec);
 }

 Outputs for various sizes:

 Collected a 10 megabyte heap in 1 milliseconds.
 Collected a 50 megabyte heap in 4 milliseconds.
 Collected a 200 megabyte heap in 16 milliseconds.
 Collected a 500 megabyte heap in 41 milliseconds.
 Collected a 1000 megabyte heap in 80 milliseconds.
 Collected a 5000 megabyte heap in 397 milliseconds.
 Collected a 10000 megabyte heap in 801 milliseconds.
 Collected a 30000 megabyte heap in 2454 milliseconds.
 Collected a 50000 megabyte heap in 4096 milliseconds.

 Note that these tests were run on a server with over 100 GB of
 physical RAM, so a shortage of physical memory isn't the problem.
 Shouldn't GC be O(1) with respect to the size of the unscanned portion
 of the heap?

Having recently constructed the GC model in my head (and it's rapidly deteriorating from there, believe me), here is a stab at what I see could be a bottleneck. The way the GC works is you have this giant loop (in pseudocode): bool changed; while(changed) { changed = false; foreach(memblock in heap) { if(memblock.marked && memblock.containsPointers) foreach(pointer in memblock) { auto memblock2 = heap.findBlock(pointer); if(memblock2 && !memblock2.marked) { memblock2.mark(); changed = true; } } } } So you can see two things. First, every iteration of the outer while loop loops through *all* memory blocks, even if they do not contain pointers. This has a non-negligible cost. Second, there looks like the potential for the while loop to mark one, or at least a very small number, of blocks, so the algorithm worst case degenerates into O(n^2). This may not be happening, but it made me a little uneasy. The part in my mind that already deteriorated is whether marked blocks which have already been scanned are scanned again. I would guess not, but if that's not the case, that would be a really easy thing to fix. Also note that the findPointer function is a binary search I think. So you are talking O(lg(n)) there, not O(1). Like I said, I may not remember exactly how it works. -steve


Feb 19 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 Yeh, I rebuilt the same model in my head over the past few hours (like 
 you, I had a good mental model of the GC at one point but have slowly 
 forgotten it).

For serious D programmers having such model in the head is so important that I'd like a short page with such explanations & pseudocode in the D site :-) Bye, bearophile
Feb 19 2011
parent "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:ijpkh8$232r$1 digitalmars.com...
 dsimcha:

 Yeh, I rebuilt the same model in my head over the past few hours (like
 you, I had a good mental model of the GC at one point but have slowly
 forgotten it).

For serious D programmers having such model in the head is so important that I'd like a short page with such explanations & pseudocode in the D site :-)

I seem to remember finding out at one point (the hard way) that arrays of primitives are scanned for pointers unless you explicitly tell the GC not to do so (for each array). A long time ago, this caused me major headaches when I tried to write a batch processing tool for image files. I never actually verified this, but my understanding was that the file and image data contained false pointers that prevented the data from completed images from being collected. So the program just happily kept munching away at more and more memory (and presumably, more false pointers) with each image until after only a fairly small number of files it ground to a sputtering halt under the weight of its own memory footprint.
Feb 19 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 19 Feb 2011 22:39:10 -0500, Nick Sabalausky <a a.a> wrote:

 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:ijpkh8$232r$1 digitalmars.com...
 dsimcha:

 Yeh, I rebuilt the same model in my head over the past few hours (like
 you, I had a good mental model of the GC at one point but have slowly
 forgotten it).

For serious D programmers having such model in the head is so important that I'd like a short page with such explanations & pseudocode in the D site :-)

I seem to remember finding out at one point (the hard way) that arrays of primitives are scanned for pointers unless you explicitly tell the GC not to do so (for each array). A long time ago, this caused me major headaches when I tried to write a batch processing tool for image files. I never actually verified this, but my understanding was that the file and image data contained false pointers that prevented the data from completed images from being collected. So the program just happily kept munching away at more and more memory (and presumably, more false pointers) with each image until after only a fairly small number of files it ground to a sputtering halt under the weight of its own memory footprint.

This is not correct. Arrays of value-elements (builtins or custom structs with no pointers in them, the compiler records this in the TypeInfo) are marked NO_SCAN when the array is constructed. Here are the two problems I know of: 1. You can have a struct with sparse pointers in it, an array of such structs would not be marked NO_SCAN. This means, even the non-pointer parts are scanned as pointers. For example: struct Bad { void *pointer; int[1024] data; } the data member is scanned in addition to the pointer member. This can create false hits on the heap, keeping those blocks alive. IMO, this problem is not very significant -- those types of aggregates are uncommon. 2. A large value-element array is kept in memory because of false pointers. There are several types of memory that are *always* considered pointers. This means you always have false pointers laying around, even if you never allocate any heap data. For example, stack data is always considered to contain all pointers. As well as global and TLS data. I think even the proposed "precise" scanning patches for D's GC in bugzilla do not address stack or global space. This is a bigger problem. If you, for example, allocate a 200MB array, in a 4GB address space, that's 1/20th the total address space. The chances that a false pointer somewhere in the stack or TLS or globals "points" at that data is 1/20. If you don't explicitly delete such arrays, and stop pointing at them, that memory is now leaked. If you start allocating more of them, you are now even more likely to have false pointers. This second issue I think is much more likely to happen. I proposed at one point that we should have a library construct that aids in freeing such chunks of memory deterministically to try and avoid this problem. I don't think it was well received, but I can't really remember. -Steve
Feb 20 2011
prev sibling next sibling parent retard <re tard.com.invalid> writes:
Sat, 19 Feb 2011 22:15:52 -0500, Nick Sabalausky wrote:

 "retard" <re tard.com.invalid> wrote in message
 news:ijp7pa$1d34$1 digitalmars.com...
 Sat, 19 Feb 2011 14:32:27 -0500, dsimcha wrote:

 On 2/19/2011 12:50 PM, Ulrik Mikaelsson wrote:
 Just a thought; I guess the references to the non-GC-scanned strings
 are held in GC-scanned memory, right? Are the number of such
 references also increased linearly?

Well, first of all, the benchmark I posted seems to indicate otherwise. Second of all, I was running this program before on yeast DNA and it was ridiculously fast. Then I tried to do the same thing on human DNA and it became slow as molasses. Roughly speaking, w/o getting into the biology much, I've got one string for each gene. Yeast have about 1/3 as many genes as humans, but the genes are on average about 100 times smaller. Therefore, the difference should be at most a small constant factor and in actuality it's a huge constant factor. Note: I know I could make the program in question a lot more space efficient, and that's what I ended up doing. It works now. It's just that it was originally written for yeast, where space efficiency is obviously not a concern, and I would have liked to just try a one-off calculation on the human genome without having to rewrite portions of it.

Probably one reason for this behavior is the lack of testing. My desktop only has 24 GB of DDR3. I have another machine with 16 GB of DDR2, but don't know how to combine the address spaces via clustering. This would also horribly drag down GC performance. Even JVM is badly tuned for larger systems, they might use the Azul Java runtimes instead..

*Only* 24GB of DDR3, huh? :) Makes me feel like a pauper: I recently upgraded from 1GB to 2GB of DDR1 ;) (It actually had been 2GB a few years ago, but I cannablized half of it to build my Linux box.) Out of curiosity, what are you running on that? (Multiple instances of Crysis? High-definition voxels?)

The largest processes are virtual machines, application servers, web server, IDE environment, several compiler instances in parallel. The Web browser also seems to have use for a gigabyte or two these days. As I recall, the memory was cheaper than now when I bought it. It's also cheaper than DDR2 or DDR (per gigabyte).
Feb 19 2011
prev sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Nick Sabalausky <a a.a> wrote:

 Ahh, I was confusing "gene" with "chromosome" (and probably still got the
 exact number wrong ;) ).

That you did. Humans have 23 chromosome pairs. But now you know! -- Simen
Feb 20 2011