digitalmars.D - A GC Memory Usage Experiment: `--DRT-gcopt=heapSizeFactor` is Magic

FeepingCreature (104/104) Dec 09 2022 We have a process that balloons up to 5GB in production. That's a

kinke (4/4) Dec 09 2022 A while back, I've played around with GC params too, for the

FeepingCreature <feepingcreature gmail.com> writes:

We have a process that balloons up to 5GB in production. That's a 
bit much, so I started looking into ways to reign it in.

tl;dr: Add `--DRT-gcopt=heapSizeFactor:0.25` if you don't care 
about CPU use, but want to keep RAM usage low.



This is a process that heavily uses `std.json` decoded into an 
object hierarchy in a multithreaded setup to load data from a 
central store.

I let it run until it had loaded all its data, then recorded the 
RES value from top.

Each configuration was run three times and averaged. Note that 
actual numbers are from memory because I lost the data in a freak 
LibreOffice crash (first time I've had Restore Document outright 
_fail_ on me), but I don't think they're more than ±100MB off.

`heaptrack-d` (thanks Tim Schendekehl, your heaptrack branch 
keeps on being useful) says that the used memory is about 1:2 
split between internal allocations and std.json detritus. It also 
shows the memory usage sawtoothing up and down by ~2.5, several 
times during startup, as expected for a heavily GC-reliant 
process.



I have a standing theory (the "zombie stack hypothesis") that the 
D GC can leak memory by dead references that are falsely kept 
alive because they're not properly cleared from the stack by 
successive calls. Ie.

```
void main() {
   void foo() {
     Object obj = new Object;
   }
   foo();
   // foo returns, obj is still "live" because it's right above 
the main stackframe
   void bar() {
     // somehow a gap arises in bar's stackframe?
     void* ptr = void;
     Thread.sleep(600.seconds);
   }
   bar;
}
```

Now `obj` is dead but will live for at least 10 minutes, because 
its pointer value will never be erased.

It is unclear how much this actually happens. However, I was 
trying the `-gx` flag to ostensibly suppress this effect.

All builds targeted x86 64-bit, DMD 2.100.2, LDC2 1.30.0



- DMD stock: 3.8GB
- DMD `-gx`: 3.4GB
- LDC stock: 3.1GB
- LDC `--DRT-gcopt=heapSizeFactor:0.25`: 800MB!!
- LDC with `"--DRT-gcopt=heapSizeFactor:0.25 gc:precise"`: 800MB



DMD stock loses by a massive margin, even compared to LDC stock. 
It's unclear what is going on there: we may hypothesize that LDC 
can maybe make denser use of the stack than DMD (?), which would 
explain its superiority, but that hypothesis would predict that 
DMD `-gx` would be *equal* to LDC. However, even with `-gx`, LDC 
(without `-gx`!) still beats DMD by a good margin.

It's important to note that these values have significant noise. 
Due to the nature of sparse GC runs, the result may be sensitive 
to where exactly in the sawtooth pattern the benchmark ran out. 
However, as we averaged over multiple runs, LDC still seems to 
have an advantage here that neither noise nor the zombie stack 
hypothesis can fully explain.

Now to the big one: `heapSizeFactor` is **massive**. For some 
reason, running GC vastly more often has a `>2x` benefit. This is 
even though the default `heapSizeFactor` is 2, meaning at most we 
should get down to `1.5GB`.

It is possible that *something* about simply running the GC more 
often helps it clean up dead values more effectively. The zombie 
stack hypothesis has *some* opinions on this: maybe if a thread 
happens to be idle when the GC runs, its low stack size helps the 
GC discover that references that would usually be seen as 
fake-alive are really dead? Am I using this hypothesis for 
everything because that's all I got? MAYBE!!

**Caveat**: The LDC run with `heapSizeFactor` was also 2x-3x 
slower than without. This is okay in our case because the process 
in question spends the great majority of its lifetime sitting at 
~5% CPU anyways.

Interestingly, the precise GC provided no benefit. My 
understanding is that precise garbage collection only considers 
fields in heap allocated areas that are actually pointers, rather 
than splitting allocations into "may have pointer/pointer-free." 
If so, the reason this provided no advantage may be because we're 
running on 64-bit: it's much less likely than on 32-bit that a 
random non-pointer value would alias a pointer. If the zombie 
stack hypothesis holds up, the major benefit would come from 
precise stack scanning, rather than precise heap scanning: 
because memory is cleared on allocation by default, undead values 
on the heap are inherently much less likely. However, precise 
stack scanning is not implemented in any D compiler.

(Zombie heap leaks would arise only when a data structure like an 
array or hashmap is downsized in place without clearing the 
now-free fields.)

There is an open question what the tradeoff is for different 
values of `heapSizeFactor`. Ostensibly, no values smaller than 1 
should make any difference (being approximately "run the GC on 
every allocation"), but this doesn't seem to be how it works. 
There is some internal smoothing for the actual target value at 
work, as well.

In any case, we will keep `--DRT-gcopt=heapSizeFactor` in mind as 
our front-line response for processes using excessive amounts of 
memory.

Dec 09 2022

kinke <noone nowhere.com> writes:

A while back, I've played around with GC params too, for the 
compiler frontend itself (`-lowmem`). `maxPoolSize` might be 
interesting too. 
https://github.com/ldc-developers/ldc/pull/2916#issuecomment-443433594

Dec 09 2022

D Programming

C/C++ Programming

Other

digitalmars.D - A GC Memory Usage Experiment: `--DRT-gcopt=heapSizeFactor` is Magic