digitalmars.D - Why does this simple test program leak 500MB of RAM?

FeepingCreature (71/71) Jun 19 2023 Consider this program:

Steven Schveighoffer (19/97) Jun 19 2023 The `clear` should remove all references to the bucket elements (it

FeepingCreature (14/20) Jun 19 2023 Some additional info I probably should have mentioned.

Steven Schveighoffer (20/35) Jun 19 2023 Are you looking at GC memory usage or process usage? Because the GC

FeepingCreature (5/44) Jun 19 2023 I'm looking at process usage. But I mean, the thing is, I *can*
FeepingCreature (19/34) Jun 19 2023 Okay, here's the "full stack cleanup" version:
FeepingCreature (32/37) Jun 30 2023 Okay, I can confirm it's some sort of fragmentation issue.

Bastiaan Veelo (6/8) Jun 30 2023 Just throwing this in here in case it is of significance: Does

FeepingCreature (7/15) Jul 01 2023 It's never made any difference any time I tried.

Steven Schveighoffer (5/9) Jul 02 2023 Probably a coincidental adjustment in gc bookkeeping allocations.

FeepingCreature <feepingcreature gmail.com> writes:

Consider this program:

```
module test;

import core.memory : GC;
import core.sync.semaphore;
import core.thread;
import std;

enum keys = [
     "foo", "bar", "baz", "whee",
     "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", 
"foo8",
     "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", 
"bar8",
     "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", 
"baz8"];

void spamGC(Semaphore sema) {
     void recursionTest(int depth) {
         if (depth < 1000) {
             recursionTest(depth + 1);
             if (depth % 300 == 0)
                 recursionTest(depth + 1);
         }
         string[string] assocArray;
         static foreach (a; keys)
         {
             assocArray[a] = a;
         }
         // ?????
         assocArray.clear;
         assocArray = null;
     }
     recursionTest(0);
     sema.notify;
     Thread.sleep(3600.seconds);
}

void main() {
     Thread[] threads;
     auto sema = new Semaphore(0);
     enum threadCount = 100;
     threadCount.iota.each!((_) {
         auto thread = new Thread({ spamGC(sema); });
         thread.start;
         threads ~= thread;
     });
     // this clears the leak up!
     // threads.each!((thread) { thread.join; });
     threadCount.iota.each!((i) {
         sema.wait;
     });
     writefln!"Done.";
     100.iota.each!((_) {
         GC.collect;
         GC.minimize;
     });
     writefln!"Collected.";

     // Now look at residential memory for the process.
     Thread.sleep(3600.seconds);
}
```

We've had problems with excessive memory leaks in JSON decoding 
in production. So I've put together this test program.

The `spamGC` recursive call represents recursive JSON decoding.

The `Thread.sleep` call represents a thread idling in a 
threadpool.

After completion, the program sits at ~600MB residential. I 
believe, from looking at it, it should be obvious that this 
memory is **entirely dead**.

If I alloca 10KB of stack at the beginning of the recursion, the 
residential RAM drops to ~180MB. I don't think this completely 
solves the issue, but it sure is interesting.

`-gx` does nothing.

Jun 19 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 6/19/23 6:15 AM, FeepingCreature wrote:
 Consider this program:
 
 ```
 module test;
 
 import core.memory : GC;
 import core.sync.semaphore;
 import core.thread;
 import std;
 
 enum keys = [
      "foo", "bar", "baz", "whee",
      "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8",
      "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8",
      "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"];
 
 void spamGC(Semaphore sema) {
      void recursionTest(int depth) {
          if (depth < 1000) {
              recursionTest(depth + 1);
              if (depth % 300 == 0)
                  recursionTest(depth + 1);
          }
          string[string] assocArray;
          static foreach (a; keys)
          {
              assocArray[a] = a;
          }
          // ?????
          assocArray.clear;
          assocArray = null;
      }
      recursionTest(0);
      sema.notify;
      Thread.sleep(3600.seconds);
 }
 
 void main() {
      Thread[] threads;
      auto sema = new Semaphore(0);
      enum threadCount = 100;
      threadCount.iota.each!((_) {
          auto thread = new Thread({ spamGC(sema); });
          thread.start;
          threads ~= thread;
      });
      // this clears the leak up!
      // threads.each!((thread) { thread.join; });
      threadCount.iota.each!((i) {
          sema.wait;
      });
      writefln!"Done.";
      100.iota.each!((_) {
          GC.collect;
          GC.minimize;
      });
      writefln!"Collected.";
 
      // Now look at residential memory for the process.
      Thread.sleep(3600.seconds);
 }
 ```
 
 We've had problems with excessive memory leaks in JSON decoding in 
 production. So I've put together this test program.
 
 The `spamGC` recursive call represents recursive JSON decoding.
 
 The `Thread.sleep` call represents a thread idling in a threadpool.
 
 After completion, the program sits at ~600MB residential. I believe, 
 from looking at it, it should be obvious that this memory is **entirely 
 dead**.

The `clear` should remove all references to the bucket elements (it 
basically nulls out the bucket elements without deallocating the 
elements or the bucket array). The `assocArray = null` could be 
considered a dead store by the const-folder, so it's possible that isn't 
actually being performed.

 
 If I alloca 10KB of stack at the beginning of the recursion, the 
 residential RAM drops to ~180MB. I don't think this completely solves 
 the issue, but it sure is interesting.
 

That is interesting...

So I'm looking at the druntime code. If I understand this correctly, the 
GC doesn't try running a collection when allocating small blocks until 
2.5GB of memory is consumed? that can't be right...

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/os.d#L215-L252

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1949-L1967

But maybe that's only if some other condition is not met. Can you use GC 
profile to determine how many collections run during the affected time? 
I am curious if it is not a problem of the GC not collecting things it 
should collect vs. not collecting at all.

This seems to play a factor, but I don't understand it.

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1871-L1892

-Steve

Jun 19 2023

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 19 June 2023 at 15:59:39 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 6:15 AM, FeepingCreature wrote:
 If I alloca 10KB of stack at the beginning of the recursion, 
 the residential RAM drops to ~180MB. I don't think this 
 completely solves the issue, but it sure is interesting.
 

 That is interesting...

Some additional info I probably should have mentioned.

Note that I'm manually calling the GC at the end. At that point, 
everything should be dead. heaptrack shows that we get something 
like 8 big GC runs over the program runtime: as expected, because 
it's overprovisioning a "healthy" amount.

My standing theory is that it's the zombie stack issue again: see 
https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org 
for what I think is happening.

I'm currently experimenting with setting up a fiber, then zeroing 
out its stack completely after the function has run in it. 
Results are mixed, but promising. I still end up with ~100MB live 
at the end though - which admittedly is a lot better than 600MB.

Jun 19 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 6/19/23 12:40 PM, FeepingCreature wrote:

 Some additional info I probably should have mentioned.
 
 Note that I'm manually calling the GC at the end. At that point, 
 everything should be dead. heaptrack shows that we get something like 8 
 big GC runs over the program runtime: as expected, because it's 
 overprovisioning a "healthy" amount.
 
 My standing theory is that it's the zombie stack issue again: see 
 https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for 
 what I think is happening.
 
 I'm currently experimenting with setting up a fiber, then zeroing out 
 its stack completely after the function has run in it. Results are 
 mixed, but promising. I still end up with ~100MB live at the end though 
 - which admittedly is a lot better than 600MB.

Are you looking at GC memory usage or process usage? Because the GC 
doesn't generally give back memory to the OS even when it could. And 
fragmentation can stop it from doing so also -- an entire allocated pool 
has to be unallocated for it to release it to the OS.

I know there has always been mysterious "reasons" for the GC to not 
collect things that are stack-referenced, but I tend to first try to 
rule out looking at the wrong information, and/or other simple 
explanations before getting there.

Having somewhat recently looked at how the GC scanning works for 
threads, I'm reasonably certain that unused stack isn't being scanned 
(even if it's allocated). On a 64-bit machine, the likelihood of 
accidentally keeping things referenced on the stack is low.

I would also point out that your specific case allocates *most* of the 
memory secondary referenced from the stack. That is, you might refer to 
the AA structure from the stack, but not the bucket elements (where the 
bulk is allocated). And using `aa.clear` should make that even less 
possible. All of the allocated buckets should be definitively garbage 
with no possibility of zombie references.

-Steve

Jun 19 2023

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 12:40 PM, FeepingCreature wrote:

 Some additional info I probably should have mentioned.
 
 Note that I'm manually calling the GC at the end. At that 
 point, everything should be dead. heaptrack shows that we get 
 something like 8 big GC runs over the program runtime: as 
 expected, because it's overprovisioning a "healthy" amount.
 
 My standing theory is that it's the zombie stack issue again: 
 see 
 https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I
think is happening.
 
 I'm currently experimenting with setting up a fiber, then 
 zeroing out its stack completely after the function has run in 
 it. Results are mixed, but promising. I still end up with 
 ~100MB live at the end though - which admittedly is a lot 
 better than 600MB.

 Are you looking at GC memory usage or process usage? Because 
 the GC doesn't generally give back memory to the OS even when 
 it could. And fragmentation can stop it from doing so also -- 
 an entire allocated pool has to be unallocated for it to 
 release it to the OS.

 I know there has always been mysterious "reasons" for the GC to 
 not collect things that are stack-referenced, but I tend to 
 first try to rule out looking at the wrong information, and/or 
 other simple explanations before getting there.

 Having somewhat recently looked at how the GC scanning works 
 for threads, I'm reasonably certain that unused stack isn't 
 being scanned (even if it's allocated). On a 64-bit machine, 
 the likelihood of accidentally keeping things referenced on the 
 stack is low.

 I would also point out that your specific case allocates *most* 
 of the memory secondary referenced from the stack. That is, you 
 might refer to the AA structure from the stack, but not the 
 bucket elements (where the bulk is allocated). And using 
 `aa.clear` should make that even less possible. All of the 
 allocated buckets should be definitively garbage with no 
 possibility of zombie references.

 -Steve

I'm looking at process usage. But I mean, the thing is, I *can* 
make it go down to sub 100MB by playing with the stack.

I'll post code for that version when I get to work.

Jun 19 2023

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 12:40 PM, FeepingCreature wrote:
 I'm currently experimenting with setting up a fiber, then 
 zeroing out its stack completely after the function has run in 
 it. Results are mixed, but promising. I still end up with 
 ~100MB live at the end though - which admittedly is a lot 
 better than 600MB.

 Are you looking at GC memory usage or process usage? Because 
 the GC doesn't generally give back memory to the OS even when 
 it could. And fragmentation can stop it from doing so also -- 
 an entire allocated pool has to be unallocated for it to 
 release it to the OS.

 I know there has always been mysterious "reasons" for the GC to 
 not collect things that are stack-referenced, but I tend to 
 first try to rule out looking at the wrong information, and/or 
 other simple explanations before getting there.

Okay, here's the "full stack cleanup" version: 
https://gist.github.com/FeepingCreature/3baf6bd62dc64441fe73f46e3df9b482

It spawns a fiber, runs the recursion in the fiber, then uses 
madvise and druntime abuse to set the entire fiber stack to zero.

RSS averaged over five runs:

With fiber, without madvise: ~177MB
With fiber, with madvise: ~75MB

But this is extremely weird though. Thinking about it, the 
`madvise` call really shouldn't matter! The fiber has exited, its 
stack shouldn't be a GC root. Even if it is, there should be no 
way that any references to our string[string]s live in any 
possibly-live stack area in there. And yet, explicitly zeroing it 
out repeatably improves GC performance. Conversely, just running 
the recursion in a fiber shouldn't do anything either, either 
that stack area is scanned or it isn't.

Well, empirically, "fiber and cleanup makes rss go down" which is 
the only actual thing of importance, I guess.

Jun 19 2023

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 Are you looking at GC memory usage or process usage? Because 
 the GC doesn't generally give back memory to the OS even when 
 it could. And fragmentation can stop it from doing so also -- 
 an entire allocated pool has to be unallocated for it to 
 release it to the OS.

Okay, I can confirm it's some sort of fragmentation issue.

Printing GC stats:

```
     writefln!"Done. (%s used, %s free)"(GC.stats.usedSize, 
GC.stats.freeSize);
     100.iota.each!((_) {
         GC.collect;
         GC.minimize;
     });
     writefln!"Collected. (%s used, %s free)"(GC.stats.usedSize, 
GC.stats.freeSize);
```

Without alloca purging:

```
Done. (155347088 used, 527275888 free)
Collected. (495808 used, 682127168 free)
```

So all but 500KB GC memory is free, but it uses 796MB residential.

With alloca purging:

```
Done. (35950960 used, 646672016 free)
Collected. (26528 used, 36673632 free)
```

We got rid of another 450KB, but residential dropped to 164MB!

I think miniscule stack leaks must be keeping large GC pools 
alive here.

I kind of wish there was a way to say "1. All allocations in this 
thread should go into a dedicated pool X" and "2. Okay, we're 
done with the work, all remaining references to these allocations 
are a bug; scan for them, crash (with a message) if you find any."

Jun 30 2023

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:
 I think miniscule stack leaks must be keeping large GC pools 
 alive here.

Just throwing this in here in case it is of significance: Does 
adding `"--DRT-gcopt=gc:precise"` as a command line option make 
any difference?

https://dlang.org/spec/garbage.html#precise_gc

-- Bastiaan.

Jun 30 2023

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 30 June 2023 at 18:49:58 UTC, Bastiaan Veelo wrote:
 On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:
 I think miniscule stack leaks must be keeping large GC pools 
 alive here.

 Just throwing this in here in case it is of significance: Does 
 adding `"--DRT-gcopt=gc:precise"` as a command line option make 
 any difference?

 https://dlang.org/spec/garbage.html#precise_gc

 -- Bastiaan.

It's never made any difference any time I tried.

I'd expect precise scanning to only be relevant on 32-bit, since 
it doesn't precise scan the stack.

edit: Heh. The precise GC manages to collect *less* garbage in 
the test above, leaving the program standing at 670M residential.

Explain that one. :P

Jul 01 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 7/1/23 5:24 AM, FeepingCreature wrote:

 edit: Heh. The precise GC manages to collect *less* garbage in the test 
 above, leaving the program standing at 670M residential.
 
 Explain that one. :P

Probably a coincidental adjustment in gc bookkeeping allocations.

If I had a nickel every time someone posted a "why does the GC behave 
this way!" message, I'd have like probably $2-3.

-Steve

Jul 02 2023

D Programming

C/C++ Programming

Other

digitalmars.D - Why does this simple test program leak 500MB of RAM?