www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why does this simple test program leak 500MB of RAM?

reply FeepingCreature <feepingcreature gmail.com> writes:
Consider this program:

```
module test;

import core.memory : GC;
import core.sync.semaphore;
import core.thread;
import std;

enum keys = [
     "foo", "bar", "baz", "whee",
     "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", 
"foo8",
     "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", 
"bar8",
     "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", 
"baz8"];

void spamGC(Semaphore sema) {
     void recursionTest(int depth) {
         if (depth < 1000) {
             recursionTest(depth + 1);
             if (depth % 300 == 0)
                 recursionTest(depth + 1);
         }
         string[string] assocArray;
         static foreach (a; keys)
         {
             assocArray[a] = a;
         }
         // ?????
         assocArray.clear;
         assocArray = null;
     }
     recursionTest(0);
     sema.notify;
     Thread.sleep(3600.seconds);
}

void main() {
     Thread[] threads;
     auto sema = new Semaphore(0);
     enum threadCount = 100;
     threadCount.iota.each!((_) {
         auto thread = new Thread({ spamGC(sema); });
         thread.start;
         threads ~= thread;
     });
     // this clears the leak up!
     // threads.each!((thread) { thread.join; });
     threadCount.iota.each!((i) {
         sema.wait;
     });
     writefln!"Done.";
     100.iota.each!((_) {
         GC.collect;
         GC.minimize;
     });
     writefln!"Collected.";

     // Now look at residential memory for the process.
     Thread.sleep(3600.seconds);
}
```

We've had problems with excessive memory leaks in JSON decoding 
in production. So I've put together this test program.

The `spamGC` recursive call represents recursive JSON decoding.

The `Thread.sleep` call represents a thread idling in a 
threadpool.

After completion, the program sits at ~600MB residential. I 
believe, from looking at it, it should be obvious that this 
memory is **entirely dead**.

If I alloca 10KB of stack at the beginning of the recursion, the 
residential RAM drops to ~180MB. I don't think this completely 
solves the issue, but it sure is interesting.

`-gx` does nothing.
Jun 19 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/19/23 6:15 AM, FeepingCreature wrote:
 Consider this program:
 
 ```
 module test;
 
 import core.memory : GC;
 import core.sync.semaphore;
 import core.thread;
 import std;
 
 enum keys = [
      "foo", "bar", "baz", "whee",
      "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8",
      "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8",
      "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"];
 
 void spamGC(Semaphore sema) {
      void recursionTest(int depth) {
          if (depth < 1000) {
              recursionTest(depth + 1);
              if (depth % 300 == 0)
                  recursionTest(depth + 1);
          }
          string[string] assocArray;
          static foreach (a; keys)
          {
              assocArray[a] = a;
          }
          // ?????
          assocArray.clear;
          assocArray = null;
      }
      recursionTest(0);
      sema.notify;
      Thread.sleep(3600.seconds);
 }
 
 void main() {
      Thread[] threads;
      auto sema = new Semaphore(0);
      enum threadCount = 100;
      threadCount.iota.each!((_) {
          auto thread = new Thread({ spamGC(sema); });
          thread.start;
          threads ~= thread;
      });
      // this clears the leak up!
      // threads.each!((thread) { thread.join; });
      threadCount.iota.each!((i) {
          sema.wait;
      });
      writefln!"Done.";
      100.iota.each!((_) {
          GC.collect;
          GC.minimize;
      });
      writefln!"Collected.";
 
      // Now look at residential memory for the process.
      Thread.sleep(3600.seconds);
 }
 ```
 
 We've had problems with excessive memory leaks in JSON decoding in 
 production. So I've put together this test program.
 
 The `spamGC` recursive call represents recursive JSON decoding.
 
 The `Thread.sleep` call represents a thread idling in a threadpool.
 
 After completion, the program sits at ~600MB residential. I believe, 
 from looking at it, it should be obvious that this memory is **entirely 
 dead**.
The `clear` should remove all references to the bucket elements (it basically nulls out the bucket elements without deallocating the elements or the bucket array). The `assocArray = null` could be considered a dead store by the const-folder, so it's possible that isn't actually being performed.
 
 If I alloca 10KB of stack at the beginning of the recursion, the 
 residential RAM drops to ~180MB. I don't think this completely solves 
 the issue, but it sure is interesting.
 
That is interesting... So I'm looking at the druntime code. If I understand this correctly, the GC doesn't try running a collection when allocating small blocks until 2.5GB of memory is consumed? that can't be right... https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/os.d#L215-L252 https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1949-L1967 But maybe that's only if some other condition is not met. Can you use GC profile to determine how many collections run during the affected time? I am curious if it is not a problem of the GC not collecting things it should collect vs. not collecting at all. This seems to play a factor, but I don't understand it. https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1871-L1892 -Steve
Jun 19 2023
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 19 June 2023 at 15:59:39 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 6:15 AM, FeepingCreature wrote:
 If I alloca 10KB of stack at the beginning of the recursion, 
 the residential RAM drops to ~180MB. I don't think this 
 completely solves the issue, but it sure is interesting.
 
That is interesting...
Some additional info I probably should have mentioned. Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount. My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I think is happening. I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.
Jun 19 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/19/23 12:40 PM, FeepingCreature wrote:

 Some additional info I probably should have mentioned.
 
 Note that I'm manually calling the GC at the end. At that point, 
 everything should be dead. heaptrack shows that we get something like 8 
 big GC runs over the program runtime: as expected, because it's 
 overprovisioning a "healthy" amount.
 
 My standing theory is that it's the zombie stack issue again: see 
 https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for 
 what I think is happening.
 
 I'm currently experimenting with setting up a fiber, then zeroing out 
 its stack completely after the function has run in it. Results are 
 mixed, but promising. I still end up with ~100MB live at the end though 
 - which admittedly is a lot better than 600MB.
Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there. Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low. I would also point out that your specific case allocates *most* of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using `aa.clear` should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references. -Steve
Jun 19 2023
next sibling parent FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 12:40 PM, FeepingCreature wrote:

 Some additional info I probably should have mentioned.
 
 Note that I'm manually calling the GC at the end. At that 
 point, everything should be dead. heaptrack shows that we get 
 something like 8 big GC runs over the program runtime: as 
 expected, because it's overprovisioning a "healthy" amount.
 
 My standing theory is that it's the zombie stack issue again: 
 see 
 https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I
think is happening.
 
 I'm currently experimenting with setting up a fiber, then 
 zeroing out its stack completely after the function has run in 
 it. Results are mixed, but promising. I still end up with 
 ~100MB live at the end though - which admittedly is a lot 
 better than 600MB.
Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there. Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low. I would also point out that your specific case allocates *most* of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using `aa.clear` should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references. -Steve
I'm looking at process usage. But I mean, the thing is, I *can* make it go down to sub 100MB by playing with the stack. I'll post code for that version when I get to work.
Jun 19 2023
prev sibling next sibling parent FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 On 6/19/23 12:40 PM, FeepingCreature wrote:
 I'm currently experimenting with setting up a fiber, then 
 zeroing out its stack completely after the function has run in 
 it. Results are mixed, but promising. I still end up with 
 ~100MB live at the end though - which admittedly is a lot 
 better than 600MB.
Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there.
Okay, here's the "full stack cleanup" version: https://gist.github.com/FeepingCreature/3baf6bd62dc64441fe73f46e3df9b482 It spawns a fiber, runs the recursion in the fiber, then uses madvise and druntime abuse to set the entire fiber stack to zero. RSS averaged over five runs: With fiber, without madvise: ~177MB With fiber, with madvise: ~75MB But this is extremely weird though. Thinking about it, the `madvise` call really shouldn't matter! The fiber has exited, its stack shouldn't be a GC root. Even if it is, there should be no way that any references to our string[string]s live in any possibly-live stack area in there. And yet, explicitly zeroing it out repeatably improves GC performance. Conversely, just running the recursion in a fiber shouldn't do anything either, either that stack area is scanned or it isn't. Well, empirically, "fiber and cleanup makes rss go down" which is the only actual thing of importance, I guess.
Jun 19 2023
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer 
wrote:
 Are you looking at GC memory usage or process usage? Because 
 the GC doesn't generally give back memory to the OS even when 
 it could. And fragmentation can stop it from doing so also -- 
 an entire allocated pool has to be unallocated for it to 
 release it to the OS.
Okay, I can confirm it's some sort of fragmentation issue. Printing GC stats: ``` writefln!"Done. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize); 100.iota.each!((_) { GC.collect; GC.minimize; }); writefln!"Collected. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize); ``` Without alloca purging: ``` Done. (155347088 used, 527275888 free) Collected. (495808 used, 682127168 free) ``` So all but 500KB GC memory is free, but it uses 796MB residential. With alloca purging: ``` Done. (35950960 used, 646672016 free) Collected. (26528 used, 36673632 free) ``` We got rid of another 450KB, but residential dropped to 164MB! I think miniscule stack leaks must be keeping large GC pools alive here. I kind of wish there was a way to say "1. All allocations in this thread should go into a dedicated pool X" and "2. Okay, we're done with the work, all remaining references to these allocations are a bug; scan for them, crash (with a message) if you find any."
Jun 30 2023
parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:
 I think miniscule stack leaks must be keeping large GC pools 
 alive here.
Just throwing this in here in case it is of significance: Does adding `"--DRT-gcopt=gc:precise"` as a command line option make any difference? https://dlang.org/spec/garbage.html#precise_gc -- Bastiaan.
Jun 30 2023
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 30 June 2023 at 18:49:58 UTC, Bastiaan Veelo wrote:
 On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:
 I think miniscule stack leaks must be keeping large GC pools 
 alive here.
Just throwing this in here in case it is of significance: Does adding `"--DRT-gcopt=gc:precise"` as a command line option make any difference? https://dlang.org/spec/garbage.html#precise_gc -- Bastiaan.
It's never made any difference any time I tried. I'd expect precise scanning to only be relevant on 32-bit, since it doesn't precise scan the stack. edit: Heh. The precise GC manages to collect *less* garbage in the test above, leaving the program standing at 670M residential. Explain that one. :P
Jul 01 2023
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 7/1/23 5:24 AM, FeepingCreature wrote:

 edit: Heh. The precise GC manages to collect *less* garbage in the test 
 above, leaving the program standing at 670M residential.
 
 Explain that one. :P
Probably a coincidental adjustment in gc bookkeeping allocations. If I had a nickel every time someone posted a "why does the GC behave this way!" message, I'd have like probably $2-3. -Steve
Jul 02 2023