digitalmars.D - Why does this simple test program leak 500MB of RAM?
- FeepingCreature (71/71) Jun 19 2023 Consider this program:
- Steven Schveighoffer (19/97) Jun 19 2023 The `clear` should remove all references to the bucket elements (it
- FeepingCreature (14/20) Jun 19 2023 Some additional info I probably should have mentioned.
- Steven Schveighoffer (20/35) Jun 19 2023 Are you looking at GC memory usage or process usage? Because the GC
- FeepingCreature (5/44) Jun 19 2023 I'm looking at process usage. But I mean, the thing is, I *can*
- FeepingCreature (19/34) Jun 19 2023 Okay, here's the "full stack cleanup" version:
- FeepingCreature (32/37) Jun 30 2023 Okay, I can confirm it's some sort of fragmentation issue.
- Bastiaan Veelo (6/8) Jun 30 2023 Just throwing this in here in case it is of significance: Does
- FeepingCreature (7/15) Jul 01 2023 It's never made any difference any time I tried.
- Steven Schveighoffer (5/9) Jul 02 2023 Probably a coincidental adjustment in gc bookkeeping allocations.
Consider this program: ``` module test; import core.memory : GC; import core.sync.semaphore; import core.thread; import std; enum keys = [ "foo", "bar", "baz", "whee", "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8", "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8", "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"]; void spamGC(Semaphore sema) { void recursionTest(int depth) { if (depth < 1000) { recursionTest(depth + 1); if (depth % 300 == 0) recursionTest(depth + 1); } string[string] assocArray; static foreach (a; keys) { assocArray[a] = a; } // ????? assocArray.clear; assocArray = null; } recursionTest(0); sema.notify; Thread.sleep(3600.seconds); } void main() { Thread[] threads; auto sema = new Semaphore(0); enum threadCount = 100; threadCount.iota.each!((_) { auto thread = new Thread({ spamGC(sema); }); thread.start; threads ~= thread; }); // this clears the leak up! // threads.each!((thread) { thread.join; }); threadCount.iota.each!((i) { sema.wait; }); writefln!"Done."; 100.iota.each!((_) { GC.collect; GC.minimize; }); writefln!"Collected."; // Now look at residential memory for the process. Thread.sleep(3600.seconds); } ``` We've had problems with excessive memory leaks in JSON decoding in production. So I've put together this test program. The `spamGC` recursive call represents recursive JSON decoding. The `Thread.sleep` call represents a thread idling in a threadpool. After completion, the program sits at ~600MB residential. I believe, from looking at it, it should be obvious that this memory is **entirely dead**. If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting. `-gx` does nothing.
Jun 19 2023
On 6/19/23 6:15 AM, FeepingCreature wrote:Consider this program: ``` module test; import core.memory : GC; import core.sync.semaphore; import core.thread; import std; enum keys = [ "foo", "bar", "baz", "whee", "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8", "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8", "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"]; void spamGC(Semaphore sema) { void recursionTest(int depth) { if (depth < 1000) { recursionTest(depth + 1); if (depth % 300 == 0) recursionTest(depth + 1); } string[string] assocArray; static foreach (a; keys) { assocArray[a] = a; } // ????? assocArray.clear; assocArray = null; } recursionTest(0); sema.notify; Thread.sleep(3600.seconds); } void main() { Thread[] threads; auto sema = new Semaphore(0); enum threadCount = 100; threadCount.iota.each!((_) { auto thread = new Thread({ spamGC(sema); }); thread.start; threads ~= thread; }); // this clears the leak up! // threads.each!((thread) { thread.join; }); threadCount.iota.each!((i) { sema.wait; }); writefln!"Done."; 100.iota.each!((_) { GC.collect; GC.minimize; }); writefln!"Collected."; // Now look at residential memory for the process. Thread.sleep(3600.seconds); } ``` We've had problems with excessive memory leaks in JSON decoding in production. So I've put together this test program. The `spamGC` recursive call represents recursive JSON decoding. The `Thread.sleep` call represents a thread idling in a threadpool. After completion, the program sits at ~600MB residential. I believe, from looking at it, it should be obvious that this memory is **entirely dead**.The `clear` should remove all references to the bucket elements (it basically nulls out the bucket elements without deallocating the elements or the bucket array). The `assocArray = null` could be considered a dead store by the const-folder, so it's possible that isn't actually being performed.If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting.That is interesting... So I'm looking at the druntime code. If I understand this correctly, the GC doesn't try running a collection when allocating small blocks until 2.5GB of memory is consumed? that can't be right... https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/os.d#L215-L252 https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1949-L1967 But maybe that's only if some other condition is not met. Can you use GC profile to determine how many collections run during the affected time? I am curious if it is not a problem of the GC not collecting things it should collect vs. not collecting at all. This seems to play a factor, but I don't understand it. https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1871-L1892 -Steve
Jun 19 2023
On Monday, 19 June 2023 at 15:59:39 UTC, Steven Schveighoffer wrote:On 6/19/23 6:15 AM, FeepingCreature wrote:Some additional info I probably should have mentioned. Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount. My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I think is happening. I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting.That is interesting...
Jun 19 2023
On 6/19/23 12:40 PM, FeepingCreature wrote:Some additional info I probably should have mentioned. Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount. My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I think is happening. I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there. Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low. I would also point out that your specific case allocates *most* of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using `aa.clear` should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references. -Steve
Jun 19 2023
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:On 6/19/23 12:40 PM, FeepingCreature wrote:I'm looking at process usage. But I mean, the thing is, I *can* make it go down to sub 100MB by playing with the stack. I'll post code for that version when I get to work.Some additional info I probably should have mentioned. Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount. My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx forum.dlang.org for what I think is happening. I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there. Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low. I would also point out that your specific case allocates *most* of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using `aa.clear` should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references. -Steve
Jun 19 2023
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:On 6/19/23 12:40 PM, FeepingCreature wrote:Okay, here's the "full stack cleanup" version: https://gist.github.com/FeepingCreature/3baf6bd62dc64441fe73f46e3df9b482 It spawns a fiber, runs the recursion in the fiber, then uses madvise and druntime abuse to set the entire fiber stack to zero. RSS averaged over five runs: With fiber, without madvise: ~177MB With fiber, with madvise: ~75MB But this is extremely weird though. Thinking about it, the `madvise` call really shouldn't matter! The fiber has exited, its stack shouldn't be a GC root. Even if it is, there should be no way that any references to our string[string]s live in any possibly-live stack area in there. And yet, explicitly zeroing it out repeatably improves GC performance. Conversely, just running the recursion in a fiber shouldn't do anything either, either that stack area is scanned or it isn't. Well, empirically, "fiber and cleanup makes rss go down" which is the only actual thing of importance, I guess.I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS. I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there.
Jun 19 2023
On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS.Okay, I can confirm it's some sort of fragmentation issue. Printing GC stats: ``` writefln!"Done. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize); 100.iota.each!((_) { GC.collect; GC.minimize; }); writefln!"Collected. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize); ``` Without alloca purging: ``` Done. (155347088 used, 527275888 free) Collected. (495808 used, 682127168 free) ``` So all but 500KB GC memory is free, but it uses 796MB residential. With alloca purging: ``` Done. (35950960 used, 646672016 free) Collected. (26528 used, 36673632 free) ``` We got rid of another 450KB, but residential dropped to 164MB! I think miniscule stack leaks must be keeping large GC pools alive here. I kind of wish there was a way to say "1. All allocations in this thread should go into a dedicated pool X" and "2. Okay, we're done with the work, all remaining references to these allocations are a bug; scan for them, crash (with a message) if you find any."
Jun 30 2023
On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:I think miniscule stack leaks must be keeping large GC pools alive here.Just throwing this in here in case it is of significance: Does adding `"--DRT-gcopt=gc:precise"` as a command line option make any difference? https://dlang.org/spec/garbage.html#precise_gc -- Bastiaan.
Jun 30 2023
On Friday, 30 June 2023 at 18:49:58 UTC, Bastiaan Veelo wrote:On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:It's never made any difference any time I tried. I'd expect precise scanning to only be relevant on 32-bit, since it doesn't precise scan the stack. edit: Heh. The precise GC manages to collect *less* garbage in the test above, leaving the program standing at 670M residential. Explain that one. :PI think miniscule stack leaks must be keeping large GC pools alive here.Just throwing this in here in case it is of significance: Does adding `"--DRT-gcopt=gc:precise"` as a command line option make any difference? https://dlang.org/spec/garbage.html#precise_gc -- Bastiaan.
Jul 01 2023
On 7/1/23 5:24 AM, FeepingCreature wrote:edit: Heh. The precise GC manages to collect *less* garbage in the test above, leaving the program standing at 670M residential. Explain that one. :PProbably a coincidental adjustment in gc bookkeeping allocations. If I had a nickel every time someone posted a "why does the GC behave this way!" message, I'd have like probably $2-3. -Steve
Jul 02 2023