digitalmars.D - Problem with GC and address/leak sanitizer
- =?UTF-8?B?THXDrXM=?= Marques (115/115) Feb 15 I have a program where the GC *seems* to be overwriting memory
- Steven Schveighoffer (28/30) Feb 15 Do I understand that this corruption is happening only with
- Walter Bright (1/1) Feb 17 Is this an issue with the new GC, or the old one?
- Steven Schveighoffer (3/4) Feb 17 Old gc
- Walter Bright (3/7) Feb 17 I'm a little surprised to see a problem crop up after 25 years of contin...
- Richard (Rikki) Andrew Cattermole (3/13) Feb 17 This has already been solved, see Johan's comments.
- Walter Bright (2/3) Feb 19 Excellent! Carry on...
- Johan (6/14) Feb 16 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that
- =?UTF-8?B?THXDrXM=?= Marques (11/14) Feb 16 The fake stack allocator is enabled. If I disable it via
- =?UTF-8?B?THXDrXM=?= Marques (2/5) Feb 16 I meant =0, of course.
- Johan (20/31) Feb 16 FakeStack allocates (!) space for stack variables, and points to
- =?UTF-8?B?THXDrXM=?= Marques (2/4) Feb 16 Sure, I'll have a look. Thanks.
- =?UTF-8?B?THXDrXM=?= Marques (12/20) Feb 17 I don't think this broke with the D 2.100. For instance, LDC
- Johan (5/26) Feb 18 It is likely related to LLVM version. Did you already check that?
- =?UTF-8?B?THXDrXM=?= Marques (5/7) Mar 06 I'm only now checking that, I ran into a lot of yak shaving tasks.
- =?UTF-8?B?THXDrXM=?= Marques (4/6) Mar 07 LDC 1.40.0 (git commit 9296fd6fcc) reproduces the problem with
- =?UTF-8?B?THXDrXM=?= Marques (37/40) Mar 27 I finally found some time today to analyze this properly, rather
- Johan (43/71) Mar 28 Hi Luis,
I have a program where the GC *seems* to be overwriting memory still in use and corrupting data. Here's the code. It's massively reduced from the original program. It's hard to reduce it further because minor changes can prevent the problem from triggering. I'll explain below the important parts. ```d import std.stdio; struct S { int check; S* next; int[4] data; } int main(string[] args) { void*[] allocs; enum bad_iter = 268; for (int n = 0; n < bad_iter+1; n++) { allocs.length = 0; auto x = " "; x ~= ' '; int[10][] ts; for(int i = 0; i < 21; i++) { ts.length++; } S head; S* s = &head; if (n == bad_iter) { n = bad_iter; // convenient line to set a breakpoint only for the last iteration } for(int i = 0; i < 8; i++) { auto ns = new S; ns.check = 1; // set test value here s.next = ns; s = ns; } s = head.next; // get the first S allocated this iteration if (s.check != 1) { // check test value here writefln("check=%d", s.check); return -1; } new int[10]; allocs ~= null; new size_t[3]; } return 0; } ``` The important part is the following. On each iteration we create 8 instances of S. For each S value, we set its `check` field to 1. Then we check the value of that field (for the first instance of S). When compiled with the address sanitizer, we observe it's been corrupted and it's no longer 1. Am I doing something incorrectly in the code? AFAIK I'm respecting the rules required by the GC. Maybe there's a silly bug I overlooked? Tested with LDC 1.40.0 on x86_64 Linux: ``` BUG check=-337690816 $ ``` By setting a watchpoint on the address of the field, I see that the code that writes to `check` is part of the GC implementation. Here's the backtrace: ``` libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx15recoverNextPageMFNbEQCmQC QCeQCeQCcQCn4BinsZb + 348 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMF bmKmkxC8TypeInfoZPv + 776 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCsQCqQCkQCkQCiQCtQBy12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQFaQEyQEsQEsQEqQFb10mallocTimelS_DQGiQGgQGaQGaQFyQGj10numMallocslTmTkTmTxQDlZQF MFNbKmKkKmKxQEeZQDx + 89 libdruntime-ldc-shared.so.110`_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZ QDd6memory8BlkInfo_ + 83 libdruntime-ldc-shared.so.110`gc_qalloc + 28 app`_D4core8lifetime__T11_d_newitemTTS3app1SZQwFNaNbNeZPQt at lifetime.d:2837:5 0x00007fffffffe438) at app.d:28:13 libdruntime-ldc-shared.so.110`_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv + 77 libdruntime-ldc-shared.so.110`_d_run_main2 + 407 libdruntime-ldc-shared.so.110`_d_run_main + 141 argv=0x00007fffffffe728) at entrypoint.d:42:17 libc.so.6`__libc_start_call_main(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728) at libc_start_call_main.h:58:16 libc.so.6`__libc_start_main_impl(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe718) at libc-start.c:360:3 ``` There is a subsequent write to that memory location in the leak sanitizer and LSan complains: `==4056526==LeakSanitizer has encountered a fatal error.` (though usually this message isn't flushed) I assume the original problem was caused by the GC and ASan/LSan are just subsequent victims, but it's hard to be sure. Apparently, LSan is automatically enabled for Linux when ASan is used. Although the ASan documentation says that LSan "can be enabled using `ASAN_OPTIONS=detect_leaks=1` on macOS", setting that to 0 didn't seem to disable it, so I couldn't test with ASan but not LSan. Any ideas of what might be going on?
Feb 15
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:I have a program where the GC *seems* to be overwriting memory still in use and corrupting data.Do I understand that this corruption is happening only with address sanitizer turned on? I don't see any red flags here, though I'm assuming a lot of these weird random things you are doing (like appending a space to a string every loop) are essential to making the thing fail? It's possible these are tickling GC patterns that cause problems, or it's possible it's tickling bugs in code generation that might prevent the GC from seeing memory! Writing to the "check" field might be because a gc cycle ran, and that item was incorrectly collected, and now the gc is writing to it because it thinks that data is fair game to use. The writing is probably not the problem, the problem is the previous collection of that data. I have learned a lot of tricks when implementing the new GC, and when faced with problems like this, it's super-difficult to figure out how to properly find the problem. One technique I used is to fork after scanning, but before collection, and put that forked process to sleep. If the failure happens, then I can gdb into the forked process and see what state the GC was in, including the entire graph of memory, and I could see how a piece of memory is or isn't referenced. This is a tedious process, and requires a lot of knowledge and patience. If this is indeed a problem with the GC, it's going to be tough to track down. If it's a problem with the codegen, then probably also difficult, but this function is small enough, that maybe someone can look at the assembly and verify that it's doing the right thing? I don't know.
Feb 15
Is this an issue with the new GC, or the old one?
Feb 17
On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:Is this an issue with the new GC, or the old one?Old gc -Steve
Feb 17
On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?Is this an issue with the new GC, or the old one?Old gc
Feb 17
On 18/02/2025 6:22 PM, Walter Bright wrote:On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:This has already been solved, see Johan's comments. ASAN with a fake stack isn't operating properly.On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?Is this an issue with the new GC, or the old one?Old gc
Feb 17
On 2/17/2025 9:26 PM, Richard (Rikki) Andrew Cattermole wrote:This has already been solved, see Johan's comments.Excellent! Carry on...
Feb 19
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:Tested with LDC 1.40.0 on x86_64 Linux: ``` $ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app check=-337690816 $ ```Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`) (And also test with a little bit older LDC, with different LLVM version, to see if it is a new issue or not) -Johan
Feb 16
On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`)The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?) Thanks! [1] https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn#garbage-collection
Feb 16
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces.I meant =0, of course.
Feb 16
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:FakeStack allocates (!) space for stack variables, and points to that "fake stack" memory with a pointer in actual CPU stack memory. This means that the stack variables are now no longer in memory that is scanned by the GC. The fix for that, of course, is to include all FakeStacks in the GC scanning [1a][1b]. This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working! [3] is an interesting test case of how code should work. -Johan [1a] https://github.com/ldc-developers/druntime/compare/d6b328be91db63aff979f584b0d1def0f746d730...1d938e0b7f668b099f9fa694b135c82ef13dec59 [1b] https://github.com/ldc-developers/ldc/pull/3888 [2] https://github.com/ldc-developers/ldc/blob/d3f065816ec7d420f370e4c95c6000eb78187e25/tests/sanitizers/asan_fakestack_GC.d#L3 [3] https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/asan/TestCases/Posix/gc-test.cppCan you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`)The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)
Feb 16
On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.
Feb 16
On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc. If you are going to look at this, please let me know, to avoid duplicate efforts.This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.
Feb 17
On Monday, 17 February 2025 at 21:56:29 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc.This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.If you are going to look at this, please let me know, to avoid duplicate efforts.Not soon, no. -Johan
Feb 18
On Tuesday, 18 February 2025 at 18:05:41 UTC, Johan wrote:It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.I'm only now checking that, I ran into a lot of yak shaving tasks. BTW, did LDC stop emitting warnings/errors since 1.35? E.g. for the `;` vs `{}` https://godbolt.org/z/K95E99a5z
Mar 06
On Tuesday, 18 February 2025 at 18:05:41 UTC, Johan wrote:It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.LDC 1.40.0 (git commit 9296fd6fcc) reproduces the problem with both LLVM 15.0 and 19.1, so for now it seems that this isn't related to something having changed on the LLVM side.
Mar 07
On Friday, 7 March 2025 at 14:51:22 UTC, Luís Marques wrote:LDC 1.40.0 (git commit 9296fd6fcc) reproduces the problem with both LLVM 15.0 and 19.1, so for now it seems that this isn't related to something having changed on the LLVM side.I finally found some time today to analyze this properly, rather than just trying different LDC/LLVM versions. I haven't yet double-checked everything, but based on my current understanding, I believe I've identified the issue, along with several deficiencies: 1. **The main issue:** The distributed `libdruntime-ldc-shared.so` appears to be compiled without `version=SupportSanitizers`, so the fake stack scanning isn't enabled. I conclude this by observing that the distributed runtime's disassembly doesn't include `scanStackForASanFakeStack`, unlike when this option is enabled. The necessary code exists in the runtime's source, but it's simply not being compiled in. 2. **Fake stack support should work when properly enabled:** If you build LDC with `-DRT_SUPPORT_SANITIZERS=True`, the fake stack is scanned as expected. From what I can see (by checking for the proper GC scan calls and addresses), everything is being done correctly. The test `tests/sanitizers/asan_fakestack_GC.d` still fails, but not because of broken fake stack scanning—instead, the test itself is broken. An assertion fails in `test_non_null_does_not_trigger_collection` because memory from a previous iteration is being freed, not because it's freeing memory incorrectly for the current iteration. 3. **Redundant function calls:** The runtime function `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple times for the same fake stack frame, using identical memory ranges each time. While technically correct, this is unnecessary and inefficient. Based on the comment `"This is the pointer we are catching with"`, it seems there's a misunderstanding regarding how the fake stack should be located. 4. **Misleading comment in the test:** The comment `// Large enough so it will be on fakestack heap (not inlined in local stack frame)` in `asan_fakestack_GC.d` appears to be incorrect. The size of the allocation shouldn't determine this—if its address escapes, it must go into the heap-allocated fake stack regardless.
Mar 27
Hi Luis, Thanks for spending some time on it. Here some inline remarks from my side. On Friday, 28 March 2025 at 00:35:50 UTC, Luís Marques wrote:1. **The main issue:** The distributed `libdruntime-ldc-shared.so` appears to be compiled without `version=SupportSanitizers`, so the fake stack scanning isn't enabled.This is intentional, because there is some overhead associated with it. If this is the main issue then that's "OK" and has always been the case. Perhaps FakeStack is now enabled by default. We should either disable that default, or start shipping including the small overhead in druntime.2. **Fake stack support should work when properly enabled:** If you build LDC with `-DRT_SUPPORT_SANITIZERS=True`, the fake stack is scanned as expected.Great.From what I can see (by checking for the proper GC scan calls and addresses), everything is being done correctly. The test `tests/sanitizers/asan_fakestack_GC.d` still fails, but not because of broken fake stack scanning—instead, the test itself is broken. An assertion fails in `test_non_null_does_not_trigger_collection` because memory from a previous iteration is being freed, not because it's freeing memory incorrectly for the current iteration.The idea of the test is to test that enabling fakestack does not change collection behavior. The test is clunky, I'll admit. If it is indeed failing already on the fakestack-disabled case, then it needs to be adjusted for sure.3. **Redundant function calls:** The runtime function `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple times for the same fake stack frame, using identical memory ranges each time. While technically correct, this is unnecessary and inefficient. Based on the comment `"This is the pointer we are catching with"`, it seems there's a misunderstanding regarding how the fake stack should be located.I'm not sure how this can be related to specifically fakestack implementation issue. If you mean that `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple times, then that is not a problem of our fakestack support code, but of `scanAllTypeImpl` itself because it would also call the normal `scan` method multiple times with same memory range. That would point towards an issue with duplicates in the `ThreadBase.sm_cbeg` list. Is that what you meant? (it is literally what you typed, but the sentences following it do not make sense with that interpretation...) LLVM does not give us many options to find the fake stack. At the time I think I used the only public API method available, which indeed results in multiple times getting the same fakestack. We can poke into ASan internals but I am afraid it will quickly bitrot/break. Perhaps by now ASan's public interface has been extended, otherwise we should ask for it (or submit PRs to LLVM)... I'm quite sure we are doing the right thing of locating the fakestack, it's done similarly here: https://www.aovivo.ilutas.com.br/node-v17.9.1/deps/v8/src/heap/base/stack.cc4. **Misleading comment in the test:** The comment `// Large enough so it will be on fakestack heap (not inlined in local stack frame)` in `asan_fakestack_GC.d` appears to be incorrect. The size of the allocation shouldn't determine this—if its address escapes, it must go into the heap-allocated fake stack regardless.I vaguely remember there being some optimization with small memory sizes, but I cannot find it and cannot trigger it anymore. I think it may be related to this comment of mine: https://johanengelen.github.io/posts/2017-12-25-ldc-and-addresssanitizer/#fn:4 Hmm... I no longer remember. Regardless, you are right that to detect any stackmemory use after return, that memory cannot sit on the regular processor stack. cheers, Johan
Mar 28