www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Problem with GC and address/leak sanitizer

reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
I have a program where the GC *seems* to be overwriting memory 
still in use and corrupting data.

Here's the code. It's massively reduced from the original 
program. It's hard to reduce it further because minor changes can 
prevent the problem from triggering. I'll explain below the 
important parts.

```d
import std.stdio;

struct S {
     int check;
     S* next;
     int[4] data;
}

int main(string[] args) {
     void*[] allocs;
     enum bad_iter = 268;
     for (int n = 0; n < bad_iter+1; n++) {
         allocs.length = 0;
         auto x = "                   ";
         x ~= ' ';

         int[10][] ts;
         for(int i = 0; i < 21; i++) {
             ts.length++;
         }

         S head;
         S* s = &head;
         if (n == bad_iter) {
             n = bad_iter; // convenient line to set a breakpoint 
only for the last iteration
         }
         for(int i = 0; i < 8; i++) {
             auto ns = new S;
             ns.check = 1; // set test value here
             s.next = ns;
             s = ns;
         }
         s = head.next; // get the first S allocated this iteration
         if (s.check != 1) { // check test value here
             writefln("check=%d", s.check);
             return -1;
         }

         new int[10];
         allocs ~= null;
         new size_t[3];
     }
     return 0;
}
```

The important part is the following. On each iteration we create 
8 instances of S. For each S value, we set its `check` field to 
1. Then we check the value of that field (for the first instance 
of S). When compiled with the address sanitizer, we observe it's 
been corrupted and it's no longer 1.

Am I doing something incorrectly in the code? AFAIK I'm 
respecting the rules required by the GC. Maybe there's a silly 
bug I overlooked?

Tested with LDC 1.40.0 on x86_64 Linux:


```


BUG
check=-337690816
$
```

By setting a watchpoint on the address of the field, I see that 
the code that writes to `check` is part of the GC implementation. 
Here's the backtrace:

```


libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx15recoverNextPageMFNbEQCmQC
QCeQCeQCcQCn4BinsZb + 348

libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMF
bmKmkxC8TypeInfoZPv + 776

libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCsQCqQCkQCkQCiQCtQBy12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQFaQEyQEsQEsQEqQFb10mallocTimelS_DQGiQGgQGaQGaQFyQGj10numMallocslTmTkTmTxQDlZQF
MFNbKmKkKmKxQEeZQDx + 89

libdruntime-ldc-shared.so.110`_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZ
QDd6memory8BlkInfo_ + 83

libdruntime-ldc-shared.so.110`gc_qalloc + 28

app`_D4core8lifetime__T11_d_newitemTTS3app1SZQwFNaNbNeZPQt at 
lifetime.d:2837:5

0x00007fffffffe438) at app.d:28:13

libdruntime-ldc-shared.so.110`_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
+ 77

libdruntime-ldc-shared.so.110`_d_run_main2 + 407

libdruntime-ldc-shared.so.110`_d_run_main + 141

argv=0x00007fffffffe728) at entrypoint.d:42:17

libc.so.6`__libc_start_call_main(main=(app`main at 
entrypoint.d:39), argc=1, argv=0x00007fffffffe728) at 
libc_start_call_main.h:58:16

libc.so.6`__libc_start_main_impl(main=(app`main at 
entrypoint.d:39), argc=1, argv=0x00007fffffffe728, 
init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, 
stack_end=0x00007fffffffe718) at libc-start.c:360:3

```

There is a subsequent write to that memory location in the leak 
sanitizer and LSan complains:

`==4056526==LeakSanitizer has encountered a fatal error.`  
(though usually this message isn't flushed)

I assume the original problem was caused by the GC and ASan/LSan 
are just subsequent victims, but it's hard to be sure. 
Apparently, LSan is automatically enabled for Linux when ASan is 
used. Although the ASan documentation says that LSan "can be 
enabled using `ASAN_OPTIONS=detect_leaks=1` on macOS", setting 
that to 0 didn't seem to disable it, so I couldn't test with ASan 
but not LSan.

Any ideas of what might be going on?
Feb 15
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:
 I have a program where the GC *seems* to be overwriting memory 
 still in use and corrupting data.
Do I understand that this corruption is happening only with address sanitizer turned on? I don't see any red flags here, though I'm assuming a lot of these weird random things you are doing (like appending a space to a string every loop) are essential to making the thing fail? It's possible these are tickling GC patterns that cause problems, or it's possible it's tickling bugs in code generation that might prevent the GC from seeing memory! Writing to the "check" field might be because a gc cycle ran, and that item was incorrectly collected, and now the gc is writing to it because it thinks that data is fair game to use. The writing is probably not the problem, the problem is the previous collection of that data. I have learned a lot of tricks when implementing the new GC, and when faced with problems like this, it's super-difficult to figure out how to properly find the problem. One technique I used is to fork after scanning, but before collection, and put that forked process to sleep. If the failure happens, then I can gdb into the forked process and see what state the GC was in, including the entire graph of memory, and I could see how a piece of memory is or isn't referenced. This is a tedious process, and requires a lot of knowledge and patience. If this is indeed a problem with the GC, it's going to be tough to track down. If it's a problem with the codegen, then probably also difficult, but this function is small enough, that maybe someone can look at the assembly and verify that it's doing the right thing? I don't know.
Feb 15
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Is this an issue with the new GC, or the old one?
Feb 17
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc -Steve
Feb 17
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:
 On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc
I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?
Feb 17
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 18/02/2025 6:22 PM, Walter Bright wrote:
 On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:
 On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc
I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?
This has already been solved, see Johan's comments. ASAN with a fake stack isn't operating properly.
Feb 17
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/17/2025 9:26 PM, Richard (Rikki) Andrew Cattermole wrote:
 This has already been solved, see Johan's comments.
Excellent! Carry on...
Feb 19
prev sibling parent reply Johan <j j.nl> writes:
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:
 Tested with LDC 1.40.0 on x86_64 Linux:


 ```

 $ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app 

 check=-337690816
 $
 ```
Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`) (And also test with a little bit older LDC, with different LLVM version, to see if it is a new issue or not) -Johan
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:
 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that 
 FakeStack is not enabled?  
 (`detect_stack_use_after_return=false`)
The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?) Thanks! [1] https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn#garbage-collection
Feb 16
next sibling parent =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:
 The fake stack allocator is enabled. If I disable it via 
 `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no 
 longer reproduces.
I meant =0, of course.
Feb 16
prev sibling parent reply Johan <j j.nl> writes:
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:
 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that 
 FakeStack is not enabled?  
 (`detect_stack_use_after_return=false`)
The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)
FakeStack allocates (!) space for stack variables, and points to that "fake stack" memory with a pointer in actual CPU stack memory. This means that the stack variables are now no longer in memory that is scanned by the GC. The fix for that, of course, is to include all FakeStacks in the GC scanning [1a][1b]. This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working! [3] is an interesting test case of how code should work. -Johan [1a] https://github.com/ldc-developers/druntime/compare/d6b328be91db63aff979f584b0d1def0f746d730...1d938e0b7f668b099f9fa694b135c82ef13dec59 [1b] https://github.com/ldc-developers/ldc/pull/3888 [2] https://github.com/ldc-developers/ldc/blob/d3f065816ec7d420f370e4c95c6000eb78187e25/tests/sanitizers/asan_fakestack_GC.d#L3 [3] https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/asan/TestCases/Posix/gc-test.cpp
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 This used to work, but somehow does not work anymore since LDC 
 2.100 (I perhaps have forgotten about this and just noticed 
 it). [2]

 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc. If you are going to look at this, please let me know, to avoid duplicate efforts.
Feb 17
parent reply Johan <j j.nl> writes:
On Monday, 17 February 2025 at 21:56:29 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 This used to work, but somehow does not work anymore since 
 LDC 2.100 (I perhaps have forgotten about this and just 
 noticed it). [2]

 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc.
It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.
 If you are going to look at this, please let me know, to avoid 
 duplicate efforts.
Not soon, no. -Johan
Feb 18
next sibling parent =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Tuesday, 18 February 2025 at 18:05:41 UTC, Johan wrote:
 It is likely related to LLVM version. Did you already check 
 that? Possibly a subtle change in API.
I'm only now checking that, I ran into a lot of yak shaving tasks. BTW, did LDC stop emitting warnings/errors since 1.35? E.g. for the `;` vs `{}` https://godbolt.org/z/K95E99a5z
Mar 06
prev sibling parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Tuesday, 18 February 2025 at 18:05:41 UTC, Johan wrote:
 It is likely related to LLVM version. Did you already check 
 that? Possibly a subtle change in API.
LDC 1.40.0 (git commit 9296fd6fcc) reproduces the problem with both LLVM 15.0 and 19.1, so for now it seems that this isn't related to something having changed on the LLVM side.
Mar 07
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Friday, 7 March 2025 at 14:51:22 UTC, Luís Marques wrote:
 LDC 1.40.0 (git commit 9296fd6fcc) reproduces the problem with 
 both LLVM 15.0 and 19.1, so for now it seems that this isn't 
 related to something having changed on the LLVM side.
I finally found some time today to analyze this properly, rather than just trying different LDC/LLVM versions. I haven't yet double-checked everything, but based on my current understanding, I believe I've identified the issue, along with several deficiencies: 1. **The main issue:** The distributed `libdruntime-ldc-shared.so` appears to be compiled without `version=SupportSanitizers`, so the fake stack scanning isn't enabled. I conclude this by observing that the distributed runtime's disassembly doesn't include `scanStackForASanFakeStack`, unlike when this option is enabled. The necessary code exists in the runtime's source, but it's simply not being compiled in. 2. **Fake stack support should work when properly enabled:** If you build LDC with `-DRT_SUPPORT_SANITIZERS=True`, the fake stack is scanned as expected. From what I can see (by checking for the proper GC scan calls and addresses), everything is being done correctly. The test `tests/sanitizers/asan_fakestack_GC.d` still fails, but not because of broken fake stack scanning—instead, the test itself is broken. An assertion fails in `test_non_null_does_not_trigger_collection` because memory from a previous iteration is being freed, not because it's freeing memory incorrectly for the current iteration. 3. **Redundant function calls:** The runtime function `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple times for the same fake stack frame, using identical memory ranges each time. While technically correct, this is unnecessary and inefficient. Based on the comment `"This is the pointer we are catching with"`, it seems there's a misunderstanding regarding how the fake stack should be located. 4. **Misleading comment in the test:** The comment `// Large enough so it will be on fakestack heap (not inlined in local stack frame)` in `asan_fakestack_GC.d` appears to be incorrect. The size of the allocation shouldn't determine this—if its address escapes, it must go into the heap-allocated fake stack regardless.
Mar 27
parent Johan <j j.nl> writes:
Hi Luis,
   Thanks for spending some time on it.
Here some inline remarks from my side.

On Friday, 28 March 2025 at 00:35:50 UTC, Luís Marques wrote:
 1. **The main issue:** The distributed 
 `libdruntime-ldc-shared.so` appears to be compiled without 
 `version=SupportSanitizers`, so the fake stack scanning isn't 
 enabled.
This is intentional, because there is some overhead associated with it. If this is the main issue then that's "OK" and has always been the case. Perhaps FakeStack is now enabled by default. We should either disable that default, or start shipping including the small overhead in druntime.
 2. **Fake stack support should work when properly enabled:** If 
 you build LDC with `-DRT_SUPPORT_SANITIZERS=True`, the fake 
 stack is scanned as expected.
Great.
 From what I can see (by checking for the proper GC scan calls 
 and addresses), everything is being done correctly. The test 
 `tests/sanitizers/asan_fakestack_GC.d` still fails, but not 
 because of broken fake stack scanning—instead, the test itself 
 is broken. An assertion fails in 
 `test_non_null_does_not_trigger_collection` because memory from 
 a previous iteration is being freed, not because it's freeing 
 memory incorrectly for the current iteration.
The idea of the test is to test that enabling fakestack does not change collection behavior. The test is clunky, I'll admit. If it is indeed failing already on the fakestack-disabled case, then it needs to be adjusted for sure.
 3. **Redundant function calls:** The runtime function 
 `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple 
 times for the same fake stack frame, using identical memory 
 ranges each time. While technically correct, this is 
 unnecessary and inefficient. Based on the comment `"This is the 
 pointer we are catching with"`, it seems there's a 
 misunderstanding regarding how the fake stack should be located.
I'm not sure how this can be related to specifically fakestack implementation issue. If you mean that `scanAllTypeImpl` calls `scanStackForASanFakeStack` multiple times, then that is not a problem of our fakestack support code, but of `scanAllTypeImpl` itself because it would also call the normal `scan` method multiple times with same memory range. That would point towards an issue with duplicates in the `ThreadBase.sm_cbeg` list. Is that what you meant? (it is literally what you typed, but the sentences following it do not make sense with that interpretation...) LLVM does not give us many options to find the fake stack. At the time I think I used the only public API method available, which indeed results in multiple times getting the same fakestack. We can poke into ASan internals but I am afraid it will quickly bitrot/break. Perhaps by now ASan's public interface has been extended, otherwise we should ask for it (or submit PRs to LLVM)... I'm quite sure we are doing the right thing of locating the fakestack, it's done similarly here: https://www.aovivo.ilutas.com.br/node-v17.9.1/deps/v8/src/heap/base/stack.cc
 4. **Misleading comment in the test:** The comment `// Large 
 enough so it will be on fakestack heap (not inlined in local 
 stack frame)` in `asan_fakestack_GC.d` appears to be incorrect. 
 The size of the allocation shouldn't determine this—if its 
 address escapes, it must go into the heap-allocated fake stack 
 regardless.
I vaguely remember there being some optimization with small memory sizes, but I cannot find it and cannot trigger it anymore. I think it may be related to this comment of mine: https://johanengelen.github.io/posts/2017-12-25-ldc-and-addresssanitizer/#fn:4 Hmm... I no longer remember. Regardless, you are right that to detect any stackmemory use after return, that memory cannot sit on the regular processor stack. cheers, Johan
Mar 28