www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GC hangs on spinlock in ConservativeGC.runLocked

reply klasbo <klasbo gmail.com> writes:
This is in a vibe.d application, on Ubuntu server 22.04. 
Presumably something isn't unlocking the gcLock, but I have no 
idea what. See GDB backtrace below.

Ideas? And is there some way to get stack traces for any other 
vibe-tasks in a running process? Could this be caused by a 
non-clean exit of a vibe task?

It's hard to know if this is me, vibe.d, or druntime...

```

(clock_id=clock_id entry=0, flags=flags entry=0, 
req=0x7f7d0e6763d0, rem=0x7f7d0e6763e0)
     at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78

rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25

core.thread.osthread.Thread.sleep(core.time.Duration) ()

core.internal.spinlock.SpinLock.yield(ulong) shared ()

shared ()

core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC
mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)),
core.internal.gc.impl.conservative.gc.mallocTime,
core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong,
const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
()

core.internal.gc.impl.conservative.gc.ConservativeGC.malloc(ulong, uint,
const(TypeInfo)) ()



()




core.internal.gc.impl.conservative.gc.Gcx.fullcollect(bool, bool, 
bool) ()

core.internal.gc.impl.conservative.gc.Gcx.bigAlloc(ulong, ref 
ulong, uint, const(TypeInfo)) ()

core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC
mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)),
core.internal.gc.impl.conservative.gc.mallocTime,
core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong,
const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
()

core.internal.gc.impl.conservative.gc.ConservativeGC.qalloc(ulong, uint, scope
const(TypeInfo)) ()


_D3std5array__T8AppenderTAyaZQo13ensureAddableMFmZ9__lambda9MFNaNbNeZS4
ore6memory8BlkInfo_ (__capture=0x7f7d0e676a90, __HID4=0x7f7d0e676a60)
     at 
/usr/local/dlang/dmd-2.100.2/linux/bin64/../../src/phobos/std/array.d:3576
[rest of trace omitted]
```
Feb 20 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 2/20/23 8:01 AM, klasbo wrote:
 This is in a vibe.d application, on Ubuntu server 22.04. Presumably 
 something isn't unlocking the gcLock, but I have no idea what. See GDB 
 backtrace below.
 
 Ideas? And is there some way to get stack traces for any other 
 vibe-tasks in a running process? Could this be caused by a non-clean 
 exit of a vibe task?
 
This bug was fixed in 2.102.0: https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616 -Steve
Feb 20 2023
parent reply klasbo <klasbo gmail.com> writes:
On Monday, 20 February 2023 at 16:31:18 UTC, Steven Schveighoffer 
wrote:
 This bug was fixed in 2.102.0: 
 https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo

 The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616

 -Steve
I was using 2.100, so that would explain it then. I don't quite understand the internals of the GC well enough to "get" what happens here. Is it that an assert in sweep() triggers, and this[1] scope(failure) in fullcollect() would re-trigger a GC collection when it tries to allocate the trace info (which is now fixed when trace info is nogc)? Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for? [1] https://github.com/dlang/dmd/blob/3457018e34ab2333de22a6285622d3190268d5e0/druntime/src/core/internal/gc/impl/conservative/gc.d#L3147 Thank you for the answer, and (to all involved) the work in fixing it!
Feb 20 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 2/20/23 4:28 PM, klasbo wrote:
 I don't quite understand the internals of the GC well enough to "get" 
 what happens here. Is it that an assert in sweep() triggers, and this[1] 
 scope(failure) in fullcollect() would re-trigger a GC collection when it 
 tries to allocate the trace info (which is now fixed when trace info is 
 nogc)?
So it was quite a subtle bug. The throw handler will look at the `info` member of the Throwable (the trace info), and if it is non-null, will attempt to allocate a trace info. If you threw inside the GC (i.e. when the GC is collecting), this means you *cannot* allocate. The trace info allocator sneakily was using the GC, and so it had a case where it checked the `inFinalizer` flag, and if that was true, it would simply not allocate traceinfo (leave it at null). In the GC collection routine, there is a mechanism that catches `Exception`, and if so, handles it properly by throwing a FinalizerError (this has its `info` set to `SuppressTraceInfo`, which prevents further traceinfo allocations). But `Error` is *not* caught. It leaks all the way out to a `scope(failure)` statement. Now, inside this `scope(failure)` statement, the `inFinalizer` flag is set to false. What does this do? a. catches the Error (which remember, still has the null trace info) b. Sets the flag to false c. *rethrows the Error*. This sees a null traceinfo, and tries to allocate d. Allocating tries to take the spinlock, which is still locked, and deadlocks. The whole reason we don't allow allocating inside the finalizer is because of this deadlock! The solution is to allocate the trace info with C `malloc`, and free it with `free`. The trace info was always simply a list of stack frame addresses, and so trivial to allocate/free.
 Or the more important (for me) question: Is this part of "normal" GC 
 control flow (why would assert(freedPages < usedPages) trigger? This is 
 the beyond the limit of my GC understanding!), or is there still/also 
 something broken on my end that I have to look for?
The error itself is something that isn't addressed with this fix. So I should clarify that the *dealdock* was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug! -Steve
Feb 20 2023
parent klasbo <klasbo gmail.com> writes:
On Monday, 20 February 2023 at 22:11:47 UTC, Steven Schveighoffer 
wrote:
 On 2/20/23 4:28 PM, klasbo wrote:
 Or the more important (for me) question: Is this part of 
 "normal" GC control flow (why would assert(freedPages < 
 usedPages) trigger? This is the beyond the limit of my GC 
 understanding!), or is there still/also something broken on my 
 end that I have to look for?
The error itself is something that isn't addressed with this fix. So I should clarify that the *dealdock* was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug! -Steve
So at least I will now get a proper error (presumably InvalidMemoryOperationError?) and its stack trace when/if-ever this happens again. Or if I am (un)lucky, this will be "handled" by vibe and only show up in any logs I enable. Thanks again.
Feb 20 2023