www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to debug (potential) GC bugs?

reply Matthias Klumpp <matthias tenstral.net> writes:
Hello!
I am working together with others on the D-based 
appstream-generator[1] project, which is generating software 
metadata for "software centers" and other package-manager 
functionality on Linux distributions, and is used by default on 
Debian, Ubuntu and Arch Linux.

For Ubuntu, some modifications on the code were needed, and 
apparently for them the code is currently crashing in the GC 
collection thread: http://paste.debian.net/840490/

The project is running a lot of stuff in parallel and is using 
the GC (if the extraction is a few seconds slower due to the GC 
being active, it doesn't matter much).

We also link against a lot of 3rd-party libraries and use a big 
amount of existing C code in the project.

So, I would like to know the following things:

1) Is there any caveat when linking to C libraries and using the 
GC in a project? So far, it seems to be working well, but there 
have been a few cases where I was suspicious about the GC 
actually doing something to malloc'ed stuff or C structs present 
in the bindings.

2) How can one debug issues like the one mentioned above 
properly? Since it seems to happen in the GC and doesn't give me 
information on where to start searching for the issue, I am a bit 
lost.

3) The tool seems to leak memory somewhere and OOMs pretty 
quickly on some machines. All the stuff using C code frees 
resources properly though, and using Valgrind on the project is a 
pain due to large amounts of data being mmapped. I worked around 
this a while back, but then the GC interfered with Valgrind, 
making information less useful. Is there any information on how 
to find memory leaks, or e.g. large structs the GC cannot free 
because something is still having a needless reference on it?

Unfortunately I can't reproduce the crash from 2) myself, it only 
seems to happen at Ubuntu (but Ubuntu is using some different 
codepaths too).

Any insights would be highly appreciated!
Cheers,
    Matthias

[1[: https://github.com/ximion/appstream-generator
Sep 25 2016
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 25 Sep 2016 16:23:11 +0000
schrieb Matthias Klumpp <matthias tenstral.net>:

 So, I would like to know the following things:
 
 1) Is there any caveat when linking to C libraries and using the 
 GC in a project? So far, it seems to be working well, but there 
 have been a few cases where I was suspicious about the GC 
 actually doing something to malloc'ed stuff or C structs present 
 in the bindings.
If you pass callbacks into the C code, make sure they never throw. Stack unwinding and exception handling generally doesn't work across language boundaries. A tracing garbage collector starts with the assumption that all the memory that it allocated is no longer reachable and then starts scanning the known memory for any pointers to allocations that falsify this assumption. What you malloc'ed is unknown to the GC and wont be scanned. Should you ever have GC memory pointers in your malloc'ed stuff, then you need to call GC.addRange() to make those pointers keep the allocations alive. Otherwise you will get a "used after free" error: data corruption or access violations. A simple case would be a string that you constructed in D and store in C as a pointer. The GC can automatically scan the stack and any globals/statics on the D side, but that's about it. I know of no tools similar to valgrind specially designed to debug the D GC. You can plug into the GC API and keep track of the allocation sizes. I.e. write a proxy GC. -- Marco
Sep 26 2016
prev sibling next sibling parent Guillaume Piolat <first.last gmail.com> writes:
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp 
wrote:
 Hello!
 I am working together with others on the D-based 
 appstream-generator[1] project, which is generating software 
 metadata for "software centers" and other package-manager 
 functionality on Linux distributions, and is used by default on 
 Debian, Ubuntu and Arch Linux.

 For Ubuntu, some modifications on the code were needed, and 
 apparently for them the code is currently crashing in the GC 
 collection thread: http://paste.debian.net/840490/

 The project is running a lot of stuff in parallel and is using 
 the GC (if the extraction is a few seconds slower due to the GC 
 being active, it doesn't matter much).

 We also link against a lot of 3rd-party libraries and use a big 
 amount of existing C code in the project.

 So, I would like to know the following things:
 1) Is there any caveat when linking to C libraries and using 
 the GC in a project? So far, it seems to be working well, but 
 there have been a few cases where I was suspicious about the GC 
 actually doing something to malloc'ed stuff or C structs 
 present in the bindings.
There is no way the GC scans memory allocated with malloc (unless you tell it to) or used in the bindings. A caveat is that if you are called from C (not your case), you must initialize the runtime, and attach/detach threads. The GC could well stop threads that are currently in the C code if they were registered to the runtime.
 2) How can one debug issues like the one mentioned above 
 properly? Since it seems to happen in the GC and doesn't give 
 me information on where to start searching for the issue, I am 
 a bit lost.
There can be multiple reasons. - The GC is collecting some object that is unreachable from its POV; when you are actually using it. - The GC is calling destructors, that should not be called by the GC. Performing illegal operations. usually this is solved by using deterministic destruction instead and never relying on a destructor called by the GC. - The GC tries to stop threads that don't exist anymore or are not interruptible My advice is to have a fuly deterministic tree of objects, like a C++ program, and Google for "GC-proof resource class" in case you are using classes.
Sep 27 2016
prev sibling next sibling parent Kapps <opantm2+spam gmail.com> writes:
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp 
wrote:
 Hello!
 I am working together with others on the D-based 
 appstream-generator[1] project, which is generating software 
 metadata for "software centers" and other package-manager 
 functionality on Linux distributions, and is used by default on 
 Debian, Ubuntu and Arch Linux.

 For Ubuntu, some modifications on the code were needed, and 
 apparently for them the code is currently crashing in the GC 
 collection thread: http://paste.debian.net/840490/

 The project is running a lot of stuff in parallel and is using 
 the GC (if the extraction is a few seconds slower due to the GC 
 being active, it doesn't matter much).

 We also link against a lot of 3rd-party libraries and use a big 
 amount of existing C code in the project.

 So, I would like to know the following things:

 1) Is there any caveat when linking to C libraries and using 
 the GC in a project? So far, it seems to be working well, but 
 there have been a few cases where I was suspicious about the GC 
 actually doing something to malloc'ed stuff or C structs 
 present in the bindings.

 2) How can one debug issues like the one mentioned above 
 properly? Since it seems to happen in the GC and doesn't give 
 me information on where to start searching for the issue, I am 
 a bit lost.

 3) The tool seems to leak memory somewhere and OOMs pretty 
 quickly on some machines. All the stuff using C code frees 
 resources properly though, and using Valgrind on the project is 
 a pain due to large amounts of data being mmapped. I worked 
 around this a while back, but then the GC interfered with 
 Valgrind, making information less useful. Is there any 
 information on how to find memory leaks, or e.g. large structs 
 the GC cannot free because something is still having a needless 
 reference on it?

 Unfortunately I can't reproduce the crash from 2) myself, it 
 only seems to happen at Ubuntu (but Ubuntu is using some 
 different codepaths too).

 Any insights would be highly appreciated!
 Cheers,
    Matthias

 [1[: https://github.com/ximion/appstream-generator
First, make sure any C threads calling D code use Thread.attachThis (thread_attachThis maybe?). Otherwise the GC will not suspend those threads during a collection which will cause crashes. I'd guess this is your issue. Second, tell the GC of non-GC memory that has pointers to GC memory by using GC.addRange / GC.addRoot as needed. Make sure to remove them once the non-GC memory is deallocated as well, otherwise you'll get memory leaks. The GC collector is also conservative, not precise, so false positives are possible. If you're using 64 bit programs, this shouldn't be much of an issue though. Finally, make sure you're not doing any GC allocations in dtors.
Sep 27 2016
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
Does it crash only in rt_finalize2? It calls the class 
destructor, and the destructor must not allocate or touch GC in 
any way because the GC doesn't yet support allocation during 
collection.
Sep 29 2016
parent reply Matthias Klumpp <matthias tenstral.net> writes:
On Thursday, 29 September 2016 at 09:56:34 UTC, Kagamin wrote:
 Does it crash only in rt_finalize2? It calls the class 
 destructor, and the destructor must not allocate or touch GC in 
 any way because the GC doesn't yet support allocation during 
 collection.
Thank you all for the good advice! I do none of those things in my code though... Unfortunately for having deterministic memory management, I would essentially need to develop GC-less, and would loose classes. This means many nice features of D aren't available, e.g. I couldn't use interfaces (AFAIK they don't work on structs) or constraints. Strangely after switching from the GDC compiler to the LDC compiler, all crashes observed at Ubuntu are gone. So, this problem is: A) A compiler / DRuntime bug, or B) A bug in my code (not) triggered by a certain compiler / DRuntime For the excessive memory usage, I have no idea yet - the GC not freeing its memory pool on exit is quite bad for Valgrinding the code. Memory consumption has bettered recently after not re-opening a LMDB database twice in the same process from multiple threads, which is not supported by LMDB. I haven't done longer runs yet, so I am not sure if that really was the problem (seems unlikely, but you never know...). Cheers, Matthias
Sep 30 2016
next sibling parent Kagamin <spam here.lot> writes:
On Saturday, 1 October 2016 at 00:06:05 UTC, Matthias Klumpp 
wrote:
 I do none of those things in my code though...
`grep "~this" *.d` gives nothing? It can be a struct with destructor stored in a class. Can you observe the error? Try to set a breakpoint at onInvalidMemoryOperationError https://github.com/dlang/druntime/blob/master/src/core/exception.d#L559 and see what stack leads to it.
 Unfortunately for having deterministic memory management, I 
 would essentially need to develop GC-less, and would loose 
 classes. This means many nice features of D aren't available, 
 e.g. I couldn't use interfaces (AFAIK they don't work on 
 structs) or constraints.
Not necessarily. You only need to dispose the resources in time, like in C#. But if you don't have destructors, you have nothing to dispose.
 Strangely after switching from the GDC compiler to the LDC 
 compiler, all crashes observed at Ubuntu are gone.
Sounds not good.
Oct 03 2016
prev sibling next sibling parent Kagamin <spam here.lot> writes:
If it's heap corruption, GC has debugging option -debug=SENTINEL 
- for buffer overrun checks. Also that particular stack trace 
shows that object being destroyed is allocated in bin 512, i.e. 
its size is between 256 and 512 bytes.
Oct 03 2016
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Saturday, 1 October 2016 at 00:06:05 UTC, Matthias Klumpp 
wrote:
 So, this problem is:
  A) A compiler / DRuntime bug, or
  B) A bug in my code (not) triggered by a certain compiler / 
 DRuntime
We actually did change druntime recently to no longer fail when using GC.free from a finalizer (will get ignored now). Maybe that's what fixed it for you w/ a newer version, but at a quick glance I haven't seen any freeing code in destructors.
Oct 07 2016
prev sibling next sibling parent Kagamin <spam here.lot> writes:
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp 
wrote:
 For Ubuntu, some modifications on the code were needed, and 
 apparently for them the code is currently crashing in the GC 
 collection thread: http://paste.debian.net/840490/
Oh, wait, what do you mean by crashing?
Oct 03 2016
prev sibling next sibling parent reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp 
wrote:
 Hello!
 I am working together with others on the D-based 
 appstream-generator[1] project, which is generating software 
 metadata for "software centers" and other package-manager 
 functionality on Linux distributions, and is used by default on 
 Debian, Ubuntu and Arch Linux.

 [...]
Probably related issue: https://issues.dlang.org/show_bug.cgi?id=15939
Oct 04 2016
parent Martin Nowak <code dawg.eu> writes:
On Tuesday, 4 October 2016 at 08:14:37 UTC, Ilya Yaroshenko wrote:
 Probably related issue: 
 https://issues.dlang.org/show_bug.cgi?id=15939
Crashes in a finalizer, likely not related to the dead-lock bug.
Oct 07 2016
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 25 Sep 2016 16:23:11 +0000
schrieb Matthias Klumpp <matthias tenstral.net>:

 Hello!
 I am working together with others on the D-based 
 appstream-generator[1] project, which is generating software 
 metadata for "software centers" and other package-manager 
 functionality on Linux distributions, and is used by default on 
 Debian, Ubuntu and Arch Linux.
 
 For Ubuntu, some modifications on the code were needed, and 
 apparently for them the code is currently crashing in the GC 
 collection thread: http://paste.debian.net/840490/
 
 The project is running a lot of stuff in parallel and is using 
 the GC (if the extraction is a few seconds slower due to the GC 
 being active, it doesn't matter much).
 
 [...]
 
 2) How can one debug issues like the one mentioned above 
 properly? Since it seems to happen in the GC and doesn't give me 
 information on where to start searching for the issue, I am a bit 
 lost.
 
Can you get the GDC & LDC phobos versions? We added shared library support in 2.068 which replaced much of GDC-specific backported GC/TLS code with the standard upstream implementation. So using a recent 2.068 GDC could help. Judging from the stack trace you're probably using a 2.067 phobos: https://github.com/D-Programming-GDC/GDC/blob/722cf5670d927ef6182bf1b72765a64ca0fde693/libphobos/libdruntime/rt/lifetime.d#L1423 Here's some advice for debugging such a problem: The memory layout is usually deterministic when restarting the app in gdb with the run command. So you can do this: gdb app # run # SIGSEGV in .... # bt Then get the value of p when the app crashed, in the posted stack trace 0x7fdfae368000 # break rt_finalize2 if p = 0x7fdfae368000 # run Should now break whenever the object is collected, so you can check if it is collected twice. You can also use next to step until you get the classinfo in c and then print the classinfo contents: print c You can also use write breakpoints to find data corruption: find the value of pc: # break lifetime.d:1418 if p = 0x7fdfae368000 # run # print ppv # watch -l pc # or watch * (value of ppv) then disable the old breakpoint & run from start # disable 1 # run This should now break when data is written to the location. (The commands might not be 100% correct ;-)
Oct 07 2016