www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Known reasons why D crashes without any message?

reply Thorsten Sommer <vektoren gmail.com> writes:
Dear Community,

My student assistant and I working on an artificial intelligence 
library in D for a while. This library is part of my PhD thesis 
in order to perform several experiments to push the state of the 
art.

(Yes, after the thesis is published, the entire library gets open 
source on GitHub including novel algorithms)

Right now, we are done with the development and ready to start 
experiments. Until now, almost anything runs fine with our unit 
tests.

Besides the unit tests, the main program is now able to startup 
but crashes after a while without any message at all. No stack 
trace, no exception, nothing. Obviously, this makes it hard to 
debug anything...

To get a roughly impression, what the code uses (maybe this 
information will help to limit the possibilities):

- External dependencies: fluent-asserts, requests and our own 
library quantum-random for physical randomness
- Many meta-programming e.g. with templates across 9,000 lines of 
code
- The code was designed to be OOP... composition, inheritance, 
delegation, polymorphism...
- We call many instances of an external Go program with a Maze 
simulation (the task for the AI) by using pipeProcess()
- We use parallel foreach loops for scaling (we have issues with 
that also -- may I open another thread for it)
- We send thousands of HTTP requests using the requests library
- The entire simulation runs in Docker containers on huge servers 
(144 CPU Cores, ~470 GB RAM). Base image uses DMD 2.076.0 + 
Ubuntu Server 16.04

Are there any well-known circumstances, bugs, etc. where an 
abrupt interruption of a D program without any message is 
possible? My expectation was, that I would receive at least a 
stack trace. For debugging, I disabled parallelism at all in 
order to eliminate effects like exceptions are hidden in threads, 
missing/wrong variable sharing, etc.

I would be pleased about any idea, as I am currently stuck and no 
longer know how and where to continue debugging.


Best regards
Thorsten
Sep 13 2017
next sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer 
wrote:
 Are there any well-known circumstances, bugs, etc. where an 
 abrupt interruption of a D program without any message is 
 possible?
A stack overflow is one. Why not run the program under a debugger?
Sep 13 2017
prev sibling next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
1) You really need to switch to ldc, even for small neural networks, it 
makes a MASSIVE difference!
2) In release mode, who knows what'll happen. Add some logging in maybe 
(versioned/debug of course) to help figure out where things are going on.
3) Wrap it up with try catch and write out the message yourself. You 
want Error not Exception FYI.

Not terribly helpful, but a good place to begin anyway.
Of course if somebody is calling the c exit function, it may very well 
bypass D's exception handling all together.
Sep 13 2017
prev sibling next sibling parent Moritz Maxeiner <moritz ucworks.org> writes:
On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer 
wrote:
 [...]

 Besides the unit tests, the main program is now able to startup 
 but crashes after a while without any message at all. No stack 
 trace, no exception, nothing. Obviously, this makes it hard to 
 debug anything...

 [...]

 Are there any well-known circumstances, bugs, etc. where an 
 abrupt interruption of a D program without any message is 
 possible? My expectation was, that I would receive at least a 
 stack trace. For debugging, I disabled parallelism at all in 
 order to eliminate effects like exceptions are hidden in 
 threads, missing/wrong variable sharing, etc.

 [...]
Things D generally depends on the platform to deal with (such as null pointer dereferences) won't yield you a message from the D side. What is the exit code of the program? If it's of the form `128+n` with `n == SIGXYZ` you know more of why it crashed [1]. If the exit code is 139 e.g., you know some code tried to access memory via an invalid reference (as SIGSEGV == 11 on Linux x64), which often means you dereferenced a null pointer somewhere. [1] http://www.tldp.org/LDP/abs/html/exitcodes.html
Sep 13 2017
prev sibling next sibling parent qznc <qznc web.de> writes:
On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer 
wrote:
 Right now, we are done with the development and ready to start 
 experiments. Until now, almost anything runs fine with our unit 
 tests.

 Besides the unit tests, the main program is now able to startup 
 but crashes after a while without any message at all. No stack 
 trace, no exception, nothing. Obviously, this makes it hard to 
 debug anything...
I assume you see a return code which is nonzero, because you say it "crashes". Which one? Most likely would be a segmentation fault (invalid memory access, stack overflow, null pointer dereferenced, etc). Use a debugger. Compile with debug info and execute wrapped in gdb. It should stop right where it crashes and can show you a stack trace. If necessary, inspect the value of variables. If gdb does not stop on its own, someone is calling exit to terminate prematurely. Set a breakpoint at exit to get a stack trace. If you cannot use gdb on your server and you cannot trigger the crash on your desktop, maybe you can let it coredump on the server? Then use gdb to inspect the dump. Did you try to annotate your code with safe? It helps to avoid errors leading to segmentation faults.
Sep 14 2017
prev sibling next sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 09/13/2017 03:20 AM, Thorsten Sommer wrote:

 No stack trace, no exception, nothing.
Maybe the OOM Killer if running on Linux. Ali
Sep 14 2017
parent reply Thorsten Sommer <vektoren gmail.com> writes:
Thank you very much for the different approaches. Vladimir, I 
installed the GDB today and try to gain new insights with it. 
Rikki, we are aware of the advantages of LDC. But first of all we 
want the program to run with DMD. After that we would then switch 
to LDC.

I have already introduced try-catch blocks on "Throwable" around 
all program parts, which unfortunately does not work. We also use 
logging. Unfortunately, these measures do not work.

Moritz, thank you for the idea of checking the exit code. I have 
adjusted the Dockerfile accordingly: Our code leads to at least 
one segmentation fault. I hope to be able to identify the 
position with GDB.

Qznc, we just put your suggestion into practice. Hope to find out 
more with GDB now. Installed GDB in the Docker container and 
automated the launch. Should actually work, the test is running 
while I am writing this text.

Ali, thanks for the tip with OOM Killer. I never knew that fact 
before. At the moment it is the case that segmentation fault 
occurs before we even begin to reach a memory limit. However, I 
will keep this in mind for further work and testing.

Thank you all so much. We will now work with GDB and hopefully 
solve the problem.
Sep 14 2017
next sibling parent Daniel Kozak via Digitalmars-d <digitalmars-d puremagic.com> writes:
http://vibed.org/docs#handling-segmentation-faults
this should help

On Fri, Sep 15, 2017 at 8:17 AM, Thorsten Sommer via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 Thank you very much for the different approaches. Vladimir, I installed
 the GDB today and try to gain new insights with it. Rikki, we are aware of
 the advantages of LDC. But first of all we want the program to run with
 DMD. After that we would then switch to LDC.

 I have already introduced try-catch blocks on "Throwable" around all
 program parts, which unfortunately does not work. We also use logging.
 Unfortunately, these measures do not work.

 Moritz, thank you for the idea of checking the exit code. I have adjusted
 the Dockerfile accordingly: Our code leads to at least one segmentation
 fault. I hope to be able to identify the position with GDB.

 Qznc, we just put your suggestion into practice. Hope to find out more
 with GDB now. Installed GDB in the Docker container and automated the
 launch. Should actually work, the test is running while I am writing this
 text.

 Ali, thanks for the tip with OOM Killer. I never knew that fact before. At
 the moment it is the case that segmentation fault occurs before we even
 begin to reach a memory limit. However, I will keep this in mind for
 further work and testing.

 Thank you all so much. We will now work with GDB and hopefully solve the
 problem.
Sep 14 2017
prev sibling parent Johan Engelen <j j.nl> writes:
On Friday, 15 September 2017 at 06:17:33 UTC, Thorsten Sommer 
wrote:
 Thank you very much for the different approaches. Vladimir, I 
 installed the GDB today and try to gain new insights with it. 
 Rikki, we are aware of the advantages of LDC. But first of all 
 we want the program to run with DMD. After that we would then 
 switch to LDC.
Latest LDC (1.4.0) gives you AddressSanitizer which can catch bad memory accesses and reports them in a nice way. Use `-fsanitize=address` when compiling. Caveat: it doesn't catch memory bugs involving GC-(de)allocated memory yet (only _very_ bad ones). But it does catch malloc'ed memory bugs and stack bugs. https://github.com/google/sanitizers/wiki/AddressSanitizer - Johan
Sep 16 2017
prev sibling next sibling parent reply Swoorup Joshi <swoorupjoshi gmail.com> writes:
On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer 
wrote:
 Dear Community,

 My student assistant and I working on an artificial 
 intelligence library in D for a while. This library is part of 
 my PhD thesis in order to perform several experiments to push 
 the state of the art.

 [...]
I had the same issue trying to use the std.experimental.xml library. * Ran an example * Crashes at some posix, C library writing to a file. * Gave up, now looking at other programming language (rust)
Sep 14 2017
next sibling parent reply Suliman <evermind live.ru> writes:
On Friday, 15 September 2017 at 06:22:01 UTC, Swoorup Joshi wrote:
 On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten 
 Sommer wrote:
 Dear Community,

 My student assistant and I working on an artificial 
 intelligence library in D for a while. This library is part of 
 my PhD thesis in order to perform several experiments to push 
 the state of the art.

 [...]
I had the same issue trying to use the std.experimental.xml library. * Ran an example * Crashes at some posix, C library writing to a file. * Gave up, now looking at other programming language (rust)
What did you expect from unofficial alpha package?
Sep 15 2017
parent reply Swoorup Joshi <swoorupjoshi gmail.com> writes:
On Friday, 15 September 2017 at 12:58:19 UTC, Suliman wrote:
 On Friday, 15 September 2017 at 06:22:01 UTC, Swoorup Joshi 
 wrote:
 On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten 
 Sommer wrote:
 Dear Community,

 My student assistant and I working on an artificial 
 intelligence library in D for a while. This library is part 
 of my PhD thesis in order to perform several experiments to 
 push the state of the art.

 [...]
I had the same issue trying to use the std.experimental.xml library. * Ran an example * Crashes at some posix, C library writing to a file. * Gave up, now looking at other programming language (rust)
What did you expect from unofficial alpha package?
That the xml experimental library is now abandoned by the author? Not much hope there
Sep 16 2017
parent Daniel Kozak <kozzi11 gmail.com> writes:
https://github.com/dlang-community/discussions/issues/
23#issuecomment-318331816


https://github.com/Kozzi11/experimental.xml

Dne 16. 9. 2017 2:51 odpoledne napsal u=C5=BEivatel "Swoorup Joshi via
Digitalmars-d" <digitalmars-d puremagic.com>:

On Friday, 15 September 2017 at 12:58:19 UTC, Suliman wrote:

 On Friday, 15 September 2017 at 06:22:01 UTC, Swoorup Joshi wrote:

 On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer wrote:

 Dear Community,

 My student assistant and I working on an artificial intelligence librar=
y
 in D for a while. This library is part of my PhD thesis in order to per=
form
 several experiments to push the state of the art.

 [...]
I had the same issue trying to use the std.experimental.xml library. * Ran an example * Crashes at some posix, C library writing to a file. * Gave up, now looking at other programming language (rust)
What did you expect from unofficial alpha package?
That the xml experimental library is now abandoned by the author? Not much hope there
Sep 17 2017
prev sibling next sibling parent apz28 <home home.com> writes:
On Friday, 15 September 2017 at 06:22:01 UTC, Swoorup Joshi wrote:
 On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten 
 Sommer wrote:
 Dear Community,

 My student assistant and I working on an artificial 
 intelligence library in D for a while. This library is part of 
 my PhD thesis in order to perform several experiments to push 
 the state of the art.

 [...]
I had the same issue trying to use the std.experimental.xml library. * Ran an example * Crashes at some posix, C library writing to a file. * Gave up, now looking at other programming language (rust)
Try with this xml package https://github.com/apz28/dlang-xml Cheers apz28
Sep 15 2017
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 15 September 2017 at 06:22:01 UTC, Swoorup Joshi wrote:
 I had the same issue trying to use the std.experimental.xml 
 library.
my dom.d works :P
Sep 16 2017
prev sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Wednesday, 13 September 2017 at 10:20:48 UTC, Thorsten Sommer 
wrote:
 Besides the unit tests, the main program is now able to startup 
 but crashes after a while without any message at all. No stack 
 trace, no exception, nothing. Obviously, this makes it hard to 
 debug anything...
You mention you're using Docker. https://github.com/moby/moby/issues/11740 has some info on how to generate core files inside a Docker container. You should be able to load that up in gdb and see exactly what's going on.
Sep 15 2017
parent Thorsten Sommer <vektoren gmail.com> writes:
Thank you all. In the meantime I found the cause: At one point in 
the code, null was used as a key in a map i.e. associative array.

It is really great that D has such a great community.
Sep 16 2017