www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Memory corruption with -O3, but not -O2 (and not with DMD)

reply James Blachly <james.blachly gmail.com> writes:
Hi all,

First , as always thanks for LDC2 without which we couldn't write high 
performance D software for our lab.

I've run in to an problem wherein after ~60,000 iterations of a loop we 
get memory corruption, but only when building with LDC2 and -O3; there 
are no problems AFAICT with -O2, or when building optimized versions 
with DMD. -enable-inlining does not make a difference. All of that being 
said, it does not rule out me making a pointer or memory error, but all 
seems well except with LDC2 -O3.

Debugging has been difficult because -O3 optimizes away a lot and thus 
lldb is not able to show me the tracking debugging variables I need to 
isolate the problematic code. Disassembling, I've found the register 
storing pointer to the corrupt string; interestingly, the correct string 
appears just slightly lower on the heap (maybe 32 bytes IIRC).

Manifestation slightly nondeterministic -- adding tracking variables and 
code makes the problem intermittent.

Indeed, I placed some simple guards (e.g.: if (pre_string != 
post_string) throw new Exception() ) near the place where the corrupt 
memory is manifest and sometimes the guard is triggered, while other 
times it is _not_, but the bad string shows up just a few statements 
later (inside a function).

Am at my wits' end, so help or next steps are greatly appreciated. I can 
provide disassembly of whatever combination of -O/-O2/-O3 and 
triggering/nontriggering code blocks if it would e helpful.

If this should move to github, let me know.

Kind regards
Aug 17 2019
next sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
Dear James,

As mentioned by kinke elsewhere, this is pretty much impossible to track 
down from our end without more information.

Is the corruption deterministic across multiple runs of one particular 
executable? Putting a memory watchpoint to the address that gets 
corrupted might provide some extra clues as to where it comes from.

Having a look at the LLVM IR (-output-ll) might also be illuminating; I 
personally find it easier to read than assembly. In particular, you 
could use the LLVM `opt` tool to apply the -O3 passes one by one, 
compiling to object code, linking and testing every step of the way, and 
compare the IR before/after the pass that first introduces the crash. 
(The `bugpoint` tool has some support for this, but you might be quicker 
doing this manually.)

Best regards,
David


On 17 Aug 2019, at 21:33, James Blachly via digitalmars-d-ldc wrote:

 Hi all,

 First , as always thanks for LDC2 without which we couldn't write high 
 performance D software for our lab.

 I've run in to an problem wherein after ~60,000 iterations of a loop 
 we get memory corruption, but only when building with LDC2 and -O3; 
 there are no problems AFAICT with -O2, or when building optimized 
 versions with DMD. -enable-inlining does not make a difference. All of 
 that being said, it does not rule out me making a pointer or memory 
 error, but all seems well except with LDC2 -O3.

 Debugging has been difficult because -O3 optimizes away a lot and thus 
 lldb is not able to show me the tracking debugging variables I need to 
 isolate the problematic code. Disassembling, I've found the register 
 storing pointer to the corrupt string; interestingly, the correct 
 string appears just slightly lower on the heap (maybe 32 bytes IIRC).

 Manifestation slightly nondeterministic -- adding tracking variables 
 and code makes the problem intermittent.

 Indeed, I placed some simple guards (e.g.: if (pre_string != 
 post_string) throw new Exception() ) near the place where the corrupt 
 memory is manifest and sometimes the guard is triggered, while other 
 times it is _not_, but the bad string shows up just a few statements 
 later (inside a function).

 Am at my wits' end, so help or next steps are greatly appreciated. I 
 can provide disassembly of whatever combination of -O/-O2/-O3 and 
 triggering/nontriggering code blocks if it would e helpful.

 If this should move to github, let me know.

 Kind regards
Aug 20 2019
parent James Blachly <james.blachly gmail.com> writes:
On 8/20/19 4:18 AM, David Nadlinger wrote:
 Dear James,
 
 As mentioned by kinke elsewhere, this is pretty much impossible to track 
 down from our end without more information.
 
 Is the corruption deterministic across multiple runs of one particular 
 executable? Putting a memory watchpoint to the address that gets 
 corrupted might provide some extra clues as to where it comes from.
 
 Having a look at the LLVM IR (-output-ll) might also be illuminating; I 
 personally find it easier to read than assembly. In particular, you 
 could use the LLVM `opt` tool to apply the -O3 passes one by one, 
 compiling to object code, linking and testing every step of the way, and 
 compare the IR before/after the pass that first introduces the crash. 
 (The `bugpoint` tool has some support for this, but you might be quicker 
 doing this manually.)
 
 Best regards,
 David
Thank you David. We will open up the repo soon whether we get the bug nailed or not. The corruption has been nondeterministic which of course makes it all the more frustrating. I have not used the opt tool, nor had I considered examining the IR instead of the assembly -- thank you for that.
Aug 21 2019
prev sibling parent reply Kagamin <spam here.lot> writes:
On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote:
 Am at my wits' end, so help or next steps are greatly 
 appreciated. I can provide disassembly of whatever combination 
 of -O/-O2/-O3 and triggering/nontriggering code blocks if it 
 would e helpful.
Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens.
Aug 20 2019
parent reply James Blachly <james.blachly gmail.com> writes:
On 8/21/19 2:29 AM, Kagamin wrote:
 On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote:
 Am at my wits' end, so help or next steps are greatly appreciated. I 
 can provide disassembly of whatever combination of -O/-O2/-O3 and 
 triggering/nontriggering code blocks if it would e helpful.
Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens.
I had never considered the possibility of optimizing different parts at different levels :-O I guess you just break it out into a separate .d/.o file and link?
Aug 21 2019
parent Kagamin <spam here.lot> writes:
On Thursday, 22 August 2019 at 00:34:12 UTC, James Blachly wrote:
 I had never considered the possibility of optimizing different 
 parts at different levels :-O  I guess you just break it out 
 into a separate .d/.o file and link?
You can start at file granularity.
Aug 23 2019