digitalmars.D.bugs - [Issue 1282] New: Very strange GC problem, memory corruption
- d-bugmail puremagic.com (58/58) Jun 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (9/9) Jun 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (9/9) Jun 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (12/12) Jun 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (33/33) Jun 21 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (23/23) Jun 21 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
- d-bugmail puremagic.com (27/27) Jun 21 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1282
http://d.puremagic.com/issues/show_bug.cgi?id=1282 Summary: Very strange GC problem, memory corruption Product: D Version: 1.016 Platform: PC OS/Version: Windows Status: NEW Keywords: wrong-code Severity: critical Priority: P2 Component: Phobos AssignedTo: bugzilla digitalmars.com ReportedBy: deewiant gmail.com For most of the past 12 hours I've tried, desperately, to track down this bug and to find a minimal test case, without luck. Using Jascha Wetzel's excellent Ddbg, I've only managed to narrow my problems down to Tango's gc.basic.gcx and/or gc.basic.gcalloc modules, but beyond that, I don't really know. The latest iteration, which I present here, causes an Access Violation in memmove. All of the following affect whether the bug shows up, and how it shows up: - compiler flags, both those used to compile the GC and those for the source itself - the number of object files in the compilation: at one point, I had to have up to around 100 empty dummy modules compiled in just to keep the bug showing up - the memory footprint of various structs used in the program, and sizes of arrays not even used at any point - the precise x-y dimensions of the file (_bef.b98 in the .zip I link to below) which is loaded into an associative array Especially the last two factors mentioned above lead me to believe that this bug might not be reproduceable, which is also why I'm now filing it now, while it still reliably crashes on my machine, instead of coming back tomorrow and finding that nothing happens any longer. Indeed, in this particular testcase, I've already lost the crash which originally lead me to find this bug. I can still get it in my main project, though, so I might follow up on this tomorrow. Various ways in which the bug has manifested itself: - IBM PurifyPlus reported numerous VirtualFree() calls to invalid memory - Access Violation in gc.gcx.Gcx.__invariant - Access Violation in gc.gcx.Gcx.mark - Access Violation in gc.gcx.Gc.fullCollect - Access Violation in gc.gcx.Gc.mallocNoSync (or one of the NoSync methods, can't remember for sure which one) - Access Violation in _memmove (the one I'm currently getting) - a class reference suddenly becoming uint.max, in the middle of code which doesn't even know about the class existing (doesn't import the relevant modules) It used to be the case that uncommenting line 68 in utils.d caused the bug to disappear, but not so with this memmove crash. I'll try to catch those other bugs tomorrow. The code is Tango dependent, but since it's a GC bug I filed it under Phobos, here. I'll try to port the code to Phobos later this week. I'm using the SVN trunk of Tango (revision 2345), compiled with -g (replaced DFLAGS in lib\gc\basic\win32.mak). Anybody who can shed some light, please do! I looked at Bug 72, and I'm willing to believe this has something to do with that. Source package, with precompiled .exe (it's a Windows only issue anyway): http://rapidshare.com/files/38390338/evil_bug.zip.html --
Jun 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 Since Tango does have non-trivial changes to the gc code it should probably start there, unless you can finish the de-tangoization to show that the bug lies in Phobo's gc code. Yes, having two different core libraries like this sucks. I really hope that at some point the deepest parts of Tango and Phobos can be unified. That's a discussion that shouldn't be had in the context of this bug report though. --
Jun 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 Tango contains no significant changes to the real functional aspects of the GC (core memory allocation and scanning), so I suspect this bug likely exists in Phobos as well. However, since reproducability is difficult at best, I imagine it's easier to work with the case provided. I'll see if I can figure out anything using Tango, and if so, I'll post a fix here so it can be applied to Phobos as well. --
Jun 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 braddr puremagic.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bugzilla digitalmars.com |sean f4.ca Thanks Sean.. that'd be a big help since I doubt Walter has played with Tango yet and would have to spend cycles on that before getting to the meat of the problem. Assigning to you for now, feel free to bump it back via the 'Reassign issue to default assignee' option if you can't find it. --
Jun 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 deewiant gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID I found the problem, and it's my code: I'm deleting a pointer to a struct which hasn't been allocated with new. It's a remnant from when I used a class instead of a struct. It boils down to the following: struct S {} S* ps; void main() { S s; ps = &s; delete ps; // assign ps to something else and keep doing stuff... } http://www.digitalmars.com/d/expression.html#DeleteExpression says: "If the garbage collector was not used to allocate the memory for the instance, undefined behavior will result." I can't reproduce the problem after removing the delete, so I'll assume that was it: the GC allocates memory for a char[] array on top of deleted-but-not-newed memory, which happens to be on top of a class reference in a struct. This would explain why the class reference becomes 0xffffffff instead of, say, null: char.init is 0xff. Those Access Violations would just be harder-to-find symptoms of the same. Of course, if I'm unlucky, something else causes the problem and the delete is just one of those unrelated, yet relevant, lines, but I don't think so. My bad! Nothing to see here! Unless there's a way for the GC to stop this from ever happening accidentally? Couldn't it know which areas in memory it has allocated? --
Jun 21 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 The GC does know whether an address is in a pool it manages. For deleting a typical memory block (ie. not a class), _d_delmemory() is called. The code for this is very simple: if (*p) { gc_free(*p); *p = null; } So on to gc_free() we go... the routine is freeNoSync, and this is the first thing it does: pool = gcx.findPool(p); if (!pool) return; With this in mind, I don't understand how removing the delete could have solved your problems, no matter what the spec says. The only effect should have been that ps is set to null. To be sure, I ran your second (short) sample program through a debugger and confirmed that this is indeed what's happening. So perhaps something more specific to your program was causing the actual problem, and this was merely a contributing factor? In any case, keep me informed if the problem resurfaces. --
Jun 21 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1282 Pseudocode, closer to the original situation: struct S { contains pointers and classes } S* ps; S[] ss; void main() { ss ~= S(); for (;;) { for (size_t i = ss.length; i-- > 0;) { ps = &ss[i]; do stuff, use ps, mess with the heap heap allocation, dynamic array usage aplenty if (condition) { delete ps; ss = ss[0..i] ~ ss[i+1..$]; continue; } } } } I don't know pretty much anything about the inner workings of the GC, but maybe the above can clarify why something went wrong? If not, feel free to take a detailed look at what happens when the delete in the main loop is executed (in ccbi.d in the evil_bug.zip I uploaded). Maybe you can find a deeper problem. --
Jun 21 2007