www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 1282] New: Very strange GC problem, memory corruption

reply d-bugmail puremagic.com writes:

           Summary: Very strange GC problem, memory corruption
           Product: D
           Version: 1.016
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Keywords: wrong-code
          Severity: critical
          Priority: P2
         Component: Phobos
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: deewiant gmail.com

For most of the past 12 hours I've tried, desperately, to track down this bug
and to find a minimal test case, without luck.

Using Jascha Wetzel's excellent Ddbg, I've only managed to narrow my problems
down to Tango's gc.basic.gcx and/or gc.basic.gcalloc modules, but beyond that,
I don't really know. The latest iteration, which I present here, causes an
Access Violation in memmove. All of the following affect whether the bug shows
up, and how it shows up:

- compiler flags, both those used to compile the GC and those for the source
- the number of object files in the compilation: at one point, I had to have up
to around 100 empty dummy modules compiled in just to keep the bug showing up
- the memory footprint of various structs used in the program, and sizes of
arrays not even used at any point
- the precise x-y dimensions of the file (_bef.b98 in the .zip I link to below)
which is loaded into an associative array

Especially the last two factors mentioned above lead me to believe that this
bug might not be reproduceable, which is also why I'm now filing it now, while
it still reliably crashes on my machine, instead of coming back tomorrow and
finding that nothing happens any longer.

Indeed, in this particular testcase, I've already lost the crash which
originally lead me to find this bug. I can still get it in my main project,
though, so I might follow up on this tomorrow.

Various ways in which the bug has manifested itself:

- IBM PurifyPlus reported numerous VirtualFree() calls to invalid memory
- Access Violation in gc.gcx.Gcx.__invariant
- Access Violation in gc.gcx.Gcx.mark
- Access Violation in gc.gcx.Gc.fullCollect
- Access Violation in gc.gcx.Gc.mallocNoSync (or one of the NoSync methods,
can't remember for sure which one)
- Access Violation in _memmove (the one I'm currently getting)
- a class reference suddenly becoming uint.max, in the middle of code which
doesn't even know about the class existing (doesn't import the relevant

It used to be the case that uncommenting line 68 in utils.d caused the bug to
disappear, but not so with this memmove crash. I'll try to catch those other
bugs tomorrow.

The code is Tango dependent, but since it's a GC bug I filed it under Phobos,
here. I'll try to port the code to Phobos later this week.

I'm using the SVN trunk of Tango (revision 2345), compiled with -g (replaced
DFLAGS in lib\gc\basic\win32.mak).

Anybody who can shed some light, please do! I looked at Bug 72, and I'm willing
to believe this has something to do with that.

Source package, with precompiled .exe (it's a Windows only issue anyway):

Jun 20 2007
next sibling parent d-bugmail puremagic.com writes:

------- Comment #1 from braddr puremagic.com  2007-06-20 15:35 -------
Since Tango does have non-trivial changes to the gc code it should probably
start there, unless you can finish the de-tangoization to show that the bug
lies in Phobo's gc code.

Yes, having two different core libraries like this sucks.  I really hope that
at some point the deepest parts of Tango and Phobos can be unified.  That's a
discussion that shouldn't be had in the context of this bug report though.

Jun 20 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:

------- Comment #2 from sean f4.ca  2007-06-20 16:23 -------
Tango contains no significant changes to the real functional aspects of the GC
(core memory allocation and scanning), so I suspect this bug likely exists in
Phobos as well.  However, since reproducability is difficult at best, I imagine
it's easier to work with the case provided.  I'll see if I can figure out
anything using Tango, and if so, I'll post a fix here so it can be applied to
Phobos as well.

Jun 20 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:

braddr puremagic.com changed:

           What    |Removed                     |Added
         AssignedTo|bugzilla digitalmars.com    |sean f4.ca

------- Comment #3 from braddr puremagic.com  2007-06-20 16:54 -------
Thanks Sean.. that'd be a big help since I doubt Walter has played with Tango
yet and would have to spend cycles on that before getting to the meat of the

Assigning to you for now, feel free to bump it back via the 'Reassign issue to
default assignee' option if you can't find it.

Jun 20 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:

deewiant gmail.com changed:

           What    |Removed                     |Added
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID

------- Comment #4 from deewiant gmail.com  2007-06-21 03:29 -------
I found the problem, and it's my code: I'm deleting a pointer to a struct which
hasn't been allocated with new. It's a remnant from when I used a class instead
of a struct. It boils down to the following:

struct S {}
S* ps;

void main() {
        S s;
        ps = &s;
        delete ps;

        // assign ps to something else and keep doing stuff...

http://www.digitalmars.com/d/expression.html#DeleteExpression says: "If the
garbage collector was not used to allocate the memory for the instance,
undefined behavior will result."

I can't reproduce the problem after removing the delete, so I'll assume that
was it: the GC allocates memory for a char[] array on top of
deleted-but-not-newed memory, which happens to be on top of a class reference
in a struct. This would explain why the class reference becomes 0xffffffff
instead of, say, null: char.init is 0xff.

Those Access Violations would just be harder-to-find symptoms of the same.

Of course, if I'm unlucky, something else causes the problem and the delete is
just one of those unrelated, yet relevant, lines, but I don't think so.

My bad! Nothing to see here!

Unless there's a way for the GC to stop this from ever happening accidentally?
Couldn't it know which areas in memory it has allocated?

Jun 21 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:

------- Comment #5 from sean f4.ca  2007-06-21 13:44 -------
The GC does know whether an address is in a pool it manages.  For deleting a
typical memory block (ie. not a class), _d_delmemory() is called.  The code for
this is very simple:

    if (*p)
        *p = null;

So on to gc_free() we go...  the routine is freeNoSync, and this is the first
thing it does:

        pool = gcx.findPool(p);
        if (!pool)

With this in mind, I don't understand how removing the delete could have solved
your problems, no matter what the spec says.  The only effect should have been
that ps is set to null.  To be sure, I ran your second (short) sample program
through a debugger and confirmed that this is indeed what's happening.  So
perhaps something more specific to your program was causing the actual problem,
and this was merely a contributing factor?  In any case, keep me informed if
the problem resurfaces.

Jun 21 2007
prev sibling parent d-bugmail puremagic.com writes:

------- Comment #6 from deewiant gmail.com  2007-06-21 14:45 -------
Pseudocode, closer to the original situation:

struct S { contains pointers and classes }
S* ps;
S[] ss;

void main() {
        ss ~= S();
        for (;;) {
                for (size_t i = ss.length; i-- > 0;) {
                        ps = &ss[i];

                        do stuff, use ps, mess with the heap
                        heap allocation, dynamic array usage aplenty

                        if (condition) {
                                delete ps;
                                ss = ss[0..i] ~ ss[i+1..$];

I don't know pretty much anything about the inner workings of the GC, but maybe
the above can clarify why something went wrong?

If not, feel free to take a detailed look at what happens when the delete in
the main loop is executed (in ccbi.d in the evil_bug.zip I uploaded). Maybe you
can find a deeper problem.

Jun 21 2007