digitalmars.D.bugs - [Issue 5488] New: Spawned threads hang in a way that suggests allocation or gc issue

d-bugmail puremagic.com (36/36) Jan 25 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488

d-bugmail puremagic.com (18/18) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
d-bugmail puremagic.com (9/9) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
d-bugmail puremagic.com (15/15) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
d-bugmail puremagic.com (93/93) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
d-bugmail puremagic.com (7/7) Feb 02 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
d-bugmail puremagic.com (14/14) Apr 22 2012 http://d.puremagic.com/issues/show_bug.cgi?id=5488

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488

           Summary: Spawned threads hang in a way that suggests allocation
                    or gc issue
           Product: D
           Version: D2
          Platform: x86
        OS/Version: Mac OS X
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: adam_conner_sax yahoo.com



20:05:49 PST ---
Created an attachment (id=882)
code to demonstrate the issue described above

The attached program hangs more often than not during the second set of spawns
(using dmd 2.051 on OSX).  The thread functions do nothing but allocate a large
array and then exit.  In one case the array is an Array!double (from
std.container) and in the other it is a built-in double[].  In the second case,
a large enough array will cause the program to hang.  

Sean Kelly has already done some investigating, quoting from his responses:

1) This one is weird, and doesn't appear related to 4307.  One of the threads
(thread A) is in a GC collection and blocked trying to acquire the mutex
protecting the global thread list within thread_resumeAll.  Another thread
(thread B) is also blocked trying to acquire this mutex for other reasons.  My
best guess is that pthread_mutex in OSX is trying to give ownership of the lock
to thread B, and since thread B is suspended it effectively blocks thread A
from acquiring it to resume execution after the GC cycle. 

2) After some testing, it looks like I was right.  I have a fix for this, but
it's far from ideal (though the diff is small): require everything but
thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only
acquires the second.  I'll try to come up with something better.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 25 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488


Sean Kelly <sean invisibleduck.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |sean invisibleduck.org
         AssignedTo|nobody puremagic.com        |sean invisibleduck.org



---
This one is weird, and doesn't appear related to 4307.  One of the threads
(thread A) is in a GC collection and blocked trying to acquire the mutex
protecting the global thread list within thread_resumeAll. Another thread
(thread B) is also blocked trying to acquire this mutex for other reasons.  My
best guess is that pthread_mutex in OSX is trying to give ownership of the lock
to thread B, and since thread B is suspended it effectively blocks thread A
from acquiring it to resume execution after the GC cycle.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 31 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488




---
After some testing, it looks like I was right.  I have a fix for this, but it's
far from ideal (though the diff is small): require everything but
thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only
acquires the second.  I'll try to come up with something better.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 31 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488




---
Okay, I decided to use a Mutex instead of a built-in object monitor for locking
the thread list, because this allows me to lock in thread_suspendAll() and hold
the lock until thread_resumeAll() completes.  This also allows me to remove
some busy waits I'd added to Thread.add() to avoid adding a thread or context
while a GC cycle was in progress.  Much neater and in theory it solves
everything.

That said, I'm still seeing a rare occasional deadlock in the attached app. 
This one appears to be different however, and the near complete lack of usable
debug info in DMD binaries on OSX is complicating figuring this one out.  I'll
add some printfs and hope that turns up something.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 31 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488




---
I've confirmed that this new deadlock isn't because the GC thread was blocked
acquiring the global thread list mutex, so I've fixed the issue this ticket was
created for.  However, it's starting to look like the Mach thread_suspend()
call doesn't play well with Posix mutexes.  What I think is happening is that a
thread is blocked on the GC mutex when a collection occurs.  The collection
completes and the mutex is released, but the thread being given the lock is
slow to resume and is missing the signal meant to notify it that the lock is
free.

This is all conjecture based on stack traces (which I'll include below) and
some printfs to confirm that core.thread isn't involved, but it seems
reasonable.  If true though, it could mean that the mechanism used to stop and
restart the world during a GC run on OSX is fundamentally unsound.  I'll see
about confirming the cause and go from there.



0x984d6142 in semaphore_wait_signal_trap ()
(gdb) bt









D3std11concurrency38__T6_spawnTkTkTS3std11concurrency3TidZ6_spawnFbPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid
()

D3std11concurrency37__T5spawnTkTkTS3std11concurrency3TidZ5spawnFPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid
()






(gdb) thread 2
[Switching to thread 2 (process 114)]
0x0000512e in D6object12__T5clearTdZ5clearFKdZv ()
(gdb) thread 3
[Switching to thread 3 (process 114)]
0x0000512e in D6object12__T5clearTdZ5clearFKdZv ()
(gdb) thread 4
[Switching to thread 4 (process 114)]
0xffff07b6 in __memcpy ()
(gdb) thread 5
[Switching to thread 5 (process 114)]
0x984d6142 in semaphore_wait_signal_trap ()
(gdb) bt








D3std11concurrency36__T4ListTS3std11concurrency7MessageZ4List3putMFS3std11concurrency7MessageZv
()

D3std11concurrency10MessageBox3putMFKS3std11concurrency7MessageZv ()

D3std11concurrency33__T5_sendTS3std11concurrency3TidZ5_sendFE3std11concurrency7MsgTypeS3std11concurrency3TidS3std11concurrency3TidZv
()







(gdb) thread 6
[Switching to thread 6 (process 114)]
0x984d6142 in semaphore_wait_signal_trap ()
(gdb) bt















-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 31 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488




---
Okay, all issues related to this appear to have been fixed and changes checked
in.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Feb 02 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=5488


SomeDude <lovelydear mailmetrash.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lovelydear mailmetrash.com



PDT ---
Is the problem solved on Mac OSX ?
This test runs on Win32 2.059 as long as the process takes less than 1.3Gb of
RAM on my machine, i.e no problem with nThreads = 40, but hangs if nThreads =
50 (probably because it can't allocate any more RAM). If multiplier is reduced
to 10_000, it runs fine with nThreads = 100.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Apr 22 2012

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 5488] New: Spawned threads hang in a way that suggests allocation or gc issue