digitalmars.D.bugs - [Issue 5488] New: Spawned threads hang in a way that suggests allocation or gc issue
- d-bugmail puremagic.com (36/36) Jan 25 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (18/18) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (9/9) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (15/15) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (93/93) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (7/7) Feb 02 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (14/14) Apr 22 2012 http://d.puremagic.com/issues/show_bug.cgi?id=5488
http://d.puremagic.com/issues/show_bug.cgi?id=5488 Summary: Spawned threads hang in a way that suggests allocation or gc issue Product: D Version: D2 Platform: x86 OS/Version: Mac OS X Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: adam_conner_sax yahoo.com --- Comment #0 from Adam Conner-Sax <adam_conner_sax yahoo.com> 2011-01-25 20:05:49 PST --- Created an attachment (id=882) code to demonstrate the issue described above The attached program hangs more often than not during the second set of spawns (using dmd 2.051 on OSX). The thread functions do nothing but allocate a large array and then exit. In one case the array is an Array!double (from std.container) and in the other it is a built-in double[]. In the second case, a large enough array will cause the program to hang. Sean Kelly has already done some investigating, quoting from his responses: 1) This one is weird, and doesn't appear related to 4307. One of the threads (thread A) is in a GC collection and blocked trying to acquire the mutex protecting the global thread list within thread_resumeAll. Another thread (thread B) is also blocked trying to acquire this mutex for other reasons. My best guess is that pthread_mutex in OSX is trying to give ownership of the lock to thread B, and since thread B is suspended it effectively blocks thread A from acquiring it to resume execution after the GC cycle. 2) After some testing, it looks like I was right. I have a fix for this, but it's far from ideal (though the diff is small): require everything but thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only acquires the second. I'll try to come up with something better. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 25 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 Sean Kelly <sean invisibleduck.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |sean invisibleduck.org AssignedTo|nobody puremagic.com |sean invisibleduck.org --- Comment #1 from Sean Kelly <sean invisibleduck.org> 2011-01-31 11:16:39 PST --- This one is weird, and doesn't appear related to 4307. One of the threads (thread A) is in a GC collection and blocked trying to acquire the mutex protecting the global thread list within thread_resumeAll. Another thread (thread B) is also blocked trying to acquire this mutex for other reasons. My best guess is that pthread_mutex in OSX is trying to give ownership of the lock to thread B, and since thread B is suspended it effectively blocks thread A from acquiring it to resume execution after the GC cycle. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Comment #2 from Sean Kelly <sean invisibleduck.org> 2011-01-31 11:16:51 PST --- After some testing, it looks like I was right. I have a fix for this, but it's far from ideal (though the diff is small): require everything but thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only acquires the second. I'll try to come up with something better. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Comment #3 from Sean Kelly <sean invisibleduck.org> 2011-01-31 11:28:35 PST --- Okay, I decided to use a Mutex instead of a built-in object monitor for locking the thread list, because this allows me to lock in thread_suspendAll() and hold the lock until thread_resumeAll() completes. This also allows me to remove some busy waits I'd added to Thread.add() to avoid adding a thread or context while a GC cycle was in progress. Much neater and in theory it solves everything. That said, I'm still seeing a rare occasional deadlock in the attached app. This one appears to be different however, and the near complete lack of usable debug info in DMD binaries on OSX is complicating figuring this one out. I'll add some printfs and hope that turns up something. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Comment #4 from Sean Kelly <sean invisibleduck.org> 2011-01-31 12:02:19 PST --- I've confirmed that this new deadlock isn't because the GC thread was blocked acquiring the global thread list mutex, so I've fixed the issue this ticket was created for. However, it's starting to look like the Mach thread_suspend() call doesn't play well with Posix mutexes. What I think is happening is that a thread is blocked on the GC mutex when a collection occurs. The collection completes and the mutex is released, but the thread being given the lock is slow to resume and is missing the signal meant to notify it that the lock is free. This is all conjecture based on stack traces (which I'll include below) and some printfs to confirm that core.thread isn't involved, but it seems reasonable. If true though, it could mean that the mechanism used to stop and restart the world during a GC run on OSX is fundamentally unsound. I'll see about confirming the cause and go from there. 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt #0 0x984d6142 in semaphore_wait_signal_trap () #1 0x984dbc46 in pthread_mutex_lock () #2 0x0000d44e in _d_monitor_lock () #3 0x00010659 in _d_monitorenter () #4 0x0001a789 in D2gc3gcx2GC6callocMFkkPkZPv () #5 0x00019a78 in gc_calloc () #6 0x0001d5f8 in _aaGetX () #7 0x0001d52c in _aaGet () #8 0x0000a4ad in D3std11concurrency38__T6_spawnTkTkTS3std11concurrency3TidZ6_spawnFbPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid () #9 0x0000a3cc in D3std11concurrency37__T5spawnTkTkTS3std11concurrency3TidZ5spawnFPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid () #10 0x000026af in _Dmain () #11 0x0001eb77 in D2rt6dmain24mainUiPPaZi7runMainMFZv () #12 0x0001eafe in D2rt6dmain24mainUiPPaZi7tryExecMFMDFZvZv () #13 0x0001ebbf in D2rt6dmain24mainUiPPaZi6runAllMFZv () #14 0x0001eafe in D2rt6dmain24mainUiPPaZi7tryExecMFMDFZvZv () #15 0x0001ea8f in main () (gdb) thread 2 [Switching to thread 2 (process 114)] 0x0000512e in D6object12__T5clearTdZ5clearFKdZv () (gdb) thread 3 [Switching to thread 3 (process 114)] 0x0000512e in D6object12__T5clearTdZ5clearFKdZv () (gdb) thread 4 [Switching to thread 4 (process 114)] 0xffff07b6 in __memcpy () (gdb) thread 5 [Switching to thread 5 (process 114)] 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt #0 0x984d6142 in semaphore_wait_signal_trap () #1 0x984dbc46 in pthread_mutex_lock () #2 0x0000d44e in _d_monitor_lock () #3 0x00010659 in _d_monitorenter () #4 0x0001a53c in D2gc3gcx2GC6mallocMFkkPkZPv () #5 0x000199f3 in gc_qalloc () #6 0x0001f7c0 in _d_newarrayiT () #7 0x000272cd in D3std11concurrency36__T4ListTS3std11concurrency7MessageZ4List3putMFS3std11concurrency7MessageZv () #8 0x00026efe in D3std11concurrency10MessageBox3putMFKS3std11concurrency7MessageZv () #9 0x000276a9 in D3std11concurrency33__T5_sendTS3std11concurrency3TidZ5_sendFE3std11concurrency7MsgTypeS3std11concurrency3TidS3std11concurrency3TidZv () #10 0x00026b87 in D3std11concurrency12_staticDtor2FZv () #11 0x00026a63 in _D3std11concurrency9__moddtorFZv () #12 0x00010596 in _moduleTlsDtor () #13 0x000105bf in rt_moduleTlsDtor () #14 0x00014bab in thread_entryPoint () #15 0x9850385d in _pthread_start () #16 0x985036e2 in thread_start () (gdb) thread 6 [Switching to thread 6 (process 114)] 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt #0 0x984d6142 in semaphore_wait_signal_trap () #1 0x984dbc46 in pthread_mutex_lock () #2 0x0000d44e in _d_monitor_lock () #3 0x00010659 in _d_monitorenter () #4 0x0001a53c in D2gc3gcx2GC6mallocMFkkPkZPv () #5 0x00019980 in gc_malloc () #6 0x0001ed7f in _d_newclass () #7 0x0002e3c6 in D3std8datetime3UTC12_staticCtor6FZv () #8 0x0002e31f in _D13datetime.11619__modctorFZv () #9 0x000104f4 in _moduleTlsCtor () #10 0x000105af in rt_moduleTlsCtor () #11 0x00014b8c in thread_entryPoint () #12 0x9850385d in _pthread_start () #13 0x985036e2 in thread_start () -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Comment #5 from Sean Kelly <sean invisibleduck.org> 2011-02-02 12:10:21 PST --- Okay, all issues related to this appear to have been fixed and changes checked in. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 02 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 SomeDude <lovelydear mailmetrash.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lovelydear mailmetrash.com --- Comment #6 from SomeDude <lovelydear mailmetrash.com> 2012-04-22 15:46:58 PDT --- Is the problem solved on Mac OSX ? This test runs on Win32 2.059 as long as the process takes less than 1.3Gb of RAM on my machine, i.e no problem with nThreads = 40, but hangs if nThreads = 50 (probably because it can't allocate any more RAM). If multiplier is reduced to 10_000, it runs fine with nThreads = 100. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Apr 22 2012