www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - phobos std.threads & signal problem

After switching YAWR/tools over to a pause/resume-based threadpool to reduce
idle CPU load,
I've repeatedly ran into odd program hangs, resulting in stack traces like the
following:

#0  0xb7fb2410 in __kernel_vsyscall ()
#1  0xb7ed97ce in sem_wait GLIBC_2.0 () from /lib/libpthread.so.0
#2  0x08140eed in _D3std6thread6Thread8pauseAllFZv ()
    at ../../../gcc-4.1.2/libphobos/std/thread.d:879
#3  0x0815a903 in _D3gcx3Gcx11fullcollectMFPvZk (this=0x81ab020,
    stackTop=0xbff5de88) at ../../../gcc-4.1.2/libphobos/internal/gc/gcx.d:1604

Let me give some context for the critical frame #2:

 static void pauseAll()
 [...]
 synchronized (threadLock)
 [...]
 // bla bla pause all threads using SIGUSR1 and increment a counter for each
thread
 [...]
 // Wait for each paused thread to acknowledge
 while (npause--)
 {
   flagSuspend.wait();
 }

From a cursory look at std/threads.d, it seems that, although there is a threadLock object, pause and resume do not _use_ it, limiting its usefulness. Since this branch of pauseAll doesn't use them, I don't see the necessity for this, and it might be related to the problem. More on this later. If one looks at the backtrace, it seems like the signal handler for SIGUSR1 isn't being called often enough .. which is strange, to say the least. (this would leave the flagSuspend semaphore loop hanging, which is what's happening here) A theory. std.threads' pauseAll does not check if the threads it tries to stop are already paused. What if a paused thread gets another SIGUSR1? One might think that the signals would be queued. One might be right. *MOST* of the time. Consider, if you will, the following code.
 import std.stdio;

 extern(C) {
         uint sleep(uint secs);
         void function(int) signal(int signum, void function(int) handler);
         const SIGUSR1=10;
         int pthread_kill(int pid, int sig);
         int pthread_self();
 }

 int count;
 extern(C) void test(int x) {
         writefln("\tThis is test the ", count++, "th; blocking for 2s");
         sleep(2);
         writefln("\tExiting");
 }

 import std.thread;
 void main() {
         signal(SIGUSR1, &test);
         auto start=pthread_self();
         (new Thread({ sleep(2); writefln("1: attempting to usr1 main");
pthread_kill(start, SIGUSR1); writefln("1: done"); return 0; })).start();
         (new Thread({ sleep(3); writefln("2: attempting to usr1 main");
pthread_kill(start, SIGUSR1); writefln("2: done"); return 0; })).start();
         sleep(1);
         writefln("0: attempting to self-usr1");
         pthread_kill(start, SIGUSR1);
         writefln("0: done, count ", count);
 }

So what we have here, is three different threads all sending signals to the main thread. Of course, what we expect to see is this: 0: attempting to self-usr1 This is test the 0th; blocking for 2s 1: attempting to usr1 main 1: done 2: attempting to usr1 main Exiting This is test the 1th; blocking for 2s 2: done Exiting This is test the 2th; blocking for 2s Exiting 0: done, count 3 HOWEVER! The following behavior has also been observed: 0: attempting to self-usr1 This is test the 0th; blocking for 2s 1: attempting to usr1 main 1: done 2: attempting to usr1 main 2: done Exiting This is test the 1th; blocking for 2s Exiting 0: done, count 2 Allow me to repeat the critical bit *0: done, count 2* This might help to explain why the above pauseAll hangs; if more than one thread at a time tries to send SIGUSR1 to the same thread, then the semaphore's count will end up skewed, and it will hang forever. The solution? I have no idea. Synchronize pause and resume with the same lock as pauseAll/resumeAll, maybe? DISCLAIMER: All of this is pure conjecture. Being kind of a newb wrt. signals, I might be wrong about the reason for this behavior. But then again, there _has_ to be a reason for std.threads hanging on that semaphore. What do you think?
Dec 18 2007