www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - I think race condition exists in tango & phobos gc code

reply redsea <redsea 163.com> writes:
I have a programm wrote in D and run 24 * 7,  I found it would block one time
or twice a week (without using CPU load), whenever I use strace to check if if
block at system all, it continue run (strange ? )

and I can resume it use kill -SIGUSR2, so I think this situation may associated
with gc. But why strace ?  I check the strace code, and found it would cause
SIGSTOP to send, and I found SIGSTOP can not block by signal mask.  

Then I check the lib, and I think the problem may cause by the following
execute  order:

   thread A:                                              thread B:     
   
   fullcollect 
      thread_suspendAll
          suspend                                 
                                                              
thread_suspendHandler
                                                               sem_post(
&suspendCount );

               ret from sem_wait( &suspendCount );   
      do collect
      
      thread_resumeAll
               !! this signal would lost
               pthread_kill( t.m_addr, SIGUSR2 )
                                                              
                                                               sigsuspend(
&sigres );         

thread B would block because of the SIGUSR2 lost.

then I check the phobos code, and the code is alike.

Now, I 'm trying to use semaphore to do resume, and would check if my
programming run correctly.


Any suggest ?
Sep 07 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
redsea wrote:
 I have a programm wrote in D and run 24 * 7,  I found it would block one time
or twice a week (without using CPU load), whenever I use strace to check if if
block at system all, it continue run (strange ? )
 
 and I can resume it use kill -SIGUSR2, so I think this situation may
associated with gc. But why strace ?  I check the strace code, and found it
would cause SIGSTOP to send, and I found SIGSTOP can not block by signal mask.  
 
 Then I check the lib, and I think the problem may cause by the following
execute  order:
 
    thread A:                                              thread B:     
    
    fullcollect 
       thread_suspendAll
           suspend                                 
                                                               
thread_suspendHandler
                                                                sem_post(
&suspendCount );
 
                ret from sem_wait( &suspendCount );   
       do collect
       
       thread_resumeAll
                !! this signal would lost
                pthread_kill( t.m_addr, SIGUSR2 )
                                                               
                                                                sigsuspend(
&sigres );         
 
 thread B would block because of the SIGUSR2 lost.

SIGUSR2 shouldn't be lost. Tango sets sa_mask for the signal handlers to tell the OS to block all signals while the handler is processing. The call to sigsuspend is supposed to manually change that for the signals requested.
 then I check the phobos code, and the code is alike.
 
 Now, I 'm trying to use semaphore to do resume, and would check if my
programming run correctly.

Thanks, please do. If it really is a problem I'd be happy to change it. Sean
Sep 08 2008
next sibling parent redsea <redsea 163.com> writes:
Sean Kelly Wrote:

 SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers 
 to tell the OS to block all signals while the handler is processing. 
 The call to sigsuspend is supposed to manually change that for the 
 signals requested.
 
 then I check the phobos code, and the code is alike.
 
 Now, I 'm trying to use semaphore to do resume, and would check if my
programming run correctly.

Thanks, please do. If it really is a problem I'd be happy to change it.

I wrote a small programm kill and sigsuspend use the order as me metioned before, the signal is not lost. So the real reason should hide more deep. The version use semaphore finished, but I've to wait the adminstrator test & upload the programming. I will do more check. Thanks for your opinions .
Sep 09 2008
prev sibling parent redsea <redsea 163.com> writes:
Sean Kelly Wrote:

 
 SIGUSR2 shouldn't be lost.  Tango sets sa_mask for the signal handlers 
 to tell the OS to block all signals while the handler is processing. 
 The call to sigsuspend is supposed to manually change that for the 
 signals requested.

I'm wrong. Indeed the programming has two components, client & server, both is multi thread. I was reported that two components have same problem. After check, I found the client version is correct, running stable, that the bug must be nothing about tango. Sorry !
Sep 10 2008