www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Multithreading woes on Linux

reply Juan Jose Comellas <jcomellas gmail.com> writes:
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8Bit

It seems that there is a problem in the code generated by DMD or the code in
Phobos when using multithreading on Linux. I've been trying several ways of
rewriting my programs to avoid this problem, but I've had no success so
far. The crashes always happen inside the garbage collector. The line
reported by gdb is:

#0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
1318                byte *p = cast(byte *)(*p1);

It looks like the pointer that's being dereferenced by the GC is invalid.
I've added checks before this line to see if it was a NULL pointer and it's
not. Surprisingly (or not), my program crashes almost immediately if Phobos
and the GC are compiled with optimizations. If I only leave "-g" as the
DFLAGS in the makefiles I get these crashes much less frequently.  

In the test program I'm using I have two threads. The crash is happening on
thread 1. The full backtrace I get for the crash is attached to this post.

I'm trying to write a simplified sample program and I'll post it once I have
it ready. Walter, if you have a minute, I'd appreciate you looking into
this.
Apr 23 2006
next sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Juan Jose Comellas schrieb am 2006-04-23:
 It seems that there is a problem in the code generated by DMD or the code in
 Phobos when using multithreading on Linux. I've been trying several ways of
 rewriting my programs to avoid this problem, but I've had no success so
 far. The crashes always happen inside the garbage collector. The line
 reported by gdb is:

 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 1318                byte *p = cast(byte *)(*p1);

Might be related to http://d.puremagic.com/bugzilla/show_bug.cgi?id=72 A potential workaround: 1) edit dmd/src/phobos/internal/gc/linux.mak remove -relase from DFLAGS: DFLAGS=-O -inline -I../.. 2) recompile libphobos.a 3) replace your current libphobos.a with the one found at dmd/src/phobos/libphobos.a Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFES+KJ3w+/yD4P9tIRAk6XAKCEv0Vcxe8Gr39gq43WwswuikaajgCgxaCQ j0UzSJRwEcrZ+59dPlfuB7g= =oJR4 -----END PGP SIGNATURE-----
Apr 23 2006
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
I just ran into this - the fix in std/thread.d:

     extern (C) static void pauseHandler(int sig)
     {   int result;

     // Save all registers on the stack so they'll be scanned by the GC
     asm
     {
         pusha   ;
     }

     assert(sig == SIGUSR1);
     // Move sem_post to after t.stackTop = getESP();
     //sem_post(&flagSuspend);

     sigset_t sigmask;
     result = sigfillset(&sigmask);
     assert(result == 0);
     result = sigdelset(&sigmask, SIGUSR2);
     assert(result == 0);

     Thread t = getThis();
     t.stackTop = getESP();
     t.flags &= ~1;
     sem_post(&flagSuspend); // HERE
     while (1)
     {
         sigsuspend(&sigmask);   // suspend until SIGUSR2
         if (t.flags & 1)        // ensure it was resumeHandler()
         break;
     }

     // Restore all registers
     asm
     {
         popa    ;
     }
     }

The problem is that the t.stackTop is not valid when it is passed into 
gcx.mark() because it is being munged as pauseAll returns (and lets the 
GC commence) before the stackTop is set for all of the paused threads.

Please give it a try and if it also solves your problem then it will be 
a confirmed fix.

- Dave

Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code in
 Phobos when using multithreading on Linux. I've been trying several ways of
 rewriting my programs to avoid this problem, but I've had no success so
 far. The crashes always happen inside the garbage collector. The line
 reported by gdb is:
 
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and it's
 not. Surprisingly (or not), my program crashes almost immediately if Phobos
 and the GC are compiled with optimizations. If I only leave "-g" as the
 DFLAGS in the makefiles I get these crashes much less frequently.  
 
 In the test program I'm using I have two threads. The crash is happening on
 thread 1. The full backtrace I get for the crash is attached to this post.
 
 I'm trying to write a simplified sample program and I'll post it once I have
 it ready. Walter, if you have a minute, I'd appreciate you looking into
 this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):
 #0  0x5557db9d in sem_post GLIBC_2.0 () from /lib/tls/libpthread.so.0
 #1  0x08062f27 in _D3std6thread6Thread12pauseHandlerUiZv () at std/thread.d:940
 #2  <signal handler called>
 #3  0x5557e83e in send () from /lib/tls/libpthread.so.0
 #4  0x08050a61 in _D5mango2io6Socket6Socket4sendFAvE5mango2io6S
cket6Socket5FlagsZi () at
/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:1423
 #5  0x08050290 in _D5mango2io6Socket6Socket6writerFAvZk () at
/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:879
 #6  0x0804cbde in _D5mango2io7Conduit7Conduit5writeFAvZk () at
/home/jcomellas/devel/d/mango_test/mango/io/Conduit.d:198
 #7  0x0805821f in _D8selector16clientThreadFuncFZv () at selector.d:363
 #8  0x0805816e in _D8selector21dummyClientThreadFuncFPvZi () at selector.d:327
 #9  0x080628c5 in _D3std6thread6Thread3runFZi () at std/thread.d:609
 #10 0x08062d50 in _D3std6thread6Thread11threadstartUPvZPv () at
std/thread.d:845
 #11 0x55579ced in start_thread () from /lib/tls/libpthread.so.0
 #12 0x5567ddde in clone () from /lib/tls/libc.so.6
 
 Thread 1 (process 8949):
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 #1  0x0806ad05 in _D3gcx3Gcx11fullcollectFPvZk () at gcx.d:1462
 #2  0x0806aab5 in _D3gcx3Gcx16fullcollectshellFZk () at gcx.d:1382
 #3  0x080692de in _D3gcx2GC12mallocNoSyncFkZPv () at gcx.d:275
 #4  0x080691c1 in _D3gcx2GC6mallocFkZPv () at gcx.d:228
 #5  0x080684db in _d_newclass () at gc.d:127
 #6  0x08053df7 in _D5mango2io8selector12PollSelector12PollSelector11selectedSetFZC5mango2io8selector5model9ISel
ctor13ISelectionSet ()
     at /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353
 #7  0x08057d69 in _D8selector12testSelectorFC5mango2io8selector5model9I
elector9ISelectorZv () at selector.d:142
 #8  0x08057c24 in _Dmain () at selector.d:66
 #9  0x0805a38a in main () at internal/dmain2.d:94
 

Apr 23 2006
parent reply Juan Jose Comellas <jcomellas gmail.com> writes:
Great fix! This solved all the problems I've found so far when working with
multiple threads on Linux. I'm going to start running more complex test
cases with several hundred threads to see if I can find any additional
problems.

Thank you very much for this.

Walter, please add this fix to Phobos. Should I create an entry in D's
bugzilla?


Dave wrote:
 
 I just ran into this - the fix in std/thread.d:
 
      extern (C) static void pauseHandler(int sig)
      {   int result;
 
      // Save all registers on the stack so they'll be scanned by the GC
      asm
      {
          pusha   ;
      }
 
      assert(sig == SIGUSR1);
      // Move sem_post to after t.stackTop = getESP();
      //sem_post(&flagSuspend);
 
      sigset_t sigmask;
      result = sigfillset(&sigmask);
      assert(result == 0);
      result = sigdelset(&sigmask, SIGUSR2);
      assert(result == 0);
 
      Thread t = getThis();
      t.stackTop = getESP();
      t.flags &= ~1;
      sem_post(&flagSuspend); // HERE
      while (1)
      {
          sigsuspend(&sigmask);   // suspend until SIGUSR2
          if (t.flags & 1)        // ensure it was resumeHandler()
          break;
      }
 
      // Restore all registers
      asm
      {
          popa    ;
      }
      }
 
 The problem is that the t.stackTop is not valid when it is passed into
 gcx.mark() because it is being munged as pauseAll returns (and lets the
 GC commence) before the stackTop is set for all of the paused threads.
 
 Please give it a try and if it also solves your problem then it will be
 a confirmed fix.
 
 - Dave
 
 Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code
 in Phobos when using multithreading on Linux. I've been trying several
 ways of rewriting my programs to avoid this problem, but I've had no
 success so far. The crashes always happen inside the garbage collector.
 The line reported by gdb is:
 
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and
 it's not. Surprisingly (or not), my program crashes almost immediately if
 Phobos and the GC are compiled with optimizations. If I only leave "-g"
 as the DFLAGS in the makefiles I get these crashes much less frequently.
 
 In the test program I'm using I have two threads. The crash is happening
 on thread 1. The full backtrace I get for the crash is attached to this
 post.
 
 I'm trying to write a simplified sample program and I'll post it once I
 have it ready. Walter, if you have a minute, I'd appreciate you looking
 into this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):
 #0  0x5557db9d in sem_post GLIBC_2.0 () from /lib/tls/libpthread.so.0
 #1  0x08062f27 in _D3std6thread6Thread12pauseHandlerUiZv () at
 #std/thread.d:940
 #2  <signal handler called>
 #3  0x5557e83e in send () from /lib/tls/libpthread.so.0
 #4  0x08050a61 in
 #_D5mango2io6Socket6Socket4sendFAvE5mango2io6Socket6Socket5FlagsZi () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:1423
 #5  0x08050290 in _D5mango2io6Socket6Socket6writerFAvZk () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:879
 #6  0x0804cbde in _D5mango2io7Conduit7Conduit5writeFAvZk () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Conduit.d:198
 #7  0x0805821f in _D8selector16clientThreadFuncFZv () at selector.d:363
 #8  0x0805816e in _D8selector21dummyClientThreadFuncFPvZi () at
 #selector.d:327
 #9  0x080628c5 in _D3std6thread6Thread3runFZi () at std/thread.d:609
 #10 0x08062d50 in _D3std6thread6Thread11threadstartUPvZPv () at
 #std/thread.d:845 11 0x55579ced in start_thread () from
 #/lib/tls/libpthread.so.0 12 0x5567ddde in clone () from
 #/lib/tls/libc.so.6
 
 Thread 1 (process 8949):
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 #1  0x0806ad05 in _D3gcx3Gcx11fullcollectFPvZk () at gcx.d:1462
 #2  0x0806aab5 in _D3gcx3Gcx16fullcollectshellFZk () at gcx.d:1382
 #3  0x080692de in _D3gcx2GC12mallocNoSyncFkZPv () at gcx.d:275
 #4  0x080691c1 in _D3gcx2GC6mallocFkZPv () at gcx.d:228
 #5  0x080684db in _d_newclass () at gc.d:127
 #6  0x08053df7 in


 #()
     at
     /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353
 #7  0x08057d69 in


 #() at selector.d:142
 #8  0x08057c24 in _Dmain () at selector.d:66
 #9  0x0805a38a in main () at internal/dmain2.d:94


Apr 23 2006
next sibling parent Justin C Calvarese <technocrat7 gmail.com> writes:
Juan Jose Comellas wrote:
 Great fix! This solved all the problems I've found so far when working with
 multiple threads on Linux. I'm going to start running more complex test
 cases with several hundred threads to see if I can find any additional
 problems.
 
 Thank you very much for this.
 
 Walter, please add this fix to Phobos. Should I create an entry in D's
 bugzilla?

I think this is exactly what bugzilla is for. I think you should go ahead and add it. -- jcc7
Apr 23 2006
prev sibling parent pmoore <pmoore_member pathlink.com> writes:
Slightly off topic:

Why does this function do a pusha and popa? Surely they are 16 bit pushes and
pops? Wouldn't you want pushad and popad instead? Note though that individual
pushes and pops would probably be better with the 64 bit future in mind as
pushad and popad beome invalid instructions in x86_64.


In article <e2gvv6$217a$1 digitaldaemon.com>, Juan Jose Comellas says...
Great fix! This solved all the problems I've found so far when working with
multiple threads on Linux. I'm going to start running more complex test
cases with several hundred threads to see if I can find any additional
problems.

Thank you very much for this.

Walter, please add this fix to Phobos. Should I create an entry in D's
bugzilla?


Dave wrote:
 
 I just ran into this - the fix in std/thread.d:
 
      extern (C) static void pauseHandler(int sig)
      {   int result;
 
      // Save all registers on the stack so they'll be scanned by the GC
      asm
      {
          pusha   ;
      }
 
      assert(sig == SIGUSR1);
      // Move sem_post to after t.stackTop = getESP();
      //sem_post(&flagSuspend);
 
      sigset_t sigmask;
      result = sigfillset(&sigmask);
      assert(result == 0);
      result = sigdelset(&sigmask, SIGUSR2);
      assert(result == 0);
 
      Thread t = getThis();
      t.stackTop = getESP();
      t.flags &= ~1;
      sem_post(&flagSuspend); // HERE
      while (1)
      {
          sigsuspend(&sigmask);   // suspend until SIGUSR2
          if (t.flags & 1)        // ensure it was resumeHandler()
          break;
      }
 
      // Restore all registers
      asm
      {
          popa    ;
      }
      }
 
 The problem is that the t.stackTop is not valid when it is passed into
 gcx.mark() because it is being munged as pauseAll returns (and lets the
 GC commence) before the stackTop is set for all of the paused threads.
 
 Please give it a try and if it also solves your problem then it will be
 a confirmed fix.
 
 - Dave
 
 Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code
 in Phobos when using multithreading on Linux. I've been trying several
 ways of rewriting my programs to avoid this problem, but I've had no
 success so far. The crashes always happen inside the garbage collector.
 The line reported by gdb is:
 
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and
 it's not. Surprisingly (or not), my program crashes almost immediately if
 Phobos and the GC are compiled with optimizations. If I only leave "-g"
 as the DFLAGS in the makefiles I get these crashes much less frequently.
 
 In the test program I'm using I have two threads. The crash is happening
 on thread 1. The full backtrace I get for the crash is attached to this
 post.
 
 I'm trying to write a simplified sample program and I'll post it once I
 have it ready. Walter, if you have a minute, I'd appreciate you looking
 into this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):
 #0  0x5557db9d in sem_post GLIBC_2.0 () from /lib/tls/libpthread.so.0
 #1  0x08062f27 in _D3std6thread6Thread12pauseHandlerUiZv () at
 #std/thread.d:940
 #2  <signal handler called>
 #3  0x5557e83e in send () from /lib/tls/libpthread.so.0
 #4  0x08050a61 in
 #_D5mango2io6Socket6Socket4sendFAvE5mango2io6Socket6Socket5FlagsZi () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:1423
 #5  0x08050290 in _D5mango2io6Socket6Socket6writerFAvZk () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:879
 #6  0x0804cbde in _D5mango2io7Conduit7Conduit5writeFAvZk () at
 #/home/jcomellas/devel/d/mango_test/mango/io/Conduit.d:198
 #7  0x0805821f in _D8selector16clientThreadFuncFZv () at selector.d:363
 #8  0x0805816e in _D8selector21dummyClientThreadFuncFPvZi () at
 #selector.d:327
 #9  0x080628c5 in _D3std6thread6Thread3runFZi () at std/thread.d:609
 #10 0x08062d50 in _D3std6thread6Thread11threadstartUPvZPv () at
 #std/thread.d:845 11 0x55579ced in start_thread () from
 #/lib/tls/libpthread.so.0 12 0x5567ddde in clone () from
 #/lib/tls/libc.so.6
 
 Thread 1 (process 8949):
 #0  0x0806a978 in _D3gcx3Gcx4markFPvPvZv () at gcx.d:1318
 #1  0x0806ad05 in _D3gcx3Gcx11fullcollectFPvZk () at gcx.d:1462
 #2  0x0806aab5 in _D3gcx3Gcx16fullcollectshellFZk () at gcx.d:1382
 #3  0x080692de in _D3gcx2GC12mallocNoSyncFkZPv () at gcx.d:275
 #4  0x080691c1 in _D3gcx2GC6mallocFkZPv () at gcx.d:228
 #5  0x080684db in _d_newclass () at gc.d:127
 #6  0x08053df7 in


 #()
     at
     /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353
 #7  0x08057d69 in


 #() at selector.d:142
 #8  0x08057c24 in _Dmain () at selector.d:66
 #9  0x0805a38a in main () at internal/dmain2.d:94



Apr 24 2006