www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D1 garbage collector + threads + malloc = garbage?

reply Bane <branimir.milosavljevic gmail.com> writes:
Bug with GC fullCollect() in multithreaded environment. Grauzone explained it
here http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610

Test case that freezes app (tested on 1.053):


import std.c.stdlib;
import std.stdio;
import std.gc;
import std.thread;
import std.c.time;


class Foo {
  void* p;
  this(){
    synchronized(this.classinfo) {
      writefln("malloc begin");
      p = std.c.stdlib.malloc(1024*1024);
      writefln("malloc finished");
    }
  }

  ~this(){
    synchronized(this.classinfo){
      writefln("free begin");
      std.c.stdlib.free(p);
      writefln("free finished");
    }
  }
}

class MyThread : Thread {
  int run(){
    while(true){
      new Foo;
      msleep(1);
    }
    return 0;
  }
}

void main(){
  for(int i=0; i<10; i++){
    (new MyThread).start;
  }
  
  while(true){
    writefln("collect begin");
    std.gc.fullCollect;
    writefln("collect finished");
    msleep(1000);
  }
}



Can it be fixed or adleast mentioned somewhere in docs as known limitation of
D1 GC? It would save a lot of time to some people (too late for me :).
Dec 11 2009
parent reply Leandro Lucarella <llucax gmail.com> writes:
Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment. Grauzone explained it
here http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
 Test case that freezes app (tested on 1.053):
 
 
 import std.c.stdlib;
 import std.stdio;
 import std.gc;
 import std.thread;
 import std.c.time;
 
 
 class Foo {
   void* p;
   this(){
     synchronized(this.classinfo) {
       writefln("malloc begin");
       p = std.c.stdlib.malloc(1024*1024);
       writefln("malloc finished");
     }
   }
 
   ~this(){
     synchronized(this.classinfo){
       writefln("free begin");
       std.c.stdlib.free(p);
       writefln("free finished");
     }
   }
 }
 
 class MyThread : Thread {
   int run(){
     while(true){
       new Foo;
       msleep(1);
     }
     return 0;
   }
 }
 
 void main(){
   for(int i=0; i<10; i++){
     (new MyThread).start;
   }
   
   while(true){
     writefln("collect begin");
     std.gc.fullCollect;
     writefln("collect finished");
     msleep(1000);
   }
 }
 
 
 
 Can it be fixed or adleast mentioned somewhere in docs as known limitation of
D1 GC? It would save a lot of time to some people (too late for me :).
The patch is very simple (as pointed by Grauzone), I guess it should make it into D1 (I hope): ------------------------------------------------------------ diff --git a/phobos/internal/gc/gcx.d b/phobos/internal/gc/gcx.d index b0d2821..7b1a7a3 100644 --- a/phobos/internal/gc/gcx.d +++ b/phobos/internal/gc/gcx.d -2033,8 +2033,6 struct Gcx } } - Thread.resumeAll(); - // Free up everything not marked debug(COLLECT_PRINTF) printf("\tfree'ing\n"); size_t freedpages = 0; -2194,6 +2192,8 struct Gcx debug(COLLECT_PRINTF) printf("recovered pages = %d\n", recoveredpages); debug(COLLECT_PRINTF) printf("\tfree'd %u bytes, %u pages from %u pools\ + Thread.resumeAll(); + return freedpages + recoveredpages; } ------------------------------------------------------------ I wanted to reproduce it to make a bug report to attach the patch but I couldn't. I replaced msleep() with usleep() (and multiplied the values by 1000) because I'm in Linux, but that's all I changed. I have an old AMD Athlon 1.8GHz. How do you reproduce it? I want to be sure the bug is present before the patch and fixed after the patch before submiting the patch via Bugzilla. Thanks! -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Nos retiramos hasta la semana que viene reflexionando sobre nuestras vidas: "Qué vida de mier'... Qué vida de mier'!" -- Sidharta Kiwi
Dec 11 2009
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Leandro Lucarella Wrote:

 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment. Grauzone explained it
here http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
To be perfectly safe, it's a tiny bit trickier than that. You have to always lock on allocations instead of just locking the collection cycle in single-threaded apps. I think the reason for this is that once the lock is acquired it must be held until the current thread exits the GC code, and the "synchronized" statement only allows you to lock for the length of the collect() call (because the start and end must be within the same scope). However, this /may/ actually be a separate bug with the D 1.0 GC. I really can't recall for sure if this was an issue with collecting when all the other threads were suspended.
Dec 11 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sean Kelly wrote:
 Leandro Lucarella Wrote:
 
 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment. Grauzone
 explained it here
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
To be perfectly safe, it's a tiny bit trickier than that. You have to always lock on allocations instead of just locking the collection cycle in single-threaded apps. I think the reason for this is that once the lock is acquired it must be held until the current thread exits the GC code, and the "synchronized" statement only allows you to lock for the length of the collect() call (because the start and end must be within the same scope). However, this /may/ actually be a separate bug with the D 1.0 GC. I really can't recall for sure if this was an issue with collecting when all the other threads were suspended.
Yes, the lock on the gc must still be held until the collection work is completed, but the other threads can be unfrozen once the mark phase is complete.
Dec 11 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 Sean Kelly wrote:
 Leandro Lucarella Wrote:
 
 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment. Grauzone
 explained it here
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
To be perfectly safe, it's a tiny bit trickier than that. You have to always lock on allocations instead of just locking the collection cycle in single-threaded apps. I think the reason for this is that once the lock is acquired it must be held until the current thread exits the GC code, and the "synchronized" statement only allows you to lock for the length of the collect() call (because the start and end must be within the same scope). However, this /may/ actually be a separate bug with the D 1.0 GC. I really can't recall for sure if this was an issue with collecting when all the other threads were suspended.
Yes, the lock on the gc must still be held until the collection work is completed, but the other threads can be unfrozen once the mark phase is complete.
Hm... so if a dtor creates a new thread then that thread could enter and "lock" the GC while the first thread is still completing its allocation. Couldn't this cause problems?
Dec 12 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright Wrote:
 
 Sean Kelly wrote:
 Leandro Lucarella Wrote:
 
 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment.
 Grauzone explained it here 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
 
To be perfectly safe, it's a tiny bit trickier than that. You have to always lock on allocations instead of just locking the collection cycle in single-threaded apps. I think the reason for this is that once the lock is acquired it must be held until the current thread exits the GC code, and the "synchronized" statement only allows you to lock for the length of the collect() call (because the start and end must be within the same scope). However, this /may/ actually be a separate bug with the D 1.0 GC. I really can't recall for sure if this was an issue with collecting when all the other threads were suspended.
Yes, the lock on the gc must still be held until the collection work is completed, but the other threads can be unfrozen once the mark phase is complete.
Hm... so if a dtor creates a new thread then that thread could enter and "lock" the GC while the first thread is still completing its allocation. Couldn't this cause problems?
No, because the thread doing the GC collection has locked the GC. Any other thread attempting to alloc will wait on that lock. There is only one GC lock.
Dec 12 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 Sean Kelly wrote:
 Walter Bright Wrote:
 
 Sean Kelly wrote:
 Leandro Lucarella Wrote:
 
 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment.
 Grauzone explained it here 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
 
To be perfectly safe, it's a tiny bit trickier than that. You have to always lock on allocations instead of just locking the collection cycle in single-threaded apps. I think the reason for this is that once the lock is acquired it must be held until the current thread exits the GC code, and the "synchronized" statement only allows you to lock for the length of the collect() call (because the start and end must be within the same scope). However, this /may/ actually be a separate bug with the D 1.0 GC. I really can't recall for sure if this was an issue with collecting when all the other threads were suspended.
Yes, the lock on the gc must still be held until the collection work is completed, but the other threads can be unfrozen once the mark phase is complete.
Hm... so if a dtor creates a new thread then that thread could enter and "lock" the GC while the first thread is still completing its allocation. Couldn't this cause problems?
No, because the thread doing the GC collection has locked the GC. Any other thread attempting to alloc will wait on that lock. There is only one GC lock.
Yeah, but it releases the lock when the collection completes, and the it retries the allocation to actually obtain memory. If the lock were held until the thread exited the GC code completely there wouldn't be a problem. I've thought about switching to using a Mutex in D2 so the lock could be acquired at the collection and released at the end of GC.malloc(), but this isn't an option in D1.
Dec 12 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright Wrote:
 No, because the thread doing the GC collection has locked the GC.
 Any other thread attempting to alloc will wait on that lock. There
 is only one GC lock.
Yeah, but it releases the lock when the collection completes, and the it retries the allocation to actually obtain memory. If the lock were held until the thread exited the GC code completely there wouldn't be a problem.
I don't understand why there is a problem with this. When the collection is complete, release lock, then get lock again for allocation, if the allocation fails, then go for another collection.
Dec 12 2009
parent Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 Sean Kelly wrote:
 Walter Bright Wrote:
 No, because the thread doing the GC collection has locked the GC.
 Any other thread attempting to alloc will wait on that lock. There
 is only one GC lock.
Yeah, but it releases the lock when the collection completes, and the it retries the allocation to actually obtain memory. If the lock were held until the thread exited the GC code completely there wouldn't be a problem.
I don't understand why there is a problem with this. When the collection is complete, release lock, then get lock again for allocation, if the allocation fails, then go for another collection.
Ah, I didn't realize the allocation call backed up that far after the collection. It's been a while since I looked at the code.
Dec 12 2009
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Leandro Lucarella wrote:
 I want to be sure the bug is present before the patch and fixed after the
 patch before submiting the patch via Bugzilla.
It's a good fix, I checked in a patch for it.
Dec 11 2009
next sibling parent reply Leandro Lucarella <llucax gmail.com> writes:
Walter Bright, el 11 de diciembre a las 20:51 me escribiste:
 Leandro Lucarella wrote:
I want to be sure the bug is present before the patch and fixed after the
patch before submiting the patch via Bugzilla.
It's a good fix, I checked in a patch for it.
Great, thanks! -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- You should've seen her face. It was the exact same look my father gave me when I told him I wanted to be a ventriloquist. -- George Constanza
Dec 12 2009
parent reply Leandro Lucarella <llucax gmail.com> writes:
Leandro Lucarella, el 12 de diciembre a las 11:18 me escribiste:
 Walter Bright, el 11 de diciembre a las 20:51 me escribiste:
 Leandro Lucarella wrote:
I want to be sure the bug is present before the patch and fixed after the
patch before submiting the patch via Bugzilla.
It's a good fix, I checked in a patch for it.
Great, thanks!
If you could add a small description of what the commit is about besides pointing the NG discussion/Bugzilla number, as you are doing with DMD now, it would be really great! =) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Le pedí que me enseñe a usar el mouse Pero solo quiere hablarme del Bauhaus Le pregunté si era chorra o rockera Me dijo "Gertrude Stein era re-tortillera" Me hizo mucho mal la cumbiera intelectual
Dec 12 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Leandro Lucarella wrote:
 If you could add a small description of what the commit is about besides
 pointing the NG discussion/Bugzilla number, as you are doing with DMD now,
 it would be really great! =)
Just click on the link! (At least, that's how the changelog works.) Also, I added a comment in the source code that explains the why.
Dec 12 2009
parent Leandro Lucarella <llucax gmail.com> writes:
Walter Bright, el 12 de diciembre a las 10:29 me escribiste:
 Leandro Lucarella wrote:
If you could add a small description of what the commit is about besides
pointing the NG discussion/Bugzilla number, as you are doing with DMD now,
it would be really great! =)
Just click on the link! (At least, that's how the changelog works.) Also, I added a comment in the source code that explains the why.
There are no links in the commit messages, and my RSS feed gets only the commit message, not the diff, so to actually see the diff I have to open the RSS "link" to go to the full changeset information. So, if there are 10 changes, I have to open 10 new web pages to see what the changes are about. It would be nice to know what the changes are about *before* opening any new web pages, so I can only click in the changes I'm interested in seeing fully. Thanks. -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Si pusieramos la máquina de cortar boludos dentro de la máquina del tiempo y se pusiera a cortar boludos históricos con retroactividad, otra sería la historieta hoy. -- Tato Bores
Dec 13 2009
prev sibling parent Bane <branimir.milosavljevic gmail.com> writes:
Walter Bright Wrote:

 Leandro Lucarella wrote:
 I want to be sure the bug is present before the patch and fixed after the
 patch before submiting the patch via Bugzilla.
It's a good fix, I checked in a patch for it.
Thanx :) I tested it with patched phobos on winxp/centos and it works.
Dec 12 2009
prev sibling next sibling parent Bane <branimir.milosavljevic gmail.com> writes:
Leandro Lucarella Wrote:

 Bane, el 11 de diciembre a las 06:00 me escribiste:
 Bug with GC fullCollect() in multithreaded environment. Grauzone explained it
here http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=99610
 
 Test case that freezes app (tested on 1.053):
 
 
 import std.c.stdlib;
 import std.stdio;
 import std.gc;
 import std.thread;
 import std.c.time;
 
 
 class Foo {
   void* p;
   this(){
     synchronized(this.classinfo) {
       writefln("malloc begin");
       p = std.c.stdlib.malloc(1024*1024);
       writefln("malloc finished");
     }
   }
 
   ~this(){
     synchronized(this.classinfo){
       writefln("free begin");
       std.c.stdlib.free(p);
       writefln("free finished");
     }
   }
 }
 
 class MyThread : Thread {
   int run(){
     while(true){
       new Foo;
       msleep(1);
     }
     return 0;
   }
 }
 
 void main(){
   for(int i=0; i<10; i++){
     (new MyThread).start;
   }
   
   while(true){
     writefln("collect begin");
     std.gc.fullCollect;
     writefln("collect finished");
     msleep(1000);
   }
 }
 
 
 
 Can it be fixed or adleast mentioned somewhere in docs as known limitation of
D1 GC? It would save a lot of time to some people (too late for me :).
The patch is very simple (as pointed by Grauzone), I guess it should make it into D1 (I hope): ------------------------------------------------------------ diff --git a/phobos/internal/gc/gcx.d b/phobos/internal/gc/gcx.d index b0d2821..7b1a7a3 100644 --- a/phobos/internal/gc/gcx.d +++ b/phobos/internal/gc/gcx.d -2033,8 +2033,6 struct Gcx } } - Thread.resumeAll(); - // Free up everything not marked debug(COLLECT_PRINTF) printf("\tfree'ing\n"); size_t freedpages = 0; -2194,6 +2192,8 struct Gcx debug(COLLECT_PRINTF) printf("recovered pages = %d\n", recoveredpages); debug(COLLECT_PRINTF) printf("\tfree'd %u bytes, %u pages from %u pools\ + Thread.resumeAll(); + return freedpages + recoveredpages; } ------------------------------------------------------------ I wanted to reproduce it to make a bug report to attach the patch but I couldn't. I replaced msleep() with usleep() (and multiplied the values by 1000) because I'm in Linux, but that's all I changed. I have an old AMD Athlon 1.8GHz. How do you reproduce it? I want to be sure the bug is present before the patch and fixed after the patch before submiting the patch via Bugzilla. Thanks! -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Nos retiramos hasta la semana que viene reflexionando sobre nuestras vidas: "Qué vida de mier'... Qué vida de mier'!" -- Sidharta Kiwi
Home computer I tested it in is winxp with phenom quad core, but this bug is confirmed with centos 5.2 + intel dual core and centos 5.2 + intel xeon. Bug is present for a long time, thats for sure, It took me a while to track it because I thought its just me abusing keyboard, not a bug in phobos. Worst thing with this bug is that there is no output or exception when it happens - just that dreaded deadlock :) Once thing I noted - if you reduce memory required in that malloc call from 1 MB to 1 KB or less (or increase sleep time between calls), chances for bug to occur drop drastically, so in most applications this probably will never happen.
Dec 12 2009
prev sibling parent Bane <branimir.milosavljevic gmail.com> writes:
 I wanted to reproduce it to make a bug report to attach the patch but
 I couldn't. I replaced msleep() with usleep() (and multiplied the values
 by 1000) because I'm in Linux, but that's all I changed. I have an old AMD
 Athlon 1.8GHz. How do you reproduce it?
 
 I want to be sure the bug is present before the patch and fixed after the
 patch before submiting the patch via Bugzilla.
 
 Thanks!
I forgot: for linux just remove msleep lines completely. It will deadlock soon after program start. Seems that linux is more resistant to this bug than windows.
Dec 12 2009