www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A different, precise TLS garbage collector?

reply Etienne <etcimon gmail.com> writes:
I always wondered why we would use the shared keyword on GC allocations 
if only the stack can be optimized for TLS Storage.

After thinking about how shared objects should work with the GC, it's 
become obvious that the GC should be optimized for local data. Anything 
shared would have to be manually managed, because the biggest slowdown 
of all is stopping the world to facilitate concurrency.

With a precise GC on the way, it's become easy to filter out allocations 
from shared objects. Simply proxy them through malloc and get right of 
the locks. Make the GC thread-local, and you can expect it to scale with 
the number of processors.

Any thread-local data should already have to be duplicated into a shared 
object to be used from another thread, and the lifetime is easy to 
manage manually.

SomeTLS variable = new SomeTLS("Data");
shared SomeTLS variable2 = cast(shared) variable.dupShared();
Tid tid = spawn(&doSomething, variable2);
variable = receive!variable2(tid).dupLocal();
delete variable2;

Programming with a syntax that makes use of shared objects, and forces 
manual management on those, seems to make "stop the world" a thing of 
the past. Any thoughts?
Nov 16 2014
next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
We'll have to change the way "immutable" is treated for 
allocations.  Which I think is a good thing.  Just because 
something can be shared doesn't meant that I intend to share it.
Nov 16 2014
parent reply Etienne Cimon <etcimon gmail.com> writes:
On 2014-11-16 10:20, Sean Kelly wrote:
 We'll have to change the way "immutable" is treated for allocations.
 Which I think is a good thing.  Just because something can be shared
 doesn't meant that I intend to share it.
Exactly, I'm not sure how DMD currently handles immutable but it should automatically be mangled in the global namespace in the application data. If this seems feasible to everyone I wouldn't mind forking the precise GC into a thread-local library, without any "stop the world" slowdown. A laptop with 4 cores in a multi-threaded application would (theoretically) run through the marking/collect process 4 times faster, and allocate unbelievably faster due to no locks :) The only problem is having to manually allocate shared objects, which seems fine because most of the time they'd be deallocated in shared static ~this anyways.
Nov 16 2014
parent Etienne Cimon <etcimon gmail.com> writes:
This GC model also seems to work fine for locally-allocated __gshared 
objects. Since they're registered locally but available globally, 
they'll be collected once the thread that created it is gone.

Also, when an object is cast(shared) before being sent to another 
thread, it's usually still in scope once the other thread returns.

So there seems to be some very thin chances that existing code will be 
broken with a thread-local GC.
Nov 16 2014
prev sibling next sibling parent reply "Xinok" <xinok live.com> writes:
On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:
 I always wondered why we would use the shared keyword on GC 
 allocations if only the stack can be optimized for TLS Storage.

 After thinking about how shared objects should work with the 
 GC, it's become obvious that the GC should be optimized for 
 local data. Anything shared would have to be manually managed, 
 because the biggest slowdown of all is stopping the world to 
 facilitate concurrency.

 With a precise GC on the way, it's become easy to filter out 
 allocations from shared objects. Simply proxy them through 
 malloc and get right of the locks. Make the GC thread-local, 
 and you can expect it to scale with the number of processors.

 Any thread-local data should already have to be duplicated into 
 a shared object to be used from another thread, and the 
 lifetime is easy to manage manually.

 SomeTLS variable = new SomeTLS("Data");
 shared SomeTLS variable2 = cast(shared) variable.dupShared();
 Tid tid = spawn(&doSomething, variable2);
 variable = receive!variable2(tid).dupLocal();
 delete variable2;

 Programming with a syntax that makes use of shared objects, and 
 forces manual management on those, seems to make "stop the 
 world" a thing of the past. Any thoughts?
How about immutable data which is implicitly shareable? Granted you can destroy/free the data asynchronously, but you would still need to check all threads for references to that data.
Nov 16 2014
parent Etienne Cimon <etcimon gmail.com> writes:
On 2014-11-16 10:21, Xinok wrote:
 How about immutable data which is implicitly shareable? Granted you can
 destroy/free the data asynchronously, but you would still need to check
 all threads for references to that data.
Immutable data would proxy through malloc and would not be scanned as it can only contain immutable data that cannot be deleted nor scanned. This is also shared by every thread without any locking. Currently, immutable data is global in storage but may be local in access rights I think? I would have assumed it would automatically be in the .rdata process segments.
Nov 16 2014
prev sibling next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:
 After thinking about how shared objects should work with the 
 GC, it's become obvious that the GC should be optimized for 
 local data. Anything shared would have to be manually managed, 
 because the biggest slowdown of all is stopping the world to 
 facilitate concurrency.
If you go for thread local garbage collection then there is no reason for being more general and support per-data-structure garbage collection as well. That's more useful, it can be used for collecting cycles in graphs. Just let the application initiate collection when there are no reference pointing into it. But keep in mind that you also have to account for fibers that move between threads.
Nov 16 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim Grøstad 
wrote:
 But keep in mind that you also have to account for fibers that 
 move between threads.
Yes. There are a lot of little "gotchas" with thread-local allocation.
Nov 16 2014
parent reply "Etienne" <etcimon gmail.com> writes:
On Sunday, 16 November 2014 at 17:40:30 UTC, Sean Kelly wrote:
 On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim 
 Grøstad wrote:
 But keep in mind that you also have to account for fibers that 
 move between threads.
Yes. There are a lot of little "gotchas" with thread-local allocation.
I can't even think of a situation when this would be necessary. It sounds like all I would need is to take the precise GC and store each instance in the thread data, I'll probably only need the rtinfo to see if it's shared during allocation to proxy towards malloc. Am I missing something?
Nov 16 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Sunday, 16 November 2014 at 19:13:27 UTC, Etienne wrote:
 I can't even think of a situation when this would be necessary. 
 It sounds like all I would need is to take the precise GC and 
 store each instance in the thread data, I'll probably only need 
 the rtinfo to see if it's shared during allocation to proxy 
 towards malloc. Am I missing something?
There is a reason for why "elegant" GC languages pick one primary type of concurrency. If you say that all code is running on a fiber and that there is no such thing as thread local, then you can tie the local GC partition to the fiber and collect it on any thread. If you say that functions called from a fiber sometimes call into global statespace, sometimes into thread statespace and sometimes into fiber statespace… then you need to figure out ownership on all allocations. Does the allocated object belong to a global database, a thread local database or a fiber cache which is flushed automatically when moving to a new thread? Or is it an extension of the fiber statespace that should be transparent to threads?
Nov 16 2014
parent reply Etienne Cimon <etcimon gmail.com> writes:
On 2014-11-16 19:32, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 Does the allocated object belong to a global database, a thread local
 database or a fiber cache which is flushed automatically when moving to
 a new thread? Or is it an extension of the fiber statespace that should
 be transparent to threads?
I'm not sure what this means, wouldn't the fiber stacks be saved on the thread-local space when they yield? In turn, they become part of the thread-local stack space I guess. Overall, I'd put all the GC allocations through malloc the same way it is right now. I don't see anything that needs to be done other than make multiple thread-local GC instances and remove the locks. I'm sure I'll find obstacles but I don't see them right now, do you know of any that I should look out for?
Nov 16 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 17 November 2014 at 00:44:13 UTC, Etienne Cimon wrote:
 I'm not sure what this means, wouldn't the fiber stacks be 
 saved on the thread-local space when they yield? In turn, they 
 become part of the thread-local stack space I guess.
If you want performant, low latency fibers you want load balancing, so they should not be affiliated with a thread but live in a pool. That said, fibers aren't a low level construct like threads so I am not sure if they belong in system level programming anyway.
 Overall, I'd put all the GC allocations through malloc the same 
 way it is right now. I don't see anything that needs to be done 
 other than make multiple thread-local GC instances and remove 
 the locks. I'm sure I'll find obstacles but I don't see them 
 right now, do you know of any that I should look out for?
Not if you work hard to ensure referential integrity. I personally would find it more useful with an optional GC where the programmer takes responsibility for collecting when the situation is right. (specifying root pointers, stack etc). I think the language should limit itself to generate the information that can enable precise collection, then leave the rest to the programmer…
Nov 16 2014
prev sibling parent reply "Kagamin" <spam here.lot> writes:
Previous thread: 
http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.org
Nov 17 2014
parent Etienne <etcimon gmail.com> writes:
On 2014-11-17 9:45 AM, Kagamin wrote:
 Previous thread:
 http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.org
Looks somewhat similar but the idea of a shared GC will defeat the purpose and will end up complicating things. After another review of the problem, I've come up with some new observations: - The shared data needs to be manually managed for a thread-local GC in order to scale with the number of CPU cores - Anything instantiated as `new shared X` will have to proxy into a shared allocator interface or malloc. - All __gshared instances containing mutable indirections will cause undefined behavior - All thread-local instances moved through threads using cast(shared) and carrying indirections will cause undefined behavior - Immutable object values must not be allocated on the GC and defined only in a shared static this constructor to ensure the values are available to all threads at all times The only necessity is shared/non-shared type information during allocation and deallocation. The __gshared and cast(shared) issues are certainly the most daunting. This is why this GC would have to be optional through a version(ThreadLocalGC)
Nov 17 2014