digitalmars.D - A different, precise TLS garbage collector?

Etienne (21/21) Nov 16 2014 I always wondered why we would use the shared keyword on GC allocations

Sean Kelly (3/3) Nov 16 2014 We'll have to change the way "immutable" is treated for

Etienne Cimon (11/14) Nov 16 2014 Exactly, I'm not sure how DMD currently handles immutable but it should

Etienne Cimon (7/7) Nov 16 2014 This GC model also seems to work fine for locally-allocated __gshared

Xinok (4/26) Nov 16 2014 How about immutable data which is implicitly shareable? Granted

Etienne Cimon (7/10) Nov 16 2014 Immutable data would proxy through malloc and would not be scanned as it...

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (8/13) Nov 16 2014 If you go for thread local garbage collection then there is no

Sean Kelly (4/6) Nov 16 2014 Yes. There are a lot of little "gotchas" with thread-local

Etienne (6/13) Nov 16 2014 I can't even think of a situation when this would be necessary.

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (14/19) Nov 16 2014 There is a reason for why "elegant" GC languages pick one primary

Etienne Cimon (10/14) Nov 16 2014 I'm not sure what this means, wouldn't the fiber stacks be saved on the

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (13/21) Nov 16 2014 If you want performant, low latency fibers you want load

Kagamin (2/2) Nov 17 2014 Previous thread:

Etienne (20/22) Nov 17 2014 Looks somewhat similar but the idea of a shared GC will defeat the

Etienne <etcimon gmail.com> writes:

I always wondered why we would use the shared keyword on GC allocations 
if only the stack can be optimized for TLS Storage.

After thinking about how shared objects should work with the GC, it's 
become obvious that the GC should be optimized for local data. Anything 
shared would have to be manually managed, because the biggest slowdown 
of all is stopping the world to facilitate concurrency.

With a precise GC on the way, it's become easy to filter out allocations 
from shared objects. Simply proxy them through malloc and get right of 
the locks. Make the GC thread-local, and you can expect it to scale with 
the number of processors.

Any thread-local data should already have to be duplicated into a shared 
object to be used from another thread, and the lifetime is easy to 
manage manually.

SomeTLS variable = new SomeTLS("Data");
shared SomeTLS variable2 = cast(shared) variable.dupShared();
Tid tid = spawn(&doSomething, variable2);
variable = receive!variable2(tid).dupLocal();
delete variable2;

Programming with a syntax that makes use of shared objects, and forces 
manual management on those, seems to make "stop the world" a thing of 
the past. Any thoughts?

Nov 16 2014

"Sean Kelly" <sean invisibleduck.org> writes:

We'll have to change the way "immutable" is treated for 
allocations.  Which I think is a good thing.  Just because 
something can be shared doesn't meant that I intend to share it.

Nov 16 2014

Etienne Cimon <etcimon gmail.com> writes:

On 2014-11-16 10:20, Sean Kelly wrote:
 We'll have to change the way "immutable" is treated for allocations.
 Which I think is a good thing.  Just because something can be shared
 doesn't meant that I intend to share it.

Exactly, I'm not sure how DMD currently handles immutable but it should 
automatically be mangled in the global namespace in the application data.

If this seems feasible to everyone I wouldn't mind forking the precise 
GC into a thread-local library, without any "stop the world" slowdown.

A laptop with 4 cores in a multi-threaded application would 
(theoretically) run through the marking/collect process 4 times faster, 
and allocate unbelievably faster due to no locks :)

The only problem is having to manually allocate shared objects, which 
seems fine because most of the time they'd be deallocated in shared 
static ~this anyways.

Nov 16 2014

Etienne Cimon <etcimon gmail.com> writes:

This GC model also seems to work fine for locally-allocated __gshared 
objects. Since they're registered locally but available globally, 
they'll be collected once the thread that created it is gone.

Also, when an object is cast(shared) before being sent to another 
thread, it's usually still in scope once the other thread returns.

So there seems to be some very thin chances that existing code will be 
broken with a thread-local GC.

Nov 16 2014

"Xinok" <xinok live.com> writes:

On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:
 I always wondered why we would use the shared keyword on GC 
 allocations if only the stack can be optimized for TLS Storage.

 After thinking about how shared objects should work with the 
 GC, it's become obvious that the GC should be optimized for 
 local data. Anything shared would have to be manually managed, 
 because the biggest slowdown of all is stopping the world to 
 facilitate concurrency.

 With a precise GC on the way, it's become easy to filter out 
 allocations from shared objects. Simply proxy them through 
 malloc and get right of the locks. Make the GC thread-local, 
 and you can expect it to scale with the number of processors.

 Any thread-local data should already have to be duplicated into 
 a shared object to be used from another thread, and the 
 lifetime is easy to manage manually.

 SomeTLS variable = new SomeTLS("Data");
 shared SomeTLS variable2 = cast(shared) variable.dupShared();
 Tid tid = spawn(&doSomething, variable2);
 variable = receive!variable2(tid).dupLocal();
 delete variable2;

 Programming with a syntax that makes use of shared objects, and 
 forces manual management on those, seems to make "stop the 
 world" a thing of the past. Any thoughts?

How about immutable data which is implicitly shareable? Granted 
you can destroy/free the data asynchronously, but you would still 
need to check all threads for references to that data.

Nov 16 2014

Etienne Cimon <etcimon gmail.com> writes:

On 2014-11-16 10:21, Xinok wrote:
 How about immutable data which is implicitly shareable? Granted you can
 destroy/free the data asynchronously, but you would still need to check
 all threads for references to that data.

Immutable data would proxy through malloc and would not be scanned as it 
can only contain immutable data that cannot be deleted nor scanned.

This is also shared by every thread without any locking. Currently, 
immutable data is global in storage but may be local in access rights I 
think? I would have assumed it would automatically be in the .rdata 
process segments.

Nov 16 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 16 November 2014 at 13:58:19 UTC, Etienne wrote:
 After thinking about how shared objects should work with the 
 GC, it's become obvious that the GC should be optimized for 
 local data. Anything shared would have to be manually managed, 
 because the biggest slowdown of all is stopping the world to 
 facilitate concurrency.

If you go for thread local garbage collection then there is no 
reason for being more general and support per-data-structure 
garbage collection as well. That's more useful, it can be used 
for collecting cycles in graphs. Just let the application 
initiate collection when there are no reference pointing into it.

But keep in mind that you also have to account for fibers that 
move between threads.

Nov 16 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim Grøstad 
wrote:
 But keep in mind that you also have to account for fibers that 
 move between threads.

Yes.  There are a lot of little "gotchas" with thread-local 
allocation.

Nov 16 2014

"Etienne" <etcimon gmail.com> writes:

On Sunday, 16 November 2014 at 17:40:30 UTC, Sean Kelly wrote:
 On Sunday, 16 November 2014 at 17:38:54 UTC, Ola Fosheim 
 Grøstad wrote:
 But keep in mind that you also have to account for fibers that 
 move between threads.

 Yes.  There are a lot of little "gotchas" with thread-local 
 allocation.

I can't even think of a situation when this would be necessary. 
It sounds like all I would need is to take the precise GC and 
store each instance in the thread data, I'll probably only need 
the rtinfo to see if it's shared during allocation to proxy 
towards malloc. Am I missing something?

Nov 16 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 16 November 2014 at 19:13:27 UTC, Etienne wrote:
 I can't even think of a situation when this would be necessary. 
 It sounds like all I would need is to take the precise GC and 
 store each instance in the thread data, I'll probably only need 
 the rtinfo to see if it's shared during allocation to proxy 
 towards malloc. Am I missing something?

There is a reason for why "elegant" GC languages pick one primary 
type of concurrency.

If you say that all code is running on a fiber and that there is 
no such thing as thread local, then you can tie the local GC 
partition to the fiber and collect it on any thread.

If you say that functions called from a fiber sometimes call into 
global statespace, sometimes into thread statespace and sometimes 
into fiber statespace… then you need to figure out ownership on 
all allocations. Does the allocated object belong to a global 
database, a thread local database or a fiber cache which is 
flushed automatically when moving to a new thread? Or is it an 
extension of the fiber statespace that should be transparent to 
threads?

Nov 16 2014

Etienne Cimon <etcimon gmail.com> writes:

On 2014-11-16 19:32, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 Does the allocated object belong to a global database, a thread local
 database or a fiber cache which is flushed automatically when moving to
 a new thread? Or is it an extension of the fiber statespace that should
 be transparent to threads?

I'm not sure what this means, wouldn't the fiber stacks be saved on the 
thread-local space when they yield? In turn, they become part of the 
thread-local stack space I guess.

Overall, I'd put all the GC allocations through malloc the same way it 
is right now. I don't see anything that needs to be done other than make 
multiple thread-local GC instances and remove the locks. I'm sure I'll 
find obstacles but I don't see them right now, do you know of any that I 
should look out for?

Nov 16 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 17 November 2014 at 00:44:13 UTC, Etienne Cimon wrote:
 I'm not sure what this means, wouldn't the fiber stacks be 
 saved on the thread-local space when they yield? In turn, they 
 become part of the thread-local stack space I guess.

If you want performant, low latency fibers you want load 
balancing, so they should not be affiliated with a thread but 
live in a pool.

That said, fibers aren't a low level construct like threads so I 
am not sure if they belong in system level programming anyway.

 Overall, I'd put all the GC allocations through malloc the same 
 way it is right now. I don't see anything that needs to be done 
 other than make multiple thread-local GC instances and remove 
 the locks. I'm sure I'll find obstacles but I don't see them 
 right now, do you know of any that I should look out for?

Not if you work hard to ensure referential integrity. I 
personally would find it more useful with an optional GC where 
the programmer takes responsibility for collecting when the 
situation is right. (specifying root pointers, stack etc).

I think the language should limit itself to generate the 
information that can enable precise collection, then leave the 
rest to the programmer…

Nov 16 2014

"Kagamin" <spam here.lot> writes:

Previous thread: 
http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.org

Nov 17 2014

Etienne <etcimon gmail.com> writes:

On 2014-11-17 9:45 AM, Kagamin wrote:
 Previous thread:
 http://forum.dlang.org/post/dnxgbumzenupviqymhrg forum.dlang.org

Looks somewhat similar but the idea of a shared GC will defeat the 
purpose and will end up complicating things. After another review of the 
problem, I've come up with some new observations:

- The shared data needs to be manually managed for a thread-local GC in 
order to scale with the number of CPU cores

- Anything instantiated as `new shared X` will have to proxy into a 
shared allocator interface or malloc.

- All __gshared instances containing mutable indirections will cause 
undefined behavior

- All thread-local instances moved through threads using cast(shared) 
and carrying indirections will cause undefined behavior

- Immutable object values must not be allocated on the GC and defined 
only in a shared static this constructor to ensure the values are 
available to all threads at all times

The only necessity is shared/non-shared type information during 
allocation and deallocation.

The __gshared and cast(shared) issues are certainly the most daunting. 
This is why this GC would have to be optional through a 
version(ThreadLocalGC)

Nov 17 2014

D Programming

C/C++ Programming

Other

digitalmars.D - A different, precise TLS garbage collector?