digitalmars.D - [RFC]Proposal for better garbage collection

Benjamin Thaut (117/117) Feb 22 2012 As I'm not satisfied with the current GC D has and don't see the

H. S. Teoh (53/141) Feb 22 2012 Have you seen this?

Benjamin Thaut (34/173) Feb 22 2012 Yes I know about dgc it is better but still not on par with for example

H. S. Teoh (41/147) Feb 22 2012 I agree. Better compiler support would definitely be beneficial.
deadalnix (28/30) Feb 23 2012 This is the problem with your proposal. It doesn't consider pro and cons...

H. S. Teoh (19/34) Feb 23 2012 Yep. In one of my replies I considered the possibility of storing a

Martin Nowak (2/5) Feb 22 2012 COWing whole pages is about the last thing you need with a high mutation...
Jacob Carlborg (4/40) Feb 23 2012 Doesn't the "faster" example introduces an implicit scope?
Timon Gehr (2/6) Feb 25 2012 Those are the same thing. '{ }' is not what introduces a scope.

Dmitry Olshansky (19/136) Feb 22 2012 I think walking up the stack to collect this info again and again (the...

Dmitry Olshansky (3/14) Feb 22 2012 BTW check out ideas on GC
H. S. Teoh (26/36) Feb 22 2012 Yeah, this is not a good way to go. Better would be for each GC come

H. S. Teoh (26/35) Feb 22 2012 [...]

Benjamin Thaut (22/55) Feb 22 2012 But where would you know from which scope variables are still (or
Robert Jacques (2/35) Feb 23 2012 The break even point between bit-fields and pointers is 512 bytes. Altho...

Sean Kelly (20/29) Feb 22 2012 call into the runtime and passes the new value of the reference / =
H. S. Teoh (13/26) Feb 22 2012 [...]
Jonathan M Davis (4/7) Feb 22 2012 As I understand it, this is mitigated considerably on 64-bit platforms d...
H. S. Teoh (13/20) Feb 22 2012 [...]
Martin Nowak (5/13) Feb 22 2012 If we'd want per-thread mark/sweeping then shared memory must never own

Sean Kelly (7/22) Feb 23 2012 he

deadalnix (5/5) Feb 23 2012 You didn't mention what is the most important IMO.
Manu (20/115) Feb 23 2012 You say "every function needs a stack frame". Can you comment on this wi...
H. S. Teoh (10/13) Feb 23 2012 [...]

deadalnix (6/16) Feb 24 2012 This will work, but has serious drawbacks.

Artur Skawina (19/29) Feb 23 2012 No, unless you consider a segfault to be really nasty. :)
H. S. Teoh (8/34) Feb 23 2012 Well, segfaults are nasty, but there are nastier things. :)

Benjamin Thaut <code benjamin-thaut.de> writes:

As I'm not satisfied with the current GC D has and don't see the 
situation improving in the future without significant changes to the 
compiler I wrote the following document that points out all the possible 
issues with garbage collection I could think of and possible solutions 
for them. This document is just a draft and only a proposal, critics and 
comments are welcome.

Kind Regards
Benjamin Thaut


1) Preface

All modern Languages with fast and efficient GCs have build in support 
for garbage collection within the compiler / virtual machine. Currently 
the D language does have as much GC support as C++ has but fully relies 
on the GC with a lot of language features and the standard library. As a 
result the GC is slow, non parallel and always needs to stop all threads 
before collection. This document suggests to build in better support for 
garbage collection into the language so that creating a fast and 
efficient GC becomes possible. A list of features that would be required 
is described in this document.

2) Tracking references on the stack:

The D compiler always needs to emit a full stack frame so that the GC 
can walk up the stack at any time in the program. The stack frame of 
every function generated by the D compiler starts which a bitfield 
(usually the size of a machine register) where each bit indicates that 
these bytes are a pointer / reference. The bitfield needs to be large 
enough to cover the whole stack frame of the function.

For example on x86: 11001...
1 = bytes 0 to 4 are a pointer
1 = bytes 4 to 8 are a pointer
00 = bytes 8 to 16 are not a pointer
1 = bytes 16 to 20 are a pointer

The last bit indicates whether the bitfield is continued in the 
following bytes or not. 1 means continued 0 means finished.
Every scope generated by the D compiler would need additional code at 
the start and end of the scope. When the scope is entered the bitfield 
would be patched to represent the new variables inside the scope and 
when the scope is left the bitfield is patched again to remove the 
changes that were made on entering the scope.
Every time a function gets called that did not get generated by the D 
compiler ( C / C++ etc functions) the compiler generates a call into the 
runtime and passes the current stack pointer and stack base pointer to it.

void _d_externalCallStart(void* stackptr, void* baseptr);

Every time such a function returns the compiler   generates a call into 
the the runtime too.

void _d_externalCallEnd(void* stackptr, void* baseptr);

Every time a functions that can get called from other languages 
(extern(C) etc) are executed the end callback is inserted at the start 
of the functions and the start callback is inserted at the end of the 
function.
Using these callbacks the GC can mark certain parts of the stack as 
"non-D" and ignore them when scanning for bit fields and 
references/pointers. It can just skip parts of the stack that are 
"non-D" and therefore does not need a full stack frame within these 
"non-D" sections.
All these features are required so that the GC can precisely scan 
pointers/references on the stack and change them as necessary.
Remaining issues: The D compiler can freely move around value types on 
the stack. With such move operations it would be necessary to fix up all 
the bit fields. I needs to be investigated if this is doable.

3) Tracking references on the heap

For every class / struct a mixin template which is defined inside the 
runtime gets instantiated. This template can then use introspection to 
generate the necessary information to allow the GC to scan the pointers 
within that struct / class precisely.

4) Thread local / global memory

A callback into the runtime needs to happen in the following cases:
- a __gshared variable is assigned
- a reference / pointer is casted to immutable
- a reference / pointer is casted to shared

void _d_castToGlobalMem(void* ptr);

This can be used by the GC to keep thread local pools of memory and move 
a memory block to a global memory pool as soon as it is needed there.

5) pointer / reference changed callback

Every time a pointer / reference is changed the D compiler emits a call 
into the runtime and passes the new value of the reference / pointer 
with it.

void _d_pointerChanged(void *ptr);

This can be used when writing a generational GC to have separate pools 
for young and old generations. Every time the young generation needs to 
be collected it can be avoided to scan the old generations pool because 
it is sufficient to only check the pointers that have changed since the 
last time the young generation was collected. With the above mentioned 
callback it is easily possible to track these references.

Remaining issues:
-If the GC interrupts a thread right before any of the above mentioned 
callbacks happen it will cause a invalid state for the GC and the GC 
might access invalid pointers. It has to be investigated if this leads 
to invalid behavior. It can be fixed by not interrupting a thread but 
pause it the next time it calls any of callbacks, or other functions 
that can be interrupted by the GC. This in turn could cause a thread to 
be non pausable because it is stuck in a polling loop. The compiler 
could identify loops without any GC interruptible function and manually 
insert one.
-When pointers/references are passed inside processor registers the GC 
cannot know if these values are actually pointers/references or 
represent other values. If threads are paused at defined points in the 
code as mentioned before this issues would be fixed because the state at 
these points is known and can be handled accordingly.

6) Different interface for the GC

The current interface to the GC would have to change because the "this 
block of memory might contain a pointer" approach wouldn't work anymore. 
For example a block of memory and a delegate which iterates over all 
pointers within the memory block could be used for user allocated memory 
blocks. There should be a separate allocator function provided by the GC 
that allocates memory that does not get moved around so it can be used 
to pass it to non garbage collected code.

7) Compiler Options

Each of the above mentioned groups of features should be exposed as 
compiler options so that you can turn them on/off depending on which 
type of GC you use. Default on/off states for these features are set 
within a config file depending on which type of GC currently ships per 
default with druntime.

8) Conclusion

Garbage Collection brings a lot of advantages for the programmer using 
the language but is not free and shouldn't be treated as free. Full 
support for garbage collection is required to build a fast and efficient 
GC. This additional support requires additional features within the 
compiler but should result in a overall better performing language.

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
 As I'm not satisfied with the current GC D has and don't see the
 situation improving in the future without significant changes to the
 compiler I wrote the following document that points out all the
 possible issues with garbage collection I could think of and
 possible solutions for them. This document is just a draft and only
 a proposal, critics and comments are welcome.

Have you seen this?

	http://www.llucax.com.ar/proj/dgc/index.html


[...]
 2) Tracking references on the stack:
 
 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

This adds a lot of overhead to the runtime stack, esp. if you have deep
recursion. It's also not necessarily faster, since the GC now has to
parse a bitfield (a variable-length encoded bitfield, no less), instead
of just scanning words directly, which can be optimized by CPU-specific
microcode depending on the target platform.


[...]
 Every scope generated by the D compiler would need additional code
 at the start and end of the scope. When the scope is entered the
 bitfield would be patched to represent the new variables inside the
 scope and when the scope is left the bitfield is patched again to
 remove the changes that were made on entering the scope.

This would introduce quite a lot of overhead per scope. It will also
lead to strange things like:

	if (x) y();	// faster
	if (x) { y(); }	// slower

which will encourage people to omit {} after if, which makes code more
fragile and hard to read.


 Every time a function gets called that did not get generated by the
 D compiler ( C / C++ etc functions) the compiler generates a call
 into the runtime and passes the current stack pointer and stack base
 pointer to it.
 
 void _d_externalCallStart(void* stackptr, void* baseptr);
 
 Every time such a function returns the compiler   generates a call
 into the the runtime too.
 
 void _d_externalCallEnd(void* stackptr, void* baseptr);
 
 Every time a functions that can get called from other languages
 (extern(C) etc) are executed the end callback is inserted at the
 start of the functions and the start callback is inserted at the end
 of the function.
 Using these callbacks the GC can mark certain parts of the stack as
 "non-D" and ignore them when scanning for bit fields and
 references/pointers. It can just skip parts of the stack that are
 "non-D" and therefore does not need a full stack frame within these
 "non-D" sections.

This may not be a bad idea, though it does introduce some overhead when
you cross language boundaries.


 All these features are required so that the GC can precisely scan
 pointers/references on the stack and change them as necessary.
 Remaining issues: The D compiler can freely move around value types
 on the stack. With such move operations it would be necessary to fix
 up all the bit fields. I needs to be investigated if this is doable.

This can only make the GC slower, especially if it needs to update
variable-length encoded bitfields. Of course, you may be able to offset
this by making it possible to do real-time GC, (reduced throughput but
less waiting time for collection cycles) but that's a very complex
problem.


 3) Tracking references on the heap
 
 For every class / struct a mixin template which is defined inside
 the runtime gets instantiated. This template can then use
 introspection to generate the necessary information to allow the GC
 to scan the pointers within that struct / class precisely.

So basically you're proposing a compacting precise-scanning GC instead
of the current conservative GC. There are pros and cons in either
approach; it'd be nice if you could compare them.


[...]
 5) pointer / reference changed callback
 
 Every time a pointer / reference is changed the D compiler emits a
 call into the runtime and passes the new value of the reference /
 pointer with it.

This introduces a LOT of overhead, especially in a language like D which
manipulates a lot of pointers quite often (esp. if you use slices a
lot).


[...]
 -If the GC interrupts a thread right before any of the above mentioned
 callbacks happen it will cause a invalid state for the GC and the GC
 might access invalid pointers. It has to be investigated if this leads
 to invalid behavior. It can be fixed by not interrupting a thread but
 pause it the next time it calls any of callbacks, or other functions
 that can be interrupted by the GC.

This adds a lot of intermittent pauses in program execution.

The link I posted at the top has a GC implementation that doesn't
introduce *any* of this overhead (the GC runs concurrently with the
program), with no pause during a collection cycle (garbage is
incrementally collected when allocating new memory).


[...]
 6) Different interface for the GC
 
 The current interface to the GC would have to change because the
 "this block of memory might contain a pointer" approach wouldn't
 work anymore. For example a block of memory and a delegate which
 iterates over all pointers within the memory block could be used for
 user allocated memory blocks. There should be a separate allocator
 function provided by the GC that allocates memory that does not get
 moved around so it can be used to pass it to non garbage collected
 code.

It would be nice if there was a way for GCs to be pluggable, especially
in the compiler. Currently, we can only swap GCs that implement the same
interface as the existing one, but to switch to a different GC model
like you're proposing would require a lot of compiler support.


 7) Compiler Options
 
 Each of the above mentioned groups of features should be exposed as
 compiler options so that you can turn them on/off depending on which
 type of GC you use. Default on/off states for these features are set
 within a config file depending on which type of GC currently ships
 per default with druntime.

This would be nice.


 8) Conclusion
 
 Garbage Collection brings a lot of advantages for the programmer
 using the language but is not free and shouldn't be treated as free.

I don't know, but after reading this:

	http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

I think there might be a possibility of (almost) free GC.


 Full support for garbage collection is required to build a fast and
 efficient GC. This additional support requires additional features
 within the compiler but should result in a overall better performing
 language.

I agree in principle, although for specific GC proposals, we'd need to
evaluate the pros and cons to determine what is improved and what
degrades. Unfortunately, GC is a complex problem, and different GCs work
better with different apps. I wouldn't be so quick to make claims about
the performance of GCs. It depends on what the app does.


T

-- 
EMACS = Extremely Massive And Cumbersome System

Feb 22 2012

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.02.2012 20:40, schrieb H. S. Teoh:
 On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
 As I'm not satisfied with the current GC D has and don't see the
 situation improving in the future without significant changes to the
 compiler I wrote the following document that points out all the
 possible issues with garbage collection I could think of and
 possible solutions for them. This document is just a draft and only
 a proposal, critics and comments are welcome.

 Have you seen this?

 	http://www.llucax.com.ar/proj/dgc/index.html

Yes I know about dgc it is better but still not on par with for example 
the GC that is shipped with the .NET 4.0
All I'm saying is that without propper support from the compiler we are 
not going to get GCs as good as in other modern languages.

 [...]
 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

 This adds a lot of overhead to the runtime stack, esp. if you have deep
 recursion. It's also not necessarily faster, since the GC now has to
 parse a bitfield (a variable-length encoded bitfield, no less), instead
 of just scanning words directly, which can be optimized by CPU-specific
 microcode depending on the target platform.

If you have a better idea for percise stack scanning I'm open for 
suggestions.

 [...]
 Every scope generated by the D compiler would need additional code
 at the start and end of the scope. When the scope is entered the
 bitfield would be patched to represent the new variables inside the
 scope and when the scope is left the bitfield is patched again to
 remove the changes that were made on entering the scope.

 This would introduce quite a lot of overhead per scope. It will also
 lead to strange things like:

 	if (x) y();	// faster
 	if (x) { y(); }	// slower

 which will encourage people to omit {} after if, which makes code more
 fragile and hard to read.

Scopeds that don't have variables declared inside them don't need the 
bitfield patching. so that argument is completely pointless. Scopes that 
contain varaibles that are not pointers or refrences also don't need the 
patching.

 Every time a function gets called that did not get generated by the
 D compiler ( C / C++ etc functions) the compiler generates a call
 into the runtime and passes the current stack pointer and stack base
 pointer to it.

 void _d_externalCallStart(void* stackptr, void* baseptr);

 Every time such a function returns the compiler   generates a call
 into the the runtime too.

 void _d_externalCallEnd(void* stackptr, void* baseptr);

 Every time a functions that can get called from other languages
 (extern(C) etc) are executed the end callback is inserted at the
 start of the functions and the start callback is inserted at the end
 of the function.
 Using these callbacks the GC can mark certain parts of the stack as
 "non-D" and ignore them when scanning for bit fields and
 references/pointers. It can just skip parts of the stack that are
 "non-D" and therefore does not need a full stack frame within these
 "non-D" sections.

 This may not be a bad idea, though it does introduce some overhead when
 you cross language boundaries.


 All these features are required so that the GC can precisely scan
 pointers/references on the stack and change them as necessary.
 Remaining issues: The D compiler can freely move around value types
 on the stack. With such move operations it would be necessary to fix
 up all the bit fields. I needs to be investigated if this is doable.

 This can only make the GC slower, especially if it needs to update
 variable-length encoded bitfields. Of course, you may be able to offset
 this by making it possible to do real-time GC, (reduced throughput but
 less waiting time for collection cycles) but that's a very complex
 problem.


 3) Tracking references on the heap

 For every class / struct a mixin template which is defined inside
 the runtime gets instantiated. This template can then use
 introspection to generate the necessary information to allow the GC
 to scan the pointers within that struct / class precisely.

 So basically you're proposing a compacting precise-scanning GC instead
 of the current conservative GC. There are pros and cons in either
 approach; it'd be nice if you could compare them.

I'm proposing a compacting percise-scanning generantional gc that has 
thread local pools and can scan these thread local pools without 
stopping the other threads. Also it will be able to collect young 
generations without the need to scan the old generations.

 [...]
 5) pointer / reference changed callback

 Every time a pointer / reference is changed the D compiler emits a
 call into the runtime and passes the new value of the reference /
 pointer with it.

 This introduces a LOT of overhead, especially in a language like D which
 manipulates a lot of pointers quite often (esp. if you use slices a
 lot).

I did not make this up, I know a smalltalk implementation that actually 
does this and is pretty efficient.

 [...]
 -If the GC interrupts a thread right before any of the above mentioned
 callbacks happen it will cause a invalid state for the GC and the GC
 might access invalid pointers. It has to be investigated if this leads
 to invalid behavior. It can be fixed by not interrupting a thread but
 pause it the next time it calls any of callbacks, or other functions
 that can be interrupted by the GC.

 This adds a lot of intermittent pauses in program execution.

Why should there be pauses, there is just a additional check in every 
callback to the gc there already is. When the gc wants to collect he 
sets the pause flag to true and waits until all required threads paused 
themselfs.

 The link I posted at the top has a GC implementation that doesn't
 introduce *any* of this overhead (the GC runs concurrently with the
 program), with no pause during a collection cycle (garbage is
 incrementally collected when allocating new memory).

Any non percise scanning algorithms will not be able to deal with memory 
fragmentation and will also have uneccessary overhead for scanning 
regions of memory that don't contain any pointers. Also they can leak 
memory because some int value has the same value as a pointer and 
therefore the gc does not free that block of memory.

 [...]
 6) Different interface for the GC

 The current interface to the GC would have to change because the
 "this block of memory might contain a pointer" approach wouldn't
 work anymore. For example a block of memory and a delegate which
 iterates over all pointers within the memory block could be used for
 user allocated memory blocks. There should be a separate allocator
 function provided by the GC that allocates memory that does not get
 moved around so it can be used to pass it to non garbage collected
 code.

 It would be nice if there was a way for GCs to be pluggable, especially
 in the compiler. Currently, we can only swap GCs that implement the same
 interface as the existing one, but to switch to a different GC model
 like you're proposing would require a lot of compiler support.


 7) Compiler Options

 Each of the above mentioned groups of features should be exposed as
 compiler options so that you can turn them on/off depending on which
 type of GC you use. Default on/off states for these features are set
 within a config file depending on which type of GC currently ships
 per default with druntime.

 This would be nice.


 8) Conclusion

 Garbage Collection brings a lot of advantages for the programmer
 using the language but is not free and shouldn't be treated as free.

 I don't know, but after reading this:

 	http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

 I think there might be a possibility of (almost) free GC.

There is no free GC. The only question is which trade offs you want to 
make. Modern implementations like the .NET 4.0 garbage collector show 
that all the things mentioned here are possible and are faster then 
primitve implementations.

 Full support for garbage collection is required to build a fast and
 efficient GC. This additional support requires additional features
 within the compiler but should result in a overall better performing
 language.

 I agree in principle, although for specific GC proposals, we'd need to
 evaluate the pros and cons to determine what is improved and what
 degrades. Unfortunately, GC is a complex problem, and different GCs work
 better with different apps. I wouldn't be so quick to make claims about
 the performance of GCs. It depends on what the app does.


 T

Fully agree on this

I want to add that I did not make all this up. Most of the mentioned 
features here are actually used in a Smalltalk implementation that 
compiles Smalltalk to C for faster execution.

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 22, 2012 at 08:53:45PM +0100, Benjamin Thaut wrote:
 Am 22.02.2012 20:40, schrieb H. S. Teoh:
On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
As I'm not satisfied with the current GC D has and don't see the
situation improving in the future without significant changes to the
compiler I wrote the following document that points out all the
possible issues with garbage collection I could think of and
possible solutions for them. This document is just a draft and only
a proposal, critics and comments are welcome.

Have you seen this?

	http://www.llucax.com.ar/proj/dgc/index.html

 
 Yes I know about dgc it is better but still not on par with for
 example the GC that is shipped with the .NET 4.0
 All I'm saying is that without propper support from the compiler we
 are not going to get GCs as good as in other modern languages.

I agree. Better compiler support would definitely be beneficial.


[...]
Every scope generated by the D compiler would need additional code
at the start and end of the scope. When the scope is entered the
bitfield would be patched to represent the new variables inside the
scope and when the scope is left the bitfield is patched again to
remove the changes that were made on entering the scope.

This would introduce quite a lot of overhead per scope. It will also
lead to strange things like:

	if (x) y();	// faster
	if (x) { y(); }	// slower

which will encourage people to omit {} after if, which makes code more
fragile and hard to read.

 
 Scopeds that don't have variables declared inside them don't need
 the bitfield patching. so that argument is completely pointless.
 Scopes that contain varaibles that are not pointers or refrences
 also don't need the patching.

That wasn't clear from your description. It makes more sense now.


[...]
5) pointer / reference changed callback

Every time a pointer / reference is changed the D compiler emits a
call into the runtime and passes the new value of the reference /
pointer with it.

This introduces a LOT of overhead, especially in a language like D which
manipulates a lot of pointers quite often (esp. if you use slices a
lot).

 
 I did not make this up, I know a smalltalk implementation that
 actually does this and is pretty efficient.

OK.


[...]
-If the GC interrupts a thread right before any of the above
mentioned callbacks happen it will cause a invalid state for the GC
and the GC might access invalid pointers. It has to be investigated
if this leads to invalid behavior. It can be fixed by not
interrupting a thread but pause it the next time it calls any of
callbacks, or other functions that can be interrupted by the GC.

This adds a lot of intermittent pauses in program execution.

 
 Why should there be pauses, there is just a additional check in
 every callback to the gc there already is. When the gc wants to
 collect he sets the pause flag to true and waits until all required
 threads paused themselfs.

Whereas with a scheme like dgc there is no need for threads to pause at
all.


The link I posted at the top has a GC implementation that doesn't
introduce *any* of this overhead (the GC runs concurrently with the
program), with no pause during a collection cycle (garbage is
incrementally collected when allocating new memory).

 
 Any non percise scanning algorithms will not be able to deal with
 memory fragmentation

There are ways to deal with this. Though, granted, they're imperfect.


 and will also have uneccessary overhead for scanning regions of memory
 that don't contain any pointers.

True. But if the scanning is running in its own thread anyway, and no
other thread needs to wait for it, then this doesn't really matter, does
it?


 Also they can leak memory because some int value has the same value as
 a pointer and therefore the gc does not free that block of memory.

Yes, this is definitely a problem. It's not easy to fix this in a
language like D, though, without adding some overhead.


[...]
8) Conclusion

Garbage Collection brings a lot of advantages for the programmer
using the language but is not free and shouldn't be treated as free.

I don't know, but after reading this:

	http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

I think there might be a possibility of (almost) free GC.

 
 There is no free GC. The only question is which trade offs you want
 to make. Modern implementations like the .NET 4.0 garbage collector
 show that all the things mentioned here are possible and are faster
 then primitve implementations.

True. But then again, D's GC is only a simple implementation. Just
because an advanced implementation of a GC beats D's GC doesn't
necessarily mean that that particular implementation's GC model is the
best.

But I'm not trying to argue against your proposal. I'm just saying we
should evaluate different GC models to see which one works best. But for
that we need a way to easily plug in different GC models, which requires
compiler support.


[...]
I agree in principle, although for specific GC proposals, we'd need
to evaluate the pros and cons to determine what is improved and what
degrades. Unfortunately, GC is a complex problem, and different GCs
work better with different apps. I wouldn't be so quick to make
claims about the performance of GCs. It depends on what the app does.


T

 
 Fully agree on this
 
 I want to add that I did not make all this up. Most of the mentioned
 features here are actually used in a Smalltalk implementation that
 compiles Smalltalk to C for faster execution.

The thing is, Smalltalk is different enough from D that it's hard to
draw conclusions about the performance of the GC when used in D just by
looking at its performance in Smalltalk. In Smalltalk, the programmer
doesn't have direct access to things like pointers and byte
representations of stuff. This allows the compiler to make optimizations
that would you couldn't safely do with a D program. It also means
programmers may implement the same algorithms differently in D than in
Smalltalk. So the memory usage patterns of a D program are quite
different from a Smalltalk program, and this will affect how a
particular GC behaves when applied to D.

Without actual testing we have no way to know for sure.

But regardless, we need better compiler support. No argument about that.
:)


T

-- 
Lottery: tax on the stupid. -- Slashdotter

Feb 22 2012

deadalnix <deadalnix gmail.com> writes:

Le 22/02/2012 20:53, Benjamin Thaut a �crit :
 If you have a better idea for percise stack scanning I'm open for
 suggestions.


This is the problem with your proposal. It doesn't consider pro and cons 
and actual data. It doesn't consider the alternatives. You go straight 
to � How can we do that ? � without condidering � should we do that ? �

What would be the impact of being precise on the heap but not on the stack ?

1/ It would add some false positives. The future being 64bits, False 
positive will be way less present than on 32bits machines. I did 
searched for numbers on that, but couldn't found them. Considering this 
is only on the stack, this may be neglectible (or not, but it 
definitively require data).

2/ Data pointed by the stack are not movable.Again, what is the impact 
of that. How much data could be promoted from young generation to old 
one (considering we have young and old gen). How much data couldn't be 
compacted ? What would the overhead on allocators ?

This definitively lack the required data and/or analysis of pro and cons.

Additionnaly, the stack is made like a linked list. Each function 
calling another one register the return address. With this information, 
we can have data about what is on the stack except for the very last 
function called with no runtime overhead. This is another alternative.

But you have to consider that, even with a mask, you are not sure that 
what is marked as a pointer is a pointer. A memory location can 
represent different thing during a function execution. So thoses values 
can only be considered as probable pointers, or we disable some compiler 
optimizations. As we cannot be sure, the point 2/ stay valid.

Granted the overhead of the operation, it ay not worth it. To know that, 
we need actual data on how much data is the stack is actually pointer, 
and how much false positive we get. As the future is 64bits, I'm not 
sure it is interesting for us.

Feb 23 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 23, 2012 at 06:01:48PM +0100, deadalnix wrote:
[...]
 Additionnaly, the stack is made like a linked list. Each function
 calling another one register the return address. With this
 information, we can have data about what is on the stack except for
 the very last function called with no runtime overhead. This is
 another alternative.

Yep. In one of my replies I considered the possibility of storing a
function ID on the stack, but that may not be necessary if the GC has
access to compile-time static info about each function, so just by
seeing the return address it knows which function it is, and can figure
out where the pointers are. (Of course there are other issues that need
to be addressed... but we can't decide on that without actual data.)


 But you have to consider that, even with a mask, you are not sure
 that what is marked as a pointer is a pointer. A memory location can
 represent different thing during a function execution. So thoses
 values can only be considered as probable pointers, or we disable
 some compiler optimizations. As we cannot be sure, the point 2/ stay
 valid.

I believe his proposal was for the function to manually update these
bits as it runs. It does introduce a lot of overhead. And like you said,
without actual hard data to show whether or not this overhead is
justified (offset by improved GC performance), how do we know that we
should do this at all? How do we know we aren't making it worse?


 Granted the overhead of the operation, it ay not worth it. To know
 that, we need actual data on how much data is the stack is actually
 pointer, and how much false positive we get. As the future is 64bits,
 I'm not sure it is interesting for us.

Actually, I believe David Simcha *is* considering the possibility of precise
scanning. But the proof is in the actual benchmarks. We don't know if it
will help or not unless we have real data to back it up.


T

-- 
Real Programmers use "cat > a.out".

Feb 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

 I don't know, but after reading this:

 	http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

 I think there might be a possibility of (almost) free GC.

COWing whole pages is about the last thing you need with a high mutation  
rate.

Feb 22 2012

Jacob Carlborg <doob me.com> writes:

On 2012-02-22 20:40, H. S. Teoh wrote:
 On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
 As I'm not satisfied with the current GC D has and don't see the
 situation improving in the future without significant changes to the
 compiler I wrote the following document that points out all the
 possible issues with garbage collection I could think of and
 possible solutions for them. This document is just a draft and only
 a proposal, critics and comments are welcome.

 Have you seen this?

 	http://www.llucax.com.ar/proj/dgc/index.html


 [...]
 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

 This adds a lot of overhead to the runtime stack, esp. if you have deep
 recursion. It's also not necessarily faster, since the GC now has to
 parse a bitfield (a variable-length encoded bitfield, no less), instead
 of just scanning words directly, which can be optimized by CPU-specific
 microcode depending on the target platform.


 [...]
 Every scope generated by the D compiler would need additional code
 at the start and end of the scope. When the scope is entered the
 bitfield would be patched to represent the new variables inside the
 scope and when the scope is left the bitfield is patched again to
 remove the changes that were made on entering the scope.

 This would introduce quite a lot of overhead per scope. It will also
 lead to strange things like:

 	if (x) y();	// faster
 	if (x) { y(); }	// slower

 which will encourage people to omit {} after if, which makes code more
 fragile and hard to read.

Doesn't the "faster" example introduces an implicit scope?

-- 
/Jacob Carlborg

Feb 23 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/22/2012 08:40 PM, H. S. Teoh wrote:
 This would introduce quite a lot of overhead per scope. It will also
 lead to strange things like:

 	if (x) y();	// faster
 	if (x) { y(); }	// slower

Those are the same thing. '{ }' is not what introduces a scope.

Feb 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 22.02.2012 22:56, Benjamin Thaut wrote:
 As I'm not satisfied with the current GC D has and don't see the
 situation improving in the future without significant changes to the
 compiler I wrote the following document that points out all the possible
 issues with garbage collection I could think of and possible solutions
 for them. This document is just a draft and only a proposal, critics and
 comments are welcome.

 Kind Regards
 Benjamin Thaut


 1) Preface

 All modern Languages with fast and efficient GCs have build in support
 for garbage collection within the compiler / virtual machine. Currently
 the D language does have as much GC support as C++ has but fully relies
 on the GC with a lot of language features and the standard library. As a
 result the GC is slow, non parallel and always needs to stop all threads
 before collection. This document suggests to build in better support for
 garbage collection into the language so that creating a fast and
 efficient GC becomes possible. A list of features that would be required
 is described in this document.

 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the GC
 can walk up the stack at any time in the program.

  I think walking up the stack to collect this info again and again (the 
stack has a lot of "heavy frames" on the bottom, right?) sounds like a 
tremendously slow way of getting necessary memory ranges. I'm no expert 
though.

The stack frame of
 every function generated by the D compiler starts which a bitfield
 (usually the size of a machine register) where each bit indicates that
 these bytes are a pointer / reference. The bitfield needs to be large
 enough to cover the whole stack frame of the function.

 For example on x86: 11001...
 1 = bytes 0 to 4 are a pointer
 1 = bytes 4 to 8 are a pointer
 00 = bytes 8 to 16 are not a pointer
 1 = bytes 16 to 20 are a pointer

 The last bit indicates whether the bitfield is continued in the
 following bytes or not. 1 means continued 0 means finished.
 Every scope generated by the D compiler would need additional code at
 the start and end of the scope. When the scope is entered the bitfield
 would be patched to represent the new variables inside the scope and
 when the scope is left the bitfield is patched again to remove the
 changes that were made on entering the scope.

Again I'm no expert, but what happens when GC starts collecting a thread 
stack amid this patching operation?

 Every time a function gets called that did not get generated by the D
 compiler ( C / C++ etc functions) the compiler generates a call into the
 runtime and passes the current stack pointer and stack base pointer to it.

 void _d_externalCallStart(void* stackptr, void* baseptr);

 Every time such a function returns the compiler generates a call into
 the the runtime too.

 void _d_externalCallEnd(void* stackptr, void* baseptr);

Why would you need these? And if you start calling callback on every 
operation, you may just as well pass direct ranges of memory to GC 
without stack walk.
+ you can call D function from C one that in turn calls D one, think 
extern(C) and callbacks.

 Every time a functions that can get called from other languages
 (extern(C) etc) are executed the end callback is inserted at the start
 of the functions and the start callback is inserted at the end of the
 function.
 Using these callbacks the GC can mark certain parts of the stack as
 "non-D" and ignore them when scanning for bit fields and
 references/pointers. It can just skip parts of the stack that are
 "non-D" and therefore does not need a full stack frame within these
 "non-D" sections.
 All these features are required so that the GC can precisely scan
 pointers/references on the stack and change them as necessary.
 Remaining issues: The D compiler can freely move around value types on
 the stack. With such move operations it would be necessary to fix up all
 the bit fields. I needs to be investigated if this is doable.

 3) Tracking references on the heap

 For every class / struct a mixin template which is defined inside the
 runtime gets instantiated. This template can then use introspection to
 generate the necessary information to allow the GC to scan the pointers
 within that struct / class precisely.

 4) Thread local / global memory

 A callback into the runtime needs to happen in the following cases:
 - a __gshared variable is assigned
 - a reference / pointer is casted to immutable
 - a reference / pointer is casted to shared

 void _d_castToGlobalMem(void* ptr);

 This can be used by the GC to keep thread local pools of memory and move
 a memory block to a global memory pool as soon as it is needed there.

 5) pointer / reference changed callback

 Every time a pointer / reference is changed the D compiler emits a call
 into the runtime and passes the new value of the reference / pointer
 with it.

Bye-bye any speed of p++ ? I mean I'm horrified, and I bet I'm not alone.

 void _d_pointerChanged(void *ptr);

 This can be used when writing a generational GC to have separate pools
 for young and old generations. Every time the young generation needs to
 be collected it can be avoided to scan the old generations pool because
 it is sufficient to only check the pointers that have changed since the
 last time the young generation was collected. With the above mentioned
 callback it is easily possible to track these references.

 Remaining issues:
 -If the GC interrupts a thread right before any of the above mentioned
 callbacks happen it will cause a invalid state for the GC and the GC
 might access invalid pointers. It has to be investigated if this leads
 to invalid behavior. It can be fixed by not interrupting a thread but
 pause it the next time it calls any of callbacks, or other functions
 that can be interrupted by the GC. This in turn could cause a thread to
 be non pausable because it is stuck in a polling loop. The compiler
 could identify loops without any GC interruptible function and manually
 insert one.
 -When pointers/references are passed inside processor registers the GC
 cannot know if these values are actually pointers/references or
 represent other values. If threads are paused at defined points in the
 code as mentioned before this issues would be fixed because the state at
 these points is known and can be handled accordingly.

 6) Different interface for the GC

 The current interface to the GC would have to change because the "this
 block of memory might contain a pointer" approach wouldn't work anymore.
 For example a block of memory and a delegate which iterates over all
 pointers within the memory block could be used for user allocated memory
 blocks. There should be a separate allocator function provided by the GC
 that allocates memory that does not get moved around so it can be used
 to pass it to non garbage collected code.

 7) Compiler Options

 Each of the above mentioned groups of features should be exposed as
 compiler options so that you can turn them on/off depending on which
 type of GC you use. Default on/off states for these features are set
 within a config file depending on which type of GC currently ships per
 default with druntime.

Combinatorial explosion of sets of options that doesn't necessary allow 
a particular GC?
 8) Conclusion

 Garbage Collection brings a lot of advantages for the programmer using
 the language but is not free and shouldn't be treated as free. Full
 support for garbage collection is required to build a fast and efficient
 GC. This additional support requires additional features within the
 compiler but should result in a overall better performing language.

Well, that was critic ;)

-- 
Dmitry Olshansky

Feb 22 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 23.02.2012 0:03, Dmitry Olshansky wrote:
 On 22.02.2012 22:56, Benjamin Thaut wrote:
 As I'm not satisfied with the current GC D has and don't see the
 situation improving in the future without significant changes to the
 compiler I wrote the following document that points out all the possible
 issues with garbage collection I could think of and possible solutions
 for them. This document is just a draft and only a proposal, critics and
 comments are welcome.

 Kind Regards
 Benjamin Thaut


BTW check out ideas on GC 
http://prowiki.org/wiki4d/wiki.cgi?GSOC_2012_Ideas

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 23, 2012 at 12:03:59AM +0400, Dmitry Olshansky wrote:
[...]
7) Compiler Options

Each of the above mentioned groups of features should be exposed as
compiler options so that you can turn them on/off depending on which
type of GC you use. Default on/off states for these features are set
within a config file depending on which type of GC currently ships
per default with druntime.

 
 Combinatorial explosion of sets of options that doesn't necessary
 allow a particular GC?

Yeah, this is not a good way to go. Better would be for each GC come
with a GC description file, that describes what hooks/info it needs from
the compiler.  Then the compiler can read this description file (passed
as a *single* compile flag) and do the right thing for that particular
GC. It can even automatically link in that particular GC without needing
you to specify anything further.

The GC description file can contain info like:

- Where/when to insert calls to GC functions (e.g., start collect cycle,
  start mark cycle)
- Which function to use for allocating memory
- Any additional info required:
  - e.g., map of each function's pointer local variables so that the GC
    knows where the roots are;
  - Whether or not hooks are needed for function entry/exit, and which
    GC function to map them to;
  - Any additional GC-specific info to insert into functions / structs /
    etc..
  - Whether or not pointer reads/writes need to include GC-specific code
    (the description can include code to insert, if needed).
- Path to GC source(s) to be compiled into the program.


T

-- 
If Java had true garbage collection, most programs would delete
themselves upon execution. -- Robert Sewell

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
[...]
 2) Tracking references on the stack:
 
 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

[...]

I was thinking about this a bit more, and I had an idea: why bother with
storing bitfields on the stack? Any function's local pointer variables
are known at compile-time. So store a function ID (probably a pointer)
that maps to some static storage where this information is stored. Then
we always only need 1 word of extra storage on the stack frame, and the
GC can follow the pointer to get the info it needs. A recursively called
function won't incur the cost of duplicated copies of bitfields, its ID
points to same place. You can even have two different functions share
the same ID if they have pointer variables in exactly the same places.

The static storage can then be an array of relative stack offsets to the
function's pointer variables, so the GC can easily use this info to find
roots. No need to complicate the GC with manipulating bitfields, it's
just an int[].

If you want to get fancy, have the compiler reorder local variables so
that pointers are clustered together in blocks, then in the static
storage you can just encode pointer blocks by offset+length. (Although
this may not help much with (pointer,length) pairs on the stack, like
slices.) Or the compiler can reorder variables to maximize ID merges.

The same thing can be done for scopes, since their local variables are
also all known at compile-time.


T

-- 
Spaghetti code may be tangly, but lasagna code is just cheesy.

Feb 22 2012

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.02.2012 22:32, schrieb H. S. Teoh:
 On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
 [...]
 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

 [...]

 I was thinking about this a bit more, and I had an idea: why bother with
 storing bitfields on the stack? Any function's local pointer variables
 are known at compile-time. So store a function ID (probably a pointer)
 that maps to some static storage where this information is stored. Then
 we always only need 1 word of extra storage on the stack frame, and the
 GC can follow the pointer to get the info it needs. A recursively called
 function won't incur the cost of duplicated copies of bitfields, its ID
 points to same place. You can even have two different functions share
 the same ID if they have pointer variables in exactly the same places.

 The static storage can then be an array of relative stack offsets to the
 function's pointer variables, so the GC can easily use this info to find
 roots. No need to complicate the GC with manipulating bitfields, it's
 just an int[].

 If you want to get fancy, have the compiler reorder local variables so
 that pointers are clustered together in blocks, then in the static
 storage you can just encode pointer blocks by offset+length. (Although
 this may not help much with (pointer,length) pairs on the stack, like
 slices.) Or the compiler can reorder variables to maximize ID merges.

 The same thing can be done for scopes, since their local variables are
 also all known at compile-time.


 T

But where would you know from which scope variables are still (or 
already) valid and which are not?

void func()
{
   void* ptr = gc.alloc(...);
   //Ptr2, Ptr3 not valid yet
   void* ptr2 = gc.alloc(...);
   //ptr3 not valid yet
   {
     void* ptr3 = ptr1;
   }
   //ptr 3 not valid anymore
}

Also as the bitfiel is stored on the stack it will most likely ba 
already in the cache. Whereas with your approach scanning 1 stackframe 
would very likely also cause 1 cache miss because of the additional 
indirection. So if you are scanning 30 stack frames it will cause 30 
cache misses.

-- 
Kind Regards
Benjamin Thaut

Feb 22 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 22 Feb 2012 15:32:37 -0600, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
 On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
 [...]
 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the
 GC can walk up the stack at any time in the program. The stack frame
 of every function generated by the D compiler starts which a
 bitfield (usually the size of a machine register) where each bit
 indicates that these bytes are a pointer / reference. The bitfield
 needs to be large enough to cover the whole stack frame of the
 function.

 [...]

 I was thinking about this a bit more, and I had an idea: why bother with
 storing bitfields on the stack? Any function's local pointer variables
 are known at compile-time. So store a function ID (probably a pointer)
 that maps to some static storage where this information is stored. Then
 we always only need 1 word of extra storage on the stack frame, and the
 GC can follow the pointer to get the info it needs. A recursively called
 function won't incur the cost of duplicated copies of bitfields, its ID
 points to same place. You can even have two different functions share
 the same ID if they have pointer variables in exactly the same places.

 The static storage can then be an array of relative stack offsets to the
 function's pointer variables, so the GC can easily use this info to find
 roots. No need to complicate the GC with manipulating bitfields, it's
 just an int[].

 If you want to get fancy, have the compiler reorder local variables so
 that pointers are clustered together in blocks, then in the static
 storage you can just encode pointer blocks by offset+length. (Although
 this may not help much with (pointer,length) pairs on the stack, like
 slices.) Or the compiler can reorder variables to maximize ID merges.

 The same thing can be done for scopes, since their local variables are
 also all known at compile-time.


 T

The break even point between bit-fields and pointers is 512 bytes. Although, if
one is thinking about on stack storage this probably doesn't matter since for
alignment purposes you'll always end up using at least 1 word if not 2.
However, a lot of functions use less then 512 (or 1024) bytes of of stack
space. I'd think it would be much more space efficient to have a separate
bitfield for the stack. Cache efficiency should be about the same as a on stack
representation, and scanning would, in theory, be quicker. IIRC, a separate
bit-field was the approach used by at least one precise C GC.

Feb 23 2012

Sean Kelly <sean invisibleduck.org> writes:

On Feb 22, 2012, at 10:56 AM, Benjamin Thaut wrote:
=20
 5) pointer / reference changed callback
=20
 Every time a pointer / reference is changed the D compiler emits a =

call into the runtime and passes the new value of the reference / =
pointer with it.
=20
 void _d_pointerChanged(void *ptr);

D can call assembler, C routines like memset(), plain old opaque C =
library code, etc.  What should the D compiler do in light of all the =
sources of memory changes that it can't monitor?

 6) Different interface for the GC
=20
 The current interface to the GC would have to change because the "this =

block of memory might contain a pointer" approach wouldn't work anymore. =
For example a block of memory and a delegate which iterates over all =
pointers within the memory block could be used for user allocated memory =
blocks. There should be a separate allocator function provided by the GC =
that allocates memory that does not get moved around so it can be used =
to pass it to non garbage collected code.

I posted a suggested new GC interface do the runtime mailing list 6 or =
so months ago.  In short, I do think the current interface is lacking.  =
Also, CDGC does support precise scanning and runs with Druntime.  The =
big problem there is that CDGC is based on the Tango GC (where =
Druntime's GC originated) and someone needs to review all the GC changes =
since the Druntime project was created to see what may need to be merged =
into CDGC.  I started on this once, but it turned out to be more work =
than I had time for.=

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 22, 2012 at 03:05:38PM -0800, Sean Kelly wrote:
 On Feb 22, 2012, at 10:56 AM, Benjamin Thaut wrote:
 
 5) pointer / reference changed callback
 
 Every time a pointer / reference is changed the D compiler emits a
 call into the runtime and passes the new value of the reference /
 pointer with it.
 
 void _d_pointerChanged(void *ptr);

 
 D can call assembler, C routines like memset(), plain old opaque C
 library code, etc.  What should the D compiler do in light of all the
 sources of memory changes that it can't monitor?

[...]

Yeah, the GC should be capable of dealing with non- safe code. Otherwise
it would just be too limited to be used for large non-trivial D
projects.

But still, some benchmarks do appear to be showing signs of a large
performance hit on the GC when there happens to be many integers that
look like valid pointers. This may be beyond the programmer's control,
since it could be the OS that gives the GC an address segment that just
happens to span integer values very commonly used throughout the app.


T

-- 
What doesn't kill me makes me stranger.

Feb 22 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, February 22, 2012 16:45:29 H. S. Teoh wrote:
 But still, some benchmarks do appear to be showing signs of a large
 performance hit on the GC when there happens to be many integers that
 look like valid pointers.

As I understand it, this is mitigated considerably on 64-bit platforms due to 
the large pointer size.

- Jonathan M Davis

Feb 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 22, 2012 at 07:48:38PM -0500, Jonathan M Davis wrote:
 On Wednesday, February 22, 2012 16:45:29 H. S. Teoh wrote:
 But still, some benchmarks do appear to be showing signs of a large
 performance hit on the GC when there happens to be many integers that
 look like valid pointers.

 
 As I understand it, this is mitigated considerably on 64-bit platforms
 due to the large pointer size.

[...]

True. You'd have to deliberately want to break the GC in order for
coincidental integer values to cause a significant problem.

But this problem could be a major issue if D was to become a significant
player on handhelds, which, if I understand correctly, are still largely
running 32-bit CPUs.

It's also a major problem on 16-bit platforms, but the only use of those
that I can conceive of are toy applications so it's probably not worth
the consideration. :)


T

-- 
That's not a bug; that's a feature!

Feb 22 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

 4) Thread local / global memory

 A callback into the runtime needs to happen in the following cases:
 - a __gshared variable is assigned
 - a reference / pointer is casted to immutable
 - a reference / pointer is casted to shared

 void _d_castToGlobalMem(void* ptr);

 This can be used by the GC to keep thread local pools of memory and move  
 a memory block to a global memory pool as soon as it is needed there.

If we'd want per-thread mark/sweeping then shared memory must never own
unshared memory. I think this could be done using a separate allocator for
shared/immutable data. For casts this would require a transitive move of  
the
data or it'd need to be prohibited.

Feb 22 2012

Sean Kelly <sean invisibleduck.org> writes:

On Feb 22, 2012, at 8:07 PM, "Martin Nowak" <dawg dawgfoto.de> wrote:

 4) Thread local / global memory
=20
 A callback into the runtime needs to happen in the following cases:
 - a __gshared variable is assigned
 - a reference / pointer is casted to immutable
 - a reference / pointer is casted to shared
=20
 void _d_castToGlobalMem(void* ptr);
=20
 This can be used by the GC to keep thread local pools of memory and move a=


 memory block to a global memory pool as soon as it is needed there.
=20

 If we'd want per-thread mark/sweeping then shared memory must never own
 unshared memory. I think this could be done using a separate allocator for=

 shared/immutable data. For casts this would require a transitive move of t=

he
 data or it'd need to be prohibited.

Casting to/from shared needs a bit more logic anyway, so the proper thread f=
inalizes unshared objects. Casting away shared clears the block's owner and v=
ice versa. Nice perk to this is casting away shared could then detect a shar=
ing violation if a block already has another owner.=20=

Feb 23 2012

deadalnix <deadalnix gmail.com> writes:

You didn't mention what is the most important IMO.

In D, most data are thread local. Shared data are either shared or 
immutable.

Both thread local data and immutable data lead to very interesting GC 
optimisations. This is where we need language support.

Feb 23 2012

Manu <turkeyman gmail.com> writes:

On 22 February 2012 20:56, Benjamin Thaut <code benjamin-thaut.de> wrote:

 2) Tracking references on the stack:

 The D compiler always needs to emit a full stack frame so that the GC can
 walk up the stack at any time in the program.


You say "every function needs a stack frame". Can you comment on this with
respect to leaf functions? No leaf function should ever generate a stack
frame, infact, typical leaf functions will never touch memory at all. This
is VERY IMPORTANT for critical loops. I have never worked on a project
where I did not depend on the performance of leaf functions to do the hard
work in the most critical parts of my application.
Obviously such functions would not be making allocations, and shouldn't be
interacting with the GC any way, so why is having a stack frame important?


 The stack frame of every function generated by the D compiler starts which
 a bitfield (*usually the size of a machine register*)...

Oh really? And how do we define that type? ;) (*cough* reference to
size_t/ptrdiff_t thread)



 For example on x86: 11001...
 1 = bytes 0 to 4 are a pointer
 1 = bytes 4 to 8 are a pointer
 00 = bytes 8 to 16 are not a pointer
 1 = bytes 16 to 20 are a pointer

 The last bit indicates whether the bitfield is continued in the following
 bytes or not. 1 means continued 0 means finished.
 Every scope generated by the D compiler would need additional code at the
 start and end of the scope. When the scope is entered the bitfield would be
 patched to represent the new variables inside the scope and when the scope
 is left the bitfield is patched again to remove the changes that were made
 on entering the scope.
 Every time a function gets called that did not get generated by the D
 compiler ( C / C++ etc functions) the compiler generates a call into the
 runtime and passes the current stack pointer and stack base pointer to it.

 void _d_externalCallStart(void* stackptr, void* baseptr);

 Every time such a function returns the compiler   generates a call into
 the the runtime too.

 void _d_externalCallEnd(void* stackptr, void* baseptr);

Can you comment on what those functions will actually do? It definitely
sounds very worrying to me to be turning every function call into THREE
calls.
Calling into extern code is certainly not rare... almost everything of any
use is an extern C lib. What about interaction with the OS? Trivial libs
like zlib? etc...

I wonder if there are alternative ways to detect a foreign stack. And I'm
not sure why it even matters, you can't depend on the extern ABI, how do
you unwind the stack reliably in the first place?


 Every time a functions that can get called from other languages (extern(C)
 etc) are executed the end callback is inserted at the start of the
 functions and the start callback is inserted at the end of the function.
 Using these callbacks the GC can mark certain parts of the stack as
 "non-D" and ignore them when scanning for bit fields and
 references/pointers. It can just skip parts of the stack that are "non-D"
 and therefore does not need a full stack frame within these "non-D"
 sections.
 All these features are required so that the GC can precisely scan
 pointers/references on the stack and change them as necessary.
 Remaining issues: The D compiler can freely move around value types on the
 stack. With such move operations it would be necessary to fix up all the
 bit fields. I needs to be investigated if this is doable.

 3) Tracking references on the heap

 For every class / struct a mixin template which is defined inside the
 runtime gets instantiated. This template can then use introspection to
 generate the necessary information to allow the GC to scan the pointers
 within that struct / class precisely.

 4) Thread local / global memory

 A callback into the runtime needs to happen in the following cases:
 - a __gshared variable is assigned
 - a reference / pointer is casted to immutable
 - a reference / pointer is casted to shared

 void _d_castToGlobalMem(void* ptr);

 This can be used by the GC to keep thread local pools of memory and move a
 memory block to a global memory pool as soon as it is needed there.

 5) pointer / reference changed callback

 Every time a pointer / reference is changed the D compiler emits a call
 into the runtime and passes the new value of the reference / pointer with
 it.

 void _d_pointerChanged(void *ptr);

 This can be used when writing a generational GC to have separate pools for
 young and old generations. Every time the young generation needs to be
 collected it can be avoided to scan the old generations pool because it is
 sufficient to only check the pointers that have changed since the last time
 the young generation was collected. With the above mentioned callback it is
 easily possible to track these references.

 Remaining issues:
 -If the GC interrupts a thread right before any of the above mentioned
 callbacks happen it will cause a invalid state for the GC and the GC might
 access invalid pointers. It has to be investigated if this leads to invalid
 behavior. It can be fixed by not interrupting a thread but pause it the
 next time it calls any of callbacks, or other functions that can be
 interrupted by the GC. This in turn could cause a thread to be non pausable
 because it is stuck in a polling loop. The compiler could identify loops
 without any GC interruptible function and manually insert one.
 -When pointers/references are passed inside processor registers the GC
 cannot know if these values are actually pointers/references or represent
 other values. If threads are paused at defined points in the code as
 mentioned before this issues would be fixed because the state at these
 points is known and can be handled accordingly.

 6) Different interface for the GC

 The current interface to the GC would have to change because the "this
 block of memory might contain a pointer" approach wouldn't work anymore.
 For example a block of memory and a delegate which iterates over all
 pointers within the memory block could be used for user allocated memory
 blocks. There should be a separate allocator function provided by the GC
 that allocates memory that does not get moved around so it can be used to
 pass it to non garbage collected code.

 7) Compiler Options

 Each of the above mentioned groups of features should be exposed as
 compiler options so that you can turn them on/off depending on which type
 of GC you use. Default on/off states for these features are set within a
 config file depending on which type of GC currently ships per default with
 druntime.

 8) Conclusion

 Garbage Collection brings a lot of advantages for the programmer using the
 language but is not free and shouldn't be treated as free. Full support for
 garbage collection is required to build a fast and efficient GC. This
 additional support requires additional features within the compiler but
 should result in a overall better performing language.

Feb 23 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 23, 2012 at 01:51:31PM +0200, Manu wrote:
[...]
 I wonder if there are alternative ways to detect a foreign stack. And
 I'm not sure why it even matters, you can't depend on the extern ABI,
 how do you unwind the stack reliably in the first place?

[...]

This is a bit off-topic, but what happens in the current implementation
if you pass a D callback to a C function, and then throw an exception
from the callback? Does it work? Or does it do something really nasty?


T

-- 
Elegant or ugly code as well as fine or rude sentences have something in
common: they don't depend on the language. -- Luca De Vitis

Feb 23 2012

deadalnix <deadalnix gmail.com> writes:

Le 23/02/2012 20:58, H. S. Teoh a �crit :
 On Thu, Feb 23, 2012 at 01:51:31PM +0200, Manu wrote:
 [...]
 I wonder if there are alternative ways to detect a foreign stack. And
 I'm not sure why it even matters, you can't depend on the extern ABI,
 how do you unwind the stack reliably in the first place?

 [...]

 This is a bit off-topic, but what happens in the current implementation
 if you pass a D callback to a C function, and then throw an exception
 from the callback? Does it work? Or does it do something really nasty?


 T

This will work, but has serious drawbacks.

First of all, you are not sure the C function will release all resources 
(free, fclose, etc . . .). So you cannot be sure of the state of your 
program.

This is something that you don't want to do.

Feb 24 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 02/23/12 20:58, H. S. Teoh wrote:
 On Thu, Feb 23, 2012 at 01:51:31PM +0200, Manu wrote:
 [...]
 I wonder if there are alternative ways to detect a foreign stack. And
 I'm not sure why it even matters, you can't depend on the extern ABI,
 how do you unwind the stack reliably in the first place?

 [...]
 
 This is a bit off-topic, but what happens in the current implementation
 if you pass a D callback to a C function, and then throw an exception
 from the callback? Does it work? Or does it do something really nasty?

No, unless you consider a segfault to be really nasty. :)
Actually, it mostly works - i just tried it in a gtk app, and it works as
long as you catch the exception and only look at the error msg. If you don't
catch it (or try writeln(e) etc), then the result is something like:

action.Action!(int).Action.registerNS.MissingActionEx action.d(54): Action
"GUI" missing symbol 'int AppWindowClosed()'
----------------
./gtkapp() [0x8054366]
/usr/lib/i686/sse2/libgtk-x11-2.0.so.0(+0x153052) [0xf7375052]
/usr/lib/i686/sse2/libgobject-2.0.so.0(g_closure_invoke+0x19b) [0xf71e25fd]
/usr/lib/i686/sse2/libgobject-2.0.so.0(+0x1ddc8) [0xf71f2dc8]
/usr/lib/i686/sse2/libgobject-2.0.so.0(g_signal_emit_valist+0x59a) [0xf71fa37f]
/usr/lib/i686/sse2/libgobject-2.0.so.0(g_signal_emit+0x34) [0xf71fa61f]
/usr/lib/i686/sse2/libgtk-x11-2.0.so.0(+0x2a17f3) [0xf74c37f3]
/usr/lib/i686/sse2/libgtk-x11-2.0.so.0(gtk_main_do_event+0x8e6) [0xf7373856]
Segmentation fault

So something appears to get confused while walking the stack; another thing to 
investigate later, i guess...

artur

Feb 23 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 23, 2012 at 09:18:22PM +0100, Artur Skawina wrote:
 On 02/23/12 20:58, H. S. Teoh wrote:

[...]
 This is a bit off-topic, but what happens in the current
 implementation if you pass a D callback to a C function, and then
 throw an exception from the callback? Does it work? Or does it do
 something really nasty?

 
 No, unless you consider a segfault to be really nasty. :)

Well, segfaults are nasty, but there are nastier things. :)


 Actually, it mostly works - i just tried it in a gtk app, and it works
 as long as you catch the exception and only look at the error msg. If
 you don't catch it (or try writeln(e) etc), then the result is
 something like:
 
 action.Action!(int).Action.registerNS.MissingActionEx action.d(54): Action
"GUI" missing symbol 'int AppWindowClosed()'
 ----------------
 ./gtkapp() [0x8054366]
 /usr/lib/i686/sse2/libgtk-x11-2.0.so.0(+0x153052) [0xf7375052]
 /usr/lib/i686/sse2/libgobject-2.0.so.0(g_closure_invoke+0x19b) [0xf71e25fd]
 /usr/lib/i686/sse2/libgobject-2.0.so.0(+0x1ddc8) [0xf71f2dc8]
 /usr/lib/i686/sse2/libgobject-2.0.so.0(g_signal_emit_valist+0x59a) [0xf71fa37f]
 /usr/lib/i686/sse2/libgobject-2.0.so.0(g_signal_emit+0x34) [0xf71fa61f]
 /usr/lib/i686/sse2/libgtk-x11-2.0.so.0(+0x2a17f3) [0xf74c37f3]
 /usr/lib/i686/sse2/libgtk-x11-2.0.so.0(gtk_main_do_event+0x8e6) [0xf7373856]
 Segmentation fault
 
 So something appears to get confused while walking the stack; another
 thing to investigate later, i guess...

[...]

Looks like it got confused at the cross-language boundary.


T

-- 
Prosperity breeds contempt, and poverty breeds consent. -- Suck.com

Feb 23 2012

D Programming

C/C++ Programming

Other

digitalmars.D - [RFC]Proposal for better garbage collection