www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - TempAlloc and druntime GC

reply dsimcha <dsimcha yahoo.com> writes:
For those of you who aren't familiar with TempAlloc, it's a region-based
memory allocator that's essentially a second stack for allocating memory in a
last in, first out manner.  It is based on an idea proposed on this newsgroup
a few months back by Andrei under the name SuperStack, and was later
implemented by me.  The goal is to provide fast thread-local pointer-bumping
memory allocation for objects whose lifetime is bound to a scope, while
allowing the whole heap, instead of the relatively limited call stack space,
to be used.  According to some benchmarks I've done, TempAlloc can be anywhere
between 30% and 10s of times faster than the druntime GC allocator, depending
on several factors, the biggest being lock contention for the druntime
allocator.

TempAlloc's biggest weakness is that nothing allocated by TempAlloc is scanned
by the GC, making it impossible to safely store the only reference to
GC-allocated objects inside TempAlloc-allocated space.  I've been using
TempAlloc in my own personal projects for a while now, and have not found this
to be a significant limitation in practice.  Usually, if I'm using TempAlloc,
the whole point is to avoid generating GC heap activity within the function
I'm working in, meaning that I don't create any new GC-allocated objects in
that function.  If the objects I were created before calling that function,
there's another reference to them somewhere.  However, for people who are not
as familiar with how TempAlloc is implemented, or with how GC is implemented,
it may be easy to screw up, making TempAlloc a shoe with a gun built in and
probably unsuitable for inclusion in Phobos or Tango.

I've been thinking about how to deal with this issue and make TempAlloc safe
enough for inclusion in the standard library.  Since it works by allocating
large (currently 4 MB) regions from the C heap, and then sub-allocating these
regions to its clients, simply scanning these whole chunks is simply not a
reasonable option, as it would lead to way too many false pointers.  Using the
GC's addRegion/removeRegion interface at every TempAlloc allocation is also
not a reasonable option, as this operation requires a lock and would negate
any performance gains from using TempAlloc instead of the GC heap.  The only
alternative I see is to make the GC aware of TempAlloc at a very low level.
I've looked at the GC code and this doesn't look terribly hard to do.  Using
some bit twiddling tricks, I can do the bookkeeping to determine which
TempAlloc allocated objects should be scanned by the GC w/o any additional
space overhead.

The downside to this is that TempAlloc would be very tightly coupled to
druntime. Furthermore, if Sean and the other druntime maintainers decided they
didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
some feedback before I bother doing anything.  Do you think that TempAlloc is
universally useful enough to belong in the standard library?  If so, do you
believe that it is only suitable for inclusion if it can be made safe w/
respect to storing the only reference to GC-allocated objects?  If so, is it
worth tightly coupling it to the GC, and should I start working on this and
submit a patch to druntime?
Jan 18 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.
The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.
 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?
For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). Sean
Jan 19 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.
The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.
Can you elaborate on this idea? I don't understand it, so I can't comment on whether it would work.
 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?
For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). Sean
It seems like you want to create some kind of more abstract interface to solve the more general problem of interfacing stuff with the GC, rather than just integrating TempAlloc directly at a low level. One way this could be done (not necessarily the best way, but the best I've thought of so far) is to make it possible to register class objects conforming to some specific range interface (an explicit runtime interface w/ virtual functions, not an implicit compile-time interface, since we care about the ABI here) with the GC. These objects would return a pointer at each iteration, and then the GC would mark whatever was pointed to by that pointer. This would allow me to use some hacks to keep TempAlloc efficient, while keeping these hacks well-encapsulated and out of the core GC code. In TempAlloc's case specifically, every time a new thread's TempAlloc state was initialized, it would register one of these objects with the GC. This object would contain the thread's ID internally, and be set to return empty immediately if the thread was already dead. (TempAlloc-allocated data is absolutely NOT meant to be shared between threads.) Otherwise, it would scan through the currently allocated elements and provide the GC with the pointers that are located there, through the range interface. I'm not sure if this would also work for DLL stuff since I know nothing about that problem. However, if this would solve the other cases you have in mind, I think it's an elegant solution.
Jan 19 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.
The way I'd do this at the druntime level is generate a Thread.Context struct for each distinct block used by TempAlloc. It shouldn't be too difficult to create an API to provide this functionality. I'd have to do something roughly along these lines anyway to get thread support working for DLLs in Win32.
Can you elaborate on this idea? I don't understand it, so I can't comment on whether it would work.
The data range of every execution context in druntime is tracked by the GC via a linked-list of structs that look roughly like this: struct Context { void* begin; void* end; Context* next; } The stack of each kernel thread has one, as does each user thread (Fiber). Currently, this is all internal, but I could expose an API along these lines: void attachContext( Context* ); void detachContext( Context* ); Each ThreadAlloc range would have its own Context struct which it would attach on creation and detach upon destruction. You're still paying for a mutex call to attach and detach the struct, but the begin and end pointers can be modified at will, which gets rid of the problem with addRange, etc. I suppose I could add this to the interface of GC as well, instead of sticking yet more logic in core.thread.
 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?
For any new runtime features, the easiest thing is just to submit a ticket or to contact me and tell me what you need. At first blush, I don't see any problem with providing the functions you'd need to make this work. It may eventually make sense to put TempAlloc, or something like it, into core.memory anyway (there's a reason I didn't name that module core.gc). Sean
It seems like you want to create some kind of more abstract interface to solve the more general problem of interfacing stuff with the GC, rather than just integrating TempAlloc directly at a low level. One way this could be done (not necessarily the best way, but the best I've thought of so far) is to make it possible to register class objects conforming to some specific range interface (an explicit runtime interface w/ virtual functions, not an implicit compile-time interface, since we care about the ABI here) with the GC. These objects would return a pointer at each iteration, and then the GC would mark whatever was pointed to by that pointer. This would allow me to use some hacks to keep TempAlloc efficient, while keeping these hacks well-encapsulated and out of the core GC code.
That's basically how the Context stuff works above.
 In TempAlloc's case specifically, every time a new thread's TempAlloc state was
 initialized, it would register one of these objects with the GC.  This object
 would contain the thread's ID internally, and be set to return empty
immediately
 if the thread was already dead.  (TempAlloc-allocated data is absolutely NOT
meant
 to be shared between threads.)  Otherwise, it would scan through the currently
 allocated elements and provide the GC with the pointers that are located there,
 through the range interface.
Hm... this may be more tricky than I had outlined above. If I were to do this I'd probably expose it from GC rather than allow access to the existing context mechanism. However... I /could/ possibly add an onExit hook for the Thread object. I'll have to think about whether that could cause any problems. Sean
Jan 19 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 The data range of every execution context in druntime is tracked by the GC
 via a linked-list of structs that look roughly like this:
 struct Context
 {
     void*        begin;
     void*        end;
     Context* next;
 }
 The stack of each kernel thread has one, as does each user thread (Fiber).
Currently,
 this is all internal, but I could expose an API along these lines:
 void attachContext( Context* );
 void detachContext( Context* );
 Each ThreadAlloc range would have its own Context struct which it would attach
 on creation and detach upon destruction.  You're still paying for a mutex call
 to attach and detach the struct, but the begin and end pointers can be modified
 at will, which gets rid of the problem with addRange, etc.
 I suppose I could add this to the interface of GC as well, instead of sticking
yet more
 logic in core.thread.
The problem with this is that I'd ideally like to control whether each TempAlloc-allocated object is scanned by the GC individually, rather than scanning the entire used portion of the TempAlloc stack. Having to create and store a context struct for every object would be incredibly inefficient. Scanning the entire used portion would be incredibly conservative. At an implementation level, TempAlloc already uses an array of void*s internally to track what it's allocated (only one per object, in an array, not a linked list) so that it can be freed properly. Since TempAlloc allocates using 16-byte alignment, I was planning to avoid any extra overhead by storing the scan/noscan bits in the low order bits of these pointers. One more note: It would also be greatly appreciated if you would expose a version of GC.malloc that returns null on failure instead of throwing an exception. For performance, TempAlloc treats out of memory as non-recoverable. This is necessary because, since scope statements are often used to manage freeing, the performance hit from having TempAlloc-related functions be throwing would be too large. Also, making TempAlloc functions non-throwing by catching outOfMemoryErrors and aborting is in itself expensive. Furthermore, I just realized (literally while I was writing this post) that C's malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte aligned pointers like druntime's. Ideally, I'd like to stop using the C heap and start using the druntime heap, but I can't do that until there is a non-throwing implementation of GC.malloc.
Jan 19 2009
next sibling parent Daniel Keep <daniel.keep.lists gmail.com> writes:
dsimcha wrote:
 [snip]
 The problem with this is that I'd ideally like to control whether each
 TempAlloc-allocated object is scanned by the GC individually, rather than
scanning
 the entire used portion of the TempAlloc stack.  Having to create and store a
 context struct for every object would be incredibly inefficient.  Scanning the
 entire used portion would be incredibly conservative.
 
 At an implementation level, TempAlloc already uses an array of void*s
internally
 to track what it's allocated (only one per object, in an array, not a linked
list)
 so that it can be freed properly.  Since TempAlloc allocates using 16-byte
 alignment, I was planning to avoid any extra overhead by storing the
scan/noscan
 bits in the low order bits of these pointers.
 [snip]
Just a random thought here; what if you had this: struct Context { void* begin; void* end; Context* next; } struct FatContext { void* begin; void* end; Context* next; uint* bitmap; } void attachFatContext( FatContext* ctx ) { assert( ctx.next & 3 == 0 ); ctx.next |= 1; attachContext( cast(Context*) ctx ); } FatContext* isFatContext( Context* ctx ) { return (ctx.next & 1) ? cast(FatContext*) ctx : null; } Context* nextContext( Context* ctx ) { return ctx.next & ~3; } Now, we assume that bitmap is a pointer to the first element of an array of uints that's (end-begin)/16 long. You'd only need one bit per 16 bytes, and it'd let you change the scan/noscan of any area of memory without having to talk to the GC every time. -- Daniel
Jan 19 2009
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 One more note:  It would also be greatly appreciated if you would expose a
version
 of GC.malloc that returns null on failure instead of throwing an exception. 
For
 performance, TempAlloc treats out of memory as non-recoverable.  This is
necessary
 because, since scope statements are often used to manage freeing, the
performance
 hit from having TempAlloc-related functions be throwing would be too large. 
Also,
 making TempAlloc functions non-throwing by catching outOfMemoryErrors and
aborting
 is in itself expensive.
 Furthermore, I just realized (literally while I was writing this post) that C's
 malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte
 aligned pointers like druntime's.  Ideally, I'd like to stop using the C heap
and
 start using the druntime heap, but I can't do that until there is a
non-throwing
 implementation of GC.malloc.
Never mind as far as this. I realized that the reason why the try/catch blocks were costing so much performance is because I was doing them inline and in some cases putting them around a whole bunch of code. This was probably killing a lot of compiler optimizations. When I instead did it the non-stupid way and wrote a wrapper for malloc, this no longer resulted in a significant performance penalty.
Jan 19 2009