digitalmars.D - TempAlloc and druntime GC

dsimcha (45/45) Jan 18 2009 For those of you who aren't familiar with TempAlloc, it's a region-based

Sean Kelly (11/33) Jan 19 2009 The way I'd do this at the druntime level is generate a Thread.Context s...

dsimcha (24/57) Jan 19 2009 Can you elaborate on this idea? I don't understand it, so I can't comme...

Sean Kelly (25/79) Jan 19 2009 The data range of every execution context in druntime is tracked by the ...

dsimcha (25/43) Jan 19 2009 yet more

Daniel Keep (34/47) Jan 19 2009 Just a random thought here; what if you had this:
dsimcha (6/18) Jan 19 2009 Never mind as far as this. I realized that the reason why the try/catch...

dsimcha <dsimcha yahoo.com> writes:

For those of you who aren't familiar with TempAlloc, it's a region-based
memory allocator that's essentially a second stack for allocating memory in a
last in, first out manner.  It is based on an idea proposed on this newsgroup
a few months back by Andrei under the name SuperStack, and was later
implemented by me.  The goal is to provide fast thread-local pointer-bumping
memory allocation for objects whose lifetime is bound to a scope, while
allowing the whole heap, instead of the relatively limited call stack space,
to be used.  According to some benchmarks I've done, TempAlloc can be anywhere
between 30% and 10s of times faster than the druntime GC allocator, depending
on several factors, the biggest being lock contention for the druntime
allocator.

TempAlloc's biggest weakness is that nothing allocated by TempAlloc is scanned
by the GC, making it impossible to safely store the only reference to
GC-allocated objects inside TempAlloc-allocated space.  I've been using
TempAlloc in my own personal projects for a while now, and have not found this
to be a significant limitation in practice.  Usually, if I'm using TempAlloc,
the whole point is to avoid generating GC heap activity within the function
I'm working in, meaning that I don't create any new GC-allocated objects in
that function.  If the objects I were created before calling that function,
there's another reference to them somewhere.  However, for people who are not
as familiar with how TempAlloc is implemented, or with how GC is implemented,
it may be easy to screw up, making TempAlloc a shoe with a gun built in and
probably unsuitable for inclusion in Phobos or Tango.

I've been thinking about how to deal with this issue and make TempAlloc safe
enough for inclusion in the standard library.  Since it works by allocating
large (currently 4 MB) regions from the C heap, and then sub-allocating these
regions to its clients, simply scanning these whole chunks is simply not a
reasonable option, as it would lead to way too many false pointers.  Using the
GC's addRegion/removeRegion interface at every TempAlloc allocation is also
not a reasonable option, as this operation requires a lock and would negate
any performance gains from using TempAlloc instead of the GC heap.  The only
alternative I see is to make the GC aware of TempAlloc at a very low level.
I've looked at the GC code and this doesn't look terribly hard to do.  Using
some bit twiddling tricks, I can do the bookkeeping to determine which
TempAlloc allocated objects should be scanned by the GC w/o any additional
space overhead.

The downside to this is that TempAlloc would be very tightly coupled to
druntime. Furthermore, if Sean and the other druntime maintainers decided they
didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
some feedback before I bother doing anything.  Do you think that TempAlloc is
universally useful enough to belong in the standard library?  If so, do you
believe that it is only suitable for inclusion if it can be made safe w/
respect to storing the only reference to GC-allocated objects?  If so, is it
worth tightly coupling it to the GC, and should I start working on this and
submit a patch to druntime?

Jan 18 2009

Sean Kelly <sean invisibleduck.org> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.

The way I'd do this at the druntime level is generate a Thread.Context struct
for each distinct block used by TempAlloc.  It shouldn't be too difficult to
create an API to provide this functionality.  I'd have to do something roughly
along these lines anyway to get thread support working for DLLs in Win32.

 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?

For any new runtime features, the easiest thing is just to submit a ticket or
to contact me and tell me what you need.  At first blush, I don't see any
problem with providing the functions you'd need to make this work.
It may eventually make sense to put TempAlloc, or something like it, into
core.memory anyway (there's a reason I didn't name that module core.gc).


Sean

Jan 19 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.

 The way I'd do this at the druntime level is generate a Thread.Context struct
 for each distinct block used by TempAlloc.  It shouldn't be too difficult to
 create an API to provide this functionality.  I'd have to do something roughly
 along these lines anyway to get thread support working for DLLs in Win32.

Can you elaborate on this idea?  I don't understand it, so I can't comment on
whether it would work.

 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?

 For any new runtime features, the easiest thing is just to submit a ticket or
 to contact me and tell me what you need.  At first blush, I don't see any
 problem with providing the functions you'd need to make this work.
 It may eventually make sense to put TempAlloc, or something like it, into
 core.memory anyway (there's a reason I didn't name that module core.gc).
 Sean

It seems like you want to create some kind of more abstract interface to solve
the
more general problem of interfacing stuff with the GC, rather than just
integrating TempAlloc directly at a low level.  One way this could be done (not
necessarily the best way, but the best I've thought of so far) is to make it
possible to register class objects conforming to some specific range interface
(an
explicit runtime interface w/ virtual functions, not an implicit compile-time
interface, since we care about the ABI here) with the GC.  These objects would
return a pointer at each iteration, and then the GC would mark whatever was
pointed to by that pointer.  This would allow me to use some hacks to keep
TempAlloc efficient, while keeping these hacks well-encapsulated and out of the
core GC code.

In TempAlloc's case specifically, every time a new thread's TempAlloc state was
initialized, it would register one of these objects with the GC.  This object
would contain the thread's ID internally, and be set to return empty immediately
if the thread was already dead.  (TempAlloc-allocated data is absolutely NOT
meant
to be shared between threads.)  Otherwise, it would scan through the currently
allocated elements and provide the GC with the pointers that are located there,
through the range interface.

I'm not sure if this would also work for DLL stuff since I know nothing about
that
problem.  However, if this would solve the other cases you have in mind, I think
it's an elegant solution.

Jan 19 2009

Sean Kelly <sean invisibleduck.org> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been thinking about how to deal with this issue and make TempAlloc safe
 enough for inclusion in the standard library.  Since it works by allocating
 large (currently 4 MB) regions from the C heap, and then sub-allocating these
 regions to its clients, simply scanning these whole chunks is simply not a
 reasonable option, as it would lead to way too many false pointers.  Using the
 GC's addRegion/removeRegion interface at every TempAlloc allocation is also
 not a reasonable option, as this operation requires a lock and would negate
 any performance gains from using TempAlloc instead of the GC heap.  The only
 alternative I see is to make the GC aware of TempAlloc at a very low level.
 I've looked at the GC code and this doesn't look terribly hard to do.  Using
 some bit twiddling tricks, I can do the bookkeeping to determine which
 TempAlloc allocated objects should be scanned by the GC w/o any additional
 space overhead.

 The way I'd do this at the druntime level is generate a Thread.Context struct
 for each distinct block used by TempAlloc.  It shouldn't be too difficult to
 create an API to provide this functionality.  I'd have to do something roughly
 along these lines anyway to get thread support working for DLLs in Win32.

 Can you elaborate on this idea?  I don't understand it, so I can't comment on
 whether it would work.

The data range of every execution context in druntime is tracked by the GC
via a linked-list of structs that look roughly like this:

struct Context
{
    void*        begin;
    void*        end;
    Context* next;
}

The stack of each kernel thread has one, as does each user thread (Fiber). 
Currently,
this is all internal, but I could expose an API along these lines:

void attachContext( Context* );
void detachContext( Context* );

Each ThreadAlloc range would have its own Context struct which it would attach
on creation and detach upon destruction.  You're still paying for a mutex call
to attach and detach the struct, but the begin and end pointers can be modified
at will, which gets rid of the problem with addRange, etc.

I suppose I could add this to the interface of GC as well, instead of sticking
yet more
logic in core.thread.

 The downside to this is that TempAlloc would be very tightly coupled to
 druntime. Furthermore, if Sean and the other druntime maintainers decided they
 didn't like my patch, my effort will be wasted.  Therefore, I'd like to get
 some feedback before I bother doing anything.  Do you think that TempAlloc is
 universally useful enough to belong in the standard library?  If so, do you
 believe that it is only suitable for inclusion if it can be made safe w/
 respect to storing the only reference to GC-allocated objects?  If so, is it
 worth tightly coupling it to the GC, and should I start working on this and
 submit a patch to druntime?

 For any new runtime features, the easiest thing is just to submit a ticket or
 to contact me and tell me what you need.  At first blush, I don't see any
 problem with providing the functions you'd need to make this work.
 It may eventually make sense to put TempAlloc, or something like it, into
 core.memory anyway (there's a reason I didn't name that module core.gc).
 Sean

 It seems like you want to create some kind of more abstract interface to solve
the
 more general problem of interfacing stuff with the GC, rather than just
 integrating TempAlloc directly at a low level.  One way this could be done (not
 necessarily the best way, but the best I've thought of so far) is to make it
 possible to register class objects conforming to some specific range interface
(an
 explicit runtime interface w/ virtual functions, not an implicit compile-time
 interface, since we care about the ABI here) with the GC.  These objects would
 return a pointer at each iteration, and then the GC would mark whatever was
 pointed to by that pointer.  This would allow me to use some hacks to keep
 TempAlloc efficient, while keeping these hacks well-encapsulated and out of the
 core GC code.

That's basically how the Context stuff works above.

 In TempAlloc's case specifically, every time a new thread's TempAlloc state was
 initialized, it would register one of these objects with the GC.  This object
 would contain the thread's ID internally, and be set to return empty
immediately
 if the thread was already dead.  (TempAlloc-allocated data is absolutely NOT
meant
 to be shared between threads.)  Otherwise, it would scan through the currently
 allocated elements and provide the GC with the pointers that are located there,
 through the range interface.

Hm... this may be more tricky than I had outlined above.  If I were to do this
I'd
probably expose it from GC rather than allow access to the existing context
mechanism.  However... I /could/ possibly add an onExit hook for the Thread
object.  I'll have to think about whether that could cause any problems.


Sean

Jan 19 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Sean Kelly (sean invisibleduck.org)'s article
 The data range of every execution context in druntime is tracked by the GC
 via a linked-list of structs that look roughly like this:
 struct Context
 {
     void*        begin;
     void*        end;
     Context* next;
 }
 The stack of each kernel thread has one, as does each user thread (Fiber).

Currently,
 this is all internal, but I could expose an API along these lines:
 void attachContext( Context* );
 void detachContext( Context* );
 Each ThreadAlloc range would have its own Context struct which it would attach
 on creation and detach upon destruction.  You're still paying for a mutex call
 to attach and detach the struct, but the begin and end pointers can be modified
 at will, which gets rid of the problem with addRange, etc.
 I suppose I could add this to the interface of GC as well, instead of sticking

yet more
 logic in core.thread.

The problem with this is that I'd ideally like to control whether each
TempAlloc-allocated object is scanned by the GC individually, rather than
scanning
the entire used portion of the TempAlloc stack.  Having to create and store a
context struct for every object would be incredibly inefficient.  Scanning the
entire used portion would be incredibly conservative.

At an implementation level, TempAlloc already uses an array of void*s internally
to track what it's allocated (only one per object, in an array, not a linked
list)
so that it can be freed properly.  Since TempAlloc allocates using 16-byte
alignment, I was planning to avoid any extra overhead by storing the scan/noscan
bits in the low order bits of these pointers.

One more note:  It would also be greatly appreciated if you would expose a
version
of GC.malloc that returns null on failure instead of throwing an exception.  For
performance, TempAlloc treats out of memory as non-recoverable.  This is
necessary
because, since scope statements are often used to manage freeing, the
performance
hit from having TempAlloc-related functions be throwing would be too large. 
Also,
making TempAlloc functions non-throwing by catching outOfMemoryErrors and
aborting
is in itself expensive.

Furthermore, I just realized (literally while I was writing this post) that C's
malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte
aligned pointers like druntime's.  Ideally, I'd like to stop using the C heap
and
start using the druntime heap, but I can't do that until there is a non-throwing
implementation of GC.malloc.

Jan 19 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

dsimcha wrote:
 [snip]
 The problem with this is that I'd ideally like to control whether each
 TempAlloc-allocated object is scanned by the GC individually, rather than
scanning
 the entire used portion of the TempAlloc stack.  Having to create and store a
 context struct for every object would be incredibly inefficient.  Scanning the
 entire used portion would be incredibly conservative.
 
 At an implementation level, TempAlloc already uses an array of void*s
internally
 to track what it's allocated (only one per object, in an array, not a linked
list)
 so that it can be freed properly.  Since TempAlloc allocates using 16-byte
 alignment, I was planning to avoid any extra overhead by storing the
scan/noscan
 bits in the low order bits of these pointers.
 [snip]

Just a random thought here; what if you had this:

struct Context
{
    void*    begin;
    void*    end;
    Context* next;
}

struct FatContext
{
    void*    begin;
    void*    end;
    Context* next;
    uint*    bitmap;
}

void attachFatContext( FatContext* ctx )
{
    assert( ctx.next & 3 == 0 );
    ctx.next |= 1;
    attachContext( cast(Context*) ctx );
}

FatContext* isFatContext( Context* ctx )
{
    return (ctx.next & 1) ? cast(FatContext*) ctx : null;
}

Context* nextContext( Context* ctx )
{
    return ctx.next & ~3;
}

Now, we assume that bitmap is a pointer to the first element of an array
of uints that's (end-begin)/16 long.  You'd only need one bit per 16
bytes, and it'd let you change the scan/noscan of any area of memory
without having to talk to the GC every time.

  -- Daniel

Jan 19 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 One more note:  It would also be greatly appreciated if you would expose a
version
 of GC.malloc that returns null on failure instead of throwing an exception. 
For
 performance, TempAlloc treats out of memory as non-recoverable.  This is
necessary
 because, since scope statements are often used to manage freeing, the
performance
 hit from having TempAlloc-related functions be throwing would be too large. 
Also,
 making TempAlloc functions non-throwing by catching outOfMemoryErrors and
aborting
 is in itself expensive.
 Furthermore, I just realized (literally while I was writing this post) that C's
 malloc (which I use only b/c it doesn't throw) doesn't always return 16-byte
 aligned pointers like druntime's.  Ideally, I'd like to stop using the C heap
and
 start using the druntime heap, but I can't do that until there is a
non-throwing
 implementation of GC.malloc.

Never mind as far as this.  I realized that the reason why the try/catch blocks
were costing so much performance is because I was doing them inline and in some
cases putting them around a whole bunch of code.  This was probably killing a
lot
of compiler optimizations.  When I instead did it the non-stupid way and wrote a
wrapper for malloc, this no longer resulted in a significant performance
penalty.

Jan 19 2009

D Programming

C/C++ Programming

Other

digitalmars.D - TempAlloc and druntime GC