www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Memory Management in D: Request for Comment

reply dsimcha <dsimcha yahoo.com> writes:
During my occasional forays into the dark side of Python and Java, I am often
amazed at the extent to which memory management in these languages "just
works".  D should be like this for all but the most low-level programming
tasks, and was intended to be.  It seems like most of the other regulars here
have carved out niches that don't involve improving memory management.

My attempts at adding precise heap scanning to the GC got me thinking about
other ways to improve memory management in D.  I love most aspects of D, but
the constant memory management issues make it feel like much less of a
high-level language than it should feel like.  I'm thinking of making this my
niche around here, as I already know more about the problem than I ever wanted
to know and I'm sick of having memory management not work properly.  Here are
some things that I'd like comments on:

1.  In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid
allocating memory pages that currently have false pointers pointing into them.
 (If a page is not allocated but looks like it has a pointer into it, then we
can assume this is a false pointer.)  If I dip into the GC code again and
implement something like this, we'll be one step closer to making D memory
management "just work" and making false pointers a thing of the past.

2.  I've mentioned this a few times here before, but I wrote a second
stack-based memory allocator called TempAlloc, which allows stack-based memory
allocation that is not necessarily bound to function calls and automatically
falls back on heap allocation for very large objects, rather than killing your
program with a "stack overflow" error message.  I've also written some
implementations of common data structures (so far arrays, hash tables and
sets; I'll probably add binary trees at some point) that are optimized for it.
 (see http://svn.dsource.org/projects/dstats/docs/alloc.html).

The biggest problem with TempAlloc is that it is not scanned by the GC,
meaning that you can't store heap-allocated data in it unless you have another
reference somewhere else.  I don't know how to remedy this, partly because the
stacks are thread-local and I don't know how to remove a range from the GC
upon a thread terminating, even if I hack the GC to give it the features it
needs to properly scan TempAlloc.  Advice would be appreciated.

Other than GC scanning, is there anything else you would like to see added to
TempAlloc?  Do you think it's general enough to be included in core.memory, or
is it too niche?

3.  This one is an order of magnitude less likely than the other two to
actually get implemented, at least by me, but how about thread-local
allocators so you can call malloc() without taking a lock?  I vaguely remember
Sean saying he was working on that a while back, but I never heard anything
about it again.  It's probably best to wait for shared to be implemented for
this so that unshared objects can also be collected w/o stopping the world,
but we should start at least discussing this now.


4.  I submitted a patch a while back to allow the GC to ignore interior
pointers for specific objects.
(http://d.puremagic.com/issues/show_bug.cgi?id=2927)  This would be useful if
you have, for example, a large array that never escapes a class and the class
always maintains a pointer to the head of the array as long as it's alive.
This way, when the class dies, the array dies too even if there are false
pointers to its interior.  Few people have commented on this.  Is there any
reason why it's not a good idea?  Yes, it's somewhat unsafe if you're not
careful, but when the alternative is horrible memory leaks, sometimes
unsafeness is a necessary evil.
Nov 02 2009
next sibling parent reply Jason House <jason.james.house gmail.com> writes:
Please add weak references! I can email you a druntime compatible
implementation.

dsimcha Wrote:

 During my occasional forays into the dark side of Python and Java, I am often
 amazed at the extent to which memory management in these languages "just
 works".  D should be like this for all but the most low-level programming
 tasks, and was intended to be.  It seems like most of the other regulars here
 have carved out niches that don't involve improving memory management.
 
 My attempts at adding precise heap scanning to the GC got me thinking about
 other ways to improve memory management in D.  I love most aspects of D, but
 the constant memory management issues make it feel like much less of a
 high-level language than it should feel like.  I'm thinking of making this my
 niche around here, as I already know more about the problem than I ever wanted
 to know and I'm sick of having memory management not work properly.  Here are
 some things that I'd like comments on:
 
 1.  In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid
 allocating memory pages that currently have false pointers pointing into them.
  (If a page is not allocated but looks like it has a pointer into it, then we
 can assume this is a false pointer.)  If I dip into the GC code again and
 implement something like this, we'll be one step closer to making D memory
 management "just work" and making false pointers a thing of the past.
 
 2.  I've mentioned this a few times here before, but I wrote a second
 stack-based memory allocator called TempAlloc, which allows stack-based memory
 allocation that is not necessarily bound to function calls and automatically
 falls back on heap allocation for very large objects, rather than killing your
 program with a "stack overflow" error message.  I've also written some
 implementations of common data structures (so far arrays, hash tables and
 sets; I'll probably add binary trees at some point) that are optimized for it.
  (see http://svn.dsource.org/projects/dstats/docs/alloc.html).
 
 The biggest problem with TempAlloc is that it is not scanned by the GC,
 meaning that you can't store heap-allocated data in it unless you have another
 reference somewhere else.  I don't know how to remedy this, partly because the
 stacks are thread-local and I don't know how to remove a range from the GC
 upon a thread terminating, even if I hack the GC to give it the features it
 needs to properly scan TempAlloc.  Advice would be appreciated.
 
 Other than GC scanning, is there anything else you would like to see added to
 TempAlloc?  Do you think it's general enough to be included in core.memory, or
 is it too niche?
 
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.
 
 
 4.  I submitted a patch a while back to allow the GC to ignore interior
 pointers for specific objects.
 (http://d.puremagic.com/issues/show_bug.cgi?id=2927)  This would be useful if
 you have, for example, a large array that never escapes a class and the class
 always maintains a pointer to the head of the array as long as it's alive.
 This way, when the class dies, the array dies too even if there are false
 pointers to its interior.  Few people have commented on this.  Is there any
 reason why it's not a good idea?  Yes, it's somewhat unsafe if you're not
 careful, but when the alternative is horrible memory leaks, sometimes
 unsafeness is a necessary evil.

Nov 03 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Jason House (jason.james.house gmail.com)'s article
 Please add weak references! I can email you a druntime compatible
implementation.

Probably be better to submit it as a patch to Bugzilla if it's reasonably well-tested and working already. On the other hand, I am curious to see how this was done. Did it involve hacking gcx.d or was there a more graceful way?
Nov 03 2009
parent Jason House <jason.james.house gmail.com> writes:
dsimcha Wrote:

 == Quote from Jason House (jason.james.house gmail.com)'s article
 Please add weak references! I can email you a druntime compatible
implementation.

Probably be better to submit it as a patch to Bugzilla if it's reasonably well-tested and working already. On the other hand, I am curious to see how this was done. Did it involve hacking gcx.d or was there a more graceful way?

I did hack gcx.d, but only to make the global mutex public. The implementation is relatively simple. My ticket submission failed when I tried to submit it, so I emailed Sean directly. That was months ago, and I never heard back. I'll resubmit a ticket... Theoretically, a weak ref might naturally tie into your precise scanning changes. Admittedly, it's the dereferencing that is tricky to get right without hacking gcx.d.
Nov 03 2009
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
I submitted a patch a while back for constant-time removeRange. Can we dredge
that one up and/or implement something similar? It's rather useful for things
like stackthreads that need to add and remove lots of ranges :)

(Search the NG for "Feature req.+Patch: O(1) removeRange")
Nov 03 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from downs (default_357-line yahoo.de)'s article
 I submitted a patch a while back for constant-time removeRange. Can we dredge

stackthreads that need to add and remove lots of ranges :)
 (Search the NG for "Feature req.+Patch: O(1) removeRange")

I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?
Nov 03 2009
parent reply downs <default_357-line yahoo.de> writes:
dsimcha wrote:
 == Quote from downs (default_357-line yahoo.de)'s article
 I submitted a patch a while back for constant-time removeRange. Can we dredge

stackthreads that need to add and remove lots of ranges :)
 (Search the NG for "Feature req.+Patch: O(1) removeRange")

I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?

The virtual stacks I use are being allocated via mmap. Every stack is a range. With every context switch, the relevant ranges need to be modified - the range that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added.
Nov 03 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from downs (default_357-line yahoo.de)'s article
 dsimcha wrote:
 == Quote from downs (default_357-line yahoo.de)'s article
 I submitted a patch a while back for constant-time removeRange. Can we dredge

stackthreads that need to add and remove lots of ranges :)
 (Search the NG for "Feature req.+Patch: O(1) removeRange")

I guess removeRange is currently O(N)? I haven't looked. Anyhow, I think the bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?


that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added. IDK, that sounds a bit expensive even with O(1) removeRange. Are you using some home-grown impl. of StackThreads/fibers, Tango's, or D2 druntime's? If it's Tango's or druntime's, I haven't looked, but I would think the GC understands stuff about them. If it's your own, have you considered switching and maybe submitting whatever extra features your impl. has to Tango and/or druntime?
Nov 03 2009
parent downs <default_357-line yahoo.de> writes:
dsimcha wrote:
 == Quote from downs (default_357-line yahoo.de)'s article
 dsimcha wrote:
 == Quote from downs (default_357-line yahoo.de)'s article
 I submitted a patch a while back for constant-time removeRange. Can we dredge

stackthreads that need to add and remove lots of ranges :)
 (Search the NG for "Feature req.+Patch: O(1) removeRange")

bigger problem in practice, i.e. until N is unrealistically large, is that removeRange requires taking a lock. O(1) would be nice, though. What does StackThreads (I assume this is equivalent to fibers) do that it requires so many add/remove ranges?


that is being switched to needs to be removed (because it becomes the "main" stack which has special GC handling), and the stack range that's being switched from needs to be added. IDK, that sounds a bit expensive even with O(1) removeRange. Are you using some home-grown impl. of StackThreads/fibers, Tango's, or D2 druntime's? If it's Tango's or druntime's, I haven't looked, but I would think the GC understands stuff about them. If it's your own, have you considered switching and maybe submitting whatever extra features your impl. has to Tango and/or druntime?

Well .... D1 Phobos :) I have some philosophical disagreements with the Tango project, much as I admit their technical superiority, and D2 is still too beta. But at least I don't use tools.stackthreads that much these days, so it's fine :)
Nov 04 2009
prev sibling next sibling parent Leandro Lucarella <llucax gmail.com> writes:
dsimcha, el  3 de noviembre a las 05:46 me escribiste:
 During my occasional forays into the dark side of Python and Java, I am often
 amazed at the extent to which memory management in these languages "just
 works".  D should be like this for all but the most low-level programming
 tasks, and was intended to be.  It seems like most of the other regulars here
 have carved out niches that don't involve improving memory management.
 
 My attempts at adding precise heap scanning to the GC got me thinking about
 other ways to improve memory management in D.  I love most aspects of D, but
 the constant memory management issues make it feel like much less of a
 high-level language than it should feel like.  I'm thinking of making this my
 niche around here, as I already know more about the problem than I ever wanted
 to know and I'm sick of having memory management not work properly.  Here are
 some things that I'd like comments on:
 
 1.  In the Hans Boehm GC, they use a blacklisting scheme whereby they avoid
 allocating memory pages that currently have false pointers pointing into them.
  (If a page is not allocated but looks like it has a pointer into it, then we
 can assume this is a false pointer.)  If I dip into the GC code again and
 implement something like this, we'll be one step closer to making D memory
 management "just work" and making false pointers a thing of the past.

The Boehm GC has a lot to learn from. I'm planning to add concurrent collection in a sightly different way than the Boehm GC, which have at least a couple of ways of doing so. I know it might look like vaporware at this point but I have to do it eventually, I need to finish that to finally get my degree ;). I was just a little busy lately.
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.

I think avoiding the lock might be possible using atomic ops. I think it should be easier than using thread-local, but I'm not entirely sure if it's really possible.
 4.  I submitted a patch a while back to allow the GC to ignore interior
 pointers for specific objects.
 (http://d.puremagic.com/issues/show_bug.cgi?id=2927)  This would be useful if
 you have, for example, a large array that never escapes a class and the class
 always maintains a pointer to the head of the array as long as it's alive.
 This way, when the class dies, the array dies too even if there are false
 pointers to its interior.  Few people have commented on this.  Is there any
 reason why it's not a good idea?  Yes, it's somewhat unsafe if you're not
 careful, but when the alternative is horrible memory leaks, sometimes
 unsafeness is a necessary evil.

I think I commented it, and I think is a good idea to have it. It's not unsafer than having NO_SCAN or casting a pointer to a size_t. -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Todos en el mundo somos grasas, no hago distinción de sexo y raza, sólo que algunos lo disfrutan y otros no pueden evitarlo.
Nov 03 2009
prev sibling next sibling parent reply BCS <none anon.com> writes:
Hello dsimcha,

 3.  This one is an order of magnitude less likely than the other two
 to actually get implemented, at least by me, but how about
 thread-local allocators so you can call malloc() without taking a
 lock?  I vaguely remember Sean saying he was working on that a while
 back, but I never heard anything about it again.  It's probably best
 to wait for shared to be implemented for this so that unshared objects
 can also be collected w/o stopping the world, but we should start at
 least discussing this now.

I think there are general malloc functions that are lock free. On the same note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.
Nov 04 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from BCS (none anon.com)'s article
 Hello dsimcha,
 3.  This one is an order of magnitude less likely than the other two
 to actually get implemented, at least by me, but how about
 thread-local allocators so you can call malloc() without taking a
 lock?  I vaguely remember Sean saying he was working on that a while
 back, but I never heard anything about it again.  It's probably best
 to wait for shared to be implemented for this so that unshared objects
 can also be collected w/o stopping the world, but we should start at
 least discussing this now.

note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.

But x86 atomic ops, while not as slow as locks, are pretty slow. Some benchmarks I did a while back indicated that, on Windows, atomic increment is only about twice as fast as an uncontested synchronized { var++; }. I think using thread-local storage and having each thread-local heap interact with the global heap only occasionally, for very large transactions, is a much better bet than using atomic CAS. I'm also a little bit confused about what you're suggesting, specifically about the non-continuous memory space. Does this mean that things would have to be freed LIFO? If so, what use cases would you be covering that TempAlloc doesn't already cover? If not, what does it mean?
Nov 04 2009
parent BCS <none anon.com> writes:
Hello dsimcha,

 == Quote from BCS (none anon.com)'s article
 
 Hello dsimcha,
 
 3.  This one is an order of magnitude less likely than the other two
 to actually get implemented, at least by me, but how about
 thread-local allocators so you can call malloc() without taking a
 lock?  I vaguely remember Sean saying he was working on that a while
 back, but I never heard anything about it again.  It's probably best
 to wait for shared to be implemented for this so that unshared
 objects can also be collected w/o stopping the world, but we should
 start at least discussing this now.
 

same note, I've been toying with an idea for how to do memory heap structure that would be totally lock free (think CAS) but has some downsides (like it won't handle a non-continues memory space and expanding the memory could get very expensive. Both of these could be non issues in some cases; short lived job specific heaps, memory mapped (shared?) data(base) files, resource limited systems, etc. I've been meaning to get around to implementing it but never had time. I'd be willing to collaborate or, if even that doesn't get me moving, just hand over the whole project in exchange for a "coauthor" line in the comments.

But x86 atomic ops, while not as slow as locks, are pretty slow. Some benchmarks I did a while back indicated that, on Windows, atomic increment is only about twice as fast as an uncontested synchronized { var++; }.

IIRC a mutex/lock has to have a CAS at some point so if you can get a function to work with a single CAS, at worst an uncontested action will be as good as an uncontested mutex/lock. Once uncontention starts happening it depends on how much your retry costs. Another thing to take into account is the overhead vs granularity tradeoff you get with locks that can sometimes be skiped with direct use of CAS.
 I think using thread-local storage and having each
 thread-local heap interact with the global heap only occasionally, for
 very large transactions, is a much better bet than using atomic CAS.

I got started thinking about this when I came across a comment that some particular malloc didn't have issues with freeing memory from a different thread than it was malloced from.
 
 I'm also a little bit confused about what you're suggesting,
 specifically about the non-continuous memory space.  Does this mean
 that things would have to be freed LIFO?  If so, what use cases would
 you be covering that TempAlloc doesn't already cover?  If not, what
 does it mean?

IIRC, when the current GC runs out of space, it uses mmap (on linux) to get more address space mapped to something. This wouldn't work for my idea because you can't count of mmap returning a block of ram at any given location, it is free to leave gaping holes in the places you can allocate at and I wouldn't be able to handle that.
Nov 04 2009
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
dsimcha Wrote:
 
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.

I actually started working on this a while back and then set it aside to take care of some other things. I'm planning to revisit it once the concurrency model is reasonably sorted out, since such a GC is pretty much useless until then anyway.
Nov 04 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.


is reasonably sorted out, since such a GC is pretty much useless until then anyway. Does this mean that this has officially, for all practical purposes, been put off until D3?
Nov 04 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.


is reasonably sorted out, since such a GC is pretty much useless until then anyway. Does this mean that this has officially, for all practical purposes, been put off until D3?

Performance improvements are fortunately not tied to the language spec. Andrei
Nov 04 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 dsimcha wrote:
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.


is reasonably sorted out, since such a GC is pretty much useless until then


 Does this mean that this has officially, for all practical purposes, been put
off
 until D3?

Andrei

But the concurrency model is.
Nov 04 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 dsimcha wrote:
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 3.  This one is an order of magnitude less likely than the other two to
 actually get implemented, at least by me, but how about thread-local
 allocators so you can call malloc() without taking a lock?  I vaguely remember
 Sean saying he was working on that a while back, but I never heard anything
 about it again.  It's probably best to wait for shared to be implemented for
 this so that unshared objects can also be collected w/o stopping the world,
 but we should start at least discussing this now.


is reasonably sorted out, since such a GC is pretty much useless until then


 Does this mean that this has officially, for all practical purposes, been put
off
 until D3?

Andrei

But the concurrency model is.

Sorry, I misunderstood a bit. Anyway, the process is a continuum, so you shouldn't expect that the concurrency model will be available only on the day of D2's release. Also, there will be many releases of D2 following the "gold" release. It's not like once we deploy D2 we just start working on D3. Andrei
Nov 04 2009