www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Fast Memory Allocation Strategy

reply "Craig Black" <cblack ara.com> writes:
I've been looking at D's approach to memory allocation and I have an idea 
that I think will cause memory allocation to be blazingly fast.

I don't know what to call it, and there probably is a name for it.  It's 
kind of similar to D's mark and release approach but slightly more involved.

Objects are organized into buffers according to their size.  So if an object 
is 10 words long, it will go into the 10-word buffer.  If an object is 5 
words long, it goes into the 5-word buffer, etc.  Thus all objects in each 
buffer are of equal size.

The buffers must be resizable.  When an object is requested from a buffer 
that is full, it will be enlarged to facilitate more objects.  When the 
buffer is resized the GC must move all the objects.  When an object is freed 
from memory, the gap in the buffer is filled with the last item in the 
buffer.  This requires the GC to move that block of memory.  Thus there is 
no memory fragmentation within each buffer.

The problem with this approach is that it requires interaction with the GC 
that I don't think D currently provides, so Walter may have to get involved 
for this to be implemented.

All you smart D folks, let me know what you think.

-Craig 
Mar 16 2005
parent reply Ilya Minkov <minkov cs.tum.edu> writes:
That's how Doug Lea's memory manager (i.e. malloc/free implementation) 
works and is known to be one of the most non-fragmenting memory 
managers. And it's fast! GNU C library has its malloc derived from it, 
and this is one of the many reasons for good Linux performance.

With the difference that it naturally cannot move anything, but it 
hardly has any negative consequence, since the program potentially 
allocates more data of the same type. Even otherwise, there are 
different strategies - if we are speaking of very small objects, the 
deficiency can be safely ignored, if of multi-kilobyte ones one can 
consider assigning this space to different pools or returning it to the 
OS. Basically, since we have virtually mapped memory, it is much more 
efficient to unmap a piece of adress space than to move the data and go 
back correct the pointers.

I've stumbled over the subject over an article on Gamasutra, as i wanted 
to learn whether i should expect to run into memory management trouble 
on a (now abandoned but historically gorgeous) game console with 16 
megabytes of main RAM, and how to avoid or solve them.

I might be wrong, but i seem to remember Phobos GC also uses a sort of 
size-discriminated pool allocator, and i would guess the Boehm does as 
well - fast memory allocation is one of its main selling points.

I suggest you read an article on it, which says that with such an 
allocator fragmentation basically ceases to bother anyone - usually even 
on a game console which doesn't have virtual memory enabled for 
performance reasons.

	"The Memory Fragmentation Problem: Solved?"
http://www.cs.utexas.edu/users/wilson/papers/fragsolved.pdf

This document is also available from a few other locations on the internet.

Generally, precise garbage collection is a hard problem - and moving 
data around is a particularly hard sub-problem. It would force us to 
have typed stack - naturally not impossible - at least as long as one 
could exclude foreign languege (e.g. C) calls - but a running overhead, 
i think. What we are effectively best left with is a collector with 
conservative roots but precise object scanning - but it would be unable 
to detect each and every pointer and be able to patch it if a piece of 
adress space moved.

-eye


Craig Black wrote:
 I've been looking at D's approach to memory allocation and I have an idea 
 that I think will cause memory allocation to be blazingly fast.
 
 I don't know what to call it, and there probably is a name for it.  It's 
 kind of similar to D's mark and release approach but slightly more involved.
 
 Objects are organized into buffers according to their size.  So if an object 
 is 10 words long, it will go into the 10-word buffer.  If an object is 5 
 words long, it goes into the 5-word buffer, etc.  Thus all objects in each 
 buffer are of equal size.
 
 The buffers must be resizable.  When an object is requested from a buffer 
 that is full, it will be enlarged to facilitate more objects.  When the 
 buffer is resized the GC must move all the objects.  When an object is freed 
 from memory, the gap in the buffer is filled with the last item in the 
 buffer.  This requires the GC to move that block of memory.  Thus there is 
 no memory fragmentation within each buffer.
 
 The problem with this approach is that it requires interaction with the GC 
 that I don't think D currently provides, so Walter may have to get involved 
 for this to be implemented.
 
 All you smart D folks, let me know what you think.
 
 -Craig 
 
 

Mar 16 2005
parent reply "Craig Black" <cblack ara.com> writes:
Thank you for your response.  You seem to be quite informed about memory 
allocation.  I was searching for a name for my approach.  Perhaps I can 
steel the term "size-discriminated pool allocator" from you?  You don't have 
a copyright on that do you? :)

 Generally, precise garbage collection is a hard problem - and moving data 
 around is a particularly hard sub-problem. It would force us to have typed 
 stack - naturally not impossible - at least as long as one could exclude 
 foreign languege (e.g. C) calls - but a running overhead, i think. What we 
 are effectively best left with is a collector with conservative roots but 
 precise object scanning - but it would be unable to detect each and every 
 pointer and be able to patch it if a piece of adress space moved.

I disagree. Moving memory around is quite possible and could be implemented efficiently, given that we are dealing with GC references rather than pointers. That is, after all, what GC is all about, right? Not having to deal with those nasty, hateful little pointers? Having to move memory around once in a while is quite worth the effort in this case. A slight drawback is when the buffer is resized, this may take a few clock cycles, especially if the buffer is quite large. However, this should very rarely happen, especially for large buffers. I believe that with the right GC algorithm, we can outperform non-GC. -Craig
Mar 17 2005
parent reply Ilya Minkov <minkov cs.tum.edu> writes:
Craig Black wrote:
 Thank you for your response.  You seem to be quite informed about 
 memory allocation.  I was searching for a name for my approach.

I think in fact i know too little on almost any subject, including memory allocation. However, if i was shure i knew a lot, i wouldn't come to learn quite as much.
 Perhaps I can steel the term "size-discriminated pool allocator" from
  you?  You don't have a copyright on that do you? :)

How could i? This is simply a bunch of technical terms thrown together. I'm not even sure it's correct. I didn't even make one long word out of it - which would be in a fine german tradition.
 I disagree.  Moving memory around is quite possible and could be 
 implemented efficiently, given that we are dealing with GC references
  rather than pointers.  That is, after all, what GC is all about, 
 right?  Not having to deal with those nasty, hateful little pointers?
  Having to move memory around once in a while is quite worth the 
 effort in this case.  A slight drawback is when the buffer is 
 resized, this may take a few clock cycles, especially if the buffer 
 is quite large.  However, this should very rarely happen, especially
  for large buffers.

We are dealing with pointers. What do you mean or imagine under a GC reference? What should it look like? Take multiple words of space? Be double-indirect? Currently they are just pointers which is of direct advantage to the execution performance, and i would be interested in any clue how to do it better. Among problems to consider is that D runtime environment is not encapsulated, as, say, Java is - D is intendet to interact with low-level code directly - or in fact everything vaguely resembling C from the interface. It would in fact be interesting to see an efficient and safe way to resettle pointers. Now we are paying a few percent of CPU performance for the adress indirection. Many embedded systems thus don't enable it at all - which however makes the memory fragmentation problem quite somewhat more real, especially for garbage-collected languages. Having a good way to reposition pointers would be thus a helpful thing there. If, however, we have a system where we already have to pay an overhead on memory and on CPU, perhaps for security or other reasons, i would be rather willing to make use of it than layer anything else on top of it. Resizing the buffer is on the contrary perfectly unproblematic - new chunks need not be adjacent, and the buffer need not be contiguous. As long as the buffer chunks are of the same size for many differently large types, neither large nor small, there should be no fragmentation.
 I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly - we can rather outperform it by reducing the costs of semi-manual bookkeeping. Many C++ programs use shared_ptr<> to do their memory management, which is both syntactically wordy and quite inefficient because of often count manipulation and checking. Even worse is relying on undocumented reference-counted implementation of some standard containers, when their specified semantics is value. Besides, we are slowly entering a multithreaded world, where using reference counting is a *crime*, while GC works much better and is on the way to further improvement. -eye
Mar 17 2005
parent reply "Craig Black" <cblack ara.com> writes:
 We are dealing with pointers. What do you mean or imagine under a GC
 reference? What should it look like? Take multiple words of space? Be
 double-indirect? Currently they are just pointers which is of direct
 advantage to the execution performance, and i would be interested in any
 clue how to do it better. Among problems to consider is that D runtime
 environment is not encapsulated, as, say, Java is - D is intendet to
 interact with low-level code directly - or in fact everything vaguely
 resembling C from the interface.

No doubt about it, my approach would very much discourage the use of pointers. I'm glad you have inquired about GC references. Again, in my approach, all objects are allocated in object pools or buffers, according to their size, up to a certain point. Lets say the cuttoff is 1024 words (could also be 4096 or whatever). All objects equal to or larger than 1024 words go in the last buffer. Thus, the last buffer must be handled differently, but this shouldn't be a problem for performance because very few objects will be so large. These buffers are stored as an array of buffers so that they can be quickly indexed. Since all objects are allocated in these buffers, we be can break up the GC reference into two indices. The first index denotes the particular buffer, the second denotes the specific object in the buffer. These indices could be stored in 32 bits, or 64 bits for computers with a lot of memory.
 It would in fact be interesting to see an efficient and safe way to
 resettle pointers. Now we are paying a few percent of CPU performance
 for the adress indirection.

True. There is more indirection involved than with simple pointers, but the access is still O(1). However, allocation and deallocation is cheaper on an order of magnitude.
 Many embedded systems thus don't enable it
 at all - which however makes the memory fragmentation problem quite
 somewhat more real, especially for garbage-collected languages. Having a
 good way to reposition pointers would be thus a helpful thing there. If,
 however, we have a system where we already have to pay an overhead on
 memory and on CPU, perhaps for security or other reasons, i would be
 rather willing to make use of it than layer anything else on top of it.

 Resizing the buffer is on the contrary perfectly unproblematic - new
 chunks need not be adjacent, and the buffer need not be contiguous. As
 long as the buffer chunks are of the same size for many differently
 large types, neither large nor small, there should be no fragmentation.

Exactly. A relatively simple solution. No fragmentation within the buffers.
 I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly -

I disagree. If the GC has the capability to compact memory, then I believe non-GC can be outdone. However, this has yet to be seen.
 we can
 rather outperform it by reducing the costs of semi-manual bookkeeping.
 Many C++ programs use shared_ptr<> to do their memory management, which
 is both syntactically wordy and quite inefficient because of often count
 manipulation and checking. Even worse is relying on undocumented
 reference-counted implementation of some standard containers, when their
 specified semantics is value. Besides, we are slowly entering a
 multithreaded world, where using reference counting is a *crime*,

What are the problems with reference counting and multithreading? It would seem to me that as long as you "synchronize" the appropriate methods, there should be no problem. I don't use a shared object templates in C++, but I personally prefer reference counting to single references. It seems more powerful.
 while GC works much better and is on the way to further improvement.

I hope so. -Craig
Mar 17 2005
next sibling parent reply Ilya Minkov <minkov cs.tum.edu> writes:
Craig Black wrote:
 No doubt about it, my approach would very much discourage the use of 
 pointers. 

!?! They are even syntactically the same.
 I'm glad you have inquired about GC references.  Again, in my 
 approach, all objects are allocated in object pools or buffers, according to 
 their size, up to a certain point.  Lets say the cuttoff is 1024 words 
 (could also be 4096 or whatever).  All objects equal to or larger than 1024 
 words go in the last buffer.  Thus, the last buffer must be handled 
 differently, but this shouldn't be a problem for performance because very 
 few objects will be so large.
 
 These buffers are stored as an array of buffers so that they can be quickly 
 indexed. Since all objects are allocated in these buffers, we be can break 
 up the GC reference into two indices.  The first index denotes the 
 particular buffer, the second denotes the specific object in the buffer. 
 These indices could be stored in 32 bits, or 64 bits for computers with a 
 lot of memory.

Ok, so you give up a minor amount of performance for indirection. You can have, for example, 64-kilobyte large pools, and up to 65536 pool pointers table on a 32-bit memory architecture. So far so good. There is even no huge problem accomodating for normal pointers - let them be simple one-word-pooled garbage-collected objects. However, this leaves one question open: when scanning, your starting point is the stack and the globals. The stack does not contain any type information. When gliding over an element of stack - of which you think it might be a GC reference, finally, it is nothing more or less than a machine word - you have to unambiguosly decide whether this is a pointer or not. There are many contexts where one could supply type information without run-time disadvantage, but i don't know a way how to do that for stack. The current policy is, we assume everything is a pointer. If we are right, then if it points into some actually allocated region of memory, we mark it as currently used and thus know we cannot deallocate this region. There are naturally some memory regions which are marked while they aren't actually pointed to by a pointer, it was just an integer or just about any collection of bits, but this happens rarely and hardly makes any significant difference. But can we move a marked region? No! Because we need to go back and change anything that we thought was a pointer to it - and what if that something was just a random collection of bits? Your proposal has the same problem, if i understood it correctly.
It would in fact be interesting to see an efficient and safe way to
resettle pointers. Now we are paying a few percent of CPU performance
for the adress indirection.

True. There is more indirection involved than with simple pointers, but the access is still O(1). However, allocation and deallocation is cheaper on an order of magnitude.

Would memory management really become cheaper? Allocation is already from size-typed pools. Deallocation with free involves searching for a pointer in a map, to determine the size of a memory region to free, which is O(log n), or this size of a memory region can be written just before the pointed-to spot in the allocated region. The second variation is O(1), but it involves that the allocator allocates always one word more memory than requiered and returns the pointer to the following byte. There is one problem in C which forces this whole design: a pointer designates any sort of memory region, which may be a stuct, an array, a struct with an array field at the end, ... anything! C++ and D enforce stronger design - anything allocated with new must be deallocated with delete, or in D's case also by GC. Furthermore, one cannot do many sorts of pointer conversions legally. Thus we can assume all pointers are well-typed at the point where deallocation happens. This has one of the following consequences: (1) The type is of statically known size. Thus, the size need not be stored with the object, but can be hardcoded at the deallocation point, and given to the actual deallocator. (2) The type is a class which can be substituted by a descendant when handled by pointer, which means in C++ that it has virtual member functions declared. Because the size is being determined by which descendant we are holding, it can be hidden per-class in the classinfo or vtable, taken then from there and given the deallocator. (3) The type is an array. We use over-allocation technique to store the length, and given the deallocator. Deallocator, having the size, can quickly select a pool to which the reclaimed memory goes. A pool contains a list of free blocks within. I see how this can be optimized when we can compact though. I don't know whether actual C++ implementations employ such a technique, but i think they could, and thus C++ gives us any sort of freedom to make the almost optimal memory management. As to D, i would not be so shure with the first point, because when a GC scans, it need not know type, it cares more about size. Definately, this is one direction where one should think about and probably get some improvement.
I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly -

I disagree. If the GC has the capability to compact memory, then I believe non-GC can be outdone. However, this has yet to be seen.

As i have shown above, marginally. Deallocation happens usually in the GC, but it doesn't take really time. A much higher amount of time goes into determining what memory regions are alive. Currently, the GC recursively goes, starting with the stack, through the whole heap, and considers every word it encounters a pointer. Checking whether a number points into allocated block of memory is a pointer takes a map lookup which is logarithmic in time to the number of allocated blocks, thus a complete scan is O(n log n) to the amount of active memory. Walter knows of enough ways to improve this, he just hasn't come to implementing it yet since there have always been more important issues.
 What are the problems with reference counting and multithreading?  It would 
 seem to me that as long as you "synchronize" the appropriate methods, there 
 should be no problem.  I don't use a shared object templates in C++, but I 
 personally prefer reference counting to single references.  It seems more 
 powerful.

"Synchronize" is the problem. You have to suspend all threads while you increment a counter - unless you can prove that your thread is the only one which could be manipulating the counter at the moment. This might not be trivial when we have multiple processors. And thus a simple small operation of incrementing an integer starts to take an unproportionally huge amount of time. It all just bogs down. -eye
Mar 18 2005
parent reply David Medlock <amedlock nospam.org> writes:
Ilya Minkov wrote:
<snip>
 -eye

Good points. Pretty funny we both posted such similar responses in different groups (the other is on digitalmars.d) to the same thread. Hehe! -David
Mar 18 2005
parent Ilya Minkov <minkov cs.tum.edu> writes:
David Medlock wrote:
 Good points.
 
 Pretty funny we both posted such similar responses in different 
 groups (the other is on digitalmars.d) to the same thread.

Really? The other newsgroup is too fat for me to read. Of course most of what i say was already stated by me, someone else, and is to the most from some knowlegde i gained through scientific papers posted on the newsgroup back when i was an active participant years ago. In fact i must imagine Walter must be getting very tired of similar proposals coming over and over, just to be discarded or put off after a few rounds of collective thinking and gathering the arguments over again. If i may do that sort of stupid thing and throw in a question for consideration: is there a sensble way to separate foreign language execution from D stack-wise? I think i have an idea which is worth thinking about. Imagine, you have a function, alongside of which, its stack layout information is stored. (*1) Then, when a GC kicks in, you must just figure out what function is running now, read its stack layout - from which you can exactly determine the pointers. Go to the entry side of the stack, and find the return adress there. Search for that function. And so on. However, the C functions interfere. And a C function can call D function can call C function, ... This might mean we would need to introduce some restrictions into the languege. What kind of restrictions? (*1) I know. There is no static stack layout per se, but one could do something like over-allocate so that all locals get in, and separate alloca from the execution stack - which would be of performance advantage anyway because alloca "heapstack" need not be scanned for pointers unless explicitly made a GC root. Additionally, for all functions called the point on which the return adress is stored must be the same. I also wonder how symbolic debuggers work. Probably not an option since they slow the whole thing down. Another option is to use a mechanism like stack unrolling, which means around each stack frame there would be handlers - something that D has been avoiding even for exceptions, to maintain a performance edge against C++. Nothing is for consideration for soon implementation, since we now only need to consern ourselves with crafting a minimal set of language features which wouldn't force us to break the compatibility afterwards. To cut all the weak branches now. -eye
Mar 21 2005
prev sibling parent Georg Wrede <georg.wrede nospam.org> writes:
I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly -

I disagree. If the GC has the capability to compact memory, then I believe non-GC can be outdone. However, this has yet to be seen.

This is like asking whether auto transmission or stick shift is faster. It really depends on the circumstances. (I know, I know, mostly stick shift is faster, but that was not the point, right?) If a particular GC algorithm outperforms some code, or not, depends first of all on the purpose of the code itself, and on the particular programmer. Want to find a GC outperforming Bjarne Stroustrup doing a simple linked list? OTOH, want to find a GC _not_ outperforming a newbie doing a simple linked list? "This code is faster than C." "This code is faster than ASM." Besides, a GC for web applications would probably be different than one for a text editor. For example. ------------- Besides, the real advantage of GC is not speed. It is reliability of code, ease of programming, and separation of disparate issues (here, memory management and program logic).
Mar 25 2005