digitalmars.D.learn - Fast Memory Allocation Strategy

Craig Black (19/19) Mar 16 2005 I've been looking at D's approach to memory allocation and I have an ide...

Ilya Minkov (37/64) Mar 16 2005 That's how Doug Lea's memory manager (i.e. malloc/free implementation)

Craig Black (14/21) Mar 17 2005 Thank you for your response. You seem to be quite informed about memory...

Ilya Minkov (38/52) Mar 17 2005 I think in fact i know too little on almost any subject, including

Craig Black (28/61) Mar 17 2005 No doubt about it, my approach would very much discourage the use of

Ilya Minkov (77/112) Mar 18 2005 !?!

David Medlock (7/8) Mar 18 2005 Good points.

Ilya Minkov (37/41) Mar 21 2005 Really? The other newsgroup is too fat for me to read. Of course most of

Georg Wrede (16/22) Mar 25 2005 This is like asking whether auto transmission or stick shift is faster.

"Craig Black" <cblack ara.com> writes:

I've been looking at D's approach to memory allocation and I have an idea 
that I think will cause memory allocation to be blazingly fast.

I don't know what to call it, and there probably is a name for it.  It's 
kind of similar to D's mark and release approach but slightly more involved.

Objects are organized into buffers according to their size.  So if an object 
is 10 words long, it will go into the 10-word buffer.  If an object is 5 
words long, it goes into the 5-word buffer, etc.  Thus all objects in each 
buffer are of equal size.

The buffers must be resizable.  When an object is requested from a buffer 
that is full, it will be enlarged to facilitate more objects.  When the 
buffer is resized the GC must move all the objects.  When an object is freed 
from memory, the gap in the buffer is filled with the last item in the 
buffer.  This requires the GC to move that block of memory.  Thus there is 
no memory fragmentation within each buffer.

The problem with this approach is that it requires interaction with the GC 
that I don't think D currently provides, so Walter may have to get involved 
for this to be implemented.

All you smart D folks, let me know what you think.

-Craig

Mar 16 2005

Ilya Minkov <minkov cs.tum.edu> writes:

That's how Doug Lea's memory manager (i.e. malloc/free implementation) 
works and is known to be one of the most non-fragmenting memory 
managers. And it's fast! GNU C library has its malloc derived from it, 
and this is one of the many reasons for good Linux performance.

With the difference that it naturally cannot move anything, but it 
hardly has any negative consequence, since the program potentially 
allocates more data of the same type. Even otherwise, there are 
different strategies - if we are speaking of very small objects, the 
deficiency can be safely ignored, if of multi-kilobyte ones one can 
consider assigning this space to different pools or returning it to the 
OS. Basically, since we have virtually mapped memory, it is much more 
efficient to unmap a piece of adress space than to move the data and go 
back correct the pointers.

I've stumbled over the subject over an article on Gamasutra, as i wanted 
to learn whether i should expect to run into memory management trouble 
on a (now abandoned but historically gorgeous) game console with 16 
megabytes of main RAM, and how to avoid or solve them.

I might be wrong, but i seem to remember Phobos GC also uses a sort of 
size-discriminated pool allocator, and i would guess the Boehm does as 
well - fast memory allocation is one of its main selling points.

I suggest you read an article on it, which says that with such an 
allocator fragmentation basically ceases to bother anyone - usually even 
on a game console which doesn't have virtual memory enabled for 
performance reasons.

	"The Memory Fragmentation Problem: Solved?"
http://www.cs.utexas.edu/users/wilson/papers/fragsolved.pdf

This document is also available from a few other locations on the internet.

Generally, precise garbage collection is a hard problem - and moving 
data around is a particularly hard sub-problem. It would force us to 
have typed stack - naturally not impossible - at least as long as one 
could exclude foreign languege (e.g. C) calls - but a running overhead, 
i think. What we are effectively best left with is a collector with 
conservative roots but precise object scanning - but it would be unable 
to detect each and every pointer and be able to patch it if a piece of 
adress space moved.

-eye


Craig Black wrote:
 I've been looking at D's approach to memory allocation and I have an idea 
 that I think will cause memory allocation to be blazingly fast.
 
 I don't know what to call it, and there probably is a name for it.  It's 
 kind of similar to D's mark and release approach but slightly more involved.
 
 Objects are organized into buffers according to their size.  So if an object 
 is 10 words long, it will go into the 10-word buffer.  If an object is 5 
 words long, it goes into the 5-word buffer, etc.  Thus all objects in each 
 buffer are of equal size.
 
 The buffers must be resizable.  When an object is requested from a buffer 
 that is full, it will be enlarged to facilitate more objects.  When the 
 buffer is resized the GC must move all the objects.  When an object is freed 
 from memory, the gap in the buffer is filled with the last item in the 
 buffer.  This requires the GC to move that block of memory.  Thus there is 
 no memory fragmentation within each buffer.
 
 The problem with this approach is that it requires interaction with the GC 
 that I don't think D currently provides, so Walter may have to get involved 
 for this to be implemented.
 
 All you smart D folks, let me know what you think.
 
 -Craig

Mar 16 2005

"Craig Black" <cblack ara.com> writes:

Thank you for your response.  You seem to be quite informed about memory 
allocation.  I was searching for a name for my approach.  Perhaps I can 
steel the term "size-discriminated pool allocator" from you?  You don't have 
a copyright on that do you? :)

 Generally, precise garbage collection is a hard problem - and moving data 
 around is a particularly hard sub-problem. It would force us to have typed 
 stack - naturally not impossible - at least as long as one could exclude 
 foreign languege (e.g. C) calls - but a running overhead, i think. What we 
 are effectively best left with is a collector with conservative roots but 
 precise object scanning - but it would be unable to detect each and every 
 pointer and be able to patch it if a piece of adress space moved.

I disagree.  Moving memory around is quite possible and could be implemented 
efficiently, given that we are dealing with GC references rather than 
pointers.  That is, after all, what GC is all about, right?  Not having to 
deal with those nasty, hateful little pointers?  Having to move memory 
around once in a while is quite worth the effort in this case.  A slight 
drawback is when the buffer is resized, this may take a few clock cycles, 
especially if the buffer is quite large.  However, this should very rarely 
happen, especially for large buffers.

I believe that with the right GC algorithm, we can outperform non-GC.

-Craig

Mar 17 2005

Ilya Minkov <minkov cs.tum.edu> writes:

Craig Black wrote:
 Thank you for your response.  You seem to be quite informed about 
 memory allocation.  I was searching for a name for my approach.

I think in fact i know too little on almost any subject, including
memory allocation. However, if i was shure i knew a lot, i wouldn't come
to learn quite as much.

 Perhaps I can steel the term "size-discriminated pool allocator" from
  you?  You don't have a copyright on that do you? :)

How could i? This is simply a bunch of technical terms thrown together.
I'm not even sure it's correct. I didn't even make one long word out of
it - which would be in a fine german tradition.

 I disagree.  Moving memory around is quite possible and could be 
 implemented efficiently, given that we are dealing with GC references
  rather than pointers.  That is, after all, what GC is all about, 
 right?  Not having to deal with those nasty, hateful little pointers?
  Having to move memory around once in a while is quite worth the 
 effort in this case.  A slight drawback is when the buffer is 
 resized, this may take a few clock cycles, especially if the buffer 
 is quite large.  However, this should very rarely happen, especially
  for large buffers.

We are dealing with pointers. What do you mean or imagine under a GC
reference? What should it look like? Take multiple words of space? Be
double-indirect? Currently they are just pointers which is of direct
advantage to the execution performance, and i would be interested in any
clue how to do it better. Among problems to consider is that D runtime
environment is not encapsulated, as, say, Java is - D is intendet to
interact with low-level code directly - or in fact everything vaguely
resembling C from the interface.

It would in fact be interesting to see an efficient and safe way to
resettle pointers. Now we are paying a few percent of CPU performance
for the adress indirection. Many embedded systems thus don't enable it
at all - which however makes the memory fragmentation problem quite
somewhat more real, especially for garbage-collected languages. Having a
good way to reposition pointers would be thus a helpful thing there. If,
however, we have a system where we already have to pay an overhead on
memory and on CPU, perhaps for security or other reasons, i would be
rather willing to make use of it than layer anything else on top of it.

Resizing the buffer is on the contrary perfectly unproblematic - new
chunks need not be adjacent, and the buffer need not be contiguous. As
long as the buffer chunks are of the same size for many differently
large types, neither large nor small, there should be no fragmentation.

 I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly - we can
rather outperform it by reducing the costs of semi-manual bookkeeping.
Many C++ programs use shared_ptr<> to do their memory management, which
is both syntactically wordy and quite inefficient because of often count
manipulation and checking. Even worse is relying on undocumented
reference-counted implementation of some standard containers, when their
specified semantics is value. Besides, we are slowly entering a
multithreaded world, where using reference counting is a *crime*, while
GC works much better and is on the way to further improvement.

-eye

Mar 17 2005

"Craig Black" <cblack ara.com> writes:

 We are dealing with pointers. What do you mean or imagine under a GC
 reference? What should it look like? Take multiple words of space? Be
 double-indirect? Currently they are just pointers which is of direct
 advantage to the execution performance, and i would be interested in any
 clue how to do it better. Among problems to consider is that D runtime
 environment is not encapsulated, as, say, Java is - D is intendet to
 interact with low-level code directly - or in fact everything vaguely
 resembling C from the interface.

No doubt about it, my approach would very much discourage the use of 
pointers.  I'm glad you have inquired about GC references.  Again, in my 
approach, all objects are allocated in object pools or buffers, according to 
their size, up to a certain point.  Lets say the cuttoff is 1024 words 
(could also be 4096 or whatever).  All objects equal to or larger than 1024 
words go in the last buffer.  Thus, the last buffer must be handled 
differently, but this shouldn't be a problem for performance because very 
few objects will be so large.

These buffers are stored as an array of buffers so that they can be quickly 
indexed. Since all objects are allocated in these buffers, we be can break 
up the GC reference into two indices.  The first index denotes the 
particular buffer, the second denotes the specific object in the buffer. 
These indices could be stored in 32 bits, or 64 bits for computers with a 
lot of memory.

 It would in fact be interesting to see an efficient and safe way to
 resettle pointers. Now we are paying a few percent of CPU performance
 for the adress indirection.

True.  There is more indirection involved than with simple pointers, but the 
access is still O(1).  However, allocation and deallocation is cheaper on an 
order of magnitude.

 Many embedded systems thus don't enable it
 at all - which however makes the memory fragmentation problem quite
 somewhat more real, especially for garbage-collected languages. Having a
 good way to reposition pointers would be thus a helpful thing there. If,
 however, we have a system where we already have to pay an overhead on
 memory and on CPU, perhaps for security or other reasons, i would be
 rather willing to make use of it than layer anything else on top of it.

 Resizing the buffer is on the contrary perfectly unproblematic - new
 chunks need not be adjacent, and the buffer need not be contiguous. As
 long as the buffer chunks are of the same size for many differently
 large types, neither large nor small, there should be no fragmentation.

Exactly.  A relatively simple solution.  No fragmentation within the 
buffers.

 I believe that with the right GC algorithm, we can outperform non-GC.

 I believe one cannot outperform non-GC allocation directly -

I disagree.  If the GC has the capability to compact memory, then I believe 
non-GC can be outdone.  However, this has yet to be seen.

 we can
 rather outperform it by reducing the costs of semi-manual bookkeeping.
 Many C++ programs use shared_ptr<> to do their memory management, which
 is both syntactically wordy and quite inefficient because of often count
 manipulation and checking. Even worse is relying on undocumented
 reference-counted implementation of some standard containers, when their
 specified semantics is value. Besides, we are slowly entering a
 multithreaded world, where using reference counting is a *crime*,

What are the problems with reference counting and multithreading?  It would 
seem to me that as long as you "synchronize" the appropriate methods, there 
should be no problem.  I don't use a shared object templates in C++, but I 
personally prefer reference counting to single references.  It seems more 
powerful.

 while GC works much better and is on the way to further improvement.

I hope so.

-Craig

Mar 17 2005

Ilya Minkov <minkov cs.tum.edu> writes:

Craig Black wrote:
 No doubt about it, my approach would very much discourage the use of 
 pointers. 

!?!

They are even syntactically the same.

 I'm glad you have inquired about GC references.  Again, in my 
 approach, all objects are allocated in object pools or buffers, according to 
 their size, up to a certain point.  Lets say the cuttoff is 1024 words 
 (could also be 4096 or whatever).  All objects equal to or larger than 1024 
 words go in the last buffer.  Thus, the last buffer must be handled 
 differently, but this shouldn't be a problem for performance because very 
 few objects will be so large.
 
 These buffers are stored as an array of buffers so that they can be quickly 
 indexed. Since all objects are allocated in these buffers, we be can break 
 up the GC reference into two indices.  The first index denotes the 
 particular buffer, the second denotes the specific object in the buffer. 
 These indices could be stored in 32 bits, or 64 bits for computers with a 
 lot of memory.

Ok, so you give up a minor amount of performance for indirection. You 
can have, for example, 64-kilobyte large pools, and up to 65536 pool 
pointers table on a 32-bit memory architecture. So far so good.

There is even no huge problem accomodating for normal pointers - let 
them be simple one-word-pooled garbage-collected objects.

However, this leaves one question open: when scanning, your starting 
point is the stack and the globals. The stack does not contain any type 
information. When gliding over an element of stack - of which you think 
it might be a GC reference, finally, it is nothing more or less than a 
machine word - you have to unambiguosly decide whether this is a pointer 
or not. There are many contexts where one could supply type information 
without run-time disadvantage, but i don't know a way how to do that for 
stack. The current policy is, we assume everything is a pointer. If we 
are right, then if it points into some actually allocated region of 
memory, we mark it as currently used and thus know we cannot deallocate 
this region. There are naturally some memory regions which are marked 
while they aren't actually pointed to by a pointer, it was just an 
integer or just about any collection of bits, but this happens rarely 
and hardly makes any significant difference. But can we move a marked 
region? No! Because we need to go back and change anything that we 
thought was a pointer to it - and what if that something was just a 
random collection of bits? Your proposal has the same problem, if i 
understood it correctly.

It would in fact be interesting to see an efficient and safe way to
resettle pointers. Now we are paying a few percent of CPU performance
for the adress indirection.

 
 True.  There is more indirection involved than with simple pointers, but the 
 access is still O(1).  However, allocation and deallocation is cheaper on an 
 order of magnitude.

Would memory management really become cheaper? Allocation is already 
from size-typed pools. Deallocation with free involves searching for a 
pointer in a map, to determine the size of a memory region to free, 
which is O(log n), or this size of a memory region can be written just 
before the pointed-to spot in the allocated region. The second variation 
is O(1), but it involves that the allocator allocates always one word 
more memory than requiered and returns the pointer to the following byte.

There is one problem in C which forces this whole design: a pointer 
designates any sort of memory region, which may be a stuct, an array, a 
struct with an array field at the end, ... anything! C++ and D enforce 
stronger design - anything allocated with new must be deallocated with 
delete, or in D's case also by GC. Furthermore, one cannot do many sorts 
of pointer conversions legally. Thus we can assume all pointers are 
well-typed at the point where deallocation happens. This has one of the 
following consequences:

(1) The type is of statically known size. Thus, the size need not be 
stored with the object, but can be hardcoded at the deallocation point, 
and given to the actual deallocator.

(2) The type is a class which can be substituted by a descendant when 
handled by pointer, which means in C++ that it has virtual member 
functions declared. Because the size is being determined by which 
descendant we are holding, it can be hidden per-class in the classinfo 
or vtable, taken then from there and given the deallocator.

(3) The type is an array. We use over-allocation technique to store the 
length, and given the deallocator.

Deallocator, having the size, can quickly select a pool to which the 
reclaimed memory goes. A pool contains a list of free blocks within. I 
see how this can be optimized when we can compact though.

I don't know whether actual C++ implementations employ such a technique, 
but i think they could, and thus C++ gives us any sort of freedom to 
make the almost optimal memory management. As to D, i would not be so 
shure with the first point, because when a GC scans, it need not know 
type, it cares more about size. Definately, this is one direction where 
one should think about and probably get some improvement.

I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly -

 
 
 I disagree.  If the GC has the capability to compact memory, then I believe 
 non-GC can be outdone.  However, this has yet to be seen.

As i have shown above, marginally. Deallocation happens usually in the 
GC, but it doesn't take really time. A much higher amount of time goes 
into determining what memory regions are alive. Currently, the GC 
recursively goes, starting with the stack, through the whole heap, and 
considers every word it encounters a pointer. Checking whether a number 
points into allocated block of memory is a pointer takes a map lookup 
which is logarithmic in time to the number of allocated blocks, thus a 
complete scan is O(n log n) to the amount of active memory. Walter knows 
of enough ways to improve this, he just hasn't come to implementing it 
yet since there have always been more important issues.

 What are the problems with reference counting and multithreading?  It would 
 seem to me that as long as you "synchronize" the appropriate methods, there 
 should be no problem.  I don't use a shared object templates in C++, but I 
 personally prefer reference counting to single references.  It seems more 
 powerful.

"Synchronize" is the problem. You have to suspend all threads while you 
increment a counter - unless you can prove that your thread is the only 
one which could be manipulating the counter at the moment. This might 
not be trivial when we have multiple processors. And thus a simple small 
operation of incrementing an integer starts to take an unproportionally 
huge amount of time. It all just bogs down.

-eye

Mar 18 2005

David Medlock <amedlock nospam.org> writes:

Ilya Minkov wrote:
<snip>
 -eye

Good points.

Pretty funny we both posted such similar responses in different groups 
(the other is on digitalmars.d) to the same thread.

Hehe!
-David

Mar 18 2005

Ilya Minkov <minkov cs.tum.edu> writes:

David Medlock wrote:
 Good points.
 
 Pretty funny we both posted such similar responses in different 
 groups (the other is on digitalmars.d) to the same thread.

Really? The other newsgroup is too fat for me to read. Of course most of
what i say was already stated by me, someone else, and is to the most
from some knowlegde i gained through scientific papers posted on the
newsgroup back when i was an active participant years ago.

In fact i must imagine Walter must be getting very tired of similar
proposals coming over and over, just to be discarded or put off after a
few rounds of collective thinking and gathering the arguments over again.

If i may do that sort of stupid thing and throw in a question for
consideration: is there a sensble way to separate foreign language
execution from D stack-wise? I think i have an idea which is worth
thinking about. Imagine, you have a function, alongside of which, its
stack layout information is stored. (*1) Then, when a GC kicks in, you
must just figure out what function is running now, read its stack layout
- from which you can exactly determine the pointers. Go to the entry
side of the stack, and find the return adress there. Search for that
function. And so on. However, the C functions interfere. And a C
function can call D function can call C function, ... This might mean we
would need to introduce some restrictions into the languege. What kind
of restrictions?

(*1) I know. There is no static stack layout per se, but one could do
something like over-allocate so that all locals get in, and separate
alloca from the execution stack - which would be of performance
advantage anyway because alloca "heapstack" need not be scanned for
pointers unless explicitly made a GC root. Additionally, for all
functions called the point on which the return adress is stored must be
the same.

I also wonder how symbolic debuggers work. Probably not an option since
they slow the whole thing down. Another option is to use a mechanism
like stack unrolling, which means around each stack frame there would be
handlers - something that D has been avoiding even for exceptions, to
maintain a performance edge against C++.

Nothing is for consideration for soon implementation, since we now only
need to consern ourselves with crafting a minimal set of language
features which wouldn't force us to break the compatibility afterwards.
To cut all the weak branches now.

-eye

Mar 21 2005

Georg Wrede <georg.wrede nospam.org> writes:

I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly -

 
 I disagree.  If the GC has the capability to compact memory, then I believe 
 non-GC can be outdone.  However, this has yet to be seen.

This is like asking whether auto transmission or stick shift is faster. 
It really depends on the circumstances. (I know, I know, mostly stick 
shift is faster, but that was not the point, right?)

If a particular GC algorithm outperforms some code, or not, depends 
first of all on the purpose of the code itself, and on the particular 
programmer.

Want to find a GC outperforming Bjarne Stroustrup doing a simple linked 
list? OTOH, want to find a GC _not_ outperforming a newbie doing a 
simple linked list?

"This code is faster than C." "This code is faster than ASM."

Besides, a GC for web applications would probably be different than one 
for a text editor. For example.

-------------

Besides, the real advantage of GC is not speed. It is reliability of 
code, ease of programming, and separation of disparate issues (here, 
memory management and program logic).

Mar 25 2005

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Fast Memory Allocation Strategy