digitalmars.D - Interior pointers and fast GC

Chris Wright (150/150) Jan 13 2017 Interior pointers are a barrier to performant garbage collection.

Rainer Schuetze (19/29) Jan 14 2017 [...]

Nick Treleaven (11/14) Jan 21 2017 Makes me wonder about a GC'd language where each pointer is

Shachar Shemesh (6/17) Jan 22 2017 You are neglecting to account for cache locality performance cost. An

deadalnix (13/15) Jan 21 2017 1. Split the heap in chunk of size n being a power of 2, say 4M.

Araq (8/20) Jan 21 2017 It's an O(1) that requires a hash table lookup in general because

Chris Wright (24/30) Jan 21 2017 Overexplaining because I'm tired.

Araq (6/20) Jan 21 2017 Sure, O(log #pools) + O(1) is "much better" than O(1).

Chris Wright (16/37) Jan 22 2017 Much better than what's available today, unless I misunderstood the

deadalnix (7/14) Jan 22 2017 Huge allocs are always handled specifically by allocators. The

Chris Wright <dhasenan gmail.com> writes:

Interior pointers are a barrier to performant garbage collection.

Here, I'm brainstorming about the problem and not really coming to any 
conclusions.

The world today
===============
D's garbage collector is significantly less performant than Java's, and 
this is partly the result of D's features. Interior pointers are costly.

What's the deal with interior pointers?
---------------------------------------
A heap object (for this discussion) is a series of bytes on the GC heap.

An interior pointer is a pointer to some memory location within a heap 
object, but not to the beginning of the heap object.

A garbage collector that never has to deal with internal pointers has 
several options for storing metadata related to a heap object:

 * In a segment of data immediately before the heap object
 * In a hashmap
 * In a binary search tree, a prefix tree, or the like

The first strategy is O(1) and has good data locality relative to the 
data pointed at. Eclipse OMR assumes you use this strategy (though it 
supports other methods).

The second strategy is O(1) in the expected case. By default, it has poor 
data locality -- though you can improve that with careful management of 
virtual memory.

The third strategy is O(log N) in the expected case and has similar data 
locality to the hashtable.

Obviously, we prefer the first strategy. Unfortunately, given an interior 
pointer, you can't identify the base of its heap object in constant time. 
I believe D's garbage collector uses a binary search tree.

D allows unrestricted interior pointers. This makes it difficult to 
experiment with other garbage collectors, except for those designed for D 
or C++.


Addressof
---------
D lets you take the address of anything:

  class Paddock {
      uint goats, sheep;
  }
  auto paddock = new Paddock;
  auto goatPtr = &paddock.goats;

The creation of interior pointers this way can be detected at compile 
time, if that benefits us.


Arrays
------
A D array is a struct containing a pointer and a length.

D encourages array slicing:

  auto str = "hello world!";
  foreach (word; str.split(' ')) {
      // This is an interior pointer
      assert(word.ptr >= str.ptr);
      assert(word.ptr < str.ptr + str.length);
  }

Array slicing is splendid and worthwhile, and eliminating it would be a 
huge cost.

We can at least potentially detect non-pathological cases of this.


Closures
--------
Closures can create interior pointers, depending on the compiler's 
implementation:

  class Paddock
  {
      uint goats, sheep;
      void trackCattle() {
          cattleEnter.connect((flock) {
              if (flock.isGoats) goats += flock.size;
              if (flock.isSheep) sheep += flock.size;
          });
      }
  }

The compiler might include a copy of the `this` pointer in the closure, 
or it might include direct pointers to `goats` and `sheep`. Both produce 
equivalent results aside from the creation of interior pointers.

When the closure is defined in a struct method, it is impossible to 
ensure that the contents of the closure do not include an interior 
pointer -- the struct itself might be a member of another heap object.


What options do we have?
========================
If we want to reduce interior pointers, we have a number of options. We 
would likely need several to get reasonable results.

Forbid user-generated interior pointers
---------------------------------------
We might simply forbid taking the address of a member of a heap object, 
turning it into a compile-time error, or merely leaving it as undefined 
behavior.

This is unlikely to be feasible. D's lifeblood is array slicing, and that 
produces interior pointers left and right.


Augment arrays
--------------
We can augment arrays to eliminate interior pointers for common 
operations:

  struct Array(T)
  {
      private T* baseptr;
      size_t offset;
      size_t length;
      T* ptr() { return baseptr + offset; }
  }

It is possible to construct an array from a pointer, and any pointer 
might be an interior pointer, but this strategy would at least reduce the 
number of interior pointers we have floating around.


Change struct this pointer to pointer+offset
--------------------------------------------
We convert struct `this` pointers to a pointer plus an offset in order to 
eliminate interior pointers from closures involving struct members. This 
probably requires adding a hidden field to each struct, which breaks C 
compatibility. It also requires a lot of bookkeeping, and I'm not sure 
that's always possible. Uninitialized struct values break this.


Mark allocations that might contain interior pointers
-----------------------------------------------------
Since we can detect many things that might create interior pointers at 
compile time, we can explicitly mark them as such. Conversely, we can 
prove that many things do not contain interior pointers, and we could 
mark those.

When we scan a heap object that is marked as not containing interior 
pointers, we can use a fast path to locate the metadata associated with 
that heap object. Otherwise, we revert to the current, slow path.

This will probably be useless unless we take steps to reduce the number 
of interior pointers. By default, if an aggregate contains a pointer or 
an array, it contains a possible interior pointer. If we augment arrays 
to avoid interior pointers, this might be useful.

This is always an optimization because we can tell immediately (O(1) 
time, reading values that are already in memory) whether to use the slow 
or fast path.


Mark potential interior pointers
--------------------------------
By managing virtual memory explicitly, we can use the high order bit of a 
pointer into the GC heap to indicate whether that is potentially an 
interior pointer. We can potentially memory map multiple preselected 
virtual address ranges to the same physical range (I'm uncertain about 
this). However, on systems where that is not possible, every pointer read 
requires invoking a runtime function.

This is not an option on 32-bit systems, where we have limited address 
space to use. It would reduce us to ~1.5GB of GC heap (since the OS 
typically reserves 0.5GB of address space). ASLR adds more work.

If implemented, this would give us an O(1) means for determining whether 
to use the slow case or not, assuming we also managed to identify the 
construction of all potential interior pointers.

This is also problematic with C/C++ interop and inline assembly. You can 
trivially construct interior pointers in either, and they would be 
unmarked. If pointer reads require invoking a runtime function, that is a 
barrier with passing pointers into C/C++ code.


Hashtable with search tree fallback
-----------------------------------
We can use a hashtable of allocations first and fall back to a binary 
search tree if necessary.

This has us maintaining two search data structures when we'd prefer zero. 
However, it's the most easily implemented version.

The utility of this is proportional to the relative frequency of interior 
pointers. If they are rare, this strategy pays off. If they are common, 
this strategy adds overhead for little gain. I anticipate that this would 
be slower with current DMD.

Jan 13 2017

Rainer Schuetze <r.sagitario gmx.de> writes:

On 14.01.2017 05:37, Chris Wright wrote:
 Interior pointers are a barrier to performant garbage collection.

[...]
  * In a binary search tree, a prefix tree, or the like

[...]
 The third strategy is O(log N) in the expected case and has similar data
 locality to the hashtable.

 Obviously, we prefer the first strategy. Unfortunately, given an interior
 pointer, you can't identify the base of its heap object in constant time.
 I believe D's garbage collector uses a binary search tree.

 D allows unrestricted interior pointers. This makes it difficult to
 experiment with other garbage collectors, except for those designed for D
 or C++.

The garbage collected memory is organized as pools of growing size. For 
each pool there are lookup tables for each page that has fixed sized 
memory blocks. That way finding the base of an allocation is a matter of 
finding the pool (currently O(log #pools); by N I assume you mean the 
number of allocations) but the rest is constant time.

IIRC an application that uses 1 GB of memory has about 20 pools, which 
means the time to figure out the base address and the size of an 
allocation is more or less constant.

In addition, you need to lookup the pool anyway to figure out if the 
pointer points to non-managed memory (stack, global data, malloc'd memory).

So, I don't think interior pointers are a performance problem of D's GC. 
Other GCs might not be able to deal with these, though.

The much larger problem with trying more sophisticated GCs is the lack 
of a fast write barrier. If we get one with compiler assisted 
(automatic) reference counting, we might aswell try this for other types 
of GC.

Jan 14 2017

Nick Treleaven <nick geany.org> writes:

On Saturday, 14 January 2017 at 15:30:42 UTC, Rainer Schuetze 
wrote:
 In addition, you need to lookup the pool anyway to figure out 
 if the pointer points to non-managed memory (stack, global 
 data, malloc'd memory).

Makes me wonder about a GC'd language where each pointer is 
actually a member of a struct which also has a base allocation 
pointer. The base pointer could either only be set for managed 
allocations, or the low bit of the base address could be used to 
indicate such instead*. This would make for faster GC scans, but 
would also cause slowdowns when copying pointers. Pointer 
arithmetic would be just as efficient though.

(*) With this scheme, pointer arithmetic can be  safe in assert 
mode.

Jan 21 2017

Shachar Shemesh <shachar weka.io> writes:

On 21/01/17 17:30, Nick Treleaven wrote:
 On Saturday, 14 January 2017 at 15:30:42 UTC, Rainer Schuetze wrote:
 In addition, you need to lookup the pool anyway to figure out if the
 pointer points to non-managed memory (stack, global data, malloc'd
 memory).

 Makes me wonder about a GC'd language where each pointer is actually a
 member of a struct which also has a base allocation pointer. The base
 pointer could either only be set for managed allocations, or the low bit
 of the base address could be used to indicate such instead*. This would
 make for faster GC scans, but would also cause slowdowns when copying
 pointers. Pointer arithmetic would be just as efficient though.

 (*) With this scheme, pointer arithmetic can be  safe in assert mode.

You are neglecting to account for cache locality performance cost. An 
array of pointers would, under this plan, store half as many pointers in 
a single cache line. This means handling this array would be, give or 
take, half as fast. That's a huge price to pay.

Shachar

Jan 22 2017

deadalnix <deadalnix gmail.com> writes:

On Saturday, 14 January 2017 at 04:37:01 UTC, Chris Wright wrote:
 Unfortunately, given an interior pointer, you can't identify 
 the base of its heap object in constant time.

1. Split the heap in chunk of size n being a power of 2, say 4M. 
Align them 4M.
2. Find the chunk an alloc is part of in O(1) bu masking the 
lower bits (22 bits to mask in our 4M case).
3. Have a table of page descriptor in the chunk header. Lookup 
the page the alloc is in in O(1).
4a. If the alloc is large (flag in the page descriptor), find the 
base pointer in O(1).
4b. if the alloc is small, compute the index of the item in the 
page from the size class in the page descriptor (on addition, one 
multiply and one shift) in O(1).

Start on false premise, end up nowhere.

Jan 21 2017

Araq <rumpf_a web.de> writes:

On Saturday, 21 January 2017 at 17:42:46 UTC, deadalnix wrote:
 1. Split the heap in chunk of size n being a power of 2, say 
 4M. Align them 4M.
 2. Find the chunk an alloc is part of in O(1) bu masking the 
 lower bits (22 bits to mask in our 4M case).
 3. Have a table of page descriptor in the chunk header. Lookup 
 the page the alloc is in in O(1).
 4a. If the alloc is large (flag in the page descriptor), find 
 the base pointer in O(1).
 4b. if the alloc is small, compute the index of the item in the 
 page from the size class in the page descriptor (on addition, 
 one multiply and one shift) in O(1).

 Start on false premise, end up nowhere.

It's an O(1) that requires a hash table lookup in general because 
allocations can exceed the chunk size and so you cannot just mask 
the pointer and look at the chunk header because it might not be 
a chunk header at all. Know any production GCs that use hash 
table lookups for pointer assignments? Me neither. Ok ok, maybe 
Go does, it's the only language with GC that embraces interior 
pointers as stupid as that is.

Jan 21 2017

Chris Wright <dhasenan gmail.com> writes:

On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:
 It's an O(1) that requires a hash table lookup in general because
 allocations can exceed the chunk size and so you cannot just mask the
 pointer and look at the chunk header because it might not be a chunk
 header at all. Know any production GCs that use hash table lookups for
 pointer assignments? Me neither. Ok ok, maybe Go does, it's the only
 language with GC that embraces interior pointers as stupid as that is.

Overexplaining because I'm tired.

If you have fixed-size allocations, you get much better time: O(log 
#pools) to find the pool, O(1) to find the offset in that pool. The cost 
is you multiply the number of pools in existence by the number of 
allocation sizes you decide to support.

This obviously helps for structs and classes.

It also helps for small arrays: a small array is a fixed-size allocation 
that can potentially be reallocated.

For instance, I decide to support allocations of 16, 32, and 256 bytes. 
The string "A British tar is a soaring soul" fits into a 32-byte 
allocation, so I put it in the 32-byte pool.

If I append the string " as free as a mountain bird" to it, that no 
longer fits, so I copy it into the 256 byte pool and append there.

If you then take a slice from it, "a soaring soul", I can find the 
relevant pool by examining every pool (or by traversing a search tree). 
The pool says it's a 32-byte pool, so floor-32 of the pointer is the base 
pointer.

Large heap objects (anything above the maximum supported size, which will 
be about one OS page) don't afford this convenience -- they aren't of any 
particular size, so you still need to search through every allocation.

So the performance of D's GC is rather strongly dependent on how large 
arrays tend to get.

Unless I'm misunderstanding the GC yet again.

Jan 21 2017

Araq <rumpf_a web.de> writes:

On Sunday, 22 January 2017 at 06:28:35 UTC, Chris Wright wrote:
 On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:
 It's an O(1) that requires a hash table lookup in general 
 because allocations can exceed the chunk size and so you 
 cannot just mask the pointer and look at the chunk header 
 because it might not be a chunk header at all. Know any 
 production GCs that use hash table lookups for pointer 
 assignments? Me neither. Ok ok, maybe Go does, it's the only 
 language with GC that embraces interior pointers as stupid as 
 that is.

 Overexplaining because I'm tired.

 If you have fixed-size allocations, you get much better time: 
 O(log #pools) to find the pool, O(1) to find the offset in that 
 pool. The cost is you multiply the number of pools in existence 
 by the number of allocation sizes you decide to support.

Sure, O(log #pools) + O(1) is "much better" than O(1).

To shorten the discussion: You're looking for a variant of "card 
marking". No overhead for interior pointers if you do it right. I 
think, never worked out the details though.

http://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works

Jan 21 2017

Chris Wright <dhasenan gmail.com> writes:

On Sun, 22 Jan 2017 06:44:47 +0000, Araq wrote:

 On Sunday, 22 January 2017 at 06:28:35 UTC, Chris Wright wrote:
 On Sun, 22 Jan 2017 05:02:43 +0000, Araq wrote:
 It's an O(1) that requires a hash table lookup in general because
 allocations can exceed the chunk size and so you cannot just mask the
 pointer and look at the chunk header because it might not be a chunk
 header at all. Know any production GCs that use hash table lookups for
 pointer assignments? Me neither. Ok ok, maybe Go does, it's the only
 language with GC that embraces interior pointers as stupid as that is.

 Overexplaining because I'm tired.

 If you have fixed-size allocations, you get much better time: O(log
 #pools) to find the pool, O(1) to find the offset in that pool. The
 cost is you multiply the number of pools in existence by the number of
 allocation sizes you decide to support.

 Sure, O(log #pools) + O(1) is "much better" than O(1).

Much better than what's available today, unless I misunderstood the 
current GC again.

A tree lookup with < 100 elements in the tree won't be much slower than a 

control virtual memory more strictly (and have virtual memory to burn -- 
so not on 32-bit), you can put a better constant on the hashtable lookup.

 To shorten the discussion: You're looking for a variant of "card
 marking". No overhead for interior pointers if you do it right. I think,
 never worked out the details though.

Card marking is about write barriers. Write barriers are about 
incremental collection, caching part of your past GC run information so 
you can update it using only those pages of memory that have changed 
since last time. It's low granularity because it's essentially free to 
read an entire page of memory once you've read one byte. It's also low 
granularity because one common strategy is to mprotect(3) the pages as 
read-only and install a fault handler, which is an expensive way of doing 
things that you want to repeat fewer times if possible.

It's not what I'm looking for.

Jan 22 2017

deadalnix <deadalnix gmail.com> writes:

On Sunday, 22 January 2017 at 05:02:43 UTC, Araq wrote:
 It's an O(1) that requires a hash table lookup in general 
 because allocations can exceed the chunk size and so you cannot 
 just mask the pointer and look at the chunk header because it 
 might not be a chunk header at all. Know any production GCs 
 that use hash table lookups for pointer assignments? Me 
 neither. Ok ok, maybe Go does, it's the only language with GC 
 that embraces interior pointers as stupid as that is.

Huge allocs are always handled specifically by allocators. The 
usual technique is via a radix tree. But it doesn't really matter 
for the discussion at hand: huge alloc are not numerous. If you 
have 4G of RAM, by definition, you need to have less than a 1000 
of them with he above mentioned scheme. The whole lookup 
datastructure can fit in cache.

Jan 22 2017

D Programming

C/C++ Programming

Other

digitalmars.D - Interior pointers and fast GC