www.digitalmars.com         C & C++   DMDScript  

D - Copying/heap compacting GC

reply Stewart Gordon <smjg_1998 yahoo.com> writes:
http://www.digitalmars.com/d/garbage.html


slower ones. Generational, copying collectors eliminate much of the 
inefficiency of early mark and sweep algorithms.

reduce the number of pages actively referenced by a program, which means 
that memory accesses are more likely to be cache hits and less swapping."

The fact that this is given in the context of D suggests that these 
kinds of GC are valid implementations for D.

The trouble is that copying and heap compaction naturally change the 
memory addresses to which pointers point.  So before this can work, the 
"pointers being pointers and not pointers being not pointers" needs to 
be considered further.

The isuses I can see are:

- Unions.  Obviously, if it doesn't know whether something is a pointer 
or not, it can't know whether to reassign it when the suspected referent 
is moved.

- Crocky APIs.  std.c.windows.windows defines handles as pointers. 
Maybe they are, just not into memory that an application is going to 
manipulate directly.  Are they?  If not, they could clash with pointers 
into the GC heap, and a GC pass would then screw up the handle. 
Something else that's cast into a pointer type but blatantly isn't is 
numeric resource IDs.

- Pointers into the GC heap held by foreign code.  Obviously you need a 
pointer on the D side as well.  But if the GC moves the object, while 
the D side pointer'll be updated, the foreign one surely won't.

- std.gc.addRange.  Would it be at all straightforward to know what's a 
pointer and what isn't within an added range?  (Or is this only meant 
for adding arrays of pointers?  This isn't crystal clear.)

Maybe someone has ideas....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.
Mar 24 2004
next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
On Wed, 24 Mar 2004 09:16:20 +0000, Stewart Gordon
<smjg_1998 yahoo.com> wrote:

http://www.digitalmars.com/d/garbage.html


slower ones. Generational, copying collectors eliminate much of the 
inefficiency of early mark and sweep algorithms.

reduce the number of pages actively referenced by a program, which means 
that memory accesses are more likely to be cache hits and less swapping."

The fact that this is given in the context of D suggests that these 
kinds of GC are valid implementations for D.

The trouble is that copying and heap compaction naturally change the 
memory addresses to which pointers point.  So before this can work, the 
"pointers being pointers and not pointers being not pointers" needs to 
be considered further.
I worry about this, too, though I suspect most any GC algorithm can be adapted to work with "ambiguous pointers". The price is GC complexity and probably some performance hit. I'm not a GC expert though so take my answers with a huge grain of salt.
The isuses I can see are:

- Unions.  Obviously, if it doesn't know whether something is a pointer 
or not, it can't know whether to reassign it when the suspected referent 
is moved.
There are "mostly-copying" collectors which, I think, know about data blocks that can be moved and data blocks that can't. That means the GC knows about the union pieces of a data structure as well as the "normal" pointer pieces. Basically it makes the GC and runtime type info requirements more complex but since unions probably don't show up all that much it still means the collector copies and compacts most of the time.
- Crocky APIs.  std.c.windows.windows defines handles as pointers. 
Maybe they are, just not into memory that an application is going to 
manipulate directly.  Are they?  If not, they could clash with pointers 
into the GC heap, and a GC pass would then screw up the handle. 
Something else that's cast into a pointer type but blatantly isn't is 
numeric resource IDs.
I assume the GC can tell the difference between pointers into the GC heap and pointers elsewhere.
- Pointers into the GC heap held by foreign code.  Obviously you need a 
pointer on the D side as well.  But if the GC moves the object, while 
the D side pointer'll be updated, the foreign one surely won't.
I guess the GC would need an API to mark a data block as being unmovable.
- std.gc.addRange.  Would it be at all straightforward to know what's a 
pointer and what isn't within an added range?  (Or is this only meant 
for adding arrays of pointers?  This isn't crystal clear.)
I assume that addRange should work with arbitrary data structures and not just arrays of pointers. In any case the addRange'd data would be unmovable and the things pointed to by the data would be unmovable.
Maybe someone has ideas....

Stewart.
Mar 24 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Ben Hinkle wrote:
 On Wed, 24 Mar 2004 09:16:20 +0000, Stewart Gordon 
 <smjg_1998 yahoo.com> wrote:
<snip>
 - Crocky APIs.  std.c.windows.windows defines handles as pointers.
  Maybe they are, just not into memory that an application is going
 to manipulate directly.  Are they?  If not, they could clash with
 pointers into the GC heap, and a GC pass would then screw up the
 handle. Something else that's cast into a pointer type but
 blatantly isn't is numeric resource IDs.
I assume the GC can tell the difference between pointers into the GC heap and pointers elsewhere.
What has this to do with pointers that aren't pointers at all?
 - Pointers into the GC heap held by foreign code.  Obviously you
 need a pointer on the D side as well.  But if the GC moves the
 object, while the D side pointer'll be updated, the foreign one
 surely won't.
I guess the GC would need an API to mark a data block as being unmovable.
<snip> Guess you're right. makeImmovable and makeMovable? Or unmakeImmovable? Maybe setImmovable and clearImmovable? And should the set and clear calls have to be numerically balanced in order to reinstate movability? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Mar 25 2004
prev sibling next sibling parent reply Stephan Wienczny <wienczny web.de> writes:
Do you know a copying gc algorithm we could use?
What should be used incremental or not?

Stephan
Mar 24 2004
parent Cloud9 Virtual <cloud9virtual yahoo.com> writes:
Stephan Wienczny wrote:
 
 Do you know a copying gc algorithm we could use?
 What should be used incremental or not?
 
 Stephan
For a definitive treatment of all publicly available garbage collecting algorithms get this: http://www.amazon.com/exec/obidos/tg/detail/-/0471941484/qid=1080203603/sr=1-9/ref=sr_1_9/102-8193337-4992900?v=glance&s=books There are a couple of proprietary/patented strategies out there that are not in the book; I can provide pointers if anyone wants them. BTW there is nothing wrong with the stock mark&sweep algorithm if you don't care about real-time collection.
Mar 25 2004
prev sibling parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c3rjl6$1nds$1 digitaldaemon.com>, Stewart Gordon says...
http://www.digitalmars.com/d/garbage.html


slower ones. Generational, copying collectors eliminate much of the 
inefficiency of early mark and sweep algorithms.

reduce the number of pages actively referenced by a program, which means 
that memory accesses are more likely to be cache hits and less swapping."

The fact that this is given in the context of D suggests that these 
kinds of GC are valid implementations for D.

The trouble is that copying and heap compaction naturally change the 
memory addresses to which pointers point.  So before this can work, the 
"pointers being pointers and not pointers being not pointers" needs to 
be considered further.

The isuses I can see are:

- Unions.  Obviously, if it doesn't know whether something is a pointer 
or not, it can't know whether to reassign it when the suspected referent 
is moved.
Two potential solutions: If D does not specify the exact layout of a union type, then D could track which field of the union you're using right now. This is often done by the programmer, but D could do it internally. It could be an additional debug-mode-only assertion() like array indices. It would die if you tried to read read a different union field than you recently stored to. More likely: in the few cases that you have unions, just pin down the disputed memory and move everything else. Areas of memory allocated via malloc() and added to the GC via addRange() would also be included in the "immobile" category. This is harder because it requires a "mostly copyin" collector, but I don't think it's fatal. Some of the simplicity of a moving collector is lost.
- Crocky APIs.  std.c.windows.windows defines handles as pointers. 
Maybe they are, just not into memory that an application is going to 
manipulate directly.  Are they?  If not, they could clash with pointers 
into the GC heap, and a GC pass would then screw up the handle. 
Something else that's cast into a pointer type but blatantly isn't is 
numeric resource IDs.
If you work with these you could allocate them via "malloc()" or similar, and they would act like a giant union as above.
- Pointers into the GC heap held by foreign code.  Obviously you need a 
pointer on the D side as well.  But if the GC moves the object, while 
the D side pointer'll be updated, the foreign one surely won't.
Umm... pin them down by putting them in a union as above.... did I just invent a new kinda "locked memory" pointer?
- std.gc.addRange.  Would it be at all straightforward to know what's a 
pointer and what isn't within an added range?  (Or is this only meant 
for adding arrays of pointers?  This isn't crystal clear.)

Maybe someone has ideas....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.
I think the thing is to ask yourself, WHY is a moving collector better than a non-moving collector? To get the whole story, you can read one of the "gc surveys". A famous one: (ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps). But the gist as I remember it, is two fold: A. Structures like trees and linked lists are iterated "in order", both in the program and in the GC. This means that consecutive list nodes and the data they point to (if any) get moved into the new locations IN ORDER. This is great for cache efficiency because memory access is almost array-like if you do it sequentially. All the nodes of a list get moved into consequetive spots. --> Essentially the garbage collector becomes a heap data structure optimizer. B. Solves most internal fragmentation problems. Non-moving collectors sometimes end up in states where most of the free memory is divided into small pieces that are small. If this happens, the memory can be exhausted. There is plenty of space, but not enough in any one place. Copying collectors are normally immune to this effect, which can be a boon for real-time systems that could otherwise die from this effect "one in a million times". So.... if you can get a 90 % copying collector (ie "conservative" compaction), it might be good enough. A few unions stuck here and there will not be fatal to this plan, as long as the "fully understood" D objects are movable. Kevin
May 04 2005
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Kevin Bealer wrote:
 In article <c3rjl6$1nds$1 digitaldaemon.com>, Stewart Gordon says...
<snip>
- Unions.  Obviously, if it doesn't know whether something is a pointer 
or not, it can't know whether to reassign it when the suspected referent 
is moved.
Two potential solutions: If D does not specify the exact layout of a union type, then D could track which field of the union you're using right now.
Structs and unions are designed to be compatible with the C ABI. As such, there's no room for that extra piece of information.
 This is often done by the programmer, but D could do it internally.  It could
be an
 additional debug-mode-only assertion() like array indices.  It would die if you
 tried to read read a different union field than you recently stored to.
 
 More likely: in the few cases that you have unions, just pin down the disputed
 memory and move everything else.  Areas of memory allocated via malloc() and
 added to the GC via addRange() would also be included in the "immobile"
 category.
The difficulty is that the GC would need to know in advance whether a given piece of heap memory is pointed to from a union. The essence of a copying GC is that it collects garbage by copying everything to newly allocated memory and then deallocating the old memory. And so if it's copied a certain block and then it stumbles on a pointer to it in a union, it'll have to change back all the pointers it's changed.
 This is harder because it requires a "mostly copyin" collector, but I don't
 think it's fatal.  Some of the simplicity of a moving collector is lost.
 
- Crocky APIs.  std.c.windows.windows defines handles as pointers. 
Maybe they are, just not into memory that an application is going to 
manipulate directly.  Are they?  If not, they could clash with pointers 
into the GC heap, and a GC pass would then screw up the handle. 
Something else that's cast into a pointer type but blatantly isn't is 
numeric resource IDs.
If you work with these you could allocate them via "malloc()" or similar, and they would act like a giant union as above.
You mean I should store all my resource IDs in malloc'd memory? I'm not sure that that's always possible, nor how it would help.
- Pointers into the GC heap held by foreign code.  Obviously you need a 
pointer on the D side as well.  But if the GC moves the object, while 
the D side pointer'll be updated, the foreign one surely won't.
Umm... pin them down by putting them in a union as above.... did I just invent a new kinda "locked memory" pointer?
As I just said, having unions as the way of pinning is going to cause difficulties. Better to add a specific pinning feature to D. <snip>
 I think the thing is to ask yourself, WHY is a moving collector better than a
 non-moving collector?  To get the whole story, you can read one of the "gc
 surveys".  A famous one: (ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps).
 
 But the gist as I remember it, is two fold:
 
 A. Structures like trees and linked lists are iterated "in order", both in the
 program and in the GC.  This means that consecutive list nodes and the data
they
 point to (if any) get moved into the new locations IN ORDER.  This is great for
 cache efficiency because memory access is almost array-like if you do it
 sequentially.  All the nodes of a list get moved into consequetive spots.
Assuming that there aren't other ways into the structures that may have been picked up first.
 --> Essentially the garbage collector becomes a heap data structure optimizer.
 
 B. Solves most internal fragmentation problems.  Non-moving collectors
sometimes
 end up in states where most of the free memory is divided into small pieces
that
 are small.  If this happens, the memory can be exhausted.  There is plenty of
 space, but not enough in any one place.  Copying collectors are normally immune
 to this effect, which can be a boon for real-time systems that could otherwise
 die from this effect "one in a million times".
<snip> I guess that just about every heap allocation system would have a cluster size. So if your allocated objects aren't too big, then this won't be too much of a problem. OTOH if you are working with large arrays, and these are competing for space with smaller things, then there is indeed this issue. And as long as you don't use up so much memory that the copying GC has no room to manoeuvre.... Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
May 06 2005