www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Proposal: alternative GC idea that negates the need for T[new]

I thought of a way the GC could be implemented that negates the need for the 
T[new] syntax.

This would also prevent the ~= operator from corrupting data after a slice, 
which I consider to be an error-prone design.

The problem as I see it is defined as this:

The current GC allocates blocks of fixed sizes.  The block sizes come in 
powers of 2 (minimum 16 bytes).  If you request a size that is less than one 
of these sizes, you get the whole block anyways.  If every append operation 
reallocated a new block, then code like:

char[] str;
for(int i = 0; i < 10000; i++)
  str ~= 'a';

would be very inefficient.  Each append would allocate a new block!

So the current solution is to check if str is at the beginning of a block. 
If it is, and the current block can hold the new length, the array just 
extends into the block.

However, this now becomes a problem when you use that wonderful slicing 
operator:

char[] str = "abc".dup;
char[] slice = str[0..1];
slice ~= "aa";  // slice is now aaa, str is now aaa

I see this as a very difficult problem to find.  It is so rare that you may 
never see it happen, but when it does, you will be scratching your head 
looking for a reason.

So my solution is to keep track of how much data is used in a block.  This 
is easily remembered as the amount of data requested.  So if you request a 
buffer of 10 bytes, a block of 16 bytes is allocated, with a stored 
requested length of 10.

When an append operation is done to an array, the runtime checks to see if 
the end of the array is at the end of the block.  If it is, and the new 
length requested fits within the block, the 'request length' is incremented 
to the new size, and the buffer pointer does not change.  This works for 
slices that end on the last valid byte as well.  This will not corrupt the 
data, because if you then append to the original array, that array no longer 
ends on the last valid byte, and so a new block is allocated.

A run through:

// This line allocates a block of 16 bytes, sets the valid length to 3
char[] str = "abc".dup

char[] slice = str[0..1];

// since slice is only 1 byte, it does not end on the last valid
// byte of the block, and so slice now allocates another block.
// str is intact.
slice ~= "aa"; // "aaa"

slice = str[1..$]; // "bc"

// since slice now ends on the last valid byte, the valid length is
// incremented to 8 to accomodate the new data.  The pointer
// of slice is not changed. str is still intact.
slice ~= "hello"; // "bchello"

// since str's length is 3, it does not end on the last valid byte
// of the block because the valid length was set to 8.  so str is
// assigned a new block.  slice is still intact.
str ~= "def"; // "abcdef"

This also allows the array building code I stated at the beginning to be as 
efficient as the current implementation.

I think this can be implemented with minimal storage overhead.  For blocks 
of 16 bytes, the length can be stored as 4-bit values (3% overhead).  For 
blocks of 32-256 bytes, the lengths are stored as bytes (max 3% overhead, 
min .4%), for blocks of 512-2^16 bytes, can be stored as a short, and the 
overhead is very negligible at that point.

The runtime overhead is also just a constant, so it does not affect the run 
time complexity of an allocation call.

One final thing, the ~ operator would no longer be required to make a copy 
of the first argument.  And you no longer have to dup if you are unsure of 
the origin of a slice.

The only drawback I see is that you can't 'shrink' a buffer.  However, I 
think if you are re-using a buffer, there are ways to write classes to 
handle that functionality.

-Steve 
Dec 06 2007