www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Slices and GC

reply "BLM" <blm768 gmail.com> writes:
Recently I've been working on some projects that involve parsing 
binary files. I've mainly been using std.file.read() to get the 
whole file as a huge array and then extracting slices. I had 
initially assumed that the GC would free any chunks of the array 
that didn't end up being referenced by these slices, but after 
reading some more, it looks like the whole array is kept in 
memory even if only a few elements are actually referenced. Is 
this actually the case? If so, might the language be extended to 
handle this situation?
Apr 05 2012
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thursday, 5 April 2012 at 15:00:04 UTC, BLM wrote:
 Recently I've been working on some projects that involve 
 parsing binary files. I've mainly been using std.file.read() to 
 get the whole file as a huge array and then extracting slices. 
 I had initially assumed that the GC would free any chunks of 
 the array that didn't end up being referenced by these slices, 
 but after reading some more, it looks like the whole array is 
 kept in memory even if only a few elements are actually 
 referenced. Is this actually the case? If so, might the 
 language be extended to handle this situation?

The GC can't really know which parts of the array you're using. For example, your only reference to the array might be a pointer, and you might be traversing the array in either direction, only keeping count of the remaining bytes until the array boundary. Consider .dup-ing the slices you're going to need, or using std.mmfile to map the file into memory - in that case, the OS won't load the unnecessary parts of the file into memory in the first place.
Apr 05 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05.04.2012 20:35, BLM wrote:
 On Thursday, 5 April 2012 at 15:30:45 UTC, Vladimir Panteleev wrote:

 The GC can't really know which parts of the array you're using. For
 example, your only reference to the array might be a pointer, and you
 might be traversing the array in either direction, only keeping count
 of the remaining bytes until the array boundary.

 Consider .dup-ing the slices you're going to need, or using std.mmfile
 to map the file into memory - in that case, the OS won't load the
 unnecessary parts of the file into memory in the first place.

I had considered using .dup, but I wanted to minimize overhead. I should probably look into std.mmfile or pull the data out in smaller chunks that the GC can handle individually.

Another idea is to copy out interesting parts of the original chunk to a separate storage array. This array will contain your sliced-out data just packed more tightly. If you have a upper bound on % of useful bytes then you can get away without extra allocations. The tricky part is reallocating this storage array, as it will make slices that point to it dangling (and keeping GC from deallocation), a workaround would be to use pure index-based "slices" that work on this block only.
 If the GC can distinguish between pointers and slices, it should
 theoretically be able to prune an array that is only referenced by
 slices, but I'm not sure how well that would fit into the current GC
 system.

-- Dmitry Olshansky
Apr 05 2012
prev sibling next sibling parent "BLM" <blm768 gmail.com> writes:
On Thursday, 5 April 2012 at 15:30:45 UTC, Vladimir Panteleev 
wrote:

 The GC can't really know which parts of the array you're using. 
 For example, your only reference to the array might be a 
 pointer, and you might be traversing the array in either 
 direction, only keeping count of the remaining bytes until the 
 array boundary.

 Consider .dup-ing the slices you're going to need, or using 
 std.mmfile to map the file into memory - in that case, the OS 
 won't load the unnecessary parts of the file into memory in the 
 first place.

I had considered using .dup, but I wanted to minimize overhead. I should probably look into std.mmfile or pull the data out in smaller chunks that the GC can handle individually. If the GC can distinguish between pointers and slices, it should theoretically be able to prune an array that is only referenced by slices, but I'm not sure how well that would fit into the current GC system.
Apr 05 2012
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, April 05, 2012 17:00:03 BLM wrote:
 Recently I've been working on some projects that involve parsing
 binary files. I've mainly been using std.file.read() to get the
 whole file as a huge array and then extracting slices. I had
 initially assumed that the GC would free any chunks of the array
 that didn't end up being referenced by these slices, but after
 reading some more, it looks like the whole array is kept in
 memory even if only a few elements are actually referenced. Is
 this actually the case? If so, might the language be extended to
 handle this situation?

http://dlang.org/d-array-article.html - Jonathan M Davis
Apr 05 2012