www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to free memory allocated via double[][] using dmd-2.0.12?

reply Markus Dittrich <markusle gmail.com> writes:
Hi,

For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and 
processed sequentially, i.e. in pseudocode

int main()
{
    while (some condition)
    {
         double[][] myLargeDataset = read_data();
         process_data(myLargeDataset);
         // free all memory here otherwise next cycle will 
         // run out of memory
     }
    
   return 0;
}

Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting 
binary keeps accumulating memory and crashing shortly after). 
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am 
doing terribly wrong or is this a bug in dmd?

Thanks much,
Markus
Apr 08 2008
next sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Markus Dittrich wrote:
 Hi,
 
 For a data processing application I need to read a large number
 of data sets from disk. Due to their size, they have to be read and 
 processed sequentially, i.e. in pseudocode
 
 int main()
 {
     while (some condition)
     {
          double[][] myLargeDataset = read_data();
          process_data(myLargeDataset);
          // free all memory here otherwise next cycle will 
          // run out of memory
      }
     
    return 0;
 }
 
 Now, the "problem" is the fact that each single data-set saturates
 system memory and hence I need to make sure that all memory
 is freed after each process_data step is complete. Unfortunately,
 using dmd-2.012 I have not been able to achieve this. Whatever
 I do (including nothing, i.e., letting the GC do its job), the resulting 
 binary keeps accumulating memory and crashing shortly after). 
 I've tried deleting the array, setting the array lengths to 0, and manually
 forcing the GC to collect, to no avail. Hence, is there something I am 
 doing terribly wrong or is this a bug in dmd?

Did you try just setting the array reference to null. This should make the contents of the array unreachable and therefore it should be collected when you next allocate (and run short on memory). i.e. int main() { double[][] myLargeDataset; while (some condition) { myLargeDataset = read_data(); process_data(myLargeDataset); myLargeDataset = null; } return 0; } Regan
Apr 08 2008
parent Markus Dittrich <markusle gmail.com> writes:
Regan Heath Wrote:

 Did you try just setting the array reference to null.  This should make 
 the contents of the array unreachable and therefore it should be 
 collected when you next allocate (and run short on memory).
 
 i.e.
 
 int main()
 {
      double[][] myLargeDataset;
      while (some condition)
      {
 	 myLargeDataset = read_data();
           process_data(myLargeDataset);
           myLargeDataset = null;
       }
 
     return 0;
 }
 
 Regan

Thanks for the hint. I just tried this again just to make sure and also tried plopping an std.gc.fullCollect() right after to force the GC to collect. In both cases I can watch memory consumption grow continuously with the system running out of memory eventually. Maybe its a GC bug? Markus
Apr 08 2008
prev sibling next sibling parent reply BCS <BCS pathlink.com> writes:
Markus Dittrich wrote:
 Hi,
 
 For a data processing application I need to read a large number
 of data sets from disk. Due to their size, they have to be read and 
 processed sequentially, i.e. in pseudocode
 
 int main()
 {
     while (some condition)
     {
          double[][] myLargeDataset = read_data();
          process_data(myLargeDataset);
          // free all memory here otherwise next cycle will 
          // run out of memory
      }
     
    return 0;
 }
 
 Now, the "problem" is the fact that each single data-set saturates
 system memory and hence I need to make sure that all memory
 is freed after each process_data step is complete. Unfortunately,
 using dmd-2.012 I have not been able to achieve this. Whatever
 I do (including nothing, i.e., letting the GC do its job), the resulting 
 binary keeps accumulating memory and crashing shortly after). 
 I've tried deleting the array, setting the array lengths to 0, and manually
 forcing the GC to collect, to no avail. Hence, is there something I am 
 doing terribly wrong or is this a bug in dmd?
 
 Thanks much,
 Markus

One "hack" would be to have read_data() allocate a big buffer and then slice the parts of the double[][] out of it. This would have the advantage that you can just keep track of the buffer and on the next pass just reuse it in it's entirety, you never have to delete it. double[][] read_data() { static byte[] buff; if(buff.prt is null) buff = new byte[huge]; byte left = buff; T[] Alloca(T)(int i) { T[] ret = (cast(*T)left.prt)[0..i]; buff = buff[i*T.sizeof..$]; return ret; } /// code uses Alloca!(double) and Alloca!(double[]) for /// allocations. Don't use .length or ~= }
Apr 08 2008
parent reply Markus Dittrich <markusle gmail.com> writes:
BCS Wrote:

 Markus Dittrich wrote:
 Hi,
 
 For a data processing application I need to read a large number
 of data sets from disk. Due to their size, they have to be read and 
 processed sequentially, i.e. in pseudocode
 
 int main()
 {
     while (some condition)
     {
          double[][] myLargeDataset = read_data();
          process_data(myLargeDataset);
          // free all memory here otherwise next cycle will 
          // run out of memory
      }
     
    return 0;
 }
 
 Now, the "problem" is the fact that each single data-set saturates
 system memory and hence I need to make sure that all memory
 is freed after each process_data step is complete. Unfortunately,
 using dmd-2.012 I have not been able to achieve this. Whatever
 I do (including nothing, i.e., letting the GC do its job), the resulting 
 binary keeps accumulating memory and crashing shortly after). 
 I've tried deleting the array, setting the array lengths to 0, and manually
 forcing the GC to collect, to no avail. Hence, is there something I am 
 doing terribly wrong or is this a bug in dmd?
 
 Thanks much,
 Markus

One "hack" would be to have read_data() allocate a big buffer and then slice the parts of the double[][] out of it. This would have the advantage that you can just keep track of the buffer and on the next pass just reuse it in it's entirety, you never have to delete it. double[][] read_data() { static byte[] buff; if(buff.prt is null) buff = new byte[huge]; byte left = buff; T[] Alloca(T)(int i) { T[] ret = (cast(*T)left.prt)[0..i]; buff = buff[i*T.sizeof..$]; return ret; } /// code uses Alloca!(double) and Alloca!(double[]) for /// allocations. Don't use .length or ~= }

Thanks much for you response! I could certainly role my own buffer management. Unfortunately, the "real" app is more complicated than the "proof of concept" code I posted and doing so would require a bit more work. After all, the main reason for using D for this type of thing was the fact that I didn't want to deal with manual memory management ;) From the posts I gather that I am not doing anything fundamentally wrong, and I'll probably file a bug for this later.
Apr 08 2008
parent BCS <BCS pathlink.com> writes:
Markus Dittrich wrote:
 Unfortunately, the "real" app is more complicated 
 than the "proof of concept" code I posted and doing so would require a bit
 more work.
 

life would be so much nicer if real life didn't get in the way :b
Apr 08 2008
prev sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Markus Dittrich wrote:
 Hi,
 
 For a data processing application I need to read a large number
 of data sets from disk. Due to their size, they have to be read and 
 processed sequentially, i.e. in pseudocode
 
 int main()
 {
     while (some condition)
     {
          double[][] myLargeDataset = read_data();
          process_data(myLargeDataset);
          // free all memory here otherwise next cycle will 
          // run out of memory
      }
     
    return 0;
 }
 
 Now, the "problem" is the fact that each single data-set saturates
 system memory and hence I need to make sure that all memory
 is freed after each process_data step is complete. Unfortunately,
 using dmd-2.012 I have not been able to achieve this. Whatever
 I do (including nothing, i.e., letting the GC do its job), the resulting 
 binary keeps accumulating memory and crashing shortly after). 
 I've tried deleting the array, setting the array lengths to 0, and manually
 forcing the GC to collect, to no avail. Hence, is there something I am 
 doing terribly wrong or is this a bug in dmd?

Markus, you do not show us what either read_data or process_data do. It is possible that one of those is somehow holding on to references to the data. This would prevent the GC from collecting the memory. Another problem is if you allocate the memory initially as void[] then the GC will scan it for pointers, and in a big float buffer you'll get a lot of false hits. To prevent that, allocate the buffer initially as byte[] (or double[] --- just not void). Anyway, if you want a speedy fix, you'll need to distill this down into something that is actually reproducible by Walter. --bb
Apr 08 2008
parent reply Markus Dittirich <markusle gmail.com> writes:
Bill Baxter Wrote:

 
 Markus,  you do not show us what either read_data or process_data do. 
 It is possible that one of those is somehow holding on to references to 
 the data.  This would prevent the GC from collecting the memory.
 
 Another problem is if you allocate the memory initially as void[] then 
 the GC will scan it for pointers, and in a big float buffer you'll get a 
 lot of false hits.  To prevent that, allocate the buffer initially as 
 byte[] (or double[] --- just not void).
 
 Anyway, if you want a speedy fix, you'll need to distill this down into 
 something that is actually reproducible by Walter.
 
 --bb

Hi Bill, You're of course absolutely correct! Below is a proof of concept code that still exhibits the issue I was describing. The parse code needs to handle row centric ascii data with a variable number of columns. The file "data_random.dat" contains a single row of random integers. After a few iterations the code runs out of memory on my machine and no deleting seems to help. import std.stream; import std.stdio; import std.contracts; import std.gc; public double[][] parse(BufferedFile inputFile) { double[][] array; foreach(char[] line; inputFile) { double[] temp; foreach(string item; std.string.split(assumeUnique(line))) { temp ~= std.string.atof(item); } array ~= temp; } /* rewind for next round */ inputFile.seekSet(0); return array; } int main() { BufferedFile inputFile = new BufferedFile("data_random.dat"); while(1) { double[][] foo = parse(inputFile); } return 1; } Thanks much, Markus
Apr 08 2008
next sibling parent Bill Baxter <dnewsgroup billbaxter.com> writes:
Markus Dittirich wrote:
 Bill Baxter Wrote:
 
 Markus,  you do not show us what either read_data or process_data do. 
 It is possible that one of those is somehow holding on to references to 
 the data.  This would prevent the GC from collecting the memory.

 Another problem is if you allocate the memory initially as void[] then 
 the GC will scan it for pointers, and in a big float buffer you'll get a 
 lot of false hits.  To prevent that, allocate the buffer initially as 
 byte[] (or double[] --- just not void).

 Anyway, if you want a speedy fix, you'll need to distill this down into 
 something that is actually reproducible by Walter.

 --bb

Hi Bill, You're of course absolutely correct! Below is a proof of concept code that still exhibits the issue I was describing. The parse code needs to handle row centric ascii data with a variable number of columns. The file "data_random.dat" contains a single row of random integers. After a few iterations the code runs out of memory on my machine and no deleting seems to help. import std.stream; import std.stdio; import std.contracts; import std.gc; public double[][] parse(BufferedFile inputFile) { double[][] array; foreach(char[] line; inputFile) { double[] temp; foreach(string item; std.string.split(assumeUnique(line))) { temp ~= std.string.atof(item); } array ~= temp; } /* rewind for next round */ inputFile.seekSet(0); return array; } int main() { BufferedFile inputFile = new BufferedFile("data_random.dat"); while(1) { double[][] foo = parse(inputFile); } return 1; }

Ok. You should add that to the bug report. However, that test program works fine for me on Windows. I tried it with DMD/Phobos 1.028, DMD/Tango/Tangobos 1.028, and DMD/Phobos 2.012. --bb
Apr 08 2008
prev sibling parent BCS <BCS pathlink.com> writes:
Markus Dittirich wrote:
 Bill Baxter Wrote:
 
 
Markus,  you do not show us what either read_data or process_data do. 
It is possible that one of those is somehow holding on to references to 
the data.  This would prevent the GC from collecting the memory.

Another problem is if you allocate the memory initially as void[] then 
the GC will scan it for pointers, and in a big float buffer you'll get a 
lot of false hits.  To prevent that, allocate the buffer initially as 
byte[] (or double[] --- just not void).

Anyway, if you want a speedy fix, you'll need to distill this down into 
something that is actually reproducible by Walter.

--bb

Hi Bill, You're of course absolutely correct! Below is a proof of concept code that still exhibits the issue I was describing. The parse code needs to handle row centric ascii data with a variable number of columns. The file "data_random.dat" contains a single row of random integers. After a few iterations the code runs out of memory on my machine and no deleting seems to help.

does it change things if you drop the ~= in favor of extending the array? What about if you preallocate the array with the correct size to begin with? (I know this might not be doable in the general case)
Apr 08 2008