Hi,
For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode
int main()
{
while (some condition)
{
double[][] myLargeDataset = read_data();
process_data(myLargeDataset);
// free all memory here otherwise next cycle will
// run out of memory
}
return 0;
}
Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?
Thanks much,
Markus
Hi,
For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode
int main()
{
while (some condition)
{
double[][] myLargeDataset = read_data();
process_data(myLargeDataset);
// free all memory here otherwise next cycle will
// run out of memory
}
return 0;
}
Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?
Did you try just setting the array reference to null. This should make
the contents of the array unreachable and therefore it should be
collected when you next allocate (and run short on memory).
i.e.
int main()
{
double[][] myLargeDataset;
while (some condition)
{
myLargeDataset = read_data();
process_data(myLargeDataset);
myLargeDataset = null;
}
return 0;
}
Regan
Apr 08 2008
↑ ↓ ← → Markus Dittrich <markusle gmail.com> writes:
Regan Heath Wrote:
Did you try just setting the array reference to null. This should make
the contents of the array unreachable and therefore it should be
collected when you next allocate (and run short on memory).
i.e.
int main()
{
double[][] myLargeDataset;
while (some condition)
{
myLargeDataset = read_data();
process_data(myLargeDataset);
myLargeDataset = null;
}
return 0;
}
Regan
Thanks for the hint. I just tried this again just to make sure and also
tried plopping an std.gc.fullCollect() right after to force the GC to
collect. In both cases I can watch memory consumption grow continuously
with the system running out of memory eventually. Maybe its a GC bug?
Markus
Hi,
For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode
int main()
{
while (some condition)
{
double[][] myLargeDataset = read_data();
process_data(myLargeDataset);
// free all memory here otherwise next cycle will
// run out of memory
}
return 0;
}
Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?
Thanks much,
Markus
One "hack" would be to have read_data() allocate a big buffer and then
slice the parts of the double[][] out of it. This would have the
advantage that you can just keep track of the buffer and on the next
pass just reuse it in it's entirety, you never have to delete it.
double[][] read_data()
{
static byte[] buff;
if(buff.prt is null) buff = new byte[huge];
byte left = buff;
T[] Alloca(T)(int i)
{
T[] ret = (cast(*T)left.prt)[0..i];
buff = buff[i*T.sizeof..$];
return ret;
}
/// code uses Alloca!(double) and Alloca!(double[]) for
/// allocations. Don't use .length or ~=
}
Hi,
For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode
int main()
{
while (some condition)
{
double[][] myLargeDataset = read_data();
process_data(myLargeDataset);
// free all memory here otherwise next cycle will
// run out of memory
}
return 0;
}
Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?
Thanks much,
Markus
One "hack" would be to have read_data() allocate a big buffer and then
slice the parts of the double[][] out of it. This would have the
advantage that you can just keep track of the buffer and on the next
pass just reuse it in it's entirety, you never have to delete it.
double[][] read_data()
{
static byte[] buff;
if(buff.prt is null) buff = new byte[huge];
byte left = buff;
T[] Alloca(T)(int i)
{
T[] ret = (cast(*T)left.prt)[0..i];
buff = buff[i*T.sizeof..$];
return ret;
}
/// code uses Alloca!(double) and Alloca!(double[]) for
/// allocations. Don't use .length or ~=
}
Thanks much for you response! I could certainly role my own buffer
management. Unfortunately, the "real" app is more complicated
than the "proof of concept" code I posted and doing so would require a bit
more work. After all, the main reason for using D for this type of thing
was the fact that I didn't want to deal with manual memory management ;)
From the posts I gather that I am not doing anything fundamentally
wrong, and I'll probably file a bug for this later.
Unfortunately, the "real" app is more complicated
than the "proof of concept" code I posted and doing so would require a bit
more work.
life would be so much nicer if real life didn't get in the way :b
Apr 08 2008
↑ ↓ ←→ Bill Baxter <dnewsgroup billbaxter.com> writes:
Markus Dittrich wrote:
Hi,
For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode
int main()
{
while (some condition)
{
double[][] myLargeDataset = read_data();
process_data(myLargeDataset);
// free all memory here otherwise next cycle will
// run out of memory
}
return 0;
}
Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?
Markus, you do not show us what either read_data or process_data do.
It is possible that one of those is somehow holding on to references to
the data. This would prevent the GC from collecting the memory.
Another problem is if you allocate the memory initially as void[] then
the GC will scan it for pointers, and in a big float buffer you'll get a
lot of false hits. To prevent that, allocate the buffer initially as
byte[] (or double[] --- just not void).
Anyway, if you want a speedy fix, you'll need to distill this down into
something that is actually reproducible by Walter.
--bb
Apr 08 2008
↑ ↓ ←→ Markus Dittirich <markusle gmail.com> writes:
Bill Baxter Wrote:
Markus, you do not show us what either read_data or process_data do.
It is possible that one of those is somehow holding on to references to
the data. This would prevent the GC from collecting the memory.
Another problem is if you allocate the memory initially as void[] then
the GC will scan it for pointers, and in a big float buffer you'll get a
lot of false hits. To prevent that, allocate the buffer initially as
byte[] (or double[] --- just not void).
Anyway, if you want a speedy fix, you'll need to distill this down into
something that is actually reproducible by Walter.
--bb
Hi Bill,
You're of course absolutely correct! Below is a proof of concept code
that still exhibits the issue I was describing. The parse code needs
to handle row centric ascii data with a variable number of columns.
The file "data_random.dat" contains a single row of random integers.
After a few iterations the code runs out of memory on my machine
and no deleting seems to help.
import std.stream;
import std.stdio;
import std.contracts;
import std.gc;
public double[][] parse(BufferedFile inputFile)
{
double[][] array;
foreach(char[] line; inputFile)
{
double[] temp;
foreach(string item; std.string.split(assumeUnique(line)))
{
temp ~= std.string.atof(item);
}
array ~= temp;
}
/* rewind for next round */
inputFile.seekSet(0);
return array;
}
int main()
{
BufferedFile inputFile = new BufferedFile("data_random.dat");
while(1)
{
double[][] foo = parse(inputFile);
}
return 1;
}
Thanks much,
Markus
Apr 08 2008
↑ ↓← → Bill Baxter <dnewsgroup billbaxter.com> writes:
Markus Dittirich wrote:
Bill Baxter Wrote:
Markus, you do not show us what either read_data or process_data do.
It is possible that one of those is somehow holding on to references to
the data. This would prevent the GC from collecting the memory.
Another problem is if you allocate the memory initially as void[] then
the GC will scan it for pointers, and in a big float buffer you'll get a
lot of false hits. To prevent that, allocate the buffer initially as
byte[] (or double[] --- just not void).
Anyway, if you want a speedy fix, you'll need to distill this down into
something that is actually reproducible by Walter.
--bb
Hi Bill,
You're of course absolutely correct! Below is a proof of concept code
that still exhibits the issue I was describing. The parse code needs
to handle row centric ascii data with a variable number of columns.
The file "data_random.dat" contains a single row of random integers.
After a few iterations the code runs out of memory on my machine
and no deleting seems to help.
import std.stream;
import std.stdio;
import std.contracts;
import std.gc;
public double[][] parse(BufferedFile inputFile)
{
double[][] array;
foreach(char[] line; inputFile)
{
double[] temp;
foreach(string item; std.string.split(assumeUnique(line)))
{
temp ~= std.string.atof(item);
}
array ~= temp;
}
/* rewind for next round */
inputFile.seekSet(0);
return array;
}
int main()
{
BufferedFile inputFile = new BufferedFile("data_random.dat");
while(1)
{
double[][] foo = parse(inputFile);
}
return 1;
}
Ok. You should add that to the bug report.
However, that test program works fine for me on Windows.
I tried it with
DMD/Phobos 1.028,
DMD/Tango/Tangobos 1.028, and
DMD/Phobos 2.012.
--bb
Markus, you do not show us what either read_data or process_data do.
It is possible that one of those is somehow holding on to references to
the data. This would prevent the GC from collecting the memory.
Another problem is if you allocate the memory initially as void[] then
the GC will scan it for pointers, and in a big float buffer you'll get a
lot of false hits. To prevent that, allocate the buffer initially as
byte[] (or double[] --- just not void).
Anyway, if you want a speedy fix, you'll need to distill this down into
something that is actually reproducible by Walter.
--bb
Hi Bill,
You're of course absolutely correct! Below is a proof of concept code
that still exhibits the issue I was describing. The parse code needs
to handle row centric ascii data with a variable number of columns.
The file "data_random.dat" contains a single row of random integers.
After a few iterations the code runs out of memory on my machine
and no deleting seems to help.
does it change things if you drop the ~= in favor of extending the
array? What about if you preallocate the array with the correct size to
begin with? (I know this might not be doable in the general case)