digitalmars.D - Memory allocation in D

Marius Muja (53/53) Nov 28 2007 Hi,

BCS (6/22) Nov 28 2007 If you are using windows and the D new version allocates 3GB of ram then...
Sean Kelly (13/40) Nov 29 2007 With D as it is now, this will call GC.malloc(2049), which will allocate...

Sean Kelly (5/5) Nov 29 2007 This thread inspired me to make a number of changes to the GC and such

Craig Black (3/8) Nov 29 2007 Cool! Maybe you could show some benchmarks when you get done?

Sean Kelly (21/29) Nov 30 2007 Unfortunately, I think I'm not going to commit these changes. I had

Robert Fraser (10/22) Nov 30 2007 I'm sure you've probably read it, but I'll post it anyway:
Marius Muja (12/21) Nov 30 2007 That was my problem also. I was using array sizes that were powers of

Sean Kelly (8/30) Nov 30 2007 Weird. D's GC is basically a "big bag of pages" allocator, where groups...

Marius Muja (4/10) Nov 30 2007 That seems to be the case. And because in my test program I was always

Jason House (5/41) Nov 30 2007 If object sizes are a power of 2, is that because of special padding of ...

Sean Kelly (10/48) Nov 30 2007 Only arrays get the +1 byte added to their size during allocations. If

Sean Kelly (8/22) Nov 30 2007 So I went and looked up the clause and I'm not clear whether it requires...

James Dennett (4/24) Dec 19 2007 C and C++ only require that the address be valid for pointer

Sean Kelly (7/28) Dec 19 2007 Thanks. This was what I inferred, but I had heard (or misheard) that it...

Oskar Linde (12/20) Dec 01 2007 The +1 allocation for arrays is AFAIK there to make sure that it is

Sean Kelly (5/14) Dec 01 2007 Okay that makes sense. The Boehm GC tied the +1 byte to whether

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= (22/42) Nov 29 2007 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= (18/19) Nov 29 2007 -----BEGIN PGP SIGNED MESSAGE-----

Marius Muja <mariusm cs.ubc.ca> writes:

Hi,

I'm working on a program that is computationally and memory intensive. I 
have noticed that if I allocate the memory using D's new operator I can 
allocate only about half as much memory compared to C's malloc (and also 
the memory allocation is much slower). Does anybody know why D's memory 
allocator uses twice as much memory compared to malloc?

I used the following test program to test this:

version (Tango){
import tango.core.Memory;
import tango.stdc.stdlib;
import tango.io.Stdout;
}
else
{
import std.stdio;
import std.c.stdlib;
}

const int ALLOC_SIZE = 1024*1024;

void test_d_alloc()
{
	float[] a = new float[ALLOC_SIZE];
}

void test_malloc()
{
	float* ptr = cast(float*)malloc(ALLOC_SIZE*float.sizeof);
	if (ptr is null) throw new Exception("Out of memory");
}

void main()
{
	version (Tango) {
		GC.disable();
	} else {
		std.gc.disable();
	}
	int i=0;
	while(true) {
		version (Tango) {
			Stdout.formatln("{}",++i);
		} else {
			writefln(++i);
		}
// 		test_d_alloc();
		test_malloc();
	}
}


When the above program uses the test_d_alloc() function it executes the 
loop about half as many times compared to when it's using the 
test_malloc() function (until it terminates with an out of memory 
error). Also when the allocation is done in smaller increments 
(ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after 
allocating the maximum amount of memory (3G) instead of terminating with 
a nice OutOfMemory error.

Marius

Nov 28 2007

BCS <ao pathlink.com> writes:

Reply to Marius,

 Hi,
 
 I'm working on a program that is computationally and memory intensive.
 I have noticed that if I allocate the memory using D's new operator I
 can allocate only about half as much memory compared to C's malloc
 (and also the memory allocation is much slower). Does anybody know why
 D's memory allocator uses twice as much memory compared to malloc?
 

[...]
 When the above program uses the test_d_alloc() function it executes
 the loop about half as many times compared to when it's using the
 test_malloc() function (until it terminates with an out of memory
 error). Also when the allocation is done in smaller increments
 (ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after
 allocating the maximum amount of memory (3G) instead of terminating
 with a nice OutOfMemory error.
 

If you are using windows and the D new version allocates 3GB of ram then 
it's using everything it can get. No matter what, you have 75% of the address 
space, I would suspect that the malloc isn't actual allocating as much as 
it should. Or am I reading something wrong.

Nov 28 2007

Sean Kelly <sean f4.ca> writes:

Marius Muja wrote:
 Hi,
 
 I'm working on a program that is computationally and memory intensive. I 
 have noticed that if I allocate the memory using D's new operator I can 
 allocate only about half as much memory compared to C's malloc (and also 
 the memory allocation is much slower). Does anybody know why D's memory 
 allocator uses twice as much memory compared to malloc?
 
 I used the following test program to test this:
 
 version (Tango){
 import tango.core.Memory;
 import tango.stdc.stdlib;
 import tango.io.Stdout;
 }
 else
 {
 import std.stdio;
 import std.c.stdlib;
 }
 
 const int ALLOC_SIZE = 1024*1024;
 
 void test_d_alloc()
 {
     float[] a = new float[ALLOC_SIZE];
 }

With D as it is now, this will call GC.malloc(2049), which will allocate 
one page of memory per call.  The D runtime adds 1 to every allocation 
with "new", according to Walter, this was to prevent certain confusing 
errors.  It will allocate a page per call because the D GC allocates 
from pools devoted to fixed-size blocks which grow in powers of two up 
to 4096 bytes (one page on most systems).  Allocations beyond 4096 bytes 
are "big" allocations, and a multiple of 4096 bytes will be allocated to 
fit the requested size.  For these allocations, when the memory is 
released I believe the pages go back into a free pool.  I'm not sure 
offhand if empty pages used for smaller blocks do or not.

You might want to try calling GC.malloc() instead and compare results.


Sean

Nov 29 2007

Sean Kelly <sean f4.ca> writes:

This thread inspired me to make a number of changes to the GC and such 
in Tango.  They should improve efficiency in terms of eliminating wasted 
space and avoiding collections in some situations.  They'll probably 
take a few days, so stay tuned.


Sean

Nov 29 2007

"Craig Black" <cblack ara.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such in 
 Tango.  They should improve efficiency in terms of eliminating wasted 
 space and avoiding collections in some situations.  They'll probably take 
 a few days, so stay tuned.


 Sean

Cool!  Maybe you could show some benchmarks when you get done?

Nov 29 2007

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message 
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such in 
 Tango.  They should improve efficiency in terms of eliminating wasted 
 space and avoiding collections in some situations.  They'll probably take 
 a few days, so stay tuned.

 
 Cool!  Maybe you could show some benchmarks when you get done? 

Unfortunately, I think I'm not going to commit these changes.  I had 
some clever ideas for how to replace the "allocate an extra byte" 
functionality with an approach that would have resulted in less wasted 
space, but it's become more complicated than the added efficiency is worth.

Basically, I was hoping to keep an extra free page at the end of the 
allocated memory region used by the process.  This assumes that all 
memory used by the process is contiguous however, or at least that the 
intervening pages are valid.  But neither assumption is reasonable, and 
while some sort of fancy manual fix may have worked with some memory 
management APIs, it wouldn't have worked with others.

For what it's worth, my concern with the current behavior is that its 
worst-case scenario coincides with typical usage.  It is extremely 
common for a programmer to use array sizes that are powers of two, and 
this is the situation which results in the maximum unused space per 
block (because the runtime adds 1 to the size of all array allocations 
passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
don't suppose someone can offer a suggestion for how to address this, 
other than "just always allocate exactly what they request and if they 
get a page fault then too bad for them?"


Sean

Nov 30 2007

Robert Fraser <fraserofthenight gmail.com> writes:

Sean Kelly wrote:
 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"
 
 
 Sean

I'm sure you've probably read it, but I'll post it anyway:

http://www.cs.purdue.edu/homes/grr/ISMM98-grr-mps-cef.pdf

In particular:
"... [Objects] slightly larger than a page boundary lead to
significant internal fragmentation in the page-based large
object allocator as well. The high internal fragmentation
associated with both large small objects and small large
objects suggest that one should create an intermediate class
of medium objects."

Nov 30 2007

Marius Muja <mariusm cs.ubc.ca> writes:

Sean Kelly wrote:

 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"

That was my problem also. I was using array sizes that were powers of 
two and because of the extra 1 there were many memory pages wasted. For 
example in one of my tests I was allocating float arrays of size 1024 
(4096 bytes) which were taking 2 pages instead of 1 because of the extra 
1 added, and that's where I was observing that with D's new operator I 
can only allocate half as much memory than with malloc.
Oddly the same problem occurs if I allocate much larger arrays of 
exactly 1024*1024 floats (4MB size). Seems that instead of 4MB, D's 
runtime allocated 8MB for each of these arrays. Is D's runtime using 4MB 
pages for large memory allocations?


Marius

Nov 30 2007

Sean Kelly <sean f4.ca> writes:

Marius Muja wrote:
 Sean Kelly wrote:
 
 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"

 
 That was my problem also. I was using array sizes that were powers of 
 two and because of the extra 1 there were many memory pages wasted. For 
 example in one of my tests I was allocating float arrays of size 1024 
 (4096 bytes) which were taking 2 pages instead of 1 because of the extra 
 1 added, and that's where I was observing that with D's new operator I 
 can only allocate half as much memory than with malloc.
 Oddly the same problem occurs if I allocate much larger arrays of 
 exactly 1024*1024 floats (4MB size). Seems that instead of 4MB, D's 
 runtime allocated 8MB for each of these arrays. Is D's runtime using 4MB 
 pages for large memory allocations?

Weird.  D's GC is basically a "big bag of pages" allocator, where groups 
of pages are divided into pools.  The maximum size a pool will be on its 
own is 8 megs, with very large single allocations getting their own pool 
that is as big as necessary.  If I had to guess, I'd say the 8MB you're 
seeing is an artifact of this allocation scheme, and I'd suspect that 
much of the 8MB is committed but marked as free pages in your program.


Sean

Nov 30 2007

Marius Muja <mariusm cs.ubc.ca> writes:

 Weird.  D's GC is basically a "big bag of pages" allocator, where groups 
 of pages are divided into pools.  The maximum size a pool will be on its 
 own is 8 megs, with very large single allocations getting their own pool 
 that is as big as necessary.  If I had to guess, I'd say the 8MB you're 
 seeing is an artifact of this allocation scheme, and I'd suspect that 
 much of the 8MB is committed but marked as free pages in your program.

That seems to be the case. And because in my test program I was always 
allocating 4MB-sized arrays, the extra pages that were committed but 
marked as free couldn't be used.

Marius

Nov 30 2007

Jason House <jason.james.house gmail.com> writes:

Sean Kelly wrote:

 Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such
 in
 Tango.  They should improve efficiency in terms of eliminating wasted
 space and avoiding collections in some situations.  They'll probably
 take a few days, so stay tuned.

 
 Cool!  Maybe you could show some benchmarks when you get done?

 
 Unfortunately, I think I'm not going to commit these changes.  I had
 some clever ideas for how to replace the "allocate an extra byte"
 functionality with an approach that would have resulted in less wasted
 space, but it's become more complicated than the added efficiency is
 worth.
 
 Basically, I was hoping to keep an extra free page at the end of the
 allocated memory region used by the process.  This assumes that all
 memory used by the process is contiguous however, or at least that the
 intervening pages are valid.  But neither assumption is reasonable, and
 while some sort of fancy manual fix may have worked with some memory
 management APIs, it wouldn't have worked with others.
 
 For what it's worth, my concern with the current behavior is that its
 worst-case scenario coincides with typical usage.  It is extremely
 common for a programmer to use array sizes that are powers of two, and
 this is the situation which results in the maximum unused space per
 block (because the runtime adds 1 to the size of all array allocations
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I
 don't suppose someone can offer a suggestion for how to address this,
 other than "just always allocate exactly what they request and if they
 get a page fault then too bad for them?"
 
 
 Sean

If object sizes are a power of 2, is that because of special padding of a
struct/object to make it align well?  Maybe some clever trick could be done
to use wasted space, if it exists, for special tracking data... Otherwise,
allocating a bit more seems the most sensible.

Nov 30 2007

Sean Kelly <sean f4.ca> writes:

Jason House wrote:
 Sean Kelly wrote:
 
 Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such
 in
 Tango.  They should improve efficiency in terms of eliminating wasted
 space and avoiding collections in some situations.  They'll probably
 take a few days, so stay tuned.

 Cool!  Maybe you could show some benchmarks when you get done?

 Unfortunately, I think I'm not going to commit these changes.  I had
 some clever ideas for how to replace the "allocate an extra byte"
 functionality with an approach that would have resulted in less wasted
 space, but it's become more complicated than the added efficiency is
 worth.

 Basically, I was hoping to keep an extra free page at the end of the
 allocated memory region used by the process.  This assumes that all
 memory used by the process is contiguous however, or at least that the
 intervening pages are valid.  But neither assumption is reasonable, and
 while some sort of fancy manual fix may have worked with some memory
 management APIs, it wouldn't have worked with others.

 For what it's worth, my concern with the current behavior is that its
 worst-case scenario coincides with typical usage.  It is extremely
 common for a programmer to use array sizes that are powers of two, and
 this is the situation which results in the maximum unused space per
 block (because the runtime adds 1 to the size of all array allocations
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I
 don't suppose someone can offer a suggestion for how to address this,
 other than "just always allocate exactly what they request and if they
 get a page fault then too bad for them?"

 
 If object sizes are a power of 2, is that because of special padding of a
 struct/object to make it align well?  Maybe some clever trick could be done
 to use wasted space, if it exists, for special tracking data... Otherwise,
 allocating a bit more seems the most sensible.

Only arrays get the +1 byte added to their size during allocations.  If 
an object occupies 16 bytes then it will end up in a 16 byte bucket. 
The +1 issue for arrays exists to make the program a bit more forgiving 
for "past the end" errors.  Interestingly, the ANSI C standard actually 
requires this behavior in its allocators.  I have yet to decide whether 
or not this is an artifact of C using terminators to mark the end of a 
block, since using terminators seems more likely to result in such bugs 
than the length scheme D employs.


Sean

Nov 30 2007

Sean Kelly <sean f4.ca> writes:

Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding of a
 struct/object to make it align well?  Maybe some clever trick could be 
 done
 to use wasted space, if it exists, for special tracking data... 
 Otherwise,
 allocating a bit more seems the most sensible.

 
 Only arrays get the +1 byte added to their size during allocations.  If 
 an object occupies 16 bytes then it will end up in a 16 byte bucket. The 
 +1 issue for arrays exists to make the program a bit more forgiving for 
 "past the end" errors.  Interestingly, the ANSI C standard actually 
 requires this behavior in its allocators.

So I went and looked up the clause and I'm not clear whether it requires 
the location to be physically available or merely fore the address to be 
valid for pointer mathematics.  I also realized that I don't need to 
worry about malloc and vmalloc because they're already ANSI C complaint, 
so it's just potentially a matter of sorting out VirtualAlloc and mmap. 
  I'm going to do a bit more research and see what comes out of it.


Sean

Nov 30 2007

James Dennett <jdennett acm.org> writes:

Sean Kelly wrote:
 Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding
 of a
 struct/object to make it align well?  Maybe some clever trick could
 be done
 to use wasted space, if it exists, for special tracking data...
 Otherwise,
 allocating a bit more seems the most sensible.

 Only arrays get the +1 byte added to their size during allocations. 
 If an object occupies 16 bytes then it will end up in a 16 byte
 bucket. The +1 issue for arrays exists to make the program a bit more
 forgiving for "past the end" errors.  Interestingly, the ANSI C
 standard actually requires this behavior in its allocators.

 
 So I went and looked up the clause and I'm not clear whether it requires
 the location to be physically available or merely fore the address to be
 valid for pointer mathematics. 

C and C++ only require that the address be valid for pointer
arithmetic; it need not be dereferencable.

-- James

Dec 19 2007

Sean Kelly <sean f4.ca> writes:

James Dennett wrote:
 Sean Kelly wrote:
 Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding
 of a
 struct/object to make it align well?  Maybe some clever trick could
 be done
 to use wasted space, if it exists, for special tracking data...
 Otherwise,
 allocating a bit more seems the most sensible.

 Only arrays get the +1 byte added to their size during allocations. 
 If an object occupies 16 bytes then it will end up in a 16 byte
 bucket. The +1 issue for arrays exists to make the program a bit more
 forgiving for "past the end" errors.  Interestingly, the ANSI C
 standard actually requires this behavior in its allocators.

 So I went and looked up the clause and I'm not clear whether it requires
 the location to be physically available or merely fore the address to be
 valid for pointer mathematics. 

 
 C and C++ only require that the address be valid for pointer
 arithmetic; it need not be dereferencable.

Thanks.  This was what I inferred, but I had heard (or misheard) that it 
was the stronger guarantee and got confused when I didn't see this in 
the spec.  Unfortunately, as others have said, the "past the end" issue 
can cause other problems with garbage collection, so it seems that in D 
the extra byte of free space should typically be reserved anyway.


Sean

Dec 19 2007

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Sean Kelly wrote:

 Only arrays get the +1 byte added to their size during allocations.  If 
 an object occupies 16 bytes then it will end up in a 16 byte bucket. The 
 +1 issue for arrays exists to make the program a bit more forgiving for 
 "past the end" errors.  Interestingly, the ANSI C standard actually 
 requires this behavior in its allocators.  I have yet to decide whether 
 or not this is an artifact of C using terminators to mark the end of a 
 block, since using terminators seems more likely to result in such bugs 
 than the length scheme D employs.

The +1 allocation for arrays is AFAIK there to make sure that it is 
valid to have a pointer point one past the end. You want this to a) 
allow efficient iterators, and b) to allow [$..$] slices.

The problems that would arise if the GC didn't allocate that extra byte are

* appending to a [$..$] slice would in some cases be confused by the gc 
as appending to a [0..0] slice of the following allocated array, and 
thereby corrupting that memory.

* all pointers past the end prevents the consecutive memory area to be 
reclaimed by the GC.

-- 
Oskar

Dec 01 2007

Sean Kelly <sean f4.ca> writes:

Oskar Linde wrote:
 
 The problems that would arise if the GC didn't allocate that extra byte are
 
 * appending to a [$..$] slice would in some cases be confused by the gc 
 as appending to a [0..0] slice of the following allocated array, and 
 thereby corrupting that memory.
 
 * all pointers past the end prevents the consecutive memory area to be 
 reclaimed by the GC.

Okay that makes sense.  The Boehm GC tied the +1 byte to whether 
internal pointers were allowed so I was wondering if it was something 
along these lines.  Thanks.


Sean

Dec 01 2007

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marius Muja wrote:
 Hi,
 
 I'm working on a program that is computationally and memory intensive. I
 have noticed that if I allocate the memory using D's new operator I can
 allocate only about half as much memory compared to C's malloc (and also
 the memory allocation is much slower). Does anybody know why D's memory
 allocator uses twice as much memory compared to malloc?
 
 I used the following test program to test this:
 
 ... snip ...
 
 When the above program uses the test_d_alloc() function it executes the
 loop about half as many times compared to when it's using the
 test_malloc() function (until it terminates with an out of memory
 error). Also when the allocation is done in smaller increments
 (ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after
 allocating the maximum amount of memory (3G) instead of terminating with
 a nice OutOfMemory error.
 

	Since you are using Linux, I would like to point out that a call to
malloc may succeed even if there is no memory available at the time.
The program will only fail when it actually tries to access the
memory. Since the D new fills the array with zeroes, it will fail
immediately upon reaching the memory limit. The standard malloc
however may appear to work much longer unless you actually try to
use the memory.

		Jerome
- --
+------------------------- Jerome M. BERGER ---------------------+
|    mailto:jeberger free.fr      | ICQ:    238062172            |
|    http://jeberger.free.fr/     | Jabber: jeberger jabber.fr   |
+---------------------------------+------------------------------+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHTxnhd0kWM4JG3k8RAuW/AKCBk6hVjvhe/2qLEVdnF1cWvFxEZQCgg0Yu
aVOJuBt6UGLXGiSoQ31EX0I=
=7X3r
-----END PGP SIGNATURE-----

Nov 29 2007

=?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J�r�me M. Berger wrote:
 Since the D new fills the array with zeroes, 

                                       ^^^^^^

	Sorry, this should be "nans" of course, but the main point remains
valid...

		Jerome
- --
+------------------------- Jerome M. BERGER ---------------------+
|    mailto:jeberger free.fr      | ICQ:    238062172            |
|    http://jeberger.free.fr/     | Jabber: jeberger jabber.fr   |
+---------------------------------+------------------------------+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHTxwRd0kWM4JG3k8RAtv6AKC+DFmdebG3rE4p+MgwRztO+7YlugCePP2X
y5rUTtv77h6fmH8AI/zy0Oc=
=CqfB
-----END PGP SIGNATURE-----

Nov 29 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Memory allocation in D