www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Memory allocation in D

reply Marius Muja <mariusm cs.ubc.ca> writes:
Hi,

I'm working on a program that is computationally and memory intensive. I 
have noticed that if I allocate the memory using D's new operator I can 
allocate only about half as much memory compared to C's malloc (and also 
the memory allocation is much slower). Does anybody know why D's memory 
allocator uses twice as much memory compared to malloc?

I used the following test program to test this:

version (Tango){
import tango.core.Memory;
import tango.stdc.stdlib;
import tango.io.Stdout;
}
else
{
import std.stdio;
import std.c.stdlib;
}

const int ALLOC_SIZE = 1024*1024;

void test_d_alloc()
{
	float[] a = new float[ALLOC_SIZE];
}

void test_malloc()
{
	float* ptr = cast(float*)malloc(ALLOC_SIZE*float.sizeof);
	if (ptr is null) throw new Exception("Out of memory");
}

void main()
{
	version (Tango) {
		GC.disable();
	} else {
		std.gc.disable();
	}
	int i=0;
	while(true) {
		version (Tango) {
			Stdout.formatln("{}",++i);
		} else {
			writefln(++i);
		}
// 		test_d_alloc();
		test_malloc();
	}
}


When the above program uses the test_d_alloc() function it executes the 
loop about half as many times compared to when it's using the 
test_malloc() function (until it terminates with an out of memory 
error). Also when the allocation is done in smaller increments 
(ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after 
allocating the maximum amount of memory (3G) instead of terminating with 
a nice OutOfMemory error.

Marius
Nov 28 2007
next sibling parent BCS <ao pathlink.com> writes:
Reply to Marius,

 Hi,
 
 I'm working on a program that is computationally and memory intensive.
 I have noticed that if I allocate the memory using D's new operator I
 can allocate only about half as much memory compared to C's malloc
 (and also the memory allocation is much slower). Does anybody know why
 D's memory allocator uses twice as much memory compared to malloc?
 

 When the above program uses the test_d_alloc() function it executes
 the loop about half as many times compared to when it's using the
 test_malloc() function (until it terminates with an out of memory
 error). Also when the allocation is done in smaller increments
 (ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after
 allocating the maximum amount of memory (3G) instead of terminating
 with a nice OutOfMemory error.
 

If you are using windows and the D new version allocates 3GB of ram then it's using everything it can get. No matter what, you have 75% of the address space, I would suspect that the malloc isn't actual allocating as much as it should. Or am I reading something wrong.
Nov 28 2007
prev sibling next sibling parent reply Sean Kelly <sean f4.ca> writes:
Marius Muja wrote:
 Hi,
 
 I'm working on a program that is computationally and memory intensive. I 
 have noticed that if I allocate the memory using D's new operator I can 
 allocate only about half as much memory compared to C's malloc (and also 
 the memory allocation is much slower). Does anybody know why D's memory 
 allocator uses twice as much memory compared to malloc?
 
 I used the following test program to test this:
 
 version (Tango){
 import tango.core.Memory;
 import tango.stdc.stdlib;
 import tango.io.Stdout;
 }
 else
 {
 import std.stdio;
 import std.c.stdlib;
 }
 
 const int ALLOC_SIZE = 1024*1024;
 
 void test_d_alloc()
 {
     float[] a = new float[ALLOC_SIZE];
 }

With D as it is now, this will call GC.malloc(2049), which will allocate one page of memory per call. The D runtime adds 1 to every allocation with "new", according to Walter, this was to prevent certain confusing errors. It will allocate a page per call because the D GC allocates from pools devoted to fixed-size blocks which grow in powers of two up to 4096 bytes (one page on most systems). Allocations beyond 4096 bytes are "big" allocations, and a multiple of 4096 bytes will be allocated to fit the requested size. For these allocations, when the memory is released I believe the pages go back into a free pool. I'm not sure offhand if empty pages used for smaller blocks do or not. You might want to try calling GC.malloc() instead and compare results. Sean
Nov 29 2007
parent reply Sean Kelly <sean f4.ca> writes:
This thread inspired me to make a number of changes to the GC and such 
in Tango.  They should improve efficiency in terms of eliminating wasted 
space and avoiding collections in some situations.  They'll probably 
take a few days, so stay tuned.


Sean
Nov 29 2007
parent reply "Craig Black" <cblack ara.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such in 
 Tango.  They should improve efficiency in terms of eliminating wasted 
 space and avoiding collections in some situations.  They'll probably take 
 a few days, so stay tuned.


 Sean

Cool! Maybe you could show some benchmarks when you get done?
Nov 29 2007
parent reply Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message 
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such in 
 Tango.  They should improve efficiency in terms of eliminating wasted 
 space and avoiding collections in some situations.  They'll probably take 
 a few days, so stay tuned.

Cool! Maybe you could show some benchmarks when you get done?

Unfortunately, I think I'm not going to commit these changes. I had some clever ideas for how to replace the "allocate an extra byte" functionality with an approach that would have resulted in less wasted space, but it's become more complicated than the added efficiency is worth. Basically, I was hoping to keep an extra free page at the end of the allocated memory region used by the process. This assumes that all memory used by the process is contiguous however, or at least that the intervening pages are valid. But neither assumption is reasonable, and while some sort of fancy manual fix may have worked with some memory management APIs, it wouldn't have worked with others. For what it's worth, my concern with the current behavior is that its worst-case scenario coincides with typical usage. It is extremely common for a programmer to use array sizes that are powers of two, and this is the situation which results in the maximum unused space per block (because the runtime adds 1 to the size of all array allocations passed to the GC). Totally stinks, and I'm running out of ideas. I don't suppose someone can offer a suggestion for how to address this, other than "just always allocate exactly what they request and if they get a page fault then too bad for them?" Sean
Nov 30 2007
next sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
Sean Kelly wrote:
 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"
 
 
 Sean

I'm sure you've probably read it, but I'll post it anyway: http://www.cs.purdue.edu/homes/grr/ISMM98-grr-mps-cef.pdf In particular: "... [Objects] slightly larger than a page boundary lead to significant internal fragmentation in the page-based large object allocator as well. The high internal fragmentation associated with both large small objects and small large objects suggest that one should create an intermediate class of medium objects."
Nov 30 2007
prev sibling next sibling parent reply Marius Muja <mariusm cs.ubc.ca> writes:
Sean Kelly wrote:

 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"

That was my problem also. I was using array sizes that were powers of two and because of the extra 1 there were many memory pages wasted. For example in one of my tests I was allocating float arrays of size 1024 (4096 bytes) which were taking 2 pages instead of 1 because of the extra 1 added, and that's where I was observing that with D's new operator I can only allocate half as much memory than with malloc. Oddly the same problem occurs if I allocate much larger arrays of exactly 1024*1024 floats (4MB size). Seems that instead of 4MB, D's runtime allocated 8MB for each of these arrays. Is D's runtime using 4MB pages for large memory allocations? Marius
Nov 30 2007
parent reply Sean Kelly <sean f4.ca> writes:
Marius Muja wrote:
 Sean Kelly wrote:
 
 For what it's worth, my concern with the current behavior is that its 
 worst-case scenario coincides with typical usage.  It is extremely 
 common for a programmer to use array sizes that are powers of two, and 
 this is the situation which results in the maximum unused space per 
 block (because the runtime adds 1 to the size of all array allocations 
 passed to the GC).  Totally stinks, and I'm running out of ideas.  I 
 don't suppose someone can offer a suggestion for how to address this, 
 other than "just always allocate exactly what they request and if they 
 get a page fault then too bad for them?"

That was my problem also. I was using array sizes that were powers of two and because of the extra 1 there were many memory pages wasted. For example in one of my tests I was allocating float arrays of size 1024 (4096 bytes) which were taking 2 pages instead of 1 because of the extra 1 added, and that's where I was observing that with D's new operator I can only allocate half as much memory than with malloc. Oddly the same problem occurs if I allocate much larger arrays of exactly 1024*1024 floats (4MB size). Seems that instead of 4MB, D's runtime allocated 8MB for each of these arrays. Is D's runtime using 4MB pages for large memory allocations?

Weird. D's GC is basically a "big bag of pages" allocator, where groups of pages are divided into pools. The maximum size a pool will be on its own is 8 megs, with very large single allocations getting their own pool that is as big as necessary. If I had to guess, I'd say the 8MB you're seeing is an artifact of this allocation scheme, and I'd suspect that much of the 8MB is committed but marked as free pages in your program. Sean
Nov 30 2007
parent Marius Muja <mariusm cs.ubc.ca> writes:
 Weird.  D's GC is basically a "big bag of pages" allocator, where groups 
 of pages are divided into pools.  The maximum size a pool will be on its 
 own is 8 megs, with very large single allocations getting their own pool 
 that is as big as necessary.  If I had to guess, I'd say the 8MB you're 
 seeing is an artifact of this allocation scheme, and I'd suspect that 
 much of the 8MB is committed but marked as free pages in your program.

That seems to be the case. And because in my test program I was always allocating 4MB-sized arrays, the extra pages that were committed but marked as free couldn't be used. Marius
Nov 30 2007
prev sibling parent reply Jason House <jason.james.house gmail.com> writes:
Sean Kelly wrote:

 Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such
 in
 Tango.  They should improve efficiency in terms of eliminating wasted
 space and avoiding collections in some situations.  They'll probably
 take a few days, so stay tuned.

Cool! Maybe you could show some benchmarks when you get done?

Unfortunately, I think I'm not going to commit these changes. I had some clever ideas for how to replace the "allocate an extra byte" functionality with an approach that would have resulted in less wasted space, but it's become more complicated than the added efficiency is worth. Basically, I was hoping to keep an extra free page at the end of the allocated memory region used by the process. This assumes that all memory used by the process is contiguous however, or at least that the intervening pages are valid. But neither assumption is reasonable, and while some sort of fancy manual fix may have worked with some memory management APIs, it wouldn't have worked with others. For what it's worth, my concern with the current behavior is that its worst-case scenario coincides with typical usage. It is extremely common for a programmer to use array sizes that are powers of two, and this is the situation which results in the maximum unused space per block (because the runtime adds 1 to the size of all array allocations passed to the GC). Totally stinks, and I'm running out of ideas. I don't suppose someone can offer a suggestion for how to address this, other than "just always allocate exactly what they request and if they get a page fault then too bad for them?" Sean

If object sizes are a power of 2, is that because of special padding of a struct/object to make it align well? Maybe some clever trick could be done to use wasted space, if it exists, for special tracking data... Otherwise, allocating a bit more seems the most sensible.
Nov 30 2007
parent reply Sean Kelly <sean f4.ca> writes:
Jason House wrote:
 Sean Kelly wrote:
 
 Craig Black wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 news:fin4fl$1h7v$1 digitalmars.com...
 This thread inspired me to make a number of changes to the GC and such
 in
 Tango.  They should improve efficiency in terms of eliminating wasted
 space and avoiding collections in some situations.  They'll probably
 take a few days, so stay tuned.


some clever ideas for how to replace the "allocate an extra byte" functionality with an approach that would have resulted in less wasted space, but it's become more complicated than the added efficiency is worth. Basically, I was hoping to keep an extra free page at the end of the allocated memory region used by the process. This assumes that all memory used by the process is contiguous however, or at least that the intervening pages are valid. But neither assumption is reasonable, and while some sort of fancy manual fix may have worked with some memory management APIs, it wouldn't have worked with others. For what it's worth, my concern with the current behavior is that its worst-case scenario coincides with typical usage. It is extremely common for a programmer to use array sizes that are powers of two, and this is the situation which results in the maximum unused space per block (because the runtime adds 1 to the size of all array allocations passed to the GC). Totally stinks, and I'm running out of ideas. I don't suppose someone can offer a suggestion for how to address this, other than "just always allocate exactly what they request and if they get a page fault then too bad for them?"

If object sizes are a power of 2, is that because of special padding of a struct/object to make it align well? Maybe some clever trick could be done to use wasted space, if it exists, for special tracking data... Otherwise, allocating a bit more seems the most sensible.

Only arrays get the +1 byte added to their size during allocations. If an object occupies 16 bytes then it will end up in a 16 byte bucket. The +1 issue for arrays exists to make the program a bit more forgiving for "past the end" errors. Interestingly, the ANSI C standard actually requires this behavior in its allocators. I have yet to decide whether or not this is an artifact of C using terminators to mark the end of a block, since using terminators seems more likely to result in such bugs than the length scheme D employs. Sean
Nov 30 2007
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding of a
 struct/object to make it align well?  Maybe some clever trick could be 
 done
 to use wasted space, if it exists, for special tracking data... 
 Otherwise,
 allocating a bit more seems the most sensible.

Only arrays get the +1 byte added to their size during allocations. If an object occupies 16 bytes then it will end up in a 16 byte bucket. The +1 issue for arrays exists to make the program a bit more forgiving for "past the end" errors. Interestingly, the ANSI C standard actually requires this behavior in its allocators.

So I went and looked up the clause and I'm not clear whether it requires the location to be physically available or merely fore the address to be valid for pointer mathematics. I also realized that I don't need to worry about malloc and vmalloc because they're already ANSI C complaint, so it's just potentially a matter of sorting out VirtualAlloc and mmap. I'm going to do a bit more research and see what comes out of it. Sean
Nov 30 2007
parent reply James Dennett <jdennett acm.org> writes:
Sean Kelly wrote:
 Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding
 of a
 struct/object to make it align well?  Maybe some clever trick could
 be done
 to use wasted space, if it exists, for special tracking data...
 Otherwise,
 allocating a bit more seems the most sensible.

Only arrays get the +1 byte added to their size during allocations. If an object occupies 16 bytes then it will end up in a 16 byte bucket. The +1 issue for arrays exists to make the program a bit more forgiving for "past the end" errors. Interestingly, the ANSI C standard actually requires this behavior in its allocators.

So I went and looked up the clause and I'm not clear whether it requires the location to be physically available or merely fore the address to be valid for pointer mathematics.

C and C++ only require that the address be valid for pointer arithmetic; it need not be dereferencable. -- James
Dec 19 2007
parent Sean Kelly <sean f4.ca> writes:
James Dennett wrote:
 Sean Kelly wrote:
 Sean Kelly wrote:
 Jason House wrote:
 If object sizes are a power of 2, is that because of special padding
 of a
 struct/object to make it align well?  Maybe some clever trick could
 be done
 to use wasted space, if it exists, for special tracking data...
 Otherwise,
 allocating a bit more seems the most sensible.

If an object occupies 16 bytes then it will end up in a 16 byte bucket. The +1 issue for arrays exists to make the program a bit more forgiving for "past the end" errors. Interestingly, the ANSI C standard actually requires this behavior in its allocators.

the location to be physically available or merely fore the address to be valid for pointer mathematics.

C and C++ only require that the address be valid for pointer arithmetic; it need not be dereferencable.

Thanks. This was what I inferred, but I had heard (or misheard) that it was the stronger guarantee and got confused when I didn't see this in the spec. Unfortunately, as others have said, the "past the end" issue can cause other problems with garbage collection, so it seems that in D the extra byte of free space should typically be reserved anyway. Sean
Dec 19 2007
prev sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Sean Kelly wrote:

 Only arrays get the +1 byte added to their size during allocations.  If 
 an object occupies 16 bytes then it will end up in a 16 byte bucket. The 
 +1 issue for arrays exists to make the program a bit more forgiving for 
 "past the end" errors.  Interestingly, the ANSI C standard actually 
 requires this behavior in its allocators.  I have yet to decide whether 
 or not this is an artifact of C using terminators to mark the end of a 
 block, since using terminators seems more likely to result in such bugs 
 than the length scheme D employs.

The +1 allocation for arrays is AFAIK there to make sure that it is valid to have a pointer point one past the end. You want this to a) allow efficient iterators, and b) to allow [$..$] slices. The problems that would arise if the GC didn't allocate that extra byte are * appending to a [$..$] slice would in some cases be confused by the gc as appending to a [0..0] slice of the following allocated array, and thereby corrupting that memory. * all pointers past the end prevents the consecutive memory area to be reclaimed by the GC. -- Oskar
Dec 01 2007
parent Sean Kelly <sean f4.ca> writes:
Oskar Linde wrote:
 
 The problems that would arise if the GC didn't allocate that extra byte are
 
 * appending to a [$..$] slice would in some cases be confused by the gc 
 as appending to a [0..0] slice of the following allocated array, and 
 thereby corrupting that memory.
 
 * all pointers past the end prevents the consecutive memory area to be 
 reclaimed by the GC.

Okay that makes sense. The Boehm GC tied the +1 byte to whether internal pointers were allowed so I was wondering if it was something along these lines. Thanks. Sean
Dec 01 2007
prev sibling parent reply =?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marius Muja wrote:
 Hi,
 
 I'm working on a program that is computationally and memory intensive. I
 have noticed that if I allocate the memory using D's new operator I can
 allocate only about half as much memory compared to C's malloc (and also
 the memory allocation is much slower). Does anybody know why D's memory
 allocator uses twice as much memory compared to malloc?
 
 I used the following test program to test this:
 
 ... snip ...
 
 When the above program uses the test_d_alloc() function it executes the
 loop about half as many times compared to when it's using the
 test_malloc() function (until it terminates with an out of memory
 error). Also when the allocation is done in smaller increments
 (ALLOC_SIZE is smaller) the test_d_alloc() version segfaults after
 allocating the maximum amount of memory (3G) instead of terminating with
 a nice OutOfMemory error.
 

malloc may succeed even if there is no memory available at the time. The program will only fail when it actually tries to access the memory. Since the D new fills the array with zeroes, it will fail immediately upon reaching the memory limit. The standard malloc however may appear to work much longer unless you actually try to use the memory. Jerome - -- +------------------------- Jerome M. BERGER ---------------------+ | mailto:jeberger free.fr | ICQ: 238062172 | | http://jeberger.free.fr/ | Jabber: jeberger jabber.fr | +---------------------------------+------------------------------+ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHTxnhd0kWM4JG3k8RAuW/AKCBk6hVjvhe/2qLEVdnF1cWvFxEZQCgg0Yu aVOJuBt6UGLXGiSoQ31EX0I= =7X3r -----END PGP SIGNATURE-----
Nov 29 2007
parent =?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jérôme M. Berger wrote:
 Since the D new fills the array with zeroes, 

Sorry, this should be "nans" of course, but the main point remains valid... Jerome - -- +------------------------- Jerome M. BERGER ---------------------+ | mailto:jeberger free.fr | ICQ: 238062172 | | http://jeberger.free.fr/ | Jabber: jeberger jabber.fr | +---------------------------------+------------------------------+ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHTxwRd0kWM4JG3k8RAtv6AKC+DFmdebG3rE4p+MgwRztO+7YlugCePP2X y5rUTtv77h6fmH8AI/zy0Oc= =CqfB -----END PGP SIGNATURE-----
Nov 29 2007