digitalmars.D - How to save RAM in D programs (on zero initialized buffers): Reloaded
- Marco Leise (61/61) Feb 07 2012 Hi, this is me again with some "size matters" topic. This time, it's not
- Nick Sabalausky (3/3) Feb 07 2012 Is void initialization not good enough?
- Marco Leise (9/12) Feb 07 2012 That gives me a) no buffer, who's pointer is b) not initialized to null.
- Michel Fortin (10/28) Feb 07 2012 What would be nice is a GC that would just track system-allocated
- Martin Nowak (10/31) Feb 07 2012 A similar idea popped up to solve non-deterministic shared library
- Jacob Carlborg (4/28) Feb 07 2012 What about GC.addRoot ?
- Michel Fortin (8/38) Feb 08 2012 This is perfect if you want the GC to scan a memory range for pointers.
- Martin Nowak (3/18) Feb 07 2012 You usually don't need initialized memory from the GC.
- Kapps (5/8) Feb 07 2012 This example would be uninitializedArray!(ubyte[])(1024 * 1024).
- Iain Buclaw (16/77) Feb 07 2012 What about these functions?
- Marco Leise (6/18) Feb 07 2012 Yeah I know these and used them, only to find out the program I am porti...
- F i L (3/3) Feb 07 2012 Can't you just write a custom allocator using calloc for
- Timon Gehr (3/6) Feb 07 2012 The solution with the best performance and least memory requirements
- Artur Skawina (3/4) Feb 07 2012 That won't help - the compiler will defeat the optimization by "initiali...
- Manfred Nowak (14/15) Feb 07 2012 The tests only cover a very small fraction of an unknown data
- Marco Leise (11/26) Feb 07 2012 I meant caveats with calloc vs. other allocators. I don't want to discus...
- Manfred Nowak (21/25) Feb 08 2012 [...]
- Marco Leise (25/50) Feb 08 2012 That sounds a bit vague. If I understand you correctly you would impleme...
- Manfred Nowak (4/5) Feb 08 2012 Andrei has written a paper on allocation:
- Marco Leise (18/23) Feb 09 2012 Oh ok, that's farther than I ever digged into memory management. My
- Kagamin (3/3) Feb 09 2012 I guess, calloc will reuse blocks too, so if you run the
- Marco Leise (11/14) Feb 09 2012 You don't understand how it works. calloc gives you exactly 0 KB of
- Jose Armando Garcia (7/69) Feb 07 2012 Special? What do you mean by special? Most OS use Virtual Memory so
- Marco Leise (9/14) Feb 07 2012 Yes, special. Like, say, you fork a process on Unix. Instead of copying ...
Hi, this is me again with some "size matters" topic. This time, it's not the executable size, no! Instead I want to discuss a runtime memory footprint and speed issue that affects everyone, and how to improve the situation dramatically. In D we allocate memory through the GC, that is initialized according to the type's .init, which gives us a save default. In most cases this will result in the memory block being zeroed out, like in the case of allocating ubyte[] buffers. Let's assume, we have a program that allocates some buffers in advance, that it may not use fully. This often happens when the input data is much smaller than the anticipated case. So our memory manager should handle this situation well: o zero out a memory block o we probably don't need all of it So here is a small benchmark that allocates 512 * 1 MB, first using the typical method: new ubyte[1024 * 1024]. The oputput is: ** new ubyte[1024 * 1024] ressource usage: +526840 KB user time: +0.098s | sys. time: +0.368s As expected we have a physical memory usage increase of ~512 MB and spent a considerable amount of time in the system to find free memory blocks and in our program to initialize the data to zero. Can we use the GC more directly? Let's try GC.calloc: ** GC.calloc(1024 * 1024) ressource usage: +525104 KB user time: +0.089s | sys. time: +0.370s Again, 512 MB and about the same time. Nothing gained, but my RAM is starting to fill up. By the way, how does a good old system call to 'malloc' compare? That gives us a block of garbage 'initialized' data - a situation we left behind for good in D! So here we go with another test: ** malloc(1024 * 1024) ressource usage: +2048 KB user time: +0.000s | sys. time: +0.002s Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002 seconds ain't worth talking about. The operating system didn't actually allocate the memory, it merely gave us a virtual memory range to use. Only when we write to the memory will physical memory be bound. That's perfect for a generously sized buffer, right? Well... we still want it zeroed out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] = 0: ** malloc(1024 * 1024) + zero out ressource usage: +526336 KB user time: +0.053s | sys. time: +0.366s ... and we are back at square one. With the exception, that the user time is notably lower. What we need is a facility that gives us lazily allocated zeroed out memory. And guess what, it's not too much to ask for. Here is 'calloc' to the rescue: ** calloc(1, 1024 * 1024) ressource usage: +2048 KB user time: +0.001s | sys. time: +0.001s How does it work? The operating system fakes the memory allocation and just gives us 131072 references to a special read-only memory page of zeroes. The semantic is copy-on-write. So we start with a view on zeroed out memory and get the real thing once we write into it. (Sorry, if I tell some of you nothing new, but I just found this out today ;) ) The question I have is, should we go and improve druntime with that knowledge? I'm not aware of any caveats, are there any? Thanks for reading and the test program for Linux is in the attachment (I used GDC to compile). -- Marco
Feb 07 2012
Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;
Feb 07 2012
Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
On Tue, 07 Feb 2012 21:37:03 +0100, Michel Fortin <michel.fortin michelf.com> wrote:On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:A similar idea popped up to solve non-deterministic shared library unloading and/or mapped memory freeing. I'm don't think though that putting additional pressure on the GC is the right approach to resource management. The interface would be simple for non-overlapping ranges. void GC.trackRange(void* p, size_t sz, void delegate() finalizer)Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
On 2012-02-07 21:37, Michel Fortin wrote:On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:What about GC.addRoot ? -- /Jacob CarlborgAm 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
On 2012-02-08 07:30:29 +0000, Jacob Carlborg <doob me.com> said:On 2012-02-07 21:37, Michel Fortin wrote:This is perfect if you want the GC to scan a memory range for pointers. But the GC can't track pointers pointing to the given memory range and notify you when that range is no longer referenced. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:What about GC.addRoot ?Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 08 2012
On Tue, 07 Feb 2012 21:24:40 +0100, Marco Leise <Marco.Leise gmx.de> wrote:Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:You usually don't need initialized memory from the GC. The calloc behavior is still pretty interesting.Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
On 07/02/2012 2:11 PM, Nick Sabalausky wrote:Is void initialization not good enough? IIRC it's something like: ubyte[] buf = void;This example would be uninitializedArray!(ubyte[])(1024 * 1024). I would guess that it gives significantly better performance. There's also minimallyInitializerArray!(ubyte[])(1024 * 1024) that just initializes the bare data needed to be safe. They're both in std.array.
Feb 07 2012
On 7 February 2012 19:39, Marco Leise <Marco.Leise gmx.de> wrote:Hi, this is me again with some "size matters" topic. This time, it's not the executable size, no! Instead I want to discuss a runtime memory footprint and speed issue that affects everyone, and how to improve the situation dramatically. In D we allocate memory through the GC, that is initialized according to the type's .init, which gives us a save default. In most cases this will result in the memory block being zeroed out, like in the case of allocating ubyte[] buffers. Let's assume, we have a program that allocates some buffers in advance, that it may not use fully. This often happens when the input data is much smaller than the anticipated case. So our memory manager should handle this situation well: =A0 o =A0zero out a memory block =A0 o =A0we probably don't need all of it So here is a small benchmark that allocates 512 * 1 MB, first using the typical method: new ubyte[1024 * 1024]. The oputput is: =A0 =A0 =A0 =A0** new ubyte[1024 * 1024] =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s As expected we have a physical memory usage increase of ~512 MB and spent a considerable amount of time in the system to find free memory blocks and in our program to initialize the data to zero. Can we use the GC more directly? Let's try GC.calloc: =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s Again, 512 MB and about the same time. Nothing gained, but my RAM is starting to fill up. By the way, how does a good old system call to 'malloc' compare? That gives us a block of garbage 'initialized' data - a situation we left behind for good in D! So here we go with another test: =A0 =A0 =A0 =A0** malloc(1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002 seconds ain't worth talking about. The operating system didn't actually allocate the memory, it merely gave us a virtual memory range to use. Only when we write to the memory will physical memory be bound. That's perfect for a generously sized buffer, right? Well... we still want it zeroed out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=0:=A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s ... and we are back at square one. With the exception, that the user time is notably lower. What we need is a facility that gives us lazily allocated zeroed out memory. And guess what, it's not too much to ask for. Here is 'calloc' to the rescue: =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s How does it work? The operating system fakes the memory allocation and just gives us 131072 references to a special read-only memory page of zeroes. The semantic is copy-on-write. So we start with a view on zeroed out memory and get the real thing once we write into it. (Sorry, if I tell some of you nothing new, but I just found this out today ;) ) The question I have is, should we go and improve druntime with that knowledge? I'm not aware of any caveats, are there any? Thanks for reading and the test program for Linux is in the attachment (I used GDC to compile). -- MarcoWhat about these functions? import std.array; byte[][ALLOCS] a, b; writeln("** uninitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) b[i] =3D uninitializedArray!(ubyte[])(1024 * 1024); prev =3D print_ressources(&prev); writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) c[i] =3D minimallyInitializedArray!(ubyte[])(1024 *= 1024); prev =3D print_ressources(&prev); Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Feb 07 2012
Am 07.02.2012, 22:15 Uhr, schrieb Iain Buclaw <ibuclaw ubuntu.com>:o zero out a memory block <---------- !!!What about these functions? import std.array; byte[][ALLOCS] a, b; writeln("** uninitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) b[i] = uninitializedArray!(ubyte[])(1024 * 1024); prev = print_ressources(&prev); writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) c[i] = minimallyInitializedArray!(ubyte[])(1024 * 1024); prev = print_ressources(&prev); RegardsYeah I know these and used them, only to find out the program I am porting relies on the data beeing zeroed out (it is a certain type of compression utility). It's a shame to see the C version get away with a low memory footprint _and_ zero initialized buffers. I'll just copy the 'Array' class from the original then, that uses calloc and free.
Feb 07 2012
Can't you just write a custom allocator using calloc for performance critical structures (http://dlang.org/memory.html#newdelete), or do what Iain said.
Feb 07 2012
On 02/08/2012 12:01 AM, F i L wrote:Can't you just write a custom allocator using calloc for performance critical structures (http://dlang.org/memory.html#newdelete), or do what Iain said.The solution with the best performance and least memory requirements obviously must be the default.
Feb 07 2012
On 02/08/12 00:01, F i L wrote:Can't you just write a custom allocator using calloc for performance critical structures (http://dlang.org/memory.html#newdelete), or do what Iain said.That won't help - the compiler will defeat the optimization by "initializing" the area... artur
Feb 07 2012
Artur Skawina wrote:That won't help - the compiler will defeat the optimization by "initializing" the area...I see. Timon Gehr wrote:The solution with the best performance and least memory requirements obviously must be the default.No argument here. Only, if calloc is an all-around better allocation method than malloc, why is malloc even used?
Feb 07 2012
F i L wrote:Only, if calloc is an all-around better allocation method than malloc, why is malloc even used?Note-to-self: google before asking stupid questions...
Feb 07 2012
Marco Leise wrote:I'm not aware of any caveats, are there any?The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred
Feb 07 2012
Am 08.02.2012, 04:37 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:Marco Leise wrote:I meant caveats with calloc vs. other allocators. I don't want to discuss the design of compression algorithms, that allocate a fixed sized buffer in advance. I am porting one of those to D and of the many data structures it allocates with calloc, there are some that may not be fully filled when the program exits. Yet, the user notices a start-up delay in the D version for the memory allocation and initialization as well as an instant jump to the maximum memory use. The C++ version starts instantly and maps memory pages on demand. The algorithm relies on some of the buffers being zeroed out, so there is really no alternative to calloc.I'm not aware of any caveats, are there any?The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred
Feb 07 2012
Marco Leise wrote:In D we allocate memory through the GC[...]Let's assume, we have a program that allocates some buffers in advance, that it may not use fully.[...]there is really no alternative to calloc.1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy. This seems necessary because there is not only the null pattern to initialize allocated memory. 2) According to your OP the request for allocating memory can be disjoint partitioned into three: a) memory that is guaranteed to be used b) memory that is guaranteed not to be used and c) memory that might be used. For the case that a + b exceeds available memory, should the coder predeclare a strategy, i.e. should the GC signal out of memory on request of the memory or should the GC wait until that sum reaches some deadline? -manfred
Feb 08 2012
Am 08.02.2012, 14:30 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:Marco Leise wrote:That sounds a bit vague. If I understand you correctly you would impleme= nt = this as a hint flag, like GC.allocHint =3D AllocationHint.lazyZero; auto arr =3D new ubyte[=E2=80=A6];In D we allocate memory through the GC[...]Let's assume, we have a program that allocates some buffers in advance, that it may not use fully.[...]there is really no alternative to calloc.1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy.This seems necessary because there is not only the null pattern to initialize allocated memory. 2) According to your OP the request for allocating memory can be disjoint partitioned into three: a) memory that is guaranteed to be used b) memory that is guaranteed not to be used and c) memory that might be used. For the case that a + b exceeds available memory, should the coder predeclare a strategy, i.e. should the GC signal out of memory on request of the memory or should the GC wait until that sum reaches some deadline? -manfredHere I can't follow you. The request to allocate memory contains memory = = regions that are guaranteed not to be used? Why would I request them the= n? = Also, what is your definition of available memory? Available RAM, RAM + = = swap or available virtual memory going beyond what the operating system = = can commit and resulting in random killing of applications = (http://linux-mm.org/OOM_Killer) I don't see where the GC enters the picture, since all this 'may use = memory, but don't commit on it' is handled solely by the OS. I could onl= y = imagine a function in the GC that always allocates a new chunnk of memor= y = and does that through calloc - or the hammer method - all allocations us= e = calloc and some of the manual memset(..., 0, ...) is removed. -- Marco
Feb 08 2012
Marco Leise wrote:That sounds a bit vague.Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred
Feb 08 2012
Am 09.02.2012, 03:56 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:Marco Leise wrote:Oh ok, that's farther than I ever digged into memory management. My current problem with the compression utility has been solved at a small scale with a manual memory management 'ZeroInitializedAndAlignedBuffer' struct that uses calloc - right in the spirit of the original source code. The startup time was reduced by ~0.7 seconds and is now relaitvely close to the original as well, at least there is no longer that perceived delay. Techniques like calloc make it possible to use the algorithm in other places like batch processing, compressed FUSE file systems on Linux or libraries, where the new instances may be spawned in quick succession. All technical details aside, I would wish for some solution to "I spawn a new instance of some huge complex data structure, but I probably wont need all of it, so use calloc". I don't know what the situation with other D programs is, but would be interested in some larger scale experiments with calloc and app startup time. It may be worse in some cases, but what if it improved the general situation - without writing a memory management library? I don't know the GC internals enough to tell if calloc is already used somewhere (in place of malloc and memset).That sounds a bit vague.Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred
Feb 09 2012
I guess, calloc will reuse blocks too, so if you run the compressing function twice, it will reuse the memory block used and freed previously and zero it out honestly.
Feb 09 2012
Am 09.02.2012, 11:55 Uhr, schrieb Kagamin <spam here.lot>:I guess, calloc will reuse blocks too, so if you run the compressing function twice, it will reuse the memory block used and freed previously and zero it out honestly.You don't understand how it works. calloc gives you exactly 0 KB of memory. There is nothing to zero out :) There is on page, a block of 4096 bytes somewhere in the kernel, that is all zeroes and read-only. If you allocate memory you get a bunch of references to it. Or in other words, a zillion views on the same 4096 bytes repeating over and over. Only once you need to write to it, will happen what you say. The zero page is 'copied' into some (probably previously freed) page. This is equivalent to zeroing out the target page. The main difference here is that the zeroing out happens with a 'lazy' keyword attached to it.
Feb 09 2012
On Tue, Feb 7, 2012 at 5:39 PM, Marco Leise <Marco.Leise gmx.de> wrote:Hi, this is me again with some "size matters" topic. This time, it's not the executable size, no! Instead I want to discuss a runtime memory footprint and speed issue that affects everyone, and how to improve the situation dramatically. In D we allocate memory through the GC, that is initialized according to the type's .init, which gives us a save default. In most cases this will result in the memory block being zeroed out, like in the case of allocating ubyte[] buffers. Let's assume, we have a program that allocates some buffers in advance, that it may not use fully. This often happens when the input data is much smaller than the anticipated case. So our memory manager should handle this situation well: =A0 o =A0zero out a memory block =A0 o =A0we probably don't need all of it So here is a small benchmark that allocates 512 * 1 MB, first using the typical method: new ubyte[1024 * 1024]. The oputput is: =A0 =A0 =A0 =A0** new ubyte[1024 * 1024] =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s As expected we have a physical memory usage increase of ~512 MB and spent a considerable amount of time in the system to find free memory blocks and in our program to initialize the data to zero. Can we use the GC more directly? Let's try GC.calloc: =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s Again, 512 MB and about the same time. Nothing gained, but my RAM is starting to fill up. By the way, how does a good old system call to 'malloc' compare? That gives us a block of garbage 'initialized' data - a situation we left behind for good in D! So here we go with another test: =A0 =A0 =A0 =A0** malloc(1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002 seconds ain't worth talking about. The operating system didn't actually allocate the memory, it merely gave us a virtual memory range to use. Only when we write to the memory will physical memory be bound. That's perfect for a generously sized buffer, right? Well... we still want it zeroed out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=0:=A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s ... and we are back at square one. With the exception, that the user time is notably lower. What we need is a facility that gives us lazily allocated zeroed out memory. And guess what, it's not too much to ask for. Here is 'calloc' to the rescue: =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024) =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s How does it work? The operating system fakes the memory allocation and just gives us 131072 references to a special read-only memory page of zeroes. The semantic is copy-on-write. So we start with a view on zeroed out memory and get the real thing once we write into it.Special? What do you mean by special? Most OS use Virtual Memory so sure they can say here is a page and yet not have that page backed by physical memory. To my knowledge, they can only do this if the allocated memory points to an unmapped page. I doubt this is the case in non-trivial programs.(Sorry, if I tell some of you nothing new, but I just found this out today ;) ) The question I have is, should we go and improve druntime with that knowledge? I'm not aware of any caveats, are there any? Thanks for reading and the test program for Linux is in the attachment (I used GDC to compile). -- Marco
Feb 07 2012
Am 08.02.2012, 04:40 Uhr, schrieb Jose Armando Garcia <jsancio gmail.com>:Special? What do you mean by special? Most OS use Virtual Memory so sure they can say here is a page and yet not have that page backed by physical memory. To my knowledge, they can only do this if the allocated memory points to an unmapped page. I doubt this is the case in non-trivial programs.Yes, special. Like, say, you fork a process on Unix. Instead of copying all the uses memory, the processes share the same pages in read-only, copy-on-write mode. As soon as one of the processes writes to a memory page it is duplicated. The same is true for the 'zero page'. Just that there is one for the whole system and it can have a zillion references to it: http://en.wikipedia.org/wiki/Copy-on-write <- Look for calloc So the page can even be mapped and read without increasing the memory footprint, as I understand it.
Feb 07 2012