www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How to save RAM in D programs (on zero initialized buffers): Reloaded

reply "Marco Leise" <Marco.Leise gmx.de> writes:
------------UnRPXoQBAvdjEGH3uMx1Yv
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit

Hi, this is me again with some "size matters" topic. This time, it's not
the executable size, no! Instead I want to discuss a runtime memory
footprint and speed issue that affects everyone, and how to improve the
situation dramatically.

In D we allocate memory through the GC, that is initialized according to
the type's .init, which gives us a save default. In most cases this will
result in the memory block being zeroed out, like in the case of
allocating ubyte[] buffers. Let's assume, we have a program that
allocates some buffers in advance, that it may not use fully. This often  
happens
when the input data is much smaller than the anticipated case. So our
memory manager should handle this situation well:
    o  zero out a memory block
    o  we probably don't need all of it

So here is a small benchmark that allocates 512 * 1 MB, first using the
typical method: new ubyte[1024 * 1024]. The oputput is:

	** new ubyte[1024 * 1024]
	   ressource usage: +526840 KB
	   user time: +0.098s | sys. time: +0.368s

As expected we have a physical memory usage increase of ~512 MB and spent
a considerable amount of time in the system to find free memory blocks
and in our program to initialize the data to zero. Can we use the GC more
directly? Let's try GC.calloc:

	** GC.calloc(1024 * 1024)
	   ressource usage: +525104 KB
	   user time: +0.089s | sys. time: +0.370s

Again, 512 MB and about the same time. Nothing gained, but my RAM is
starting to fill up. By the way, how does a good old system call to
'malloc' compare? That gives us a block of garbage 'initialized' data - a
situation we left behind for good in D! So here we go with another test:

	** malloc(1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.000s | sys. time: +0.002s

Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
seconds ain't worth talking about. The operating system didn't actually
allocate the memory, it merely gave us a virtual memory range to use.
Only when we write to the memory will physical memory be bound. That's  
perfect
for a generously sized buffer, right? Well... we still want it zeroed
out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] = 0:

	** malloc(1024 * 1024) + zero out
	   ressource usage: +526336 KB
	   user time: +0.053s | sys. time: +0.366s

... and we are back at square one. With the exception, that the user time
is notably lower. What we need is a facility that gives us lazily
allocated zeroed out memory. And guess what, it's not too much to ask
for. Here is 'calloc' to the rescue:

	** calloc(1, 1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.001s | sys. time: +0.001s

How does it work? The operating system fakes the memory allocation and
just gives us 131072 references to a special read-only memory page of
zeroes. The semantic is copy-on-write. So we start with a view on zeroed
out memory and get the real thing once we write into it. (Sorry, if I
tell
some of you nothing new, but I just found this out today ;) )
The question I have is, should we go and improve druntime with that
knowledge? I'm not aware of any caveats, are there any?
Thanks for reading and the test program for Linux is in the attachment (I
used GDC to compile).

-- Marco
------------UnRPXoQBAvdjEGH3uMx1Yv
Content-Disposition: attachment; filename=memory_benchmark.d
Content-Type: application/octet-stream; name=memory_benchmark.d
Content-Transfer-Encoding: Base64

aW1wb3J0IGNvcmUubWVtb3J5LCBzdGQuc3RkaW8sIGNvcmUuc3lzLnBvc2l4LnN5
cy50aW1lLCBjb3JlLnN5cy5wb3NpeC5zdGRsaWI7CgplbnVtIEFMTE9DUyA9IDUx
MjsKCnZvaWQgbWFpbigpCnsKCXVieXRlW11bQUxMT0NTXSB4OwoJdm9pZCpbQUxM
T0NTXSB5LCB3LCB6OwoJdWJ5dGUqW0FMTE9DU10gcTsKCUdDLmRpc2FibGUoKTsK
CWF1dG8gcHJldiA9IHByaW50X3Jlc3NvdXJjZXMoKTsKCgl3cml0ZWxuKCIqKiBu
ZXcgdWJ5dGVbMTAyNCAqIDEwMjRdIik7Cglmb3JlYWNoIChpOyAwIC4uIEFMTE9D
UykgeFtpXSA9IG5ldyB1Ynl0ZVsxMDI0ICogMTAyNF07CglwcmV2ID0gcHJpbnRf
cmVzc291cmNlcygmcHJldik7CgoJd3JpdGVsbigiKiogR0MuY2FsbG9jKDEwMjQg
KiAxMDI0KSIpOwoJZm9yZWFjaCAoaTsgMCAuLiBBTExPQ1MpIHlbaV0gPSBHQy5j
YWxsb2MoMTAyNCAqIDEwMjQpOwoJcHJldiA9IHByaW50X3Jlc3NvdXJjZXMoJnBy
ZXYpOwoKCXdyaXRlbG4oIioqIG1hbGxvYygxMDI0ICogMTAyNCkiKTsKCWZvcmVh
Y2ggKGk7IDAgLi4gQUxMT0NTKSB6W2ldID0gbWFsbG9jKDEwMjQgKiAxMDI0KTsK
CXByZXYgPSBwcmludF9yZXNzb3VyY2VzKCZwcmV2KTsKCgl3cml0ZWxuKCIqKiBt
YWxsb2MoMTAyNCAqIDEwMjQpICsgemVybyBvdXQiKTsKCWZvcmVhY2ggKGk7IDAg
Li4gQUxMT0NTKQoJewoJCXFbaV0gPSBjYXN0KHVieXRlKikgbWFsbG9jKDEwMjQg
KiAxMDI0KTsKCQlxW2ldWzAgLi4gMTAyNCAqIDEwMjRdID0gMDsKCX0KCXByZXYg
PSBwcmludF9yZXNzb3VyY2VzKCZwcmV2KTsKCXdyaXRlbG4oIioqIGNhbGxvYygx
LCAxMDI0ICogMTAyNCkiKTsKCWZvcmVhY2ggKGk7IDAgLi4gQUxMT0NTKSB3W2ld
ID0gY2FsbG9jKDEsIDEwMjQgKiAxMDI0KTsKCXByZXYgPSBwcmludF9yZXNzb3Vy
Y2VzKCZwcmV2KTsKfQoKcnVzYWdlIHByaW50X3Jlc3NvdXJjZXMocnVzYWdlKiBw
cmV2ID0gbnVsbCkKewoJcnVzYWdlIHJlczsKCWdldHJ1c2FnZShydXNhZ2Vfd2hv
LlNFTEYsICZyZXMpOwoJaWYgKHByZXYgIWlzIG51bGwpCgl7CgkJd3JpdGVmbG4o
IiAgIHJlc3NvdXJjZSB1c2FnZTogKyVzIEtCIiwgcmVzLnJ1X21heHJzcyAtIHBy
ZXYucnVfbWF4cnNzKTsKCQl3cml0ZWZsbigiICAgdXNlciB0aW1lOiArJS4zZnMg
fCBzeXMuIHRpbWU6ICslLjNmcyIsCgkJICAgICAgICAgdGltZV9kaWZmKHByZXYu
cnVfdXRpbWUsIHJlcy5ydV91dGltZSksCgkJICAgICAgICAgdGltZV9kaWZmKHBy
ZXYucnVfc3RpbWUsIHJlcy5ydV9zdGltZSkpOwoJfQoJcmV0dXJuIHJlczsKfQoK
ZG91YmxlIHRpbWVfZGlmZihyZWYgdGltZXZhbCBhLCByZWYgdGltZXZhbCBiKQp7
CglyZXR1cm4gYi50dl9zZWMgLSBhLnR2X3NlYyArIDAuMDAwMDAxICogKGIudHZf
dXNlYyAtIGEudHZfdXNlYyk7Cn0KCmV4dGVybihDKSBpbnQgZ2V0cnVzYWdlIChy
dXNhZ2Vfd2hvIHdobywgcnVzYWdlKiB1c2FnZSk7CQoKZW51bSBydXNhZ2Vfd2hv
IDogaW50IHsgU0VMRiA9IDAsIENISUxEUkVOID0gLTEsIEJPVEggPSAtMiwgVEhS
RUFEID0gMSB9CgpzdHJ1Y3QgcnVzYWdlCnsKCXRpbWV2YWwgcnVfdXRpbWU7ICAg
ICAgICAvKiogdXNlciB0aW1lIHVzZWQgKi8KCXRpbWV2YWwgcnVfc3RpbWU7ICAg
ICAgICAvKiogc3lzdGVtIHRpbWUgdXNlZCAqLwoJbG9uZyAgICBydV9tYXhyc3M7
ICAgICAgIC8qKiBtYXhpbXVtIHJlc2lkZW50IHNldCBzaXplICovCglsb25nICAg
IHJ1X2l4cnNzOyAgICAgICAgLyoqIGludGVncmFsIHNoYXJlZCBtZW1vcnkgc2l6
ZSAqLwoJbG9uZyAgICBydV9pZHJzczsgICAgICAgIC8qKiBpbnRlZ3JhbCB1bnNo
YXJlZCBkYXRhIHNpemUgKi8KCWxvbmcgICAgcnVfaXNyc3M7ICAgICAgICAvKiog
aW50ZWdyYWwgdW5zaGFyZWQgc3RhY2sgc2l6ZSAqLwoJbG9uZyAgICBydV9taW5m
bHQ7ICAgICAgIC8qKiBwYWdlIHJlY2xhaW1zICovCglsb25nICAgIHJ1X21hamZs
dDsgICAgICAgLyoqIHBhZ2UgZmF1bHRzICovCglsb25nICAgIHJ1X25zd2FwOyAg
ICAgICAgLyoqIHN3YXBzICovCglsb25nICAgIHJ1X2luYmxvY2s7ICAgICAgLyoq
IGJsb2NrIGlucHV0IG9wZXJhdGlvbnMgKi8KCWxvbmcgICAgcnVfb3VibG9jazsg
ICAgICAvKiogYmxvY2sgb3V0cHV0IG9wZXJhdGlvbnMgKi8KCWxvbmcgICAgcnVf
bXNnc25kOyAgICAgICAvKiogbWVzc2FnZXMgc2VudCAqLwoJbG9uZyAgICBydV9t
c2dyY3Y7ICAgICAgIC8qKiBtZXNzYWdlcyByZWNlaXZlZCAqLwoJbG9uZyAgICBy
dV9uc2lnbmFsczsgICAgIC8qKiBzaWduYWxzIHJlY2VpdmVkICovCglsb25nICAg
IHJ1X252Y3N3OyAgICAgICAgLyoqIHZvbHVudGFyeSBjb250ZXh0IHN3aXRjaGVz
ICovCglsb25nICAgIHJ1X25pdmNzdzsgICAgICAgLyoqIGludm9sdW50YXJ5ICIg
Ki8KfQo=

------------UnRPXoQBAvdjEGH3uMx1Yv--
Feb 07 2012
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
Is void initialization not good enough?

IIRC it's something like:

ubyte[] buf = void;
Feb 07 2012
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:
 
 Is void initialization not good enough?
 
 IIRC it's something like:
 
 ubyte[] buf = void;

That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.

What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 07 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-02-07 21:37, Michel Fortin wrote:
 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;

That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.

What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.

What about GC.addRoot ? -- /Jacob Carlborg
Feb 07 2012
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2012-02-08 07:30:29 +0000, Jacob Carlborg <doob me.com> said:

 On 2012-02-07 21:37, Michel Fortin wrote:
 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:
 
 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:
 
 Is void initialization not good enough?
 
 IIRC it's something like:
 
 ubyte[] buf = void;

That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.

What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.

What about GC.addRoot ?

This is perfect if you want the GC to scan a memory range for pointers. But the GC can't track pointers pointing to the given memory range and notify you when that range is no longer referenced. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 08 2012
prev sibling parent Kapps <Kapps NotValidEmail.com> writes:
On 07/02/2012 2:11 PM, Nick Sabalausky wrote:
 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;

This example would be uninitializedArray!(ubyte[])(1024 * 1024). I would guess that it gives significantly better performance. There's also minimallyInitializerArray!(ubyte[])(1024 * 1024) that just initializes the bare data needed to be safe. They're both in std.array.
Feb 07 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;

That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 07 Feb 2012 21:24:40 +0100, Marco Leise <Marco.Leise gmx.de> wrote:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;

That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.

You usually don't need initialized memory from the GC. The calloc behavior is still pretty interesting.
Feb 07 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 7 February 2012 19:39, Marco Leise <Marco.Leise gmx.de> wrote:
 Hi, this is me again with some "size matters" topic. This time, it's not
 the executable size, no! Instead I want to discuss a runtime memory
 footprint and speed issue that affects everyone, and how to improve the
 situation dramatically.

 In D we allocate memory through the GC, that is initialized according to
 the type's .init, which gives us a save default. In most cases this will
 result in the memory block being zeroed out, like in the case of
 allocating ubyte[] buffers. Let's assume, we have a program that
 allocates some buffers in advance, that it may not use fully. This often
 happens
 when the input data is much smaller than the anticipated case. So our
 memory manager should handle this situation well:
 =A0 o =A0zero out a memory block
 =A0 o =A0we probably don't need all of it

 So here is a small benchmark that allocates 512 * 1 MB, first using the
 typical method: new ubyte[1024 * 1024]. The oputput is:

 =A0 =A0 =A0 =A0** new ubyte[1024 * 1024]
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s

 As expected we have a physical memory usage increase of ~512 MB and spent
 a considerable amount of time in the system to find free memory blocks
 and in our program to initialize the data to zero. Can we use the GC more
 directly? Let's try GC.calloc:

 =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s

 Again, 512 MB and about the same time. Nothing gained, but my RAM is
 starting to fill up. By the way, how does a good old system call to
 'malloc' compare? That gives us a block of garbage 'initialized' data - a
 situation we left behind for good in D! So here we go with another test:

 =A0 =A0 =A0 =A0** malloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s

 Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
 seconds ain't worth talking about. The operating system didn't actually
 allocate the memory, it merely gave us a virtual memory range to use.
 Only when we write to the memory will physical memory be bound. That's
 perfect
 for a generously sized buffer, right? Well... we still want it zeroed
 out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=

 =A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s

 ... and we are back at square one. With the exception, that the user time
 is notably lower. What we need is a facility that gives us lazily
 allocated zeroed out memory. And guess what, it's not too much to ask
 for. Here is 'calloc' to the rescue:

 =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s

 How does it work? The operating system fakes the memory allocation and
 just gives us 131072 references to a special read-only memory page of
 zeroes. The semantic is copy-on-write. So we start with a view on zeroed
 out memory and get the real thing once we write into it. (Sorry, if I
 tell
 some of you nothing new, but I just found this out today ;) )
 The question I have is, should we go and improve druntime with that
 knowledge? I'm not aware of any caveats, are there any?
 Thanks for reading and the test program for Linux is in the attachment (I
 used GDC to compile).

 -- Marco

import std.array; byte[][ALLOCS] a, b; writeln("** uninitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) b[i] =3D uninitializedArray!(ubyte[])(1024 * 1024); prev =3D print_ressources(&prev); writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) c[i] =3D minimallyInitializedArray!(ubyte[])(1024 *= 1024); prev =3D print_ressources(&prev); Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Feb 07 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 07 Feb 2012 21:37:03 +0100, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?
  IIRC it's something like:
  ubyte[] buf = void;

null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.

What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.

A similar idea popped up to solve non-deterministic shared library unloading and/or mapped memory freeing. I'm don't think though that putting additional pressure on the GC is the right approach to resource management. The interface would be simple for non-overlapping ranges. void GC.trackRange(void* p, size_t sz, void delegate() finalizer)
Feb 07 2012
prev sibling next sibling parent reply "F i L" <witte2008 gmail.com> writes:
Can't you just write a custom allocator using calloc for 
performance critical structures 
(http://dlang.org/memory.html#newdelete), or do what Iain said.
Feb 07 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/08/2012 12:01 AM, F i L wrote:
 Can't you just write a custom allocator using calloc for performance
 critical structures (http://dlang.org/memory.html#newdelete), or do what
 Iain said.

The solution with the best performance and least memory requirements obviously must be the default.
Feb 07 2012
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 02/08/12 00:01, F i L wrote:
 Can't you just write a custom allocator using calloc for performance critical
structures (http://dlang.org/memory.html#newdelete), or do what Iain said.

That won't help - the compiler will defeat the optimization by "initializing" the area... artur
Feb 07 2012
prev sibling next sibling parent "F i L" <witte2008 gmail.com> writes:
Artur Skawina wrote:
 That won't help - the compiler will defeat the optimization by 
 "initializing" the area...

I see. Timon Gehr wrote:
 The solution with the best performance and least memory 
 requirements obviously must be the default.

No argument here. Only, if calloc is an all-around better allocation method than malloc, why is malloc even used?
Feb 07 2012
prev sibling next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 I'm not aware of any caveats, are there any?

The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred
Feb 07 2012
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 In D we allocate memory through the GC

 Let's assume, we have a program that allocates some buffers in
 advance, that it may not use fully.

 there is really no alternative to calloc.

1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy. This seems necessary because there is not only the null pattern to initialize allocated memory. 2) According to your OP the request for allocating memory can be disjoint partitioned into three: a) memory that is guaranteed to be used b) memory that is guaranteed not to be used and c) memory that might be used. For the case that a + b exceeds available memory, should the coder predeclare a strategy, i.e. should the GC signal out of memory on request of the memory or should the GC wait until that sum reaches some deadline? -manfred
Feb 08 2012
parent Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 That sounds a bit vague.

Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred
Feb 08 2012
prev sibling next sibling parent "F i L" <witte2008 gmail.com> writes:
F i L wrote:
 Only, if calloc is an all-around better allocation method than 
 malloc, why is malloc even used?

Note-to-self: google before asking stupid questions...
Feb 07 2012
prev sibling next sibling parent Jose Armando Garcia <jsancio gmail.com> writes:
On Tue, Feb 7, 2012 at 5:39 PM, Marco Leise <Marco.Leise gmx.de> wrote:
 Hi, this is me again with some "size matters" topic. This time, it's not
 the executable size, no! Instead I want to discuss a runtime memory
 footprint and speed issue that affects everyone, and how to improve the
 situation dramatically.

 In D we allocate memory through the GC, that is initialized according to
 the type's .init, which gives us a save default. In most cases this will
 result in the memory block being zeroed out, like in the case of
 allocating ubyte[] buffers. Let's assume, we have a program that
 allocates some buffers in advance, that it may not use fully. This often
 happens
 when the input data is much smaller than the anticipated case. So our
 memory manager should handle this situation well:
 =A0 o =A0zero out a memory block
 =A0 o =A0we probably don't need all of it

 So here is a small benchmark that allocates 512 * 1 MB, first using the
 typical method: new ubyte[1024 * 1024]. The oputput is:

 =A0 =A0 =A0 =A0** new ubyte[1024 * 1024]
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s

 As expected we have a physical memory usage increase of ~512 MB and spent
 a considerable amount of time in the system to find free memory blocks
 and in our program to initialize the data to zero. Can we use the GC more
 directly? Let's try GC.calloc:

 =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s

 Again, 512 MB and about the same time. Nothing gained, but my RAM is
 starting to fill up. By the way, how does a good old system call to
 'malloc' compare? That gives us a block of garbage 'initialized' data - a
 situation we left behind for good in D! So here we go with another test:

 =A0 =A0 =A0 =A0** malloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s

 Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
 seconds ain't worth talking about. The operating system didn't actually
 allocate the memory, it merely gave us a virtual memory range to use.
 Only when we write to the memory will physical memory be bound. That's
 perfect
 for a generously sized buffer, right? Well... we still want it zeroed
 out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=

 =A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s

 ... and we are back at square one. With the exception, that the user time
 is notably lower. What we need is a facility that gives us lazily
 allocated zeroed out memory. And guess what, it's not too much to ask
 for. Here is 'calloc' to the rescue:

 =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s

 How does it work? The operating system fakes the memory allocation and
 just gives us 131072 references to a special read-only memory page of
 zeroes. The semantic is copy-on-write. So we start with a view on zeroed
 out memory and get the real thing once we write into it.

Special? What do you mean by special? Most OS use Virtual Memory so sure they can say here is a page and yet not have that page backed by physical memory. To my knowledge, they can only do this if the allocated memory points to an unmapped page. I doubt this is the case in non-trivial programs.
 (Sorry, if I
 tell
 some of you nothing new, but I just found this out today ;) )
 The question I have is, should we go and improve druntime with that
 knowledge? I'm not aware of any caveats, are there any?
 Thanks for reading and the test program for Linux is in the attachment (I
 used GDC to compile).

 -- Marco

Feb 07 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 07.02.2012, 22:15 Uhr, schrieb Iain Buclaw <ibuclaw ubuntu.com>:

   o  zero out a memory block     <---------- !!!


 What about these functions?


 import std.array;

 byte[][ALLOCS] a, b;

 writeln("** uninitializedArray!(ubyte[])(1024*1024)");
 foreach(i; 0 .. ALLOCS) b[i] = uninitializedArray!(ubyte[])(1024 * 1024);
 prev = print_ressources(&prev);

 writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)");
 foreach(i; 0 .. ALLOCS) c[i] = minimallyInitializedArray!(ubyte[])(1024  
 * 1024);
 prev = print_ressources(&prev);



 Regards

Yeah I know these and used them, only to find out the program I am porting relies on the data beeing zeroed out (it is a certain type of compression utility). It's a shame to see the C version get away with a low memory footprint _and_ zero initialized buffers. I'll just copy the 'Array' class from the original then, that uses calloc and free.
Feb 07 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 04:40 Uhr, schrieb Jose Armando Garcia <jsancio gmail.com>:

 Special? What do you mean by special? Most OS use Virtual Memory so
 sure they can say here is a page and yet not have that page backed by
 physical memory. To my knowledge, they can only do this if the
 allocated memory points to an unmapped page. I doubt this is the case
 in non-trivial programs.

Yes, special. Like, say, you fork a process on Unix. Instead of copying all the uses memory, the processes share the same pages in read-only, copy-on-write mode. As soon as one of the processes writes to a memory page it is duplicated. The same is true for the 'zero page'. Just that there is one for the whole system and it can have a zillion references to it: http://en.wikipedia.org/wiki/Copy-on-write <- Look for calloc So the page can even be mapped and read without increasing the memory footprint, as I understand it.
Feb 07 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 04:37 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 I'm not aware of any caveats, are there any?

The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred

I meant caveats with calloc vs. other allocators. I don't want to discuss the design of compression algorithms, that allocate a fixed sized buffer in advance. I am porting one of those to D and of the many data structures it allocates with calloc, there are some that may not be fully filled when the program exits. Yet, the user notices a start-up delay in the D version for the memory allocation and initialization as well as an instant jump to the maximum memory use. The C++ version starts instantly and maps memory pages on demand. The algorithm relies on some of the buffers being zeroed out, so there is really no alternative to calloc.
Feb 07 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 14:30 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 In D we allocate memory through the GC

 Let's assume, we have a program that allocates some buffers in
 advance, that it may not use fully.

 there is really no alternative to calloc.

1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy.

That sounds a bit vague. If I understand you correctly you would impleme= nt = this as a hint flag, like GC.allocHint =3D AllocationHint.lazyZero; auto arr =3D new ubyte[=E2=80=A6];
 This seems necessary because there is not only the null pattern to
 initialize allocated memory.

 2)
 According to your OP the request for allocating memory can be
 disjoint partitioned into three:
 a) memory that is guaranteed to be used
 b) memory that is guaranteed not to be used and
 c) memory that might be used.

 For the case that a + b exceeds available memory, should the coder
 predeclare a strategy, i.e. should the GC signal out of memory on
 request of the memory or should the GC wait until that sum reaches
 some deadline?

 -manfred

Here I can't follow you. The request to allocate memory contains memory = = regions that are guaranteed not to be used? Why would I request them the= n? = Also, what is your definition of available memory? Available RAM, RAM + = = swap or available virtual memory going beyond what the operating system = = can commit and resulting in random killing of applications = (http://linux-mm.org/OOM_Killer) I don't see where the GC enters the picture, since all this 'may use = memory, but don't commit on it' is handled solely by the OS. I could onl= y = imagine a function in the GC that always allocates a new chunnk of memor= y = and does that through calloc - or the hammer method - all allocations us= e = calloc and some of the manual memset(..., 0, ...) is removed. -- Marco
Feb 08 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 03:56 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 That sounds a bit vague.

Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred

Oh ok, that's farther than I ever digged into memory management. My current problem with the compression utility has been solved at a small scale with a manual memory management 'ZeroInitializedAndAlignedBuffer' struct that uses calloc - right in the spirit of the original source code. The startup time was reduced by ~0.7 seconds and is now relaitvely close to the original as well, at least there is no longer that perceived delay. Techniques like calloc make it possible to use the algorithm in other places like batch processing, compressed FUSE file systems on Linux or libraries, where the new instances may be spawned in quick succession. All technical details aside, I would wish for some solution to "I spawn a new instance of some huge complex data structure, but I probably wont need all of it, so use calloc". I don't know what the situation with other D programs is, but would be interested in some larger scale experiments with calloc and app startup time. It may be worse in some cases, but what if it improved the general situation - without writing a memory management library? I don't know the GC internals enough to tell if calloc is already used somewhere (in place of malloc and memset).
Feb 09 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
I guess, calloc will reuse blocks too, so if you run the 
compressing function twice, it will reuse the memory block used 
and freed previously and zero it out honestly.
Feb 09 2012
prev sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 11:55 Uhr, schrieb Kagamin <spam here.lot>:

 I guess, calloc will reuse blocks too, so if you run the compressing  
 function twice, it will reuse the memory block used and freed previously  
 and zero it out honestly.

You don't understand how it works. calloc gives you exactly 0 KB of memory. There is nothing to zero out :) There is on page, a block of 4096 bytes somewhere in the kernel, that is all zeroes and read-only. If you allocate memory you get a bunch of references to it. Or in other words, a zillion views on the same 4096 bytes repeating over and over. Only once you need to write to it, will happen what you say. The zero page is 'copied' into some (probably previously freed) page. This is equivalent to zeroing out the target page. The main difference here is that the zeroing out happens with a 'lazy' keyword attached to it.
Feb 09 2012