digitalmars.D - why allocation of large amount of small objects so slow (x10) in D?

nobody (33/33) May 21 2009 $ g++ alloc.cpp -o alloc

Leandro Lucarella (14/54) May 21 2009 You mean why GCC C++ compiler is faster than DM D compiler. You are
dsimcha (6/39) May 21 2009 You've probably hit a corner case for the garbage collector. When you a...

nobody (19/24) May 21 2009 Thanks. Disble gc improve the speed by 5x:

Sean Kelly (17/47) May 21 2009 Try:

Robert Fraser (102/142) May 21 2009 I use this a structure for arena-based memory allocation (attached).

Robert Fraser (3/164) May 21 2009 Oops, that needs another module. Okay, both are attached in a

bearophile (17/17) May 22 2009 nobody, you can see how the GC interacts with your code also with the fo...

nobody <no where.com> writes:

$ g++     alloc.cpp   -o alloc
$ time ./alloc
real    0m1.946s
user    0m1.688s
sys     0m0.256s

$ dmd -O -release allocd.d
$ time ./allocd
real    0m22.734s
user    0m22.353s
sys     0m0.360s

$ cat alloc.cpp
#include <vector>

typedef std::vector<int> intvec;
typedef intvec* intvecp;

int main() {
  int i, n = 20000000;
  intvecp* iva;
  iva = new intvecp[n];
  for (i = n; i-- > 0; ) {
    iva[i] = new intvec();
  }

  return 0;
}

$ cat allocd.d
int main() {
  int i, n = 20000000;
  Object[] oa;
  oa = new Object[n];
  for (i = n; i-- > 0; ) {
    oa[i] = new Object();
  }

  return 0;
}

May 21 2009

Leandro Lucarella <llucax gmail.com> writes:

nobody, el 22 de mayo a las 00:08 me escribiste:
 $ g++     alloc.cpp   -o alloc
 $ time ./alloc
 real    0m1.946s
 user    0m1.688s
 sys     0m0.256s
 
 $ dmd -O -release allocd.d
 $ time ./allocd
 real    0m22.734s
 user    0m22.353s
 sys     0m0.360s
 
 $ cat alloc.cpp
 #include <vector>
 
 typedef std::vector<int> intvec;
 typedef intvec* intvecp;
 
 int main() {
   int i, n = 20000000;
   intvecp* iva;
   iva = new intvecp[n];
   for (i = n; i-- > 0; ) {
     iva[i] = new intvec();
   }
 
   return 0;
 }
 
 $ cat allocd.d
 int main() {
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }
 
   return 0;
 }

You mean why GCC C++ compiler is faster than DM D compiler. You are
comparing 2 different compiler implementations for 2 different languages.
It's really hard to get any conclusion from that.

It would be much more meaningfull if you compare g++ vs gdc or dmd vs dmc
or ldc vs llvm-g++ (they at least use the same backend).

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Oiganmen ñatos de corazón, es más posible que un potus florezca en
primavera a que un ángel pase con una remera.
	-- Peperino Pómoro

May 21 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from nobody (no where.com)'s article
 $ g++     alloc.cpp   -o alloc
 $ time ./alloc
 real    0m1.946s
 user    0m1.688s
 sys     0m0.256s
 $ dmd -O -release allocd.d
 $ time ./allocd
 real    0m22.734s
 user    0m22.353s
 sys     0m0.360s
 $ cat alloc.cpp
 #include <vector>
 typedef std::vector<int> intvec;
 typedef intvec* intvecp;
 int main() {
   int i, n = 20000000;
   intvecp* iva;
   iva = new intvecp[n];
   for (i = n; i-- > 0; ) {
     iva[i] = new intvec();
   }
   return 0;
 }
 $ cat allocd.d
 int main() {
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }
   return 0;
 }

You've probably hit a corner case for the garbage collector.  When you allocate
20,000,000 Object instances and hold onto the references, the garbage collector
probably runs about a zillion times and never finds anything to delete.  If this
is a bottleneck, you should temporarily disable the garbage collector in these
situations, until you are done with all these allocations, then re-enable it.

May 21 2009

nobody <no where.com> writes:

 You've probably hit a corner case for the garbage collector.  When you allocate
 20,000,000 Object instances and hold onto the references, the garbage collector
 probably runs about a zillion times and never finds anything to delete.  If
this
 is a bottleneck, you should temporarily disable the garbage collector in these
 situations, until you are done with all these allocations, then re-enable it.

Thanks.  Disble gc improve the speed by 5x:

$ dmd -O -release allocd.d
$ time ./allocd
real    0m4.080s
user    0m3.644s
sys     0m0.420s


$ cat allocd.d
import core.memory;

int main() {
  core.memory.GC.disable();
  int i, n = 20000000;
  Object[] oa;
  oa = new Object[n];
  for (i = n; i-- > 0; ) {
    oa[i] = new Object();
  }
  core.memory.GC.enable();

  return 0;
}

May 21 2009

Sean Kelly <sean invisibleduck.org> writes:

nobody wrote:
 You've probably hit a corner case for the garbage collector.  When you allocate
 20,000,000 Object instances and hold onto the references, the garbage collector
 probably runs about a zillion times and never finds anything to delete.  If
this
 is a bottleneck, you should temporarily disable the garbage collector in these
 situations, until you are done with all these allocations, then re-enable it.

 
 Thanks.  Disble gc improve the speed by 5x:
 
 $ dmd -O -release allocd.d
 $ time ./allocd
 real    0m4.080s
 user    0m3.644s
 sys     0m0.420s
 
 
 $ cat allocd.d
 import core.memory;
 
 int main() {
   core.memory.GC.disable();
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }
   core.memory.GC.enable();
 
   return 0;
 }

Try:

import core.memory;

int main() {
   core.memory.GC.reserve(80000000);
   core.memory.GC.disable();
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }
   core.memory.GC.enable();

   return 0;
}

If the call to reserve fails you could try breaking it up into two calls 
for half the space each.

May 21 2009

Robert Fraser <fraserofthenight gmail.com> writes:

nobody wrote:
 $ g++     alloc.cpp   -o alloc
 $ time ./alloc
 real    0m1.946s
 user    0m1.688s
 sys     0m0.256s
 
 $ dmd -O -release allocd.d
 $ time ./allocd
 real    0m22.734s
 user    0m22.353s
 sys     0m0.360s
 
 $ cat alloc.cpp
 #include <vector>
 
 typedef std::vector<int> intvec;
 typedef intvec* intvecp;
 
 int main() {
   int i, n = 20000000;
   intvecp* iva;
   iva = new intvecp[n];
   for (i = n; i-- > 0; ) {
     iva[i] = new intvec();
   }
 
   return 0;
 }
 
 $ cat allocd.d
 int main() {
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }
 
   return 0;
 }

I use this a structure for arena-based memory allocation (attached).

Example of use:

import candy.util.MemPool

MemStack!() stack;

class MyObject
{
	mixin MemPoolNew!(stack);
}

int main()
{
	stack.push();
	int i, n = 20000000;
	MyObject[] oa;
	oa = new MyObject[n];
	for (i = n; i-- > 0; )
	{
		oa[i] = new MyObject();
	}
	stack.pop();
	return 0;
}

The push() and pop() allows memory to be allocated and deallocated as 
large blocks. However, you shouldn't need to deallocate manually -- it's 
GCed memory, so ideally the GC should free it when it's no longer 
referenced. That being said, I've run some tests, and the GC will free 
it *eventually*, but it allocates 4-6x as much memory as it needs before 
it starts freeing it, even when GC.collect() is called manually.

--------------------

/**
  * Provides a pool of GCed memory to allocate things from a block.
  * This maintains cache coherency for related types (i.e. tree nodes).
  * It doesn't garuntee any ordering, though, the array struct should be
  * used for that. Also, everything has to be freed at once, freeing one
  * portion of this has no effect.
  *
  * Based on a similar concept posted by bearophile at:
  * 
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=88227
  */
public struct MemPool(size_t BLOCK_SIZE = 1 << 14)
{
     private void* next; // Next available block
     private void* end;  // End of the current block
     private void*[] blocks;

     public void* alloc(size_t sz)
     {
     	sz = ((sz + 7) & ~7); // Next multiple of 8 if this isn't a 
multiple of 8
     	if (this.next + sz >= this.end)
     	{
     		void* blk = GC.calloc(BLOCK_SIZE);
     		this.blocks.length = this.blocks.length + 1;
     		this.blocks[$ - 1] = blk;
     		this.next = blk;
     		this.end = blk + BLOCK_SIZE;
         }

         void* ret = this.next;
         this.next += sz;
         return ret;
     }

     public void free()
     {
     	foreach(blk; this.blocks)
     		GC.free(blk);
     	this.blocks = null;
     	this.blocks.length = 0;
     	this.next = null;
     	this.end = null;
     }
}

/**
  * Wrapper for MemPool that allocates the given struct
  */
public struct StructPool(T)
{
	private MemPool!() pool;
	public T* alloc()  { return cast(T*) pool.alloc(T.sizeof); }
}

public struct MemStack(size_t BLOCK_SIZE = 1 << 14)
{
	private Stack!(MemPool!(BLOCK_SIZE)*, 16, true, true) stack;
	public static const size_t MAX_ALLOC = BLOCK_SIZE;
	
	public void* alloc(size_t sz) { return stack.peek().alloc(sz);          }
	public void  push()           { stack.push(new MemPool!(BLOCK_SIZE));   }
	public void  pop()            { stack.pop().free();                     }
}

/**
  * Placement new mixin for allocating from a memory pool. Benchmarks 
show this
  * as faster than the D new in real usage (i.e. the parser runs about 1.2x
  * faster using this).
  */
public template MemPoolNew(alias Pool)
{
	version(NoMemPool) { } else
	{
		public final new(uint sz)    { return Pool.alloc(sz); }
		public final delete(void *p) {                        }
	}
}

May 21 2009

Robert Fraser <fraserofthenight gmail.com> writes:

Robert Fraser wrote:
 nobody wrote:
 $ g++     alloc.cpp   -o alloc
 $ time ./alloc
 real    0m1.946s
 user    0m1.688s
 sys     0m0.256s

 $ dmd -O -release allocd.d
 $ time ./allocd
 real    0m22.734s
 user    0m22.353s
 sys     0m0.360s

 $ cat alloc.cpp
 #include <vector>

 typedef std::vector<int> intvec;
 typedef intvec* intvecp;

 int main() {
   int i, n = 20000000;
   intvecp* iva;
   iva = new intvecp[n];
   for (i = n; i-- > 0; ) {
     iva[i] = new intvec();
   }

   return 0;
 }

 $ cat allocd.d
 int main() {
   int i, n = 20000000;
   Object[] oa;
   oa = new Object[n];
   for (i = n; i-- > 0; ) {
     oa[i] = new Object();
   }

   return 0;
 }

 
 I use this a structure for arena-based memory allocation (attached).
 
 Example of use:
 
 import candy.util.MemPool
 
 MemStack!() stack;
 
 class MyObject
 {
     mixin MemPoolNew!(stack);
 }
 
 int main()
 {
     stack.push();
     int i, n = 20000000;
     MyObject[] oa;
     oa = new MyObject[n];
     for (i = n; i-- > 0; )
     {
         oa[i] = new MyObject();
     }
     stack.pop();
     return 0;
 }
 
 The push() and pop() allows memory to be allocated and deallocated as 
 large blocks. However, you shouldn't need to deallocate manually -- it's 
 GCed memory, so ideally the GC should free it when it's no longer 
 referenced. That being said, I've run some tests, and the GC will free 
 it *eventually*, but it allocates 4-6x as much memory as it needs before 
 it starts freeing it, even when GC.collect() is called manually.
 
 --------------------
 
 /**
  * Provides a pool of GCed memory to allocate things from a block.
  * This maintains cache coherency for related types (i.e. tree nodes).
  * It doesn't garuntee any ordering, though, the array struct should be
  * used for that. Also, everything has to be freed at once, freeing one
  * portion of this has no effect.
  *
  * Based on a similar concept posted by bearophile at:
  * 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=88227 
 
  */
 public struct MemPool(size_t BLOCK_SIZE = 1 << 14)
 {
     private void* next; // Next available block
     private void* end;  // End of the current block
     private void*[] blocks;
 
     public void* alloc(size_t sz)
     {
         sz = ((sz + 7) & ~7); // Next multiple of 8 if this isn't a 
 multiple of 8
         if (this.next + sz >= this.end)
         {
             void* blk = GC.calloc(BLOCK_SIZE);
             this.blocks.length = this.blocks.length + 1;
             this.blocks[$ - 1] = blk;
             this.next = blk;
             this.end = blk + BLOCK_SIZE;
         }
 
         void* ret = this.next;
         this.next += sz;
         return ret;
     }
 
     public void free()
     {
         foreach(blk; this.blocks)
             GC.free(blk);
         this.blocks = null;
         this.blocks.length = 0;
         this.next = null;
         this.end = null;
     }
 }
 
 /**
  * Wrapper for MemPool that allocates the given struct
  */
 public struct StructPool(T)
 {
     private MemPool!() pool;
     public T* alloc()  { return cast(T*) pool.alloc(T.sizeof); }
 }
 
 public struct MemStack(size_t BLOCK_SIZE = 1 << 14)
 {
     private Stack!(MemPool!(BLOCK_SIZE)*, 16, true, true) stack;
     public static const size_t MAX_ALLOC = BLOCK_SIZE;
     
     public void* alloc(size_t sz) { return 
 stack.peek().alloc(sz);          }
     public void  push()           { stack.push(new 
 MemPool!(BLOCK_SIZE));   }
     public void  pop()            { 
 stack.pop().free();                     }
 }
 
 /**
  * Placement new mixin for allocating from a memory pool. Benchmarks 
 show this
  * as faster than the D new in real usage (i.e. the parser runs about 1.2x
  * faster using this).
  */
 public template MemPoolNew(alias Pool)
 {
     version(NoMemPool) { } else
     {
         public final new(uint sz)    { return Pool.alloc(sz); }
         public final delete(void *p) {                        }
     }
 }

Oops, that needs another module. Okay, both are attached in a 
compile-able form.

May 21 2009

bearophile <bearophileHUGS lycos.com> writes:

nobody, you can see how the GC interacts with your code also with the following
small D1 program:

import std.c.stdlib: malloc;
struct S { int i; }
void main() {
    const N = 20_000_000;

    static if (true) // change this to false
        S** sa = cast(S**)calloc(N, (S*).sizeof);
    else
        auto sa = new S*[N];

    for (int i = N; i-- > 0; )
        sa[i] = new S;
}

(Here I have assumed that null == 0, this isn't true on all CPUs).
If you change the static if to false you will see the program become (on
Windows) about 2.5 times slower (if you use malloc it gets even faster, but
it's not a fair comparison, because new clears the array).
If you add a disable() before the array creation (and the static if is false),
the code gets significantly faster.

Bye,
bearophile

May 22 2009

D Programming

C/C++ Programming

Other

digitalmars.D - why allocation of large amount of small objects so slow (x10) in D?