www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GC performances

reply bearophile <bearophileHUGS lycos.com> writes:
This Shootout page shows an interesting comparison between Java6 and D:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

As you can see there are two D versions, here I have written the slow one (the
fast D version uses manual memory management (MMM), the slow version leaves it
to the built-in GC). I think my version is equal to the faster solution, that
run on Java (it uses automatic GC), you can see both versions here:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

The interesting thing is that despite the code being essentially the same, the
D version run 9.1 times slower than the Java one.
Some notes:
- This shootout test is designed to stress the GC.
- From my experience on this demo, I think that speed difference (between that
Java and the slow D version) is caused mostly by the GC, so the D GC is quite
worse than the Java6 one.
- There the Java6 GC receives some hints from the programmer, those "-server
-Xms64m" (but not -Xbatch, it disables background compilation) on the command
line, I presume the D GC may someday accept similar hints.
- You may take a look at he memory used too: Java6 with those GC hints: 40.3 MB
(the -Xms64m flag sets the initial Java heap size to 64 MB), D MMM: 4.7 MB, D
with GC: 17.6 MB.
- The Java code shows how most people writes code, because manually managing
memory is okay in a little benchmark like this, but if you have a very large
system it's not easy to manage every memory usage by hand.
- There are even faster ways to solve this problem using D, I have seen that
using a free list the code becomes even faster than the MMM one. And there are
ways to speed that up too a lot.
- Today the source code of Java6 can be read, it's open source, so some of the
ideas used by GC can be borrowed to the D GC (but it may require some work).

Bye,
bearophile
Nov 26 2007
next sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
Another interesting thing here is that the Java version beats all other 
languages too, including the ones with manual memory management. But it 
does cost a lot of memory.

I would guess that this is an ideal case for a copying garbage 
collector? I wonder how much of an improvement such a GC will have in 
normal applications, and how hard it will be to implement in D. Another 
concern is what changes are needed in client code to work nicely with a 
copying GC.

But this example does show how relative these microbenchmarks are. After 
all, it costs the Java version more than ten times the memory to 
outperform D, how will that affect the speed of a complex application?

btw. I tried implementing this with a freelist, as suggested by the 
article at the digitalmars site, but it was slower than manual memory 
management. Might be because the code for allocation wasn't inlined for 
some reason.

Likely not the intent of the benchmark, but an array based 
implementation could be faster here.
Nov 26 2007
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Lutger Wrote:

 Another interesting thing here is that the Java version beats all other 
 languages too, including the ones with manual memory management.

Because the MMM is done in a vey naive way. See at the bottom of the page for faster versions: http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all#alt
 But this example does show how relative these microbenchmarks are. After 
 all, it costs the Java version more than ten times the memory to 
 outperform D, how will that affect the speed of a complex application?

At the moment people use Java on big-iron servers, not D. Programming efforts often cost less than the RAM, it seems.
 btw. I tried implementing this with a freelist, as suggested by the 
 article at the digitalmars site, but it was slower than manual memory 
 management. Might be because the code for allocation wasn't inlined for 
 some reason. 

Try beating this on the last DMD: static import std.gc; import std.c.stdlib: malloc; import std.conv: toInt; struct TyTreeNode { TyTreeNode* left, right; int item; } static TyTreeNode* freelist; TyTreeNode* newTreeNode(int item, TyTreeNode* left=null, TyTreeNode* right=null) { TyTreeNode* t = freelist; freelist = freelist.left; t.left = left; t.right = right; t.item = item; return t; } int itemCheck(TyTreeNode* tree) { if (tree.left is null) return tree.item; else return tree.item + itemCheck(tree.left) - itemCheck(tree.right); } TyTreeNode* makeTree(int item, uint depth) { if (depth > 0) return newTreeNode(item, makeTree(2*item-1, depth-1), makeTree(2*item, depth-1)); else return newTreeNode(item); } void deleteTree(TyTreeNode* tree) { if (tree.left !is null) { deleteTree(tree.left); deleteTree(tree.right); } tree.left = freelist; freelist = tree; } void main(string[] args) { std.gc.disable(); const minDepth = 4; int n = (args.length > 1) ? toInt(args[1]) : 2; int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n; const int nnodes = ((1 << 16) - 1) / TyTreeNode.sizeof; const int block_len = nnodes > 0 ? nnodes : 1; uint m = n > 6 ? (1<<(n+2)) : 255; while (m > 0) { uint toAllocate = m > block_len ? block_len : m; m -= toAllocate; auto pool = cast(TyTreeNode*)malloc(TyTreeNode.sizeof * toAllocate); pool[toAllocate-1].left = freelist; freelist = pool; for (uint i = 0; i < toAllocate-1; i++) pool[i].left = &(pool[i+1]); } auto stretchTree = makeTree(0, maxDepth + 1); printf("stretch tree of depth %u\t check: %li\n", maxDepth + 1, itemCheck(stretchTree)); deleteTree(stretchTree); auto longLivedTree = makeTree(0, maxDepth); for (int depth = minDepth; depth <= maxDepth; depth += 2) { int iterations = 1 << (maxDepth - depth + minDepth); int check = 0; for (int i = 1; i <= iterations; i++) { auto tempTree = makeTree(i, depth); check += itemCheck(tempTree); deleteTree(tempTree); tempTree = makeTree(-i, depth); check += itemCheck(tempTree); deleteTree(tempTree); } printf("%li\t trees of depth %u\t check: %li\n", iterations*2, depth, check); } printf("int lived tree of depth %u\t check: %li\n", maxDepth, itemCheck(longLivedTree) ); }
 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

It seems slower. Bye, bearophile
Nov 26 2007
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 Programming efforts often cost less than the RAM, it seems.

Read that as: Programming efforts often cost more than the RAM, it seems.
Nov 26 2007
prev sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
bearophile wrote:
 Lutger Wrote:

 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

It seems slower. Bye, bearophile

I've hacked up an array based implementation without recursion and allocating whole trees at once. Although it produces the same results, it's probably flawed somewhere or the profiling is wrong I think. Anyway it runs much faster, perhaps you care to take a look. I'm not so good at these things: import std.stdio, std.conv; import std.math; int left(int i) { return i * 2; } int right(int i) { return ++i * 2; } int parent(int i) { return i / 2; } struct Tree { static Tree bottumUpTree(int item, int depth) { Tree result; result.depth = depth; if (depth > 0) { auto len = pow(depth, cast(real)1); result.children.length = cast(uint)len; } else result.children.length = 1; result.children[0] = item; for (int i = 1; i < result.children.length; i+=2) { result.children[i] = (result.children[parent(i)] * 2) - 1; result.children[i + 1] = result.children[parent(i)] * 2; } return result; } int depth; int check() { int result = children[0]; auto copy = children.dup; for (int i = children.length - ((depth * 2) -1); i < children.length; i+=2) { auto p = parent(i); copy[p] += children[i] - children[i+1]; } for (int i = children.length - ((depth * 2) - 2); i > 0; i-=2) { auto p = parent(i); copy[p] += children[i-1] - children[i]; } result+= copy[1] - copy[2]; return result; } void deleteTree() { delete children; } int[] children; private int item; } void main(string[] args) { const minDepth = 4; int n = (args.length > 1) ? toInt(args[1]) : 2; int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n; auto t = Tree.bottumUpTree(0, maxDepth + 1); writefln("stretch tree of depth ", maxDepth + 1, "\t check: ", t.check()); t.deleteTree(); auto longLivedTree = Tree.bottumUpTree(0, maxDepth); for (int depth = minDepth; depth <= maxDepth; depth += 2) { int check; int iterations = 1 << (maxDepth - depth + minDepth); for (int i = 1; i <= iterations; i++) { auto t1 = Tree.bottumUpTree(i, depth); check += t1.check(); t1.deleteTree(); auto t2 = Tree.bottumUpTree(-i, depth); check += t2.check(); t2.deleteTree(); } writefln(iterations * 2, "\t trees of depth ", depth, "\t check: ", check); } writefln("long lived tree of depth ", maxDepth, "\t check: ", longLivedTree.check); longLivedTree.deleteTree(); }
Nov 26 2007
next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
Lutger wrote:
...
 int right(int i) { return ++i * 2; }

Oops, that's wrong of course, but not used in this benchmark.
Nov 26 2007
prev sibling parent reply "Craig Black" <cblack ara.com> writes:
"Lutger" <lutger.blijdestijn gmail.com> wrote in message 
news:fiekma$otu$1 digitalmars.com...
 bearophile wrote:
 Lutger Wrote:

 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

It seems slower. Bye, bearophile

I've hacked up an array based implementation without recursion and allocating whole trees at once. Although it produces the same results, it's probably flawed somewhere or the profiling is wrong I think. Anyway it runs much faster, perhaps you care to take a look. I'm not so good at these things: import std.stdio, std.conv; import std.math; int left(int i) { return i * 2; } int right(int i) { return ++i * 2; } int parent(int i) { return i / 2; } struct Tree { static Tree bottumUpTree(int item, int depth) { Tree result; result.depth = depth; if (depth > 0) { auto len = pow(depth, cast(real)1); result.children.length = cast(uint)len; } else result.children.length = 1; result.children[0] = item; for (int i = 1; i < result.children.length; i+=2) { result.children[i] = (result.children[parent(i)] * 2) - 1; result.children[i + 1] = result.children[parent(i)] * 2; } return result; } int depth; int check() { int result = children[0]; auto copy = children.dup; for (int i = children.length - ((depth * 2) -1); i < children.length; i+=2) { auto p = parent(i); copy[p] += children[i] - children[i+1]; } for (int i = children.length - ((depth * 2) - 2); i > 0; i-=2) { auto p = parent(i); copy[p] += children[i-1] - children[i]; } result+= copy[1] - copy[2]; return result; } void deleteTree() { delete children; } int[] children; private int item; } void main(string[] args) { const minDepth = 4; int n = (args.length > 1) ? toInt(args[1]) : 2; int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n; auto t = Tree.bottumUpTree(0, maxDepth + 1); writefln("stretch tree of depth ", maxDepth + 1, "\t check: ", t.check()); t.deleteTree(); auto longLivedTree = Tree.bottumUpTree(0, maxDepth); for (int depth = minDepth; depth <= maxDepth; depth += 2) { int check; int iterations = 1 << (maxDepth - depth + minDepth); for (int i = 1; i <= iterations; i++) { auto t1 = Tree.bottumUpTree(i, depth); check += t1.check(); t1.deleteTree(); auto t2 = Tree.bottumUpTree(-i, depth); check += t2.check(); t2.deleteTree(); } writefln(iterations * 2, "\t trees of depth ", depth, "\t check: ", check); } writefln("long lived tree of depth ", maxDepth, "\t check: ", longLivedTree.check); longLivedTree.deleteTree(); }

Here's an array-based C++ implementation from the shootout. It performs the best but was disqualified because it allocates everything in one big chunk and then uses placement new to allocate each node. /* The Computer Language Shootout * http://shootout.alioth.debian.org/ * Contributed by Paul Kitchin */ #include <iostream> #include <sstream> class Tree { struct Node { Node(int value, std::size_t depth, std::size_t index, Node * nodes, std::size_t max) : value(value) { if (index * 2 + 2 < max) { new (nodes + index * 2 + 1) Node(2 * value - 1, depth - 1, index * 2 + 1, nodes, max); new (nodes + index * 2 + 2) Node(2 * value, depth - 1, index * 2 + 2, nodes, max); } } int check(std::size_t index, Node * nodes, std::size_t max) const { if (index * 2 + 2 < max) { return nodes[index * 2 + 1].check(index * 2 + 1, nodes, max) + value - nodes[index * 2 + 2].check(index * 2 + 2, nodes, max); } return value; } int value; }; public: Tree(int value, std::size_t depth) : size((2 << depth) - 1), nodes(static_cast< Node * >(operator new(size * sizeof(Node)))) { new (nodes) Node(value, depth, 0, nodes, size); } ~Tree() { operator delete(nodes); } int check() const { return nodes->check(0, nodes, size); } private: std::size_t size; Node * nodes; }; int main(int argc, char * * argv) { std::size_t user_depth = 10; if (argc == 2) { std::istringstream convertor(argv[1]); if (!(convertor >> user_depth) || !convertor.eof()) { std::cerr << "Usage: " << argv[0] << " [n]\n"; std::cerr << "\tn must be an integer\n"; return 1; } } std::size_t minimum_depth = 4; std::size_t maximum_depth = std::max(minimum_depth + 2, user_depth); { Tree tree(0, maximum_depth + 1); std::cout << "stretch tree of depth " << (maximum_depth + 1) << "\t check: " << tree.check() << '\n'; } Tree long_lived_tree(0, maximum_depth); for (std::size_t depth = minimum_depth; depth <= maximum_depth; depth += 2) { int iterations = 1 << (maximum_depth - depth + minimum_depth); int check = 0; for (int iteration = 1; iteration <= iterations; ++iteration) { Tree first(iteration, depth); Tree second(-iteration, depth); check += first.check() + second.check(); } std::cout << (2 * iterations) << "\t trees of depth " << depth << "\t check: " << check << '\n'; } std::cout << "long lived tree of depth " << maximum_depth << "\t check: " << long_lived_tree.check() << '\n'; }
Nov 26 2007
parent reply bearophile <bearophileHUGS lycos.com> writes:
Craig Black:

 Here's an array-based C++ implementation from the shootout.  It performs the 
 best but was disqualified because it allocates everything in one big chunk 
 and then uses placement new to allocate each node.

Allocating huge chunks is slow on Win, look at my better version here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62373 Bye, bearophile
Nov 26 2007
parent Lutger <lutger.blijdestijn gmail.com> writes:
bearophile wrote:
 Craig Black:
 
 Here's an array-based C++ implementation from the shootout.  It performs the 
 best but was disqualified because it allocates everything in one big chunk 
 and then uses placement new to allocate each node.

Allocating huge chunks is slow on Win, look at my better version here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62373 Bye, bearophile

The code I posted uses an approach similar to the C++ version. There are some mistakes in there, after fixing those it runs about 6 times as fast as the code you posted. (dmd 1.022, windows xp, 20 iterations), and could be optimized further. But it's not the same benchmark.
Nov 26 2007
prev sibling parent reply BCS <ao pathlink.com> writes:
Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.
 
 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.
 
 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?
 
 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.
 
 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.
 

I wonder how fast it would be to just malloc a huge block of ram and use this for your memeory allocator: void* hugeBlock; //void* end; T* Allocate(T)() { T* ret = cast(T*) hugeBlock; hugeBlock = cast(void*) (&ret[1]); //if(end < hugeBlock) *(cast(int*)null) = 0; return ret; } as long as you don't run out of ram, it will be real fast. :)
Nov 26 2007
parent reply "Craig Black" <cblack ara.com> writes:
"BCS" <ao pathlink.com> wrote in message 
news:ce0a33432639b8c9fe730ad8498a news.digitalmars.com...
 Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.

 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.

 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?

 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.

 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.

I wonder how fast it would be to just malloc a huge block of ram and use this for your memeory allocator: void* hugeBlock; //void* end; T* Allocate(T)() { T* ret = cast(T*) hugeBlock; hugeBlock = cast(void*) (&ret[1]); //if(end < hugeBlock) *(cast(int*)null) = 0; return ret; } as long as you don't run out of ram, it will be real fast. :)

Yes this is very fast but causes huge amounts of fragmentation and thus inefficient use of memory.
Nov 27 2007
parent Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 "BCS" <ao pathlink.com> wrote in message 
 news:ce0a33432639b8c9fe730ad8498a news.digitalmars.com...
 Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.

 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.

 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?

 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.

 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.

this for your memeory allocator: void* hugeBlock; //void* end; T* Allocate(T)() { T* ret = cast(T*) hugeBlock; hugeBlock = cast(void*) (&ret[1]); //if(end < hugeBlock) *(cast(int*)null) = 0; return ret; } as long as you don't run out of ram, it will be real fast. :)

Yes this is very fast but causes huge amounts of fragmentation and thus inefficient use of memory.

It's not even necessarily that fast. In the research I've seen, custom allocators are almost always outperformed by the default allocator for general use. Sean
Nov 27 2007
prev sibling next sibling parent Kenny TM~ <kennytm gmail..com> writes:
bearophile wrote:
 This Shootout page shows an interesting comparison between Java6 and D:
 
 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all
 
 As you can see there are two D versions, here I have written the slow one (the
fast D version uses manual memory management (MMM), the slow version leaves it
to the built-in GC). I think my version is equal to the faster solution, that
run on Java (it uses automatic GC), you can see both versions here:
 
 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2
 
 The interesting thing is that despite the code being essentially the same, the
D version run 9.1 times slower than the Java one.
 Some notes:
 - This shootout test is designed to stress the GC.
 - From my experience on this demo, I think that speed difference (between that
Java and the slow D version) is caused mostly by the GC, so the D GC is quite
worse than the Java6 one.
 - There the Java6 GC receives some hints from the programmer, those "-server
-Xms64m" (but not -Xbatch, it disables background compilation) on the command
line, I presume the D GC may someday accept similar hints.
 - You may take a look at he memory used too: Java6 with those GC hints: 40.3
MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D MMM: 4.7 MB,
D with GC: 17.6 MB.
 - The Java code shows how most people writes code, because manually managing
memory is okay in a little benchmark like this, but if you have a very large
system it's not easy to manage every memory usage by hand.
 - There are even faster ways to solve this problem using D, I have seen that
using a free list the code becomes even faster than the MMM one. And there are
ways to speed that up too a lot.
 - Today the source code of Java6 can be read, it's open source, so some of the
ideas used by GC can be borrowed to the D GC (but it may require some work).
 
 Bye,
 bearophile

Aside, if the (your) code is compiled with Tango instead the code runs 23% faster. Of course still not comparable with the 9x performance of Java. Time elapsed / seconds (parameter = 16) ---------------------------- D2+Phobos: 13.3828 & 13.5043 D1+Phobos: 13.5441 & 13.4825 D1+Tango : 10.1788 & 9.98871 -- Kenny.
Nov 26 2007
prev sibling parent reply "Craig Black" <cblack ara.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:fie98d$8ed$1 digitalmars.com...
 This Shootout page shows an interesting comparison between Java6 and D:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

 As you can see there are two D versions, here I have written the slow one 
 (the fast D version uses manual memory management (MMM), the slow version 
 leaves it to the built-in GC). I think my version is equal to the faster 
 solution, that run on Java (it uses automatic GC), you can see both 
 versions here:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

 The interesting thing is that despite the code being essentially the same, 
 the D version run 9.1 times slower than the Java one.
 Some notes:
 - This shootout test is designed to stress the GC.
 - From my experience on this demo, I think that speed difference (between 
 that Java and the slow D version) is caused mostly by the GC, so the D GC 
 is quite worse than the Java6 one.
 - There the Java6 GC receives some hints from the programmer, those 
 "-server -Xms64m" (but not -Xbatch, it disables background compilation) on 
 the command line, I presume the D GC may someday accept similar hints.
 - You may take a look at he memory used too: Java6 with those GC hints: 
 40.3 MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D 
 MMM: 4.7 MB, D with GC: 17.6 MB.
 - The Java code shows how most people writes code, because manually 
 managing memory is okay in a little benchmark like this, but if you have a 
 very large system it's not easy to manage every memory usage by hand.
 - There are even faster ways to solve this problem using D, I have seen 
 that using a free list the code becomes even faster than the MMM one. And 
 there are ways to speed that up too a lot.
 - Today the source code of Java6 can be read, it's open source, so some of 
 the ideas used by GC can be borrowed to the D GC (but it may require some 
 work).

 Bye,
 bearophile

The modern streamlined Java GC may win against an archaic manual memory allocator. However, I don't think it could possibly outperform a modern manual allocator like nedmalloc. I think that D should include nedmalloc in the standard library. This would put D at the top of the list in this benchmark.
Nov 26 2007
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Craig Black:

 The modern streamlined Java GC may win against an archaic manual memory 
 allocator.  However, I don't think it could possibly outperform a modern 
 manual allocator like nedmalloc.  I think that D should include nedmalloc in 
 the standard library.  This would put D at the top of the list in this 
 benchmark.

There is this widely used one too, by Doug Lea: http://gee.cs.oswego.edu/pub/misc/malloc.c http://gee.cs.oswego.edu/pub/misc/malloc.h Bye, bearophile
Nov 26 2007
parent "Craig Black" <cblack ara.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:fieqlv$12lr$1 digitalmars.com...
 Craig Black:

 The modern streamlined Java GC may win against an archaic manual memory
 allocator.  However, I don't think it could possibly outperform a modern
 manual allocator like nedmalloc.  I think that D should include nedmalloc 
 in
 the standard library.  This would put D at the top of the list in this
 benchmark.

There is this widely used one too, by Doug Lea: http://gee.cs.oswego.edu/pub/misc/malloc.c http://gee.cs.oswego.edu/pub/misc/malloc.h Bye, bearophile

nedmalloc is based on Doug Lea's allocator I think. It improves on it and significantly outperforms it, especially when multiple threads are used.
Nov 26 2007
prev sibling parent reply janderson <askme me.com> writes:
Craig Black wrote:
 "bearophile" <bearophileHUGS lycos.com> wrote in message 
 news:fie98d$8ed$1 digitalmars.com...
 This Shootout page shows an interesting comparison between Java6 and D:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

 As you can see there are two D versions, here I have written the slow one 
 (the fast D version uses manual memory management (MMM), the slow version 
 leaves it to the built-in GC). I think my version is equal to the faster 
 solution, that run on Java (it uses automatic GC), you can see both 
 versions here:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

 The interesting thing is that despite the code being essentially the same, 
 the D version run 9.1 times slower than the Java one.
 Some notes:
 - This shootout test is designed to stress the GC.
 - From my experience on this demo, I think that speed difference (between 
 that Java and the slow D version) is caused mostly by the GC, so the D GC 
 is quite worse than the Java6 one.
 - There the Java6 GC receives some hints from the programmer, those 
 "-server -Xms64m" (but not -Xbatch, it disables background compilation) on 
 the command line, I presume the D GC may someday accept similar hints.
 - You may take a look at he memory used too: Java6 with those GC hints: 
 40.3 MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D 
 MMM: 4.7 MB, D with GC: 17.6 MB.
 - The Java code shows how most people writes code, because manually 
 managing memory is okay in a little benchmark like this, but if you have a 
 very large system it's not easy to manage every memory usage by hand.
 - There are even faster ways to solve this problem using D, I have seen 
 that using a free list the code becomes even faster than the MMM one. And 
 there are ways to speed that up too a lot.
 - Today the source code of Java6 can be read, it's open source, so some of 
 the ideas used by GC can be borrowed to the D GC (but it may require some 
 work).

 Bye,
 bearophile

The modern streamlined Java GC may win against an archaic manual memory allocator. However, I don't think it could possibly outperform a modern manual allocator like nedmalloc. I think that D should include nedmalloc in the standard library. This would put D at the top of the list in this benchmark.

I agree. Someone requested this ages ago. I wonder why hasn't this been done yet? -Joel
Nov 26 2007
parent Bill Baxter <dnewsgroup billbaxter.com> writes:
janderson wrote:
 Craig Black wrote:
 "bearophile" <bearophileHUGS lycos.com> wrote in message 
 news:fie98d$8ed$1 digitalmars.com...
 This Shootout page shows an interesting comparison between Java6 and D:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=
inarytrees&lang=all 


 As you can see there are two D versions, here I have written the slow 
 one (the fast D version uses manual memory management (MMM), the slow 
 version leaves it to the built-in GC). I think my version is equal to 
 the faster solution, that run on Java (it uses automatic GC), you can 
 see both versions here:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binaryt
ees&lang=dlang&id=0 

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytr
es&lang=javaxx&id=2 


 The interesting thing is that despite the code being essentially the 
 same, the D version run 9.1 times slower than the Java one.
 Some notes:
 - This shootout test is designed to stress the GC.
 - From my experience on this demo, I think that speed difference 
 (between that Java and the slow D version) is caused mostly by the 
 GC, so the D GC is quite worse than the Java6 one.
 - There the Java6 GC receives some hints from the programmer, those 
 "-server -Xms64m" (but not -Xbatch, it disables background 
 compilation) on the command line, I presume the D GC may someday 
 accept similar hints.
 - You may take a look at he memory used too: Java6 with those GC 
 hints: 40.3 MB (the -Xms64m flag sets the initial Java heap size to 
 64 MB), D MMM: 4.7 MB, D with GC: 17.6 MB.
 - The Java code shows how most people writes code, because manually 
 managing memory is okay in a little benchmark like this, but if you 
 have a very large system it's not easy to manage every memory usage 
 by hand.
 - There are even faster ways to solve this problem using D, I have 
 seen that using a free list the code becomes even faster than the MMM 
 one. And there are ways to speed that up too a lot.
 - Today the source code of Java6 can be read, it's open source, so 
 some of the ideas used by GC can be borrowed to the D GC (but it may 
 require some work).

 Bye,
 bearophile

The modern streamlined Java GC may win against an archaic manual memory allocator. However, I don't think it could possibly outperform a modern manual allocator like nedmalloc. I think that D should include nedmalloc in the standard library. This would put D at the top of the list in this benchmark.

I agree. Someone requested this ages ago. I wonder why hasn't this been done yet?

Priorities? --bb
Nov 26 2007