digitalmars.D - GC performances

bearophile (16/16) Nov 26 2007 This Shootout page shows an interesting comparison between Java6 and D:

Lutger (17/17) Nov 26 2007 Another interesting thing here is that the Java version beats all other

bearophile (80/91) Nov 26 2007 Because the MMM is done in a vey naive way. See at the bottom of the pag...

bearophile (3/4) Nov 26 2007 Read that as:
Lutger (85/93) Nov 26 2007 I've hacked up an array based implementation without recursion and

Lutger (3/4) Nov 26 2007 Oops, that's wrong of course, but not used in this benchmark.
Craig Black (97/190) Nov 26 2007 Here's an array-based C++ implementation from the shootout. It performs...

bearophile (5/8) Nov 26 2007 Allocating huge chunks is slow on Win, look at my better version here:

Lutger (5/16) Nov 26 2007 The code I posted uses an approach similar to the C++ version. There are...

BCS (13/35) Nov 26 2007 I wonder how fast it would be to just malloc a huge block of ram and use...

Craig Black (4/38) Nov 27 2007 Yes this is very fast but causes huge amounts of fragmentation and thus

Sean Kelly (5/49) Nov 27 2007 It's not even necessarily that fast. In the research I've seen, custom

Kenny TM~ (9/30) Nov 26 2007 Aside, if the (your) code is compiled with Tango instead the code runs
Craig Black (7/40) Nov 26 2007 The modern streamlined Java GC may win against an archaic manual memory

bearophile (6/11) Nov 26 2007 There is this widely used one too, by Doug Lea:

Craig Black (4/16) Nov 26 2007 nedmalloc is based on Doug Lea's allocator I think. It improves on it a...

janderson (4/51) Nov 26 2007 I agree. Someone requested this ages ago. I wonder why hasn't this

Bill Baxter (3/58) Nov 26 2007 Priorities?

bearophile <bearophileHUGS lycos.com> writes:

This Shootout page shows an interesting comparison between Java6 and D:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

As you can see there are two D versions, here I have written the slow one (the
fast D version uses manual memory management (MMM), the slow version leaves it
to the built-in GC). I think my version is equal to the faster solution, that
run on Java (it uses automatic GC), you can see both versions here:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

The interesting thing is that despite the code being essentially the same, the
D version run 9.1 times slower than the Java one.
Some notes:
- This shootout test is designed to stress the GC.
- From my experience on this demo, I think that speed difference (between that
Java and the slow D version) is caused mostly by the GC, so the D GC is quite
worse than the Java6 one.
- There the Java6 GC receives some hints from the programmer, those "-server
-Xms64m" (but not -Xbatch, it disables background compilation) on the command
line, I presume the D GC may someday accept similar hints.
- You may take a look at he memory used too: Java6 with those GC hints: 40.3 MB
(the -Xms64m flag sets the initial Java heap size to 64 MB), D MMM: 4.7 MB, D
with GC: 17.6 MB.
- The Java code shows how most people writes code, because manually managing
memory is okay in a little benchmark like this, but if you have a very large
system it's not easy to manage every memory usage by hand.
- There are even faster ways to solve this problem using D, I have seen that
using a free list the code becomes even faster than the MMM one. And there are
ways to speed that up too a lot.
- Today the source code of Java6 can be read, it's open source, so some of the
ideas used by GC can be borrowed to the D GC (but it may require some work).

Bye,
bearophile

Nov 26 2007

Lutger <lutger.blijdestijn gmail.com> writes:

Another interesting thing here is that the Java version beats all other 
languages too, including the ones with manual memory management. But it 
does cost a lot of memory.

I would guess that this is an ideal case for a copying garbage 
collector? I wonder how much of an improvement such a GC will have in 
normal applications, and how hard it will be to implement in D. Another 
concern is what changes are needed in client code to work nicely with a 
copying GC.

But this example does show how relative these microbenchmarks are. After 
all, it costs the Java version more than ten times the memory to 
outperform D, how will that affect the speed of a complex application?

btw. I tried implementing this with a freelist, as suggested by the 
article at the digitalmars site, but it was slower than manual memory 
management. Might be because the code for allocation wasn't inlined for 
some reason.

Likely not the intent of the benchmark, but an array based 
implementation could be faster here.

Nov 26 2007

bearophile <bearophileHUGS lycos.com> writes:

Lutger Wrote:

 Another interesting thing here is that the Java version beats all other 
 languages too, including the ones with manual memory management.

Because the MMM is done in a vey naive way. See at the bottom of the page for
faster versions:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all#alt




 But this example does show how relative these microbenchmarks are. After 
 all, it costs the Java version more than ten times the memory to 
 outperform D, how will that affect the speed of a complex application?

At the moment people use Java on big-iron servers, not D. Programming efforts
often cost less than the RAM, it seems.


 btw. I tried implementing this with a freelist, as suggested by the 
 article at the digitalmars site, but it was slower than manual memory 
 management. Might be because the code for allocation wasn't inlined for 
 some reason. 

Try beating this on the last DMD:

static import std.gc;
import std.c.stdlib: malloc;
import std.conv: toInt;

struct TyTreeNode {
    TyTreeNode* left, right;
    int item;
}

static TyTreeNode* freelist;

TyTreeNode* newTreeNode(int item, TyTreeNode* left=null, TyTreeNode*
right=null) {
    TyTreeNode* t = freelist;
    freelist = freelist.left;
    t.left = left;
    t.right = right;
    t.item = item;
    return t;
}

int itemCheck(TyTreeNode* tree) {
    if (tree.left is null)
        return tree.item;
    else
        return tree.item + itemCheck(tree.left) - itemCheck(tree.right);
}

TyTreeNode* makeTree(int item, uint depth) {
    if (depth > 0)
        return newTreeNode(item, makeTree(2*item-1, depth-1), makeTree(2*item,
depth-1));
    else
        return newTreeNode(item);
}

void deleteTree(TyTreeNode* tree) {
    if (tree.left !is null) {
        deleteTree(tree.left);
        deleteTree(tree.right);
    }

    tree.left = freelist;
    freelist = tree;
}

void main(string[] args) {
    std.gc.disable();
    const minDepth = 4;
    int n = (args.length > 1) ? toInt(args[1]) : 2;
    int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n;

    const int nnodes = ((1 << 16) - 1) / TyTreeNode.sizeof;
    const int block_len = nnodes > 0 ? nnodes : 1;

    uint m = n > 6 ? (1<<(n+2)) : 255;
    while (m > 0) {
        uint toAllocate = m > block_len ? block_len : m;
        m -= toAllocate;
        auto pool = cast(TyTreeNode*)malloc(TyTreeNode.sizeof * toAllocate);
        pool[toAllocate-1].left = freelist;
        freelist = pool;
        for (uint i = 0; i < toAllocate-1; i++)
            pool[i].left = &(pool[i+1]);
    }

    auto stretchTree = makeTree(0, maxDepth + 1);
    printf("stretch tree of depth %u\t check: %li\n", maxDepth + 1,
itemCheck(stretchTree));
    deleteTree(stretchTree);

    auto longLivedTree = makeTree(0, maxDepth);
    for (int depth = minDepth; depth <= maxDepth; depth += 2) {
        int iterations = 1 << (maxDepth - depth + minDepth);
        int check = 0;

        for (int i = 1; i <= iterations; i++) {
            auto tempTree = makeTree(i, depth);
            check += itemCheck(tempTree);
            deleteTree(tempTree);

            tempTree = makeTree(-i, depth);
            check += itemCheck(tempTree);
            deleteTree(tempTree);
        }

        printf("%li\t trees of depth %u\t check: %li\n", iterations*2, depth,
check);
    }

    printf("int lived tree of depth %u\t check: %li\n", maxDepth,
itemCheck(longLivedTree) );
}



 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

It seems slower.

Bye,
bearophile

Nov 26 2007

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 Programming efforts often cost less than the RAM, it seems.

Read that as:
Programming efforts often cost more than the RAM, it seems.

Nov 26 2007

Lutger <lutger.blijdestijn gmail.com> writes:

bearophile wrote:
 Lutger Wrote:

...
 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

 
 It seems slower.
 
 Bye,
 bearophile

I've hacked up an array based implementation without recursion and 
allocating whole trees at once. Although it produces the same results, 
it's probably flawed somewhere or the profiling is wrong I think. 
Anyway it runs much faster, perhaps you care to take a look. I'm not so 
good at these things:


import std.stdio, std.conv;
import std.math;

int left(int i) { return i * 2; }

int right(int i) { return ++i * 2; }

int parent(int i) { return i / 2; }

struct Tree
{
     static Tree bottumUpTree(int item, int depth)
     {
         Tree result;
         result.depth = depth;
         if (depth > 0)
         {
             auto len = pow(depth, cast(real)1);
             result.children.length = cast(uint)len;

         }
         else
             result.children.length = 1;
         result.children[0] = item;
         for (int i = 1; i < result.children.length; i+=2)
         {
             result.children[i] = (result.children[parent(i)] * 2) - 1;
             result.children[i + 1] = result.children[parent(i)] * 2;
         }
         return result;
     }

     int depth;

     int check()
     {
         int result = children[0];
         auto copy = children.dup;

         for (int i = children.length - ((depth * 2) -1); i < 
children.length; i+=2)
         {
             auto p = parent(i);
             copy[p] += children[i] - children[i+1];
         }

         for (int i = children.length - ((depth * 2) - 2); i > 0; i-=2)
         {
             auto p = parent(i);
             copy[p] += children[i-1] - children[i];
         }

         result+= copy[1] - copy[2];

         return result;
     }
     void deleteTree() { delete children; }
     int[] children;

     private int item;
}


void main(string[] args) {
     const minDepth = 4;
     int n = (args.length > 1) ? toInt(args[1]) : 2;
     int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n;

     auto t = Tree.bottumUpTree(0, maxDepth + 1);
     writefln("stretch tree of depth ", maxDepth + 1, "\t check: ", 
t.check());
     t.deleteTree();

     auto longLivedTree = Tree.bottumUpTree(0, maxDepth);

     for (int depth = minDepth; depth <= maxDepth; depth += 2)
     {
         int check;
         int iterations = 1 << (maxDepth - depth + minDepth);


         for (int i = 1; i <= iterations; i++)
         {
             auto t1 = Tree.bottumUpTree(i, depth);
             check += t1.check();
             t1.deleteTree();
             auto t2 = Tree.bottumUpTree(-i, depth);
             check += t2.check();
             t2.deleteTree();
         }

         writefln(iterations * 2, "\t trees of depth ", depth, "\t 
check: ", check);
     }

     writefln("long lived tree of depth ", maxDepth, "\t check: ", 
longLivedTree.check);
     longLivedTree.deleteTree();
}

Nov 26 2007

Lutger <lutger.blijdestijn gmail.com> writes:

Lutger wrote:
...
 int right(int i) { return ++i * 2; }

Oops, that's wrong of course, but not used in this benchmark.

Nov 26 2007

"Craig Black" <cblack ara.com> writes:

"Lutger" <lutger.blijdestijn gmail.com> wrote in message 
news:fiekma$otu$1 digitalmars.com...
 bearophile wrote:
 Lutger Wrote:

 ...
 Likely not the intent of the benchmark, but an array based 
 implementation could be faster here.

 It seems slower.

 Bye,
 bearophile

 I've hacked up an array based implementation without recursion and 
 allocating whole trees at once. Although it produces the same results, 
 it's probably flawed somewhere or the profiling is wrong I think. Anyway 
 it runs much faster, perhaps you care to take a look. I'm not so good at 
 these things:


 import std.stdio, std.conv;
 import std.math;

 int left(int i) { return i * 2; }

 int right(int i) { return ++i * 2; }

 int parent(int i) { return i / 2; }

 struct Tree
 {
     static Tree bottumUpTree(int item, int depth)
     {
         Tree result;
         result.depth = depth;
         if (depth > 0)
         {
             auto len = pow(depth, cast(real)1);
             result.children.length = cast(uint)len;

         }
         else
             result.children.length = 1;
         result.children[0] = item;
         for (int i = 1; i < result.children.length; i+=2)
         {
             result.children[i] = (result.children[parent(i)] * 2) - 1;
             result.children[i + 1] = result.children[parent(i)] * 2;
         }
         return result;
     }

     int depth;

     int check()
     {
         int result = children[0];
         auto copy = children.dup;

         for (int i = children.length - ((depth * 2) -1); i < 
 children.length; i+=2)
         {
             auto p = parent(i);
             copy[p] += children[i] - children[i+1];
         }

         for (int i = children.length - ((depth * 2) - 2); i > 0; i-=2)
         {
             auto p = parent(i);
             copy[p] += children[i-1] - children[i];
         }

         result+= copy[1] - copy[2];

         return result;
     }
     void deleteTree() { delete children; }
     int[] children;

     private int item;
 }


 void main(string[] args) {
     const minDepth = 4;
     int n = (args.length > 1) ? toInt(args[1]) : 2;
     int maxDepth = (minDepth + 2) > n ? minDepth + 2 : n;

     auto t = Tree.bottumUpTree(0, maxDepth + 1);
     writefln("stretch tree of depth ", maxDepth + 1, "\t check: ", 
 t.check());
     t.deleteTree();

     auto longLivedTree = Tree.bottumUpTree(0, maxDepth);

     for (int depth = minDepth; depth <= maxDepth; depth += 2)
     {
         int check;
         int iterations = 1 << (maxDepth - depth + minDepth);


         for (int i = 1; i <= iterations; i++)
         {
             auto t1 = Tree.bottumUpTree(i, depth);
             check += t1.check();
             t1.deleteTree();
             auto t2 = Tree.bottumUpTree(-i, depth);
             check += t2.check();
             t2.deleteTree();
         }

         writefln(iterations * 2, "\t trees of depth ", depth, "\t check: 
 ", check);
     }

     writefln("long lived tree of depth ", maxDepth, "\t check: ", 
 longLivedTree.check);
     longLivedTree.deleteTree();
 }

Here's an array-based C++ implementation from the shootout.  It performs the 
best but was disqualified because it allocates everything in one big chunk 
and then uses placement new to allocate each node.
/* The Computer Language Shootout
 * http://shootout.alioth.debian.org/
 * Contributed by Paul Kitchin
 */


#include <iostream>
#include <sstream>

class Tree
{

      struct Node
      {

         Node(int value, std::size_t depth, std::size_t index, Node * nodes, 
std::size_t max)
            :
            value(value)
         {
            if (index * 2 + 2 < max)
            {
               new (nodes + index * 2 + 1) Node(2 * value - 1, depth - 1, 
index * 2 + 1, nodes, max);
               new (nodes + index * 2 + 2) Node(2 * value, depth - 1, index 
* 2 + 2, nodes, max);
            }
         }

         int check(std::size_t index, Node * nodes, std::size_t max) const
         {
            if (index * 2 + 2 < max)
            {
               return nodes[index * 2 + 1].check(index * 2 + 1, nodes, max) 
+ value - nodes[index * 2 + 2].check(index * 2 + 2, nodes, max);
            }
            return value;
         }

         int value;

      };

   public:

      Tree(int value, std::size_t depth)
         :
         size((2 << depth) - 1),
         nodes(static_cast< Node * >(operator new(size * sizeof(Node))))
      {
         new (nodes) Node(value, depth, 0, nodes, size);
      }

      ~Tree()
      {
         operator delete(nodes);
      }

      int check() const
      {
         return nodes->check(0, nodes, size);
      }

   private:

      std::size_t size;
      Node * nodes;

};

int main(int argc, char * * argv)
{
   std::size_t user_depth = 10;
   if (argc == 2)
   {
      std::istringstream convertor(argv[1]);
      if (!(convertor >> user_depth) || !convertor.eof())
      {
         std::cerr << "Usage: " << argv[0] << " [n]\n";
         std::cerr << "\tn must be an integer\n";
         return 1;
      }
   }
   std::size_t minimum_depth = 4;
   std::size_t maximum_depth = std::max(minimum_depth + 2, user_depth);
   {
      Tree tree(0, maximum_depth + 1);
      std::cout << "stretch tree of depth " << (maximum_depth + 1) << "\t 
check: " << tree.check() << '\n';
   }
   Tree long_lived_tree(0, maximum_depth);
   for (std::size_t depth = minimum_depth; depth <= maximum_depth; depth += 
2)
   {
      int iterations = 1 << (maximum_depth - depth + minimum_depth);
      int check = 0;
      for (int iteration = 1; iteration <= iterations; ++iteration)
      {
         Tree first(iteration, depth);
         Tree second(-iteration, depth);
         check += first.check() + second.check();
      }
      std::cout << (2 * iterations) << "\t trees of depth " << depth << "\t 
check: " << check << '\n';
   }
   std::cout << "long lived tree of depth " << maximum_depth << "\t check: " 
<< long_lived_tree.check() << '\n';
}

Nov 26 2007

bearophile <bearophileHUGS lycos.com> writes:

Craig Black:

 Here's an array-based C++ implementation from the shootout.  It performs the 
 best but was disqualified because it allocates everything in one big chunk 
 and then uses placement new to allocate each node.

Allocating huge chunks is slow on Win, look at my better version here:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62373

Bye,
bearophile

Nov 26 2007

Lutger <lutger.blijdestijn gmail.com> writes:

bearophile wrote:
 Craig Black:
 
 Here's an array-based C++ implementation from the shootout.  It performs the 
 best but was disqualified because it allocates everything in one big chunk 
 and then uses placement new to allocate each node.

 
 Allocating huge chunks is slow on Win, look at my better version here:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62373
 
 Bye,
 bearophile

The code I posted uses an approach similar to the C++ version. There are 
some mistakes in there, after fixing those it runs about 6 times as fast 
as the code you posted. (dmd 1.022, windows xp, 20 iterations), and 
could be optimized further. But it's not the same benchmark.

Nov 26 2007

BCS <ao pathlink.com> writes:

Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.
 
 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.
 
 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?
 
 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.
 
 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.
 

I wonder how fast it would be to just malloc a huge block of ram and use 
this for your memeory allocator:

void* hugeBlock;
//void* end;

T* Allocate(T)()
{
  T* ret = cast(T*) hugeBlock;
  hugeBlock = cast(void*) (&ret[1]);
  //if(end < hugeBlock) *(cast(int*)null) = 0; 
  return ret;
}

as long as you don't run out of ram, it will be real fast. :)

Nov 26 2007

"Craig Black" <cblack ara.com> writes:

"BCS" <ao pathlink.com> wrote in message 
news:ce0a33432639b8c9fe730ad8498a news.digitalmars.com...
 Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.

 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.

 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?

 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.

 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.

 I wonder how fast it would be to just malloc a huge block of ram and use 
 this for your memeory allocator:

 void* hugeBlock;
 //void* end;

 T* Allocate(T)()
 {
  T* ret = cast(T*) hugeBlock;
  hugeBlock = cast(void*) (&ret[1]);
  //if(end < hugeBlock) *(cast(int*)null) = 0; return ret;
 }

 as long as you don't run out of ram, it will be real fast. :)

Yes this is very fast but causes huge amounts of fragmentation and thus 
inefficient use of memory.

Nov 27 2007

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 "BCS" <ao pathlink.com> wrote in message 
 news:ce0a33432639b8c9fe730ad8498a news.digitalmars.com...
 Reply to Lutger,

 Another interesting thing here is that the Java version beats all
 other languages too, including the ones with manual memory management.
 But it does cost a lot of memory.

 I would guess that this is an ideal case for a copying garbage
 collector? I wonder how much of an improvement such a GC will have in
 normal applications, and how hard it will be to implement in D.
 Another concern is what changes are needed in client code to work
 nicely with a copying GC.

 But this example does show how relative these microbenchmarks are.
 After all, it costs the Java version more than ten times the memory to
 outperform D, how will that affect the speed of a complex application?

 btw. I tried implementing this with a freelist, as suggested by the
 article at the digitalmars site, but it was slower than manual memory
 management. Might be because the code for allocation wasn't inlined
 for some reason.

 Likely not the intent of the benchmark, but an array based
 implementation could be faster here.

 I wonder how fast it would be to just malloc a huge block of ram and use 
 this for your memeory allocator:

 void* hugeBlock;
 //void* end;

 T* Allocate(T)()
 {
  T* ret = cast(T*) hugeBlock;
  hugeBlock = cast(void*) (&ret[1]);
  //if(end < hugeBlock) *(cast(int*)null) = 0; return ret;
 }

 as long as you don't run out of ram, it will be real fast. :)

 
 Yes this is very fast but causes huge amounts of fragmentation and thus 
 inefficient use of memory.

It's not even necessarily that fast.  In the research I've seen, custom 
allocators are almost always outperformed by the default allocator for 
general use.


Sean

Nov 27 2007

Kenny TM~ <kennytm gmail..com> writes:

bearophile wrote:
This Shootout page shows an interesting comparison between Java6 and D:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

The interesting thing is that despite the code being essentially the same, the
D version run 9.1 times slower than the Java one.
Some notes:
- This shootout test is designed to stress the GC.
- From my experience on this demo, I think that speed difference (between that
Java and the slow D version) is caused mostly by the GC, so the D GC is quite
worse than the Java6 one.
- There the Java6 GC receives some hints from the programmer, those "-server
-Xms64m" (but not -Xbatch, it disables background compilation) on the command
line, I presume the D GC may someday accept similar hints.
- You may take a look at he memory used too: Java6 with those GC hints: 40.3
MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D MMM: 4.7 MB,
D with GC: 17.6 MB.
- The Java code shows how most people writes code, because manually managing
memory is okay in a little benchmark like this, but if you have a very large
system it's not easy to manage every memory usage by hand.
- There are even faster ways to solve this problem using D, I have seen that
using a free list the code becomes even faster than the MMM one. And there are
ways to speed that up too a lot.
- Today the source code of Java6 can be read, it's open source, so some of the
ideas used by GC can be borrowed to the D GC (but it may require some work).

Bye,
bearophile

Aside, if the (your) code is compiled with Tango instead the code runs
23% faster. Of course still not comparable with the 9x performance of Java.

Time elapsed / seconds (parameter = 16)
----------------------------
D2+Phobos: 13.3828 & 13.5043
D1+Phobos: 13.5441 & 13.4825
D1+Tango : 10.1788 & 9.98871

-- Kenny.

Nov 26 2007

"Craig Black" <cblack ara.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message
news:fie98d$8ed$1 digitalmars.com...
This Shootout page shows an interesting comparison between Java6 and D:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

As you can see there are two D versions, here I have written the slow one
(the fast D version uses manual memory management (MMM), the slow version
leaves it to the built-in GC). I think my version is equal to the faster
solution, that run on Java (it uses automatic GC), you can see both
versions here:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

The interesting thing is that despite the code being essentially the same,
the D version run 9.1 times slower than the Java one.
Some notes:
- This shootout test is designed to stress the GC.
- From my experience on this demo, I think that speed difference (between
that Java and the slow D version) is caused mostly by the GC, so the D GC
is quite worse than the Java6 one.
- There the Java6 GC receives some hints from the programmer, those
"-server -Xms64m" (but not -Xbatch, it disables background compilation) on
the command line, I presume the D GC may someday accept similar hints.
- You may take a look at he memory used too: Java6 with those GC hints:
40.3 MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D
MMM: 4.7 MB, D with GC: 17.6 MB.
- The Java code shows how most people writes code, because manually
managing memory is okay in a little benchmark like this, but if you have a
very large system it's not easy to manage every memory usage by hand.
- There are even faster ways to solve this problem using D, I have seen
that using a free list the code becomes even faster than the MMM one. And
there are ways to speed that up too a lot.
- Today the source code of Java6 can be read, it's open source, so some of
the ideas used by GC can be borrowed to the D GC (but it may require some
work).

Bye,
bearophile

The modern streamlined Java GC may win against an archaic manual memory
allocator. However, I don't think it could possibly outperform a modern
manual allocator like nedmalloc. I think that D should include nedmalloc in
the standard library. This would put D at the top of the list in this
benchmark.

Nov 26 2007

bearophile <bearophileHUGS lycos.com> writes:

Craig Black:

 The modern streamlined Java GC may win against an archaic manual memory 
 allocator.  However, I don't think it could possibly outperform a modern 
 manual allocator like nedmalloc.  I think that D should include nedmalloc in 
 the standard library.  This would put D at the top of the list in this 
 benchmark.

There is this widely used one too, by Doug Lea:
http://gee.cs.oswego.edu/pub/misc/malloc.c
http://gee.cs.oswego.edu/pub/misc/malloc.h

Bye,
bearophile

Nov 26 2007

"Craig Black" <cblack ara.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:fieqlv$12lr$1 digitalmars.com...
 Craig Black:

 The modern streamlined Java GC may win against an archaic manual memory
 allocator.  However, I don't think it could possibly outperform a modern
 manual allocator like nedmalloc.  I think that D should include nedmalloc 
 in
 the standard library.  This would put D at the top of the list in this
 benchmark.

 There is this widely used one too, by Doug Lea:
 http://gee.cs.oswego.edu/pub/misc/malloc.c
 http://gee.cs.oswego.edu/pub/misc/malloc.h

 Bye,
 bearophile

nedmalloc is based on Doug Lea's allocator I think.  It improves on it and 
significantly outperforms it, especially when multiple threads are used.

Nov 26 2007

janderson <askme me.com> writes:

Craig Black wrote:
"bearophile" <bearophileHUGS lycos.com> wrote in message
news:fie98d$8ed$1 digitalmars.com...
This Shootout page shows an interesting comparison between Java6 and D:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all

As you can see there are two D versions, here I have written the slow one
(the fast D version uses manual memory management (MMM), the slow version
leaves it to the built-in GC). I think my version is equal to the faster
solution, that run on Java (it uses automatic GC), you can see both
versions here:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=dlang&id=0
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2

The interesting thing is that despite the code being essentially the same,
the D version run 9.1 times slower than the Java one.
Some notes:
- This shootout test is designed to stress the GC.
- From my experience on this demo, I think that speed difference (between
that Java and the slow D version) is caused mostly by the GC, so the D GC
is quite worse than the Java6 one.
- There the Java6 GC receives some hints from the programmer, those
"-server -Xms64m" (but not -Xbatch, it disables background compilation) on
the command line, I presume the D GC may someday accept similar hints.
- You may take a look at he memory used too: Java6 with those GC hints:
40.3 MB (the -Xms64m flag sets the initial Java heap size to 64 MB), D
MMM: 4.7 MB, D with GC: 17.6 MB.
- The Java code shows how most people writes code, because manually
managing memory is okay in a little benchmark like this, but if you have a
very large system it's not easy to manage every memory usage by hand.
- There are even faster ways to solve this problem using D, I have seen
that using a free list the code becomes even faster than the MMM one. And
there are ways to speed that up too a lot.
- Today the source code of Java6 can be read, it's open source, so some of
the ideas used by GC can be borrowed to the D GC (but it may require some
work).

Bye,
bearophile

I agree. Someone requested this ages ago. I wonder why hasn't this
been done yet?

-Joel

Nov 26 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

janderson wrote:
 Craig Black wrote:
 "bearophile" <bearophileHUGS lycos.com> wrote in message 
 news:fie98d$8ed$1 digitalmars.com...
 This Shootout page shows an interesting comparison between Java6 and D:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=
inarytrees&lang=all 


 As you can see there are two D versions, here I have written the slow 
 one (the fast D version uses manual memory management (MMM), the slow 
 version leaves it to the built-in GC). I think my version is equal to 
 the faster solution, that run on Java (it uses automatic GC), you can 
 see both versions here:

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binaryt
ees&lang=dlang&id=0 

 http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytr
es&lang=javaxx&id=2 


 The interesting thing is that despite the code being essentially the 
 same, the D version run 9.1 times slower than the Java one.
 Some notes:
 - This shootout test is designed to stress the GC.
 - From my experience on this demo, I think that speed difference 
 (between that Java and the slow D version) is caused mostly by the 
 GC, so the D GC is quite worse than the Java6 one.
 - There the Java6 GC receives some hints from the programmer, those 
 "-server -Xms64m" (but not -Xbatch, it disables background 
 compilation) on the command line, I presume the D GC may someday 
 accept similar hints.
 - You may take a look at he memory used too: Java6 with those GC 
 hints: 40.3 MB (the -Xms64m flag sets the initial Java heap size to 
 64 MB), D MMM: 4.7 MB, D with GC: 17.6 MB.
 - The Java code shows how most people writes code, because manually 
 managing memory is okay in a little benchmark like this, but if you 
 have a very large system it's not easy to manage every memory usage 
 by hand.
 - There are even faster ways to solve this problem using D, I have 
 seen that using a free list the code becomes even faster than the MMM 
 one. And there are ways to speed that up too a lot.
 - Today the source code of Java6 can be read, it's open source, so 
 some of the ideas used by GC can be borrowed to the D GC (but it may 
 require some work).

 Bye,
 bearophile

 The modern streamlined Java GC may win against an archaic manual 
 memory allocator.  However, I don't think it could possibly outperform 
 a modern manual allocator like nedmalloc.  I think that D should 
 include nedmalloc in the standard library.  This would put D at the 
 top of the list in this benchmark.

 
 I agree.  Someone requested this ages ago.  I wonder why hasn't this 
 been done yet?

Priorities?

--bb

Nov 26 2007

D Programming

C/C++ Programming

Other

digitalmars.D - GC performances