www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A benchmark, mostly GC

reply bearophile <bearophileHUGS lycos.com> writes:
This program used here as a benchmark is a bit reduced from a rosettacode task,
it finds how many ways are there to make change for 100_000 euro (the argument
'amount' represents cents of euro) using the common coins.

The result is:
992198221207406412424859964272600001

The timings, best of 3, seconds:
  DMD:    22.5
  Python:  9.3
  Java:    2.9

DMD 2.057beta, -O -release -inline
Java SE 1.7.0_01-b08 (used without -server)
Python 2.6.6
32 bit Windows system.

The D version is about 7.7 times slower than the Java client version. I have
seen that most running time in the D code is spent in this line that performs
the BigInt sum:

table[pos] += table[pos - 1];

With some tests I think most of the run time of the D version is used by the
GC, so this is mostly a GC benchmark (so here the sum routines of D BigInts are
probably good enough). I think in this benchmark the low D GC performance is
not caused by the not precise nature of the D GC (because during the running
the amount of memory used is constantly 7.7 MB), but mostly by the amount of
garbage produced each second. So maybe it's the Young Generation of the Java GC
that is helping a lot here. But even the reference count GC of Python is more
efficient here.

I have not timed the GC performance with the recent experiments done by dsimcha
on the D GC discussed here, a timing value with those changes is welcome:
https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2

--------------------------------

// D2 version
import std.stdio, std.bigint;

BigInt countChanges(in int amount, in int[] coins) {
    immutable int n = coins.length;
    int cycle = 0;
    foreach (int c; coins)
        if (c <= amount && c >= cycle)
            cycle = c + 1;
    cycle *= n;
    auto table = new BigInt[cycle];
    // table[0 .. n] = 1;
    table[0 .. n] = BigInt(1);

    int pos = n;
    for (int s = 1; s <= amount; s++) {
        for (int i = 0; i < n; i++) {
            if (i == 0 && pos >= cycle)
                pos = 0;
            if (coins[i] <= s) {
                immutable int q = pos - (coins[i] * n);
                table[pos] = (q >= 0) ? table[q] : table[q + cycle];
            }
            if (i != 0)
                table[pos] += table[pos - 1];//most time spent here
            pos++;
        }
    }

    return table[pos - 1];
}

void main() {
    const int[] coins = [200, 100, 50, 20, 10, 5, 2, 1];
    writeln(countChanges(10000000, coins));
}

--------------------------------

// Java version
import java.util.Arrays;
import java.math.BigInteger;

final class Coins {
    private static BigInteger countChanges(int amount, int[] coins){
        final int n = coins.length;
        int cycle = 0;
        for (int c : coins)
            if (c <= amount && c >= cycle)
                cycle = c + 1;
        cycle *= n;
        BigInteger[] table = new BigInteger[cycle];
        Arrays.fill(table, 0, n, BigInteger.ONE);
        Arrays.fill(table, n, cycle, BigInteger.ZERO);

        int pos = n;
        for (int s = 1; s <= amount; s++) {
            for (int i = 0; i < n; i++) {
                if (i == 0 && pos >= cycle)
                    pos = 0;
                if (coins[i] <= s) {
                    final int q = pos - (coins[i] * n);
                    table[pos] = (q >= 0) ? table[q] : table[q + cycle];
                }
                if (i != 0)
                    table[pos] = table[pos].add(table[pos - 1]);
                pos++;
            }
        }

        return table[pos - 1];
    }

    public static void main(String[] args) {
        final int[] coins = {200, 100, 50, 20, 10, 5, 2, 1};
        System.out.println(countChanges(10000000, coins));
    }
}

--------------------------------


try:
    import psyco
    psyco.full()
except ImportError:
    pass

def count_changes(amount_cents, coins):
    n = len(coins)
    cycle = max([c+1 for c in coins if c <= amount_cents]) * n
    table = [0] * cycle
    for i in xrange(n):
        table[i] = 1

    pos = n
    for s in xrange(1, amount_cents + 1):
        for i in xrange(n):
            if i == 0 and pos >= cycle:
                pos = 0
            if coins[i] <= s:
                q = pos - coins[i] * n
                table[pos] = table[q] if (q >= 0) else table[q + cycle]
            if i:
                table[pos] += table[pos - 1]
            pos += 1
    return table[pos - 1]

def main():
    coins = [200, 100, 50, 20, 10, 5, 2, 1]
    print count_changes(10000000, coins)

main()

--------------------------------

Bye,
bearophile
Dec 11 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 12/11/2011 8:48 AM, bearophile wrote:
 This program used here as a benchmark is a bit reduced from a rosettacode
task, it finds how many ways are there to make change for 100_000 euro (the
argument 'amount' represents cents of euro) using the common coins.

 The result is:
 992198221207406412424859964272600001

 The timings, best of 3, seconds:
    DMD:    22.5
    Python:  9.3
    Java:    2.9

 DMD 2.057beta, -O -release -inline
 Java SE 1.7.0_01-b08 (used without -server)
 Python 2.6.6
 32 bit Windows system.

 The D version is about 7.7 times slower than the Java client version. I have
seen that most running time in the D code is spent in this line that performs
the BigInt sum:

 table[pos] += table[pos - 1];

 With some tests I think most of the run time of the D version is used by the
GC, so this is mostly a GC benchmark (so here the sum routines of D BigInts are
probably good enough). I think in this benchmark the low D GC performance is
not caused by the not precise nature of the D GC (because during the running
the amount of memory used is constantly 7.7 MB), but mostly by the amount of
garbage produced each second. So maybe it's the Young Generation of the Java GC
that is helping a lot here. But even the reference count GC of Python is more
efficient here.

 I have not timed the GC performance with the recent experiments done by
dsimcha on the D GC discussed here, a timing value with those changes is
welcome:
 https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2
My optimizations make very little difference on this benchmark, but for good reason: It's not a very good GC benchmark. I ran it with my GC profiling code enabled and it only spends around 10% of its execution time in GC. We need to figure out why else this benchmark may be so slow. Furthermore, because this benchmark uses so little GC time, my improvements are mostly buried in noise. Here are some sample runs on 2.057 beta with and without my GC improvements. However, I make no claim that these results are statistically significant, as when I ran the benchmark a few times the variance was quite high and I'm too lazy to run it a zillion times and get a mean and variance for each stage, etc. Without my improvements: 992198221207406412424859964272600001 Total GC prep time: 62 milliseconds Total mark time: 456 milliseconds Total sweep time: 1357 milliseconds Total page recovery time: 439 milliseconds Grand total GC time: 2314 milliseconds Process returned 0 (0x0) execution time : 20.776 s With my improvements: 992198221207406412424859964272600001 Total GC prep time: 16 milliseconds Total mark time: 393 milliseconds Total sweep time: 1049 milliseconds Total page recovery time: 297 milliseconds Grand total GC time: 1755 milliseconds Process returned 0 (0x0) execution time : 19.287 s
Dec 11 2011
next sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
It would be really great to implement a generational GC similar 
to the C4 mentioned in this presentation: 
http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection 
. There is a research-paper about C4 as well: 
http://dl.acm.org/citation.cfm?id=1993491 .
Dec 11 2011
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection
 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want. By the way, this show up the fact that, depending on needs, people will require different GC and that the lib should allow us to choose the one we want. For exemple with a compiler switch (gc-module=xxx for example).
Dec 11 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection
 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
 By the way, this show up the fact that, depending on needs, people will
 require different GC and that the lib should allow us to choose the one
 we want. For exemple with a compiler switch (gc-module=xxx for example).
Dec 11 2011
parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection
 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
Dec 11 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/11/2011 10:13 PM, deadalnix wrote:
 Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection

 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
We are talking about supporting precise GC, not about custom runtime reflection. There is no way to get precise GC right without compiler support.
Dec 11 2011
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/12/2011 22:26, Timon Gehr a écrit :
 On 12/11/2011 10:13 PM, deadalnix wrote:
 Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection


 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
We are talking about supporting precise GC, not about custom runtime reflection. There is no way to get precise GC right without compiler support.
It is clearly possible by marking not movable what is possibly pointer in the stack and registers and some runtime reflexion used by the GC.
Dec 11 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/11/2011 11:27 PM, deadalnix wrote:
 Le 11/12/2011 22:26, Timon Gehr a écrit :
 On 12/11/2011 10:13 PM, deadalnix wrote:
 Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to
 the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection



 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
We are talking about supporting precise GC, not about custom runtime reflection. There is no way to get precise GC right without compiler support.
It is clearly possible by marking not movable what is possibly pointer in the stack and registers and some runtime reflexion used by the GC.
In *precise GC* speak, 'possibly pointer' is a banned term. You understand that? Also compile-time reflection is IMPOSSIBLE if you don't have the source code. It cannot work in a meaningful way. GC support must be built-in. Every attempt to make such a fundamental feature depend on the standard library is comical.
Dec 11 2011
parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/12/2011 23:31, Timon Gehr a écrit :
 On 12/11/2011 11:27 PM, deadalnix wrote:
 Le 11/12/2011 22:26, Timon Gehr a écrit :
 On 12/11/2011 10:13 PM, deadalnix wrote:
 Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to
 the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection




 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
We are talking about supporting precise GC, not about custom runtime reflection. There is no way to get precise GC right without compiler support.
It is clearly possible by marking not movable what is possibly pointer in the stack and registers and some runtime reflexion used by the GC.
In *precise GC* speak, 'possibly pointer' is a banned term. You understand that? Also compile-time reflection is IMPOSSIBLE if you don't have the source code. It cannot work in a meaningful way. GC support must be built-in. Every attempt to make such a fundamental feature depend on the standard library is comical.
Well, you are the only one here talking about a precise GC. The post you reponded to was talking about a compacting GC. Which isn't the same. So please don't go as hominem. I could do the same but it will ends up nowhere.
Dec 11 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/11/2011 11:46 PM, deadalnix wrote:
 Le 11/12/2011 23:31, Timon Gehr a écrit :
 On 12/11/2011 11:27 PM, deadalnix wrote:
 Le 11/12/2011 22:26, Timon Gehr a écrit :
 On 12/11/2011 10:13 PM, deadalnix wrote:
 Le 11/12/2011 21:55, Timon Gehr a écrit :
 On 12/11/2011 07:02 PM, deadalnix wrote:
 Le 11/12/2011 16:43, Dejan Lekic a écrit :
 It would be really great to implement a generational GC similar to
 the
 C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection





 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
This GC (like all compacting GC) require that the GC knows what is a reference/pointer and what isn't. This is possible in D using the compile time reflexion to build a runtime reflexion. But the standard lib is far away from that. This will also bloat the executable with reflexion's data, thing that some don't want.
The information has to be generated by the compiler, not by the library.
This is not the way choosen by D. And the way choosen by D is, IMO, supperior. D choosed to have compiler time reflexion, but no runtime reflexion (compiler won't generate it). Given that, and the great capability of D for generic programming, the best solution is to generate information at using compile time reflexion using a lib, to get runtime reflexion. Many application do not want the bloat created by runtime reflexion. This is up to the user to compile with such a lib or not.
We are talking about supporting precise GC, not about custom runtime reflection. There is no way to get precise GC right without compiler support.
It is clearly possible by marking not movable what is possibly pointer in the stack and registers and some runtime reflexion used by the GC.
In *precise GC* speak, 'possibly pointer' is a banned term. You understand that? Also compile-time reflection is IMPOSSIBLE if you don't have the source code. It cannot work in a meaningful way. GC support must be built-in. Every attempt to make such a fundamental feature depend on the standard library is comical.
Well, you are the only one here talking about a precise GC. The post you reponded to was talking about a compacting GC. Which isn't the same.
The post was talking about C4, which I believed was precise. Anyway, a compacting GC which is mostly precise also is very hard to get to work using static reflection features. The user would have to explicitly mark every module with some mixin.
 So please don't go as hominem. I could do the same but it will ends up
 nowhere.
I didn't, but you just did. Let's stop.
Dec 11 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 12/11/2011 4:26 PM, Timon Gehr wrote:
 We are talking about supporting precise GC, not about custom runtime
 reflection. There is no way to get precise GC right without compiler
 support.
FWIW my original precise heap scanning patch generated pointer offset information using CTFE and templates. The code to do this is still in Bugzilla and only took a couple hours to write.
Dec 11 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/11/2011 11:37 PM, dsimcha wrote:
 On 12/11/2011 4:26 PM, Timon Gehr wrote:
 We are talking about supporting precise GC, not about custom runtime
 reflection. There is no way to get precise GC right without compiler
 support.
FWIW my original precise heap scanning patch generated pointer offset information using CTFE and templates. The code to do this is still in Bugzilla and only took a couple hours to write.
But it is not precise for the stack, right? How much work is left to the programmer to generate the information?
Dec 11 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 12/11/2011 9:41 PM, Timon Gehr wrote:
 On 12/11/2011 11:37 PM, dsimcha wrote:
 On 12/11/2011 4:26 PM, Timon Gehr wrote:
 We are talking about supporting precise GC, not about custom runtime
 reflection. There is no way to get precise GC right without compiler
 support.
FWIW my original precise heap scanning patch generated pointer offset information using CTFE and templates. The code to do this is still in Bugzilla and only took a couple hours to write.
But it is not precise for the stack, right? How much work is left to the programmer to generate the information?
It wasn't precise on the stack, but for unrelated reasons. As far as work left to the programmer, I has created templates for new (which I thought at the time might get integrated into the compiler). To use the precise heap scanning, all you had to do was: class C { void* ptr; size_t integer; } void main() { auto instance = newTemplate!C(); }
Dec 11 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/12/2011 03:56 AM, dsimcha wrote:
 On 12/11/2011 9:41 PM, Timon Gehr wrote:
 On 12/11/2011 11:37 PM, dsimcha wrote:
 On 12/11/2011 4:26 PM, Timon Gehr wrote:
 We are talking about supporting precise GC, not about custom runtime
 reflection. There is no way to get precise GC right without compiler
 support.
FWIW my original precise heap scanning patch generated pointer offset information using CTFE and templates. The code to do this is still in Bugzilla and only took a couple hours to write.
But it is not precise for the stack, right? How much work is left to the programmer to generate the information?
It wasn't precise on the stack, but for unrelated reasons. As far as work left to the programmer, I has created templates for new (which I thought at the time might get integrated into the compiler).
That is compiler support. ;)
 To use the
 precise heap scanning, all you had to do was:

 class C {
 void* ptr;
 size_t integer;
 }

 void main() {
 auto instance = newTemplate!C();
 }
Ok, nice. I template my allocations anyway. (even though it creates some bloat)
Dec 12 2011
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 11 Dec 2011 10:43:14 -0500, Dejan Lekic <dejan.lekic gmail.com> wrote:
 It would be really great to implement a generational GC similar
 to the C4 mentioned in this presentation:
 http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection
 . There is a research-paper about C4 as well:
 http://dl.acm.org/citation.cfm?id=1993491 .
C4 is really an impressive GC operating system/runtime for JAVA. But, like many advance GCs it is really impressive because of the restrictions of the JVM memory model. First, most of the really awesome features of C4 (like scaling to 50GB of ram with 10 msec pauses) are entirely due to that fact that C4 is an operating system. C4 works through a bunch of memory interrupts that are not available to user-mode code on Windows, Linux, OSX, etc. Impressive work I'd like to see integrated into said OSes, but that's unlikely to happen soon. Second, being a systems language means that D can not implement a lot of GC algorithms including copying, generational and the good concurrent collectors. In D we are really limited in terms of GC strategies; for example, the concurrent D garbage collector (CDGC) leverages *nix specific OS features to work, and the general concurrency strategy it uses is IIRC slower than the JAVA concurrent collectors. D can also go with thread-local GCs, which are very competitive, but would introduce regions to D's memory model. Also, the presence of D's structs and use stack allocation greatly increases the average lifetime of objects in general and decreases the need for generational collectors.
Dec 11 2011
parent reply Brad Anderson <eco gnuk.net> writes:
On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford jhu.edu> wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors. What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
Dec 11 2011
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco gnuk.net> wrote:

 On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford jhu.edu> wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors. What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
Regarding moving collectors. D supports semi-precise collection. Therefore D can support some types of moving collectors, i.e. compactors, which is what the website is talking about. Copying collectors, on the other hand, require full precision to work; you must be able to fully evacuate a region of memory. D doesn't support full precision, for a couple of performance (unions, the call stack,C/C++ interop) and technical reasons (the inline assembler). Regarding generational collectors. Both generational and concurrent collectors require that every pointer assignment is known to the compiler, which then instruments the assignment to flag mark bits, etc. For generational collectors, you need this information to know which objects/memory pages to search for roots into the young generation. Without this information, you have to search the entire heap, i.e. do a full collection. Again, both performance and technical reasons come into play here. Instrumentation represents a performance cost, which even if it pays for itself, looks bad in newsgroups posting. Indeed, concurrent collectors are mostly about trading throughput for latency. So, like JAVA, you'd want to use version statements to select your GC style, but you'd also have to make sure your entire codebase was compiled with the same flags; with 3rd party DLLs and objects, this can become non-trivial. From a technical perspective, complete pointer assignment instrumentation is a non-starter because the D compiler doesn't have complete access to all the code; both C/C++ and assembler code can modify pointers and are not subject to instrumentation. Now, if we supported more smart or required manual pointer instrumentation of asm code, we could use these types of collectors. * Note that the above doesn't take into account the types of virtual memory tricks C4 can do, which may open these algorithms up to D and other system programming languages.
Dec 12 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/12/2011 04:17 PM, Robert Jacques wrote:
 On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco gnuk.net> wrote:

 On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford jhu.edu>
 wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors. What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
Regarding moving collectors. D supports semi-precise collection. Therefore D can support some types of moving collectors, i.e. compactors, which is what the website is talking about. Copying collectors, on the other hand, require full precision to work; you must be able to fully evacuate a region of memory. D doesn't support full precision, for a couple of performance (unions,
The compiler could insert small code fragments that track whether or not an union contains a pointer.
 the call stack,
What is the problem with the call stack? Can't the compiler just generate reference offset information for all the function frames and then the GC generates a backtrace to identify the references?
 C/C++ interop)
There should be multiple options for GC. If C/C++ interop is unimportant, a better GC that does not support it well is still handy.
 and technical reasons (the inline
 assembler).
inline assembler does not always move around GC heap references. I think that in the cases it does, reference stores could be annotated manually.
 Regarding generational collectors.
 Both generational and concurrent collectors require that every pointer
 assignment is known to the compiler, which then instruments the
 assignment to flag mark bits, etc. For generational collectors, you need
 this information to know which objects/memory pages to search for roots
 into the young generation. Without this information, you have to search
 the entire heap, i.e. do a full collection. Again, both performance and
 technical reasons come into play here. Instrumentation represents a
 performance cost, which even if it pays for itself, looks bad in
 newsgroups posting. Indeed, concurrent collectors are mostly about
 trading throughput for latency. So, like JAVA, you'd want to use version
 statements to select your GC style, but you'd also have to make sure
 your entire codebase was compiled with the same flags; with 3rd party
 DLLs and objects, this can become non-trivial. From a technical
 perspective, complete pointer assignment instrumentation is a
 non-starter because the D compiler doesn't have complete access to all
 the code; both C/C++ and assembler code can modify pointers and are not
 subject to instrumentation.  Now, if we supported C/C++ through

 or required manual pointer instrumentation of asm code, we could use
 these types of collectors.

 * Note that the above doesn't take into account the types of virtual
 memory tricks C4 can do, which may open these algorithms up to D and
 other system programming languages.
I think we'll definitely need a generational/concurrent collector eventually. Could some of the problems be worked around by having more than one GC implementation in the same executable?
Dec 12 2011
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 12 December 2011 17:44, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 12/12/2011 04:17 PM, Robert Jacques wrote:

 On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco gnuk.net> wrote:

  On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford jhu.edu>
 wrote:

 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors. What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
Regarding moving collectors. D supports semi-precise collection. Therefore D can support some types of moving collectors, i.e. compactors, which is what the website is talking about. Copying collectors, on the other hand, require full precision to work; you must be able to fully evacuate a region of memory. D doesn't support full precision, for a couple of performance (unions,
The compiler could insert small code fragments that track whether or not an union contains a pointer. the call stack,

 What is the problem with the call stack? Can't the compiler just generate
 reference offset information for all the function frames and then the GC
 generates a backtrace to identify the references?

  C/C++ interop)

 There should be multiple options for GC. If C/C++ interop is unimportant,
 a better GC that does not support it well is still handy.


  and technical reasons (the inline
 assembler).
inline assembler does not always move around GC heap references. I think that in the cases it does, reference stores could be annotated manually.
 Regarding generational collectors.
 Both generational and concurrent collectors require that every pointer
 assignment is known to the compiler, which then instruments the
 assignment to flag mark bits, etc. For generational collectors, you need
 this information to know which objects/memory pages to search for roots
 into the young generation. Without this information, you have to search
 the entire heap, i.e. do a full collection. Again, both performance and
 technical reasons come into play here. Instrumentation represents a
 performance cost, which even if it pays for itself, looks bad in
 newsgroups posting. Indeed, concurrent collectors are mostly about
 trading throughput for latency. So, like JAVA, you'd want to use version
 statements to select your GC style, but you'd also have to make sure
 your entire codebase was compiled with the same flags; with 3rd party
 DLLs and objects, this can become non-trivial. From a technical
 perspective, complete pointer assignment instrumentation is a
 non-starter because the D compiler doesn't have complete access to all

 the code; both C/C++ and assembler code can modify pointers and are not
 subject to instrumentation.  Now, if we supported C/C++ through

 or required manual pointer instrumentation of asm code, we could use
 these types of collectors.

 * Note that the above doesn't take into account the types of virtual
 memory tricks C4 can do, which may open these algorithms up to D and
 other system programming languages.
I think we'll definitely need a generational/concurrent collector eventually. Could some of the problems be worked around by having more than one GC implementation in the same executable?
Totally off topic... I have a question. If I pass a pointer to a C library... how does the GC know when it's safe to collect?
Dec 12 2011
parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Monday, 12 December 2011 at 16:00:04 UTC, Manu wrote:
 Totally off topic... I have a question.

 If I pass a pointer to a C library... how does the GC know when 
 it's safe
 to collect?
It doesn't, you should add it to the GC with GC.addRoot or GC.addRange unless you're absolutely sure the pointer is never taken off the stack (the call stack is shared between the C and D code). Once you're sure the memory is no longer referenced on the C heap, you can remove the root or range with GC.removeRoot and GC.removeRange.
Dec 12 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Monday, December 12, 2011 17:59:54 Manu wrote:
 Totally off topic... I have a question.
 
 If I pass a pointer to a C library... how does the GC know when it's safe
 to collect?
It assumes that it's safe to collect it. If you pass a pointer to a C library and it keeps it, then you need to make sure that your D code maintains a reference to that data so that the GC doesn't collect it. - Jonathan M Davis
Dec 12 2011
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 12 Dec 2011 10:44:27 -0500, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 12/12/2011 04:17 PM, Robert Jacques wrote:
 On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco gnuk.net> wrote:

 On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford jhu.edu>
 wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors. What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
Regarding moving collectors. D supports semi-precise collection. Therefore D can support some types of moving collectors, i.e. compactors, which is what the website is talking about. Copying collectors, on the other hand, require full precision to work; you must be able to fully evacuate a region of memory. D doesn't support full precision, for a couple of performance (unions,
The compiler could insert small code fragments that track whether or not an union contains a pointer.
 the call stack,
What is the problem with the call stack? Can't the compiler just generate reference offset information for all the function frames and then the GC generates a backtrace to identify the references?
 C/C++ interop)
There should be multiple options for GC. If C/C++ interop is unimportant, a better GC that does not support it well is still handy.
 and technical reasons (the inline
 assembler).
inline assembler does not always move around GC heap references. I think that in the cases it does, reference stores could be annotated manually.
Solving these problems is a little more complicated than this, but the existence of solutions wasn't the problem; performance of those solutions was. Unions can be instrumented, but the instrumentation must be harmonized with the GC as pointer bit masks must be changed. And changeable bitmasks present their own overhead problem (i.e. a 1/32 or 1/64 memory overhead). Personally, I'm in favor of a precise heap, so I'd like this, but it does require a compiler, not runtime, change. The call stack is more complicated because you have to view the function frame as a bunch of unions as space in the function frame is heavily reused to minimize cache effects and the change of a stack overflow. So, before every function call you'd have to flush the current reference offsets of the function frame. *opps* I was remembering why dual stacks doesn't work, and the same logic applies to function frames. So a dual stack approach would use one stack for references and one stack for values. However, pointers to unions/structs/classes on the dual stack don't work because to values and pointers in the objects are not together anymore. Similarly, pointers to unions on a precise/framed stack are troublesome because the the assignment wouldn't know where the meta-data was. Hmm... The GC could have a pointer bitmask for each stack, which the stack frame flushes prior to each function call. C/C++ interop is never unimportant; at a minimum all OS calls work through C/C++ interop. So while it may not be use directly by most D programs, under the hood, the libraries we use all depend on it. We'd have to be able to annotate C functions with their marshaling requirements and change how we handle them based on the GC we're using. Things get even more icky when one considers registering D callbacks or C functions calling into D DLLs. As for the inline assembler, we'd have to annotate changes to both the GC heap and the call stack, and since this would have to be done manually, it would become a source of nasty memory bugs. Technically, the performance of certain fundamental operations is an implementation detail, but there's a certain expectation of 'system' languages to be lean and mean, either optionally or by default and as pointer tracking eliminates that option. Furthermore, pointer tracking interferes with language interop in rather nasty ways and may even necessitate changes to the ABI. But that doesn't mean a D dialect couldn't do a nice modern GC; indead the .net D compiler already did, and IIRC LDC/LLVM has a JIT backend.
 Regarding generational collectors.
 Both generational and concurrent collectors require that every pointer
 assignment is known to the compiler, which then instruments the
 assignment to flag mark bits, etc. For generational collectors, you need
 this information to know which objects/memory pages to search for roots
 into the young generation. Without this information, you have to search
 the entire heap, i.e. do a full collection. Again, both performance and
 technical reasons come into play here. Instrumentation represents a
 performance cost, which even if it pays for itself, looks bad in
 newsgroups posting. Indeed, concurrent collectors are mostly about
 trading throughput for latency. So, like JAVA, you'd want to use version
 statements to select your GC style, but you'd also have to make sure
 your entire codebase was compiled with the same flags; with 3rd party
 DLLs and objects, this can become non-trivial. From a technical
 perspective, complete pointer assignment instrumentation is a
 non-starter because the D compiler doesn't have complete access to all
 the code; both C/C++ and assembler code can modify pointers and are not
 subject to instrumentation.  Now, if we supported C/C++ through

 or required manual pointer instrumentation of asm code, we could use
 these types of collectors.

 * Note that the above doesn't take into account the types of virtual
 memory tricks C4 can do, which may open these algorithms up to D and
 other system programming languages.
I think we'll definitely need a generational/concurrent collector eventually.
Thread-local collectors are easy todo in D, much easier in fact than the majority of other languages, and have performance similar to the current set of generational/concurrent collectors. And generational collectors are mainly to account for the fact Java doesn't have structs. So while I do think D needs a modern collector, I don't think that a generational/concurrent collector is absolutely required.
 Could some of the problems be worked around by having more
 than one GC implementation in the same executable?
Yes and no. GCs that require instrumentation to function would require entirely separate codebases. So supporting two different GCs would require fat-objects of some kind (i.e. N different executables)
Dec 12 2011
parent reply Paulo Pinto <pjmlp progtools.org> writes:
Robert Jacques Wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors.
I disagree, as there are other system languages with better GC algorithms as D, because they offer more safe features than D, lack of inline assembler being one of them. And regardless of what you may think about these languages suitability for doing systems programming, there are research operating systems written in them with lots of papers to read from. Something that I am yet to see from D. Yet when reading how Singularity was implemented, there are lots of parallels some possibilities to improve D's GC still.
Dec 12 2011
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Tue, 13 Dec 2011 02:22:00 -0500, Paulo Pinto <pjmlp progtools.org> wrote:

 Robert Jacques Wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors.
I disagree, as there are other system languages with better GC algorithms as D, because they offer more safe features than D, lack of inline assembler being one of them. And regardless of what you may think about these languages suitability for doing systems programming, there are research operating systems written in them with lots of papers to read from. Something that I am yet to see from D. Yet when reading how Singularity was implemented, there are lots of parallels some possibilities to improve D's GC still.
From the Singularity Wikipedia article: The lowest-level x86 interrupt dispatch code is written in assembly language and C. Once this code has done its job, it invokes the kernel, whose runtime abstraction layer is written in C++ and runs in protected mode. There is also some C code to handle debugging. The computer's BIOS is invoked during the 16-bit real mode bootstrap stage; once in 32-bit mode, Singularity never installation, Common Intermediate Language (CIL) opcodes are compiled into x86 opcodes using the Bartok compiler. From BitC website A less obvious issue is the absence of first-class union value types in the managed subset of the Common Language Runtime (CLR) or the corresponding parts of the Java Virtual Machine (JVM). These are absolutely necessary for low-level these low-level objects (thereby abandoning the foundation for checking), or one must find a more appropriate language. to introduce some fairly severe language warts to support precondition and postcondition checking, but the language does not attempt to address the access is written in ASM/C/C++, like pretty much every single other operating system. (It's still an impressive microkernel) Now BitC and Coyotos are another interesting language/OS pair, though they currently use a C/C++ conservative garbage collector. At the end of the day, I'm really excited about the growth in the systems programming arena and I'd love to see the combination of the ideals in C4, L4, Singularity and Coyotos into some new OS and/or language. But that doesn't really change the limitations of running on top of Windows or iOS.
Dec 13 2011
parent reply Paulo Pinto <pjmlp progtools.org> writes:
No need to reference Wikipedia articles, I am well aware of the implementation
specifics.

Still Singularity was just an example. I can pinpoint several other operating
systems done in GC enabled systems languages, where besides the given language,
only assembly was used, Native Oberon is such an OS for example.

Many times C and C++ get used in such operating systems because of available
tools that speed up development, instead of spending time rewriting them.

I think we will eventually get there as the trend nowadays is that everything
is slowly done in GC enabled languages anyway. Even Microsoft and Apple are
extending their systems languages (Objective-C and C+) to offer some form of
automatic memory management.

--
Paulo


Robert Jacques Wrote:

 On Tue, 13 Dec 2011 02:22:00 -0500, Paulo Pinto <pjmlp progtools.org> wrote:
 
 Robert Jacques Wrote:
 Second, being a systems language means that D can not implement a lot of
GC algorithms including copying, generational and the good concurrent collectors.
I disagree, as there are other system languages with better GC algorithms as D, because they offer more safe features than D, lack of inline assembler being one of them. And regardless of what you may think about these languages suitability for doing systems programming, there are research operating systems written in them with lots of papers to read from. Something that I am yet to see from D. Yet when reading how Singularity was implemented, there are lots of parallels some possibilities to improve D's GC still.
From the Singularity Wikipedia article: The lowest-level x86 interrupt dispatch code is written in assembly language and C. Once this code has done its job, it invokes the kernel, whose runtime abstraction layer is written in C++ and runs in protected mode. There is also some C code to handle debugging. The computer's BIOS is invoked during the 16-bit real mode bootstrap stage; once in 32-bit mode, Singularity never installation, Common Intermediate Language (CIL) opcodes are compiled into x86 opcodes using the Bartok compiler. From BitC website A less obvious issue is the absence of first-class union value types in the managed subset of the Common Language Runtime (CLR) or the corresponding parts of the Java Virtual Machine (JVM). These are absolutely necessary for low-level these low-level objects (thereby abandoning the foundation for checking), or one must find a more appropriate language. forced to introduce some fairly severe language warts to support precondition and postcondition checking, but the language does not attempt to address the access is written in ASM/C/C++, like pretty much every single other operating system. (It's still an impressive microkernel) Now BitC and Coyotos are another interesting language/OS pair, though they currently use a C/C++ conservative garbage collector. At the end of the day, I'm really excited about the growth in the systems programming arena and I'd love to see the combination of the ideals in C4, L4, Singularity and Coyotos into some new OS and/or language. But that doesn't really change the limitations of running on top of Windows or iOS.
Dec 13 2011
parent "Robert Jacques" <sandford jhu.edu> writes:
On Wed, 14 Dec 2011 02:38:17 -0500, Paulo Pinto <pjmlp progtools.org> wrote:

 No need to reference Wikipedia articles, I am well aware of the implementation
specifics.

 Still Singularity was just an example. I can pinpoint several other operating
systems done in GC enabled systems languages, where besides the given language,
only assembly was used, Native Oberon is such an OS for example.

 Many times C and C++ get used in such operating systems because of available
tools that speed up development, instead of spending time rewriting them.

 I think we will eventually get there as the trend nowadays is that everything
is slowly done in GC enabled languages anyway. Even Microsoft and Apple are
extending their systems languages (Objective-C and C+) to offer some form of
automatic memory management.

 --
 Paulo
From Oberon's FAQ: I know that the Native Oberon garbage collector is mark-and-sweep, not copying, but does it ever move objects, for instance to compact memory if it becomes excessively fragmented? A: No, objects are never moved by the Mark/Sweep collector. To avoid fragmentation with a program that continuously allocates memory, call the Garbage collector (Oberon.Collect) from time to time to free unused memory. Avoid calling it too much though, because it does take some milliseconds to run, depending on the number of reachable objects on the heap. I feel like we are having a bit of a semantic issue. My position is not that GC can't or shouldn't be used by or in systems programming, indeed it should, it's that generally speaking the types of garbage collectors available to high-performance systems languages are not the same as those available to more P.S. It seems my GC education is a bit out of date as there has been work into precise garbage collection for C in recent years (i.e. Magpie). I didn't see any references to more advanced garbage collection algorithms, but give graduate students hard problems and generally solutions fall out.
Dec 14 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/11/11 9:23 AM, dsimcha wrote:
 My optimizations make very little difference on this benchmark, but for
 good reason: It's not a very good GC benchmark. I ran it with my GC
 profiling code enabled and it only spends around 10% of its execution
 time in GC. We need to figure out why else this benchmark may be so slow.
I'll venture an opinion (without having profiled the code). The code uses BigInt, which makes eager allocations and custom work for each operation. But a good portion of the loop is spent using small numbers, so using the small integer optimization and native operations helps a lot. I think we need to add reference counting and small int optimization to BigInt to make it competitive. Andrei
Dec 11 2011
parent reply Don <nospam nospam.com> writes:
On 11.12.2011 17:33, Andrei Alexandrescu wrote:
 On 12/11/11 9:23 AM, dsimcha wrote:
 My optimizations make very little difference on this benchmark, but for
 good reason: It's not a very good GC benchmark. I ran it with my GC
 profiling code enabled and it only spends around 10% of its execution
 time in GC. We need to figure out why else this benchmark may be so slow.
I'll venture an opinion (without having profiled the code). The code uses BigInt, which makes eager allocations and custom work for each operation. But a good portion of the loop is spent using small numbers, so using the small integer optimization and native operations helps a lot. I think we need to add reference counting and small int optimization to BigInt to make it competitive. Andrei
Yeah. Also important to note that BigInt was written assuming that something like TempAlloc would become available. Its GC usage is terrible without it. OTOH looks like this is a case where it'd be much faster to use fixed length integers rather than BigInt (I think that's true of nearly everything in RosettaCode) -- strict upper bounds are known on the length of the integers, they aren't arbitrary precision. I think these are toy examples, though, they're completely unrepresentative of real-world code.
Dec 12 2011
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Don:

 OTOH looks like this is a case where it'd be much faster to use fixed 
 length integers rather than BigInt
There's a fixed length integers version too: But it's slower than the Java code still, maybe because of the DMD back-end :-)
 I think these are toy examples, though, they're completely 
 unrepresentative of real-world code.
It's a toy example, but I have seen several times the low performance of D BigInts compared to Python flexible ints, so I think this toy example shows a more general pattern. Bye, bearophile
Dec 12 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Don:

 OTOH looks like this is a case where it'd be much faster to use fixed 
 length integers rather than BigInt (I think that's true of nearly 
 everything in RosettaCode) -- strict upper bounds are known on the 
 length of the integers, they aren't arbitrary precision.
Two problems with this idea: 1) The Python version of this coins program has found a bug in the C version that uses 128 bit integers. When you use fixed sized numbers you risk overflows. If the overflows are silent (like the ones in the built-in D integral numbers) you don't know if your code is giving bogus results, you need to work and think to be sure there are no overflows. This work and thinking requires time, that often I'd like to use for something else. This is why multiprecision numbers (or not silent overflows) are handy. In this program the upper bounds are known if you compute them first with Python or with multiprecision numbers, like C GMP :-) 2) What if I want coins for a big larger number of euros like 200_000? Maybe the result or some intermediate value don't fit in 128 bits, so the C program becomes useless again, while the Python code is useful still. Multiprecision numbers sometimes are more useful. Bye, bearophile
Dec 12 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 My optimizations make very little difference on this benchmark, but for 
 good reason:  It's not a very good GC benchmark.  I ran it with my GC 
 profiling code enabled and it only spends around 10% of its execution 
 time in GC.  We need to figure out why else this benchmark may be so slow.
Thank you for your timings. At the moment I don't know why this D program is slow. Bye, bearophile
Dec 12 2011
prev sibling parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 11.12.2011, 14:48 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:

 This program used here as a benchmark is a bit reduced from a  
 rosettacode task, it finds how many ways are there to make change for  
 100_000 euro (the argument 'amount' represents cents of euro) using the  
 common coins.

 The result is:
 992198221207406412424859964272600001

 The timings, best of 3, seconds:
   DMD:    22.5
   Python:  9.3
   Java:    2.9

 DMD 2.057beta, -O -release -inline
 Java SE 1.7.0_01-b08 (used without -server)
Is -server still doing anything? I thought that behavior was default now.
 Python 2.6.6
 32 bit Windows system.
Since I'm on a 64-bit machine I changed all int to ptrdiff_t first, for compatibility. And I am using DMD 2.056. dmd -inline -O -release gives me 21.680s (pretty similar) dmd -L-O1 -L-znodlopen -L-znorelro -L--no-copy-dt-needed-entries -L--relax -L--sort-common -L--gc-sections -L-lrt -L--as-needed -L--strip-all -inline -O -release -noboundscheck gives me 18.674s (black magic or something, but note worthy; shaves off 3 seconds for me) gdc -Wall -frelease -fno-assert -fno-bounds-check -ffunction-sections -fdata-sections -flto -march=native -O3 -Wl,--strip-all -Wl,-O1 -Wl,-znodlopen -Wl,-znorelro -Wl,--no-copy-dt-needed-entries -Wl,--relax -Wl,--sort-common -Wl,--gc-sections -Wl,-lrt -Wl,--as-needed gives me 14.846s -> The choice of compiler and parameters can have an unexpected impact on the runtime performance. :) But let's take a look at the non-inlined dmd version to do some (o)profiling (attached file). Looking at the call graphs, it looks to me like a total of ~63 % of the time is spend in memory management routines while the rest goes to BigInt. I don't have a set up here to quickly check out 2.057. The numbers may differ significantly there, but that alone wont close the gap to Java (which I actually find impressive here, how do they do this?)
Dec 18 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Marco Leise:

 Looking at the call graphs, it looks to me like a total of ~63 % of the  
 time is spend in memory management routines while the rest goes to BigInt.
But dsimcha said:
 My optimizations make very little difference on this benchmark, but for
 good reason:  It's not a very good GC benchmark.  I ran it with my GC
 profiling code enabled and it only spends around 10% of its execution
 time in GC.  We need to figure out why else this benchmark may be so slow.
How is this possible? Bye, bearophile
Dec 18 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 18.12.2011, 23:15 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:

 Marco Leise:

 Looking at the call graphs, it looks to me like a total of ~63 % of the
 time is spend in memory management routines while the rest goes to  
 BigInt.
But dsimcha said:
 My optimizations make very little difference on this benchmark, but for
 good reason:  It's not a very good GC benchmark.  I ran it with my GC
 profiling code enabled and it only spends around 10% of its execution
 time in GC.  We need to figure out why else this benchmark may be so  
 slow.
How is this possible? Bye, bearophile
I could imagine these differences: I tested the stock 2.056 version - but I'll check 2.057 when I write the Gentoo ebuild. The profiling method: oprofile is a sampling profiler, while dsimcha probably instrumented the code. Scope of profiled functions: dsimcha talks about the GC, while I talk about memory management functions in general. Allocating an array or setting its size is functionality I accounted to memory management. I know to little about the GC to know which functions do the garbage collection cycles, and I wont just add up all functions having GC in their name, so a conservative guess could confirm what dsimcha said. If you want to you can take a look at the report file. The unindented lines contain the percentage for that function excluding sub-calls. The binary was compiled with dmd -O -release -noboundscheck -g, in 64-bit mode.
Dec 18 2011