digitalmars.D - More embarrassing microbenchmars for D's GC.

Leandro Lucarella (52/52) Jun 09 2008 Content-Disposition: inline

bearophile (44/47) Jun 09 2008 Note that Psyco can't compile code outside functions (global code is slo...

Leandro Lucarella (13/20) Jun 10 2008 Thanks for the info, I've imported psyco but didn't play with it really ...

bearophile (4/5) Jun 10 2008 If you have a bit of free time you may time that code of mine (that's eq...

Sean Kelly (10/34) Jun 09 2008 I think it may have more to do with the allocation strategy in the GC. ...

Leandro Lucarella (20/34) Jun 10 2008 Content-Disposition: inline

Sean Kelly (9/37) Jun 10 2008 Oh okay. I was replying to the post that suggested that said the malloc
Marius Muja (6/12) Jun 10 2008 The graph may look exponential but that's often misleading (it may very

Leandro Lucarella (13/24) Jun 11 2008 You and Sean are right, the graph somewhere in between linear and

Walter Bright (5/33) Jun 09 2008 There are so many factors that can influence this, it's hard to say

Leandro Lucarella (14/46) Jun 10 2008 Content-Disposition: inline

Walter Bright (3/41) Jun 10 2008 Break up your code into more functions to get better info from the profi...

Leandro Lucarella (14/44) Jun 11 2008 Doesn't seems to be running. But it's weird you can tell for sure being

Rounin (11/11) Sep 15 2010 I found this posting after investigating a huge slowdown in a D program ...

Steven Schveighoffer (10/25) Sep 15 2010 gdc is very old. You may want to try the thing on the latest dmd. No

Rounin (8/8) Sep 16 2010 Thank you for that advice. I'm using GDC because it's available from Ubu...

bearophile (13/22) Sep 16 2010 You may try to disable/enable the GC in the Python code too (on 32 bit s...

Rounin (9/21) Sep 16 2010 This could be the case, as I started noticing the slowdown as I was writ...

dsimcha (15/18) Sep 16 2010 How about using a library defined associative array? As I've mentioned ...

Sean Kelly (2/15) Sep 16 2010 The precise scanning patch would probably help tremendously here.

dsimcha (4/19) Sep 16 2010 Yes, it would but even so, I don't think that a good AA implementation f...

Steven Schveighoffer (21/35) Sep 16 2010 I've heard that dmd is a bit tricky to set up on a 64-bit system (mine i...

Justin Johansson (7/44) Sep 16 2010 The redux of this discussion is surely that no matter how "good" the

Andrei Alexandrescu (14/22) Sep 16 2010 I have noted this trigger happiness of the GC since way back during my

Rounin (68/69) Sep 16 2010 Thank you for those kind words! Hopefully, it'll become a prioritized is...

Leandro Lucarella (16/16) Jun 10 2008 Adding info about Python+Psyco:
Michael Neumann (10/34) Jun 25 2008 Python used reference counting (~ Python 1.x) in the past, but I think

Jarrett Billingsley (5/12) Jun 25 2008 Maybe they just didn't change the API for backwards compatibility, but a...
bearophile (7/9) Jun 25 2008 Python 2.5.2 uses a reference counting GC plus logic to break reference ...

Jarrett Billingsley (3/9) Jun 25 2008 And there it is :)

Leandro Lucarella <llucax gmail.com> writes:

Content-Disposition: inline
Content-Transfer-Encoding: 8bit

I've done a very simple microbenchmark testing a single linked list.

What the test does is fill the list with long elements, and then iterate
the list and increment the associated value. This test mostly test the
allocation speed.

I've done plain a C version, a C++ version (using STL double-linked list,
so it has some disvantage here), a D version without using the GC, a D
version using the GC, and a Python version.

Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C         C++       D no gc   D gc       D gc dis  Python
Fill   0.136136  0.142705  0.131238  22.628577  0.242637  12.952816
Inc    0.025134  0.038768  0.037456   0.050480  0.051545   3.765271
Total  0.161270  0.181473  0.168694  22.679057  0.294182  16.718087

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

You can see Python is almost twice faster than D doing allocation (and the
python example has some intentional overhead to make it as close as D code
as possible, using more Pythonic code could yield better results).

But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?
2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>


	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)

Tests attached.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
RENUNCIO PARA IR A REZARLE A SAN CAYETANO
	-- Crónica TV

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Leandro Lucarella:
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).

Note that Psyco can't compile code outside functions (global code is slower
even if you don't use Psyco).

Try putting your code into a main() and add a:
from psyco.classes import __metaclass__
At the top.

The code:

import sys
from datetime import datetime

import psyco
psyco.full()
from psyco.classes import __metaclass__

class Node:
        def __init__(self):
                self.data = 0
                self.next = None

def main():
    start = datetime.now()
    list = Node()
    n = list
    for i in xrange(long(sys.argv[1])):
            n.data = i
            n.next = Node()
            n = n.next
    n.next = None
    end = datetime.now()
    delta = datetime.now() - start
    elapsed1 = delta.seconds * 1000000 + delta.microseconds

    start = datetime.now()
    p = list
    while p:
            p.data += 1
            p = p.next
    delta = datetime.now() - start
    elapsed2 = delta.seconds * 1000000 + delta.microseconds

    print "Fill:  %lu usec" % elapsed1
    print "Inc:   %lu usec" % elapsed2
    print "Total: %lu usec" % (elapsed1 + elapsed2)

main()

-----------------------

Original: 9.32 s
Just moving code into a function: 8.39 s
the same, using Psyco: 1.94 s

Bye,
bearophile

Jun 09 2008

Leandro Lucarella <llucax gmail.com> writes:

bearophile, el  9 de junio a las 17:03 me escribiste:
 Leandro Lucarella:
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).

 
 Note that Psyco can't compile code outside functions (global code is slower
even if you don't use Psyco).
 

Thanks for the info, I've imported psyco but didn't play with it really =)

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Sus discipulos se miraron sin entended hasta que uno preguntose:
Peperino, soy Daniel Q. de Olivos tengo 54 años y aún soy virgen. A lo
que Peperino respondiole: Si sos ganso, ganso ser. Y lo frotó, y lo
curó y lo sanó. A lo que todos dijeron: ¡¡¡Peperino se la come,
Peperino se la come!!!
	-- Peperino Pómoro

Jun 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Leandro Lucarella:
 Thanks for the info, I've imported psyco but didn't play with it really =)

If you have a bit of free time you may time that code of mine (that's equal to
yours + psyco) on your PC, so you can show us those timings too compared to the
other ones...

Bye,
bearophile

Jun 10 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?

I think it may have more to do with the allocation strategy in the GC.  It
obtains
memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
for a test like this the D GC will end up hitting the OS quite a few times
asking
for more memory.  If I had to guess I'd say that malloc has a more efficient
strategy here.  If you're interested, try running the same test using Tango
with and without a call tol tango.core.Memory.GC.reserve() for the amount of
memory you expect the app to use before the loop.

 2) Why is D malloc version ~33% slower than the C version? I took a look
    at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)

No idea.  I'd expect them to be roughly equivalent.


Sean

Jun 09 2008

Leandro Lucarella <llucax gmail.com> writes:

Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Sean Kelly, el  9 de junio a las 22:53 me escribiste:
 == Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?

 
 I think it may have more to do with the allocation strategy in the GC.  It
obtains
 memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
 for a test like this the D GC will end up hitting the OS quite a few times
asking
 for more memory.  If I had to guess I'd say that malloc has a more efficient
 strategy here.  If you're interested, try running the same test using Tango
 with and without a call tol tango.core.Memory.GC.reserve() for the amount of
 memory you expect the app to use before the loop.

Really? I though it was more collecion-related. I don't know exactly how
the D GC is implemented, but I thought that, as you say, it allocates a
chunk of memory, with each new it gave you a piece of that memory, and
when it ran out of memory, it runs a collection cycle. When that cycle
can't recover unused memory, allocates more.

Considering that the version with GC disabled is almost as fast as the
malloc version, and that the time grows exponentially (graph attached)
with the number of nodes, it looks like the problem is the GC
collection cycles to me. Is the GC algorithm exponential?

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Me encanta el éxito; por eso prefiero el estado de progreso constante,
con la meta al frente y no atrás.
	-- Ricardo Vaporeso. Punta del Este, Enero de 1918.

Jun 10 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Leandro Lucarella (llucax gmail.com)'s article
 --TB36FDmn/VVEgNH/
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 Content-Transfer-Encoding: 8bit
 Sean Kelly, el  9 de junio a las 22:53 me escribiste:
 == Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?

 I think it may have more to do with the allocation strategy in the GC.  It
obtains
 memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
 for a test like this the D GC will end up hitting the OS quite a few times
asking
 for more memory.  If I had to guess I'd say that malloc has a more efficient
 strategy here.  If you're interested, try running the same test using Tango
 with and without a call tol tango.core.Memory.GC.reserve() for the amount of
 memory you expect the app to use before the loop.

 Really? I though it was more collecion-related. I don't know exactly how
 the D GC is implemented, but I thought that, as you say, it allocates a
 chunk of memory, with each new it gave you a piece of that memory, and
 when it ran out of memory, it runs a collection cycle. When that cycle
 can't recover unused memory, allocates more.
 Considering that the version with GC disabled is almost as fast as the
 malloc version, and that the time grows exponentially (graph attached)
 with the number of nodes, it looks like the problem is the GC
 collection cycles to me. Is the GC algorithm exponential?

Oh okay.  I was replying to the post that suggested that said the malloc
version was still much faster than D with the GC disabled.  But if they're
comparable then all the better.  This is what I would expect anyway.

As for the GC algorithm being exponential... all nodes of an AA contain
pointers, so they will all be scanned by the GC.  Given this and the chance
of false positives, I would expect performance to decrease in a superlinear
manner.  But an exponential slow-down still seems a bit extreme.


Sean

Jun 10 2008

Marius Muja <mariusm cs.ubc.ca> writes:

Leandro Lucarella wrote:
 
 Considering that the version with GC disabled is almost as fast as the
 malloc version, and that the time grows exponentially (graph attached)
 with the number of nodes, it looks like the problem is the GC
 collection cycles to me. Is the GC algorithm exponential?
 

The graph may look exponential but that's often misleading (it may very 
well be a polynomial). To tell for sure plot the data on a graph with a 
logarithmic Y axis and then on a graph with both axes logarithmic. If 
the plot is a straight line on the semilogy graph then it's exponential. 
If it's a straight line on the loglog graph then it's polinomial.

Jun 10 2008

Leandro Lucarella <llucax gmail.com> writes:

Marius Muja, el 10 de junio a las 21:38 me escribiste:
 Leandro Lucarella wrote:
Considering that the version with GC disabled is almost as fast as the
malloc version, and that the time grows exponentially (graph attached)
with the number of nodes, it looks like the problem is the GC
collection cycles to me. Is the GC algorithm exponential?

 
 The graph may look exponential but that's often misleading (it may very well
be 
 a polynomial). To tell for sure plot the data on a graph with a logarithmic Y 
 axis and then on a graph with both axes logarithmic. If the plot is a straight 
 line on the semilogy graph then it's exponential. If it's a straight line on
the 
 loglog graph then it's polinomial.

You and Sean are right, the graph somewhere in between linear and
exponential, but is really far from linear, which I think is the worst
case in GC algorithms...

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Si pudiera acercarme un poco más, hacia vos
Te diría que me tiemblan los dos pies, cuando me mirás
Si supieras todo lo que me costó, llegar
Hoy sabrías que me cuesta respirar, cuando me mirás

Jun 11 2008

Walter Bright <newshound1 digitalmars.com> writes:

Leandro Lucarella wrote:
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?

There are so many factors that can influence this, it's hard to say 
without spending a lot of time carefully examining it.

 2) Why is D malloc version ~33% slower than the C version? I took a look
    at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 
 
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)
 
 Tests attached.

Without running a profiler, there's no way to be sure about just where 
in the code the time is being consumed.

Jun 09 2008

Leandro Lucarella <llucax gmail.com> writes:

Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Walter Bright, el  9 de junio a las 17:12 me escribiste:
 Leandro Lucarella wrote:
But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?

 
 There are so many factors that can influence this, it's hard to say without 
 spending a lot of time carefully examining it.

But there is no GC activity there right?

2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)
Tests attached.

 
 Without running a profiler, there's no way to be sure about just where in the 
 code the time is being consumed.

Attached is the output of the gprof program (for the list-test-d-gc with
the GC running). I don't see any helpful information for the point 2), but
it shows clearly that most of the time is spent in garbage collection.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Si pensas que el alma no se ve
el alma sí se ve en los ojos

Jun 10 2008

Walter Bright <newshound1 digitalmars.com> writes:

Leandro Lucarella wrote:
 Walter Bright, el  9 de junio a las 17:12 me escribiste:
 Leandro Lucarella wrote:
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?

 There are so many factors that can influence this, it's hard to say without 
 spending a lot of time carefully examining it.

 
 But there is no GC activity there right?

Instrument it to be sure.

 
 2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)
 Tests attached.

 Without running a profiler, there's no way to be sure about just where in the 
 code the time is being consumed.

 
 Attached is the output of the gprof program (for the list-test-d-gc with
 the GC running). I don't see any helpful information for the point 2), but
 it shows clearly that most of the time is spent in garbage collection.

Break up your code into more functions to get better info from the profiler.

Jun 10 2008

Leandro Lucarella <llucax gmail.com> writes:

Walter Bright, el 10 de junio a las 14:13 me escribiste:
But there is no GC activity there right?

 
 Instrument it to be sure.

Doesn't seems to be running. But it's weird you can tell for sure being
the author :S

2) Why is D malloc version ~33% slower than the C version? I took a look
  at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)
Tests attached.

Without running a profiler, there's no way to be sure about just where in the 
code the time is being consumed.

Attached is the output of the gprof program (for the list-test-d-gc with
the GC running). I don't see any helpful information for the point 2), but
it shows clearly that most of the time is spent in garbage collection.

 
 Break up your code into more functions to get better info from the profiler.

Is this useful (kcachegrind screenshoots, using callgrind with
--dump-instr=yes and --trace-jump=yes):
http://llucax.com.ar/~luca/kcachegrind-d.png
http://llucax.com.ar/~luca/kcachegrind-c.png

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
De tan fina la condesa, por no cagarse, reza.
	-- Ricardo Vaporeso

Jun 11 2008

Rounin <davidjo student.matnat.uio.no> writes:

I found this posting after investigating a huge slowdown in a D program I'd made
using gprof, which showed the culprit to be a function called
_D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time. That turned
out to be the garbage collector.

The program takes 1.74 seconds to run with two particular input files in Python.
The garbage collected D version takes 5.27 seconds.
The non-garbage collected version takes only 0.68 seconds.

Here's a more detailed report:

http://rounin.livejournal.com/21815.html

Obviously, the compiled program should run at least as fast as Python, so the
garbage collector could use some improvement.

Sep 15 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 15 Sep 2010 15:53:38 -0400, Rounin <davidjo student.matnat.uio.no>  
wrote:

 I found this posting after investigating a huge slowdown in a D program  
 I'd made
 using gprof, which showed the culprit to be a function called
 _D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time.  
 That turned
 out to be the garbage collector.

 The program takes 1.74 seconds to run with two particular input files in  
 Python.
 The garbage collected D version takes 5.27 seconds.
 The non-garbage collected version takes only 0.68 seconds.

 Here's a more detailed report:

 http://rounin.livejournal.com/21815.html

 Obviously, the compiled program should run at least as fast as Python,  
 so the
 garbage collector could use some improvement.

gdc is very old.  You may want to try the thing on the latest dmd.  No  
guarantee this is going to help though.

Another thing you may want to try is dcollections HashMap instead of  
builtin associative arrays, as they use a free-list style allocator that  
helps avoid GC collections.

I agree the GC is a major issue with performance, I hope after D2 is a bit  
more stable, we start working on improving it.

-Steve

Sep 15 2010

Rounin <davidjo student.matnat.uio.no> writes:

Thank you for that advice. I'm using GDC because it's available from Ubuntu
Linux's package system, whereas DMD currently is not. (And the .deb posted on
digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
up-to-date tools will be made available.

By the way, today I re-compiled the program with a "std.gc.enable;" right before
the final "return 0" statement, and it still runs in 0.68 seconds. So either
those
objects aren't marked as needing garbage collection, or it's really just an
issue
of keeping the garbage collector from activating too quickly.

Sep 16 2010

bearophile <bearophileHUGS lycos.com> writes:

Rounin:

 Thank you for that advice. I'm using GDC because it's available from Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
 up-to-date tools will be made available.

On a 32 bit system I find LDC to produce efficient programs, when possible (the
GC is similar, in D2 it's essentially the same).


 By the way, today I re-compiled the program with a "std.gc.enable;" right
before
 the final "return 0" statement, and it still runs in 0.68 seconds.

You may try to disable/enable the GC in the Python code too (on 32 bit systems
there's the very good Psyco too).

Your benchmark is not portable, so I can't help you find where the performance
problem is. When you perform a benchmark it's better to give all source code
and all data too, to allow others to reproduce the results and look for
performance problems.

Keep in mind that D associative arrays are usually slower than Python dicts.
Probably you build data structures like associative arrays, and this slows down
the GC. If you disable&enable the GC around that build phase, the program is
probably fast (so I suggest you to narrow as much as possible the width of the
disable/enable span, so you may see where the GC problem is). If you put a
exit(0) at the end of the program (to kill final collection) the D program may
save more time.

In Python 2.7 they have added a GC optimization that I may be used in D too:
http://bugs.python.org/issue4074
The garbage collector now performs better for one common usage pattern: when
many objects are being allocated without deallocating any of them. This would
previously take quadratic time for garbage collection, but now the number of
full garbage collections is reduced as the number of objects on the heap grows.
The new logic only performs a full garbage collection pass when the middle
generation has been collected 10 times and when the number of survivor objects
from the middle generation exceeds 10% of the number of objects in the oldest
generation. (Suggested by Martin von L�wis and implemented by Antoine Pitrou;
issue 4074.)<


while D's splitlines() wasn't quite working.<
int space = std.regexp.find(line, r"\s");<

What do you mean? If there's a bug in splitlines() or split() it's better to
add it to Bugzilla, possibly with inlined string to split (no external file to
read). splitlines() or split() are simple functions of a module, written in D,
so if there's a problem it's usually not too much hard to fix it, they are not
built-in methods written in C as in CPython.


if(!(path in oldpaths) && !(checksum in oldsums))
In D2 this may be written (unfortunately there is no "and" keyword, Walter
doesn't get its usefulness yet):
if (path !in oldpaths && checksum !in oldsums)

Bye,
bearophile

Sep 16 2010

Rounin <davidjo student.matnat.uio.no> writes:

Keep in mind that D associative arrays are usually slower than Python dicts.
Probably you build data structures like associative arrays, and this slows down
the GC. If you disable&enable the GC around that build phase, the program is
probably fast (so I suggest you to narrow as much as possible the width of the
disable/enable span, so you may see where the GC problem is). If you put a
exit(0) at the end of the program (to kill final collection) the D program may
save more time.

This could be the case, as I started noticing the slowdown as I was writing
either
the second or third block, which is where I build a ton of data structures.

Complicating the code by avoiding associative arrays isn't really an option,
though, as the whole point of writing the program was comparing D to Python in
terms of simplicity. Turning off the garbage collector during time-critical bits
is simple enough, though, as long as one knows that it has to be done.

What do you mean? If there's a bug in splitlines() or split() it's better to add
it to Bugzilla, possibly with inlined string to split (no external file to
read).
splitlines() or split() are simple functions of a module, written in D, so if
there's a problem it's usually not too much hard to fix it, they are not
built-in
methods written in C as in CPython.

There's no bug in split() as far as I know, but during development of the
program,
splitlines() would split the data on other things than newlines. It could be a
bug
in D, but it could just be a temporary glitch in the code, so I'm not filing a
bug.

Sep 16 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.

How about using a library defined associative array?  As I've mentioned several
times here before, D's current builtin associative array implementation
interacts
horribly with the GC.  This is true both in terms of speed and memory usage.
Neither D's GC nor its AA implementation is that bad per se, but the AA
implementation seems to expose all the weaknesses in the GC, such that the
result
when using large AAs is laughably bad.  I think eventually we need to completely
redesign and rewrite it, or at least put a sealed implementation in
std.container
that's more geared towards huge AAs.

My RandAA can be several times faster than the builtin AA when dealing with huge
arrays (purely due to memory management overhead), and some tweaking by others
in
the community has produced an even faster version.  Furthermore, it doesn't
create
tons of heap fragmentation and false pointers for the GC.  I've solved out of
memory errors with the builtin AA in some small projects just by swapping in
RandAA for it.  (See http://dsource.org/projects/aa)

Sep 16 2010

Sean Kelly <sean invisibleduck.org> writes:

dsimcha Wrote:

 == Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.

 
 How about using a library defined associative array?  As I've mentioned several
 times here before, D's current builtin associative array implementation
interacts
 horribly with the GC.  This is true both in terms of speed and memory usage.
 Neither D's GC nor its AA implementation is that bad per se, but the AA
 implementation seems to expose all the weaknesses in the GC, such that the
result
 when using large AAs is laughably bad.  I think eventually we need to
completely
 redesign and rewrite it, or at least put a sealed implementation in
std.container
 that's more geared towards huge AAs.

The precise scanning patch would probably help tremendously here.

Sep 16 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 == Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.

 How about using a library defined associative array?  As I've mentioned several
 times here before, D's current builtin associative array implementation
interacts
 horribly with the GC.  This is true both in terms of speed and memory usage.
 Neither D's GC nor its AA implementation is that bad per se, but the AA
 implementation seems to expose all the weaknesses in the GC, such that the
result
 when using large AAs is laughably bad.  I think eventually we need to
completely
 redesign and rewrite it, or at least put a sealed implementation in
std.container
 that's more geared towards huge AAs.

 The precise scanning patch would probably help tremendously here.

Yes, it would but even so, I don't think that a good AA implementation for a GC
like D's should require an allocation on **every single insert**.  This makes
AAs
basically unusable in parallel code.

Sep 16 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 16 Sep 2010 06:41:14 -0400, Rounin <davidjo student.matnat.uio.no>  
wrote:

 Thank you for that advice. I'm using GDC because it's available from  
 Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb  
 posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more  
 popularity and more
 up-to-date tools will be made available.

I've heard that dmd is a bit tricky to set up on a 64-bit system (mine is  
32-bit), but it will give more credence to your position to use the latest  
available tools.

 By the way, today I re-compiled the program with a "std.gc.enable;"  
 right before
 the final "return 0" statement, and it still runs in 0.68 seconds. So  
 either those
 objects aren't marked as needing garbage collection, or it's really just  
 an issue
 of keeping the garbage collector from activating too quickly.

The GC runs are not scheduled.  Essentially, you are probably encountering  
a syndrome where the GC runs more than it should.  This usually occurs  
when incrementally allocating large amounts of memory without dropping any  
of it.

The GC colleciton runs when allocating memory cannot find any free  
memory.  Most likely this is what's happening:

allocate -> No free memory available -> run GC collection, get nothing  
back -> allocate more memory from OS -> allocate...

Of course, the allocations will succeed for a bit after getting OS memory,  
but the GC collection is run way too often (as your profiling suggests).

Disabling it at the beginning, then enabling it at the end means it's  
going to run at most once.  Essentially, you take out the "run GC  
collection" step, everything else is the same.  Given that 87% of the run  
time is spent in the collection cycle, it's pretty obvious why this  
reduced your runtime ;)

-Steve

Sep 16 2010

Justin Johansson <no spam.com> writes:

On 16/09/2010 9:53 PM, Steven Schveighoffer wrote:
 On Thu, 16 Sep 2010 06:41:14 -0400, Rounin
 <davidjo student.matnat.uio.no> wrote:

 Thank you for that advice. I'm using GDC because it's available from
 Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb
 posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more
 popularity and more
 up-to-date tools will be made available.

 I've heard that dmd is a bit tricky to set up on a 64-bit system (mine
 is 32-bit), but it will give more credence to your position to use the
 latest available tools.

 By the way, today I re-compiled the program with a "std.gc.enable;"
 right before
 the final "return 0" statement, and it still runs in 0.68 seconds. So
 either those
 objects aren't marked as needing garbage collection, or it's really
 just an issue
 of keeping the garbage collector from activating too quickly.

 The GC runs are not scheduled. Essentially, you are probably
 encountering a syndrome where the GC runs more than it should. This
 usually occurs when incrementally allocating large amounts of memory
 without dropping any of it.

 The GC colleciton runs when allocating memory cannot find any free
 memory. Most likely this is what's happening:

 allocate -> No free memory available -> run GC collection, get nothing
 back -> allocate more memory from OS -> allocate...

 Of course, the allocations will succeed for a bit after getting OS
 memory, but the GC collection is run way too often (as your profiling
 suggests).

 Disabling it at the beginning, then enabling it at the end means it's
 going to run at most once. Essentially, you take out the "run GC
 collection" step, everything else is the same. Given that 87% of the run
 time is spent in the collection cycle, it's pretty obvious why this
 reduced your runtime ;)

 -Steve

The redux of this discussion is surely that no matter how "good" the
semantics of a PL are, the proof of the pudding is in the GC.  Given
the more-or-less well-founded value vs reference semantics in D, it
might be useful to condider region-based memory management idioms and
integrate these idioms into the language.  Perhaps we might see this
in D99 as a measure to circumvent traditional GC issues.

Sep 16 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/16/10 5:41 CDT, Rounin wrote:
 Thank you for that advice. I'm using GDC because it's available from Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
 up-to-date tools will be made available.

 By the way, today I re-compiled the program with a "std.gc.enable;" right
before
 the final "return 0" statement, and it still runs in 0.68 seconds. So either
those
 objects aren't marked as needing garbage collection, or it's really just an
issue
 of keeping the garbage collector from activating too quickly.

I have noted this trigger happiness of the GC since way back during my 
doctoral work, and conveyed it several times to Walter and Sean. They 
haven't gotten around to it, but I carry most of the blame because it 
was a pressing matter to me, not to them.

There are two issues: one is the GC jumps in too early, and the other is 
it jumps in too often. A program of mine had a pattern in which it would 
need a lot of memory for a long time, during which it would do the 
occasional small string allocation etc. There would still be a lot of 
collection attempts going on; the GC failed to notice that repeated 
collections do not recover much memory.

Thanks Rounin for the comparison. I think it's a great test bed for D's 
collector. If you could share the data too, that would be great.


Andrei

Sep 16 2010

Rounin <davidjo student.matnat.uio.no> writes:

Thank you for those kind words! Hopefully, it'll become a prioritized issue
whenever more pressing things are resolved.

In the meantime, here are (some slightly anonymized versions of) the MD5 files
in
question. They're about 9-10 megabytes each, so no wonder it allocates a bit of
memory!

http://folk.uio.no/davidjo/a.md5
http://folk.uio.no/davidjo/b.md5
begin 644 comparechecksums.d
M(R$O=7-R+V)I;B]D;60 +7)U; H*:6UP;W)T('-T9"YS=&1I;RP <W1D+F9I
M;&4L('-T9"YS=')I;F<L('-T9"YR96=E>'`L('-T9"YC+G-T9&EO+"!S=&0N
M9V,[" II;G0 ;6%I;BAS=')I;F=;72!A<F=V*2![" H)+R\ 1&ES86)L92!T
M:&4 9V%R8F%G92!C;VQL96-T;W( =&\ :V5E<"!I="!F<F]M('-L;W=I;F< 
M=&AE('!R;V=R86T*"2\O(&1O=VXN" ES=&0N9V,N9&ES86)L93L*" EI9B`H
M87)G=BYL96YG=&  (3T ,RD >PH)"7=R:71E9FQN*")5<V%G93H )7, ;VQD
M8VAE8VMS=6US(&YE=V-H96-K<W5M<R(L(&%R9W9;,%TI.PH)"7)E='5R;B`Q
M.PH)?0H*"2\O(%)E860 =&AE(&-H96-K<W5M(&9I;&5S+ H*"7-T<FEN9R!O
M;&1F:6QE(#T <W1R:7`H8V%S="AS=')I;F<I<F5A9"AA<F=V6S%=*2D[" ES
M=')I;F< ;F5W9FEL92`]('-T<FEP*&-A<W0H<W1R:6YG*7)E860H87)G=ELR
M72DI.PH*"7-T<FEN9UM=(&]L9&QI;F5S(#T <W1D+G)E9V5X<"YS<&QI="AO
M;&1F:6QE+")<;B(I+G-O<G0[" ES=')I;F=;72!N97=L:6YE<R`]('-T9"YR
M96=E>'`N<W!L:70H;F5W9FEL92PB7&XB*2YS;W)T.PH*"2\O(%!A<G-E('1H
M92!D871A+ H*"2\O($UA<"!P871H<R!T;R!C:&5C:W-U;7, 86YD('9I8V4 
M=F5R<V$N" ES=')I;F=;<W1R:6YG72!O;&1P871H<SL*"7-T<FEN9UMS=')I
M;F==(&YE=W!A=&AS.PH)<W1R:6YG6W-T<FEN9UT ;VQD<W5M<SL*"7-T<FEN
M9UMS=')I;F==(&YE=W-U;7,[" H)9F]R96%C:"AS=')I;F< ;&EN93L ;VQD
M;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN92D[" D):6YT('-P86-E(#T 
M<W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B*3L*"0ES=')I;F< 8VAE8VMS
M=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[" D)<W1R:6YG('!A=&  /2!S
M=')I<"AL:6YE6W-P86-E+BXD72D[" D);VQD<&%T:'-;<&%T:%T /2!C:&5C
M:W-U;3L*"0EO;&1S=6US6V-H96-K<W5M72`]('!A=& [" E]" H)9F]R96%C
M:"AS=')I;F< ;&EN93L ;F5W;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN
M92D[" D):6YT('-P86-E(#T <W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B
M*3L*"0ES=')I;F< 8VAE8VMS=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[
M" D)<W1R:6YG('!A=&  /2!S=')I<"AL:6YE6W-P86-E+BXD72D[" D);F5W
M<&%T:'-;<&%T:%T /2!C:&5C:W-U;3L*"0EN97=S=6US6V-H96-K<W5M72`]
M('!A=& [" E]" H)+R\ 0V]R<F5L871E('1H92!D871A+ H*"7-T<FEN9UM=
M(&YE=V9I;&5S.PH)<W1R:6YG6UT 9&5L971E9&9I;&5S.PH)<W1R:6YG6UT 
M;6]V961F:6QE<SL*"7-T<FEN9UM=(&UO9&EF:65D9FEL97,[" H)+R\ 3&]O
M:R!F;W( ;F5W(&9I;&5S+ H)9F]R96%C:"AS=')I;F< <&%T:"P <W1R:6YG
M(&-H96-K<W5M.R!N97=P871H<RD >PH)"6EF*"$H<&%T:"!I;B!O;&1P871H
M<RD )B8 (2AC:&5C:W-U;2!I;B!O;&1S=6US*2D*"0D);F5W9FEL97, ?CT 
M<&%T:#L*"7T*" DO+R!,;V]K(&9O<B!D96QE=&5D+"!M;V1I9FEE9"!O<B!M
M;W9E9"!F:6QE<RX*"69O<F5A8V H<W1R:6YG('!A=& L('-T<FEN9R!C:&5C
M:W-U;3L ;VQD<&%T:',I('L*"0EI9B A*'!A=&  :6X ;F5W<&%T:',I("8F
M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"61E;&5T961F:6QE<R!^/2!P
M871H.PH)"65L<V4 :68H(2AP871H(&EN(&YE=W!A=&AS*2`F)B`H8VAE8VMS
M=6T :6X ;F5W<W5M<RDI('L*"0D)<W1R:6YG(&YE=W!A=&  /2!N97=S=6US
M6V-H96-K<W5M73L*"0D);6]V961F:6QE<R!^/2!P871H('X (B#BAI( (B!^
M(&YE=W!A=& [" D)?0H)"65L<V4 :68H*'!A=&  :6X ;F5W<&%T:',I("8F
M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"6UO9&EF:65D9FEL97, ?CT 
M<&%T:#L*"7T*" DO+R!0<F5S96YT('1H92!D871A+ H*"6YE=V9I;&5S(#T 
M;F5W9FEL97,N<V]R=#L*"61E;&5T961F:6QE<R`](&1E;&5T961F:6QE<RYS
M;W)T.PH);6]V961F:6QE<R`](&UO=F5D9FEL97,N<V]R=#L*"6UO9&EF:65D
M9FEL97, /2!M;V1I9FEE9&9I;&5S+G-O<G0[" H)=W)I=&5F;&X[" H)=W)I
M=&5F;&XH(DUO9&EF:65D(&9I;&5S.EQN(BD[" EF;W)E86-H*'-T<FEN9R!P
M871H.R!M;V1I9FEE9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H*3L*
M" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*")-;W9E9"!F:6QE
M<SI<;B(I.PH)9F]R96%C:"AS=')I;F< <&%T:#L ;6]V961F:6QE<RD*"0EP
M<FEN=&8H(B4N*G-<;B(L<&%T:"D[" H)=W)I=&5F;&X[" EW<FET969L;CL*
M" EW<FET969L;B B1&5L971E9"!F:6QE<SI<;B(I.PH)9F]R96%C:"AS=')I
M;F< <&%T:#L 9&5L971E9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H
M*3L*" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*").97< 9FEL
M97,Z7&XB*3L*"69O<F5A8V H<W1R:6YG('!A=& [(&YE=V9I;&5S*0H)"7!R
M:6YT9B B)2XJ<UQN(BQP871H*3L*" EW<FET969L;CL*" DO+R!292UE;F%B
M;&EN9R!T:&4 1T, <V5E;7, =&\ 9&\ ;F\ :&%R;2P 8G5T('1H96X 86=A
M:6XL('=E)W)E(&5X:71I;F<*"2\O('1H92!P<F]G<F%M(&%N>7=A>2X*"2\O


`
end

Sep 16 2010

Leandro Lucarella <llucax gmail.com> writes:

Adding info about Python+Psyco:
Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C        C++      D no gc  D gc       D gc dis  Python     Py+Psyco
Fill   .136136  .142705  .131238  22.628577  0.242637  12.952816  2.564722
Inc    .025134  .038768  .037456   0.050480  0.051545   3.765271  0.087375
Total  .161270  .181473  .168694  22.679057  0.294182  16.718087  2.652097

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
EL "PITUFO ENRIQUE" LLEGO A LA BAILANTA
	-- Crónica TV

Jun 10 2008

Michael Neumann <mneumann ntecs.de> writes:

Leandro Lucarella wrote:
 I've done a very simple microbenchmark testing a single linked list.
 
 What the test does is fill the list with long elements, and then iterate
 the list and increment the associated value. This test mostly test the
 allocation speed.
 
 I've done plain a C version, a C++ version (using STL double-linked list,
 so it has some disvantage here), a D version without using the GC, a D
 version using the GC, and a Python version.
 
 Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):
 
        C         C++       D no gc   D gc       D gc dis  Python
 Fill   0.136136  0.142705  0.131238  22.628577  0.242637  12.952816
 Inc    0.025134  0.038768  0.037456   0.050480  0.051545   3.765271
 Total  0.161270  0.181473  0.168694  22.679057  0.294182  16.718087
 
 Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
 -frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
 debian patches).
 
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).

Python used reference counting (~ Python 1.x) in the past, but I think
they switched to a tracing garbage collector in 2.x. Using reference
counting would explain the speedup IMHO. I think any mark and sweep
garbage collector would be somewhat slow in the case of a linked list,
as it has to traverse all elements to determine which memory is live and
which can be freed (so it's O(n)). You'd be probably better off using a
dynamic array here.

Regards,

   Michael

Jun 25 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Michael Neumann" <mneumann ntecs.de> wrote in message 
news:4862675F.3020603 ntecs.de...

 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x. Using reference
 counting would explain the speedup IMHO. I think any mark and sweep
 garbage collector would be somewhat slow in the case of a linked list,
 as it has to traverse all elements to determine which memory is live and
 which can be freed (so it's O(n)). You'd be probably better off using a
 dynamic array here.

Maybe they just didn't change the API for backwards compatibility, but all 
the Python 2.x source I've seen used reference counting macros.  Maybe it's 
a hybrid memory manager?  Or maybe the macros are now no-ops ;)

Jun 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Michael Neumann:
 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x.

Python 2.5.2 uses a reference counting GC plus logic to break reference cycles,
part of the code:
http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup

You may want to take a look at this page of the results too (Psyco uses the
same GC, it's not another language):
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=72520

Bye,
bearophile

Jun 25 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g3tqs0$57g$1 digitalmars.com...
 Michael Neumann:
 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x.

 Python 2.5.2 uses a reference counting GC plus logic to break reference 
 cycles, part of the code:
 http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup

And there it is :)

Jun 25 2008

D Programming

C/C++ Programming

Other

digitalmars.D - More embarrassing microbenchmars for D's GC.