www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - rt_finalize WTFs?

reply dsimcha <dsimcha yahoo.com> writes:
I'm at my traditional passtime of trying to speed up D's garbage 
collector again, and I've stumbled on the fact that rt_finalize is 
taking up a ridiculous share of the time (~30% of total runtime) on a 
benchmark where huge numbers of classes **that don't have destructors** 
are being created and collected.  Here's the code to this function, from 
lifetime.d:

extern (C) void rt_finalize(void* p, bool det = true)
{
     debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

     if (p) // not necessary if called from gc
     {
         ClassInfo** pc = cast(ClassInfo**)p;

         if (*pc)
         {
             ClassInfo c = **pc;
             byte[]    w = c.init;

             try
             {
                 if (det || collectHandler is null || 
collectHandler(cast(Object)p))
                 {
                     do
                     {
                         if (c.destructor)
                         {
                             fp_t fp = cast(fp_t)c.destructor;
                             (*fp)(cast(Object)p); // call destructor
                         }
                         c = c.base;
                     } while (c);
                 }
                 if ((cast(void**)p)[1]) // if monitor is not null
                     _d_monitordelete(cast(Object)p, det);
                 (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
             }
             catch (Throwable e)
             {
                 onFinalizeError(**pc, e);
             }
             finally  // WTF?
             {
                 *pc = null; // zero vptr
             }
         }
     }
}

Getting rid of the stuff I've marked with //WTF? comments (namely the 
finally block and the re-initializing of the memory occupied by the 
finalized object) speeds things up by ~15% on the benchmark in question. 
  Why do we care what state the blob of memory is left in after we 
finalize it?  I can kind of see that we want to clear things if 
delete/clear was called manually and we want to leave the object in a 
state that doesn't look valid.  However, this has significant 
performance costs and IIRC is already done in clear() and delete is 
supposed to be deprecated.  Furthermore, I'd like to get rid of the 
finally block entirely, since I assume its presence and the effect on 
the generated code is causing the slowdown, not the body, which just 
assigns a pointer.

Is there any good reason to keep this code around?
Dec 04 2011
next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage  
 collector again, and I've stumbled on the fact that rt_finalize is  
 taking up a ridiculous share of the time (~30% of total runtime) on a  
 benchmark where huge numbers of classes **that don't have destructors**  
 are being created and collected.  Here's the code to this function, from  
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

      if (p) // not necessary if called from gc
      {
          ClassInfo** pc = cast(ClassInfo**)p;

          if (*pc)
          {
              ClassInfo c = **pc;
              byte[]    w = c.init;

              try
              {
                  if (det || collectHandler is null ||  
 collectHandler(cast(Object)p))
                  {
                      do
                      {
                          if (c.destructor)
                          {
                              fp_t fp = cast(fp_t)c.destructor;
                              (*fp)(cast(Object)p); // call destructor
                          }
                          c = c.base;
                      } while (c);
                  }
                  if ((cast(void**)p)[1]) // if monitor is not null
                      _d_monitordelete(cast(Object)p, det);
                  (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
              }
              catch (Throwable e)
              {
                  onFinalizeError(**pc, e);
              }
              finally  // WTF?
              {
                  *pc = null; // zero vptr
              }
          }
      }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the  
 finally block and the re-initializing of the memory occupied by the  
 finalized object) speeds things up by ~15% on the benchmark in question.  
   Why do we care what state the blob of memory is left in after we  
 finalize it?  I can kind of see that we want to clear things if  
 delete/clear was called manually and we want to leave the object in a  
 state that doesn't look valid.  However, this has significant  
 performance costs and IIRC is already done in clear() and delete is  
 supposed to be deprecated.  Furthermore, I'd like to get rid of the  
 finally block entirely, since I assume its presence and the effect on  
 the generated code is causing the slowdown, not the body, which just  
 assigns a pointer.

 Is there any good reason to keep this code around?

Not for the try block. With errors being not recoverable you don't need to care about zeroing the vtbl or you could just copy the code into the catch handler. This seems to cause less spilled variables. Most expensive is the call to a memcpy PLT, replace it with something inlineable. Zeroing is not much faster than copying init[] for small classes. At least zeroing should be worth it, unless the GC would not scan the memory otherwise. gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1 => 33.4s Please add useful benchmarks to druntime. martin
Dec 04 2011
parent dsimcha <dsimcha yahoo.com> writes:
Thanks for the benchmark.  I ended up deciding to just create a second 
function, rt_finalize_gc, that gets rid of a whole bunch of cruft that 
isn't necessary in the GC case.  I think it's worth the small amount of 
code duplication it creates.  Here are the results of my efforts so far: 
  https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . 
I've got one other good idea that I think will shave a few seconds off 
the Tree1 benchmark if I don't run into any unforeseen obstacles in 
implementing it.

On 12/4/2011 10:07 PM, Martin Nowak wrote:
 On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have
 destructors** are being created and collected. Here's the code to this
 function, from lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in
 question. Why do we care what state the blob of memory is left in
 after we finalize it? I can kind of see that we want to clear things
 if delete/clear was called manually and we want to leave the object in
 a state that doesn't look valid. However, this has significant
 performance costs and IIRC is already done in clear() and delete is
 supposed to be deprecated. Furthermore, I'd like to get rid of the
 finally block entirely, since I assume its presence and the effect on
 the generated code is causing the slowdown, not the body, which just
 assigns a pointer.

 Is there any good reason to keep this code around?

Not for the try block. With errors being not recoverable you don't need to care about zeroing the vtbl or you could just copy the code into the catch handler. This seems to cause less spilled variables. Most expensive is the call to a memcpy PLT, replace it with something inlineable. Zeroing is not much faster than copying init[] for small classes. At least zeroing should be worth it, unless the GC would not scan the memory otherwise. gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1 => 33.4s Please add useful benchmarks to druntime. martin

Dec 04 2011
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 I've stumbled on the fact that rt_finalize is 
 taking up a ridiculous share of the time (~30% of total runtime) on a 
 benchmark where huge numbers of classes **that don't have destructors** 
 are being created and collected.

DMD or LDC/GDC compiler?
 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
 
      if (p) // not necessary if called from gc
      {

That if(p) seems fit to become a static if on bool template argument (but it needs to become D code or two C functions): void rt_finalize(bool byGC)(void* p, bool det = true) { ... Bye, bearophile
Dec 04 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, December 04, 2011 23:41:08 dsimcha wrote:
 Thanks for the benchmark.  I ended up deciding to just create a second
 function, rt_finalize_gc, that gets rid of a whole bunch of cruft that
 isn't necessary in the GC case.  I think it's worth the small amount of
 code duplication it creates.  Here are the results of my efforts so far:
   https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 .
 I've got one other good idea that I think will shave a few seconds off
 the Tree1 benchmark if I don't run into any unforeseen obstacles in
 implementing it.

It's best to avoid code duplication in general, but I'd argue that if only a small amount of code duplication is required to get a big gain from the GC, it's well worth it. In general, anything we can do to improve the GC is worthwhile, since it's _the_ place where D risks being inefficient in comparison to languages such as C++. - Jonathan M Davis
Dec 04 2011
prev sibling next sibling parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
Last time I looked at it, the try/catch/finally block was rather 
expensive because it always invokes the exception handler unwinding 
mechanism, even if no exception occurs.
Try moving the try/catch block out of the loops that call rt_finalize. 
(maybe just remove it, onFinalizeError just rethrows, and it seems noone 
is using the information that the exception is thrown from inside the 
finalizer)

On 05.12.2011 02:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have destructors**
 are being created and collected. Here's the code to this function, from
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in question.
 Why do we care what state the blob of memory is left in after we
 finalize it? I can kind of see that we want to clear things if
 delete/clear was called manually and we want to leave the object in a
 state that doesn't look valid. However, this has significant performance
 costs and IIRC is already done in clear() and delete is supposed to be
 deprecated. Furthermore, I'd like to get rid of the finally block
 entirely, since I assume its presence and the effect on the generated
 code is causing the slowdown, not the body, which just assigns a pointer.

 Is there any good reason to keep this code around?

Dec 05 2011
next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Mon, 05 Dec 2011 10:14:00 +0200, Rainer Schuetze <r.sagitario gmx.de>  
wrote:

 Try moving the try/catch block out of the loops that call rt_finalize.

This sounds like a good idea. Just make sure that all code paths that lead to rt_finalize (e.g. delete) can recover from an exception.
 (maybe just remove it, onFinalizeError just rethrows, and it seems noone  
 is using the information that the exception is thrown from inside the  
 finalizer)

The point of onFinalizeError is that it throws an *error* (it is unrecoverable). Currently, the GC cannot recover from an exception thrown by a finalizer, which is why onFinalizeError exists. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 05 2011
prev sibling next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 05 Dec 2011 09:14:00 +0100, Rainer Schuetze <r.sagitario gmx.de>  
wrote:

 Last time I looked at it, the try/catch/finally block was rather  
 expensive because it always invokes the exception handler unwinding  
 mechanism, even if no exception occurs.
 Try moving the try/catch block out of the loops that call rt_finalize.  
 (maybe just remove it, onFinalizeError just rethrows, and it seems noone  
 is using the information that the exception is thrown from inside the  
 finalizer)

it still affects register assignment. Install an exception handler at sweep scope would save quite some Moving the exception handler to the sweep scope seems promising, can save lots of register saves. I appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain. Some more ideas: - Do a major refactoring of the GC code, making it less reluctant to changes. Adding sanity checks or unit tests would be great. This probably reveals some obfuscated performance issues. - Add more realistic GC benchmarks, just requires adding to druntime/test/gcbench using the new runbench. The tree1 mainly uses homogeneous classes, so this is very synthesized. - There is one binary search pool lookup for every scanned address in range. Should be a lot to gain here, but it's difficult. It needs a multilevel mixture of bitset/hashtab. - Reduce the GC roots range. I will have to work on this for shared library support anyhow. martin
 On 05.12.2011 02:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again, and I've stumbled on the fact that rt_finalize is
 taking up a ridiculous share of the time (~30% of total runtime) on a
 benchmark where huge numbers of classes **that don't have destructors**
 are being created and collected. Here's the code to this function, from
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
 debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[] w = c.init;

 try
 {
 if (det || collectHandler is null || collectHandler(cast(Object)p))
 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the
 finally block and the re-initializing of the memory occupied by the
 finalized object) speeds things up by ~15% on the benchmark in question.
 Why do we care what state the blob of memory is left in after we
 finalize it? I can kind of see that we want to clear things if
 delete/clear was called manually and we want to leave the object in a
 state that doesn't look valid. However, this has significant performance
 costs and IIRC is already done in clear() and delete is supposed to be
 deprecated. Furthermore, I'd like to get rid of the finally block
 entirely, since I assume its presence and the effect on the generated
 code is causing the slowdown, not the body, which just assigns a  
 pointer.

 Is there any good reason to keep this code around?


Dec 05 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Martin Nowak (dawg dawgfoto.de)'s article
 I appreciate the recursion during mark, wanted to do this myself
 sometime ago but expected a little more gain.

The reason the gain wasn't huge is because on the benchmark I have that involves a deep heap graph, sweeping time dominates marking time. The performance gain for the mark phase only (which is important b/c this is when the world needs to be stopped) is ~20-30%.
 Some more ideas:
   - Do a major refactoring of the GC code, making it less reluctant
     to changes. Adding sanity checks or unit tests would be great.
     This probably reveals some obfuscated performance issues.

Not just obfuscated ones. I've wanted to fix an obvious perf bug for two years and haven't done it because the necessary refactoring would be unbelievably messy and I'm too afraid I'll break something. Basically, malloc() sets the bytes between the size you requested and the size of the block actually allocated to zero to prevent false pointers. This is reasonable. The problem is that it does so **while holding the GC's lock**. Fixing it for just the case when malloc() is called by the user is also easy. The problem is fixing it when malloc() gets called from realloc(), calloc(), etc.
   - Add more realistic GC benchmarks, just requires adding to
     druntime/test/gcbench using the new runbench. The tree1 mainly
     uses homogeneous classes, so this is very synthesized.

I'll crowdsource this. I can't think of any good benchmarks that are < a few hundred lines w/ no dependencies but aren't pretty synthetic.
   - There is one binary search pool lookup for every scanned address in
 range.
     Should be a lot to gain here, but it's difficult. It needs a multilevel
     mixture of bitset/hashtab.

I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this?
   - Reduce the GC roots range. I will have to work on this for
     shared library support anyhow.

Please clarify what you mean by "reduce" the roots range. Thanks for the feedback/suggestions.
Dec 05 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Martin Nowak (dawg dawgfoto.de)'s article
 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

the the 32-bit trie for lower.

Why do you expect this to be faster than a binary search? I'm not saying it won't be, just that it's not a home run that deserves a high priority as an optimization. You still have a whole bunch of indirections, probably more than you would ever have for binary search.
Dec 05 2011
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-12-05 23:45, Iain Buclaw wrote:
 On 5 December 2011 22:34, Martin Nowak<dawg dawgfoto.de>  wrote:
 We currently add something from etext to _end (rt.memory) as static root.
 This probably contains read-only sections, data from other languages
 and (unlikely) other unrelated sections.
 I think some help from the compiler will be necessary to support
 shared libraries anyhow, so we should be able to get a precise range.

Indeed. The current implementation kicking around is to scan /proc/self/maps on init (this is true for the runtime that ships with GDC at least). Which is not the most pleasant of things to do - not to mention only supports Unix-flavoured systems, but it works well enough for when linking against shared D libraries.

/proc doesn't exist on Mac OS X. -- /Jacob Carlborg
Dec 05 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Mon, 05 Dec 2011 23:07:09 +0200, dsimcha <dsimcha yahoo.com> wrote:

 I understand the problem, but please elaborate on the proposed  
 solution.  You've
 basically got a bunch of pools, each of which represents a range of  
 memory
 addresses, not a single address (so a basic hashtable is out).  You need  
 to know
 which range some pointer fits in.  How would you beat binary  
 search/O(log N) for this?

A tree, with a few bits of the address space per level. It becomes bound to the size of the address space, not the number of pools. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Dec 05 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Martin Nowak (dawg dawgfoto.de)'s article
 I appreciate the recursion during mark, wanted to do this myself
 sometime ago but expected a little more gain.

The reason the gain wasn't huge is because on the benchmark I have that involves a deep heap graph, sweeping time dominates marking time. The performance gain for the mark phase only (which is important b/c this is when the world needs to be stopped) is ~20-30%.
 Some more ideas:
   - Do a major refactoring of the GC code, making it less reluctant
     to changes. Adding sanity checks or unit tests would be great.
     This probably reveals some obfuscated performance issues.

Not just obfuscated ones. I've wanted to fix an obvious perf bug for two years and haven't done it because the necessary refactoring would be unbelievably messy and I'm too afraid I'll break something. Basically, malloc() sets the bytes between the size you requested and the size of the block actually allocated to zero to prevent false pointers. This is reasonable. The problem is that it does so **while holding the GC's lock**. Fixing it for just the case when malloc() is called by the user is also easy. The problem is fixing it when malloc() gets called from realloc(), calloc(), etc.
   - Add more realistic GC benchmarks, just requires adding to
     druntime/test/gcbench using the new runbench. The tree1 mainly
     uses homogeneous classes, so this is very synthesized.

I'll crowdsource this. I can't think of any good benchmarks that are < a few hundred lines w/ no dependencies but aren't pretty synthetic.

But it will be difficult to fake internal pointers if they are not recorded. If we were able to enable recording in the runtime, people having performance issues with the GC could then send a file that can be used for optimizations.
   - There is one binary search pool lookup for every scanned address in
 range.
     Should be a lot to gain here, but it's difficult. It needs a  
 multilevel
     mixture of bitset/hashtab.

I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this?

at least PAGESIZE. One would need to use addresses relative to the minimum address but that still needs 256K entries per GByte, so the maintenance overhead is likely too big. More promising is to put pool addresses ranges in a trie. addr[7] [... . ...] / | \ addr[6] [... . ...] [... . ...] / | \ / | \ addr[5] pool:8 [... . ...] / | \ addr[4] pool:8 [....] pool:5 Lookup is bounded constant time, with at maximum 8(4) memory accesses. Maintenance is only required when adding/removing pools, little sophisticated though. ----- alias Node Trie; struct Node { union { Type _type; Pool* _pool; Node[16]* _nodes16; // uses upper 3 bits per level ... Node[256]* _nodes256; // uses full 8 bits per level } // use lower 3 bits to encode type, given guaranteed 8-byte alignment of malloc enum NTypeBits = 3; enum TypeMask = (1 << NTypeBits) - 1); enum PtrMask = ~TypeMask; enum Type { Empty, Pool, Nodes16, Nodes256 }; static assert(Type.max <= TypeMask); Pool* getPool(void* p) { return getPoolImpl((cast(ubyte*)&p) + size_t.sizeof - 1); } Pool* getPoolImpl(ubyte* p) { Node* node = void; final switch (type) { case Type.Empty: return null; case Type.Pool: return pool; case Type.Nodes16: node = nodes16[*p >> 5]; break; case Type.Nodes256: node = nodes256[*p]; break; } if (node.type == Type.Pool) return node.pool; else return node.getPoolImpl(p - 1); } void insertPool(Pool* npool, uint level=0) { final switch (type) { case Type.Empty: pool = npool; case Type.Pool: Pool *opool = pool; if (opool != npool) { nodes16 = new Node[16]; foreach (i; 0 .. 16) { // non-trivial code here // Needs to figure out into which subnode opool and npool // should be inserted. They can use pool.minAddr and pool.maxAddr nodes16[i].insertPool(opool, level + 1); nodes16[i].insertPool(npool, level + 1); } } } case Type.Nodes16: // count empty nodes and decide whether // to switch to more child nodes. } } void removePool(Pool* npool, uint level=0) { // reverse of above } property Type type() { return cast(Type)(_type & TypeMask); } property void type(Type t) { assert(t <= TypeMask); _type |= t; } property Pool* pool() { return cast(Pool*)(cast(size_t)_pool & PtrMask); } property void pool(Pool* p) { _pool = p; type = Type.Pool; } ... for nodesN } -----
   - Reduce the GC roots range. I will have to work on this for
     shared library support anyhow.

Please clarify what you mean by "reduce" the roots range.

This probably contains read-only sections, data from other languages and (unlikely) other unrelated sections. I think some help from the compiler will be necessary to support shared libraries anyhow, so we should be able to get a precise range.
 Thanks for the feedback/suggestions.

Dec 05 2011
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 5 December 2011 22:34, Martin Nowak <dawg dawgfoto.de> wrote:
 We currently add something from etext to _end (rt.memory) as static root.
 This probably contains read-only sections, data from other languages
 and (unlikely) other unrelated sections.
 I think some help from the compiler will be necessary to support
 shared libraries anyhow, so we should be able to get a precise range.

Indeed. The current implementation kicking around is to scan /proc/self/maps on init (this is true for the runtime that ships with GDC at least). Which is not the most pleasant of things to do - not to mention only supports Unix-flavoured systems, but it works well enough for when linking against shared D libraries. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Dec 05 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

the the 32-bit trie for lower.
Dec 05 2011
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 06 Dec 2011 00:16:01 +0100, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Martin Nowak (dawg dawgfoto.de)'s article
 More promising is to put pool addresses ranges in a trie.

 addr[7]          [...      .     ...]
                       /     |    \
 addr[6]     [...   .   ...]    [...   .   ...]
                 /   |    \         /   |   \
 addr[5]     pool:8             [...   .  ...]
                                /   |   \
 addr[4]                  pool:8 [....] pool:5

the the 32-bit trie for lower.

Why do you expect this to be faster than a binary search? I'm not saying it won't be, just that it's not a home run that deserves a high priority as an optimization. You still have a whole bunch of indirections, probably more than you would ever have for binary search.

It's the tightest loop in the garbage collector. It will end with few array accesses and the locality of memory being pointed too is paired by tree locality, thus you'd have good chances to finding them in the cache. What this gives you is that it scales very good to many pools/regions. Thus you can better tweak the pool granularity against pool sizes. Also: for (p < pend) if (*p in lastPool) can be: for (p < pend) if (*p in lastNode) It is definitely no home run, but I'll try it out when I find some time.
Dec 05 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 04 Dec 2011 20:46:27 -0500, dsimcha <dsimcha yahoo.com> wrote:

 I'm at my traditional passtime of trying to speed up D's garbage  
 collector again, and I've stumbled on the fact that rt_finalize is  
 taking up a ridiculous share of the time (~30% of total runtime) on a  
 benchmark where huge numbers of classes **that don't have destructors**  
 are being created and collected.  Here's the code to this function, from  
 lifetime.d:

 extern (C) void rt_finalize(void* p, bool det = true)
 {
      debug(PRINTF) printf("rt_finalize(p = %p)\n", p);

      if (p) // not necessary if called from gc
      {
          ClassInfo** pc = cast(ClassInfo**)p;

          if (*pc)
          {
              ClassInfo c = **pc;
              byte[]    w = c.init;

              try
              {
                  if (det || collectHandler is null ||  
 collectHandler(cast(Object)p))
                  {
                      do
                      {
                          if (c.destructor)
                          {
                              fp_t fp = cast(fp_t)c.destructor;
                              (*fp)(cast(Object)p); // call destructor
                          }
                          c = c.base;
                      } while (c);
                  }
                  if ((cast(void**)p)[1]) // if monitor is not null
                      _d_monitordelete(cast(Object)p, det);
                  (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
              }
              catch (Throwable e)
              {
                  onFinalizeError(**pc, e);
              }
              finally  // WTF?
              {
                  *pc = null; // zero vptr
              }
          }
      }
 }

 Getting rid of the stuff I've marked with //WTF? comments (namely the  
 finally block and the re-initializing of the memory occupied by the  
 finalized object) speeds things up by ~15% on the benchmark in question.  
   Why do we care what state the blob of memory is left in after we  
 finalize it?  I can kind of see that we want to clear things if  
 delete/clear was called manually and we want to leave the object in a  
 state that doesn't look valid.  However, this has significant  
 performance costs and IIRC is already done in clear() and delete is  
 supposed to be deprecated.  Furthermore, I'd like to get rid of the  
 finally block entirely, since I assume its presence and the effect on  
 the generated code is causing the slowdown, not the body, which just  
 assigns a pointer.

I think it might be a good idea to move those two operations to the clear function. Currently, clear(obj) simply calls rt_finalize(obj). It does seem rather strange to have that finally there, why do we care on an exception/error that we zero the vptr? I'd at least put that code up where the first WTF is. But I think we do need to have the reinitialization of the object and zeroing of vptr for clear, since it's part of clear's charter. I'd support any effort to speed up the GC, it's definitely the worst offender for performance. Looks like there's quite a few good ideas in this thread. -Steve
Dec 05 2011
prev sibling next sibling parent reply Robert Clipsham <robert octarineparrot.com> writes:
On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

Have you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC? -- Robert http://octarineparrot.com/
Dec 05 2011
next sibling parent reply Trass3r <un known.com> writes:
 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

Have you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?

So true, it's been rotting in that branch.
Dec 05 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 12/5/2011 6:39 PM, Trass3r wrote:
 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

Have you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?

So true, it's been rotting in that branch.

IIRC CDGC includes two major enhancements: 1. The snapshot GC for Linux. (Does this work on OSX/FreeBSD/anything Posix, or just Linux? I'm a bit skeptical about whether a snapshot GC is really that great an idea given its propensity to waste memory on long collect cycles with a lot of mutation.) 2. I think there was some precise heap scanning-related stuff in it. I originally tried to implement precise heap scanning a couple years ago, but it went nowhere for reasons too complicated to explain here. Given this experience, I'm not inclined to try again until the compiler has extensions for generating pointer offset information.
Dec 05 2011
prev sibling next sibling parent Trass3r <un known.com> writes:
 IIRC CDGC includes two major enhancements:

 1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything  
 Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC  
 is really that great an idea given its propensity to waste memory on  
 long collect cycles with a lot of mutation.)

I guess, at least it uses fork IIRC.
 2.  I think there was some precise heap scanning-related stuff in it.  I  
 originally tried to implement precise heap scanning a couple years ago,  
 but it went nowhere for reasons too complicated to explain here.  Given  
 this experience, I'm not inclined to try again until the compiler has  
 extensions for generating pointer offset information.

What's the status of that anyway? Every patch in that bugzilla entry is marked obsolete and there's no real final answer.
Dec 05 2011
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
I have an updated and win32-compilable version of CDGC in a branch. It's mis=
sing some features from the current GC though (it's based on the Tango GC wh=
ich has remained relatively static for the past years while druntime's GC ha=
s improved).

Sent from my iPhone

On Dec 5, 2011, at 3:39 PM, Trass3r <un known.com> wrote:

 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again

Have you thought about pushing for the inclusion of CDGC at all/working o=


=20
 So true, it's been rotting in that branch.

Dec 06 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Dec 4, 2011, at 5:46 PM, dsimcha wrote:

 I'm at my traditional passtime of trying to speed up D's garbage =

taking up a ridiculous share of the time (~30% of total runtime) on a = benchmark where huge numbers of classes **that don't have destructors** = are being created and collected. Here's the code to this function, from = lifetime.d:
=20
 extern (C) void rt_finalize(void* p, bool det =3D true)
 {
    debug(PRINTF) printf("rt_finalize(p =3D %p)\n", p);
=20
    if (p) // not necessary if called from gc
    {
        ClassInfo** pc =3D cast(ClassInfo**)p;
=20
        if (*pc)
        {
            ClassInfo c =3D **pc;
            byte[]    w =3D c.init;
=20
            try
            {
                if (det || collectHandler is null || =

                {
                    do
                    {
                        if (c.destructor)
                        {
                            fp_t fp =3D cast(fp_t)c.destructor;
                            (*fp)(cast(Object)p); // call destructor
                        }
                        c =3D c.base;
                    } while (c);
                }
                if ((cast(void**)p)[1]) // if monitor is not null
                    _d_monitordelete(cast(Object)p, det);
                (cast(byte*) p)[0 .. w.length] =3D w[];  // WTF?
            }
            catch (Throwable e)
            {
                onFinalizeError(**pc, e);
            }
            finally  // WTF?
            {
                *pc =3D null; // zero vptr
            }
        }
    }
 }
=20
 Getting rid of the stuff I've marked with //WTF? comments (namely the =

finalized object) speeds things up by ~15% on the benchmark in question. = Why do we care what state the blob of memory is left in after we = finalize it? I can kind of see that we want to clear things if = delete/clear was called manually and we want to leave the object in a = state that doesn't look valid. However, this has significant = performance costs and IIRC is already done in clear() and delete is = supposed to be deprecated. Furthermore, I'd like to get rid of the = finally block entirely, since I assume its presence and the effect on = the generated code is causing the slowdown, not the body, which just = assigns a pointer. Now that having multiple in-flight exceptions is actually legal, we can = probably just throw out the try/catch entirely. The try/catch and call = to onFinalizeError is a holdover from when it was effectively illegal to = throw from a finalizer.=
Dec 07 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

 I'm at my traditional passtime of trying to speed up D's garbage 
 collector again,

Are you going to create a pull request soon? So people will be able to try it with their D programs and spot potential troubles... Bye, bearophile
Dec 09 2011