www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Faster Virtual Method Dispatch

reply "Craig Black" <cblack ara.com> writes:
Most programmers do not realize the overhead involved with calling a method 
virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
lookup and a invoking a function pointer.  However, we must also consider 
that the CPU has no way of performing branch prediction with vtable lookups, 
so the pipeline is trashed.  Thus, virtual method calls incur a large 
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
optimizations is replacing the vtable lookup with if statements for the most 
commonly called virtual methods followed by the vtable lookup, so that 
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, but 
knowing the little that I do know I have come up with a simple approach that 
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the 
methods directly rather than using fuction pointers would allow the method 
dispatch to take advantage of pipelining.  Algorithmically, however, this is 
no longer O(1), but O(log n), which is bad.  So I would assume that it would 
only be faster if n (or the number of methods in the vtable) is lower than a 
given threshold.  This threshold would be determined by testing and would 
perhaps be different on different processors.  Thus, in order to ensure that 
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization 
strategies.  I don't know how well such an algorithm would work in practice. 
Comments are obviously welcome.

-Craig 
Apr 23 2006
next sibling parent reply kris <foo bar.com> writes:
Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 
 

I'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in the cache lines. None of the other in-flight instructions have to be flushed at that time. If the vtable is not cached, then a bubble will appear in the pipeline while main-memory is addressed. But, again (as I understand it), this doesn't invalidate any in-flight instructions; just feeds the core less efficiently. On the other hand, a branch misprediction must invalidate a large number of in-flight instruction; usually all of them (unless multi-path speculative execution is occuring). The P4 is particularly maimed by this, as compared to its peers, because of its particularly long pipeline (which, in turn, was needed to reached the frequencies Intel marketeers felt they needed to drive AMD off the map). Obviously I don't know enough about this, but it might be interesting to research how one might introduce some locality-of-reference bias on the more common calls made. For instance, some fancy runtime might rewrite "common" calls to a dynamically constructed "global vtable". That would perhaps avoid the misprediction concerns, whilst managing to retain the globally popular set of lookups within cache? Clearly MS compiler-engineers have noticed something beneficial. On the other hand, it seems a bit like flogging a dead-horse to introduce something that will likely cause at least one branch misprediction? I suppose the idea is that a direct-call (versus an indirect one) will enable speculative execution within the impending function-call itself? Then again, heavyweight speculative execution is starting to lose favour against a grouping of simple symmetrical cores on one die (a la Cell, Kilocore, Niagara, etc). The idea there is that it doesn't matter if a bubble appears, or a core stalls; if you partition an app to take advantage of more than one execution thread, then it will (overall) execute faster anyway. The silicon saved via simplicity is applied to other areas such as massive on-die bandwidth. Of course, that doesn't help single-threaded applications, but the industry is changing, albeit slowly :) Wither the Transputer? <g> 2 Cents
Apr 23 2006
parent BCS <BCS_member pathlink.com> writes:
In article <e2gkms$rur$1 digitaldaemon.com>, kris says...
Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 
 

I'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in the

I'm not following you train of thought completely but... I don't think that the cache hit is the issue, I think that the fact that the function call is to an interdict address is the issue. The CPU might not be able to safely predict where it's supposed to go until it get all the way to the function call. As a result the instruction queue gets completely drained.
cache lines. None of the other in-flight instructions have to be flushed 
at that time. If the vtable is not cached, then a bubble will appear in 
the pipeline while main-memory is addressed. But, again (as I understand 
it), this doesn't invalidate any in-flight instructions; just feeds the 
core less efficiently.

On the other hand, a branch misprediction must invalidate a large number 
of in-flight instruction; usually all of them (unless multi-path 
speculative execution is occuring). The P4 is particularly maimed by 
this, as compared to its peers, because of its particularly long 
pipeline (which, in turn, was needed to reached the frequencies Intel 
marketeers felt they needed to drive AMD off the map).

Obviously I don't know enough about this, but it might be interesting to 
research how one might introduce some locality-of-reference bias on the 
more common calls made. For instance, some fancy runtime might rewrite 
"common" calls to a dynamically constructed "global vtable". That would 
perhaps avoid the misprediction concerns, whilst managing to retain the 
globally popular set of lookups within cache?

Clearly MS compiler-engineers have noticed something beneficial. On the 
other hand, it seems a bit like flogging a dead-horse to introduce 
something that will likely cause at least one branch misprediction? I 
suppose the idea is that a direct-call (versus an indirect one) will 
enable speculative execution within the impending function-call itself?

Then again, heavyweight speculative execution is starting to lose favour 
against a grouping of simple symmetrical cores on one die (a la Cell, 
Kilocore, Niagara, etc). The idea there is that it doesn't matter if a 
bubble appears, or a core stalls; if you partition an app to take 
advantage of more than one execution thread, then it will (overall) 
execute faster anyway. The silicon saved via simplicity is applied to 
other areas such as massive on-die bandwidth. Of course, that doesn't 
help single-threaded applications, but the industry is changing, albeit 
slowly :)

Wither the Transputer? <g>

2 Cents

Apr 23 2006
prev sibling next sibling parent reply BCS <BCS_member pathlink.com> writes:
In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...
Most programmers do not realize the overhead involved with calling a method 
virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
lookup and a invoking a function pointer.  However, we must also consider 
that the CPU has no way of performing branch prediction with vtable lookups, 
so the pipeline is trashed.  Thus, virtual method calls incur a large 
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
optimizations is replacing the vtable lookup with if statements for the most 
commonly called virtual methods followed by the vtable lookup, so that 
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, but 
knowing the little that I do know I have come up with a simple approach that 
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the 
methods directly rather than using fuction pointers would allow the method 
dispatch to take advantage of pipelining.  Algorithmically, however, this is 
no longer O(1), but O(log n), which is bad.  So I would assume that it would 
only be faster if n (or the number of methods in the vtable) is lower than a 

IIRC what is being searched for isn't which entry in this vtbl, but which vtbl to use. Or maybe I am miss understanding you.
given threshold.  This threshold would be determined by testing and would 
perhaps be different on different processors.  Thus, in order to ensure that 
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization 
strategies.  I don't know how well such an algorithm would work in practice. 
Comments are obviously welcome.

-Craig 

Maybe some sort of run-and-review optimization could be done. This would entail having the compiler build a heavily instrumented version of the program and then run it a bunch to find out which classes are most commonly used at a given call and then on the next build let the compiler optimize each call based on the measured data. If this is class ABC call ABC.foo else if XYZ call XYZ.foo else use the vtbl
Apr 23 2006
parent "Craig Black" <cblack ara.com> writes:
"BCS" <BCS_member pathlink.com> wrote in message 
news:e2go6m$13u7$1 digitaldaemon.com...
 In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...
Most programmers do not realize the overhead involved with calling a 
method
virtually.  A virtual method call is an O(1) operation.  Just  a vtable
lookup and a invoking a function pointer.  However, we must also consider
that the CPU has no way of performing branch prediction with vtable 
lookups,
so the pipeline is trashed.  Thus, virtual method calls incur a large
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided
optimizations is replacing the vtable lookup with if statements for the 
most
commonly called virtual methods followed by the vtable lookup, so that
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, 
but
knowing the little that I do know I have come up with a simple approach 
that
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the
methods directly rather than using fuction pointers would allow the method
dispatch to take advantage of pipelining.  Algorithmically, however, this 
is
no longer O(1), but O(log n), which is bad.  So I would assume that it 
would
only be faster if n (or the number of methods in the vtable) is lower than 
a

IIRC what is being searched for isn't which entry in this vtbl, but which vtbl to use. Or maybe I am miss understanding you.

Nope you are right. Which vtable. My bad.
given threshold.  This threshold would be determined by testing and would
perhaps be different on different processors.  Thus, in order to ensure 
that
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization
strategies.  I don't know how well such an algorithm would work in 
practice.
Comments are obviously welcome.

-Craig

Maybe some sort of run-and-review optimization could be done. This would entail having the compiler build a heavily instrumented version of the program and then run it a bunch to find out which classes are most commonly used at a given call and then on the next build let the compiler optimize each call based on the measured data. If this is class ABC call ABC.foo else if XYZ call XYZ.foo else use the vtbl

Yes, like MSVC's profile guided optimization. A more general optimization would be nice too, but you could probably be even faster and more efficient with a profile-guided one. One real good thing about profile guided optimizations is that the compiler can spend more time on optimizing performance-critical code sections and less time on code that is not pertinent to performance. -Craig
Apr 23 2006
prev sibling next sibling parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
I'm far from being an expert .. but I thought vtable lookups involve two 
  or three jump/call instructions at the assembly level; at least that's 
the impression I got from stepping through code in the VC++ debugger.

Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 
 

Apr 23 2006
parent "Craig Black" <cblack ara.com> writes:
"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
news:e2gr8j$18r2$1 digitaldaemon.com...
 I'm far from being an expert .. but I thought vtable lookups involve two 
 or three jump/call instructions at the assembly level; at least that's the 
 impression I got from stepping through code in the VC++ debugger.

From one non-expert to another, I have no idea how many assembly language instructions are involved. However, I have been informed by some very experienced programmers that virtual methods don't mesh well with pipelining. This is the main performance issue with calling a virtual method. -Craig
Apr 23 2006
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 

This is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD. But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call. Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place. Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++. - Dave
Apr 23 2006
parent reply kris <foo bar.com> writes:
Dave wrote:
 Craig Black wrote:
 
 Most programmers do not realize the overhead involved with calling a 
 method virtually.  A virtual method call is an O(1) operation.  Just  
 a vtable lookup and a invoking a function pointer.  However, we must 
 also consider that the CPU has no way of performing branch prediction 
 with vtable lookups, so the pipeline is trashed.  Thus, virtual method 
 calls incur a large performance pentalty.

 Interestly, I found that one of the MSVC++ 2005 compiler's profile 
 guided optimizations is replacing the vtable lookup with if statements 
 for the most commonly called virtual methods followed by the vtable 
 lookup, so that branch prediction can be performed in the majority of 
 cases.

 I do not consider myself an expert in the area of compiler 
 optimization, but knowing the little that I do know I have come up 
 with a simple approach that may improve performance.  This approach is 
 described below.

 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the 
 method dispatch to take advantage of pipelining.  Algorithmically, 
 however, this is no longer O(1), but O(log n), which is bad.  So I 
 would assume that it would only be faster if n (or the number of 
 methods in the vtable) is lower than a given threshold.  This 
 threshold would be determined by testing and would perhaps be 
 different on different processors.  Thus, in order to ensure that this 
 approach is always faster, a relatively low number should be used.

 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in 
 practice. Comments are obviously welcome.

 -Craig

This is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD. But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call. Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place. Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++. - Dave

Yes, I recall seeing Walter write something along these lines somewhere in the D documentation: being able to compile the entire program, versus linking things in from a library, permits much more aggressive finalization of methods as you describe. That's definately a benefit.
Apr 23 2006
parent reply Dan <Dan_member pathlink.com> writes:
Actually, having "if statements" is absolutely TERRIBLE for trying to improve
branch prediction.  By it's very nature, if statements are branches, so you are
causing at least one branch misprediction anyways.

Furthermore, pointers are not O(1) algorithms.  What the computer is doing is
performing a:

LEA (reg) [(v_pointer)]; 
J(cc)|JMP [(reg)];

My understanding is that the v_pointer address is known at compile time, however
it implies:

that the v_table's cache has to be loaded every time a function is called
that extra lea buggers instruction pairing tagging on several cycles
the potential that the v_table is a cache miss (highly unlikely a page miss)

Interestingly however, if one were to prepend some Self Modifying Code to the
program that instead loaded the vtable into a set of addresses attached to the
vtable, then one could achieve the benefits of static functions and most of the
benefits of virtual functions at the cost of having an extra void*.sizeof for
every function call, and the SMC code overhead.

The problem is that *some* functions would thereafter be virtual so that a
single pointer could be used to point to any of a set of functions - a case not
easily rendered with static functions (but still possible remembering that all
code is just data)

That said, I think the overhead involved in using virtual functions is quite
reasonable and comparable to the overhead in using 'calling conventions' versus
a internally-optimized system with primarily register arguments.  It's 7-8
clocks every function call with a little bit of cache overhead and the potential
for a cache miss.

Ultimately though, I think the GC causes the highest level of performance
troubles on the benchmarks.  I love it to death though and wish someone would
get their hands dirty making it smaller and sexier.

Thanks!
Apr 24 2006
parent reply "Craig Black" <cblack ara.com> writes:
"Dan" <Dan_member pathlink.com> wrote in message 
news:e2im9o$29gi$1 digitaldaemon.com...
 Actually, having "if statements" is absolutely TERRIBLE for trying to 
 improve
 branch prediction.  By it's very nature, if statements are branches, so 
 you are
 causing at least one branch misprediction anyways.

 Furthermore, pointers are not O(1) algorithms.  What the computer is doing 
 is
 performing a:

 LEA (reg) [(v_pointer)];
 J(cc)|JMP [(reg)];

Maybe you are just way over my head but I'm not quite following how this is anything bug O(1).
 My understanding is that the v_pointer address is known at compile time, 
 however
 it implies:

 that the v_table's cache has to be loaded every time a function is called
 that extra lea buggers instruction pairing tagging on several cycles
 the potential that the v_table is a cache miss (highly unlikely a page 
 miss)

 Interestingly however, if one were to prepend some Self Modifying Code to 
 the
 program that instead loaded the vtable into a set of addresses attached to 
 the
 vtable, then one could achieve the benefits of static functions and most 
 of the
 benefits of virtual functions at the cost of having an extra void*.sizeof 
 for
 every function call, and the SMC code overhead.]

OK, now you are definitely over my head. :)
 The problem is that *some* functions would thereafter be virtual so that a
 single pointer could be used to point to any of a set of functions - a 
 case not
 easily rendered with static functions (but still possible remembering that 
 all
 code is just data)

 That said, I think the overhead involved in using virtual functions is 
 quite
 reasonable and comparable to the overhead in using 'calling conventions' 
 versus
 a internally-optimized system with primarily register arguments.  It's 7-8
 clocks every function call with a little bit of cache overhead and the 
 potential
 for a cache miss.

 Ultimately though, I think the GC causes the highest level of performance
 troubles on the benchmarks.  I love it to death though and wish someone 
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free. -Craig
Apr 24 2006
parent reply BCS <BCS_member pathlink.com> writes:
In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
"Dan" <Dan_member pathlink.com> wrote in message 
news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of performance
 troubles on the benchmarks.  I love it to death though and wish someone 
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.

A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and Iím still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.
Apr 24 2006
next sibling parent reply "Craig Black" <cblack ara.com> writes:
"BCS" <BCS_member pathlink.com> wrote in message 
news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
"Dan" <Dan_member pathlink.com> wrote in message
news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish someone
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.

A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.

I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd. The benefit of GC is not performance, but convenience. If you want the convenience of not having to free memory manually then use GC. However if you want performance, use malloc and free (new and delete). That is why GC should be completely optional, not just on a per class basis. If I want to completely eliminate GC from my application, I should be able to do so and still be able to make use of any standard library capabilities. -Craig
Apr 24 2006
next sibling parent reply kris <foo bar.com> writes:
Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 
In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...

"Dan" <Dan_member pathlink.com> wrote in message
news:e2im9o$29gi$1 digitaldaemon.com...

Ultimately though, I think the GC causes the highest level of 
performance
troubles on the benchmarks.  I love it to death though and wish someone
would
get their hands dirty making it smaller and sexier.

Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.

A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.

I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd. The benefit of GC is not performance, but convenience. If you want the convenience of not having to free memory manually then use GC. However if you want performance, use malloc and free (new and delete). That is why GC should be completely optional, not just on a per class basis. If I want to completely eliminate GC from my application, I should be able to do so and still be able to make use of any standard library capabilities. -Craig

The nice thing about D, as a language, is that it supports both approaches. The bad thing, for your needs, is that phobos is tightly bound to the GC. My particular gripes are with the GC activity caused via utf conversions, std.string and so on. There's a surprising, perhaps troublesome, amount of GC activity generated from within phobos itself. This is one of the reasons alternative libraries exist.
Apr 24 2006
parent reply "Craig Black" <cblack ara.com> writes:
 The nice thing about D, as a language, is that it supports both 
 approaches. The bad thing, for your needs, is that phobos is tightly bound 
 to the GC. My particular gripes are with the GC activity caused via utf 
 conversions, std.string and so on. There's a surprising, perhaps 
 troublesome, amount of GC activity generated from within phobos itself. 
 This is one of the reasons alternative libraries exist.

Correct. But since the language core relies on Phobos I don't think that there is a reasonable way to completely decouple D from GC. To me this situation is quite ugly. It would be nice if there was a way to replace Phobos completely. After writing standard libraries that don't rely on GC, we could easily produce D non-GC vs. D GC benchmarks. Some steps would have to be taken in order to keep some source compatibility between the two modes, so that it would't turn into two completely different languages. Manual memory management code could be versioned out when compiled in GC mode. -Craig
Apr 24 2006
parent reply kris <foo bar.com> writes:
Craig Black wrote:
The nice thing about D, as a language, is that it supports both 
approaches. The bad thing, for your needs, is that phobos is tightly bound 
to the GC. My particular gripes are with the GC activity caused via utf 
conversions, std.string and so on. There's a surprising, perhaps 
troublesome, amount of GC activity generated from within phobos itself. 
This is one of the reasons alternative libraries exist.

Correct. But since the language core relies on Phobos I don't think that there is a reasonable way to completely decouple D from GC. To me this situation is quite ugly. It would be nice if there was a way to replace Phobos completely.

That's really quite easy, Craig; start with Ares, and then apply whatever you need on top. For example, Ares and Mango go together really well, and both make a point about not allocating memory where there's a sensible alternative. In fact, the Mango HttpServer does not touch the GC even once per service request, once it warms up -- all hail to the groovy array-slice, and behold the power of the trout-slap :) (kudos to IRC #d) There again, I suspect fear of the GC is somewhat overplayed; although it certainly troubles me when it is abused. Thus, it doesn't bother me at all that Ares kinda' needs a GC also. Perhaps it's more a question of libraries, and code in general, treating it with a little respect :) After all, using any kind of malloc() should perhaps be a question-mark for high-performance code - Kris
Apr 24 2006
parent Sean Kelly <sean f4.ca> writes:
kris wrote:
 
 There again, I suspect fear of the GC is somewhat overplayed; although 
 it certainly troubles me when it is abused. Thus, it doesn't bother me 
 at all that Ares kinda' needs a GC also. Perhaps it's more a question of 
 libraries, and code in general, treating it with a little respect :)

At the very least, a GC is required if you want to do much of anything useful with dynamic arrays. For apps with a huge footprint, I may follow this route and define a custom allocator for all object types.
 After all, using any kind of malloc() should perhaps be a question-mark 
 for high-performance code

True enough. Sean
Apr 25 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
 "Dan" <Dan_member pathlink.com> wrote in message
 news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish someone
 would
 get their hands dirty making it smaller and sexier.

misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.

A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.

I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd.

There's more to it than that. One of the big benefits of GC is that fewer allocations are necessary - and a sure way to better allocation performance is to do less of them. The reason fewer allocations are necessary is that the normal behavior in C/C++, when faced with two pointers to the same object, is to copy the object. That way, there's no grief with worrying about freeing the object that someone else points to. To see this effect at work, see www.digitalmars.com/d/cppstrings.html. This problem is bad enough that the notion of shared_ptr<> has appeared. shared_ptr<> implements reference counting to eliminate the need for extra copies. But it has its own overhead problems: 1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening. 2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented. The inc/dec must be done in an exception safe manner, i.e. they need to be unwindable. Even worse, for multithreaded apps these inc/dec's need to be synchronized. And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.
Apr 24 2006
next sibling parent reply "Craig Black" <cblack ara.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:e2k12i$19bi$1 digitaldaemon.com...
 Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
 "Dan" <Dan_member pathlink.com> wrote in message
 news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish 
 someone
 would
 get their hands dirty making it smaller and sexier.

really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.

A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.

I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd.

There's more to it than that. One of the big benefits of GC is that fewer allocations are necessary - and a sure way to better allocation performance is to do less of them. The reason fewer allocations are necessary is that the normal behavior in C/C++, when faced with two pointers to the same object, is to copy the object. That way, there's no grief with worrying about freeing the object that someone else points to. To see this effect at work, see www.digitalmars.com/d/cppstrings.html. This problem is bad enough that the notion of shared_ptr<> has appeared. shared_ptr<> implements reference counting to eliminate the need for extra copies. But it has its own overhead problems: 1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening. 2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented. The inc/dec must be done in an exception safe manner, i.e. they need to be unwindable. Even worse, for multithreaded apps these inc/dec's need to be synchronized. And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.

Hmmm. I have glanced at the word count example many times but was not aware of why it was faster in D until now. Is it really the GC paradigm that allows this performance? I think I'm beginning to finally see some method to the madness. Perhaps I need to rethink my stance on GC a little. It would seem then that GC can come out on top if there is a lot of sharing of allocated memory e.g. multiple references to each object. However, if you are certain that there will be only a single reference to each object, manual memory management should always outperform GC. Am I on the right track here? -Craig
Apr 24 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Craig Black wrote:
 Hmmm.  I have glanced at the word count example many times but was not aware 
 of why it was faster in D until now.  Is it really the GC paradigm that 
 allows this performance?

Yes. It turns out to be very difficult for C++ to match it.
 I think I'm beginning to finally see some method 
 to the madness.  Perhaps I need to rethink my stance on GC a little.
 
 It would seem then that GC can come out on top if there is a lot of sharing 
 of allocated memory e.g. multiple references to each object.  However, if 
 you are certain that there will be only a single reference to each object, 
 manual memory management should always outperform GC.  Am I on the right 
 track here?

Even in that case, gc can win. For example, for explicit memory allocation in the presence of exception handling, one has to carefully set up handlers that will delete any temporaries if an exception is thrown: S *p = new S(); p->a = foo(); bar(p); delete p; What if foo() throws an exception? We've got a memory leak in p, so we've got to add code to fix that. Dealing with that is an endless source of bugs and tedium in C++. With gc, nothing needs to be added. Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted. Compacting existing allocated memory means that allocating a new object is as easy as bumping a pointer, whereas the usual malloc algorithm requires searching a free list for a best match. So, the obvious question then becomes, if gc is faster, why is Java slower than C++? (I know, lots of Java people will claim it really isn't slower <g>.) The reason is that Java's design requires a lot more objects to be allocated than in C++ - for example, you cannot aggregate arrays or other objects inside other objects, they have to be allocated separately. More objects allocated means lower performance.
Apr 24 2006
next sibling parent reply "Craig Black" <cblack ara.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:e2kffd$1rl5$1 digitaldaemon.com...
 Craig Black wrote:
 Hmmm.  I have glanced at the word count example many times but was not 
 aware of why it was faster in D until now.  Is it really the GC paradigm 
 that allows this performance?

Yes. It turns out to be very difficult for C++ to match it.
 I think I'm beginning to finally see some method to the madness.  Perhaps 
 I need to rethink my stance on GC a little.

 It would seem then that GC can come out on top if there is a lot of 
 sharing of allocated memory e.g. multiple references to each object. 
 However, if you are certain that there will be only a single reference to 
 each object, manual memory management should always outperform GC.  Am I 
 on the right track here?

Even in that case, gc can win. For example, for explicit memory allocation in the presence of exception handling, one has to carefully set up handlers that will delete any temporaries if an exception is thrown: S *p = new S(); p->a = foo(); bar(p); delete p; What if foo() throws an exception? We've got a memory leak in p, so we've got to add code to fix that. Dealing with that is an endless source of bugs and tedium in C++. With gc, nothing needs to be added.

I don't see a performance issue here. Nor do think this would be terribly difficult to solve in D, especially with the fancy-shmancy new scope() features, as long as scope(exit) works properly with exceptions. This would, however, pose a problem for C++.
 Another advantage of gc (that isn't implemented, but could be, in D's gc) 
 is that allocated memory can be compacted. Compacting existing allocated 
 memory means that allocating a new object is as easy as bumping a pointer, 
 whereas the usual malloc algorithm requires searching a free list for a 
 best match.

I can see the obvious performance improvement here, but I am skeptical of whether it can actually be done with D, given it's liberal syntax with raw pointers. Seems to me that if the GC started moving pointers around, the compiler would have to enforce more safety rules with regards to pointers.
 So, the obvious question then becomes, if gc is faster, why is Java slower 
 than C++? (I know, lots of Java people will claim it really isn't slower 
 <g>.) The reason is that Java's design requires a lot more objects to be 
 allocated than in C++ - for example, you cannot aggregate arrays or other 
 objects inside other objects, they have to be allocated separately. More 
 objects allocated means lower performance.

Well, I think you can do this with C# since it supports stack objects. Even still, C# is typically slower than Java (not by much though). I confess that I don't know all the technical details with VM's, but I have run my own benchmarks. VM-based code is just way slower than native code. Can't really tell you exactly why though because I don't really know or care to know what goes on inside a VM. (Except for the purpose of explaining to people why it is so much slower.) -Craig
Apr 25 2006
parent Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 
 Well, I think you can do this with C# since it supports stack objects. Even 
 still, C# is typically slower than Java (not by much though).  I confess 
 that I don't know all the technical details with VM's, but I have run my own 
 benchmarks.  VM-based code is just way slower than native code.  Can't 
 really tell you exactly why though because I don't really know or care to 
 know what goes on inside a VM. (Except for the purpose of explaining to 
 people why it is so much slower.)

One reason is that Java compilers perform basically no compile-time optimization, and the JIT does very little as well. C/C++, on the other hand, can see huge performance improvements with optimization turned on. C# does a bit better in this regard, but not tremendously so. However, some VM languages (such as the Lisp variants) seem to perform quite well compared to natively compiled code, though I don't have the experience to say why. Sean
Apr 25 2006
prev sibling parent "Craig Black" <cblack ara.com> writes:
 Another advantage of gc (that isn't implemented, but could be, in D's gc) 
 is that allocated memory can be compacted. Compacting existing allocated 
 memory means that allocating a new object is as easy as bumping a pointer, 
 whereas the usual malloc algorithm requires searching a free list for a 
 best match.

It is also worth mentioning that compacting can be performed without scanning the stack. Manual memory management can still be used along with a compacting allocator. An compacting allocator would certainly outperform CRT malloc and free. There is a tradeoff here because you can no longer use petal-to-the-metal algorithms that make use of raw pointers and pointer arithmetic, or if you do so, you would have to turn the allocator off temporarily to ensure safety. -Craig
Apr 25 2006
prev sibling parent Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 
 Hmmm.  I have glanced at the word count example many times but was not aware 
 of why it was faster in D until now.  Is it really the GC paradigm that 
 allows this performance?  I think I'm beginning to finally see some method 
 to the madness.  Perhaps I need to rethink my stance on GC a little.

For what it's worth, the performance improvement may be attributed to two factors: the C++ map class is a balanced binary tree while D uses a hash table, and C++ indeed copies strings around while D passes references.
 It would seem then that GC can come out on top if there is a lot of sharing 
 of allocated memory e.g. multiple references to each object.  However, if 
 you are certain that there will be only a single reference to each object, 
 manual memory management should always outperform GC.  Am I on the right 
 track here?

D allows manual memory management as well, so you aren't forced to use the GC if you don't want it in most cases. However, the lack of copy ctors does essentially eliminate the possibility of using smart pointers. One downside of a GC is that it needs to store bookkeeping data, which may amount to as much as 50% memory overhead. This can be problematic for very large applications. Sean
Apr 25 2006
prev sibling parent reply Mike Capp <mike.capp gmail.com> writes:
In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
1) for every allocation, shared_ptr<> allocates an *extra* block of 
memory to hold its accounting information. So you've got, out of the 
box, twice as many allocations/deallocations happening.

Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)
2) for every copy of the shared_ptr<>, like passing one to a function, 
returning one, etc., the reference count has to be incremented, then 
decremented.

Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.
And even with all that falderol, shared_ptr<> still can't do slicing, 
which are an important efficiency improvement.

Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing. [And in a subsequent posting]
 Another advantage of gc (that isn't implemented, but could be,
 in D's gc) is that allocated memory can be compacted.

Has there ever been a language that successfully combined compacting GC with raw pointers? How would this work? And it's going to be tough to drop raw pointers without breaking ease-of-use for C libraries. cheers Mike
Apr 25 2006
next sibling parent reply "Craig Black" <cblack ara.com> writes:
"Mike Capp" <mike.capp gmail.com> wrote in message 
news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
1) for every allocation, shared_ptr<> allocates an *extra* block of
memory to hold its accounting information. So you've got, out of the
box, twice as many allocations/deallocations happening.

Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)

What's intrusive refcounting? Please use small words. My brain is almost full right now. :)
2) for every copy of the shared_ptr<>, like passing one to a function,
returning one, etc., the reference count has to be incremented, then
decremented.

Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.
And even with all that falderol, shared_ptr<> still can't do slicing,
which are an important efficiency improvement.

Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.

If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something. -Craig
Apr 25 2006
parent reply Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 "Mike Capp" <mike.capp gmail.com> wrote in message 
 news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of
 memory to hold its accounting information. So you've got, out of the
 box, twice as many allocations/deallocations happening.

dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)

What's intrusive refcounting? Please use small words. My brain is almost full right now. :)

Maintaining the reference count within the class being pointed to instead of within the shared pointer code.
 2) for every copy of the shared_ptr<>, like passing one to a function,
 returning one, etc., the reference count has to be incremented, then
 decremented.

smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.
 And even with all that falderol, shared_ptr<> still can't do slicing,
 which are an important efficiency improvement.

containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.

If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.

I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-) Sean
Apr 25 2006
parent reply "Craig Black" <cblack ara.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 "Mike Capp" <mike.capp gmail.com> wrote in message 
 news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of
 memory to hold its accounting information. So you've got, out of the
 box, twice as many allocations/deallocations happening.

you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)

What's intrusive refcounting? Please use small words. My brain is almost full right now. :)

Maintaining the reference count within the class being pointed to instead of within the shared pointer code.
 2) for every copy of the shared_ptr<>, like passing one to a function,
 returning one, etc., the reference count has to be incremented, then
 decremented.

smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.
 And even with all that falderol, shared_ptr<> still can't do slicing,
 which are an important efficiency improvement.

class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.

If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.

I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)

That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow. -Craig
Apr 25 2006
parent reply Sean Kelly <sean f4.ca> writes:
In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says...
"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 If you think this is doable, then you should write a C++ version of 
 wordcount based on this principal.  If you can get comparable performance 
 to the D version then you may be on to something.

I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)

That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow.

I merely meant that that's not the point of the wordcount example. However, I'll try and find the time to recode it using slices and hashtables in the next few days. IIRC that brings performance in line with the D version, so I suppose that means you can "have your cake and eat it too." It's worth mentioning, however, that slicing with explicit memory management doesn't have quite the utility it does with garbage collection. It works great for apps where the entire buffer can be loaded en masse, as with the wordcount example, but in other cases it's a bit more complicated to manage the lifetime of the referenced data. Thus the GC version is still a clear win in terms of ease of use, whether or not performance can be matched with specialized code elsewhere. Sean
Apr 25 2006
parent reply "Craig Black" <cblack ara.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lpjj$hvv$1 digitaldaemon.com...
 In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says...
"Sean Kelly" <sean f4.ca> wrote in message
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 If you think this is doable, then you should write a C++ version of
 wordcount based on this principal.  If you can get comparable 
 performance
 to the D version then you may be on to something.

I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)

That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow.

I merely meant that that's not the point of the wordcount example. However, I'll try and find the time to recode it using slices and hashtables in the next few days. IIRC that brings performance in line with the D version, so I suppose that means you can "have your cake and eat it too." It's worth mentioning,

Great! I look forward to seeing the results.
 however, that slicing with explicit memory management doesn't have quite 
 the
 utility it does with garbage collection.  It works great for apps where 
 the
 entire buffer can be loaded en masse, as with the wordcount example, but 
 in
 other cases it's a bit more complicated to manage the lifetime of the 
 referenced
 data.  Thus the GC version is still a clear win in terms of ease of use, 
 whether
 or not performance can be matched with specialized code elsewhere.

Different programmers have different problems that require different solutions. I doubt that we will ever find one approach to make everyone happy about everything. That's why I think GC should be completely optional. Ideally, turning of GC could be as easy as specifying a compiler switch. But maybe I'm wrong. Maybe GC can be faster than manual memory management, but I doubt it. Either way, we need to optimize this bottleneck somehow. Although D currenty delivers competitive performance, and is in some ways is faster than C++, I still believe we have a few more steps to take before we can say with confidence that D provides performance superior to C++. IMO, the fastest way to get extreme performance from D is (1) make GC optional and (2) provide a safe, easy way to take advantage of concurrency on multi-core systems. (There is already work being done on #2 for C++, so we should try to beat them to the punch, and provide a superior solution for D.) This would make D very attractive for performance-demanding applications. I am especially thinking of the game industry, with which I am indirectly involved. This is the kind of stuff that they are very concerned about. If we do not provide a solution to these issues, I fear it will be a big deterrent for many, especially since migrating large projects to D from C++ requires a large investment. -Craig
Apr 25 2006
parent reply Sean Kelly <sean f4.ca> writes:
I ran some quick tests this morning using the word count example code 
and here are my results so far:


D version (wc):

Execution time: 0.179 s
Execution time: 0.183 s
Execution time: 0.149 s

C++ version 1 (wccpp2):

Execution time: 0.216 s
Execution time: 0.263 s
Execution time: 0.183 s

C++ version 2 (wccpp3):

Execution time: 0.221 s
Execution time: 0.168 s
Execution time: 0.185 s

C++ version 3 (wccpp4):

Execution time: 0.191 s
Execution time: 0.243 s
Execution time: 0.142 s


The apps were compiled according to the word count description and timed 
using ptime (I had some applications open during the tests so the 
results may be a tad inaccurate).  wccpp2 is straight from the web page, 
wccpp3 replaces the C++ IO code with a C version quite similar to the D 
code in std.file, and wccpp4 replaces std::string with slice<char>.

You'll notice that there is practically no difference between the 
std::string and slice tests (wccpp3 and wccpp4), and I think this is to 
be expected.  Typical C++ string implementations have a "small string 
optimization" to avoid dynamic allocations for strings of less than 16 
characters.  And since there are likely very few words in the sample 
text that are larger than this, there are few if any dynamic allocations 
being performed by std::string.  Thus I think the real notable 
performance difference between C++ and D is that C++ uses a balanced 
binary tree while D uses a hash table.

I tried replacing std::map with a hash table implementation (technically 
an unordered_map implemented according to C++ TR1 that I did for a 
project a while back), but DMC choked on it and I don't have any more 
time to spend on this right now.  When I get a chance, I'll try to 
verify that the unordered_map code conforms to the C++ spec (it compiles 
cleanly at WL:4 in Visual Studio 2005) and will file a bug report for 
DMC if it does.  For reference, the bulk of the errors I saw were 
related to symbol lookup--a template class using static and non-static 
data from a parent class without fully qualifying the names--adding some 
'using' statements took care of these.  The final one was a bit more 
subtle and offered me almost no useful information to go on, so that one 
may take some investigation.


Sean
Apr 27 2006
parent reply Sean Kelly <sean f4.ca> writes:
Quick follow-up.  I decided to re-run the tests using the full text of 
the King James Bible (http://www.gutenberg.org/dirs/etext90/kjv10.txt) 
in an attempt to get better timing resolution, as it's 4,342 KB vs. 
Alice in Wonderland's 160 KB.  Here are my results:


D version (wc):

Execution time: 1.409 s
Execution time: 1.473 s
Execution time: 1.295 s

C++ version 1 (wccpp2):

Execution time: 1.779 s
Execution time: 1.762 s
Execution time: 1.792 s

C++ version 2 (wccpp3):

Execution time: 1.758 s
Execution time: 1.309 s
Execution time: 1.297 s

C++ version 3 (wccpp4):

Execution time: 0.693 s
Execution time: 0.767 s
Execution time: 0.704 s


Surprisingly, the slice code does substantially better than both the 
other C++ versions and the D code.  Perhaps this evening I'll re-run the 
C++ tests under Visual Studio to give the hash table a test drive.


Sean
Apr 27 2006
next sibling parent reply "Craig Black" <cblack ara.com> writes:
 Surprisingly, the slice code does substantially better than both the other 
 C++ versions and the D code.  Perhaps this evening I'll re-run the C++ 
 tests under Visual Studio to give the hash table a test drive.

Wow! It's over twice as fast! How difficult would it be to use the same approach for wordcount in D by bypassing the GC? So what do you think we should do about GC to make D faster? Do you favor making GC optional or writing a more optimized, compacting GC? I personally think that even optimized GC won't come close to code that has been optimized with manual memory management. -Craig
Apr 27 2006
next sibling parent Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 Surprisingly, the slice code does substantially better than both the other 
 C++ versions and the D code.  Perhaps this evening I'll re-run the C++ 
 tests under Visual Studio to give the hash table a test drive.

Wow! It's over twice as fast! How difficult would it be to use the same approach for wordcount in D by bypassing the GC?

The C++ slice code should mirror what is already happening in D--as far as I know, the D code should allocate no memory at all for strings, only for the AA. I suspect the problem may be GC scans of the rather large (4 megabyte) data area where the book contents are stored. If the D code were modified to allocate this memory via malloc instead of new, the D program may speed up substantially. I'll see about dropping the C++ file code into the D app this evening.
 So what do you think we should do about GC to make D faster?  Do you favor 
 making GC optional or writing a more optimized, compacting GC?  I personally 
 think that even optimized GC won't come close to code that has been 
 optimized with manual memory management.

I'm not sure either approach could be said to be "better" than the other insofar as performance is concerned as it really comes down to application design. I'll admit that, until D, I'd always dismissed garbage collection for one reason or another. But after using D for a while I suddenly realized how much time and code I used to invest in managing data ownership issues. I won't go so far as to say that data ownership is not important, but being freed from the yoke of smart pointer code and such I've found both that I'm more productive and that the result is more elegant than its C++ equivalent. My only remaining issue with GC is that it makes it too easy to ignore what's actually going on at a low level. This probably isn't an issue for experienced programmers who can't help but think about these things, but in I still feel that learning programmers should be forced to deal with this stuff simply so they're aware of it. Sean
Apr 27 2006
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 
 So what do you think we should do about GC to make D faster?  Do you favor 
 making GC optional or writing a more optimized, compacting GC?  I personally 
 think that even optimized GC won't come close to code that has been 
 optimized with manual memory management.

By the way, I do think that GC performance can be improved significantly over where it is now, so don't take my other reply as an admission that GC is slow. Also, I suspect that by the time D hits 1.0, GC performance will be much improved over what it is now. This can be achieved a number of different ways, but the most obvious would be for the GC to ignore memory blocks that don't contain pointers. Sean
Apr 27 2006
parent "Craig Black" <cblack ara.com> writes:
 By the way, I do think that GC performance can be improved significantly
 over where it is now, so don't take my other reply as an admission that
 GC is slow.  Also, I suspect that by the time D hits 1.0, GC performance
 will be much improved over what it is now.

I hope that you are right.
This can be achieved a
 number of different ways, but the most obvious would be for the GC to
 ignore memory blocks that don't contain pointers.

I didn't know such selective scanning was possible with GC. Would this optimization be easy to implement? If such an optimization were implemented, then I suppose it would be wise to keep heap allocation local to small blocks of memory so that the GC would not have much to scan. -Craig
Apr 28 2006
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Last follow-up.  I re-ran all tests with a bit less stuff going on in 
the background, and I ran the C++ tests with MSVC as well, including the 
hash table version (though I have no DMC run for this particular app). 
Here are the results.


D version 1 (wc) : the original example
D version 2 (wc2): using malloc for the buffer

C++ version 1 (wccpp2): the original example
C++ version 2 (wccpp3): replaced C++ stream code with C routines
C++ version 3 (wccpp4): replaced std::string with slice<char>
C++ version 4 (wccpp5): replaced std::map with unordered_map


D version 1 (wc):

Execution time: 0.380 s
Execution time: 0.423 s
Execution time: 0.446 s

D version 2 (wc2):

Execution time: 0.659 s
Execution time: 0.424 s
Execution time: 0.377 s

----------

DMC version 1 (wccpp2):

Execution time: 1.656 s
Execution time: 1.652 s
Execution time: 1.656 s

DMC version 2 (wccpp3):

Execution time: 1.230 s
Execution time: 1.257 s
Execution time: 1.290 s

DMC version 3 (wccpp4):

Execution time: 0.674 s
Execution time: 0.729 s
Execution time: 0.697 s

----------

MSVC version 1 (wccpp2):

Execution time: 1.093 s
Execution time: 1.120 s
Execution time: 1.042 s

MSCV version 2 (wccpp3):

Execution time: 0.797 s
Execution time: 0.833 s
Execution time: 0.799 s

MSVC version 3 (wccpp4):

Execution time: 0.761 s
Execution time: 0.744 s
Execution time: 0.726 s

MSVC version 4 (wccpp5):

Execution time: 0.438 s
Execution time: 0.436 s
Execution time: 0.406 s


The most amazing thing here is the dramatic performance difference 
between these D runs and the previous ones.  I won't speculate on why, 
bu the exact same app appears to be running 3 times as fast as it did 
this morning.  Beyond that, we can see that the C++ app using slices and 
a hash table performs similarly to the D implementation, though it's 
difficult to draw any solid conclusions without a test run under DMC. 
Also, I should note for the record that the hash table implementation 
doesn't rehash automatically (it wasn't a feature I needed at the time), 
so I preallocated 4096 buckets before running the word count.  There are 
793875 words in the KJ bible so this results in somewhat long chains, 
but I saw roughly similar results with double the number of buckets so I 
think the bulk of processing at this speed may be taken up by IO.  The 
only other notable difference is that the C++ code uses standard C calls 
for the file operations while the D code uses Windows calls.  I can't 
see this resulting in a noticeable speed difference, but I figured it 
was worth mentioning.

Finally, as these tests were quite informal I don't suggest reading too 
much into them.  However I think a reasonable conclusion would be that D 
is faster than C++ for typical user applications (assuming a use pattern 
similar to the word count application) and while C++ can be quite 
competitive, making it so requires the use of hand-written code over 
standard library code.  C++ will fare a bit better in 2009 when 
unordered_map is an official part of the standard, but that still leaves 
std::string as the suggested way to store string data.  This may not be 
an issue if you're dealing with English text where most words are under 
16 characters, but it could be for other data sets where allocations 
will occur.


Sean
Apr 27 2006
parent reply "Craig Black" <cblack ara.com> writes:
Please allow me to ask a stupid question.  Why are there three benchmarks
for each version?  Are you testing on different computers or something?

The most amazing thing here is the dramatic performance difference
between these D runs and the previous ones.  I won't speculate on why,
bu the exact same app appears to be running 3 times as fast as it did
this morning.

That is very strange ... so it seems that these benchmarks, being inconsistent, were inconclusive. Sounds like we need more testing. Anyway, D came out on top that time so that is good news!
Finally, as these tests were quite informal I don't suggest reading too
much into them.

Agreed. Benchmarking one small program doesn't really say much.
However I think a reasonable conclusion would be that D
is faster than C++ for typical user applications

I would agree, but not just based on a single benchmark. -Craig -Craig
Apr 28 2006
parent Sean Kelly <sean f4.ca> writes:
Craig Black wrote:
 Please allow me to ask a stupid question.  Why are there three benchmarks
 for each version?  Are you testing on different computers or something?

I wanted to reduce the chance that random system noise would affect a result. I suppose I could have simply posted the mean for each test, but it was easier to cut & paste than to get out a calculator. Sean
Apr 29 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Mike Capp wrote:
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of 
 memory to hold its accounting information. So you've got, out of the 
 box, twice as many allocations/deallocations happening.

dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)

True, but shared_ptr<> is billed as the C++ answer to gc, and shared_ptr<> is not intrusive.
 2) for every copy of the shared_ptr<>, like passing one to a function, 
 returning one, etc., the reference count has to be incremented, then 
 decremented.

smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.

But that means just going back to the old way of being very, very careful about who "owns" the pointer and who doesn't. Refcounting is supposed to eliminate that nuisance.
 And even with all that falderol, shared_ptr<> still can't do slicing, 
 which are an important efficiency improvement.

containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.

Sure you could do that, but then slices and shared_ptr's are not interchangeable. In D, you can build a data structure that contains slices, references to the heap, and references to static data willy-nilly. In C++, because each of those would be a different type, you'd have to build a wall between each. I think that's one of the key benefits of D slicing - the user of the slice need not know or care that its a slice. I don't have to have two versions of toupper(), one for shared_ptr and one for slices.
 Another advantage of gc (that isn't implemented, but could be,
 in D's gc) is that allocated memory can be compacted.

pointers? How would this work? And it's going to be tough to drop raw pointers without breaking ease-of-use for C libraries.

Oh, it's quite doable <g>.
Apr 25 2006
parent reply Mike Capp <mike.capp gmail.com> writes:
In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...
True, but shared_ptr<> is billed as the C++ answer to gc, and 
shared_ptr<> is not intrusive.

Shrug. I'm finding myself increasingly unconcerned these days with what's "billed as the C++ answer". I mean, iostreams?
 Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case 


But that means just going back to the old way of being very, very 
careful about who "owns" the pointer and who doesn't. Refcounting is 
supposed to eliminate that nuisance.

I'm not sure that's true. Smartpointer refcounting is supposed to make ownership rules explicit in code where they can be a) seen and b) automated, instead of hidden in documentation. In practice, I still make daft array-overrun and fencepost errors from time to time, but I've never had a problem with refcounting and ownership. As a first approximation: member variables, globals and new'ed locals are owners, and everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend to be glaringly obvious.
Sure you could do that, but then slices and shared_ptr's are not 
interchangeable.

True, but again, I'm not especially interested in shared_ptr, and this isn't a fundamental problem - you can write intrusive interchangeable string and slice classes in C++.
 Has there ever been a language that successfully combined 
 compacting GC with raw pointers? How would this work? 

Oh, it's quite doable <g>.

I find your ideas intriguing, and would like to subscribe to your newsletter. Can you elaborate, or is this a trade secret? cheers Mike
Apr 26 2006
parent Walter Bright <newshound digitalmars.com> writes:
Mike Capp wrote:
 In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...
 True, but shared_ptr<> is billed as the C++ answer to gc, and 
 shared_ptr<> is not intrusive.

Shrug. I'm finding myself increasingly unconcerned these days with what's "billed as the C++ answer". I mean, iostreams?

I know what you mean.
 Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case 

careful about who "owns" the pointer and who doesn't. Refcounting is supposed to eliminate that nuisance.

rules explicit in code where they can be a) seen and b) automated, instead of hidden in documentation.

I can see that point, but I can see that once you stray into such optimizations, there's no help from the compiler, and you'll need to be very careful. A corollary to that is extra time is spent, and bugs are likely.
 In practice, I still make daft array-overrun and fencepost errors from time to
 time, but I've never had a problem with refcounting and ownership. As a first
 approximation: member variables, globals and new'ed locals are owners, and
 everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend
 to be glaringly obvious.

Andrei Alexandrescu has proposed a series of modifications to C++ to enable the compiler to enforce such rules, but nobody else seemed much interested in it.
 Sure you could do that, but then slices and shared_ptr's are not 
 interchangeable.

True, but again, I'm not especially interested in shared_ptr, and this isn't a fundamental problem - you can write intrusive interchangeable string and slice classes in C++.

Perhaps, but I've never seen anyone pull it off.
 Has there ever been a language that successfully combined 
 compacting GC with raw pointers? How would this work? 


I find your ideas intriguing, and would like to subscribe to your newsletter. Can you elaborate, or is this a trade secret?

I'd rather not, until I get it working.
Apr 26 2006
prev sibling next sibling parent Sean Kelly <sean f4.ca> writes:
BCS wrote:
 
 A bad GC will definitely make for bad performance. And any GC will be slower
 than an optimum malloc/free solution. However (and Iím still undecided on this
 one) I think the argument is that in any non trivial project, manually freeing
 all of the memory without any memory leaks or axing something to soon would
 require more overhead than using GC. The trade-off is not in the freeing time
 but in the can-I-free-this check.

I believe this is indeed the common argument. In C++, smart pointers have become a very popular way to manage memory, and while smart pointers are technically a garbage collection strategy, I think they aren't considered as such here. In any case, the cost comes from updating the reference count, not from the malloc/free, as the updates must typically be performed atomically. That said, an accurate performance comparison between GC and manual memory management is extremely difficult because implementation details and use strategies have a tremendous impact on results. For anyone interested, there's been an extensive argu- um, dialog on the topic recently on comp.lang.c++.moderated titled "Can GC be benifical": http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/84253d37f970dd2b/# Sean
Apr 24 2006
prev sibling parent reply James Dunne <james.jdunne gmail.com> writes:
BCS wrote:
 
 [snip] I think the argument is that in any non trivial project, manually
freeing
 all of the memory without any memory leaks or axing something to soon would
 require more overhead than using GC. The trade-off is not in the freeing time
 but in the can-I-free-this check.
 

Not true. A hierachical memory management system is great! It sure beats reference counting, manual memory management, and conservative GC. What you do is allocate some memory, but also provide a "parent" memory pointer to relate the new memory to. When all is said and done you have a nice tree of related memory pointers (hopefully with just one root, provided you designed correctly). To free some memory, simply walk all the branches of that node, freeing their memory, then free the original memory. You can even get predictable destructors for objects this way. I'm currently developing a project in C which uses this method. I have it allocating lots of small blocks of memory to perform a job. The total comes to 32KB. When the program is finished I do one "free" call to the root structure and I get back ALL of my memory. Not a single byte is leaked. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 26 2006
parent reply xs0 <xs0 xs0.com> writes:
 Not true.  A hierachical memory management system is great!  It sure 
 beats reference counting, manual memory management, and conservative GC.
 
 What you do is allocate some memory, but also provide a "parent" memory 
 pointer to relate the new memory to.  When all is said and done you have 
 a nice tree of related memory pointers (hopefully with just one root, 
 provided you designed correctly).
 
 To free some memory, simply walk all the branches of that node, freeing 
 their memory, then free the original memory.  You can even get 
 predictable destructors for objects this way.
 
 I'm currently developing a project in C which uses this method.  I have 
 it allocating lots of small blocks of memory to perform a job.  The 
 total comes to 32KB.  When the program is finished I do one "free" call 
 to the root structure and I get back ALL of my memory.  Not a single 
 byte is leaked.

But how can you tell _when_ to delete the root? AFAIK, the main benefit of GC is not that you don't have to manually free the memory, but that you don't have to know when it is safe to do so. I don't see how organizing memory blocks into a tree solves that... xs0
Apr 26 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
xs0 wrote:
 
 Not true.  A hierachical memory management system is great!  It sure 
 beats reference counting, manual memory management, and conservative GC.

 What you do is allocate some memory, but also provide a "parent" 
 memory pointer to relate the new memory to.  When all is said and done 
 you have a nice tree of related memory pointers (hopefully with just 
 one root, provided you designed correctly).

 To free some memory, simply walk all the branches of that node, 
 freeing their memory, then free the original memory.  You can even get 
 predictable destructors for objects this way.

 I'm currently developing a project in C which uses this method.  I 
 have it allocating lots of small blocks of memory to perform a job.  
 The total comes to 32KB.  When the program is finished I do one "free" 
 call to the root structure and I get back ALL of my memory.  Not a 
 single byte is leaked.

But how can you tell _when_ to delete the root? AFAIK, the main benefit of GC is not that you don't have to manually free the memory, but that you don't have to know when it is safe to do so. I don't see how organizing memory blocks into a tree solves that... xs0

That's a different problem and depends on the nature of the application. If you're in a server environment, you free memory when the client request has been completely processed. If you're in an interactive user-interface environment, you free memory when you close documents, close windows, etc. When the data is no longer needed, you'll know when to free and what to free. I think you thought I was implying you should only have one memory root which all things are related to and that is the only thing that should ever be freed. The first part is right, but not the second. You should keep all the memory related to each other, but it's not that you have to free it in an "all or nothing" circumstance. Then again, that depends on your application type. In my example application I have lots of memory free operations on the individual nodes when I know I won't need that memory anymore, such as for temporary strings. If I forget to include an explicit free call to that temporary memory, it's okay since I know that memory has been related to another parent memory block that _will_ be freed. Can you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own... -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 26 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
James Dunne wrote:
 Can you provide me a few legitimate examples of when you won't know what 
 to free and when to free it?  I'm having a tough time coming up with 
 examples on my own...

One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.
Apr 26 2006
next sibling parent reply James Dunne <james.jdunne gmail.com> writes:
Walter Bright wrote:
 James Dunne wrote:
 
 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...

One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.

Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph. Funny you mention compiler, because that is the example application I was referring to. :) -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 26 2006
next sibling parent reply James Dunne <james.jdunne gmail.com> writes:
James Dunne wrote:
 Walter Bright wrote:
 
 James Dunne wrote:

 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...

One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.

Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph. Funny you mention compiler, because that is the example application I was referring to. :)

I should also mention that my compiler implementation makes use of context structures, so every memory block that the compiler will allocate is somehow parented to that context structure. So when the compile job is complete, free the context structure and all the memory goes away. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 26 2006
parent reply Dave <Dave_member pathlink.com> writes:
James Dunne wrote:
 
 I should also mention that my compiler implementation makes use of 
 context structures, so every memory block that the compiler will 
 allocate is somehow parented to that context structure.  So when the 
 compile job is complete, free the context structure and all the memory 
 goes away.
 

What kinda compiler?
Apr 26 2006
parent James Dunne <james.jdunne gmail.com> writes:
Dave wrote:
 James Dunne wrote:
 
 I should also mention that my compiler implementation makes use of 
 context structures, so every memory block that the compiler will 
 allocate is somehow parented to that context structure.  So when the 
 compile job is complete, free the context structure and all the memory 
 goes away.

What kinda compiler?

My own research language named Iris. I'd like to keep it under the radar for now. You can read about it in my monthly rants on http://iris-design-log.blogspot.com/. -- Regards, James Dunne
Apr 26 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
James Dunne wrote:

 Your example is having two trees _point_ to data within each other.  The 
 original data is still _owned_ by the original tree which the data 
 resides in.  You wouldn't be swapping ownership of the data like you're 
 talking about.  In this system, when memory is allocated it is assigned 
 an owner and that ownership does not ever change.  So the system 
 resembles a tree (one parent only), not a directed graph.

But when this happens in the general case, I can't delete *any* of the trees.
Apr 26 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Walter Bright wrote:
 James Dunne wrote:
 
 Your example is having two trees _point_ to data within each other.  
 The original data is still _owned_ by the original tree which the data 
 resides in.  You wouldn't be swapping ownership of the data like 
 you're talking about.  In this system, when memory is allocated it is 
 assigned an owner and that ownership does not ever change.  So the 
 system resembles a tree (one parent only), not a directed graph.

But when this happens in the general case, I can't delete *any* of the trees.

When are you trying to delete one of these trees? To which 'tree' are you referring, the data structure tree or the hierarchical memory manager's tree? -- Regards, James Dunne
Apr 26 2006
parent Walter Bright <newshound digitalmars.com> writes:
James Dunne wrote:
 Walter Bright wrote:
 James Dunne wrote:

 Your example is having two trees _point_ to data within each other.  
 The original data is still _owned_ by the original tree which the 
 data resides in.  You wouldn't be swapping ownership of the data like 
 you're talking about.  In this system, when memory is allocated it is 
 assigned an owner and that ownership does not ever change.  So the 
 system resembles a tree (one parent only), not a directed graph.

But when this happens in the general case, I can't delete *any* of the trees.

When are you trying to delete one of these trees?

I could wait until the program finishes before deleting anything, but that isn't memory management.
 To which 'tree' are you referring, the data structure tree or the 
 hierarchical memory manager's tree?

The nested symbol table of the local scope of a function, for example.
Apr 26 2006
prev sibling parent reply James Dunne <james.jdunne gmail.com> writes:
Walter Bright wrote:
 James Dunne wrote:
 
 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...

One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.

If you parent the memory you're creating in the tree data structures to a structure that is higher up the chain than both tree data structures, then the data doesn't get lost simply because you freed the data structure that held references to that memory. -- Regards, James Dunne
Apr 26 2006
next sibling parent Walter Bright <newshound digitalmars.com> writes:
James Dunne wrote:
 Walter Bright wrote:
 I.e. the problem crops up when the data in memory just doesn't fit 
 into a strictly heirarcical format.

a structure that is higher up the chain than both tree data structures, then the data doesn't get lost simply because you freed the data structure that held references to that memory.

I've designed data structures around the memory management requirement for keeping a sole 'parent' around for it that can free it. But I'd rather design data structures around how they are used instead. For example, if I have an associative array of strings, I can stuff any old string into it - even static strings - without having to worry about who 'owns' those strings. It's quite liberating <g>.
Apr 26 2006
prev sibling parent reply "Craig Black" <cblack ara.com> writes:
Good idea!  I suppose that every time you allocate memory you must always 
specify a parent node in the hierarchy.  This requires a little more thought 
than GC, because you have to consider how to organize the hierarchy so that 
it will be easy to free memory relating to a given task once the task is 
done.  Such a system is would be very fast, and with just a small amount of 
preparation, memory leaks are completely eliminated.  It could potentially 
be faster than traditional memory management, because you don't necessarily 
have to call free on every block of memory.  Rather, one free could free the 
entire hierarchy, or a large sub-hierarchy.  Have you experienced any 
drawbacks?  Does this approach work well with exceptions? threads?  Also, 
what kind of data structure do you use for this hierarchy?

-Craig 
Apr 26 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Craig Black wrote:
 Good idea!  I suppose that every time you allocate memory you must always 
 specify a parent node in the hierarchy.  This requires a little more thought 
 than GC, because you have to consider how to organize the hierarchy so that 
 it will be easy to free memory relating to a given task once the task is 
 done.  Such a system is would be very fast, and with just a small amount of 
 preparation, memory leaks are completely eliminated.  It could potentially 
 be faster than traditional memory management, because you don't necessarily 
 have to call free on every block of memory.  Rather, one free could free the 
 entire hierarchy, or a large sub-hierarchy.  Have you experienced any 
 drawbacks?  Does this approach work well with exceptions? threads?  Also, 
 what kind of data structure do you use for this hierarchy?
 
 -Craig 
 
 

Holy geez... Yes it is a good idea, but alas I cannot take credit for it. You point out all the benefits that I've gotten from it so far quite accurately. :) I've only used it with my compiler implementation written in C, so I'll discuss my experience relating to that, if that's okay with you... Drawbacks? Only that you have to give it that much more thought about what to parent to what... There's no serious runtime penalty - it just adds a pointer to the parent's list of children, but that cost is tacked on to allocating memory in general. Exceptions? I'm coding in C, and I wrote my own pseudo-exception model, which makes use of that memory model so I don't even leak memory from exceptions... a good thing! Threads? Haven't gotten there yet, but I plan to design my threads to be completely independent of each other. It's easy to do when you're lexing/parsing/analyzing different source files at the same time. Only need to synchronize when each pass is completed. I'll look into concurrency issues in the allocator code when I get to it. The data structure is a simple list of child pointers and a single parent pointer for each memory block. Linked lists. The code still relies, of course, on malloc and free to actually get the memory in the first place. -- Regards, James Dunne
Apr 26 2006
parent reply "Craig Black" <cblack ara.com> writes:
"James Dunne" <james.jdunne gmail.com> wrote in message 
news:e2pjk0$96l$1 digitaldaemon.com...
 Craig Black wrote:
 Good idea!  I suppose that every time you allocate memory you must always 
 specify a parent node in the hierarchy.  This requires a little more 
 thought than GC, because you have to consider how to organize the 
 hierarchy so that it will be easy to free memory relating to a given task 
 once the task is done.  Such a system is would be very fast, and with 
 just a small amount of preparation, memory leaks are completely 
 eliminated.  It could potentially be faster than traditional memory 
 management, because you don't necessarily have to call free on every 
 block of memory.  Rather, one free could free the entire hierarchy, or a 
 large sub-hierarchy.  Have you experienced any drawbacks?  Does this 
 approach work well with exceptions? threads?  Also, what kind of data 
 structure do you use for this hierarchy?

 -Craig

Holy geez... Yes it is a good idea, but alas I cannot take credit for it. You point out all the benefits that I've gotten from it so far quite accurately. :) I've only used it with my compiler implementation written in C, so I'll discuss my experience relating to that, if that's okay with you... Drawbacks? Only that you have to give it that much more thought about what to parent to what... There's no serious runtime penalty - it just adds a pointer to the parent's list of children, but that cost is tacked on to allocating memory in general. Exceptions? I'm coding in C, and I wrote my own pseudo-exception model, which makes use of that memory model so I don't even leak memory from exceptions... a good thing! Threads? Haven't gotten there yet, but I plan to design my threads to be completely independent of each other. It's easy to do when you're lexing/parsing/analyzing different source files at the same time. Only need to synchronize when each pass is completed. I'll look into concurrency issues in the allocator code when I get to it. The data structure is a simple list of child pointers and a single parent pointer for each memory block. Linked lists. The code still relies, of course, on malloc and free to actually get the memory in the first place.

Of coarse it does. Along those lines ... I've been thinking of an optimization for this. Couldn't you make come of your parent nodes an allocators, so that all of their children get their memory from it? This way, you could allocate large blocks of memory, and could cut down on your calls to malloc and free by an order of magnitude. Obviously you would have to choose what nodes in the hierarchy become allocators such that the least number of CRT calls would be made. -Craig
Apr 27 2006
parent reply "Craig Black" <cblack ara.com> writes:
Couldn't you make come of your parent nodes an allocators, so that all of 
their >children get their memory from it?

I was in a hurry. Let me revise. Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?
Apr 27 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Craig Black wrote:
Couldn't you make come of your parent nodes an allocators, so that all of 
their >children get their memory from it?

I was in a hurry. Let me revise. Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?

Hehe. That's okay. I understand text through typos. I'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling? I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 27 2006
parent reply "Craig Black" <cblack ara.com> writes:
 I'm not entirely familiar with the concept of allocators as you describe. 
 Is this similar to memory pooling?  I always skipped over that portion of 
 the C++ STL; way too much complexity for my small brain =P.

Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks. Then when you want an allocation unit, you request it from the allocator. The allocator simply hands out the memory, an marks it as used. The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead. In this way, you would only call malloc and free for large chunks of memory. You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require. Did that help? -Craig
Apr 27 2006
parent James Dunne <james.jdunne gmail.com> writes:
Craig Black wrote:
I'm not entirely familiar with the concept of allocators as you describe. 
Is this similar to memory pooling?  I always skipped over that portion of 
the C++ STL; way too much complexity for my small brain =P.

Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks. Then when you want an allocation unit, you request it from the allocator. The allocator simply hands out the memory, an marks it as used. The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead. In this way, you would only call malloc and free for large chunks of memory. You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require. Did that help? -Craig

Yes I understand. Thank you. Very interesting concept. I don't think in my current situation that calling malloc is a definite bottleneck. When I see a performance concern later on, I'll definitely look into it. -- Regards, James Dunne
Apr 27 2006