digitalmars.D - Faster Virtual Method Dispatch

Craig Black (25/25) Apr 23 2006 Most programmers do not realize the overhead involved with calling a met...

kris (38/70) Apr 23 2006 I'm no expert either, Craig, yet it seems that there's more to it? An

BCS (6/76) Apr 23 2006 I'm not following you train of thought completely but...

BCS (14/39) Apr 23 2006 IIRC what is being searched for isn't which entry in this vtbl, but whic...

Craig Black (10/70) Apr 23 2006 Nope you are right. Which vtable. My bad.

Hasan Aljudy (4/36) Apr 23 2006 I'm far from being an expert .. but I thought vtable lookups involve two...

Craig Black (8/11) Apr 23 2006 From one non-expert to another, I have no idea how many assembly languag...

Dave (14/45) Apr 23 2006 This is a little OT because it involves removing the need for a lot of

kris (5/59) Apr 23 2006 Yes, I recall seeing Walter write something along these lines somewhere

Dan (30/30) Apr 24 2006 Actually, having "if statements" is absolutely TERRIBLE for trying to im...

Craig Black (13/57) Apr 24 2006 Maybe you are just way over my head but I'm not quite following how this...

BCS (7/21) Apr 24 2006 A bad GC will definitely make for bad performance. And any GC will be sl...

Craig Black (17/50) Apr 24 2006 I have heard similar. "The code required to manually free everything

kris (7/67) Apr 24 2006 The nice thing about D, as a language, is that it supports both

Craig Black (10/16) Apr 24 2006 Correct. But since the language core relies on Phobos I don't think tha...

kris (15/27) Apr 24 2006 That's really quite easy, Craig; start with Ares, and then apply

Sean Kelly (6/13) Apr 25 2006 At the very least, a GC is required if you want to do much of anything

Walter Bright (21/63) Apr 24 2006 There's more to it than that. One of the big benefits of GC is that

Craig Black (12/79) Apr 24 2006 Hmmm. I have glanced at the word count example many times but was not a...

Walter Bright (23/34) Apr 24 2006 Even in that case, gc can win. For example, for explicit memory

Craig Black (18/52) Apr 25 2006 I don't see a performance issue here. Nor do think this would be terrib...

Sean Kelly (9/17) Apr 25 2006 One reason is that Java compilers perform basically no compile-time

Craig Black (8/13) Apr 25 2006 It is also worth mentioning that compacting can be performed without

Sean Kelly (11/21) Apr 25 2006 For what it's worth, the performance improvement may be attributed to

Mike Capp (19/29) Apr 25 2006 Not true for intrusive refcounting. (This isn't always feasible if you'r...

Craig Black (8/35) Apr 25 2006 What's intrusive refcounting? Please use small words. My brain is almos...

Sean Kelly (8/44) Apr 25 2006 Maintaining the reference count within the class being pointed to

Craig Black (8/53) Apr 25 2006 That's not the point. It demonstrates that the same performance can be

Sean Kelly (12/29) Apr 25 2006 I merely meant that that's not the point of the wordcount example. Howe...

Craig Black (25/68) Apr 25 2006 Great! I look forward to seeing the results.

Sean Kelly (45/45) Apr 27 2006 I ran some quick tests this morning using the word count example code

Sean Kelly (24/24) Apr 27 2006 Quick follow-up. I decided to re-run the tests using the full text of

Craig Black (7/10) Apr 27 2006 Wow! It's over twice as fast! How difficult would it be to use the sam...

Sean Kelly (23/33) Apr 27 2006 The C++ slice code should mirror what is already happening in D--as far
Sean Kelly (8/13) Apr 27 2006 By the way, I do think that GC performance can be improved significantly...

Craig Black (6/13) Apr 28 2006 I didn't know such selective scanning was possible with GC. Would this

Sean Kelly (76/76) Apr 27 2006 Last follow-up. I re-ran all tests with a bit less stuff going on in

Craig Black (9/17) Apr 28 2006 That is very strange ... so it seems that these benchmarks, being

Sean Kelly (5/7) Apr 29 2006 I wanted to reduce the chance that random system noise would affect a

Walter Bright (15/41) Apr 25 2006 True, but shared_ptr<> is billed as the C++ answer to gc, and

Mike Capp (18/30) Apr 26 2006 Shrug. I'm finding myself increasingly unconcerned these days with what'...

Walter Bright (11/43) Apr 26 2006 I can see that point, but I can see that once you stray into such

Sean Kelly (14/21) Apr 24 2006 I believe this is indeed the common argument. In C++, smart pointers
James Dunne (23/29) Apr 26 2006 Not true. A hierachical memory management system is great! It sure

xs0 (5/22) Apr 26 2006 But how can you tell _when_ to delete the root? AFAIK, the main benefit

James Dunne (29/56) Apr 26 2006 That's a different problem and depends on the nature of the application....

Walter Bright (11/14) Apr 26 2006 One example that has plagued me in the past is having two distinct tree

James Dunne (17/37) Apr 26 2006 Your example is having two trees _point_ to data within each other. The...

James Dunne (14/49) Apr 26 2006 I should also mention that my compiler implementation makes use of

Dave (2/9) Apr 26 2006 What kinda compiler?

James Dunne (7/18) Apr 26 2006 My own research language named Iris. I'd like to keep it under the

Walter Bright (3/9) Apr 26 2006 But when this happens in the general case, I can't delete *any* of the

James Dunne (7/19) Apr 26 2006 When are you trying to delete one of these trees?

Walter Bright (4/21) Apr 26 2006 I could wait until the program finishes before deleting anything, but

James Dunne (8/28) Apr 26 2006 If you parent the memory you're creating in the tree data structures to

Walter Bright (7/14) Apr 26 2006 I've designed data structures around the memory management requirement
Craig Black (12/12) Apr 26 2006 Good idea! I suppose that every time you allocate memory you must alway...

James Dunne (25/40) Apr 26 2006 Holy geez... Yes it is a good idea, but alas I cannot take credit for

Craig Black (10/45) Apr 27 2006 Of coarse it does. Along those lines ... I've been thinking of an

Craig Black (3/5) Apr 27 2006 I was in a hurry. Let me revise.

James Dunne (13/23) Apr 27 2006 Hehe. That's okay. I understand text through typos.

Craig Black (13/16) Apr 27 2006 Instead of calling malloc for each allocation unit, you could write an

James Dunne (8/32) Apr 27 2006 Yes I understand. Thank you. Very interesting concept.

"Craig Black" <cblack ara.com> writes:

Most programmers do not realize the overhead involved with calling a method 
virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
lookup and a invoking a function pointer.  However, we must also consider 
that the CPU has no way of performing branch prediction with vtable lookups, 
so the pipeline is trashed.  Thus, virtual method calls incur a large 
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
optimizations is replacing the vtable lookup with if statements for the most 
commonly called virtual methods followed by the vtable lookup, so that 
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, but 
knowing the little that I do know I have come up with a simple approach that 
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the 
methods directly rather than using fuction pointers would allow the method 
dispatch to take advantage of pipelining.  Algorithmically, however, this is 
no longer O(1), but O(log n), which is bad.  So I would assume that it would 
only be faster if n (or the number of methods in the vtable) is lower than a 
given threshold.  This threshold would be determined by testing and would 
perhaps be different on different processors.  Thus, in order to ensure that 
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization 
strategies.  I don't know how well such an algorithm would work in practice. 
Comments are obviously welcome.

-Craig

Apr 23 2006

kris <foo bar.com> writes:

Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 
 


I'm no expert either, Craig, yet it seems that there's more to it? An 
indirect address lookup (via a vtable) doesn't trash the pipeline per 
se; I forget the cycles involved but there's very little overhead at all 
if the vtable is a "popular" one, or otherwise happens to be in the 
cache lines. None of the other in-flight instructions have to be flushed 
at that time. If the vtable is not cached, then a bubble will appear in 
the pipeline while main-memory is addressed. But, again (as I understand 
it), this doesn't invalidate any in-flight instructions; just feeds the 
core less efficiently.

On the other hand, a branch misprediction must invalidate a large number 
of in-flight instruction; usually all of them (unless multi-path 
speculative execution is occuring). The P4 is particularly maimed by 
this, as compared to its peers, because of its particularly long 
pipeline (which, in turn, was needed to reached the frequencies Intel 
marketeers felt they needed to drive AMD off the map).

Obviously I don't know enough about this, but it might be interesting to 
research how one might introduce some locality-of-reference bias on the 
more common calls made. For instance, some fancy runtime might rewrite 
"common" calls to a dynamically constructed "global vtable". That would 
perhaps avoid the misprediction concerns, whilst managing to retain the 
globally popular set of lookups within cache?

Clearly MS compiler-engineers have noticed something beneficial. On the 
other hand, it seems a bit like flogging a dead-horse to introduce 
something that will likely cause at least one branch misprediction? I 
suppose the idea is that a direct-call (versus an indirect one) will 
enable speculative execution within the impending function-call itself?

Then again, heavyweight speculative execution is starting to lose favour 
against a grouping of simple symmetrical cores on one die (a la Cell, 
Kilocore, Niagara, etc). The idea there is that it doesn't matter if a 
bubble appears, or a core stalls; if you partition an app to take 
advantage of more than one execution thread, then it will (overall) 
execute faster anyway. The silicon saved via simplicity is applied to 
other areas such as massive on-die bandwidth. Of course, that doesn't 
help single-threaded applications, but the industry is changing, albeit 
slowly :)

Wither the Transputer? <g>

2 Cents

Apr 23 2006

BCS <BCS_member pathlink.com> writes:

In article <e2gkms$rur$1 digitaldaemon.com>, kris says...
Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 
 


I'm no expert either, Craig, yet it seems that there's more to it? An 
indirect address lookup (via a vtable) doesn't trash the pipeline per 
se; I forget the cycles involved but there's very little overhead at all 
if the vtable is a "popular" one, or otherwise happens to be in the 

I'm not following you train of thought completely but...
I don't think that the cache hit is the issue, I think that the fact that the
function call is to an interdict address is the issue. The CPU might not be able
to safely predict where it's supposed to go until it get all the way to the
function call. As a result the instruction queue gets completely drained.

cache lines. None of the other in-flight instructions have to be flushed 
at that time. If the vtable is not cached, then a bubble will appear in 
the pipeline while main-memory is addressed. But, again (as I understand 
it), this doesn't invalidate any in-flight instructions; just feeds the 
core less efficiently.

On the other hand, a branch misprediction must invalidate a large number 
of in-flight instruction; usually all of them (unless multi-path 
speculative execution is occuring). The P4 is particularly maimed by 
this, as compared to its peers, because of its particularly long 
pipeline (which, in turn, was needed to reached the frequencies Intel 
marketeers felt they needed to drive AMD off the map).

Obviously I don't know enough about this, but it might be interesting to 
research how one might introduce some locality-of-reference bias on the 
more common calls made. For instance, some fancy runtime might rewrite 
"common" calls to a dynamically constructed "global vtable". That would 
perhaps avoid the misprediction concerns, whilst managing to retain the 
globally popular set of lookups within cache?

Clearly MS compiler-engineers have noticed something beneficial. On the 
other hand, it seems a bit like flogging a dead-horse to introduce 
something that will likely cause at least one branch misprediction? I 
suppose the idea is that a direct-call (versus an indirect one) will 
enable speculative execution within the impending function-call itself?

Then again, heavyweight speculative execution is starting to lose favour 
against a grouping of simple symmetrical cores on one die (a la Cell, 
Kilocore, Niagara, etc). The idea there is that it doesn't matter if a 
bubble appears, or a core stalls; if you partition an app to take 
advantage of more than one execution thread, then it will (overall) 
execute faster anyway. The silicon saved via simplicity is applied to 
other areas such as massive on-die bandwidth. Of course, that doesn't 
help single-threaded applications, but the industry is changing, albeit 
slowly :)

Wither the Transputer? <g>

2 Cents

Apr 23 2006

BCS <BCS_member pathlink.com> writes:

In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...
Most programmers do not realize the overhead involved with calling a method 
virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
lookup and a invoking a function pointer.  However, we must also consider 
that the CPU has no way of performing branch prediction with vtable lookups, 
so the pipeline is trashed.  Thus, virtual method calls incur a large 
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
optimizations is replacing the vtable lookup with if statements for the most 
commonly called virtual methods followed by the vtable lookup, so that 
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, but 
knowing the little that I do know I have come up with a simple approach that 
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the 
methods directly rather than using fuction pointers would allow the method 
dispatch to take advantage of pipelining.  Algorithmically, however, this is 
no longer O(1), but O(log n), which is bad.  So I would assume that it would 
only be faster if n (or the number of methods in the vtable) is lower than a 

IIRC what is being searched for isn't which entry in this vtbl, but which vtbl
to use. Or maybe I am miss understanding you.

given threshold.  This threshold would be determined by testing and would 
perhaps be different on different processors.  Thus, in order to ensure that 
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization 
strategies.  I don't know how well such an algorithm would work in practice. 
Comments are obviously welcome.

-Craig 


Maybe some sort of run-and-review optimization could be done. This would entail
having the compiler build a heavily instrumented version of the program and then
run it a bunch to find out which classes are most commonly used at a given call
and then on the next build let the compiler optimize each call based on the
measured data. 

If this is class ABC 
call ABC.foo
else if XYZ 
call XYZ.foo
else
use the vtbl

Apr 23 2006

"Craig Black" <cblack ara.com> writes:

"BCS" <BCS_member pathlink.com> wrote in message 
news:e2go6m$13u7$1 digitaldaemon.com...
 In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...
Most programmers do not realize the overhead involved with calling a 
method
virtually.  A virtual method call is an O(1) operation.  Just  a vtable
lookup and a invoking a function pointer.  However, we must also consider
that the CPU has no way of performing branch prediction with vtable 
lookups,
so the pipeline is trashed.  Thus, virtual method calls incur a large
performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided
optimizations is replacing the vtable lookup with if statements for the 
most
commonly called virtual methods followed by the vtable lookup, so that
branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, 
but
knowing the little that I do know I have come up with a simple approach 
that
may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the
methods directly rather than using fuction pointers would allow the method
dispatch to take advantage of pipelining.  Algorithmically, however, this 
is
no longer O(1), but O(log n), which is bad.  So I would assume that it 
would
only be faster if n (or the number of methods in the vtable) is lower than 
a

 IIRC what is being searched for isn't which entry in this vtbl, but which 
 vtbl
 to use. Or maybe I am miss understanding you.

Nope you are right.  Which vtable.  My bad.

given threshold.  This threshold would be determined by testing and would
perhaps be different on different processors.  Thus, in order to ensure 
that
this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization
strategies.  I don't know how well such an algorithm would work in 
practice.
Comments are obviously welcome.

-Craig


 Maybe some sort of run-and-review optimization could be done. This would 
 entail
 having the compiler build a heavily instrumented version of the program 
 and then
 run it a bunch to find out which classes are most commonly used at a given 
 call
 and then on the next build let the compiler optimize each call based on 
 the
 measured data.

 If this is class ABC
 call ABC.foo
 else if XYZ
 call XYZ.foo
 else
 use the vtbl

Yes, like MSVC's profile guided optimization.  A more general optimization 
would be nice too, but you could probably be even faster and more efficient 
with a profile-guided one.  One real good thing about profile guided 
optimizations is that the compiler can spend more time on optimizing 
performance-critical code sections and less time on code that is not 
pertinent to performance.

-Craig

Apr 23 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

I'm far from being an expert .. but I thought vtable lookups involve two 
  or three jump/call instructions at the assembly level; at least that's 
the impression I got from stepping through code in the VC++ debugger.

Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig

Apr 23 2006

"Craig Black" <cblack ara.com> writes:

"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
news:e2gr8j$18r2$1 digitaldaemon.com...
 I'm far from being an expert .. but I thought vtable lookups involve two 
 or three jump/call instructions at the assembly level; at least that's the 
 impression I got from stepping through code in the VC++ debugger.

From one non-expert to another, I have no idea how many assembly language 
instructions are involved.  However, I have been informed by some very 
experienced programmers that virtual methods don't mesh well with 
pipelining.  This is the main performance issue with calling a virtual 
method.

-Craig

Apr 23 2006

Dave <Dave_member pathlink.com> writes:

Craig Black wrote:
 Most programmers do not realize the overhead involved with calling a method 
 virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
 lookup and a invoking a function pointer.  However, we must also consider 
 that the CPU has no way of performing branch prediction with vtable lookups, 
 so the pipeline is trashed.  Thus, virtual method calls incur a large 
 performance pentalty.
 
 Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
 optimizations is replacing the vtable lookup with if statements for the most 
 commonly called virtual methods followed by the vtable lookup, so that 
 branch prediction can be performed in the majority of cases.
 
 I do not consider myself an expert in the area of compiler optimization, but 
 knowing the little that I do know I have come up with a simple approach that 
 may improve performance.  This approach is described below.
 
 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the method 
 dispatch to take advantage of pipelining.  Algorithmically, however, this is 
 no longer O(1), but O(log n), which is bad.  So I would assume that it would 
 only be faster if n (or the number of methods in the vtable) is lower than a 
 given threshold.  This threshold would be determined by testing and would 
 perhaps be different on different processors.  Thus, in order to ensure that 
 this approach is always faster, a relatively low number should be used.
 
 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in practice. 
 Comments are obviously welcome.
 
 -Craig 
 

This is a little OT because it involves removing the need for a lot of 
VMD altogether before the assembly code is even generated instead of a 
way to optimize the VMD.

But, I think it would be good to take a look at 'auto-finalization' 
where the compiler front-end determines that a virtual method is not 
needed so just does a direct call.

Not only can more direct calls be made, but potentially the vtable gets 
smaller, the lookups will be less frequent and there will be no need for 
any branch-prediction for direct calls, getting rid of much of the need 
for VMD optimizations for many programs in the first place.

Maybe the semantic processing differences of 'import' over #include 
would make this more straight-forward to do for D compilers over C++.

- Dave

Apr 23 2006

kris <foo bar.com> writes:

Dave wrote:
 Craig Black wrote:
 
 Most programmers do not realize the overhead involved with calling a 
 method virtually.  A virtual method call is an O(1) operation.  Just  
 a vtable lookup and a invoking a function pointer.  However, we must 
 also consider that the CPU has no way of performing branch prediction 
 with vtable lookups, so the pipeline is trashed.  Thus, virtual method 
 calls incur a large performance pentalty.

 Interestly, I found that one of the MSVC++ 2005 compiler's profile 
 guided optimizations is replacing the vtable lookup with if statements 
 for the most commonly called virtual methods followed by the vtable 
 lookup, so that branch prediction can be performed in the majority of 
 cases.

 I do not consider myself an expert in the area of compiler 
 optimization, but knowing the little that I do know I have come up 
 with a simple approach that may improve performance.  This approach is 
 described below.

 Performing a binary search using if statements, and then invoking the 
 methods directly rather than using fuction pointers would allow the 
 method dispatch to take advantage of pipelining.  Algorithmically, 
 however, this is no longer O(1), but O(log n), which is bad.  So I 
 would assume that it would only be faster if n (or the number of 
 methods in the vtable) is lower than a given threshold.  This 
 threshold would be determined by testing and would perhaps be 
 different on different processors.  Thus, in order to ensure that this 
 approach is always faster, a relatively low number should be used.

 Like I said I'm not an expert when it comes to low-level optimization 
 strategies.  I don't know how well such an algorithm would work in 
 practice. Comments are obviously welcome.

 -Craig

 
 
 This is a little OT because it involves removing the need for a lot of 
 VMD altogether before the assembly code is even generated instead of a 
 way to optimize the VMD.
 
 But, I think it would be good to take a look at 'auto-finalization' 
 where the compiler front-end determines that a virtual method is not 
 needed so just does a direct call.
 
 Not only can more direct calls be made, but potentially the vtable gets 
 smaller, the lookups will be less frequent and there will be no need for 
 any branch-prediction for direct calls, getting rid of much of the need 
 for VMD optimizations for many programs in the first place.
 
 Maybe the semantic processing differences of 'import' over #include 
 would make this more straight-forward to do for D compilers over C++.
 
 - Dave

Yes, I recall seeing Walter write something along these lines somewhere 
in the D documentation: being able to compile the entire program, versus 
linking things in from a library, permits much more aggressive 
finalization of methods as you describe. That's definately a benefit.

Apr 23 2006

Dan <Dan_member pathlink.com> writes:

Actually, having "if statements" is absolutely TERRIBLE for trying to improve
branch prediction.  By it's very nature, if statements are branches, so you are
causing at least one branch misprediction anyways.

Furthermore, pointers are not O(1) algorithms.  What the computer is doing is
performing a:

LEA (reg) [(v_pointer)]; 
J(cc)|JMP [(reg)];

My understanding is that the v_pointer address is known at compile time, however
it implies:

that the v_table's cache has to be loaded every time a function is called
that extra lea buggers instruction pairing tagging on several cycles
the potential that the v_table is a cache miss (highly unlikely a page miss)

Interestingly however, if one were to prepend some Self Modifying Code to the
program that instead loaded the vtable into a set of addresses attached to the
vtable, then one could achieve the benefits of static functions and most of the
benefits of virtual functions at the cost of having an extra void*.sizeof for
every function call, and the SMC code overhead.

The problem is that *some* functions would thereafter be virtual so that a
single pointer could be used to point to any of a set of functions - a case not
easily rendered with static functions (but still possible remembering that all
code is just data)

That said, I think the overhead involved in using virtual functions is quite
reasonable and comparable to the overhead in using 'calling conventions' versus
a internally-optimized system with primarily register arguments.  It's 7-8
clocks every function call with a little bit of cache overhead and the potential
for a cache miss.

Ultimately though, I think the GC causes the highest level of performance
troubles on the benchmarks.  I love it to death though and wish someone would
get their hands dirty making it smaller and sexier.

Thanks!

Apr 24 2006

"Craig Black" <cblack ara.com> writes:

"Dan" <Dan_member pathlink.com> wrote in message 
news:e2im9o$29gi$1 digitaldaemon.com...
 Actually, having "if statements" is absolutely TERRIBLE for trying to 
 improve
 branch prediction.  By it's very nature, if statements are branches, so 
 you are
 causing at least one branch misprediction anyways.

 Furthermore, pointers are not O(1) algorithms.  What the computer is doing 
 is
 performing a:

 LEA (reg) [(v_pointer)];
 J(cc)|JMP [(reg)];

Maybe you are just way over my head but I'm not quite following how this is 
anything bug O(1).

 My understanding is that the v_pointer address is known at compile time, 
 however
 it implies:

 that the v_table's cache has to be loaded every time a function is called
 that extra lea buggers instruction pairing tagging on several cycles
 the potential that the v_table is a cache miss (highly unlikely a page 
 miss)

 Interestingly however, if one were to prepend some Self Modifying Code to 
 the
 program that instead loaded the vtable into a set of addresses attached to 
 the
 vtable, then one could achieve the benefits of static functions and most 
 of the
 benefits of virtual functions at the cost of having an extra void*.sizeof 
 for
 every function call, and the SMC code overhead.]

OK, now you are definitely over my head. :)

 The problem is that *some* functions would thereafter be virtual so that a
 single pointer could be used to point to any of a set of functions - a 
 case not
 easily rendered with static functions (but still possible remembering that 
 all
 code is just data)

 That said, I think the overhead involved in using virtual functions is 
 quite
 reasonable and comparable to the overhead in using 'calling conventions' 
 versus
 a internally-optimized system with primarily register arguments.  It's 7-8
 clocks every function call with a little bit of cache overhead and the 
 potential
 for a cache miss.

 Ultimately though, I think the GC causes the highest level of performance
 troubles on the benchmarks.  I love it to death though and wish someone 
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :)  The notion that the GC is optional in D is really 
misleading.  Everything uses it.  While you could certainly make it faster 
at the expense of more complexity, I don't think that GC that scans the 
stack testing for non-null pointers will ever provide the performance of C's 
malloc and free.  GC is simply performing much more work.  I have heard many 
arguments to the contrary, but I still don't see how it is possible that GC 
could compete with malloc and free.

-Craig

Apr 24 2006

BCS <BCS_member pathlink.com> writes:

In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
"Dan" <Dan_member pathlink.com> wrote in message 
news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of performance
 troubles on the benchmarks.  I love it to death though and wish someone 
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :)  The notion that the GC is optional in D is really 
misleading.  Everything uses it.  While you could certainly make it faster 
at the expense of more complexity, I don't think that GC that scans the 
stack testing for non-null pointers will ever provide the performance of C's 
malloc and free.  GC is simply performing much more work.  I have heard many 
arguments to the contrary, but I still don't see how it is possible that GC 
could compete with malloc and free.


A bad GC will definitely make for bad performance. And any GC will be slower
than an optimum malloc/free solution. However (and I�m still undecided on this
one) I think the argument is that in any non trivial project, manually freeing
all of the memory without any memory leaks or axing something to soon would
require more overhead than using GC. The trade-off is not in the freeing time
but in the can-I-free-this check.

Apr 24 2006

"Craig Black" <cblack ara.com> writes:

"BCS" <BCS_member pathlink.com> wrote in message 
news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
"Dan" <Dan_member pathlink.com> wrote in message
news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish someone
 would
 get their hands dirty making it smaller and sexier.

Or (truly) optional. :)  The notion that the GC is optional in D is really
misleading.  Everything uses it.  While you could certainly make it faster
at the expense of more complexity, I don't think that GC that scans the
stack testing for non-null pointers will ever provide the performance of 
C's
malloc and free.  GC is simply performing much more work.  I have heard 
many
arguments to the contrary, but I still don't see how it is possible that 
GC
could compete with malloc and free.


 A bad GC will definitely make for bad performance. And any GC will be 
 slower
 than an optimum malloc/free solution. However (and I'm still undecided on 
 this
 one) I think the argument is that in any non trivial project, manually 
 freeing
 all of the memory without any memory leaks or axing something to soon 
 would
 require more overhead than using GC. The trade-off is not in the freeing 
 time
 but in the can-I-free-this check.

I have heard similar.  "The code required to manually free everything 
outweighs the overhead of automatic GC."  Unless you are doing something 
rediculous in your code when freeing heap objects, this statement is false. 
Manual freeing does not require gobs of code or eat up lots of processor 
cycles.  Typically it involves testing a pointer to see if it is null, 
calling free() if it is not, and then setting it to null.  How could this 
ever be slower than scanning the entire stack for non-null pointers?  GC is 
simply doing way more work.  IMO this statement is simply obsurd.

The benefit of GC is not performance, but convenience.  If you want the 
convenience of not having to free memory manually then use GC.  However if 
you want performance, use malloc and free (new and delete).

That is why GC should be completely optional, not just on a per class basis. 
If I want to completely eliminate GC from my application, I should be able 
to do so and still be able to make use of any standard library capabilities.

-Craig

Apr 24 2006

kris <foo bar.com> writes:

Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 
In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...

"Dan" <Dan_member pathlink.com> wrote in message
news:e2im9o$29gi$1 digitaldaemon.com...

Ultimately though, I think the GC causes the highest level of 
performance
troubles on the benchmarks.  I love it to death though and wish someone
would
get their hands dirty making it smaller and sexier.

Or (truly) optional. :)  The notion that the GC is optional in D is really
misleading.  Everything uses it.  While you could certainly make it faster
at the expense of more complexity, I don't think that GC that scans the
stack testing for non-null pointers will ever provide the performance of 
C's
malloc and free.  GC is simply performing much more work.  I have heard 
many
arguments to the contrary, but I still don't see how it is possible that 
GC
could compete with malloc and free.


A bad GC will definitely make for bad performance. And any GC will be 
slower
than an optimum malloc/free solution. However (and I'm still undecided on 
this
one) I think the argument is that in any non trivial project, manually 
freeing
all of the memory without any memory leaks or axing something to soon 
would
require more overhead than using GC. The trade-off is not in the freeing 
time
but in the can-I-free-this check.

 
 
 I have heard similar.  "The code required to manually free everything 
 outweighs the overhead of automatic GC."  Unless you are doing something 
 rediculous in your code when freeing heap objects, this statement is false. 
 Manual freeing does not require gobs of code or eat up lots of processor 
 cycles.  Typically it involves testing a pointer to see if it is null, 
 calling free() if it is not, and then setting it to null.  How could this 
 ever be slower than scanning the entire stack for non-null pointers?  GC is 
 simply doing way more work.  IMO this statement is simply obsurd.
 
 The benefit of GC is not performance, but convenience.  If you want the 
 convenience of not having to free memory manually then use GC.  However if 
 you want performance, use malloc and free (new and delete).
 
 That is why GC should be completely optional, not just on a per class basis. 
 If I want to completely eliminate GC from my application, I should be able 
 to do so and still be able to make use of any standard library capabilities.
 
 -Craig 
 
 


The nice thing about D, as a language, is that it supports both 
approaches. The bad thing, for your needs, is that phobos is tightly 
bound to the GC. My particular gripes are with the GC activity caused 
via utf conversions, std.string and so on. There's a surprising, perhaps 
troublesome, amount of GC activity generated from within phobos itself. 
This is one of the reasons alternative libraries exist.

Apr 24 2006

"Craig Black" <cblack ara.com> writes:

 The nice thing about D, as a language, is that it supports both 
 approaches. The bad thing, for your needs, is that phobos is tightly bound 
 to the GC. My particular gripes are with the GC activity caused via utf 
 conversions, std.string and so on. There's a surprising, perhaps 
 troublesome, amount of GC activity generated from within phobos itself. 
 This is one of the reasons alternative libraries exist.

Correct.  But since the language core relies on Phobos I don't think that 
there is a reasonable way to completely decouple D from GC.  To me this 
situation is quite ugly.  It would be nice if there was a way to replace 
Phobos completely.  After writing standard libraries that don't rely on GC, 
we could easily produce D non-GC vs. D GC benchmarks.

Some steps would have to be taken in order to keep some source compatibility 
between the two modes, so that it would't turn into two completely different 
languages.  Manual memory management code could be versioned out when 
compiled in GC mode.

-Craig

Apr 24 2006

kris <foo bar.com> writes:

Craig Black wrote:
The nice thing about D, as a language, is that it supports both 
approaches. The bad thing, for your needs, is that phobos is tightly bound 
to the GC. My particular gripes are with the GC activity caused via utf 
conversions, std.string and so on. There's a surprising, perhaps 
troublesome, amount of GC activity generated from within phobos itself. 
This is one of the reasons alternative libraries exist.

 
 
 Correct.  But since the language core relies on Phobos I don't think that 
 there is a reasonable way to completely decouple D from GC.  To me this 
 situation is quite ugly.  It would be nice if there was a way to replace 
 Phobos completely.


That's really quite easy, Craig; start with Ares, and then apply 
whatever you need on top. For example, Ares and Mango go together really 
well, and both make a point about not allocating memory where there's a 
sensible alternative. In fact, the Mango HttpServer does not touch the 
GC even once per service request, once it warms up -- all hail to the 
groovy array-slice, and behold the power of the trout-slap :)

(kudos to IRC #d)

There again, I suspect fear of the GC is somewhat overplayed; although 
it certainly troubles me when it is abused. Thus, it doesn't bother me 
at all that Ares kinda' needs a GC also. Perhaps it's more a question of 
libraries, and code in general, treating it with a little respect :)

After all, using any kind of malloc() should perhaps be a question-mark 
for high-performance code

- Kris

Apr 24 2006

Sean Kelly <sean f4.ca> writes:

kris wrote:
 
 There again, I suspect fear of the GC is somewhat overplayed; although 
 it certainly troubles me when it is abused. Thus, it doesn't bother me 
 at all that Ares kinda' needs a GC also. Perhaps it's more a question of 
 libraries, and code in general, treating it with a little respect :)

At the very least, a GC is required if you want to do much of anything 
useful with dynamic arrays.  For apps with a huge footprint, I may 
follow this route and define a custom allocator for all object types.

 After all, using any kind of malloc() should perhaps be a question-mark 
 for high-performance code

True enough.


Sean

Apr 25 2006

Walter Bright <newshound digitalmars.com> writes:

Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
 "Dan" <Dan_member pathlink.com> wrote in message
 news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish someone
 would
 get their hands dirty making it smaller and sexier.

 Or (truly) optional. :)  The notion that the GC is optional in D is really
 misleading.  Everything uses it.  While you could certainly make it faster
 at the expense of more complexity, I don't think that GC that scans the
 stack testing for non-null pointers will ever provide the performance of 
 C's
 malloc and free.  GC is simply performing much more work.  I have heard 
 many
 arguments to the contrary, but I still don't see how it is possible that 
 GC
 could compete with malloc and free.

 A bad GC will definitely make for bad performance. And any GC will be 
 slower
 than an optimum malloc/free solution. However (and I'm still undecided on 
 this
 one) I think the argument is that in any non trivial project, manually 
 freeing
 all of the memory without any memory leaks or axing something to soon 
 would
 require more overhead than using GC. The trade-off is not in the freeing 
 time
 but in the can-I-free-this check.

 
 I have heard similar.  "The code required to manually free everything 
 outweighs the overhead of automatic GC."  Unless you are doing something 
 rediculous in your code when freeing heap objects, this statement is false. 
 Manual freeing does not require gobs of code or eat up lots of processor 
 cycles.  Typically it involves testing a pointer to see if it is null, 
 calling free() if it is not, and then setting it to null.  How could this 
 ever be slower than scanning the entire stack for non-null pointers?  GC is 
 simply doing way more work.  IMO this statement is simply obsurd.

There's more to it than that. One of the big benefits of GC is that 
fewer allocations are necessary - and a sure way to better allocation 
performance is to do less of them. The reason fewer allocations are 
necessary is that the normal behavior in C/C++, when faced with two 
pointers to the same object, is to copy the object. That way, there's no 
grief with worrying about freeing the object that someone else points to.

To see this effect at work, see www.digitalmars.com/d/cppstrings.html.

This problem is bad enough that the notion of shared_ptr<> has appeared. 
shared_ptr<> implements reference counting to eliminate the need for 
extra copies. But it has its own overhead problems:

1) for every allocation, shared_ptr<> allocates an *extra* block of 
memory to hold its accounting information. So you've got, out of the 
box, twice as many allocations/deallocations happening.

2) for every copy of the shared_ptr<>, like passing one to a function, 
returning one, etc., the reference count has to be incremented, then 
decremented. The inc/dec must be done in an exception safe manner, i.e. 
they need to be unwindable. Even worse, for multithreaded apps these 
inc/dec's need to be synchronized.

And even with all that falderol, shared_ptr<> still can't do slicing, 
which are an important efficiency improvement.

Apr 24 2006

"Craig Black" <cblack ara.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:e2k12i$19bi$1 digitaldaemon.com...
 Craig Black wrote:
 "BCS" <BCS_member pathlink.com> wrote in message 
 news:e2isuc$2jmd$1 digitaldaemon.com...
 In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...
 "Dan" <Dan_member pathlink.com> wrote in message
 news:e2im9o$29gi$1 digitaldaemon.com...
 Ultimately though, I think the GC causes the highest level of 
 performance
 troubles on the benchmarks.  I love it to death though and wish 
 someone
 would
 get their hands dirty making it smaller and sexier.

 Or (truly) optional. :)  The notion that the GC is optional in D is 
 really
 misleading.  Everything uses it.  While you could certainly make it 
 faster
 at the expense of more complexity, I don't think that GC that scans the
 stack testing for non-null pointers will ever provide the performance 
 of C's
 malloc and free.  GC is simply performing much more work.  I have heard 
 many
 arguments to the contrary, but I still don't see how it is possible 
 that GC
 could compete with malloc and free.

 A bad GC will definitely make for bad performance. And any GC will be 
 slower
 than an optimum malloc/free solution. However (and I'm still undecided 
 on this
 one) I think the argument is that in any non trivial project, manually 
 freeing
 all of the memory without any memory leaks or axing something to soon 
 would
 require more overhead than using GC. The trade-off is not in the freeing 
 time
 but in the can-I-free-this check.

 I have heard similar.  "The code required to manually free everything 
 outweighs the overhead of automatic GC."  Unless you are doing something 
 rediculous in your code when freeing heap objects, this statement is 
 false. Manual freeing does not require gobs of code or eat up lots of 
 processor cycles.  Typically it involves testing a pointer to see if it 
 is null, calling free() if it is not, and then setting it to null.  How 
 could this ever be slower than scanning the entire stack for non-null 
 pointers?  GC is simply doing way more work.  IMO this statement is 
 simply obsurd.

 There's more to it than that. One of the big benefits of GC is that fewer 
 allocations are necessary - and a sure way to better allocation 
 performance is to do less of them. The reason fewer allocations are 
 necessary is that the normal behavior in C/C++, when faced with two 
 pointers to the same object, is to copy the object. That way, there's no 
 grief with worrying about freeing the object that someone else points to.

 To see this effect at work, see www.digitalmars.com/d/cppstrings.html.

 This problem is bad enough that the notion of shared_ptr<> has appeared. 
 shared_ptr<> implements reference counting to eliminate the need for extra 
 copies. But it has its own overhead problems:

 1) for every allocation, shared_ptr<> allocates an *extra* block of memory 
 to hold its accounting information. So you've got, out of the box, twice 
 as many allocations/deallocations happening.

 2) for every copy of the shared_ptr<>, like passing one to a function, 
 returning one, etc., the reference count has to be incremented, then 
 decremented. The inc/dec must be done in an exception safe manner, i.e. 
 they need to be unwindable. Even worse, for multithreaded apps these 
 inc/dec's need to be synchronized.

 And even with all that falderol, shared_ptr<> still can't do slicing, 
 which are an important efficiency improvement.

Hmmm.  I have glanced at the word count example many times but was not aware 
of why it was faster in D until now.  Is it really the GC paradigm that 
allows this performance?  I think I'm beginning to finally see some method 
to the madness.  Perhaps I need to rethink my stance on GC a little.

It would seem then that GC can come out on top if there is a lot of sharing 
of allocated memory e.g. multiple references to each object.  However, if 
you are certain that there will be only a single reference to each object, 
manual memory management should always outperform GC.  Am I on the right 
track here?

-Craig

Apr 24 2006

Walter Bright <newshound digitalmars.com> writes:

Craig Black wrote:
 Hmmm.  I have glanced at the word count example many times but was not aware 
 of why it was faster in D until now.  Is it really the GC paradigm that 
 allows this performance?

Yes. It turns out to be very difficult for C++ to match it.

 I think I'm beginning to finally see some method 
 to the madness.  Perhaps I need to rethink my stance on GC a little.
 
 It would seem then that GC can come out on top if there is a lot of sharing 
 of allocated memory e.g. multiple references to each object.  However, if 
 you are certain that there will be only a single reference to each object, 
 manual memory management should always outperform GC.  Am I on the right 
 track here?

Even in that case, gc can win. For example, for explicit memory 
allocation in the presence of exception handling, one has to carefully 
set up handlers that will delete any temporaries if an exception is thrown:

	S *p = new S();
	p->a = foo();
	bar(p);
	delete p;

What if foo() throws an exception? We've got a memory leak in p, so 
we've got to add code to fix that. Dealing with that is an endless 
source of bugs and tedium in C++. With gc, nothing needs to be added.

Another advantage of gc (that isn't implemented, but could be, in D's 
gc) is that allocated memory can be compacted. Compacting existing 
allocated memory means that allocating a new object is as easy as 
bumping a pointer, whereas the usual malloc algorithm requires searching 
a free list for a best match.

So, the obvious question then becomes, if gc is faster, why is Java 
slower than C++? (I know, lots of Java people will claim it really isn't 
slower <g>.) The reason is that Java's design requires a lot more 
objects to be allocated than in C++ - for example, you cannot aggregate 
arrays or other objects inside other objects, they have to be allocated 
separately. More objects allocated means lower performance.

Apr 24 2006

"Craig Black" <cblack ara.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:e2kffd$1rl5$1 digitaldaemon.com...
 Craig Black wrote:
 Hmmm.  I have glanced at the word count example many times but was not 
 aware of why it was faster in D until now.  Is it really the GC paradigm 
 that allows this performance?

 Yes. It turns out to be very difficult for C++ to match it.

 I think I'm beginning to finally see some method to the madness.  Perhaps 
 I need to rethink my stance on GC a little.

 It would seem then that GC can come out on top if there is a lot of 
 sharing of allocated memory e.g. multiple references to each object. 
 However, if you are certain that there will be only a single reference to 
 each object, manual memory management should always outperform GC.  Am I 
 on the right track here?

 Even in that case, gc can win. For example, for explicit memory allocation 
 in the presence of exception handling, one has to carefully set up 
 handlers that will delete any temporaries if an exception is thrown:

 S *p = new S();
 p->a = foo();
 bar(p);
 delete p;

 What if foo() throws an exception? We've got a memory leak in p, so we've 
 got to add code to fix that. Dealing with that is an endless source of 
 bugs and tedium in C++. With gc, nothing needs to be added.

I don't see a performance issue here.  Nor do think this would be terribly 
difficult to solve in D, especially with the fancy-shmancy new scope() 
features, as long as scope(exit) works properly with exceptions.  This 
would, however, pose a problem for C++.

 Another advantage of gc (that isn't implemented, but could be, in D's gc) 
 is that allocated memory can be compacted. Compacting existing allocated 
 memory means that allocating a new object is as easy as bumping a pointer, 
 whereas the usual malloc algorithm requires searching a free list for a 
 best match.

I can see the obvious performance improvement here, but I am skeptical of 
whether it can actually be done with D, given it's liberal syntax with raw 
pointers.  Seems to me that if the GC started moving pointers around, the 
compiler would have to enforce more safety rules with regards to pointers.

 So, the obvious question then becomes, if gc is faster, why is Java slower 
 than C++? (I know, lots of Java people will claim it really isn't slower 
 <g>.) The reason is that Java's design requires a lot more objects to be 
 allocated than in C++ - for example, you cannot aggregate arrays or other 
 objects inside other objects, they have to be allocated separately. More 
 objects allocated means lower performance.



that I don't know all the technical details with VM's, but I have run my own 
benchmarks.  VM-based code is just way slower than native code.  Can't 
really tell you exactly why though because I don't really know or care to 
know what goes on inside a VM. (Except for the purpose of explaining to 
people why it is so much slower.)

-Craig

Apr 25 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 


 that I don't know all the technical details with VM's, but I have run my own 
 benchmarks.  VM-based code is just way slower than native code.  Can't 
 really tell you exactly why though because I don't really know or care to 
 know what goes on inside a VM. (Except for the purpose of explaining to 
 people why it is so much slower.)

One reason is that Java compilers perform basically no compile-time 
optimization, and the JIT does very little as well.  C/C++, on the other 
hand, can see huge performance improvements with optimization turned on. 

However, some VM languages (such as the Lisp variants) seem to perform 
quite well compared to natively compiled code, though I don't have the 
experience to say why.


Sean

Apr 25 2006

"Craig Black" <cblack ara.com> writes:

 Another advantage of gc (that isn't implemented, but could be, in D's gc) 
 is that allocated memory can be compacted. Compacting existing allocated 
 memory means that allocating a new object is as easy as bumping a pointer, 
 whereas the usual malloc algorithm requires searching a free list for a 
 best match.

It is also worth mentioning that compacting can be performed without 
scanning the stack.  Manual memory management can still be used along with a 
compacting allocator.  An compacting allocator would certainly outperform 
CRT malloc and free.  There is a tradeoff here because you can no longer use 
petal-to-the-metal algorithms that make use of raw pointers and pointer 
arithmetic, or if you do so, you would have to turn the allocator off 
temporarily to ensure safety.

-Craig

Apr 25 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 
 Hmmm.  I have glanced at the word count example many times but was not aware 
 of why it was faster in D until now.  Is it really the GC paradigm that 
 allows this performance?  I think I'm beginning to finally see some method 
 to the madness.  Perhaps I need to rethink my stance on GC a little.

For what it's worth, the performance improvement may be attributed to 
two factors: the C++ map class is a balanced binary tree while D uses a 
hash table, and C++ indeed copies strings around while D passes references.

 It would seem then that GC can come out on top if there is a lot of sharing 
 of allocated memory e.g. multiple references to each object.  However, if 
 you are certain that there will be only a single reference to each object, 
 manual memory management should always outperform GC.  Am I on the right 
 track here?

D allows manual memory management as well, so you aren't forced to use 
the GC if you don't want it in most cases.  However, the lack of copy 
ctors does essentially eliminate the possibility of using smart 
pointers.  One downside of a GC is that it needs to store bookkeeping 
data, which may amount to as much as 50% memory overhead.  This can be 
problematic for very large applications.


Sean

Apr 25 2006

Mike Capp <mike.capp gmail.com> writes:

In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
1) for every allocation, shared_ptr<> allocates an *extra* block of 
memory to hold its accounting information. So you've got, out of the 
box, twice as many allocations/deallocations happening.

Not true for intrusive refcounting. (This isn't always feasible if you're
dealing with third-party classes, which is why shared_ptr isn't intrusive, but
if you _can_ use it it's very close to ideal.)

2) for every copy of the shared_ptr<>, like passing one to a function, 
returning one, etc., the reference count has to be incremented, then 
decremented.

Again, not true for intrusive refcounting. You can assign an intrusive
smartpointer from a raw pointer or reference, so most of the time you can use
those to pass/return. Refcount twiddling is only needed when changing an
"owning" pointer, plus the rare "safety" case where you have a raw pointer but
are calling code which might drop the last "real" owning ref. 

And even with all that falderol, shared_ptr<> still can't do slicing, 
which are an important efficiency improvement.

Agreed, but couldn't you do something very similar with a simple ptr class
containing a pointer to the (refcounted) base string plus a couple of range end
indices? That's worked well for me in the past, but there may be benefits of D
slices that I'm missing.

[And in a subsequent posting]

 Another advantage of gc (that isn't implemented, but could be,
 in D's gc) is that allocated memory can be compacted.

Has there ever been a language that successfully combined compacting GC with raw
pointers? How would this work? And it's going to be tough to drop raw pointers
without breaking ease-of-use for C libraries.

cheers
Mike

Apr 25 2006

"Craig Black" <cblack ara.com> writes:

"Mike Capp" <mike.capp gmail.com> wrote in message 
news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
1) for every allocation, shared_ptr<> allocates an *extra* block of
memory to hold its accounting information. So you've got, out of the
box, twice as many allocations/deallocations happening.

 Not true for intrusive refcounting. (This isn't always feasible if you're
 dealing with third-party classes, which is why shared_ptr isn't intrusive, 
 but
 if you _can_ use it it's very close to ideal.)

What's intrusive refcounting?  Please use small words. My brain is almost 
full right now. :)

2) for every copy of the shared_ptr<>, like passing one to a function,
returning one, etc., the reference count has to be incremented, then
decremented.

 Again, not true for intrusive refcounting. You can assign an intrusive
 smartpointer from a raw pointer or reference, so most of the time you can 
 use
 those to pass/return. Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case where you have a raw pointer 
 but
 are calling code which might drop the last "real" owning ref.

And even with all that falderol, shared_ptr<> still can't do slicing,
which are an important efficiency improvement.

 Agreed, but couldn't you do something very similar with a simple ptr class
 containing a pointer to the (refcounted) base string plus a couple of 
 range end
 indices? That's worked well for me in the past, but there may be benefits 
 of D
 slices that I'm missing.

If you think this is doable, then you should write a C++ version of 
wordcount based on this principal.  If you can get comparable performance to 
the D version then you may be on to something.

-Craig

Apr 25 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 "Mike Capp" <mike.capp gmail.com> wrote in message 
 news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of
 memory to hold its accounting information. So you've got, out of the
 box, twice as many allocations/deallocations happening.

 Not true for intrusive refcounting. (This isn't always feasible if you're
 dealing with third-party classes, which is why shared_ptr isn't intrusive, 
 but
 if you _can_ use it it's very close to ideal.)

 
 What's intrusive refcounting?  Please use small words. My brain is almost 
 full right now. :)

Maintaining the reference count within the class being pointed to 
instead of within the shared pointer code.

 2) for every copy of the shared_ptr<>, like passing one to a function,
 returning one, etc., the reference count has to be incremented, then
 decremented.

 Again, not true for intrusive refcounting. You can assign an intrusive
 smartpointer from a raw pointer or reference, so most of the time you can 
 use
 those to pass/return. Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case where you have a raw pointer 
 but
 are calling code which might drop the last "real" owning ref.

 And even with all that falderol, shared_ptr<> still can't do slicing,
 which are an important efficiency improvement.

 Agreed, but couldn't you do something very similar with a simple ptr class
 containing a pointer to the (refcounted) base string plus a couple of 
 range end
 indices? That's worked well for me in the past, but there may be benefits 
 of D
 slices that I'm missing.

 
 If you think this is doable, then you should write a C++ version of 
 wordcount based on this principal.  If you can get comparable performance to 
 the D version then you may be on to something.

I'd meant to do this and never got around to it.  However, the wordcount 
example is meant to compare performance using typical programming 
techniques for each language, and writing your own specialized 
containers isn't typical :-)


Sean

Apr 25 2006

"Craig Black" <cblack ara.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 "Mike Capp" <mike.capp gmail.com> wrote in message 
 news:e2l4g1$2n5s$1 digitaldaemon.com...
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of
 memory to hold its accounting information. So you've got, out of the
 box, twice as many allocations/deallocations happening.

 Not true for intrusive refcounting. (This isn't always feasible if 
 you're
 dealing with third-party classes, which is why shared_ptr isn't 
 intrusive, but
 if you _can_ use it it's very close to ideal.)

 What's intrusive refcounting?  Please use small words. My brain is almost 
 full right now. :)

 Maintaining the reference count within the class being pointed to instead 
 of within the shared pointer code.

 2) for every copy of the shared_ptr<>, like passing one to a function,
 returning one, etc., the reference count has to be incremented, then
 decremented.

 Again, not true for intrusive refcounting. You can assign an intrusive
 smartpointer from a raw pointer or reference, so most of the time you 
 can use
 those to pass/return. Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case where you have a raw 
 pointer but
 are calling code which might drop the last "real" owning ref.

 And even with all that falderol, shared_ptr<> still can't do slicing,
 which are an important efficiency improvement.

 Agreed, but couldn't you do something very similar with a simple ptr 
 class
 containing a pointer to the (refcounted) base string plus a couple of 
 range end
 indices? That's worked well for me in the past, but there may be 
 benefits of D
 slices that I'm missing.

 If you think this is doable, then you should write a C++ version of 
 wordcount based on this principal.  If you can get comparable performance 
 to the D version then you may be on to something.

 I'd meant to do this and never got around to it.  However, the wordcount 
 example is meant to compare performance using typical programming 
 techniques for each language, and writing your own specialized containers 
 isn't typical :-)

That's not the point.  It demonstrates that the same performance can be 
achieved without GC, and all the baggage that accompanies it.  So 
performance-wise it would be like having your cake and eating it too.  If 
this was possible, perhaps D could be streamlined to take advantage of this 
somehow.

-Craig

Apr 25 2006

Sean Kelly <sean f4.ca> writes:

In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says...
"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 If you think this is doable, then you should write a C++ version of 
 wordcount based on this principal.  If you can get comparable performance 
 to the D version then you may be on to something.

 I'd meant to do this and never got around to it.  However, the wordcount 
 example is meant to compare performance using typical programming 
 techniques for each language, and writing your own specialized containers 
 isn't typical :-)

That's not the point.  It demonstrates that the same performance can be 
achieved without GC, and all the baggage that accompanies it.  So 
performance-wise it would be like having your cake and eating it too.  If 
this was possible, perhaps D could be streamlined to take advantage of this 
somehow.

I merely meant that that's not the point of the wordcount example.  However,
I'll try and find the time to recode it using slices and hashtables in the next
few days.  IIRC that brings performance in line with the D version, so I suppose
that means you can "have your cake and eat it too."  It's worth mentioning,
however, that slicing with explicit memory management doesn't have quite the
utility it does with garbage collection.  It works great for apps where the
entire buffer can be loaded en masse, as with the wordcount example, but in
other cases it's a bit more complicated to manage the lifetime of the referenced
data.  Thus the GC version is still a clear win in terms of ease of use, whether
or not performance can be matched with specialized code elsewhere.


Sean

Apr 25 2006

"Craig Black" <cblack ara.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:e2lpjj$hvv$1 digitaldaemon.com...
 In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says...
"Sean Kelly" <sean f4.ca> wrote in message
news:e2lgsh$5tf$1 digitaldaemon.com...
 Craig Black wrote:
 If you think this is doable, then you should write a C++ version of
 wordcount based on this principal.  If you can get comparable 
 performance
 to the D version then you may be on to something.

 I'd meant to do this and never got around to it.  However, the wordcount
 example is meant to compare performance using typical programming
 techniques for each language, and writing your own specialized 
 containers
 isn't typical :-)

That's not the point.  It demonstrates that the same performance can be
achieved without GC, and all the baggage that accompanies it.  So
performance-wise it would be like having your cake and eating it too.  If
this was possible, perhaps D could be streamlined to take advantage of 
this
somehow.

 I merely meant that that's not the point of the wordcount example. 
 However,
 I'll try and find the time to recode it using slices and hashtables in the 
 next
 few days.  IIRC that brings performance in line with the D version, so I 
 suppose
 that means you can "have your cake and eat it too."  It's worth 
 mentioning,

Great!  I look forward to seeing the results.

 however, that slicing with explicit memory management doesn't have quite 
 the
 utility it does with garbage collection.  It works great for apps where 
 the
 entire buffer can be loaded en masse, as with the wordcount example, but 
 in
 other cases it's a bit more complicated to manage the lifetime of the 
 referenced
 data.  Thus the GC version is still a clear win in terms of ease of use, 
 whether
 or not performance can be matched with specialized code elsewhere.

Different programmers have different problems that require different 
solutions.  I doubt that we will ever find one approach to make everyone 
happy about everything.  That's why I think GC should be completely 
optional.  Ideally, turning of GC could be as easy as specifying a compiler 
switch.

But maybe I'm wrong.  Maybe GC can be faster than manual memory management, 
but I doubt it.  Either way, we need to optimize this bottleneck somehow.

Although D currenty delivers competitive performance, and is in some ways is 
faster than C++, I still believe we have a few more steps to take before we 
can say with confidence that D provides performance superior to C++.

IMO, the fastest way to get extreme performance from D is (1) make GC 
optional and (2) provide a safe, easy way to take advantage of concurrency 
on multi-core systems.


them to the punch, and provide a superior solution for D.)

This would make D very attractive for performance-demanding applications.  I 
am especially thinking of the game industry, with which I am indirectly 
involved.  This is the kind of stuff that they are very concerned about.  If 
we do not provide a solution to these issues, I fear it will be a big 
deterrent for many, especially since migrating large projects to D from C++ 
requires a large investment.

-Craig

Apr 25 2006

Sean Kelly <sean f4.ca> writes:

I ran some quick tests this morning using the word count example code 
and here are my results so far:


D version (wc):

Execution time: 0.179 s
Execution time: 0.183 s
Execution time: 0.149 s

C++ version 1 (wccpp2):

Execution time: 0.216 s
Execution time: 0.263 s
Execution time: 0.183 s

C++ version 2 (wccpp3):

Execution time: 0.221 s
Execution time: 0.168 s
Execution time: 0.185 s

C++ version 3 (wccpp4):

Execution time: 0.191 s
Execution time: 0.243 s
Execution time: 0.142 s


The apps were compiled according to the word count description and timed 
using ptime (I had some applications open during the tests so the 
results may be a tad inaccurate).  wccpp2 is straight from the web page, 
wccpp3 replaces the C++ IO code with a C version quite similar to the D 
code in std.file, and wccpp4 replaces std::string with slice<char>.

You'll notice that there is practically no difference between the 
std::string and slice tests (wccpp3 and wccpp4), and I think this is to 
be expected.  Typical C++ string implementations have a "small string 
optimization" to avoid dynamic allocations for strings of less than 16 
characters.  And since there are likely very few words in the sample 
text that are larger than this, there are few if any dynamic allocations 
being performed by std::string.  Thus I think the real notable 
performance difference between C++ and D is that C++ uses a balanced 
binary tree while D uses a hash table.

I tried replacing std::map with a hash table implementation (technically 
an unordered_map implemented according to C++ TR1 that I did for a 
project a while back), but DMC choked on it and I don't have any more 
time to spend on this right now.  When I get a chance, I'll try to 
verify that the unordered_map code conforms to the C++ spec (it compiles 
cleanly at WL:4 in Visual Studio 2005) and will file a bug report for 
DMC if it does.  For reference, the bulk of the errors I saw were 
related to symbol lookup--a template class using static and non-static 
data from a parent class without fully qualifying the names--adding some 
'using' statements took care of these.  The final one was a bit more 
subtle and offered me almost no useful information to go on, so that one 
may take some investigation.


Sean

Apr 27 2006

Sean Kelly <sean f4.ca> writes:

Quick follow-up.  I decided to re-run the tests using the full text of 
the King James Bible (http://www.gutenberg.org/dirs/etext90/kjv10.txt) 
in an attempt to get better timing resolution, as it's 4,342 KB vs. 
Alice in Wonderland's 160 KB.  Here are my results:


D version (wc):

Execution time: 1.409 s
Execution time: 1.473 s
Execution time: 1.295 s

C++ version 1 (wccpp2):

Execution time: 1.779 s
Execution time: 1.762 s
Execution time: 1.792 s

C++ version 2 (wccpp3):

Execution time: 1.758 s
Execution time: 1.309 s
Execution time: 1.297 s

C++ version 3 (wccpp4):

Execution time: 0.693 s
Execution time: 0.767 s
Execution time: 0.704 s


Surprisingly, the slice code does substantially better than both the 
other C++ versions and the D code.  Perhaps this evening I'll re-run the 
C++ tests under Visual Studio to give the hash table a test drive.


Sean

Apr 27 2006

"Craig Black" <cblack ara.com> writes:

 Surprisingly, the slice code does substantially better than both the other 
 C++ versions and the D code.  Perhaps this evening I'll re-run the C++ 
 tests under Visual Studio to give the hash table a test drive.

Wow!  It's over twice as fast!  How difficult would it be to use the same 
approach for wordcount in D by bypassing the GC?

So what do you think we should do about GC to make D faster?  Do you favor 
making GC optional or writing a more optimized, compacting GC?  I personally 
think that even optimized GC won't come close to code that has been 
optimized with manual memory management.

-Craig

Apr 27 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 Surprisingly, the slice code does substantially better than both the other 
 C++ versions and the D code.  Perhaps this evening I'll re-run the C++ 
 tests under Visual Studio to give the hash table a test drive.

 
 Wow!  It's over twice as fast!  How difficult would it be to use the same 
 approach for wordcount in D by bypassing the GC?

The C++ slice code should mirror what is already happening in D--as far 
as I know, the D code should allocate no memory at all for strings, only 
for the AA.  I suspect the problem may be GC scans of the rather large 
(4 megabyte) data area where the book contents are stored.  If the D 
code were modified to allocate this memory via malloc instead of new, 
the D program may speed up substantially.  I'll see about dropping the 
C++ file code into the D app this evening.

 So what do you think we should do about GC to make D faster?  Do you favor 
 making GC optional or writing a more optimized, compacting GC?  I personally 
 think that even optimized GC won't come close to code that has been 
 optimized with manual memory management.

I'm not sure either approach could be said to be "better" than the other 
insofar as performance is concerned as it really comes down to 
application design.  I'll admit that, until D, I'd always dismissed 
garbage collection for one reason or another.  But after using D for a 
while I suddenly realized how much time and code I used to invest in 
managing data ownership issues.  I won't go so far as to say that data 
ownership is not important, but being freed from the yoke of smart 
pointer code and such I've found both that I'm more productive and that 
the result is more elegant than its C++ equivalent.  My only remaining 
issue with GC is that it makes it too easy to ignore what's actually 
going on at a low level.  This probably isn't an issue for experienced 
programmers who can't help but think about these things, but in I still 
feel that learning programmers should be forced to deal with this stuff 
simply so they're aware of it.


Sean

Apr 27 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 
 So what do you think we should do about GC to make D faster?  Do you favor 
 making GC optional or writing a more optimized, compacting GC?  I personally 
 think that even optimized GC won't come close to code that has been 
 optimized with manual memory management.

By the way, I do think that GC performance can be improved significantly 
over where it is now, so don't take my other reply as an admission that 
GC is slow.  Also, I suspect that by the time D hits 1.0, GC performance 
will be much improved over what it is now.  This can be achieved a 
number of different ways, but the most obvious would be for the GC to 
ignore memory blocks that don't contain pointers.


Sean

Apr 27 2006

"Craig Black" <cblack ara.com> writes:

 By the way, I do think that GC performance can be improved significantly
 over where it is now, so don't take my other reply as an admission that
 GC is slow.  Also, I suspect that by the time D hits 1.0, GC performance
 will be much improved over what it is now.

I hope that you are right.

This can be achieved a
 number of different ways, but the most obvious would be for the GC to
 ignore memory blocks that don't contain pointers.

I didn't know such selective scanning was possible with GC.  Would this
optimization be easy to implement?  If such an optimization were
implemented, then I suppose it would be wise to keep heap allocation local
to small blocks of memory so that the GC would not have much to scan.

-Craig

Apr 28 2006

Sean Kelly <sean f4.ca> writes:

Last follow-up.  I re-ran all tests with a bit less stuff going on in 
the background, and I ran the C++ tests with MSVC as well, including the 
hash table version (though I have no DMC run for this particular app). 
Here are the results.


D version 1 (wc) : the original example
D version 2 (wc2): using malloc for the buffer

C++ version 1 (wccpp2): the original example
C++ version 2 (wccpp3): replaced C++ stream code with C routines
C++ version 3 (wccpp4): replaced std::string with slice<char>
C++ version 4 (wccpp5): replaced std::map with unordered_map


D version 1 (wc):

Execution time: 0.380 s
Execution time: 0.423 s
Execution time: 0.446 s

D version 2 (wc2):

Execution time: 0.659 s
Execution time: 0.424 s
Execution time: 0.377 s

----------

DMC version 1 (wccpp2):

Execution time: 1.656 s
Execution time: 1.652 s
Execution time: 1.656 s

DMC version 2 (wccpp3):

Execution time: 1.230 s
Execution time: 1.257 s
Execution time: 1.290 s

DMC version 3 (wccpp4):

Execution time: 0.674 s
Execution time: 0.729 s
Execution time: 0.697 s

----------

MSVC version 1 (wccpp2):

Execution time: 1.093 s
Execution time: 1.120 s
Execution time: 1.042 s

MSCV version 2 (wccpp3):

Execution time: 0.797 s
Execution time: 0.833 s
Execution time: 0.799 s

MSVC version 3 (wccpp4):

Execution time: 0.761 s
Execution time: 0.744 s
Execution time: 0.726 s

MSVC version 4 (wccpp5):

Execution time: 0.438 s
Execution time: 0.436 s
Execution time: 0.406 s


The most amazing thing here is the dramatic performance difference 
between these D runs and the previous ones.  I won't speculate on why, 
bu the exact same app appears to be running 3 times as fast as it did 
this morning.  Beyond that, we can see that the C++ app using slices and 
a hash table performs similarly to the D implementation, though it's 
difficult to draw any solid conclusions without a test run under DMC. 
Also, I should note for the record that the hash table implementation 
doesn't rehash automatically (it wasn't a feature I needed at the time), 
so I preallocated 4096 buckets before running the word count.  There are 
793875 words in the KJ bible so this results in somewhat long chains, 
but I saw roughly similar results with double the number of buckets so I 
think the bulk of processing at this speed may be taken up by IO.  The 
only other notable difference is that the C++ code uses standard C calls 
for the file operations while the D code uses Windows calls.  I can't 
see this resulting in a noticeable speed difference, but I figured it 
was worth mentioning.

Finally, as these tests were quite informal I don't suggest reading too 
much into them.  However I think a reasonable conclusion would be that D 
is faster than C++ for typical user applications (assuming a use pattern 
similar to the word count application) and while C++ can be quite 
competitive, making it so requires the use of hand-written code over 
standard library code.  C++ will fare a bit better in 2009 when 
unordered_map is an official part of the standard, but that still leaves 
std::string as the suggested way to store string data.  This may not be 
an issue if you're dealing with English text where most words are under 
16 characters, but it could be for other data sets where allocations 
will occur.


Sean

Apr 27 2006

"Craig Black" <cblack ara.com> writes:

Please allow me to ask a stupid question.  Why are there three benchmarks
for each version?  Are you testing on different computers or something?

The most amazing thing here is the dramatic performance difference
between these D runs and the previous ones.  I won't speculate on why,
bu the exact same app appears to be running 3 times as fast as it did
this morning.

That is very strange ... so it seems that these benchmarks, being
inconsistent, were inconclusive.  Sounds like we need more testing.  Anyway,
D came out on top that time so that is good news!

Finally, as these tests were quite informal I don't suggest reading too
much into them.

Agreed.  Benchmarking one small program doesn't really say much.

However I think a reasonable conclusion would be that D
is faster than C++ for typical user applications

I would agree, but not just based on a single benchmark.

-Craig




-Craig

Apr 28 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 Please allow me to ask a stupid question.  Why are there three benchmarks
 for each version?  Are you testing on different computers or something?

I wanted to reduce the chance that random system noise would affect a 
result.  I suppose I could have simply posted the mean for each test, 
but it was easier to cut & paste than to get out a calculator.


Sean

Apr 29 2006

Walter Bright <newshound digitalmars.com> writes:

Mike Capp wrote:
 In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...
 1) for every allocation, shared_ptr<> allocates an *extra* block of 
 memory to hold its accounting information. So you've got, out of the 
 box, twice as many allocations/deallocations happening.

 Not true for intrusive refcounting. (This isn't always feasible if you're
 dealing with third-party classes, which is why shared_ptr isn't intrusive, but
 if you _can_ use it it's very close to ideal.)

True, but shared_ptr<> is billed as the C++ answer to gc, and 
shared_ptr<> is not intrusive.

 2) for every copy of the shared_ptr<>, like passing one to a function, 
 returning one, etc., the reference count has to be incremented, then 
 decremented.

 Again, not true for intrusive refcounting. You can assign an intrusive
 smartpointer from a raw pointer or reference, so most of the time you can use
 those to pass/return. Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case where you have a raw pointer but
 are calling code which might drop the last "real" owning ref. 

But that means just going back to the old way of being very, very 
careful about who "owns" the pointer and who doesn't. Refcounting is 
supposed to eliminate that nuisance.


 And even with all that falderol, shared_ptr<> still can't do slicing, 
 which are an important efficiency improvement.

 Agreed, but couldn't you do something very similar with a simple ptr class
 containing a pointer to the (refcounted) base string plus a couple of range end
 indices? That's worked well for me in the past, but there may be benefits of D
 slices that I'm missing.

Sure you could do that, but then slices and shared_ptr's are not 
interchangeable. In D, you can build a data structure that contains 
slices, references to the heap, and references to static data 
willy-nilly. In C++, because each of those would be a different type, 
you'd have to build a wall between each.

I think that's one of the key benefits of D slicing - the user of the 
slice need not know or care that its a slice. I don't have to have two 
versions of toupper(), one for shared_ptr and one for slices.


 Another advantage of gc (that isn't implemented, but could be,
 in D's gc) is that allocated memory can be compacted.

 Has there ever been a language that successfully combined compacting GC with
raw
 pointers? How would this work? And it's going to be tough to drop raw pointers
 without breaking ease-of-use for C libraries.

Oh, it's quite doable <g>.

Apr 25 2006

Mike Capp <mike.capp gmail.com> writes:

In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...
True, but shared_ptr<> is billed as the C++ answer to gc, and 
shared_ptr<> is not intrusive.

Shrug. I'm finding myself increasingly unconcerned these days with what's
"billed as the C++ answer". I mean, iostreams?

 Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case 


But that means just going back to the old way of being very, very 
careful about who "owns" the pointer and who doesn't. Refcounting is 
supposed to eliminate that nuisance.

I'm not sure that's true. Smartpointer refcounting is supposed to make ownership
rules explicit in code where they can be a) seen and b) automated, instead of
hidden in documentation.

In practice, I still make daft array-overrun and fencepost errors from time to
time, but I've never had a problem with refcounting and ownership. As a first
approximation: member variables, globals and new'ed locals are owners, and
everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend
to be glaringly obvious.

Sure you could do that, but then slices and shared_ptr's are not 
interchangeable.

True, but again, I'm not especially interested in shared_ptr, and this isn't a
fundamental problem - you can write intrusive interchangeable string and slice
classes in C++. 

 Has there ever been a language that successfully combined 
 compacting GC with raw pointers? How would this work? 

Oh, it's quite doable <g>.

I find your ideas intriguing, and would like to subscribe to your newsletter.

Can you elaborate, or is this a trade secret?

cheers
Mike

Apr 26 2006

Walter Bright <newshound digitalmars.com> writes:

Mike Capp wrote:
 In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...
 True, but shared_ptr<> is billed as the C++ answer to gc, and 
 shared_ptr<> is not intrusive.

 
 Shrug. I'm finding myself increasingly unconcerned these days with what's
 "billed as the C++ answer". I mean, iostreams?

I know what you mean.


 Refcount twiddling is only needed when changing an
 "owning" pointer, plus the rare "safety" case 

 But that means just going back to the old way of being very, very 
 careful about who "owns" the pointer and who doesn't. Refcounting is 
 supposed to eliminate that nuisance.

 I'm not sure that's true. Smartpointer refcounting is supposed to make
ownership
 rules explicit in code where they can be a) seen and b) automated, instead of
 hidden in documentation.

I can see that point, but I can see that once you stray into such 
optimizations, there's no help from the compiler, and you'll need to be 
very careful. A corollary to that is extra time is spent, and bugs are 
likely.

 In practice, I still make daft array-overrun and fencepost errors from time to
 time, but I've never had a problem with refcounting and ownership. As a first
 approximation: member variables, globals and new'ed locals are owners, and
 everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend
 to be glaringly obvious.

Andrei Alexandrescu has proposed a series of modifications to C++ to 
enable the compiler to enforce such rules, but nobody else seemed much 
interested in it.

 Sure you could do that, but then slices and shared_ptr's are not 
 interchangeable.

 
 True, but again, I'm not especially interested in shared_ptr, and this isn't a
 fundamental problem - you can write intrusive interchangeable string and slice
 classes in C++.

Perhaps, but I've never seen anyone pull it off.


 Has there ever been a language that successfully combined 
 compacting GC with raw pointers? How would this work? 

 Oh, it's quite doable <g>.

 
 I find your ideas intriguing, and would like to subscribe to your newsletter.
 
 Can you elaborate, or is this a trade secret?

I'd rather not, until I get it working.

Apr 26 2006

Sean Kelly <sean f4.ca> writes:

BCS wrote:
 
 A bad GC will definitely make for bad performance. And any GC will be slower
 than an optimum malloc/free solution. However (and I�m still undecided on this
 one) I think the argument is that in any non trivial project, manually freeing
 all of the memory without any memory leaks or axing something to soon would
 require more overhead than using GC. The trade-off is not in the freeing time
 but in the can-I-free-this check.

I believe this is indeed the common argument.  In C++, smart pointers 
have become a very popular way to manage memory, and while smart 
pointers are technically a garbage collection strategy, I think they 
aren't considered as such here.  In any case, the cost comes from 
updating the reference count, not from the malloc/free, as the updates 
must typically be performed atomically.  That said, an accurate 
performance comparison between GC and manual memory management is 
extremely difficult because implementation details and use strategies 
have a tremendous impact on results.  For anyone interested, there's 
been an extensive argu- um, dialog on the topic recently on 
comp.lang.c++.moderated titled "Can GC be benifical":




Sean

Apr 24 2006

James Dunne <james.jdunne gmail.com> writes:

BCS wrote:
 
 [snip] I think the argument is that in any non trivial project, manually
freeing
 all of the memory without any memory leaks or axing something to soon would
 require more overhead than using GC. The trade-off is not in the freeing time
 but in the can-I-free-this check.
 

Not true.  A hierachical memory management system is great!  It sure 
beats reference counting, manual memory management, and conservative GC.

What you do is allocate some memory, but also provide a "parent" memory 
pointer to relate the new memory to.  When all is said and done you have 
a nice tree of related memory pointers (hopefully with just one root, 
provided you designed correctly).

To free some memory, simply walk all the branches of that node, freeing 
their memory, then free the original memory.  You can even get 
predictable destructors for objects this way.

I'm currently developing a project in C which uses this method.  I have 
it allocating lots of small blocks of memory to perform a job.  The 
total comes to 32KB.  When the program is finished I do one "free" call 
to the root structure and I get back ALL of my memory.  Not a single 
byte is leaked.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 26 2006

xs0 <xs0 xs0.com> writes:

 Not true.  A hierachical memory management system is great!  It sure 
 beats reference counting, manual memory management, and conservative GC.
 
 What you do is allocate some memory, but also provide a "parent" memory 
 pointer to relate the new memory to.  When all is said and done you have 
 a nice tree of related memory pointers (hopefully with just one root, 
 provided you designed correctly).
 
 To free some memory, simply walk all the branches of that node, freeing 
 their memory, then free the original memory.  You can even get 
 predictable destructors for objects this way.
 
 I'm currently developing a project in C which uses this method.  I have 
 it allocating lots of small blocks of memory to perform a job.  The 
 total comes to 32KB.  When the program is finished I do one "free" call 
 to the root structure and I get back ALL of my memory.  Not a single 
 byte is leaked.

But how can you tell _when_ to delete the root? AFAIK, the main benefit 
of GC is not that you don't have to manually free the memory, but that 
you don't have to know when it is safe to do so. I don't see how 
organizing memory blocks into a tree solves that...


xs0

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

xs0 wrote:
 
 Not true.  A hierachical memory management system is great!  It sure 
 beats reference counting, manual memory management, and conservative GC.

 What you do is allocate some memory, but also provide a "parent" 
 memory pointer to relate the new memory to.  When all is said and done 
 you have a nice tree of related memory pointers (hopefully with just 
 one root, provided you designed correctly).

 To free some memory, simply walk all the branches of that node, 
 freeing their memory, then free the original memory.  You can even get 
 predictable destructors for objects this way.

 I'm currently developing a project in C which uses this method.  I 
 have it allocating lots of small blocks of memory to perform a job.  
 The total comes to 32KB.  When the program is finished I do one "free" 
 call to the root structure and I get back ALL of my memory.  Not a 
 single byte is leaked.

 
 
 But how can you tell _when_ to delete the root? AFAIK, the main benefit 
 of GC is not that you don't have to manually free the memory, but that 
 you don't have to know when it is safe to do so. I don't see how 
 organizing memory blocks into a tree solves that...
 
 
 xs0

That's a different problem and depends on the nature of the application. 
  If you're in a server environment, you free memory when the client 
request has been completely processed.  If you're in an interactive 
user-interface environment, you free memory when you close documents, 
close windows, etc.  When the data is no longer needed, you'll know when 
to free and what to free.

I think you thought I was implying you should only have one memory root 
which all things are related to and that is the only thing that should 
ever be freed.  The first part is right, but not the second.  You should 
keep all the memory related to each other, but it's not that you have to 
free it in an "all or nothing" circumstance.  Then again, that depends 
on your application type.

In my example application I have lots of memory free operations on the 
individual nodes when I know I won't need that memory anymore, such as 
for temporary strings.  If I forget to include an explicit free call to 
that temporary memory, it's okay since I know that memory has been 
related to another parent memory block that _will_ be freed.

Can you provide me a few legitimate examples of when you won't know what 
to free and when to free it?  I'm having a tough time coming up with 
examples on my own...

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 26 2006

Walter Bright <newshound digitalmars.com> writes:

James Dunne wrote:
 Can you provide me a few legitimate examples of when you won't know what 
 to free and when to free it?  I'm having a tough time coming up with 
 examples on my own...

One example that has plagued me in the past is having two distinct tree 
data structures, and it becomes necessary for a node in one tree to 
refer to data in the other.

Or when you have a third data structure, and you need to point to data 
in one or the other or both of the first two trees.

A typical example of that is symbol tables inside the compiler. In the 
DMD front end (which is garbage collected) the data has all kinds of 
cross-connections in it.

I.e. the problem crops up when the data in memory just doesn't fit into 
a strictly heirarcical format.

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

Walter Bright wrote:
 James Dunne wrote:
 
 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...

 
 
 One example that has plagued me in the past is having two distinct tree 
 data structures, and it becomes necessary for a node in one tree to 
 refer to data in the other.
 
 Or when you have a third data structure, and you need to point to data 
 in one or the other or both of the first two trees.
 
 A typical example of that is symbol tables inside the compiler. In the 
 DMD front end (which is garbage collected) the data has all kinds of 
 cross-connections in it.
 
 I.e. the problem crops up when the data in memory just doesn't fit into 
 a strictly heirarcical format.

Your example is having two trees _point_ to data within each other.  The 
original data is still _owned_ by the original tree which the data 
resides in.  You wouldn't be swapping ownership of the data like you're 
talking about.  In this system, when memory is allocated it is assigned 
an owner and that ownership does not ever change.  So the system 
resembles a tree (one parent only), not a directed graph.

Funny you mention compiler, because that is the example application I 
was referring to. :)

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

James Dunne wrote:
 Walter Bright wrote:
 
 James Dunne wrote:

 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...



 One example that has plagued me in the past is having two distinct 
 tree data structures, and it becomes necessary for a node in one tree 
 to refer to data in the other.

 Or when you have a third data structure, and you need to point to data 
 in one or the other or both of the first two trees.

 A typical example of that is symbol tables inside the compiler. In the 
 DMD front end (which is garbage collected) the data has all kinds of 
 cross-connections in it.

 I.e. the problem crops up when the data in memory just doesn't fit 
 into a strictly heirarcical format.

 
 
 Your example is having two trees _point_ to data within each other.  The 
 original data is still _owned_ by the original tree which the data 
 resides in.  You wouldn't be swapping ownership of the data like you're 
 talking about.  In this system, when memory is allocated it is assigned 
 an owner and that ownership does not ever change.  So the system 
 resembles a tree (one parent only), not a directed graph.
 
 Funny you mention compiler, because that is the example application I 
 was referring to. :)
 

I should also mention that my compiler implementation makes use of 
context structures, so every memory block that the compiler will 
allocate is somehow parented to that context structure.  So when the 
compile job is complete, free the context structure and all the memory 
goes away.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 26 2006

Dave <Dave_member pathlink.com> writes:

James Dunne wrote:
 
 I should also mention that my compiler implementation makes use of 
 context structures, so every memory block that the compiler will 
 allocate is somehow parented to that context structure.  So when the 
 compile job is complete, free the context structure and all the memory 
 goes away.
 

What kinda compiler?

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

Dave wrote:
 James Dunne wrote:
 
 I should also mention that my compiler implementation makes use of 
 context structures, so every memory block that the compiler will 
 allocate is somehow parented to that context structure.  So when the 
 compile job is complete, free the context structure and all the memory 
 goes away.

 
 What kinda compiler?

My own research language named Iris.  I'd like to keep it under the 
radar for now.  You can read about it in my monthly rants on 
http://iris-design-log.blogspot.com/.

-- 
Regards,
James Dunne

Apr 26 2006

Walter Bright <newshound digitalmars.com> writes:

James Dunne wrote:

 Your example is having two trees _point_ to data within each other.  The 
 original data is still _owned_ by the original tree which the data 
 resides in.  You wouldn't be swapping ownership of the data like you're 
 talking about.  In this system, when memory is allocated it is assigned 
 an owner and that ownership does not ever change.  So the system 
 resembles a tree (one parent only), not a directed graph.

But when this happens in the general case, I can't delete *any* of the 
trees.

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

Walter Bright wrote:
 James Dunne wrote:
 
 Your example is having two trees _point_ to data within each other.  
 The original data is still _owned_ by the original tree which the data 
 resides in.  You wouldn't be swapping ownership of the data like 
 you're talking about.  In this system, when memory is allocated it is 
 assigned an owner and that ownership does not ever change.  So the 
 system resembles a tree (one parent only), not a directed graph.

 
 
 But when this happens in the general case, I can't delete *any* of the 
 trees.

When are you trying to delete one of these trees?

To which 'tree' are you referring, the data structure tree or the 
hierarchical memory manager's tree?

-- 
Regards,
James Dunne

Apr 26 2006

Walter Bright <newshound digitalmars.com> writes:

James Dunne wrote:
 Walter Bright wrote:
 James Dunne wrote:

 Your example is having two trees _point_ to data within each other.  
 The original data is still _owned_ by the original tree which the 
 data resides in.  You wouldn't be swapping ownership of the data like 
 you're talking about.  In this system, when memory is allocated it is 
 assigned an owner and that ownership does not ever change.  So the 
 system resembles a tree (one parent only), not a directed graph.


 But when this happens in the general case, I can't delete *any* of the 
 trees.

 
 When are you trying to delete one of these trees?

I could wait until the program finishes before deleting anything, but 
that isn't memory management.

 To which 'tree' are you referring, the data structure tree or the 
 hierarchical memory manager's tree?

The nested symbol table of the local scope of a function, for example.

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

Walter Bright wrote:
 James Dunne wrote:
 
 Can you provide me a few legitimate examples of when you won't know 
 what to free and when to free it?  I'm having a tough time coming up 
 with examples on my own...

 
 
 One example that has plagued me in the past is having two distinct tree 
 data structures, and it becomes necessary for a node in one tree to 
 refer to data in the other.
 
 Or when you have a third data structure, and you need to point to data 
 in one or the other or both of the first two trees.
 
 A typical example of that is symbol tables inside the compiler. In the 
 DMD front end (which is garbage collected) the data has all kinds of 
 cross-connections in it.
 
 I.e. the problem crops up when the data in memory just doesn't fit into 
 a strictly heirarcical format.

If you parent the memory you're creating in the tree data structures to 
a structure that is higher up the chain than both tree data structures, 
then the data doesn't get lost simply because you freed the data 
structure that held references to that memory.

-- 
Regards,
James Dunne

Apr 26 2006

Walter Bright <newshound digitalmars.com> writes:

James Dunne wrote:
 Walter Bright wrote:
 I.e. the problem crops up when the data in memory just doesn't fit 
 into a strictly heirarcical format.

 If you parent the memory you're creating in the tree data structures to 
 a structure that is higher up the chain than both tree data structures, 
 then the data doesn't get lost simply because you freed the data 
 structure that held references to that memory.

I've designed data structures around the memory management requirement 
for keeping a sole 'parent' around for it that can free it. But I'd 
rather design data structures around how they are used instead.

For example, if I have an associative array of strings, I can stuff any 
old string into it - even static strings - without having to worry about 
who 'owns' those strings. It's quite liberating <g>.

Apr 26 2006

"Craig Black" <cblack ara.com> writes:

Good idea!  I suppose that every time you allocate memory you must always 
specify a parent node in the hierarchy.  This requires a little more thought 
than GC, because you have to consider how to organize the hierarchy so that 
it will be easy to free memory relating to a given task once the task is 
done.  Such a system is would be very fast, and with just a small amount of 
preparation, memory leaks are completely eliminated.  It could potentially 
be faster than traditional memory management, because you don't necessarily 
have to call free on every block of memory.  Rather, one free could free the 
entire hierarchy, or a large sub-hierarchy.  Have you experienced any 
drawbacks?  Does this approach work well with exceptions? threads?  Also, 
what kind of data structure do you use for this hierarchy?

-Craig

Apr 26 2006

James Dunne <james.jdunne gmail.com> writes:

Craig Black wrote:
 Good idea!  I suppose that every time you allocate memory you must always 
 specify a parent node in the hierarchy.  This requires a little more thought 
 than GC, because you have to consider how to organize the hierarchy so that 
 it will be easy to free memory relating to a given task once the task is 
 done.  Such a system is would be very fast, and with just a small amount of 
 preparation, memory leaks are completely eliminated.  It could potentially 
 be faster than traditional memory management, because you don't necessarily 
 have to call free on every block of memory.  Rather, one free could free the 
 entire hierarchy, or a large sub-hierarchy.  Have you experienced any 
 drawbacks?  Does this approach work well with exceptions? threads?  Also, 
 what kind of data structure do you use for this hierarchy?
 
 -Craig 
 
 

Holy geez... Yes it is a good idea, but alas I cannot take credit for 
it.  You point out all the benefits that I've gotten from it so far 
quite accurately. :)

I've only used it with my compiler implementation written in C, so I'll 
discuss my experience relating to that, if that's okay with you...

Drawbacks?  Only that you have to give it that much more thought about 
what to parent to what...  There's no serious runtime penalty - it just 
adds a pointer to the parent's list of children, but that cost is tacked 
on to allocating memory in general.

Exceptions?  I'm coding in C, and I wrote my own pseudo-exception model, 
which makes use of that memory model so I don't even leak memory from 
exceptions... a good thing!

Threads?  Haven't gotten there yet, but I plan to design my threads to 
be completely independent of each other.  It's easy to do when you're 
lexing/parsing/analyzing different source files at the same time.  Only 
need to synchronize when each pass is completed.  I'll look into 
concurrency issues in the allocator code when I get to it.

The data structure is a simple list of child pointers and a single 
parent pointer for each memory block.  Linked lists.  The code still 
relies, of course, on malloc and free to actually get the memory in the 
first place.

-- 
Regards,
James Dunne

Apr 26 2006

"Craig Black" <cblack ara.com> writes:

"James Dunne" <james.jdunne gmail.com> wrote in message 
news:e2pjk0$96l$1 digitaldaemon.com...
 Craig Black wrote:
 Good idea!  I suppose that every time you allocate memory you must always 
 specify a parent node in the hierarchy.  This requires a little more 
 thought than GC, because you have to consider how to organize the 
 hierarchy so that it will be easy to free memory relating to a given task 
 once the task is done.  Such a system is would be very fast, and with 
 just a small amount of preparation, memory leaks are completely 
 eliminated.  It could potentially be faster than traditional memory 
 management, because you don't necessarily have to call free on every 
 block of memory.  Rather, one free could free the entire hierarchy, or a 
 large sub-hierarchy.  Have you experienced any drawbacks?  Does this 
 approach work well with exceptions? threads?  Also, what kind of data 
 structure do you use for this hierarchy?

 -Craig

 Holy geez... Yes it is a good idea, but alas I cannot take credit for it. 
 You point out all the benefits that I've gotten from it so far quite 
 accurately. :)

 I've only used it with my compiler implementation written in C, so I'll 
 discuss my experience relating to that, if that's okay with you...

 Drawbacks?  Only that you have to give it that much more thought about 
 what to parent to what...  There's no serious runtime penalty - it just 
 adds a pointer to the parent's list of children, but that cost is tacked 
 on to allocating memory in general.

 Exceptions?  I'm coding in C, and I wrote my own pseudo-exception model, 
 which makes use of that memory model so I don't even leak memory from 
 exceptions... a good thing!

 Threads?  Haven't gotten there yet, but I plan to design my threads to be 
 completely independent of each other.  It's easy to do when you're 
 lexing/parsing/analyzing different source files at the same time.  Only 
 need to synchronize when each pass is completed.  I'll look into 
 concurrency issues in the allocator code when I get to it.

 The data structure is a simple list of child pointers and a single parent 
 pointer for each memory block.  Linked lists.  The code still relies, of 
 course, on malloc and free to actually get the memory in the first place.

Of coarse it does.  Along those lines ... I've been thinking of an 
optimization for this.  Couldn't you make come of your parent nodes an 
allocators, so that all of their children get their memory from it?  This 
way, you could allocate large blocks of memory, and could cut down on your 
calls to malloc and free by an order of magnitude.  Obviously you would have 
to choose what nodes in the hierarchy become allocators such that the least 
number of CRT calls would be made.

-Craig

Apr 27 2006

"Craig Black" <cblack ara.com> writes:

Couldn't you make come of your parent nodes an allocators, so that all of 
their >children get their memory from it?

I was in a hurry.  Let me revise.

Couldn't you make some of your parent nodes allocators, so that all of their 
children get their memory from their parent allocator nodes?

Apr 27 2006

James Dunne <james.jdunne gmail.com> writes:

Craig Black wrote:
Couldn't you make come of your parent nodes an allocators, so that all of 
their >children get their memory from it?

 
 
 I was in a hurry.  Let me revise.
 
 Couldn't you make some of your parent nodes allocators, so that all of their 
 children get their memory from their parent allocator nodes?
 
 

Hehe.  That's okay.  I understand text through typos.

I'm not entirely familiar with the concept of allocators as you 
describe.  Is this similar to memory pooling?  I always skipped over 
that portion of the C++ STL; way too much complexity for my small brain =P.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 27 2006

"Craig Black" <cblack ara.com> writes:

 I'm not entirely familiar with the concept of allocators as you describe. 
 Is this similar to memory pooling?  I always skipped over that portion of 
 the C++ STL; way too much complexity for my small brain =P.

Instead of calling malloc for each allocation unit, you could write an 
allocator class that would allocate memory in large chunks.  Then when you 
want an allocation unit, you request it from the allocator.  The allocator 
simply hands out the memory, an marks it as used.

The hierarchical memory system you describe lends itself to using such 
custom allocators, because you would never have to call free for each unit. 
You could call free at the allocator level instead.  In this way, you would 
only call malloc and free for large chunks of memory.

You could also inform each allocator how much memory to reserve when they 
are created, since you may have a good idea of how much memory a given task 
will require.

Did that help?

-Craig

Apr 27 2006

James Dunne <james.jdunne gmail.com> writes:

Craig Black wrote:
I'm not entirely familiar with the concept of allocators as you describe. 
Is this similar to memory pooling?  I always skipped over that portion of 
the C++ STL; way too much complexity for my small brain =P.

 
 
 Instead of calling malloc for each allocation unit, you could write an 
 allocator class that would allocate memory in large chunks.  Then when you 
 want an allocation unit, you request it from the allocator.  The allocator 
 simply hands out the memory, an marks it as used.
 
 The hierarchical memory system you describe lends itself to using such 
 custom allocators, because you would never have to call free for each unit. 
 You could call free at the allocator level instead.  In this way, you would 
 only call malloc and free for large chunks of memory.
 
 You could also inform each allocator how much memory to reserve when they 
 are created, since you may have a good idea of how much memory a given task 
 will require.
 
 Did that help?
 
 -Craig 
 
 

Yes I understand.  Thank you.  Very interesting concept.

I don't think in my current situation that calling malloc is a definite 
bottleneck.  When I see a performance concern later on, I'll definitely 
look into it.

-- 
Regards,
James Dunne

Apr 27 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Faster Virtual Method Dispatch