www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Inline Functions

reply Mason Green (Zzzzrrr) <mason.green gmail.com> writes:
Hello,

I'm looking for ways to optimize Blaze, the D port of Box2D, and running into
some frustrations.  In fact, the same Java port
(http://www.jbox2d.org/v2demos/) is currently running circles around Blaze,
performance wise....

I have a sneaking suspicion that this is the result of the many thousands of
vector math operations that are performed each cycle during my stress test. 

Is there a way to force inline function calls?  I'm compiling my code with
'-release -O -inline', but this seems not to have much of an effect on
performance.  When I remove -inline there doesn't seem to be much of a
difference in execution speed. 

FYI, I'm using DMD v1.035 on Windowd x32.

Thanks,
Mason
Feb 24 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Mason Green:

I'm looking for ways to optimize Blaze, the D port of Box2D, and running into
some frustrations.  In fact, the same Java port
(http://www.jbox2d.org/v2demos/) is currently running circles around Blaze,
performance wise....<
on dotnet) is getting closer to well compiled C++ code (and it's much simpler to write than C++). A JavaVM like HotSpot is more refined than the backend of DMD, its GC is much more refined and more efficient, it's much better in inlining virtual methods, its data structures are usually better performance-tuned, etc. The D language is newer than Java, and it has enjoyed far less money, developers and users. Have you profiled your D code? What has the profiling told you? Have you seen where you allocate memory, to move such allocations away from inner loops, or just reduce their number? Bye, bearophile
Feb 24 2009
parent reply Mason Green <mason.green gmail.com> writes:
bearophile:

Thanks for the reply.
 
 A JavaVM like HotSpot is more refined than the backend of DMD, its GC is much
more refined and more efficient, it's much better in inlining virtual methods,
its data structures are usually better performance-tuned, etc. The D language
is newer than Java, and it has enjoyed far less money, developers and users.> 
Very well put! But, do you know if there is a way to force inlining where I want it? Someone mentioned to me that template mixins may work...? I would rather not inline all the code by hand, as I would like to trust the compiler.
 Have you profiled your D code? What has the profiling told you? Have you seen
where you allocate memory, to move such allocations away from inner loops, or
just reduce their number? >
No, I have not profiled the D code other than using an FPS counter... :-) To be honest, I'm fairly light on experience when it comes to profiling. Do you have any suggestions on how to make it happen? Bye, Mason
Feb 24 2009
next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 24 Feb 2009 22:08:26 +0300, Mason Green <mason.green gmail.com>  
wrote:

 bearophile:

 Thanks for the reply.

 A JavaVM like HotSpot is more refined than the backend of DMD, its GC  
 is much more refined and more efficient, it's much better in inlining  
 virtual methods, its data structures are usually better  
 performance-tuned, etc. The D language is newer than Java, and it has  
 enjoyed far less money, developers and users.>
Very well put! But, do you know if there is a way to force inlining where I want it? Someone mentioned to me that template mixins may work...? I would rather not inline all the code by hand, as I would like to trust the compiler.
 Have you profiled your D code? What has the profiling told you? Have  
 you seen where you allocate memory, to move such allocations away from  
 inner loops, or just reduce their number? >
No, I have not profiled the D code other than using an FPS counter... :-) To be honest, I'm fairly light on experience when it comes to profiling. Do you have any suggestions on how to make it happen? Bye, Mason
DMD has profiling built-in. Just recompile your code with -profile flag, run once and analyze output.
Feb 24 2009
prev sibling next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
Mason Green wrote:

 bearophile:
 
 Thanks for the reply.
  
 A JavaVM like HotSpot is more refined than the backend of DMD, its GC is 
much more refined and more efficient, it's much better in inlining virtual methods, its data structures are usually better performance-tuned, etc. The D language is newer than Java, and it has enjoyed far less money, developers and users.>
 
 Very well put! But, do you know if there is a way to force inlining where 
I want it? Someone mentioned to me that template mixins may work...? I would rather not inline all the code by hand, as I would like to trust the compiler. You could use mixins, but that won't lead to pretty code. It's useful to know which kinds of code can get inlined by dmd. I don't have much knowledge of this, but the most common things that won't get inlined are loops, delegates and virtual functions iirc.
 Have you profiled your D code? What has the profiling told you? Have you 
seen where you allocate memory, to move such allocations away from inner loops, or just reduce their number? >
 
 No, I have not profiled the D code other than using an FPS counter... :-) 
To be honest, I'm fairly light on experience when it comes to profiling. Do you have any suggestions on how to make it happen? dmd's builtin profiler can be useful. Some time ago I have written a small utility to help make it's output more readable: http://www.dsource.org/projects/scrapple/wiki/PtraceUtility
Feb 24 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
I seem to remember from a previous discussion about  optimizing a
ray-tracer that DMD will not inline functions that take reference
parameters.   Can anyone else confirm this?

--bb
Feb 24 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Bill Baxter (wbaxter gmail.com)'s article
 I seem to remember from a previous discussion about  optimizing a
 ray-tracer that DMD will not inline functions that take reference
 parameters.   Can anyone else confirm this?
 --bb
Here's a test program I wrote and the relevant parts of the disassembly. It was compiled w/ -O -inline -release. I think you're right, strange as it seems. I wonder why ref is never inlined. void main() { uint foo; inc(foo); } void inc(ref uint num) { num++; } __Dmain PROC NEAR ; COMDEF __Dmain push eax lea eax, [esp] mov dword ptr [esp], 0 call _D4test3incFKkZv xor eax, eax pop ecx ret __Dmain ENDP _text$__Dmain ENDS _text$_D4test3incFKkZv SEGMENT DWORD PUBLIC 'CODE' _D4test3incFKkZv PROC NEAR ; COMDEF _D4test3incFKkZv inc dword ptr [eax] ret _D4test3incFKkZv ENDP
Feb 24 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

I think you're right, strange as it seems.  I wonder why ref is never inlined.<
Do you want something like a forced_inline attribute in D? :-) Bye, bearophile
Feb 24 2009
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
I think you're right, strange as it seems.  I wonder why ref is never inlined.<
Do you want something like a forced_inline attribute in D? :-) Bye, bearophile
No, actually, I like the idea of leaving these small micro-optimizations to the compiler. It's just that I can't figure out what's special about functions that take ref parameters. Maybe there is a good reason for this behavior. I don't know. It's just that if there is a good reason, I can't think of it. Also, if you really, really, _really_ want to force a function to be inlined, you can probably simulate this with templates or mixins or something. IMHO wanting to absolutely insist that something be inlined is too much of an edge case to have pretty syntax and special language constructs for.
Feb 24 2009
prev sibling parent grauzone <none example.net> writes:
Both LDC and GDC inline the function. (LDC actually reduces your code to 
nothing, so I had to change it a bit to see if the call was really 
inlined.)
Feb 24 2009
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Tue, 24 Feb 2009 14:08:26 -0500, Mason Green wrote:

 Have you profiled your D code? What has the profiling told you? Have you seen
where you allocate memory, to move such allocations away from inner loops, or
just reduce their number? >
No, I have not profiled the D code other than using an FPS counter... :-) To be honest, I'm fairly light on experience when it comes to profiling. Do you have any suggestions on how to make it happen?
The material seems lacking so I've started a series of posts on profiling. Here's the first one: http://snakecoder.wordpress.com/2009/02/26/profiling-with-dmd-on-windows/ I already have some material for the second one, profiling Blaze. ;-)
Feb 26 2009
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sergey Gromov wrote:
 http://snakecoder.wordpress.com/2009/02/26/profiling-with-dmd-on-windows/
 
 I already have some material for the second one, profiling Blaze.  ;-)
http://www.reddit.com/r/d_language/comments/80lpm/profiling_with_digital_mars_d_compiler_on_windows/
Feb 26 2009
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 26 Feb 2009 14:43:11 -0800, Walter Bright wrote:

 Sergey Gromov wrote:
 http://snakecoder.wordpress.com/2009/02/26/profiling-with-dmd-on-windows/
 
 I already have some material for the second one, profiling Blaze.  ;-)
http://www.reddit.com/r/d_language/comments/80lpm/profiling_with_digital_mars_d_compiler_on_windows/
Heh, thanks! I hope my opus really worths mentioning.
Feb 26 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sergey Gromov wrote:
 Thu, 26 Feb 2009 14:43:11 -0800, Walter Bright wrote:
 
 Sergey Gromov wrote:
 http://snakecoder.wordpress.com/2009/02/26/profiling-with-dmd-on-windows/

 I already have some material for the second one, profiling Blaze.  ;-)
http://www.reddit.com/r/d_language/comments/80lpm/profiling_with_digital_mars_d_compiler_on_windows/
Heh, thanks! I hope my opus really worths mentioning.
I think it is.
Feb 26 2009
parent TomD <t_demmer nospam.web.de> writes:
Walter Bright Wrote:

 Sergey Gromov wrote:
[...]
 Heh, thanks!  I hope my opus really worths mentioning.
I think it is.
Shouldn't things like these maybe be included under the "Tech Tips" on digitalmars.com or so? Ciao TomD
Feb 27 2009
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 26 Feb 2009 19:42:20 +0300, Sergey Gromov wrote:

 Tue, 24 Feb 2009 14:08:26 -0500, Mason Green wrote:
 
 Have you profiled your D code? What has the profiling told you? Have you seen
where you allocate memory, to move such allocations away from inner loops, or
just reduce their number? >
No, I have not profiled the D code other than using an FPS counter... :-) To be honest, I'm fairly light on experience when it comes to profiling. Do you have any suggestions on how to make it happen?
The material seems lacking so I've started a series of posts on profiling. Here's the first one: http://snakecoder.wordpress.com/2009/02/26/profiling-with-dmd-on-windows/ I already have some material for the second one, profiling Blaze. ;-)
And the second post: http://snakecoder.wordpress.com/2009/03/02/profiling-with-dmd-on-windows-getting-hands-dirty/ This one is more practical.
Mar 01 2009
parent reply Mason Green <mason.green gmail.com> writes:
Excellent, I've implemented your optimizations and left a more detailed comment
on the blog.  I've also made a number of optimizations to the physics engine
over the weekend, and the performance increase is phenomenal!

http://svn.dsource.org/projects/blaze/downloads/blazeDemos.zip

Much appreciated!!!!

Sergey Gromov Wrote:

 And the second post:
 
 http://snakecoder.wordpress.com/2009/03/02/profiling-with-dmd-on-windows-getting-hands-dirty/
 
 This one is more practical.
Mar 02 2009
parent Sergey Gromov <snake.scaly gmail.com> writes:
Mon, 02 Mar 2009 07:12:41 -0500, Mason Green wrote:

 Excellent, I've implemented your optimizations and left a more
 detailed comment on the blog.  I've also made a number of
 optimizations to the physics engine over the weekend, and the
 performance increase is phenomenal! 
 
 http://svn.dsource.org/projects/blaze/downloads/blazeDemos.zip 
 
 Much appreciated!!!!
You're welcome! I've checked out the trunk rev. 423--it's much faster now. Good job!
Mar 02 2009
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Mason Green (Zzzzrrr) wrote:
 When I remove -inline there doesn't seem to
 be much of a difference in execution speed.
Try running obj2asm to see if the functions you want inlined are actually inlined or not.
Feb 24 2009
parent reply Tomas Lindquist Olsen <tomas.l.olsen gmail.com> writes:
On Wed, Feb 25, 2009 at 8:42 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 Mason Green (Zzzzrrr) wrote:
 When I remove -inline there doesn't seem to
 be much of a difference in execution speed.
Try running obj2asm to see if the functions you want inlined are actually inlined or not.
perhaps a verbose mode could be added in dmd that prints the pretty printed declaration when a function is inlined. then it would be a simple grep to make sure. dmd -vi foo.d | grep 'foo\.inc' telling people to inspect the obj2asm output seems to be popular, but it's hardly user friendly.
Feb 25 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Tomas Lindquist Olsen wrote:
 perhaps a verbose mode could be added in dmd that prints the pretty
 printed declaration when a function is inlined. then it would be a
 simple grep to make sure.
 
 dmd -vi foo.d | grep 'foo\.inc'
 
 telling people to inspect the obj2asm output seems to be popular, but
 it's hardly user friendly.
I know, but it isn't that hard, either, even if you don't know assembler. If the "call" isn't there, it likely got inlined. Also, if you are trying to optimize the code by trying various tweaks at the statement level, it's much like shooting skeet blindfolded if you don't look at the asm output. It's time consuming and unlikely to be successful.
Feb 25 2009
next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Wed, Feb 25, 2009 at 3:26 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 Also, if you are trying to optimize the code by trying various tweaks at the
 statement level, it's much like shooting skeet blindfolded if you don't look
 at the asm output. It's time consuming and unlikely to be successful.
In this case it's not entirely helpful that DMD's inlining rules are completely opaque. Do you have a list of what DMD will and won't inline, and their justifications? If not, could you make one?
Feb 25 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Jarrett Billingsley wrote:
 In this case it's not entirely helpful that DMD's inlining rules are
 completely opaque.  Do you have a list of what DMD will and won't
 inline, and their justifications?  If not, could you make one?
In the immortal words of Oggie-Ben-Doggie, "use the source, Luke". In this case, the source is FuncDeclaration::canInline() in /dmd/src/dmd/inline.c. Yes, I know, but it's all there is at the moment.
Feb 25 2009
prev sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Wed, Feb 25, 2009 at 9:09 AM, Jarrett Billingsley
<jarrett.billingsley gmail.com> wrote:
 On Wed, Feb 25, 2009 at 3:26 AM, Walter Bright
 <newshound1 digitalmars.com> wrote:
 Also, if you are trying to optimize the code by trying various tweaks at=
the
 statement level, it's much like shooting skeet blindfolded if you don't =
look
 at the asm output. It's time consuming and unlikely to be successful.
In this case it's not entirely helpful that DMD's inlining rules are completely opaque. =A0Do you have a list of what DMD will and won't inline, and their justifications? =A0If not, could you make one?
Also, looking at the DMD frontend source is *not* an acceptable option.
Feb 25 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jarrett Billingsley wrote:
 Also, looking at the DMD frontend source is *not* an acceptable option.
I knew you'd say that <g>. On the other hand, inlining or not is, like register allocation and any other optimizations, highly implementation dependent. If you're going to micro-optimize at that level, it really is worthwhile to get familiar with obj2asm and the relevant compiler source code. It'll save you much time in the long run, and will pay off in being able to write consistently faster code. Or, you could sign up for http://www.astoriaseminar.com/compiler-construction.html <g>.
Feb 25 2009
parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Wed, Feb 25, 2009 at 8:59 PM, Walter Bright
<newshound1 digitalmars.com> wrote:
 Jarrett Billingsley wrote:
 Also, looking at the DMD frontend source is *not* an acceptable option.
I knew you'd say that <g>.
I knew you'd suggest it ;)
 On the other hand, inlining or not is, like register allocation and any
 other optimizations, highly implementation dependent. If you're going to
 micro-optimize at that level, it really is worthwhile to get familiar with
 obj2asm and the relevant compiler source code.
True. However defining what the compiler does in these optimizations is not just in the interest of performance, but also in the interest of correctness and other implementations. If everyone can see what DMD is and isn't inlining, they can ask "why" or "why not"; they can correct you if you make a mistake; they can suggest optimizations you might not have thought of; and they can see optimizations that fall out as a consequence of the language that they might not have considered when making their own compiler. Furthermore things like NRVO either need to be specified in the language or specified in the ABI. You told me before that static opCall for structs is just as efficient as constructors because of NRVO; I didn't and still don't buy it for exactly the reasons you just now gave: optimizations are highly implementation-dependent. It's this kind of stuff that needs to be specified: is NRVO required, or just _really really nice to have_? Insert many other optimizations here.
Feb 25 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Jarrett Billingsley wrote:
 True.  However defining what the compiler does in these optimizations
 is not just in the interest of performance, but also in the interest
 of correctness and other implementations.
Optimization should have nothing to do with correctness.
 If everyone can see what
 DMD is and isn't inlining, they can ask "why" or "why not"; they can
 correct you if you make a mistake; they can suggest optimizations you
 might not have thought of; and they can see optimizations that fall
 out as a consequence of the language that they might not have
 considered when making their own compiler.
If they're working at that level, why avoid looking at the compiler source? Optimization suggestions from someone who knows how compilers work are much more likely to be viable.
 Furthermore things like NRVO either need to be specified in the
 language or specified in the ABI.  You told me before that static
 opCall for structs is just as efficient as constructors because of
 NRVO; I didn't and still don't buy it for exactly the reasons you just
 now gave: optimizations are highly implementation-dependent.  It's
 this kind of stuff that needs to be specified: is NRVO required, or
 just _really really nice to have_?  Insert many other optimizations
 here.
If an optimization is required, then yes, it needs to go in the spec. But inlining is not required. Let me put it another way. There are *thousands* of optimizations the compiler does, and they often have some very complex interactions. Even enumerating them all would be an enormous time sink. There's nothing particularly special about inlining as opposed to constant folding, dead code elimination, register allocation, instruction scheduling, strength reduction, etc., etc. Even if I wrote such a tome, it would be a waste of time to read it. The easiest, quickest way to see if an optimization happened is to look at the obj2asm output. Remember the thread a while back about how dmd did a terrible job generating arithmetic code? A quick check with obj2asm showed that the speed problem had nothing to do with the code generation, it was all sucked up by a library module (since fixed).
Feb 25 2009