www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Looking for more library optimization patterns

reply "Kai Nacke" <kai redstar.de> writes:
Hi all!

LDC includes the SimplifyDRuntimeCalls pass which should replace 
D library calls with more efficient ones. This is a clone of a 
former LLVM pass which did the same for C runtime calls.

An example is that printf("foo\n") is replaced by puts("foo").

Do you know of similar patterns in D which are worth to be 
implemented?
E.g. replacing std.stdio.writefln("foo") with 
std.stdio.writeln("foo")

Regards,
Kai
Jan 29 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Kai Nacke:

 LDC includes the SimplifyDRuntimeCalls pass which should 
 replace D library calls with more efficient ones. This is a 
 clone of a former LLVM pass which did the same for C runtime 
 calls.

 An example is that printf("foo\n") is replaced by puts("foo").

 Do you know of similar patterns in D which are worth to be 
 implemented?
 E.g. replacing std.stdio.writefln("foo") with 
 std.stdio.writeln("foo")
A de-optimization I'd really like in LDC (and dmd) is to not call the run-time functions when you perform a verctor operation on arrays statically known to be very short: void main() { int[3] a, b; a[] += b[]; } And just rewrite that with a for loop (as done for cases where the run time function is not available), and let ldc2 compile it on its own. See a thread in D.learn: http://forum.dlang.org/thread/lqmqsnucadaqlkxkoffc forum.dlang.org I'd like to replace code like: size_t i = 0; foreach (immutable j, const ref bj; bodies) { foreach (const ref bk; bodies[j + 1 .. $]) { foreach (immutable m; TypeTuple!(0, 1, 2)) r[i][m] = bj.x[m] - bk.x[m]; i++; } } With: size_t i = 0; foreach (immutable j, const ref bj; bodies) foreach (const ref bk; bodies[j + 1 .. $]) r[i++][] = bj.x[] - bk.x[]; And keep the same performance, instead of seeing a significant slowdown. Bye, bearophile
Jan 30 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Being somewhat involved in aforementioned thread, I second that. 
There seems to be no logical reason to insert a call 
dSliceOpAssignOrWhateverItIsCalled when all the data is actually 
there. At least, in -release -O3 builds :)

Granted, this may be tricky for arbitrarily-typed arrays, but for 
builtins with statically known array sizes? That's a definitive 
win.
Jan 30 2014
parent reply "Kai Nacke" <kai redstar.de> writes:
Thanks! That are good hints!
I am not sure if I can implement them soon. But I put them on my 
list for the pass.

If you have more of those hints: please post!

Regards,
Kai
Feb 10 2014
next sibling parent "Andrea Fontana" <nospam example.com> writes:
On Monday, 10 February 2014 at 19:51:33 UTC, Kai Nacke wrote:
 Thanks! That are good hints!
 I am not sure if I can implement them soon. But I put them on 
 my list for the pass.

 If you have more of those hints: please post!

 Regards,
 Kai
Don't know if it is the case but what about using appender instead of concatenation (for example for concatenations inside loops...)? Andrea
Feb 11 2014
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Kai Nacke:

 If you have more of those hints: please post!
There are plenty of them to find. But I think you need a strategy like this: http://en.wikipedia.org/wiki/Brainstorming to find them. D conferences should be good for this purpose. Bye, bearophile
Feb 11 2014
prev sibling next sibling parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote:
 A de-optimization I'd really like in LDC (and dmd) is to not 
 call the run-time functions when you perform a verctor 
 operation on arrays statically known to be very short:

 void main() {
     int[3] a, b;
     a[] += b[];
 }

 And just rewrite that with a for loop (as done for cases where 
 the run time function is not available), and let ldc2 compile 
 it on its own.
For short loops, an unrolled version like a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; may well be faster than a simple loop as the following one: foreach (immutable i; 0..3) { a[i] += b[i]; } At least on x86/64. Will that optimization happen too, at a later stage? Ivan Kazmenko.
Feb 15 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Ivan Kazmenko:

 For short loops, an unrolled version like
     a[0] += b[0];
     a[1] += b[1];
     a[2] += b[2];
 may well be faster than a simple loop as the following one:
     foreach (immutable i; 0..3) {
         a[i] += b[i];
     }
 At least on x86/64.
Yes, but ldc is plenty able to unroll small loops with length known at compile time. Bye, bearophile
Feb 15 2014
parent "Temtaime" <temtaime gmail.com> writes:
What about SSE?

Can

float[16] a, b;
a[] += b[];

Be optimized by using only 4 addps ?
Apr 18 2014
prev sibling parent "safety0ff" <safety0ff.dev gmail.com> writes:
On Thursday, 30 January 2014 at 18:36:09 UTC, bearophile wrote:
 A de-optimization I'd really like in LDC (and dmd) is to not 
 call the run-time functions when you perform a verctor 
 operation on arrays statically known to be very short:
I just learned about this issue while looking at http://rosettacode.org/wiki/Hamming_numbers#D (Alternative version 2, opEquals) It makes me sad to learn that slice / array operations get turned into monstrosities by the compiler. There are some huge wins to be had in this area.
Jul 30 2014