www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Two optimizations

reply bearophile <bearophileHUGS lycos.com> writes:
High-level optimizations have to be done mostly by the front-end because the
back-end usually doesn't know about the high-level constructs of D.

So this is the right place to ask for some of such optimizations, while asking
for them on the LLVM or LDC IRC channels is not the best thing to do (even if
LDC developers usually try to do what I ask them, they are very gentle).

---------------------

Recently here I've shown a link to some of the optimizations done by the Scala
compiler (that the JavaVM back-end isn't able to perform), but that post was
ignored. The most important thing it does is to inline most closures of a Scala
program. Such optimization is important for functional-style programming,
because (like tail-call optimization) if present allows the programmer to use
certain idioms with more freedom (like passing delegates to higher order
functions, etc), essentially allowing a different programming style. Probably
this optimization isn't easy to implement, it requires some code. Is someone
interested in adding this to D?

---------------------

This comes from successive simplification of some code of mine, it's a loop
with a kernel over an array, a common operation in my kind of code:

version(Tango) import tango.stdc.stdio: printf;
void main() {
    auto a = new int[30];
    for (int x = 2; x < a.length-2; x++)
        foreach (s; [-2, -1, 0, 1, 2])
            a[x] += a[x + s];
    printf("%d\n", a[5]);
}

This is the asm produced by LDC of the inner foreach loop:

.LBB1_1:
	movl	$4294967294, 132(%esp)
	movl	$4294967295, 136(%esp)
	movl	$0, 140(%esp)
	movl	$1, 144(%esp)
	movl	$2, 148(%esp)
	movl	20(%esp,%eax,4), %ecx
	addl	12(%esp,%eax,4), %ecx
	movl	%ecx, 20(%esp,%eax,4)
	addl	16(%esp,%eax,4), %ecx
	movl	%ecx, 20(%esp,%eax,4)
	addl	%ecx, %ecx
	movl	%ecx, 20(%esp,%eax,4)
	addl	24(%esp,%eax,4), %ecx
	movl	%ecx, 20(%esp,%eax,4)
	addl	28(%esp,%eax,4), %ecx
	movl	%ecx, 20(%esp,%eax,4)
	incl	%eax
	cmpl	$26, %eax


The [-2, -1, 0, 1, 2] array is immutable (and the variable 's' of the foreach
isn't by ref), so can't the following initializations of such array moved out
of the inner loop?

	movl	$4294967294, 132(%esp)
	movl	$4294967295, 136(%esp)
	movl	$0, 140(%esp)
	movl	$1, 144(%esp)
	movl	$2, 148(%esp)


This is a variant of the same code:

version(Tango) import tango.stdc.stdio: printf;
template Tuple(T...) { alias T Tuple; }
alias Tuple!(-2, -1, 0, 1, 2) move;
void main() {
    auto a = new int[30];
    for (int x = 2; x < a.length-2; x++)
        foreach (s; move)
            a[x] += a[x + s];
    printf("%d\n", a[5]);
}

The relative asm is much better:

main:
	pushl	%edi
	subl	$128, %esp
	xorl	%eax, %eax
	movl	$30, %ecx
	leal	8(%esp), %edi
	rep;stosl
	xorl	%eax, %eax
	.align	16
.LBB1_1:
	movl	16(%esp,%eax,4), %ecx
	addl	8(%esp,%eax,4), %ecx
	movl	%ecx, 16(%esp,%eax,4)
	addl	12(%esp,%eax,4), %ecx
	addl	%ecx, %ecx
	movl	%ecx, 16(%esp,%eax,4)
	addl	20(%esp,%eax,4), %ecx
	movl	%ecx, 16(%esp,%eax,4)
	addl	24(%esp,%eax,4), %ecx
	movl	%ecx, 16(%esp,%eax,4)
	incl	%eax
	cmpl	$26, %eax
	jne	.LBB1_1
	movl	28(%esp), %eax
	movl	%eax, 4(%esp)
	movl	$.str, (%esp)
	call	printf
	xorl	%eax, %eax
	addl	$128, %esp
	popl	%edi
	ret	$8




This is a less reduced version of the code, that I'd like the D (LDC) compiler
to optimize very well:

version(Tango) import tango.stdc.stdio: printf;

struct P {
    int x, y;
    P opAdd(P o) { return P(x+o.x, y+o.y); }
}

struct Rect {
    int lx, ly;
    int opIn_r(P p) {
        return p.x >= 0 && p.x < lx && p.y >= 0 && p.y < ly;
    }
}

void main() {
    const int SIZE = 20;
    auto m = new int[][](SIZE, SIZE);
    auto p = P(10, 10);
    // there's another loop for the rows here
    for (int i; i < SIZE; i++)
        foreach (shift;
[P(-2,-1),P(-1,-2),P(1,-2),P(2,-1),P(2,1),P(1,2),P(-1,2),P(-2,1)])
            if (shift + p in Rect(SIZE, SIZE))
                printf("OK\n");
}


Eventually the compiler can produce asm similar to (well, this uses an uint to
avoid testing >= 0, I don't think the LDC compiler will soon learn this trick
too):

template Tuple(T...) { alias T Tuple; }
alias Tuple!(-1,-2,-2,-1,+1,+2,+2,+1) movex;
alias Tuple!(-2,-1,+1,+2,+2,+1,-1,-2) movey;
foreach (uint i, sx; movex)
    if (x + sx < SIZE && y + movey[i] < SIZE)
        printf("OK\n");


If you need more info please ask.

Bye,
bearophile
Jul 27 2009
next sibling parent davidl <davidl nospam.org> writes:
在 Tue, 28 Jul 2009 04:51:42 +0800,bearophile <bearophileHUGS lycos.com>  
写道:

 High-level optimizations have to be done mostly by the front-end because  
 the back-end usually doesn't know about the high-level constructs of D.

 So this is the right place to ask for some of such optimizations, while  
 asking for them on the LLVM or LDC IRC channels is not the best thing to  
 do (even if LDC developers usually try to do what I ask them, they are  
 very gentle).

 ---------------------

 Recently here I've shown a link to some of the optimizations done by the  
 Scala compiler (that the JavaVM back-end isn't able to perform), but  
 that post was ignored. The most important thing it does is to inline  
 most closures of a Scala program. Such optimization is important for  
 functional-style programming, because (like tail-call optimization) if  
 present allows the programmer to use certain idioms with more freedom  
 (like passing delegates to higher order functions, etc), essentially  
 allowing a different programming style. Probably this optimization isn't  
 easy to implement, it requires some code. Is someone interested in  
 adding this to D?

 ---------------------

 This comes from successive simplification of some code of mine, it's a  
 loop with a kernel over an array, a common operation in my kind of code:

 version(Tango) import tango.stdc.stdio: printf;
 void main() {
     auto a = new int[30];
     for (int x = 2; x < a.length-2; x++)
         foreach (s; [-2, -1, 0, 1, 2])
             a[x] += a[x + s];
     printf("%d\n", a[5]);
 }

 This is the asm produced by LDC of the inner foreach loop:

 .LBB1_1:
 	movl	$4294967294, 132(%esp)
 	movl	$4294967295, 136(%esp)
 	movl	$0, 140(%esp)
 	movl	$1, 144(%esp)
 	movl	$2, 148(%esp)
 	movl	20(%esp,%eax,4), %ecx
 	addl	12(%esp,%eax,4), %ecx
 	movl	%ecx, 20(%esp,%eax,4)
 	addl	16(%esp,%eax,4), %ecx
 	movl	%ecx, 20(%esp,%eax,4)
 	addl	%ecx, %ecx
 	movl	%ecx, 20(%esp,%eax,4)
 	addl	24(%esp,%eax,4), %ecx
 	movl	%ecx, 20(%esp,%eax,4)
 	addl	28(%esp,%eax,4), %ecx
 	movl	%ecx, 20(%esp,%eax,4)
 	incl	%eax
 	cmpl	$26, %eax


 The [-2, -1, 0, 1, 2] array is immutable (and the variable 's' of the  
 foreach isn't by ref), so can't the following initializations of such  
 array moved out of the inner loop?

 	movl	$4294967294, 132(%esp)
 	movl	$4294967295, 136(%esp)
 	movl	$0, 140(%esp)
 	movl	$1, 144(%esp)
 	movl	$2, 148(%esp)


 This is a variant of the same code:

 version(Tango) import tango.stdc.stdio: printf;
 template Tuple(T...) { alias T Tuple; }
 alias Tuple!(-2, -1, 0, 1, 2) move;
 void main() {
     auto a = new int[30];
     for (int x = 2; x < a.length-2; x++)
         foreach (s; move)
             a[x] += a[x + s];
     printf("%d\n", a[5]);
 }

 The relative asm is much better:

 main:
 	pushl	%edi
 	subl	$128, %esp
 	xorl	%eax, %eax
 	movl	$30, %ecx
 	leal	8(%esp), %edi
 	rep;stosl
 	xorl	%eax, %eax
 	.align	16
 .LBB1_1:
 	movl	16(%esp,%eax,4), %ecx
 	addl	8(%esp,%eax,4), %ecx
 	movl	%ecx, 16(%esp,%eax,4)
 	addl	12(%esp,%eax,4), %ecx
 	addl	%ecx, %ecx
 	movl	%ecx, 16(%esp,%eax,4)
 	addl	20(%esp,%eax,4), %ecx
 	movl	%ecx, 16(%esp,%eax,4)
 	addl	24(%esp,%eax,4), %ecx
 	movl	%ecx, 16(%esp,%eax,4)
 	incl	%eax
 	cmpl	$26, %eax
 	jne	.LBB1_1
 	movl	28(%esp), %eax
 	movl	%eax, 4(%esp)
 	movl	$.str, (%esp)
 	call	printf
 	xorl	%eax, %eax
 	addl	$128, %esp
 	popl	%edi
 	ret	$8




 This is a less reduced version of the code, that I'd like the D (LDC)  
 compiler to optimize very well:

 version(Tango) import tango.stdc.stdio: printf;

 struct P {
     int x, y;
     P opAdd(P o) { return P(x+o.x, y+o.y); }
 }

 struct Rect {
     int lx, ly;
     int opIn_r(P p) {
         return p.x >= 0 && p.x < lx && p.y >= 0 && p.y < ly;
     }
 }

 void main() {
     const int SIZE = 20;
     auto m = new int[][](SIZE, SIZE);
     auto p = P(10, 10);
     // there's another loop for the rows here
     for (int i; i < SIZE; i++)
         foreach (shift;  
 [P(-2,-1),P(-1,-2),P(1,-2),P(2,-1),P(2,1),P(1,2),P(-1,2),P(-2,1)])
             if (shift + p in Rect(SIZE, SIZE))
                 printf("OK\n");
 }


 Eventually the compiler can produce asm similar to (well, this uses an  
 uint to avoid testing >= 0, I don't think the LDC compiler will soon  
 learn this trick too):

 template Tuple(T...) { alias T Tuple; }
 alias Tuple!(-1,-2,-2,-1,+1,+2,+2,+1) movex;
 alias Tuple!(-2,-1,+1,+2,+2,+1,-1,-2) movey;
 foreach (uint i, sx; movex)
     if (x + sx < SIZE && y + movey[i] < SIZE)
         printf("OK\n");


 If you need more info please ask.

 Bye,
 bearophile
It's better to be in bugzilla. I think this is not at the highest priority right now. Simple implementation idea is checking everything in a loop to see if some could be compile-time constants. Then rewrite the loop when further optimization flags are supplied to the compiler. This could mess up the LoC information for debugging. -- 使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/
Jul 27 2009
prev sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Mon, Jul 27, 2009 at 4:51 PM, bearophile<bearophileHUGS lycos.com> wrote:
 High-level optimizations have to be done mostly by the front-end because the
back-end usually doesn't know about the high-level constructs of D.

 So this is the right place to ask for some of such optimizations, while asking
for them on the LLVM or LDC IRC channels is not the best thing to do (even if
LDC developers usually try to do what I ask them, they are very gentle).
No, *bugzilla* and *the LDC ticket tracker* are the right place to ask for such optimizations. Here, I'll even give you some links. http://d.puremagic.com/issues/ http://dsource.org/projects/ldc/newticket If you continue to refuse to report bugs and requests through the proper channels, you have NO RIGHT to complain about them not being fixed. You also run the risk of running afoul of the good graces of the developers. Please, just do like everyone else and file tickets. It's not that hard, and you have given *no* reason why you don't do so.
Jul 27 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Jarrett Billingsley:
 No, *bugzilla* and *the LDC ticket tracker* are the right place to ask
 for such optimizations.
LDC isn't the right place for such optimizations because: - They are not easy, especially the first one, so Walter may be fitter for such work; - Such optimizations require brain and maybe even planning. So they have to be discussed first. A forum is a fine place for this; - If implemented correctly, the first optimization changes the running time enough to encourage a different programming style (for example Phobos2 can be rewritten a lot, using just delegates instead of strings). So it's something language-wide and not just something that speeds up code a bit; - They are general things, useful regardless the backend (and maybe a backend can't perform such operations because it doesn't know about such higher level things of D), so you want such things not just in LDC, but in DMD too, and in future D compiler too. So implementing it for LDC is bad.
 You also run the risk of running afoul of the good graces of the developers.<
Already done since few months, I fear. When people don't like me anymore, I'll leave. I'm not interested in a career in trolling. Bye, bearophile
Jul 28 2009
parent reply language_fan <foo bar.com.invalid> writes:
Tue, 28 Jul 2009 04:53:02 -0400, bearophile thusly wrote:

 Such optimizations require brain and maybe even planning. So they have
 to be discussed first.
The discussion or collective community opinion won't help a bit. Walter is the only person who decides what goes into the spec and what doesn't.
 If
 implemented correctly, the first optimization changes the running time
 enough to encourage a different programming style (for example Phobos2
 can be rewritten a lot, using just delegates instead of strings).
Functions that take functions as parameters can be dangerous for readability since they allow adding custom control structures. I think the idea in D has been to provide all possible control structures as built-in features. For example foreach, foreach_reverse, soon we probably get foreach_with_filter, foreach_parallel_for_2cores, foreach_parallel_for_4cores etc. The advantage of built-in features is that they remain compatible everywhere. A third party library writer (e.g. tango developpers) may create a tango_foreach_for_2cores and phobos group soon creates phobos_foreach_for_2cores. And some game library designer might create gamename_foreach_for_2cores. It's dangerous to give too much power to the language users. And this is what happens if you start supporting these 'high order functions' with good optimizations. Cause it's possible to optimize them too well with the knowledge we have today of programming languages.
 Already done since few months, I fear. When people don't like me
 anymore, I'll leave. I'm not interested in a career in trolling.
The D developers are supposed to solve practical issues, not take part in latest academic discussion.
 
 Bye,
 bearophile
Jul 28 2009
parent Don <nospam nospam.com> writes:
language_fan wrote:
 Tue, 28 Jul 2009 04:53:02 -0400, bearophile thusly wrote:
 
 Such optimizations require brain and maybe even planning. So they have
 to be discussed first.
The discussion or collective community opinion won't help a bit. Walter is the only person who decides what goes into the spec and what doesn't.
 The D developers are supposed to solve practical issues, not take part in 
 latest academic discussion.
The situations bearophile posted looked pretty practical to me. They're examples of where immutability can be used in optimisation. They belong in bugzilla. Tag them with the 'performance' keyword. Unfortunately, at the present time there are so many bugs in bugzilla that optimisation doesn't get much attention. (One of my performance patches will be in the next DMD, but it's pretty rare event -- they only happen as light relief from difficult and boring bug fixes <g>). Bearophile, PLEASE put them in bugzilla. They'll get lost otherwise, which is a shame since they're good observations.
Jul 28 2009