## digitalmars.D - Expression templates

- Don Clugston <dac nospam.com.au> Dec 12 2006
- Sean Kelly <sean f4.ca> Dec 12 2006
- David L. Davis <SpottedTiger yahoo.com> Dec 12 2006
- David L. Davis <SpottedTiger yahoo.com> Dec 12 2006
- Don Clugston <dac nospam.com.au> Dec 12 2006
- Bill Baxter <dnewsgroup billbaxter.com> Dec 13 2006
- Don Clugston <dac nospam.com.au> Dec 13 2006
- "Christian Kamm" <kamm nospam.de> Dec 14 2006
- Don Clugston <dac nospam.com.au> Dec 14 2006

The recent language improvements (especially, opAssign) have made expression templates eminently feasible. Here's a proof-of-concept which implements parsing of the basic vector operations of addition and scalar multiplication. It includes constant folding of scalar multiplication, and does some ordering of the parameters. At compile time, it prints the expression in postfix notation, and at runtime it prints the calculation to be performed. Enjoy. ------------- import std.stdio; import std.string; // This would be a vector. For now, just store a name. class DVec { char [] data; public: const bool isDVec=true; // workaround because is() doesn't work properly this(char [] name) { data = name; } void opAssign(A)(A expr) { pragma(msg, "In postfix: " ~ A.expressionString); // Do the actual work here. For now, just print a message // to prove we have access to the actual variables. printf((data ~ " = " ~ expr.getStr ~ \n).ptr); } Expr!('+', DVec, A) opAdd(A)(A w) { Expr!('+', DVec, A) q; q.left = this; q.right = w; return q; } Expr!('*', real, DVec) opMul(A)(A w) { static assert(is(A : real), "Can only multiply by scalars"); Expr!('*', real, DVec) q; q.left = w; q.right = this; return q; } } /// For compile-time debugging template ExprToText(X) { static if (is (X: real)) const char [] ExprToText = "scalar"; else static if (is (X.isExpr)) const char [] ExprToText = X.expressionString; else static if (is (X: DVec)) const char [] ExprToText = "Vec"; } // Stores all the data from the expression. // 'dummy' is a workaround to avoid a 'recursive template' error message. struct Expr(char operation, LeftExpr, RightExpr, int dummy=0) { typedef int isExpr; // workaround because is() doesn't work properly const char opType = operation; const char [] expressionString = ExprToText!(LeftExpr) ~ " " ~ ExprToText!(RightExpr) ~ " " ~ operation ~ " "; LeftExpr left; RightExpr right; Expr!('+', Expr!(operation, LeftExpr, RightExpr, 1), A) opAdd(A)(A w) { Expr!('+', Expr!(operation, LeftExpr, RightExpr, 1), A) q; q.left.left = this.left; q.left.right = this.right; q.right = w; return q; } static if ( operation=='*' && is(LeftExpr: real)) { // Constant fold scalars Expr!('*', real, RightExpr, 1) opMul(real w) { Expr!('*', real, RightExpr, 1) q; q.right = this.right; q.left = this.left * w; return q; } } else { Expr!('*', real, Expr!(operation, LeftExpr, RightExpr, 1)) opMul(A)(A w) { static assert(is(A : real), "Can only multiply by scalars"); Expr!('*', real, Expr!(operation, LeftExpr, RightExpr, 1)) q; q.right.left = this.left; q.right.right = this.right; q.left = w; return q; } } // Just for debugging. char [] getStr() { char [] s; static if (is (left.isExpr)) { s = "(" ~ left.getStr() ~ ")"; } else static if (is (LeftExpr : DVec)) { s = left.data; } else s ~= format("%g", left); s ~= operation; static if (is (right.isExpr)) { s ~= "(" ~ right.getStr() ~ ")"; } else static if (is (RightExpr: DVec)) { s ~= right.data; } else s ~= format("%g", right); return s; } } void main() { DVec a = new DVec("a"); DVec b = new DVec("b"); DVec c = new DVec("c"); b = a + c; b = c*2; a = 27.4*a + 2*((b+c) + 3.2*c)*1.234*2; }

Dec 12 2006

Don Clugston wrote:The recent language improvements (especially, opAssign) have made expression templates eminently feasible. Here's a proof-of-concept which implements parsing of the basic vector operations of addition and scalar multiplication.

Interesting. Yet another example of where D's reference semantics for classes dramatically simplify things. The code looks quite natural, and without the debugging code is exceedingly short. Nice work! Sean

Dec 12 2006

Don thanks for sharing! But using dmd v0.177 on WinXP SP2, and compiling the POC sample with the following commandline: C:\dmd>dmd expression_templates.d I get a compile error for this line: "b = a + c;" (and the other two lines that follow if I comment out the previous line). Output: -------- C:\dmd>dmd expression_templates.d expression_templates.d(123): Error: cannot implicitly convert expression (((a).o pAdd)((c))) of type Expr!('+',DVec,DVec) to expression_templates.DVec C:\dmd> -------- Is there something I not doing correctly? David L.

Dec 12 2006

Don my Bad! opps! I was still on dmd v0.176...sorry about that. With dmd v0.177 this is the output: ----------------------------------- C:\dmd>dmd expression_templates.d In postfix: Vec Vec + In postfix: scalar Vec * In postfix: scalar Vec * scalar Vec Vec + scalar Vec * + * + C:\dmd\bin\..\..\dm\bin\link.exe expression_templates,,,user32+kernel32/noi; C:\dmd>expression_templates b = a+c b = 2*c a = (27.4*a)+(4.936*((b+c)+(3.2*c))) C:\dmd> ------------------------------------- Thanks again for sharing this code. David L.

Dec 12 2006

Don Clugston wrote:The recent language improvements (especially, opAssign) have made expression templates eminently feasible. Here's a proof-of-concept which implements parsing of the basic vector operations of addition and scalar multiplication.

I've made a slightly improved version, which swaps the parameter ordering, in order to hide the latency of * on an x87. Translation of the postfix notation into x87 instructions is straightforward: scalar * becomes fmul real ptr[c]; Vec + becomes fadd double ptr[v + 8*EDX]; + becomes fadd ST(1), ST Vec (without +) becomes fld double ptr[v + 8*EDX]; and at the end of the loop there's fstp double ptr[r + 8*EDX]; inc EDX; jnz start; where c are the addresses of the scalars, v are the address of the vectors loaded into integer registers, and r the address of the result vector. EDX is the counter variable. Using BCS's trick of creating a fake tuple to allow static foreach to operate as a simple counter, we could actually output this code. There would be some mucking around to identify the vectors at each stage, and also we'd run out of integer registers quite quickly, but basically this technique could generate x87 code that is extremely close to optimal. Output: ---- In postfix: Vec Vec + In postfix: Vec scalar * In postfix: Vec scalar * Vec Vec + + scalar * Vec scalar * Vec + + b = c+a b = c*2 a = (((c*3.2)+(c+b))*4.936)+((a*27.4)+c) ----

Dec 12 2006

Don Clugston wrote:Don Clugston wrote:The recent language improvements (especially, opAssign) have made expression templates eminently feasible. Here's a proof-of-concept which implements parsing of the basic vector operations of addition and scalar multiplication.

I've made a slightly improved version...

I'm making some headway on an nd-array class that uses BLAS/LAPACK under the hood. It would be nifty if I could somehow get this: R = A * x + y to be groked and translated to a single call to of BLAS's saxpy/daxpy/caxpy/zaxpy routines. With all the EAX talk, it looks like you're talking about a different level of optimization. The theory with BLAS anyway, is that you'll have access to a hand-tuned version for the platform you're on, so it'll be hard to beat with any kind of register-level compile-time micromanagement. For instance BLAS level 3 (matrix-matrix) routines do blocking to make things easier on the cache for big inputs. And of course there are also parallel BLAS implementations around, too. --bb

Dec 13 2006

Bill Baxter wrote:Don Clugston wrote:Don Clugston wrote:

I've made a slightly improved version...

I'm making some headway on an nd-array class that uses BLAS/LAPACK under the hood. It would be nifty if I could somehow get this: R = A * x + y to be groked and translated to a single call to of BLAS's saxpy/daxpy/caxpy/zaxpy routines. With all the EAX talk, it looks like you're talking about a different level of optimization. The theory with BLAS anyway, is that you'll have access to a hand-tuned version for the platform you're on, so it'll be hard to beat with any kind of register-level compile-time micromanagement. For instance BLAS level 3 (matrix-matrix) routines do blocking to make things easier on the cache for big inputs. And of course there are also parallel BLAS implementations around, too.

Quite true. But the code below spits out very-near optimal asm for x87, with double[] vectors with +, -, and scalar multiplication. For example, the code generated for something like a = b + 3.5 * c - 7 *(d - e)*s; where a, b, c, d, e are vectors, and s is a scalar, will (I think) leave BLAS for dead. Think of it as extended BLAS1 kernel. However, it shouldn't be to difficult to generate calls to DAXPY and kin. ------- /* * Supports vector addition, subtraction, and multiplication by a real scalar. * For functions of the form in BLAS1, the code is optimal for early Pentium CPUs, or very nearly so. * For example, the code generated for the crucial DAXPY operation is almost * identical to that described in Agner Fog's optimisation manual (www.agner.org), * and is faster than the unrolled solution published by Intel. * * Future directions: * Actually generate the code (instead of just displaying it)! (tricky). * Support dot product (easy). * Support += and -= (with optimisation: destination register can be reused). * Support floats as well as doubles (moderate). * Support imaginary and complex numbers (tricky). * Implement loop unrolling for small loops (tricky). * Make a version that spits out D code (like Blitz++) (moderate). * Make an SSE2 version (extremely tricky). */ import std.stdio; // only for debugging import std.string; // only for debugging // This would be a vector. For now, just store a name. class DVec { char [] data; public: this(char [] name) { data = name; } DVec opAssign(A)(A expr) { pragma(msg, code!(A)); expr.evaluate(); // Do the actual work here. return this; } Expr!('+', A, DVec) opAdd(A)(A w) { Expr!('+', A, DVec) q; q.right = this; q.left = w; return q; } Expr!('-', A, DVec) opSub(A)(A w) { Expr!('-', A, DVec) q; q.right = this; q.left = w; return q; } Expr!('*', real, DVec) opMul(A)(A w) { static assert(is(A : real), "Can only multiply by scalars"); Expr!('*', real, DVec) q; q.right = this; q.left = w; return q; } } /// For compile-time debugging template ExprToText(X, int numvecs=0, int numscalars=0) { static if (is (X: real)) const char [] ExprToText = "Scalar" ~ cast(char)('1' + numscalars); else static if (is (X.isExpr)) const char [] ExprToText = "(" ~ X.ExprString!(numvecs, numscalars) ~ ")"; else static if (is (X: DVec)) const char [] ExprToText = "Vec" ~ cast(char)('1' + numvecs); } template VecName(int vecnum) { const char [] VecName = "E" ~ cast(char)('A'+vecnum-1) ~ "X"; } template ScalarName(int num) { const char [] ScalarName = "Scalar" ~ cast(char)('0'+num); } // Vectors 1..4 are stored in EAX, EBX, ECX, EDX. // ESI is the counter variable. // EDI is the destination. // EBP is a scratch register. // If there are more than 4 source vectors, we'll have to use EBP for the spill // (not currently implemented). template code(X) { const char [] code = "-------- GENERATE CODE --------------"\n ~ "; " ~ X.ExprString!(0,0) ~ \n ~ "; push used registers "\n ~ " mov EBP, vector1.length; "\n ~ ";load vectors into EAX, EBX,..."\n ~ " mov EAX, vector1.ptr;"\n ~ " ... mov EBX, vector2.ptr; ...etc"\n ~ " xor ESI, ESI;"\n ~ " lea EAX, [EAX + 8*EBP];"\n ~ " ... lea EBX, [EBX + 8*EBP]; ... etc"\n ~ " sub ESI, EBP; // counter = -length" \n ~ " mov EDI, destination;"\n ~ " lea EDI, [EDI + 8*EBP];"\n ~ " jz short L3; // test for length==0"\n ~ X.instr!(0,0).s ~ " jmp short L2;\n" ~ "L1:\n" ~ X.instr!(0,0).s ~ " fxch ST(1), ST; \t// previous result\n" ~ " fstp double ptr [EDI + 8*ESI-8];\n" ~ "L2:\n" ~ X.finalinstr!(0,0).s ~ " inc ESI;\n jnz L1;\n" ~ " fstp double ptr [EDI + 8*ESI-8];\n" ~ "L3:" \n ~ "; pop used registers"; } // Stores all the data from the expression. // 'dummy' is a workaround to avoid a 'recursive template' error message. struct Expr(char operation, LeftExpr, RightExpr, int dummy=0) { typedef int isExpr; // workaround because is() doesn't work properly const char opType = operation; LeftExpr left; RightExpr right; static if (operation=='*') { // On x87, the FMUL instruction has a long latency. // We arrange the order to give it a chance to finish before we use it again. Expr!('+', Expr!(operation, LeftExpr, RightExpr, 1), A) opAdd(A)(A w) { Expr!('+', Expr!(operation, LeftExpr, RightExpr, 1), A) q; q.left.left = this.left; q.left.right = this.right; q.right = w; return q; } Expr!('-', Expr!(operation, LeftExpr, RightExpr, 1), A) opSub(A)(A w) { Expr!('-', Expr!(operation, LeftExpr, RightExpr, 1), A) q; q.left.left = this.left; q.left.right = this.right; q.right = w; return q; } } else { Expr!('+', A, Expr!(operation, LeftExpr, RightExpr, 1)) opAdd(A)(A w) { Expr!('+', A, Expr!(operation, LeftExpr, RightExpr, 1)) q; q.right.left = this.left; q.right.right = this.right; q.left = w; return q; } Expr!('-', A, Expr!(operation, LeftExpr, RightExpr, 1)) opSub(A)(A w) { Expr!('-', A, Expr!(operation, LeftExpr, RightExpr, 1)) q; q.right.left = this.left; q.right.right = this.right; q.left = w; return q; } } static if ( operation=='*' && is(RightExpr: real)) { // Constant fold scalars Expr!('*', LeftExpr, real) opMul(real w) { Expr!('*', LeftExpr, real) q; q.left = this.left; q.right = this.right * w; return q; } } else static if ( operation=='*' && is(LeftExpr: real)) { // Constant fold scalars Expr!('*', RightExpr, real) opMul(real w) { Expr!('*', RightExpr, real) q; q.left = this.right; q.right = this.left * w; return q; } } else { Expr!('*', Expr!(operation, LeftExpr, RightExpr, 1), real) opMul(A)(A w) { static assert(is(A : real), "Can only multiply by scalars"); Expr!('*', Expr!(operation, LeftExpr, RightExpr, 1), real) q; q.left.left = this.left; q.left.right = this.right; q.right = w; return q; } } static if (is (LeftExpr : DVec)) { const int getNumLeftVectors = 1; const int getNumLeftScalars = 0; } else static if (is(LeftExpr.isExpr)) { const int getNumLeftVectors = LeftExpr.totalVectors; const int getNumLeftScalars = LeftExpr.totalScalars; } else { const int getNumLeftVectors = 0; const int getNumLeftScalars = 1; } static if (is (RightExpr : DVec)) { const int totalVectors = getNumLeftVectors + 1; const int totalScalars = getNumLeftScalars; } else static if (is(RightExpr.isExpr)) { const int totalVectors = getNumLeftVectors + RightExpr.totalVectors; const int totalScalars = getNumLeftScalars + RightExpr.totalScalars; } else { const int totalVectors = getNumLeftVectors; const int totalScalars = getNumLeftScalars + 1; } template ExprString(int numvecs, int numscalars) { const char [] ExprString = ExprToText!(LeftExpr, numvecs, numscalars) ~ " " ~ operation ~ " " ~ ExprToText!(RightExpr, numvecs + getNumLeftVectors, numscalars + getNumLeftScalars); } template finalinstr(int vecbase, int scalarbase) { static if (operation == '*') { const char [] oper= " fmul"; } else static if (operation == '-') { const char [] oper= " fsub"; } else const char [] oper= " fadd"; static if (is (right.isExpr)) { const char [] s = oper ~ "p ST(1), ST;\t\t\n"; } else static if (is (RightExpr: DVec)) { const char [] s = oper ~ " double ptr [" ~ VecName!(vecbase+1+getNumLeftVectors) ~ " + 8*ESI];\t\n"; // " ~ getStr ~ " is " ~ right.data ~ \n; } else const char [] s = oper ~ " real ptr [" ~ ScalarName!(scalarbase+1+getNumLeftScalars)~ "];\n"; // " ~ getStr ~ \n; } /// Print the instructions required to evaluate it. template instr(int vecbase, int scalarbase) { static if (is (left.isExpr)) { const char [] s1 = left.instr!(vecbase, scalarbase).s ~ left.finalinstr!(vecbase, scalarbase).s; } else static if (is (LeftExpr : DVec)) { const char [] s1 = " fld double ptr [" ~ VecName!(vecbase+1) ~ " + 8*ESI];\n"; // left.data } else const char [] s1 = " fld real ptr [" ~ ScalarName!(scalarbase+1)~ "];\n";//= format(" fld real ptr [ %g ];\n", left); static if (is (right.isExpr)) { const char [] s = s1 ~ right.instr!(vecbase+getNumLeftVectors, scalarbase+getNumLeftScalars).s ~ right.finalinstr!(vecbase+getNumLeftVectors, scalarbase+getNumLeftScalars).s; } else const char [] s = s1; } void evaluate() { printf("----------\nEvaluate: "); printf(ExprString!(0,0).ptr); printf(\n"With values: "); printf(getStr().ptr); printf(\n); } // Just for debugging. char [] getStr() { char [] s; static if (is (left.isExpr)) { s = "(" ~ left.getStr() ~ ")"; } else static if (is (LeftExpr : DVec)) { s = left.data; } else s ~= format("%g", left); s ~= " " ~ operation ~ " "; static if (is (right.isExpr)) { s ~= "(" ~ right.getStr() ~ ")"; } else static if (is (RightExpr: DVec)) { s ~= right.data; } else s ~= format("%g", right); return s; } } void main() { DVec a = new DVec("a"); DVec b = new DVec("b"); DVec c = new DVec("c"); b = a + c; b = c*2; a = 27.4*a + 2*((b-c) + 3.2*c)*1.234*2; a = 7*b + a; }

Dec 13 2006

I'm making some headway on an nd-array class that uses BLAS/LAPACK und=

the hood. It would be nifty if I could somehow get this: R =3D A * x + y to be groked and translated to a single call to of BLAS's =

saxpy/daxpy/caxpy/zaxpy routines.

This would be fabulous! (and shouldn't it work similarly to the way Don'= s = code detects constant folding? If there's a matrix-vector product and on= e = uses opAdd with another vector, modify the call to xGEMV accordingly) I would definitely use something that makes accessing the BLAS routines = as = easy as that. It should also be possible to automatically use the calls = = for symmetric, hermitian, etc. matrices, if the matrix type reflects the= se = properties...The theory with BLAS anyway, is that you'll have access to a hand-tune=

version for the platform you're on, so it'll be hard to beat with any =

kind of register-level compile-time micromanagement.

I agree. Writing near-optimal assembler code might be doable for vector = = operations (and a pretty cool thing too), but for matrix-vector and = matrix-matrix operations, I'd go with the functions from BLAS level 2 an= d = 3. Please keep us updated on your progress or make it a project on dsource = - = I'd want to check it out and contribute, if time allows. Cheers, Christian

Dec 14 2006

Christian Kamm wrote:I'm making some headway on an nd-array class that uses BLAS/LAPACK under the hood. It would be nifty if I could somehow get this: R = A * x + y to be groked and translated to a single call to of BLAS's saxpy/daxpy/caxpy/zaxpy routines.

This would be fabulous! (and shouldn't it work similarly to the way Don's code detects constant folding? If there's a matrix-vector product and one uses opAdd with another vector, modify the call to xGEMV accordingly) I would definitely use something that makes accessing the BLAS routines as easy as that. It should also be possible to automatically use the calls for symmetric, hermitian, etc. matrices, if the matrix type reflects these properties...

That's all feasible. It's not yet feasible to turn A = t * B + A; into a DAXPY call, because there's no way of knowing that the two 'A's are the same. But A += t * B; could be identified as DAXPY, without any trouble. Ditto for the matrix/vector operations. (Now if my proposed enhancement to operator overloading gets implemented (which seems likely to happen eventually), the compiler would turn A = t*B + A; into A.opAddAssign_r(t*B); and then it could be identified as DAXPY. )The theory with BLAS anyway, is that you'll have access to a hand-tuned version for the platform you're on, so it'll be hard to beat with any kind of register-level compile-time micromanagement.

I agree. Writing near-optimal assembler code might be doable for vector operations (and a pretty cool thing too), but for matrix-vector and matrix-matrix operations, I'd go with the functions from BLAS level 2 and 3.

Absolutely agree. Cache effects are dominant in BLAS 2 and 3, and a lot of very clever people have put a lot of work put into getting that right. It's only in BLAS1 that there's much chance of benefit from on-the-fly code creation.Please keep us updated on your progress or make it a project on dsource - I'd want to check it out and contribute, if time allows.

It's been a bit of a diversion for me. I'll put the code into my 'mathextra' project on dsource (name: mathextra.Blade), but won't do anything with it till after D 1.0. (I've done a lot of proof-of-concept things to try to find the weak points of the D template system; but it generally doesn't make sense to put much effort into them until we get the first feature freeze; the language has improved so much over the past few months, this stuff tends to get obselete very quickly). -Don.

Dec 14 2006