www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SIMD benchmark

reply Walter Bright <newshound2 digitalmars.com> writes:
I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
Anyhow, it's good enough now to play around with. Consider it alpha quality. 
Expect bugs - but make bug reports, as there's a serious lack of source code to 
test it with.
-----------------------
import core.simd;

void test1a(float[4] a) { }

void test1()
{
     float[4] a = 1.2;
     a[] = a[] * 3 + 7;
     test1a(a);
}

void test2a(float4 a) { }

void test2()
{
     float4 a = 1.2;
     a = a * 3 + 7;
     test2a(a);
}

import std.stdio;
import std.datetime;

int main()
{
     test1();
     test2();
     auto b = comparingBenchmark!(test1, test2, 100);
     writeln(b.point);
     return 0;
}
Jan 14 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2012 10:56 PM, Walter Bright wrote:
 as there's a serious lack of source code to
 test it with.
Here's what there is at the moment. Needs much more. https://github.com/D-Programming-Language/dmd/blob/master/test/runnable/testxmm.d
Jan 14 2012
prev sibling next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 15/01/12 6:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
You sure you want proper bug reports for this? There still seems to be a lot of issues. For example, none of these work for me (OSX 64-bt). ---- int4 a = 2; // backend/cod2.c 2630 ---- int4 a = void; int4 b = void; a = b; // segfault ---- int4 a = void; a = simd(XMM.PXOR, a, a); // segfault ---- I could go on and on really. Very little seems to work at my end. Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX though. http://d.puremagic.com/test-results/index.ghtml
Jan 15 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/15/2012 3:49 AM, Peter Alexander wrote:
 Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX
though.
Yeah, it's just OSX. I had the test for that platform inadvertently disabled, gak.
Jan 15 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha quality.
 Expect bugs - but make bug reports, as there's a serious lack of source code
 to test it with.
I get 20+ speedup without optimisations with GDC on that small test. :) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 15 2012
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha quality.
 Expect bugs - but make bug reports, as there's a serious lack of source code
 to test it with.
I get 20+ speedup without optimisations with GDC on that small test. :)
Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 15 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/15/2012 10:10 AM, Iain Buclaw wrote:
 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above.  My oh my...
Woo-hoo!
Jan 15 2012
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above.  My oh my...
Please, show me the assembly code produced, with its relative D source :-) Bye, bearophile
Jan 15 2012
next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above. =A0My oh my...
Please, show me the assembly code produced, with its relative D source :-=
)
 Bye,
 bearophile
D code: ---- import core.simd; void test2a(float4 a) { } float4 test2() { float4 a =3D 1.2; a =3D a * 3 + 7; test2a(a); return a; } ---- Relevant assembly: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .section .rodata.cst4,"aM", progbits,4 .align 4 _D4test5test2FZNhG4f: .cfi_startproc movl $3, %eax cvtsi2ss %eax, %xmm0 movb $7, %al cvtsi2ss %eax, %xmm1 unpcklps %xmm0, %xmm0 unpcklps %xmm1, %xmm1 movlhps %xmm0, %xmm0 movlhps %xmm1, %xmm1 mulps .LC5(%rip), %xmm0 addps %xmm1, %xmm0 ret .cfi_endproc ---- As someone pointed out to me, the only optimisation missing was constant propagation, but that doesn't matter too much for now. Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 15 2012
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above. =A0My oh my...
Please, show me the assembly code produced, with its relative D source :-=
)
 Bye,
 bearophile
For those who can't read AT&T: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .align 16 _D4test5test2FZNhG4f: .cfi_startproc mov eax, 3 cvtsi2ss xmm0, eax mov al, 7 cvtsi2ss xmm1, eax unpcklps xmm0, xmm0 unpcklps xmm1, xmm1 movlhps xmm0, xmm0 movlhps xmm1, xmm1 mulps xmm0, XMMWORD PTR .LC5[rip] addps xmm0, xmm1 ret .cfi_endproc ---- --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 15 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 15 January 2012 20:10, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com>
wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
quality.
 Expect bugs - but make bug reports, as there's a serious lack of source
code
 to test it with.
I get 20+ speedup without optimisations with GDC on that small test. :)
Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my...
Oh my indeed. Haha, well I'm sure that's a fairly artificial result, but yes, this is why I've been harping on for months that it's a bare necessity to provide language support :P
Jan 15 2012
prev sibling next sibling parent reply Andre Tampubolon <andre lc.vlsm.org> writes:
I just built 32 & 64 bit DMD (latest commit on git tree is
f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

Using the 32-bit version, I got this error:
Internal error: backend/cg87.c 1702

The 64-bit version went fine.

Previously, both 32 and 64 bit version had no problem.

On 01/15/2012 01:56 PM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;
 
 void test1a(float[4] a) { }
 
 void test1()
 {
     float[4] a = 1.2;
     a[] = a[] * 3 + 7;
     test1a(a);
 }
 
 void test2a(float4 a) { }
 
 void test2()
 {
     float4 a = 1.2;
     a = a * 3 + 7;
     test2a(a);
 }
 
 import std.stdio;
 import std.datetime;
 
 int main()
 {
     test1();
     test2();
     auto b = comparingBenchmark!(test1, test2, 100);
     writeln(b.point);
     return 0;
 }
Jan 16 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&  64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702

 The 64-bit version went fine.

 Previously, both 32 and 64 bit version had no problem.
Which machine?
Jan 16 2012
parent reply Andre Tampubolon <andre lc.vlsm.org> writes:
Well I only have 1 machine, a laptop running 64 bit Arch Linux.
Yesterday I did a git pull, built both 32 & 64 bit DMD, and this code
compiled fine using those.
But now, the 32 bit version fails.

Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&  64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)
 
 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702
 
 The 64-bit version went fine.
 
 Previously, both 32 and 64 bit version had no problem.
Which machine?
Jan 16 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
32 bit SIMD for Linux is not implemented.

It's all 64 bit platforms, and 32 bit OS X.

On 1/16/2012 2:35 AM, Andre Tampubolon wrote:
 Well I only have 1 machine, a laptop running 64 bit Arch Linux.
 Yesterday I did a git pull, built both 32&  64 bit DMD, and this code
 compiled fine using those.
 But now, the 32 bit version fails.

 Walter Bright<newshound2 digitalmars.com>  wrote:
 On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&   64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702

 The 64-bit version went fine.

 Previously, both 32 and 64 bit version had no problem.
Which machine?
Jan 16 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }
These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Jan 16 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 16 January 2012 18:17, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:
 On 1/15/12 12:56 AM, Walter Bright wrote:

 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }
These two functions should have the same speed.
A function using float arrays and a function using hardware vectors should certainly not be the same speed.
Jan 16 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/12 10:46 AM, Manu wrote:
 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.
My point was that the version using float arrays should opportunistically use hardware ops whenever possible. Andrei
Jan 16 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:
 On 1/16/12 10:46 AM, Manu wrote:

 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.
My point was that the version using float arrays should opportunistically use hardware ops whenever possible.
I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.
Jan 16 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/16/2012 05:59 PM, Manu wrote:
 On 16 January 2012 18:48, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

     On 1/16/12 10:46 AM, Manu wrote:

         A function using float arrays and a function using hardware vectors
         should certainly not be the same speed.


     My point was that the version using float arrays should
     opportunistically use hardware ops whenever possible.


 I think this is a mistake, because such a piece of code never exists
 outside of some context. If the context it exists within is all FPU code
 (and it is, it's a float array), then swapping between FPU and SIMD
 execution units will probably result in the function being slower than
 the original (also the float array is unaligned). The SIMD version
 however must exist within a SIMD context, since the API can't implicitly
 interact with floats, this guarantees that the context of each function
 matches that within which it lives.
 This is fundamental to fast vector performance. Using SIMD is an all or
 nothing decision, you can't just mix it in here and there.
 You don't go casting back and fourth between floats and ints on every
 other line... obviously it's imprecise, but it's also a major
 performance hazard. There is no difference here, except the performance
 hazard is much worse.
I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.
Jan 16 2012
parent reply Manu <turkeyman gmail.com> writes:
On 16 January 2012 19:01, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 01/16/2012 05:59 PM, Manu wrote:

 On 16 January 2012 18:48, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org
<mailto:SeeWebsiteForEmail **erdani.org<SeeWebsiteForEmail erdani.org>

wrote: On 1/16/12 10:46 AM, Manu wrote: A function using float arrays and a function using hardware vectors should certainly not be the same speed. My point was that the version using float arrays should opportunistically use hardware ops whenever possible. I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.
I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.
x64 can do the swapping too with no penalty, but that is the only architecture that can. So it might be a viable x64 optimisation, but only for x64 codegen, which means any tech to detect and apply the optimisation should live in the back end, not in the front end as a higher level semantic.
Jan 16 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 9:21 AM, Manu wrote:
 x64 can do the swapping too with no penalty, but that is the only architecture
 that can.
Ah, that is a crucial bit of information.
Jan 16 2012
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2012-01-16 16:59:44 +0000, Manu <turkeyman gmail.com> said:

 
 On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:
 On 1/16/12 10:46 AM, Manu wrote:
 
 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.
My point was that the version using float arrays should opportunistically use hardware ops whenever possible.
I think this is a mistake, because such a piece of code never exists outside of some context. If the context it exists within is all FPU code (and it is, it's a float array), then swapping between FPU and SIMD execution units will probably result in the function being slower than the original (also the float array is unaligned). The SIMD version however must exist within a SIMD context, since the API can't implicitly interact with floats, this guarantees that the context of each function matches that within which it lives. This is fundamental to fast vector performance. Using SIMD is an all or nothing decision, you can't just mix it in here and there. You don't go casting back and fourth between floats and ints on every other line... obviously it's imprecise, but it's also a major performance hazard. There is no difference here, except the performance hazard is much worse.
Andrei's idea could be valid as an optimization when the compiler can see that all the operations can be performed with SIMD ops. In this particular case: if test1a(a) is inlined. But it can't work if the float[4] value crosses a function's boundary. Or instead the optimization could be performed at the semantic level, like this: try to change the type of a variable float[4] to a float4, and if it can compile, use it instead. So if you have the same function working with a float[4] and a float4, and if all the functions you call on a given variable supports float4, it'll go for float4. But doing that at the semantic level would be rather messy, not counting the combinatorial explosion when multiple variables are at play. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/12 11:32 AM, Michel Fortin wrote:
 Andrei's idea could be valid as an optimization when the compiler can
 see that all the operations can be performed with SIMD ops. In this
 particular case: if test1a(a) is inlined. But it can't work if the
 float[4] value crosses a function's boundary.
In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No? Andrei
Jan 16 2012
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2012-01-16 17:57:14 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/16/12 11:32 AM, Michel Fortin wrote:
 Andrei's idea could be valid as an optimization when the compiler can
 see that all the operations can be performed with SIMD ops. In this
 particular case: if test1a(a) is inlined. But it can't work if the
 float[4] value crosses a function's boundary.
In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No?
That's exactly what I meant, if everything is local to the function you might be able to optimize. In this particular case, if test1a(a) is inlined, everything is local. But the current example has too much isolation for it to be meaningful. If you returned the result as a float[4] the the optimization doesn't work. If you took an argument as a float[4] it probably wouldn't work either (depending on what you do with the argument). So I don't think its an optimization you should count on very much. In fact, the optimization I'd expect the compiler to do in this case is just wipe out all the code, as it does nothing other than putting a value in a local variable which is never reused later. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2012
parent reply Manu <turkeyman gmail.com> writes:
On 16 January 2012 21:27, Michel Fortin <michel.fortin michelf.com> wrote:

 In fact, the optimization I'd expect the compiler to do in this case is
 just wipe out all the code, as it does nothing other than putting a value
 in a local variable which is never reused later.
Yes, my first thought when I saw this test was "why is it generating any code at all?".. But I tried to forget about that :) I am curious though, what is causing that code (on both sides) to not be eliminated? If I write that in C, I'm sure it would generate nothing. Is this a language implementation bug somehow?
Jan 16 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 12:22 PM, Manu wrote:
 Yes, my first thought when I saw this test was "why is it generating any code
at
 all?".. But I tried to forget about that :)
 I am curious though, what is causing that code (on both sides) to not be
 eliminated? If I write that in C, I'm sure it would generate nothing. Is this a
 language implementation bug somehow?
Compile with inlining off, and the compiler 'forgets' what the called function does, so it must call it.
Jan 16 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 8:59 AM, Manu wrote:
 (also the float array is
 unaligned).
Currently, it is 4 byte aligned. But the compiler could align freestanding static arrays on 16 bytes without breaking anything. It just cannot align: struct S { int a; float[4] b; } b on a 16 byte boundary, as that would break the ABI. Even worse, struct S { int a; char[16] s; } can't be aligned on 16 bytes as that is a common "small string optimization".
Jan 16 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
 My point was that the version using float arrays should opportunistically use
 hardware ops whenever possible.
Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions. Now that the basics are there, there are many, many opportunities to improve the code generation. Even for things like: int i,j; i *= 3; foo(); j *= 3; the two multiplies can be combined. Also, if operations on a particular integer variable are a subset that is supported by SIMD, that variable could be enregistered in an XMM register, instead of a GP register. But don't worry, I'm not planning on working on that at the moment :-)
Jan 16 2012
next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 16 January 2012 18:59, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
 My point was that the version using float arrays should opportunisticall=
y
 use
 hardware ops whenever possible.
Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions. Now that the basics are there, there are many, many opportunities to impr=
ove
 the code generation. Even for things like:

 =A0int i,j;
 =A0i *=3D 3;
 =A0foo();
 =A0j *=3D 3;

 the two multiplies can be combined. Also, if operations on a particular
 integer variable are a subset that is supported by SIMD, that variable co=
uld
 be enregistered in an XMM register, instead of a GP register.

 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 16 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
Jan 16 2012
next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 16 January 2012 19:25, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 16 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>  wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.
Jan 16 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 16 January 2012 23:57, Peter Alexander <peter.alexander.au gmail.com>wrote:

 On 16/01/12 8:56 PM, Iain Buclaw wrote:

 On 16 January 2012 19:25, Walter
Bright<newshound2 digitalmars.**com<newshound2 digitalmars.com>>
  wrote:

 On 1/16/2012 11:16 AM, Iain Buclaw wrote:


 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.
This is why D needs a __restrict attribute! ;)
Jan 16 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 1:54 PM, Manu wrote:
     Unfortunately, if the function was this:

     void foo(int[] a, int[] b, int[] c) {

       for (int i=0; i<256; i++)
         a[i] = b[i] + c[i];
     }

     Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)
That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Jan 16 2012
parent reply Manu <turkeyman gmail.com> writes:
On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=0; i<256; i++)
        a[i] = b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)
That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Surely it would be possible for them to be overlapping slices?
Jan 16 2012
next sibling parent reply =?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:
On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com>  
 wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=0; i<256; i++)
        a[i] = b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)
That's why D has: a[] = b[] + c[]; because the language requires the arrays to be distinct.
Surely it would be possible for them to be overlapping slices?
If they are, that's your fault and your problem. "The lvalue slice and any rvalue slices must not overlap."
Jan 16 2012
parent =?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:
On Mon, 16 Jan 2012 23:22:21 +0100, Simen Kj=C3=A6r=C3=A5s <simen.kjaras=
 gmail.com>  =

wrote:

 On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> =
=
 wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=3D0; i<256; i++)
        a[i] =3D b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)
That's why D has: a[] =3D b[] + c[]; because the language requires the arrays to be distinct.
Surely it would be possible for them to be overlapping slices?
If they are, that's your fault and your problem. "The lvalue slice and any rvalue slices must not overlap."
Sorry, forgot the link: http://www.d-programming-language.org/arrays.html#array-operations
Jan 16 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2012 2:06 PM, Manu wrote:
 Surely it would be possible for them to be overlapping slices?
Not allowed, treated like array bounds checking.
Jan 16 2012
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> wr=
ote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :-=
)
 Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. =A0I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { =A0 for (int i=3D0; i<256; i++) =A0 =A0 a[i] =3D b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.
Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?) Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 16 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 16/01/12 10:36 PM, Iain Buclaw wrote:
 On 16 January 2012 21:57, Peter Alexander<peter.alexander.au gmail.com>  wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
   wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.
Compile with -fstrict-aliasing then?
This has nothing to do with strict aliasing. a[257]; foo(a[1..257], a[0..256], a[0..256]); This doesn't break any strict aliasing rule, but the loop still cannot be (trivially) vectorized. As Manu said, you need something like __restrict (or a linear type system) to solve this problem. http://en.wikipedia.org/wiki/Linear_type_system http://en.wikipedia.org/wiki/Uniqueness_typing
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
Jan 17 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
So you'd change it to this, even in release builds? void foo(int[] a, int[] b, int[] c) { if ( /* arrays overlap */ ) { foreach(i; 0..256) a[i] = b[i] + c[i]; } else { /* vectorized code */ } } i.e. duplicate all loops that can be potentially vectorized depending on aliasing? Please bear in mind that this is a simple example. Seems a bit inefficient (code size).
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
So you'd change it to this, even in release builds?
No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.
Jan 17 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
So you'd change it to this, even in release builds?
No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.
The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap... foreach (i; 0..80) c[i] = b[i]; // Illegal? I know that b[] = c[] would be illegal, but that has nothing to do with the prior discussion.
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 3:23 PM, Peter Alexander wrote:
 On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
So you'd change it to this, even in release builds?
No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.
The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap...
No, not illegal.
 foreach (i; 0..80)
 c[i] = b[i]; // Illegal?
No, not illegal.
 I know that b[] = c[] would be illegal, but that has nothing to do with the
 prior discussion.
Yes, b[]=c[] is illegal.
Jan 17 2012
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 17/01/12 11:34 PM, Walter Bright wrote:
 On 1/17/2012 3:23 PM, Peter Alexander wrote:
 On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.
No, you don't. It can be done with a runtime check, like array bounds checking is done.
So you'd change it to this, even in release builds?
No. Like array bounds, if they overlap, an exception is thrown. Remember, the D spec says that overlapping arrays are illegal.
The D spec says that overlapping arrays are illegal for vector ops. The foo(int[], int[], int[]) function does not use vector ops. Or am I missing something really major? For example, is this legal code? int[100] a; int[] b = a[0..100]; int[] c = a[10..90]; // Illegal? b and c overlap...
No, not illegal.
 foreach (i; 0..80)
 c[i] = b[i]; // Illegal?
No, not illegal.
 I know that b[] = c[] would be illegal, but that has nothing to do
 with the
 prior discussion.
Yes, b[]=c[] is illegal.
So, my original point still stands, you can't vectorise this function: void foo(int[] a, int[] b, int[] c) { foreach (i; 0..256) a[i] = b[i] + c[i]; } Those slices are allowed to overlap, so this cannot be automatically vectorised (without inlining to get better context about those arrays). Without inlining, you need something along the lines of __restrict or uniqueness typing.
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 4:19 PM, Peter Alexander wrote:
 So, my original point still stands, you can't vectorise this function:

 void foo(int[] a, int[] b, int[] c)
 {
 foreach (i; 0..256)
 a[i] = b[i] + c[i];
 }

 Those slices are allowed to overlap, so this cannot be automatically vectorised
 (without inlining to get better context about those arrays).

 Without inlining, you need something along the lines of __restrict or
uniqueness
 typing.
No, you can rewrite it as: a[] = b[] + c[]; and you don't need __restrict or uniqueness. That's what the vector operations are for.
Jan 17 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/18/2012 02:04 AM, Walter Bright wrote:
 On 1/17/2012 4:19 PM, Peter Alexander wrote:
 So, my original point still stands, you can't vectorise this function:

 void foo(int[] a, int[] b, int[] c)
 {
 foreach (i; 0..256)
 a[i] = b[i] + c[i];
 }

 Those slices are allowed to overlap, so this cannot be automatically
 vectorised
 (without inlining to get better context about those arrays).

 Without inlining, you need something along the lines of __restrict or
 uniqueness
 typing.
No, you can rewrite it as: a[] = b[] + c[]; and you don't need __restrict or uniqueness. That's what the vector operations are for.
Are they really a general solution? How do you use vector ops to implement an efficient matrix multiply, for instance?
Jan 17 2012
parent reply "F i L" <witte2008 gmail.com> writes:
Timon Gehr wrote:
 Are they really a general solution? How do you use vector ops 
 to implement an efficient matrix multiply, for instance?
struct Matrix4 { float4 x, y, z, w; auto transform(Matrix4 mat) { Matrix4 rmat; float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x}; float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y}; float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z}; float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w}; float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w}; float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w}; float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w}; float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w}; rmat.x = cx * rx; // simd rmat.y = cy * ry; // simd rmat.z = cz * rz; // simd rmat.w = cw * rw; // simd return rmat; } }
Jan 17 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/18/2012 02:32 AM, F i L wrote:
 Timon Gehr wrote:
 Are they really a general solution? How do you use vector ops to
 implement an efficient matrix multiply, for instance?
struct Matrix4 { float4 x, y, z, w; auto transform(Matrix4 mat) { Matrix4 rmat; float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x}; float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y}; float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z}; float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w}; float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w}; float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w}; float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w}; float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w}; rmat.x = cx * rx; // simd rmat.y = cy * ry; // simd rmat.z = cz * rz; // simd rmat.w = cw * rw; // simd return rmat; } }
The parameter is just squared and returned? Anyway, I was after a general matrix*matrix multiplication, where the operands can get arbitrarily large and where any potential use of __restrict is rendered unnecessary by array vector ops.
Jan 17 2012
next sibling parent "a" <a a.com> writes:
On Wednesday, 18 January 2012 at 01:50:00 UTC, Timon Gehr wrote:

 Anyway, I was after a general matrix*matrix multiplication, 
 where the operands can get arbitrarily large and where any 
 potential use of __restrict is rendered unnecessary by array 
 vector ops.
Here you go. But I agree there are use cases for restrict where vector operations don't help void matmul(A,B,C)(A a, B b, C c, size_t n, size_t m, size_t l) { for(size_t i = 0; i < n; i++) { c[i*l..i*l + l] = 0; for(size_t j = 0; j < m; j++) c[i*l..i*l + l] += a[i*m + j] * b[j*l..j*l + l]; } }
Jan 17 2012
prev sibling parent "F i L" <witte2008 gmail.com> writes:
Timon Gehr wrote:
 The parameter is just squared and returned?
No, sorry that code is all screwed up and missing a step. My Matrix multiply code looks like this: auto transform(U)(Matrix4!U m) if (isImplicitlyConvertible(U, T)) { return Matrix4 ( Vector4 ( (m.x.x*x.x) + (m.x.y*y.x) + (m.x.z*z.x) + (m.x.w*w.x), (m.x.x*x.y) + (m.x.y*y.Y) + (m.x.z*z.y) + (m.x.w*w.y), (m.x.x*x.z) + (m.x.y*y.z) + (m.x.z*z.z) + (m.x.w*w.z), (m.x.x*x.w) + (m.x.y*y.w) + (m.x.z*z.w) + (m.x.w*w.w) ), Vector4 ( (m.y.x*x.x) + (m.y.y*y.x) + (m.y.z*z.x) + (m.y.w*w.x), (m.y.x*x.y) + (m.y.y*y.y) + (m.y.z*z.y) + (m.y.w*w.y), (m.y.x*x.z) + (m.y.y*y.z) + (m.y.z*z.Z) + (m.y.w*w.z), (m.y.x*x.w) + (m.y.y*y.w) + (m.y.z*z.w) + (m.y.w*w.w) ), Vector4 ( (m.z.x*x.x) + (m.z.y*y.x) + (m.z.z*z.x) + (m.z.w*w.x), (m.z.x*x.Y) + (m.z.y*y.y) + (m.z.z*z.y) + (m.z.w*w.y), (m.z.x*x.z) + (m.z.y*y.z) + (m.z.z*z.z) + (m.z.w*w.z), (m.z.x*x.w) + (m.z.y*y.w) + (m.z.z*z.w) + (m.z.w*w.w) ), Vector4 ( (m.w.x*x.x) + (m.w.y*y.x) + (m.w.z*z.x) + (m.w.w*w.x), (m.w.x*x.Y) + (m.w.y*y.y) + (m.w.z*z.y) + (m.w.w*w.y), (m.w.x*x.Z) + (m.w.y*y.z) + (m.w.z*z.z) + (m.w.w*w.z), (m.w.x*x.w) + (m.w.y*y.w) + (m.w.z*z.w) + (m.w.w*w.w) ) ); } to be converted to something more like my previous example in order for SIMD to kick in. IDK if D's compile is good enough to optimize the above code into SIMD ops, but I doubt it.
 Anyway, I was after a general matrix*matrix multiplication, 
 where the operands can get arbitrarily large and where any 
 potential use of __restrict is rendered unnecessary by array 
 vector ops.
I don't know enough about simd to confidently discuss this, but I'd imagine there'd have to be quite a lot of compiler magic happening before arbitrarily sized matrix constructs could make use of simd.
Jan 17 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> =
wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :=
-)
 Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. =A0I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { =A0 for (int i=3D0; i<256; i++) =A0 =A0 a[i] =3D b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.
Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?) Regards -- Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
OK, have turned on strict aliasing by default for D2. You should now be able to vectorise loops that use locals and parameters. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 16 2012
prev sibling next sibling parent reply Manu <turkeyman gmail.com> writes:
On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com>
wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
  wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment
:-)
 Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse() loops that I have written support for. I am not aware of GCC vectorising anything else. example: int a[256], b[256], c[256]; void foo () { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { for (int i=0; i<256; i++) a[i] = b[i] + c[i]; } Then it can't vectorize due to aliasing.
Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?)
OK, have turned on strict aliasing by default for D2. You should now be able to vectorise loops that use locals and parameters. :-)
What protects these ranges from being overlapping? What if they were sourced from pointers? Are just we to say in D that aliasing is not allowed, and 'you shouldn't do it'? People almost never alias intentionally, it's usually the most insidious of bugs. :/
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 12:17 AM, Manu wrote:
 What protects these ranges from being overlapping?
A runtime check, like array bounds checking.
Jan 17 2012
parent reply Manu <turkeyman gmail.com> writes:
On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/17/2012 12:17 AM, Manu wrote:

 What protects these ranges from being overlapping?
A runtime check, like array bounds checking.
Awesome. How does this map to pointer dereferencing?
Jan 17 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 2:43 AM, Manu wrote:
 On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/17/2012 12:17 AM, Manu wrote:

         What protects these ranges from being overlapping?


     A runtime check, like array bounds checking.


 Awesome.
 How does this map to pointer dereferencing?
It can't. Use dynamic arrays - that's what they're for.
Jan 17 2012
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 17 January 2012 08:17, Manu <turkeyman gmail.com> wrote:
 On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.co=
m>
 wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the momen=
t
 :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
There's auto-vectorisation for for(), foreach(), and foreach_reverse=
()
 loops that I have written support for. =A0I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
 =A0 for (int i=3D0; i<256; i++)
 =A0 =A0 a[i] =3D b[i] + c[i];
 }
Unfortunately, if the function was this: void foo(int[] a, int[] b, int[] c) { =A0for (int i=3D0; i<256; i++) =A0 =A0a[i] =3D b[i] + c[i]; } Then it can't vectorize due to aliasing.
Compile with -fstrict-aliasing then? I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for safe code?)
OK, have turned on strict aliasing by default for D2. =A0You should now be able to vectorise loops that use locals and parameters. :-)
What protects these ranges from being overlapping? What if they were sour=
ced
 from pointers?
 Are just we to say in D that aliasing is not allowed, and 'you shouldn't =
do
 it'? People almost never alias intentionally, it's usually the
 most=A0insidious=A0of bugs. :/
D arrays have a .length property that keeps track of the length of the array. When array bounds checking is turned on (default when not compiling with -release) an assert is produced when you step outside the bounds of the array. Is this what you mean? --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 17 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 16 Jan 2012 20:25:28 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)
Leave that sort of optimisation for the backend to handle please. ;-)
Of course. I suspect Intel's compiler does that one, does gcc?
Thought of that too, but it's rather tough to manage slots in vector registers. Could probably dust of Don's BLADE library. It seems that gcc and icc are limited to loop optimization. http://gcc.gnu.org/projects/tree-ssa/vectorization.html http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/
Jan 16 2012
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 But don't worry, I'm not planning on working on that at the moment :-)
Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5]; Bye, bearophile
Jan 16 2012
next sibling parent Manu <turkeyman gmail.com> writes:
On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 But don't worry, I'm not planning on working on that at the moment :-)
Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...
Jan 17 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 But don't worry, I'm not planning on working on that at the moment :-)
Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...
If the compiler knows it's a compile time constant thus you could use a static foreach.
Jan 17 2012
prev sibling parent Manu <turkeyman gmail.com> writes:
On 17 January 2012 11:48, Martin Nowak <dawg dawgfoto.de> wrote:

 On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote:

  On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:
  Walter:
 But don't worry, I'm not planning on working on that at the moment :-)
Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place: int n = 5; auto a = new int[n]; auto b = new int[n]; a[] += b[]; ==> int n = 5; auto a = new int[n]; // a and b are dynamic arrays, auto b = new int[n]; // but their length is easy to find at compile-time a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[4]; a[5] += b[5];
If this doesn't already exist, I think it's quite important. I was thinking about needing to repeatedly specialise a template last night for a bunch of short lengths of arrays, for this exact reason. Unrolling short loops must be one of the most trivial and worthwhile optimisations...
If the compiler knows it's a compile time constant thus you could use a static foreach.
Great idea! :)
Jan 17 2012
prev sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }
These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martin
Jan 16 2012
parent reply Don Clugston <dac nospam.com> writes:
On 16/01/12 17:51, Martin Nowak wrote:
 On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }
These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martin
Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation. Incidentally, this is the reason that CTFE doesn't work with array ops.
Jan 17 2012
parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac nospam.com> wrote:

 On 16/01/12 17:51, Martin Nowak wrote:
 On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }
These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei
Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martin
Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation. Incidentally, this is the reason that CTFE doesn't work with array ops.
Oh, I was literally speaking of the runtime implementation. It should loop with 4 XMM regs the continue with 1 XMM reg and finish up scalar. Right now it quantizes on 16 floats and does the remaining ones scalar, which is really bad for very small arrays. I was about to rewrite it at some point. https://gist.github.com/1235470 I think having a runtime template is better than doing this massive extern(C) interface that has to be kept in sync. That would also open up room for a better CTFE integration. martin
Jan 17 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 2:04 AM, Martin Nowak wrote:
 I was about to rewrite it at some point.
 https://gist.github.com/1235470
I think you've got an innovative and clever solution. I'd like to see you finish it!
Jan 17 2012
parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 17 Jan 2012 11:53:35 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/17/2012 2:04 AM, Martin Nowak wrote:
 I was about to rewrite it at some point.
 https://gist.github.com/1235470
I think you've got an innovative and clever solution. I'd like to see you finish it!
Mmh, there was something keeping me from specializing templates, https://github.com/D-Programming-Language/dmd/pull/396 :). But right now I'd rather like to finish the shared library merging.
Jan 17 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2012 5:20 AM, Martin Nowak wrote:
 Mmh, there was something keeping me from specializing templates,
 https://github.com/D-Programming-Language/dmd/pull/396 :).

 But right now I'd rather like to finish the shared library merging.
I agree.
Jan 17 2012