digitalmars.D - SIMD benchmark

Walter Bright (30/30) Jan 14 2012 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.

Walter Bright (3/5) Jan 14 2012 Here's what there is at the moment. Needs much more.
Peter Alexander (17/21) Jan 15 2012 You sure you want proper bug reports for this? There still seems to be a...

Walter Bright (2/3) Jan 15 2012 Yeah, it's just OSX. I had the test for that platform inadvertently disa...

Iain Buclaw (5/9) Jan 15 2012 I get 20+ speedup without optimisations with GDC on that small test. :)
Iain Buclaw (6/12) Jan 15 2012 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up

Walter Bright (2/4) Jan 15 2012 Woo-hoo!
bearophile (4/6) Jan 15 2012 Please, show me the assembly code produced, with its relative D source :...

Iain Buclaw (44/50) Jan 15 2012 D code:
Iain Buclaw (28/34) Jan 15 2012 For those who can't read AT&T:

Manu (5/19) Jan 15 2012 Oh my indeed.
Andre Tampubolon (7/43) Jan 16 2012 I just built 32 & 64 bit DMD (latest commit on git tree is

Walter Bright (2/8) Jan 16 2012 Which machine?

Andre Tampubolon (5/17) Jan 16 2012 Well I only have 1 machine, a laptop running 64 bit Arch Linux.

Walter Bright (3/20) Jan 16 2012 32 bit SIMD for Linux is not implemented.

Andrei Alexandrescu (11/31) Jan 16 2012 These two functions should have the same speed. The function that ought

Manu (3/31) Jan 16 2012 A function using float arrays and a function using hardware vectors shou...

Andrei Alexandrescu (4/6) Jan 16 2012 My point was that the version using float arrays should

Manu (14/21) Jan 16 2012 I think this is a mistake, because such a piece of code never exists

Timon Gehr (3/25) Jan 16 2012 I think DMD now uses XMM registers for scalar floating point arithmetic

Manu (6/40) Jan 16 2012 x64 can do the swapping too with no penalty, but that is the only

Walter Bright (2/4) Jan 16 2012 Ah, that is a crucial bit of information.

Michel Fortin (16/41) Jan 16 2012 Andrei's idea could be valid as an optimization when the compiler can

Andrei Alexandrescu (5/9) Jan 16 2012 In this case it's the exact contrary: the float[4] and the operation are...

Michel Fortin (17/26) Jan 16 2012 That's exactly what I meant, if everything is local to the function you

Manu (6/9) Jan 16 2012 Yes, my first thought when I saw this test was "why is it generating any

Walter Bright (3/8) Jan 16 2012 Compile with inlining off, and the compiler 'forgets' what the called fu...

Walter Bright (15/17) Jan 16 2012 Currently, it is 4 byte aligned. But the compiler could align freestandi...

Walter Bright (13/15) Jan 16 2012 Yes, you're right. The compiler can opportunistically convert a number o...

Iain Buclaw (8/25) Jan 16 2012 ove

Walter Bright (3/5) Jan 16 2012 Of course.

Iain Buclaw (13/20) Jan 16 2012 There's auto-vectorisation for for(), foreach(), and foreach_reverse()

Peter Alexander (7/28) Jan 16 2012 Unfortunately, if the function was this:

Manu (2/40) Jan 16 2012 This is why D needs a __restrict attribute! ;)

Walter Bright (4/11) Jan 16 2012 That's why D has:

Manu (2/19) Jan 16 2012 Surely it would be possible for them to be overlapping slices?

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= (3/28) Jan 16 2012 If they are, that's your fault and your problem.

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= (6/36) Jan 16 2012 On Mon, 16 Jan 2012 23:22:21 +0100, Simen Kj=C3=A6r=C3=A5s

Walter Bright (2/3) Jan 16 2012 Not allowed, treated like array bounds checking.

Iain Buclaw (10/48) Jan 16 2012 )

Peter Alexander (10/54) Jan 17 2012 This has nothing to do with strict aliasing.

Walter Bright (3/5) Jan 17 2012 No, you don't. It can be done with a runtime check, like array bounds ch...

Peter Alexander (17/23) Jan 17 2012 So you'd change it to this, even in release builds?

Walter Bright (3/12) Jan 17 2012 No. Like array bounds, if they overlap, an exception is thrown.

Peter Alexander (12/25) Jan 17 2012 The D spec says that overlapping arrays are illegal for vector ops. The

Walter Bright (4/31) Jan 17 2012 No, not illegal.

Peter Alexander (11/48) Jan 17 2012 So, my original point still stands, you can't vectorise this function:

Walter Bright (5/15) Jan 17 2012 No, you can rewrite it as:

Timon Gehr (3/23) Jan 17 2012 Are they really a general solution? How do you use vector ops to

F i L (22/24) Jan 17 2012 struct Matrix4

Timon Gehr (5/29) Jan 17 2012 The parameter is just squared and returned?

a (12/16) Jan 17 2012 Here you go. But I agree there are use cases for restrict where
F i L (40/45) Jan 17 2012 No, sorry that code is all screwed up and missing a step.

Iain Buclaw (8/58) Jan 16 2012 -)

Manu (6/59) Jan 17 2012 What protects these ranges from being overlapping? What if they were

Walter Bright (2/3) Jan 17 2012 A runtime check, like array bounds checking.

Manu (3/7) Jan 17 2012 Awesome.

Walter Bright (2/9) Jan 17 2012 It can't. Use dynamic arrays - that's what they're for.

Iain Buclaw (14/76) Jan 17 2012 t

Martin Nowak (8/13) Jan 16 2012 Thought of that too, but it's rather tough to manage slots in vector

bearophile (17/18) Jan 16 2012 Until better optimizations are implemented, I see a "simple" optimizatio...

Manu (6/24) Jan 17 2012 If this doesn't already exist, I think it's quite important. I was think...
Martin Nowak (3/37) Jan 17 2012 If the compiler knows it's a compile time constant
Manu (2/41) Jan 17 2012 Great idea! :)

Martin Nowak (7/42) Jan 16 2012 Unfortunately druntime's array ops are a mess and fail

Don Clugston (6/52) Jan 17 2012 Yes. The structural problem in the compiler is that array ops get turned...

Martin Nowak (13/68) Jan 17 2012 Oh, I was literally speaking of the runtime implementation.

Walter Bright (2/4) Jan 17 2012 I think you've got an innovative and clever solution. I'd like to see yo...

Martin Nowak (5/10) Jan 17 2012 Mmh, there was something keeping me from specializing templates,

Walter Bright (2/5) Jan 17 2012 I agree.

Walter Bright <newshound2 digitalmars.com> writes:

I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
Anyhow, it's good enough now to play around with. Consider it alpha quality. 
Expect bugs - but make bug reports, as there's a serious lack of source code to 
test it with.
-----------------------
import core.simd;

void test1a(float[4] a) { }

void test1()
{
     float[4] a = 1.2;
     a[] = a[] * 3 + 7;
     test1a(a);
}

void test2a(float4 a) { }

void test2()
{
     float4 a = 1.2;
     a = a * 3 + 7;
     test2a(a);
}

import std.stdio;
import std.datetime;

int main()
{
     test1();
     test2();
     auto b = comparingBenchmark!(test1, test2, 100);
     writeln(b.point);
     return 0;
}

Jan 14 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2012 10:56 PM, Walter Bright wrote:
 as there's a serious lack of source code to
 test it with.

Here's what there is at the moment. Needs much more.

https://github.com/D-Programming-Language/dmd/blob/master/test/runnable/testxmm.d

Jan 14 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 15/01/12 6:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.

You sure you want proper bug reports for this? There still seems to be a 
lot of issues. For example, none of these work for me (OSX 64-bt).

----

int4 a = 2; // backend/cod2.c 2630

----

int4 a = void;
int4 b = void;
a = b; // segfault

----

int4 a = void;
a = simd(XMM.PXOR, a, a); // segfault

----

I could go on and on really. Very little seems to work at my end.

Actually, looking at the auto-tester, I'm not alone. Just seems to be 
OSX though.

http://d.puremagic.com/test-results/index.ghtml

Jan 15 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/15/2012 3:49 AM, Peter Alexander wrote:
 Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX
though.

Yeah, it's just OSX. I had the test for that platform inadvertently disabled,
gak.

Jan 15 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha quality.
 Expect bugs - but make bug reports, as there's a serious lack of source code
 to test it with.

I get 20+ speedup without optimisations with GDC on that small test. :)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 15 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com> wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha quality.
 Expect bugs - but make bug reports, as there's a serious lack of source code
 to test it with.

 I get 20+ speedup without optimisations with GDC on that small test. :)

Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
with -O2 and above.  My oh my...



-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 15 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/15/2012 10:10 AM, Iain Buclaw wrote:
 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above.  My oh my...

Woo-hoo!

Jan 15 2012

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above.  My oh my...

Please, show me the assembly code produced, with its relative D source :-)

Bye,
bearophile

Jan 15 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above. =A0My oh my...

 Please, show me the assembly code produced, with its relative D source :-=

)
 Bye,
 bearophile

D code:
----
import core.simd;

void test2a(float4 a) { }

float4 test2()
{
   float4 a =3D 1.2;
   a =3D a * 3 + 7;
   test2a(a);
   return a;
}
----

Relevant assembly:
----
.LC5:
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .section        .rodata.cst4,"aM", progbits,4
        .align 4

_D4test5test2FZNhG4f:
        .cfi_startproc
        movl    $3, %eax
        cvtsi2ss        %eax, %xmm0
        movb    $7, %al
        cvtsi2ss        %eax, %xmm1
        unpcklps        %xmm0, %xmm0
        unpcklps        %xmm1, %xmm1
        movlhps %xmm0, %xmm0
        movlhps %xmm1, %xmm1
        mulps   .LC5(%rip), %xmm0
        addps   %xmm1, %xmm0
        ret
        .cfi_endproc
----

As someone pointed out to me, the only optimisation missing was
constant propagation, but that doesn't matter too much for now.

Regards
--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 15 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 15 January 2012 19:01, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above. =A0My oh my...

 Please, show me the assembly code produced, with its relative D source :-=

)
 Bye,
 bearophile

For those who can't read AT&T:
----
.LC5:
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .align 16

_D4test5test2FZNhG4f:
        .cfi_startproc
        mov     eax, 3
        cvtsi2ss        xmm0, eax
        mov     al, 7
        cvtsi2ss        xmm1, eax
        unpcklps        xmm0, xmm0
        unpcklps        xmm1, xmm1
        movlhps xmm0, xmm0
        movlhps xmm1, xmm1
        mulps   xmm0, XMMWORD PTR .LC5[rip]
        addps   xmm0, xmm1
        ret
        .cfi_endproc
----


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 15 2012

Manu <turkeyman gmail.com> writes:

On 15 January 2012 20:10, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 15 January 2012 16:59, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 15 January 2012 06:56, Walter Bright <newshound2 digitalmars.com>

 wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha


 quality.
 Expect bugs - but make bug reports, as there's a serious lack of source


 code
 to test it with.

 I get 20+ speedup without optimisations with GDC on that small test. :)

 Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up
 with -O2 and above.  My oh my...


Oh my indeed.
Haha, well I'm sure that's a fairly artificial result, but yes, this is why
I've been harping on for months that it's a bare necessity to provide
language support :P

Jan 15 2012

Andre Tampubolon <andre lc.vlsm.org> writes:

I just built 32 & 64 bit DMD (latest commit on git tree is
f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

Using the 32-bit version, I got this error:
Internal error: backend/cg87.c 1702

The 64-bit version went fine.

Previously, both 32 and 64 bit version had no problem.

On 01/15/2012 01:56 PM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;
 
 void test1a(float[4] a) { }
 
 void test1()
 {
     float[4] a = 1.2;
     a[] = a[] * 3 + 7;
     test1a(a);
 }
 
 void test2a(float4 a) { }
 
 void test2()
 {
     float4 a = 1.2;
     a = a * 3 + 7;
     test2a(a);
 }
 
 import std.stdio;
 import std.datetime;
 
 int main()
 {
     test1();
     test2();
     auto b = comparingBenchmark!(test1, test2, 100);
     writeln(b.point);
     return 0;
 }

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&  64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702

 The 64-bit version went fine.

 Previously, both 32 and 64 bit version had no problem.

Which machine?

Jan 16 2012

Andre Tampubolon <andre lc.vlsm.org> writes:

Well I only have 1 machine, a laptop running 64 bit Arch Linux.
Yesterday I did a git pull, built both 32 & 64 bit DMD, and this code
compiled fine using those.
But now, the 32 bit version fails.

Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&  64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)
 
 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702
 
 The 64-bit version went fine.
 
 Previously, both 32 and 64 bit version had no problem.

 
 Which machine?

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

32 bit SIMD for Linux is not implemented.

It's all 64 bit platforms, and 32 bit OS X.

On 1/16/2012 2:35 AM, Andre Tampubolon wrote:
 Well I only have 1 machine, a laptop running 64 bit Arch Linux.
 Yesterday I did a git pull, built both 32&  64 bit DMD, and this code
 compiled fine using those.
 But now, the 32 bit version fails.

 Walter Bright<newshound2 digitalmars.com>  wrote:
 On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
 I just built 32&   64 bit DMD (latest commit on git tree is
 f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

 Using the 32-bit version, I got this error:
 Internal error: backend/cg87.c 1702

 The 64-bit version went fine.

 Previously, both 32 and 64 bit version had no problem.

 Which machine?

Jan 16 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }

These two functions should have the same speed. The function that ought 
to be slower is:

void test1()
{
     float[5] a = 1.2;
     float[] b = a[1 .. $];
     b[] = b[] * 3 + 7;
     test1a(a);
}


Andrei

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 16 January 2012 18:17, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:

 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }

 These two functions should have the same speed.


A function using float arrays and a function using hardware vectors should
certainly not be the same speed.

Jan 16 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/12 10:46 AM, Manu wrote:
 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.

My point was that the version using float arrays should 
opportunistically use hardware ops whenever possible.

Andrei

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:

 On 1/16/12 10:46 AM, Manu wrote:

 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.

 My point was that the version using float arrays should opportunistically
 use hardware ops whenever possible.


I think this is a mistake, because such a piece of code never exists
outside of some context. If the context it exists within is all FPU code
(and it is, it's a float array), then swapping between FPU and SIMD
execution units will probably result in the function being slower than the
original (also the float array is unaligned). The SIMD version however must
exist within a SIMD context, since the API can't implicitly interact with
floats, this guarantees that the context of each function matches that
within which it lives.
This is fundamental to fast vector performance. Using SIMD is an all or
nothing decision, you can't just mix it in here and there.
You don't go casting back and fourth between floats and ints on every other
line... obviously it's imprecise, but it's also a major performance hazard.
There is no difference here, except the performance hazard is much worse.

Jan 16 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/16/2012 05:59 PM, Manu wrote:
 On 16 January 2012 18:48, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

     On 1/16/12 10:46 AM, Manu wrote:

         A function using float arrays and a function using hardware vectors
         should certainly not be the same speed.


     My point was that the version using float arrays should
     opportunistically use hardware ops whenever possible.


 I think this is a mistake, because such a piece of code never exists
 outside of some context. If the context it exists within is all FPU code
 (and it is, it's a float array), then swapping between FPU and SIMD
 execution units will probably result in the function being slower than
 the original (also the float array is unaligned). The SIMD version
 however must exist within a SIMD context, since the API can't implicitly
 interact with floats, this guarantees that the context of each function
 matches that within which it lives.
 This is fundamental to fast vector performance. Using SIMD is an all or
 nothing decision, you can't just mix it in here and there.
 You don't go casting back and fourth between floats and ints on every
 other line... obviously it's imprecise, but it's also a major
 performance hazard. There is no difference here, except the performance
 hazard is much worse.

I think DMD now uses XMM registers for scalar floating point arithmetic 
on x86_64.

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 16 January 2012 19:01, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 01/16/2012 05:59 PM, Manu wrote:

 On 16 January 2012 18:48, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org
<mailto:SeeWebsiteForEmail **erdani.org<SeeWebsiteForEmail erdani.org>


 wrote:

    On 1/16/12 10:46 AM, Manu wrote:

        A function using float arrays and a function using hardware vectors
        should certainly not be the same speed.


    My point was that the version using float arrays should
    opportunistically use hardware ops whenever possible.


 I think this is a mistake, because such a piece of code never exists
 outside of some context. If the context it exists within is all FPU code
 (and it is, it's a float array), then swapping between FPU and SIMD
 execution units will probably result in the function being slower than
 the original (also the float array is unaligned). The SIMD version
 however must exist within a SIMD context, since the API can't implicitly
 interact with floats, this guarantees that the context of each function
 matches that within which it lives.
 This is fundamental to fast vector performance. Using SIMD is an all or
 nothing decision, you can't just mix it in here and there.
 You don't go casting back and fourth between floats and ints on every
 other line... obviously it's imprecise, but it's also a major
 performance hazard. There is no difference here, except the performance
 hazard is much worse.

 I think DMD now uses XMM registers for scalar floating point arithmetic on
 x86_64.

x64 can do the swapping too with no penalty, but that is the only
architecture that can. So it might be a viable x64 optimisation, but only
for x64 codegen, which means any tech to detect and apply the optimisation
should live in the back end, not in the front end as a higher level
semantic.

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 9:21 AM, Manu wrote:
 x64 can do the swapping too with no penalty, but that is the only architecture
 that can.

Ah, that is a crucial bit of information.

Jan 16 2012

Michel Fortin <michel.fortin michelf.com> writes:

On 2012-01-16 16:59:44 +0000, Manu <turkeyman gmail.com> said:

 
 On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org
 wrote:

 
 On 1/16/12 10:46 AM, Manu wrote:
 
 A function using float arrays and a function using hardware vectors
 should certainly not be the same speed.

 
 My point was that the version using float arrays should opportunistically
 use hardware ops whenever possible.

 
 I think this is a mistake, because such a piece of code never exists
 outside of some context. If the context it exists within is all FPU code
 (and it is, it's a float array), then swapping between FPU and SIMD
 execution units will probably result in the function being slower than the
 original (also the float array is unaligned). The SIMD version however must
 exist within a SIMD context, since the API can't implicitly interact with
 floats, this guarantees that the context of each function matches that
 within which it lives.
 This is fundamental to fast vector performance. Using SIMD is an all or
 nothing decision, you can't just mix it in here and there.
 You don't go casting back and fourth between floats and ints on every other
 line... obviously it's imprecise, but it's also a major performance hazard.
 There is no difference here, except the performance hazard is much worse.

Andrei's idea could be valid as an optimization when the compiler can 
see that all the operations can be performed with SIMD ops. In this 
particular case: if test1a(a) is inlined. But it can't work if the 
float[4] value crosses a function's boundary.

Or instead the optimization could be performed at the semantic level, 
like this: try to change the type of a variable float[4] to a float4, 
and if it can compile, use it instead. So if you have the same function 
working with a float[4] and a float4, and if all the functions you call 
on a given variable supports float4, it'll go for float4. But doing 
that at the semantic level would be rather messy, not counting the 
combinatorial explosion when multiple variables are at play.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 16 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/12 11:32 AM, Michel Fortin wrote:
 Andrei's idea could be valid as an optimization when the compiler can
 see that all the operations can be performed with SIMD ops. In this
 particular case: if test1a(a) is inlined. But it can't work if the
 float[4] value crosses a function's boundary.

In this case it's the exact contrary: the float[4] and the operation are 
both local to the function. So it all depends on the inlining of the 
dummy functions that follows. No?


Andrei

Jan 16 2012

Michel Fortin <michel.fortin michelf.com> writes:

On 2012-01-16 17:57:14 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/16/12 11:32 AM, Michel Fortin wrote:
 Andrei's idea could be valid as an optimization when the compiler can
 see that all the operations can be performed with SIMD ops. In this
 particular case: if test1a(a) is inlined. But it can't work if the
 float[4] value crosses a function's boundary.

 
 In this case it's the exact contrary: the float[4] and the operation 
 are both local to the function. So it all depends on the inlining of 
 the dummy functions that follows. No?

That's exactly what I meant, if everything is local to the function you 
might be able to optimize. In this particular case, if test1a(a) is 
inlined, everything is local.

But the current example has too much isolation for it to be meaningful. 
If you returned the result as a float[4] the the optimization doesn't 
work. If you took an argument as a float[4] it probably wouldn't work 
either (depending on what you do with the argument). So I don't think 
its an optimization you should count on very much.

In fact, the optimization I'd expect the compiler to do in this case is 
just wipe out all the code, as it does nothing other than putting a 
value in a local variable which is never reused later.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 16 January 2012 21:27, Michel Fortin <michel.fortin michelf.com> wrote:

 In fact, the optimization I'd expect the compiler to do in this case is
 just wipe out all the code, as it does nothing other than putting a value
 in a local variable which is never reused later.

Yes, my first thought when I saw this test was "why is it generating any
code at all?".. But I tried to forget about that :)
I am curious though, what is causing that code (on both sides) to not be
eliminated? If I write that in C, I'm sure it would generate nothing. Is
this a language implementation bug somehow?

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 12:22 PM, Manu wrote:
 Yes, my first thought when I saw this test was "why is it generating any code
at
 all?".. But I tried to forget about that :)
 I am curious though, what is causing that code (on both sides) to not be
 eliminated? If I write that in C, I'm sure it would generate nothing. Is this a
 language implementation bug somehow?

Compile with inlining off, and the compiler 'forgets' what the called function 
does, so it must call it.

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 8:59 AM, Manu wrote:
 (also the float array is
 unaligned).

Currently, it is 4 byte aligned. But the compiler could align freestanding 
static arrays on 16 bytes without breaking anything. It just cannot align:

    struct S
    {
         int a;
         float[4] b;
    }

b on a 16 byte boundary, as that would break the ABI. Even worse,

    struct S
    {
        int a;
        char[16] s;
    }

can't be aligned on 16 bytes as that is a common "small string optimization".

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
 My point was that the version using float arrays should opportunistically use
 hardware ops whenever possible.

Yes, you're right. The compiler can opportunistically convert a number of
vector 
operations on static arrays to the SIMD instructions.

Now that the basics are there, there are many, many opportunities to improve
the 
code generation. Even for things like:

   int i,j;
   i *= 3;
   foo();
   j *= 3;

the two multiplies can be combined. Also, if operations on a particular integer 
variable are a subset that is supported by SIMD, that variable could be 
enregistered in an XMM register, instead of a GP register.

But don't worry, I'm not planning on working on that at the moment :-)

Jan 16 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 16 January 2012 18:59, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
 My point was that the version using float arrays should opportunisticall=


y
 use
 hardware ops whenever possible.


 Yes, you're right. The compiler can opportunistically convert a number of
 vector operations on static arrays to the SIMD instructions.

 Now that the basics are there, there are many, many opportunities to impr=

ove
 the code generation. Even for things like:

 =A0int i,j;
 =A0i *=3D 3;
 =A0foo();
 =A0j *=3D 3;

 the two multiplies can be combined. Also, if operations on a particular
 integer variable are a subset that is supported by SIMD, that variable co=

uld
 be enregistered in an XMM register, instead of a GP register.

 But don't worry, I'm not planning on working on that at the moment :-)

Leave that sort of optimisation for the backend to handle please. ;-)


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)

 Leave that sort of optimisation for the backend to handle please. ;-)

Of course.

I suspect Intel's compiler does that one, does gcc?

Jan 16 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 16 January 2012 19:25, Walter Bright <newshound2 digitalmars.com> wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)

 Leave that sort of optimisation for the backend to handle please. ;-)


 Of course.

 I suspect Intel's compiler does that one, does gcc?

There's auto-vectorisation for for(), foreach(), and foreach_reverse()
loops that I have written support for.  I am not aware of GCC
vectorising anything else.

example:

int a[256], b[256], c[256];
void foo () {
  for (int i=0; i<256; i++)
    a[i] = b[i] + c[i];
}

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 16 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>  wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)

 Leave that sort of optimisation for the backend to handle please. ;-)


 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for.  I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
    for (int i=0; i<256; i++)
      a[i] = b[i] + c[i];
 }

Unfortunately, if the function was this:

void foo(int[] a, int[] b, int[] c) {
   for (int i=0; i<256; i++)
     a[i] = b[i] + c[i];
}

Then it can't vectorize due to aliasing.

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 16 January 2012 23:57, Peter Alexander <peter.alexander.au gmail.com>wrote:

 On 16/01/12 8:56 PM, Iain Buclaw wrote:

 On 16 January 2012 19:25, Walter
Bright<newshound2 digitalmars.**com<newshound2 digitalmars.com>>
  wrote:

 On 1/16/2012 11:16 AM, Iain Buclaw wrote:


 But don't worry, I'm not planning on working on that at the moment :-)

 Leave that sort of optimisation for the backend to handle please. ;-)


 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for.  I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
   for (int i=0; i<256; i++)
     a[i] = b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

  for (int i=0; i<256; i++)
    a[i] = b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

This is why D needs a __restrict attribute! ;)

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 1:54 PM, Manu wrote:
     Unfortunately, if the function was this:

     void foo(int[] a, int[] b, int[] c) {

       for (int i=0; i<256; i++)
         a[i] = b[i] + c[i];
     }

     Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)

That's why D has:

    a[] = b[] + c[];

because the language requires the arrays to be distinct.

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=0; i<256; i++)
        a[i] = b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)

 That's why D has:

   a[] = b[] + c[];

 because the language requires the arrays to be distinct.

Surely it would be possible for them to be overlapping slices?

Jan 16 2012

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:

On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com>  
 wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=0; i<256; i++)
        a[i] = b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)

 That's why D has:

   a[] = b[] + c[];

 because the language requires the arrays to be distinct.

 Surely it would be possible for them to be overlapping slices?

If they are, that's your fault and your problem.

"The lvalue slice and any rvalue slices must not overlap."

Jan 16 2012

=?utf-8?Q?Simen_Kj=C3=A6r=C3=A5s?= <simen.kjaras gmail.com> writes:

On Mon, 16 Jan 2012 23:22:21 +0100, Simen Kj=C3=A6r=C3=A5s <simen.kjaras=
 gmail.com>  =

wrote:

 On Mon, 16 Jan 2012 23:06:12 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 00:03, Walter Bright <newshound2 digitalmars.com> =


 =

 wrote:

 On 1/16/2012 1:54 PM, Manu wrote:

    Unfortunately, if the function was this:

    void foo(int[] a, int[] b, int[] c) {

      for (int i=3D0; i<256; i++)
        a[i] =3D b[i] + c[i];
    }

    Then it can't vectorize due to aliasing.


 This is why D needs a __restrict attribute! ;)

 That's why D has:

   a[] =3D b[] + c[];

 because the language requires the arrays to be distinct.

 Surely it would be possible for them to be overlapping slices?

 If they are, that's your fault and your problem.

 "The lvalue slice and any rvalue slices must not overlap."

Sorry, forgot the link:

http://www.d-programming-language.org/arrays.html#array-operations

Jan 16 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2012 2:06 PM, Manu wrote:
 Surely it would be possible for them to be overlapping slices?

Not allowed, treated like array bounds checking.

Jan 16 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> wr=
ote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :-=





)
 Leave that sort of optimisation for the backend to handle please. ;-)



 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for. =A0I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
 =A0 for (int i=3D0; i<256; i++)
 =A0 =A0 a[i] =3D b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

 =A0for (int i=3D0; i<256; i++)
 =A0 =A0a[i] =3D b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

Compile with -fstrict-aliasing then?

I could certainly play about with having this enabled by default, but
I forsee there may be issues (maybe have it on for  safe code?)


Regards
--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 16 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 16/01/12 10:36 PM, Iain Buclaw wrote:
 On 16 January 2012 21:57, Peter Alexander<peter.alexander.au gmail.com>  wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
   wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :-)


 Leave that sort of optimisation for the backend to handle please. ;-)



 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for.  I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
    for (int i=0; i<256; i++)
      a[i] = b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

   for (int i=0; i<256; i++)
     a[i] = b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

 Compile with -fstrict-aliasing then?

This has nothing to do with strict aliasing.

a[257];
foo(a[1..257], a[0..256], a[0..256]);

This doesn't break any strict aliasing rule, but the loop still cannot 
be (trivially) vectorized.

As Manu said, you need something like __restrict (or a linear type 
system) to solve this problem.

http://en.wikipedia.org/wiki/Linear_type_system
http://en.wikipedia.org/wiki/Uniqueness_typing

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type system) to
 solve this problem.

No, you don't. It can be done with a runtime check, like array bounds checking 
is done.

Jan 17 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.

 No, you don't. It can be done with a runtime check, like array bounds
 checking is done.

So you'd change it to this, even in release builds?

void foo(int[] a, int[] b, int[] c)
{
   if ( /* arrays overlap */ )
   {
     foreach(i; 0..256)
       a[i] = b[i] + c[i];
   }
   else
   {
     /* vectorized code */
   }
}


i.e. duplicate all loops that can be potentially vectorized depending on 
aliasing? Please bear in mind that this is a simple example.

Seems a bit inefficient (code size).

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.

 No, you don't. It can be done with a runtime check, like array bounds
 checking is done.

 So you'd change it to this, even in release builds?

No. Like array bounds, if they overlap, an exception is thrown.

Remember, the D spec says that overlapping arrays are illegal.

Jan 17 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.

 No, you don't. It can be done with a runtime check, like array bounds
 checking is done.

 So you'd change it to this, even in release builds?

 No. Like array bounds, if they overlap, an exception is thrown.

 Remember, the D spec says that overlapping arrays are illegal.

The D spec says that overlapping arrays are illegal for vector ops. The 
foo(int[], int[], int[]) function does not use vector ops.

Or am I missing something really major?

For example, is this legal code?

int[100] a;
int[] b = a[0..100];
int[] c = a[10..90]; // Illegal? b and c overlap...

foreach (i; 0..80)
     c[i] = b[i]; // Illegal?

I know that b[] = c[] would be illegal, but that has nothing to do with 
the prior discussion.

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 3:23 PM, Peter Alexander wrote:
 On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.

 No, you don't. It can be done with a runtime check, like array bounds
 checking is done.

 So you'd change it to this, even in release builds?

 No. Like array bounds, if they overlap, an exception is thrown.

 Remember, the D spec says that overlapping arrays are illegal.

 The D spec says that overlapping arrays are illegal for vector ops. The
 foo(int[], int[], int[]) function does not use vector ops.

 Or am I missing something really major?

 For example, is this legal code?

 int[100] a;
 int[] b = a[0..100];
 int[] c = a[10..90]; // Illegal? b and c overlap...

No, not illegal.

 foreach (i; 0..80)
 c[i] = b[i]; // Illegal?

No, not illegal.

 I know that b[] = c[] would be illegal, but that has nothing to do with the
 prior discussion.

Yes, b[]=c[] is illegal.

Jan 17 2012

Peter Alexander <peter.alexander.au gmail.com> writes:

On 17/01/12 11:34 PM, Walter Bright wrote:
 On 1/17/2012 3:23 PM, Peter Alexander wrote:
 On 17/01/12 10:55 PM, Walter Bright wrote:
 On 1/17/2012 1:47 PM, Peter Alexander wrote:
 On 17/01/12 9:24 PM, Walter Bright wrote:
 On 1/17/2012 1:20 PM, Peter Alexander wrote:
 As Manu said, you need something like __restrict (or a linear type
 system) to
 solve this problem.

 No, you don't. It can be done with a runtime check, like array bounds
 checking is done.

 So you'd change it to this, even in release builds?

 No. Like array bounds, if they overlap, an exception is thrown.

 Remember, the D spec says that overlapping arrays are illegal.

 The D spec says that overlapping arrays are illegal for vector ops. The
 foo(int[], int[], int[]) function does not use vector ops.

 Or am I missing something really major?

 For example, is this legal code?

 int[100] a;
 int[] b = a[0..100];
 int[] c = a[10..90]; // Illegal? b and c overlap...

 No, not illegal.

 foreach (i; 0..80)
 c[i] = b[i]; // Illegal?

 No, not illegal.

 I know that b[] = c[] would be illegal, but that has nothing to do
 with the
 prior discussion.

 Yes, b[]=c[] is illegal.

So, my original point still stands, you can't vectorise this function:

void foo(int[] a, int[] b, int[] c)
{
   foreach (i; 0..256)
     a[i] = b[i] + c[i];
}

Those slices are allowed to overlap, so this cannot be automatically 
vectorised (without inlining to get better context about those arrays).

Without inlining, you need something along the lines of __restrict or 
uniqueness typing.

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 4:19 PM, Peter Alexander wrote:
 So, my original point still stands, you can't vectorise this function:

 void foo(int[] a, int[] b, int[] c)
 {
 foreach (i; 0..256)
 a[i] = b[i] + c[i];
 }

 Those slices are allowed to overlap, so this cannot be automatically vectorised
 (without inlining to get better context about those arrays).

 Without inlining, you need something along the lines of __restrict or
uniqueness
 typing.

No, you can rewrite it as:

    a[] = b[] + c[];

and you don't need __restrict or uniqueness. That's what the vector operations 
are for.

Jan 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/18/2012 02:04 AM, Walter Bright wrote:
 On 1/17/2012 4:19 PM, Peter Alexander wrote:
 So, my original point still stands, you can't vectorise this function:

 void foo(int[] a, int[] b, int[] c)
 {
 foreach (i; 0..256)
 a[i] = b[i] + c[i];
 }

 Those slices are allowed to overlap, so this cannot be automatically
 vectorised
 (without inlining to get better context about those arrays).

 Without inlining, you need something along the lines of __restrict or
 uniqueness
 typing.

 No, you can rewrite it as:

 a[] = b[] + c[];

 and you don't need __restrict or uniqueness. That's what the vector
 operations are for.

Are they really a general solution? How do you use vector ops to 
implement an efficient matrix multiply, for instance?

Jan 17 2012

"F i L" <witte2008 gmail.com> writes:

Timon Gehr wrote:
 Are they really a general solution? How do you use vector ops 
 to implement an efficient matrix multiply, for instance?

struct Matrix4
{
    float4 x, y, z, w;

    auto transform(Matrix4 mat)
    {
        Matrix4 rmat;

        float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x};
        float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y};
        float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z};
        float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w};

        float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w};
        float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w};
        float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w};
        float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w};

        rmat.x = cx * rx; // simd
        rmat.y = cy * ry; // simd
        rmat.z = cz * rz; // simd
        rmat.w = cw * rw; // simd

        return rmat;
    }
}

Jan 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/18/2012 02:32 AM, F i L wrote:
 Timon Gehr wrote:
 Are they really a general solution? How do you use vector ops to
 implement an efficient matrix multiply, for instance?

 struct Matrix4
 {
 float4 x, y, z, w;

 auto transform(Matrix4 mat)
 {
 Matrix4 rmat;

 float4 cx = {mat.x.x, mat.y.x, mat.z.x, mat.w.x};
 float4 cy = {mat.x.y, mat.y.y, mat.z.y, mat.w.y};
 float4 cz = {mat.x.z, mat.y.z, mat.z.z, mat.w.z};
 float4 cw = {mat.x.w, mat.y.w, mat.z.w, mat.w.w};

 float4 rx = {mat.x.x, mat.x.y, mat.x.z, mat.x.w};
 float4 ry = {mat.y.x, mat.y.y, mat.y.z, mat.y.w};
 float4 rz = {mat.z.x, mat.z.y, mat.z.z, mat.z.w};
 float4 rw = {mat.w.x, mat.w.y, mat.w.z, mat.w.w};

 rmat.x = cx * rx; // simd
 rmat.y = cy * ry; // simd
 rmat.z = cz * rz; // simd
 rmat.w = cw * rw; // simd

 return rmat;
 }
 }

The parameter is just squared and returned?

Anyway, I was after a general matrix*matrix multiplication, where the 
operands can get arbitrarily large and where any potential use of 
__restrict is rendered unnecessary by array vector ops.

Jan 17 2012

"a" <a a.com> writes:

On Wednesday, 18 January 2012 at 01:50:00 UTC, Timon Gehr wrote:

 Anyway, I was after a general matrix*matrix multiplication, 
 where the operands can get arbitrarily large and where any 
 potential use of __restrict is rendered unnecessary by array 
 vector ops.

Here you go. But I agree there are use cases for restrict where 
vector operations don't help

void matmul(A,B,C)(A a, B b, C c, size_t n, size_t m, size_t l)
{
    for(size_t i = 0; i < n; i++)
    {
        c[i*l..i*l + l] = 0;
        for(size_t j = 0; j < m; j++)
            c[i*l..i*l + l] += a[i*m + j] * b[j*l..j*l + l];
    }
}

Jan 17 2012

"F i L" <witte2008 gmail.com> writes:

Timon Gehr wrote:
 The parameter is just squared and returned?

No, sorry that code is all screwed up and missing a step.
My Matrix multiply code looks like this:

auto transform(U)(Matrix4!U m) if (isImplicitlyConvertible(U, T))
{
    return Matrix4 (
        Vector4 (
            (m.x.x*x.x) + (m.x.y*y.x) + (m.x.z*z.x) + (m.x.w*w.x),
            (m.x.x*x.y) + (m.x.y*y.Y) + (m.x.z*z.y) + (m.x.w*w.y),
            (m.x.x*x.z) + (m.x.y*y.z) + (m.x.z*z.z) + (m.x.w*w.z),
            (m.x.x*x.w) + (m.x.y*y.w) + (m.x.z*z.w) + (m.x.w*w.w)
        ),
        Vector4 (
            (m.y.x*x.x) + (m.y.y*y.x) + (m.y.z*z.x) + (m.y.w*w.x),
            (m.y.x*x.y) + (m.y.y*y.y) + (m.y.z*z.y) + (m.y.w*w.y),
            (m.y.x*x.z) + (m.y.y*y.z) + (m.y.z*z.Z) + (m.y.w*w.z),
            (m.y.x*x.w) + (m.y.y*y.w) + (m.y.z*z.w) + (m.y.w*w.w)
        ),
        Vector4 (
            (m.z.x*x.x) + (m.z.y*y.x) + (m.z.z*z.x) + (m.z.w*w.x),
            (m.z.x*x.Y) + (m.z.y*y.y) + (m.z.z*z.y) + (m.z.w*w.y),
            (m.z.x*x.z) + (m.z.y*y.z) + (m.z.z*z.z) + (m.z.w*w.z),
            (m.z.x*x.w) + (m.z.y*y.w) + (m.z.z*z.w) + (m.z.w*w.w)
        ),
        Vector4 (
            (m.w.x*x.x) + (m.w.y*y.x) + (m.w.z*z.x) + (m.w.w*w.x),
            (m.w.x*x.Y) + (m.w.y*y.y) + (m.w.z*z.y) + (m.w.w*w.y),
            (m.w.x*x.Z) + (m.w.y*y.z) + (m.w.z*z.z) + (m.w.w*w.z),
            (m.w.x*x.w) + (m.w.y*y.w) + (m.w.z*z.w) + (m.w.w*w.w)
        )
    );
}


to be converted to something more like my previous example in 
order for SIMD to kick in. IDK if D's compile is good enough to 
optimize the above code into SIMD ops, but I doubt it.


 Anyway, I was after a general matrix*matrix multiplication, 
 where the operands can get arbitrarily large and where any 
 potential use of __restrict is rendered unnecessary by array 
 vector ops.

I don't know enough about simd to confidently discuss this, but 
I'd imagine there'd have to be quite a lot of compiler magic 
happening before arbitrarily sized matrix constructs could make 
use of simd.

Jan 17 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com> =

wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment :=






-)
 Leave that sort of optimisation for the backend to handle please. ;-)



 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for. =A0I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
 =A0 for (int i=3D0; i<256; i++)
 =A0 =A0 a[i] =3D b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

 =A0for (int i=3D0; i<256; i++)
 =A0 =A0a[i] =3D b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

 Compile with -fstrict-aliasing then?

 I could certainly play about with having this enabled by default, but
 I forsee there may be issues (maybe have it on for  safe code?)


 Regards
 --
 Iain Buclaw

 *(p < e ? p++ : p) =3D (c & 0x0f) + '0';

OK, have turned on strict aliasing by default for D2.  You should now
be able to vectorise loops that use locals and parameters. :-)


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.com>

 wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
  wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the moment






 :-)
 Leave that sort of optimisation for the backend to handle please. ;-)



 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse()
 loops that I have written support for.  I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
   for (int i=0; i<256; i++)
     a[i] = b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

  for (int i=0; i<256; i++)
    a[i] = b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

 Compile with -fstrict-aliasing then?

 I could certainly play about with having this enabled by default, but
 I forsee there may be issues (maybe have it on for  safe code?)

 OK, have turned on strict aliasing by default for D2.  You should now
 be able to vectorise loops that use locals and parameters. :-)

What protects these ranges from being overlapping? What if they were
sourced from pointers?
Are just we to say in D that aliasing is not allowed, and 'you shouldn't do
it'? People almost never alias intentionally, it's usually the
most insidious of bugs. :/

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 12:17 AM, Manu wrote:
 What protects these ranges from being overlapping?

A runtime check, like array bounds checking.

Jan 17 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com> wrote:

 On 1/17/2012 12:17 AM, Manu wrote:

 What protects these ranges from being overlapping?

 A runtime check, like array bounds checking.

Awesome.
How does this map to pointer dereferencing?

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 2:43 AM, Manu wrote:
 On 17 January 2012 12:33, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 1/17/2012 12:17 AM, Manu wrote:

         What protects these ranges from being overlapping?


     A runtime check, like array bounds checking.


 Awesome.
 How does this map to pointer dereferencing?

It can't. Use dynamic arrays - that's what they're for.

Jan 17 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 17 January 2012 08:17, Manu <turkeyman gmail.com> wrote:
 On 17 January 2012 03:56, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 22:36, Iain Buclaw <ibuclaw ubuntu.com> wrote:
 On 16 January 2012 21:57, Peter Alexander <peter.alexander.au gmail.co=



m>
 wrote:
 On 16/01/12 8:56 PM, Iain Buclaw wrote:
 On 16 January 2012 19:25, Walter Bright<newshound2 digitalmars.com>
 =A0wrote:
 On 1/16/2012 11:16 AM, Iain Buclaw wrote:

 But don't worry, I'm not planning on working on that at the momen=








t
 :-)


 Leave that sort of optimisation for the backend to handle please.
 ;-)



 Of course.

 I suspect Intel's compiler does that one, does gcc?

 There's auto-vectorisation for for(), foreach(), and foreach_reverse=





()
 loops that I have written support for. =A0I am not aware of GCC
 vectorising anything else.

 example:

 int a[256], b[256], c[256];
 void foo () {
 =A0 for (int i=3D0; i<256; i++)
 =A0 =A0 a[i] =3D b[i] + c[i];
 }

 Unfortunately, if the function was this:

 void foo(int[] a, int[] b, int[] c) {

 =A0for (int i=3D0; i<256; i++)
 =A0 =A0a[i] =3D b[i] + c[i];
 }

 Then it can't vectorize due to aliasing.

 Compile with -fstrict-aliasing then?

 I could certainly play about with having this enabled by default, but
 I forsee there may be issues (maybe have it on for  safe code?)

 OK, have turned on strict aliasing by default for D2. =A0You should now
 be able to vectorise loops that use locals and parameters. :-)


 What protects these ranges from being overlapping? What if they were sour=

ced
 from pointers?
 Are just we to say in D that aliasing is not allowed, and 'you shouldn't =

do
 it'? People almost never alias intentionally, it's usually the
 most=A0insidious=A0of bugs. :/

D arrays have a .length property that keeps track of the length of the
array. When array bounds checking is turned on (default when not
compiling with -release) an assert is produced when you step outside
the bounds of the array.


Is this what you mean?

--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 17 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 16 Jan 2012 20:25:28 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/16/2012 11:16 AM, Iain Buclaw wrote:
 But don't worry, I'm not planning on working on that at the moment :-)

 Leave that sort of optimisation for the backend to handle please. ;-)

 Of course.

 I suspect Intel's compiler does that one, does gcc?

Thought of that too, but it's rather tough to manage slots in vector  
registers.
Could probably dust of Don's BLADE library.


It seems that gcc and icc are limited to loop optimization.

http://gcc.gnu.org/projects/tree-ssa/vectorization.html
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/

Jan 16 2012

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 But don't worry, I'm not planning on working on that at the moment :-)

Until better optimizations are implemented, I see a "simple" optimization for
vector ops. When the compiler knows an arrays are very small it unrolls the
operation in-place:

int n = 5;
auto a = new int[n];
auto b = new int[n];
a[] += b[];

==>

int n = 5;
auto a = new int[n]; // a and b are dynamic arrays,
auto b = new int[n]; // but their length is easy to find at compile-time
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
a[3] += b[4];
a[5] += b[5];

Bye,
bearophile

Jan 16 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 But don't worry, I'm not planning on working on that at the moment :-)

 Until better optimizations are implemented, I see a "simple" optimization
 for vector ops. When the compiler knows an arrays are very small it unrolls
 the operation in-place:

 int n = 5;
 auto a = new int[n];
 auto b = new int[n];
 a[] += b[];

 ==>

 int n = 5;
 auto a = new int[n]; // a and b are dynamic arrays,
 auto b = new int[n]; // but their length is easy to find at compile-time
 a[0] += b[0];
 a[1] += b[1];
 a[2] += b[2];
 a[3] += b[4];
 a[5] += b[5];

If this doesn't already exist, I think it's quite important. I was thinking
about needing to repeatedly specialise a template last night for a bunch of
short lengths of arrays, for this exact reason.
Unrolling short loops must be one of the most trivial and worthwhile
optimisations...

Jan 17 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote:

 On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:

 Walter:

 But don't worry, I'm not planning on working on that at the moment :-)

 Until better optimizations are implemented, I see a "simple"  
 optimization
 for vector ops. When the compiler knows an arrays are very small it  
 unrolls
 the operation in-place:

 int n = 5;
 auto a = new int[n];
 auto b = new int[n];
 a[] += b[];

 ==>

 int n = 5;
 auto a = new int[n]; // a and b are dynamic arrays,
 auto b = new int[n]; // but their length is easy to find at compile-time
 a[0] += b[0];
 a[1] += b[1];
 a[2] += b[2];
 a[3] += b[4];
 a[5] += b[5];

 If this doesn't already exist, I think it's quite important. I was  
 thinking
 about needing to repeatedly specialise a template last night for a bunch  
 of
 short lengths of arrays, for this exact reason.
 Unrolling short loops must be one of the most trivial and worthwhile
 optimisations...

If the compiler knows it's a compile time constant
thus you could use a static foreach.

Jan 17 2012

Manu <turkeyman gmail.com> writes:

On 17 January 2012 11:48, Martin Nowak <dawg dawgfoto.de> wrote:

 On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman gmail.com> wrote:

  On 17 January 2012 05:55, bearophile <bearophileHUGS lycos.com> wrote:
  Walter:
 But don't worry, I'm not planning on working on that at the moment :-)

 Until better optimizations are implemented, I see a "simple" optimization
 for vector ops. When the compiler knows an arrays are very small it
 unrolls
 the operation in-place:

 int n = 5;
 auto a = new int[n];
 auto b = new int[n];
 a[] += b[];

 ==>

 int n = 5;
 auto a = new int[n]; // a and b are dynamic arrays,
 auto b = new int[n]; // but their length is easy to find at compile-time
 a[0] += b[0];
 a[1] += b[1];
 a[2] += b[2];
 a[3] += b[4];
 a[5] += b[5];

 If this doesn't already exist, I think it's quite important. I was
 thinking
 about needing to repeatedly specialise a template last night for a bunch
 of
 short lengths of arrays, for this exact reason.
 Unrolling short loops must be one of the most trivial and worthwhile
 optimisations...

 If the compiler knows it's a compile time constant
 thus you could use a static foreach.

Great idea! :)

Jan 17 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }

 These two functions should have the same speed. The function that ought  
 to be slower is:

 void test1()
 {
      float[5] a = 1.2;
      float[] b = a[1 .. $];
      b[] = b[] * 3 + 7;
      test1a(a);
 }


 Andrei

Unfortunately druntime's array ops are a mess and fail
to speed up anything below 16 floats.
Additionally there is overhead for a function call and
they have to check alignment at runtime.

martin

Jan 16 2012

Don Clugston <dac nospam.com> writes:

On 16/01/12 17:51, Martin Nowak wrote:
 On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }

 These two functions should have the same speed. The function that
 ought to be slower is:

 void test1()
 {
 float[5] a = 1.2;
 float[] b = a[1 .. $];
 b[] = b[] * 3 + 7;
 test1a(a);
 }


 Andrei

 Unfortunately druntime's array ops are a mess and fail
 to speed up anything below 16 floats.
 Additionally there is overhead for a function call and
 they have to check alignment at runtime.

 martin

Yes. The structural problem in the compiler is that array ops get turned 
into function calls far too early. It happens in the semantic pass, but 
it shouldn't happen in the front-end at all -- it should be done in the 
glue layer, at the beginning of code generation.

Incidentally, this is the reason that CTFE doesn't work with array ops.

Jan 17 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac nospam.com> wrote:

 On 16/01/12 17:51, Martin Nowak wrote:
 On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/12 12:56 AM, Walter Bright wrote:
 I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
 Anyhow, it's good enough now to play around with. Consider it alpha
 quality. Expect bugs - but make bug reports, as there's a serious lack
 of source code to test it with.
 -----------------------
 import core.simd;

 void test1a(float[4] a) { }

 void test1()
 {
 float[4] a = 1.2;
 a[] = a[] * 3 + 7;
 test1a(a);
 }

 void test2a(float4 a) { }

 void test2()
 {
 float4 a = 1.2;
 a = a * 3 + 7;
 test2a(a);
 }

 These two functions should have the same speed. The function that
 ought to be slower is:

 void test1()
 {
 float[5] a = 1.2;
 float[] b = a[1 .. $];
 b[] = b[] * 3 + 7;
 test1a(a);
 }


 Andrei

 Unfortunately druntime's array ops are a mess and fail
 to speed up anything below 16 floats.
 Additionally there is overhead for a function call and
 they have to check alignment at runtime.

 martin

 Yes. The structural problem in the compiler is that array ops get turned  
 into function calls far too early. It happens in the semantic pass, but  
 it shouldn't happen in the front-end at all -- it should be done in the  
 glue layer, at the beginning of code generation.

 Incidentally, this is the reason that CTFE doesn't work with array ops.

Oh, I was literally speaking of the runtime implementation.
It should loop with 4 XMM regs the continue with 1 XMM reg
and finish up scalar.
Right now it quantizes on 16 floats and does the remaining
ones scalar, which is really bad for very small arrays.

I was about to rewrite it at some point.
https://gist.github.com/1235470

I think having a runtime template is better than
doing this massive extern(C) interface that has
to be kept in sync. That would also open up room
for a better CTFE integration.

martin

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 2:04 AM, Martin Nowak wrote:
 I was about to rewrite it at some point.
 https://gist.github.com/1235470

I think you've got an innovative and clever solution. I'd like to see you
finish it!

Jan 17 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 17 Jan 2012 11:53:35 +0100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 1/17/2012 2:04 AM, Martin Nowak wrote:
 I was about to rewrite it at some point.
 https://gist.github.com/1235470

 I think you've got an innovative and clever solution. I'd like to see  
 you finish it!

Mmh, there was something keeping me from specializing templates,
https://github.com/D-Programming-Language/dmd/pull/396 :).

But right now I'd rather like to finish the shared library merging.

Jan 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2012 5:20 AM, Martin Nowak wrote:
 Mmh, there was something keeping me from specializing templates,
 https://github.com/D-Programming-Language/dmd/pull/396 :).

 But right now I'd rather like to finish the shared library merging.

I agree.

Jan 17 2012

D Programming

C/C++ Programming

Other

digitalmars.D - SIMD benchmark