digitalmars.D - SSE in D

Emil Madsen (13/13) Oct 02 2010 Is there a D equivalent of the "xmmintrin.h", or any other convenient wa...

Trass3r (8/22) Oct 02 2010 SSE is supported in inline assembly.
bearophile (36/46) Oct 02 2010 D2 language is not designed to be an academic language, it's designed to...

yoda (2/80) Oct 03 2010 Is it just me or does anyone else have problems reading / understanding ...
Emil Madsen (16/116) Oct 03 2010 How is this done? - using codepaths after a call to cpuid?

bearophile (14/24) Oct 03 2010 In your dmd distribution there is compiler/druntime/phobos source code t...

Emil Madsen <sovende gmail.com> writes:

Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?
- I've been looking into the Array Operators, but will those work, for
instance if I'm doing something alike:
a[3], b[4]
c[4] = a+b;
and when will the compiler write SSE asm for the array operators? - is there
a target=architecture for the compiler? or will it simply write SSE if one
defines something alike -msse4? - I'm having a bit of trouble finding stuff
about SSE for D, sources on the subject anyone?

-- 
// Yours sincerely
// Emil 'Skeen' Madsen

Oct 02 2010

Trass3r <un known.com> writes:

Am 02.10.2010, 15:23 Uhr, schrieb Emil Madsen <sovende gmail.com>:

 Is there a D equivalent of the "xmmintrin.h", or any other convenient  
 way of
 doing SSE in D?
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;
 and when will the compiler write SSE asm for the array operators? - is  
 there
 a target=architecture for the compiler? or will it simply write SSE if  
 one
 defines something alike -msse4? - I'm having a bit of trouble finding  
 stuff
 about SSE for D, sources on the subject anyone?

SSE is supported in inline assembly.
dmd's backend doesn't automatically vectorize code.
gdc and ldc are theoretically able to it (cause of the backends they use)  
but I don't know to what extent they really do in practice.

Array operations leverage prewritten optimized SSE code if possible.
See Array operations section of  
http://www.digitalmars.com/d/2.0/arrays.html

Oct 02 2010

bearophile <bearophileHUGS lycos.com> writes:

Emil Madsen:

You are asking many different things, let's disentangle your questions a little.

Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?<

D2 language is not designed to be an academic language, it's designed to be a
reasonably practical language (despite some of its feature are not just buggy
or unfinished, but also contain new design ideas, that far from being "battle
tested", so no one knows if they will actually turn out to be good in large or
very large D2 programs).

But its implementation is not fully practical yet. In a compiler like GCC you
may see a ton of dirty or smelly little features that turn out being
practically useful or even almost necessary for real-world code, that are
absent from the C standard. The D2 compiler lacks a big amount of such dirty
utility corner cases. Even the (D1) compiler LDC shows some of such necessary
dirty little features, like the allow_inline pragma to allow inlining of
functions that contain asm, and so on. I guess that when D2 will be more
finished, and some people will write a more efficient implementation of D2,
those little smelly things will be added in abundance.

The xmmintrin little dirty intrinsics are absent from DMD and D, both in
practice and by design. GCC C is not designed much, they just add those SIMD
operations to the ball of mud named GNU C (plus handy operator overloading if
you want to sum or mult two registers represented as special arrays of doubles
or floats or ints). D here is designed in a bit more idealistic way, and it
tries to be semantically cleaner, so instead of those intrinsics, you are
supposed to use vectorial operations done on arrays (both static and dynamic).

Many of such operations are already implemented and more or less they work, but
unless your arrays are large, they actually usually slow down your code,
because they are chunks of pre-written asm (that use SSE+ registers too)
designed for large arrays, are they are not inlined. In theory in future the D
front-end will be able to replace a sum of two 4-float static arrays with a
single SSE instruction (or little more) (if you have compiled the code for
SSE-enabled CPUs). In practice DMD is far from this point, and the development
efforts are (rightly!) focused on finishing core features and removing the
worst implementation (or even design) bugs. Optimization of code generation
matters are for later.


 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;

The right D syntax is:

float[4] a, b, c;
c[] = a[] + b[];

You must always use [] after the array name. Arrays must have the same length.

And currently you can't use this syntax:

void main() {
    float[4] a, b;
    float[4] c[] = a[] + b[];
}


That gives the error:

test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type
float[] to float[4u][]

Probably because of a unforeseen design bug that causes such collision between
D and C syntax that is accepted still in D.

See this bug report for more info about this design problem, that so far most
people (including the main designers) seem to happily ignore:
http://d.puremagic.com/issues/show_bug.cgi?id=3971
Here I have suggested a possible solution, the introduction of a -cstyle
compiler flag, that was ignored even more:
http://d.puremagic.com/issues/show_bug.cgi?id=4580

So this code works:

void main() {
    float[4] a, b, c;
    c[] = a[] + b[];
}

But it performs a call to the asm routine that performs the vector c=a+b in
assembly, that uses SSE registers too if your CPU (detected at runtime)
supports them.


 and when will the compiler write SSE asm for the array operators?

DMD currently never writes SSE asm, unless you use those asm instructions in
inlined asm code. The 64 bit DMD will probably be able to use those registers
too, but I have no idea if then 32 bit DMD too will use them, I hope so, but I
have little hope. I'd like to know this.

D1 LDC now uses SSE registers for most of its floating point operations because
LLVM is very bad in using the X86 floating point stack.

Low-level D code written for D1 ldc is usually about as efficient as C code
written for GCC. This is a very good thing. But recently the development of LDC
has slowed down a lot, and there is no D2 version of it, it's not updated to
the latest versions of LLVM and there's no Windows support because LLVM devs
are paid by Apple and they don't care to make LLVM work fully (== with
exceptions too) for Windows too, they just need to give to people the illusion
that LLVM is multi-platform. I used to help LLVM development, but I have
stopped until they will add a good support of exceptions on Windows.

There is a GCC-based D compiler too, named GDC, and I think it works, but I
have never appreciated it much on Windows. Other people may give you
more/better info on it.


 - is there a target=architecture for the compiler? or will it simply write
 SSE if one defines something alike -msse4? -

LDC D1 allows you to specify the target a little, while I think DMD always
targets a Pentium1.


 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?

There is not much to search :-)

Bye,
bearophile

Oct 02 2010

yoda <yoda talk.info> writes:

bearophile Wrote:

 Emil Madsen:
 
 You are asking many different things, let's disentangle your questions a
little.
 
Is there a D equivalent of the "xmmintrin.h", or any other convenient way of
doing SSE in D?<

 
 D2 language is not designed to be an academic language, it's designed to be a
reasonably practical language (despite some of its feature are not just buggy
or unfinished, but also contain new design ideas, that far from being "battle
tested", so no one knows if they will actually turn out to be good in large or
very large D2 programs).
 
 But its implementation is not fully practical yet. In a compiler like GCC you
may see a ton of dirty or smelly little features that turn out being
practically useful or even almost necessary for real-world code, that are
absent from the C standard. The D2 compiler lacks a big amount of such dirty
utility corner cases. Even the (D1) compiler LDC shows some of such necessary
dirty little features, like the allow_inline pragma to allow inlining of
functions that contain asm, and so on. I guess that when D2 will be more
finished, and some people will write a more efficient implementation of D2,
those little smelly things will be added in abundance.
 
 The xmmintrin little dirty intrinsics are absent from DMD and D, both in
practice and by design. GCC C is not designed much, they just add those SIMD
operations to the ball of mud named GNU C (plus handy operator overloading if
you want to sum or mult two registers represented as special arrays of doubles
or floats or ints). D here is designed in a bit more idealistic way, and it
tries to be semantically cleaner, so instead of those intrinsics, you are
supposed to use vectorial operations done on arrays (both static and dynamic).
 
 Many of such operations are already implemented and more or less they work,
but unless your arrays are large, they actually usually slow down your code,
because they are chunks of pre-written asm (that use SSE+ registers too)
designed for large arrays, are they are not inlined. In theory in future the D
front-end will be able to replace a sum of two 4-float static arrays with a
single SSE instruction (or little more) (if you have compiled the code for
SSE-enabled CPUs). In practice DMD is far from this point, and the development
efforts are (rightly!) focused on finishing core features and removing the
worst implementation (or even design) bugs. Optimization of code generation
matters are for later.
 
 
 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;

 
 The right D syntax is:
 
 float[4] a, b, c;
 c[] = a[] + b[];
 
 You must always use [] after the array name. Arrays must have the same length.
 
 And currently you can't use this syntax:
 
 void main() {
     float[4] a, b;
     float[4] c[] = a[] + b[];
 }
 
 
 That gives the error:
 
 test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type
float[] to float[4u][]
 
 Probably because of a unforeseen design bug that causes such collision between
D and C syntax that is accepted still in D.
 
 See this bug report for more info about this design problem, that so far most
people (including the main designers) seem to happily ignore:
 http://d.puremagic.com/issues/show_bug.cgi?id=3971
 Here I have suggested a possible solution, the introduction of a -cstyle
compiler flag, that was ignored even more:
 http://d.puremagic.com/issues/show_bug.cgi?id=4580
 
 So this code works:
 
 void main() {
     float[4] a, b, c;
     c[] = a[] + b[];
 }
 
 But it performs a call to the asm routine that performs the vector c=a+b in
assembly, that uses SSE registers too if your CPU (detected at runtime)
supports them.
 
 
 and when will the compiler write SSE asm for the array operators?

 
 DMD currently never writes SSE asm, unless you use those asm instructions in
inlined asm code. The 64 bit DMD will probably be able to use those registers
too, but I have no idea if then 32 bit DMD too will use them, I hope so, but I
have little hope. I'd like to know this.
 
 D1 LDC now uses SSE registers for most of its floating point operations
because LLVM is very bad in using the X86 floating point stack.
 
 Low-level D code written for D1 ldc is usually about as efficient as C code
written for GCC. This is a very good thing. But recently the development of LDC
has slowed down a lot, and there is no D2 version of it, it's not updated to
the latest versions of LLVM and there's no Windows support because LLVM devs
are paid by Apple and they don't care to make LLVM work fully (== with
exceptions too) for Windows too, they just need to give to people the illusion
that LLVM is multi-platform. I used to help LLVM development, but I have
stopped until they will add a good support of exceptions on Windows.
 
 There is a GCC-based D compiler too, named GDC, and I think it works, but I
have never appreciated it much on Windows. Other people may give you
more/better info on it.
 
 
 - is there a target=architecture for the compiler? or will it simply write
 SSE if one defines something alike -msse4? -

 
 LDC D1 allows you to specify the target a little, while I think DMD always
targets a Pentium1.
 
 
 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?

 
 There is not much to search :-)
 

Is it just me or does anyone else have problems reading / understanding what he
tries to say? The words are more or less correct, but the grammar is something
really incomprehensible. What is bearophile? Some Asperger child prodigy? The
words sound like they come from some ivory tower 500 feet above us and show no
signs of emotions or social group thinking. Why is he affecting D's development
so much?

Oct 03 2010

Emil Madsen <sovende gmail.com> writes:

uses SSE registers too if your CPU (detected at runtime) supports them.

How is this done? - using codepaths after a call to cpuid?

and I can see the idea in cleaning up syntax, by replacing intrinsics with
array operators, however, what if I want to for instance shuffle? - would it
be possible to overload >> for that, or something? and how would it shuffle?
4 elements or the entire thing? - Say I want to shuffle elements once to the
right like this:
a b c d --> d a b c
(_mm_shuffle_ps(array, array, _MM_SHUFFLE(2, 1, 0, 3));)

Its just because I'm in need of such functionality to implement matrixes,
and such using SSE. - what would my alternative be? implementing
"xmmintrin.h" using bits of small inline asm? - that however wouldn't yield
any speed, if its not getting inlined?

On 3 October 2010 03:34, bearophile <bearophileHUGS lycos.com> wrote:

 Emil Madsen:

 You are asking many different things, let's disentangle your questions a
 little.

Is there a D equivalent of the "xmmintrin.h", or any other convenient way

 of doing SSE in D?<

 D2 language is not designed to be an academic language, it's designed to be
 a reasonably practical language (despite some of its feature are not just
 buggy or unfinished, but also contain new design ideas, that far from being
 "battle tested", so no one knows if they will actually turn out to be good
 in large or very large D2 programs).

 But its implementation is not fully practical yet. In a compiler like GCC
 you may see a ton of dirty or smelly little features that turn out being
 practically useful or even almost necessary for real-world code, that are
 absent from the C standard. The D2 compiler lacks a big amount of such dirty
 utility corner cases. Even the (D1) compiler LDC shows some of such
 necessary dirty little features, like the allow_inline pragma to allow
 inlining of functions that contain asm, and so on. I guess that when D2 will
 be more finished, and some people will write a more efficient implementation
 of D2, those little smelly things will be added in abundance.

 The xmmintrin little dirty intrinsics are absent from DMD and D, both in
 practice and by design. GCC C is not designed much, they just add those SIMD
 operations to the ball of mud named GNU C (plus handy operator overloading
 if you want to sum or mult two registers represented as special arrays of
 doubles or floats or ints). D here is designed in a bit more idealistic way,
 and it tries to be semantically cleaner, so instead of those intrinsics, you
 are supposed to use vectorial operations done on arrays (both static and
 dynamic).

 Many of such operations are already implemented and more or less they work,
 but unless your arrays are large, they actually usually slow down your code,
 because they are chunks of pre-written asm (that use SSE+ registers too)
 designed for large arrays, are they are not inlined. In theory in future the
 D front-end will be able to replace a sum of two 4-float static arrays with
 a single SSE instruction (or little more) (if you have compiled the code for
 SSE-enabled CPUs). In practice DMD is far from this point, and the
 development efforts are (rightly!) focused on finishing core features and
 removing the worst implementation (or even design) bugs. Optimization of
 code generation matters are for later.


 - I've been looking into the Array Operators, but will those work, for
 instance if I'm doing something alike:
 a[3], b[4]
 c[4] = a+b;

 The right D syntax is:

 float[4] a, b, c;
 c[] = a[] + b[];

 You must always use [] after the array name. Arrays must have the same
 length.

 And currently you can't use this syntax:

 void main() {
    float[4] a, b;
    float[4] c[] = a[] + b[];
 }


 That gives the error:

 test.d(3): Error: cannot implicitly convert expression (a[] + b[]) of type
 float[] to float[4u][]

 Probably because of a unforeseen design bug that causes such collision
 between D and C syntax that is accepted still in D.

 See this bug report for more info about this design problem, that so far
 most people (including the main designers) seem to happily ignore:
 http://d.puremagic.com/issues/show_bug.cgi?id=3971
 Here I have suggested a possible solution, the introduction of a -cstyle
 compiler flag, that was ignored even more:
 http://d.puremagic.com/issues/show_bug.cgi?id=4580

 So this code works:

 void main() {
    float[4] a, b, c;
    c[] = a[] + b[];
 }

 But it performs a call to the asm routine that performs the vector c=a+b in
 assembly, that uses SSE registers too if your CPU (detected at runtime)
 supports them.


 and when will the compiler write SSE asm for the array operators?

 DMD currently never writes SSE asm, unless you use those asm instructions
 in inlined asm code. The 64 bit DMD will probably be able to use those
 registers too, but I have no idea if then 32 bit DMD too will use them, I
 hope so, but I have little hope. I'd like to know this.

 D1 LDC now uses SSE registers for most of its floating point operations
 because LLVM is very bad in using the X86 floating point stack.

 Low-level D code written for D1 ldc is usually about as efficient as C code
 written for GCC. This is a very good thing. But recently the development of
 LDC has slowed down a lot, and there is no D2 version of it, it's not
 updated to the latest versions of LLVM and there's no Windows support
 because LLVM devs are paid by Apple and they don't care to make LLVM work
 fully (== with exceptions too) for Windows too, they just need to give to
 people the illusion that LLVM is multi-platform. I used to help LLVM
 development, but I have stopped until they will add a good support of
 exceptions on Windows.

 There is a GCC-based D compiler too, named GDC, and I think it works, but I
 have never appreciated it much on Windows. Other people may give you
 more/better info on it.


 - is there a target=architecture for the compiler? or will it simply

 write
 SSE if one defines something alike -msse4? -

 LDC D1 allows you to specify the target a little, while I think DMD always
 targets a Pentium1.


 I'm having a bit of trouble finding stuff
 about SSE for D, sources on the subject anyone?

 There is not much to search :-)

 Bye,
 bearophile



-- 
// Yours sincerely
// Emil 'Skeen' Madsen

Oct 03 2010

bearophile <bearophileHUGS lycos.com> writes:

Emil Madsen:

uses SSE registers too if your CPU (detected at runtime) supports them.


How is this done? - using codepaths after a call to cpuid?

In your dmd distribution there is compiler/druntime/phobos source code too,
take a peek there. This souce code shows you how it's done:

http://www.dsource.org/projects/druntime/browser/trunk/src/rt/arraydouble.d


 what if I want to for instance shuffle? - would it
 be possible to overload >> for that, or something? and how would it shuffle?
 4 elements or the entire thing? - Say I want to shuffle elements once to the
 right like this:
 a b c d

At the moment I think you have to write a little function that performs the
shuffle (and if it contains asm it will not be inlined). A similar solution is
to use a little shuffling struct that uses opDispatch to give a nice shuffling
syntax.
You may also use a string mixin, if your asm code must be inlined, but this is
not nice.
I think currently there is no very good way to do what you need to do. I think
Don or someone else will need to invent something good enough for the efficient
shuffling :-)


 implementing
 "xmmintrin.h" using bits of small inline asm? - that however wouldn't yield
 any speed, if its not getting inlined?

In DMD functions that contain asm don't get inlined, so those small snippets
become kind of useless if your purpose is max performance. 

LDC (D1) compiler being more practical has two different ways to do what you
need to do, the pragma(allow_inline):
http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline
And Inline Assembly Expressions:
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

In DMD you probably have to build your code as string at compile-time and then
mix-in in the normal code. This is not handy nor clean, but it may work.

Bye,
bearophile

Oct 03 2010

D Programming

C/C++ Programming

Other

digitalmars.D - SSE in D