www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Any usable SIMD implementation?

reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
I'm currently working on a templated arrayop implementation (using RPN
to encode ASTs).
So far things worked out great, but now I got stuck b/c apparently none
of the D compilers has a working SIMD implementation (maybe GDC has but
it's very difficult to work w/ the 2.066 frontend).

https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d
https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

I don't want to do anything fancy, just unaligned loads, stores, and
integral mul/div. Is this really the current state of SIMD or am I
missing sth.?

-Martin
Mar 31 2016
next sibling parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
I don't know how far has Ilya's work [1] advanced, but you may want to join efforts with him. There are also two std.simd packages [2] [3]. BTW, I looked at your code a couple of days ago and I thought that it is a really interesting approach to encode operations like that. I'm just wondering if pursuing this approach is a good idea in the long run, i.e. is it expressible enough to cover the use cases of HPC which would also need something similar, but for custom linear algebra types. Here's an interesting video about approaches to solving this problem in C++: https://www.youtube.com/watch?v=hfn0BVOegac [1]: http://forum.dlang.org/post/nilhvnqbsgqhxdshpqfl forum.dlang.org [2]: https://github.com/D-Programming-Language/phobos/pull/2862 [3]: https://github.com/Iakh/simd
Mar 31 2016
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 03/31/2016 10:55 AM, ZombineDev wrote:
 [2]: https://github.com/D-Programming-Language/phobos/pull/2862
Well apparently stores w/ dmd's weird core.simd interface don't work, or I can't figure out (from the non-existent documentation) how to use it. --- import core.simd; void test(float4* ptr, float4 val) { __simd_sto(XMM.STOUPS, *ptr, val); __simd(XMM.STOUPS, *ptr, val); auto val1 = __simd_sto(XMM.STOUPS, *ptr, val); auto val2 = __simd(XMM.STOUPS, *ptr, val); } --- LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but for some reason comes with it's own broken ldc.simd.loadUnaligned instead of providing intrinsics. --- import core.simd, ldc.simd; float4 test(float* ptr) { return loadUnaligned!float4(ptr); } --- /home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't parse inline LLVM IR: %r = load <4 x float>* %p, align 1 ^ expected comma after load's type So are 3 different untested and unused APIs really the current state of SIMD? -Martin
Apr 01 2016
next sibling parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 2 Apr 2016 12:40 am, "Martin Nowak via Digitalmars-d" <
digitalmars-d puremagic.com> wrote:
 On 03/31/2016 10:55 AM, ZombineDev wrote:
 [2]: https://github.com/D-Programming-Language/phobos/pull/2862
Well apparently stores w/ dmd's weird core.simd interface don't work, or I can't figure out (from the non-existent documentation) how to use it. --- import core.simd; void test(float4* ptr, float4 val) { __simd_sto(XMM.STOUPS, *ptr, val); __simd(XMM.STOUPS, *ptr, val); auto val1 = __simd_sto(XMM.STOUPS, *ptr, val); auto val2 = __simd(XMM.STOUPS, *ptr, val); } --- LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but for some reason comes with it's own broken ldc.simd.loadUnaligned instead of providing intrinsics. --- import core.simd, ldc.simd; float4 test(float* ptr) { return loadUnaligned!float4(ptr); } --- /home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't parse inline LLVM IR: %r = load <4 x float>* %p, align 1 ^ expected comma after load's type So are 3 different untested and unused APIs really the current state of SIMD? -Martin
I would just let the compiler optimize / vectorize the operation, but then again that it is probably just me who thinks these things. http://goo.gl/XdiKZX I'm not aware of any intrinsic to load unaligned data. Only to assume alignment. Iain.
Apr 01 2016
parent reply Martin Nowak <code dawg.eu> writes:
On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:
 I would just let the compiler optimize / vectorize the 
 operation, but then again that it is probably just me who 
 thinks these things.
It's intended to replace the array ops in druntime, relying on vecorizers won't suffice, e.g. your example already stops working when I pass dynamic instead of static arrays.
 I'm not aware of any intrinsic to load unaligned data. Only to 
 assume alignment.
__builtin_ia32_loadups __builtin_ia32_storeups
Apr 02 2016
parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 2 Apr 2016 9:45 am, "Martin Nowak via Digitalmars-d" <
digitalmars-d puremagic.com> wrote:
 On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:
 I would just let the compiler optimize / vectorize the operation, but
then again that it is probably just me who thinks these things.
 It's intended to replace the array ops in druntime, relying on vecorizers
won't suffice, e.g. your example already stops working when I pass dynamic instead of static arrays.
 I'm not aware of any intrinsic to load unaligned data. Only to assume
alignment.
 __builtin_ia32_loadups
 __builtin_ia32_storeups
Any agnostic way to... :-)
Apr 02 2016
parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 04/02/2016 10:19 AM, Iain Buclaw via Digitalmars-d wrote:
 __builtin_ia32_loadups
 __builtin_ia32_storeups
Any agnostic way to... :-)
I'm already using vector types for most operations, so it's somewhat portable. But for whatever reason D doesn't allow multiplication/division w/ integral vectors (departing from GCC/clang) and I can't perform unaligned loads, so I have to resort to intrinsics for that.
Apr 02 2016
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:
 LDC at least has some intrinsics once you find 
 ldc.gccbuiltins_x86, but for some reason comes with it's own 
 broken ldc.simd.loadUnaligned
Please submit a GH issue with LDC, thanks! -Johan
Apr 03 2016
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:
 Well apparently stores w/ dmd's weird core.simd interface don't 
 work, or I can't figure out (from the non-existent 
 documentation) how to use it.
https://github.com/D-Programming-Language/dmd/pull/5625
Apr 03 2016
prev sibling next sibling parent John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
Am I being stupid or is core.simd what you want?
Mar 31 2016
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?
I think you want to write your code using SIMD primitives. But in case you want the compiler to generate SIMD instructions, perhaps ldc.attributes.target may help you: I have not checked what LDC does with SIMD with default commandline parameters. Cheers, Johan
Mar 31 2016
prev sibling next sibling parent Iakh <iaktakh gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
Unfortunately my one(https://github.com/Iakh/simd) is far from production code. For now I'm trying to figure out interface common to all archs/compilers. And its more about SIMD comparison operations. You could do loads, stores and mul with default D SIMD support but not int div
Mar 31 2016
prev sibling next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
Hello Martin, Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, Ilya
Apr 02 2016
next sibling parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" <digitalmars-d puremagic.com>
wrote:
 On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c apparently none
 of the D compilers has a working SIMD implementation (maybe GDC has but
 it's very difficult to work w/ the 2.066 frontend).
https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d
 I don't want to do anything fancy, just unaligned loads, stores, and
integral mul/div. Is this really the current state of SIMD or am I missing sth.?
 -Martin
Hello Martin, Is it possible to introduce compile time information about target
platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target.
 Best regards,
 Ilya
What kind of information?
Apr 02 2016
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Sunday, 3 April 2016 at 06:33:13 UTC, Iain Buclaw wrote:
 On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" 
 <digitalmars-d puremagic.com> wrote:
 Hello Martin,

 Is it possible to introduce compile time information about 
 target
platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target.
 Best regards,
 Ilya
What kind of information?
Target cpu configuration: - CPU architecture (done) - Count of FP/Integer registers - Allowed sets of instructions: for example, AVX2, FMA4 - Compiler optimization options (for math) Ilya
Apr 04 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 04 Apr 2016 14:02:03 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

 Target cpu configuration:
 - CPU architecture (done)
 - Count of FP/Integer registers
 - Allowed sets of instructions: for example, AVX2, FMA4
 - Compiler optimization options (for math)
 
 Ilya
- On amd64, whether floating-point math is handled by the FPU or SSE. When emulating floating-point, e.g. for float-to-string and string-to-float code, it is useful to know where to get the active rounding mode from, since they may differ and at least GCC has a switch to choose between both. - For compile time enabling of SSE4 code, a version define is sufficient. Sometimes we want to select a code path at runtime. For this to work, GDC and LDC use a conservative feature set at compile time (e.g. amd64 with SSE2) and tag each SSE4 function with an attribute to temporarily elevate the instruction set. (e.g. attribute("target", "+sse4")) If you didn't tag the function like that the compiler would error out, because the SSE4 instructions are not supported by a minimal amd64 CPU. To put this to good use, we need a reliable way - basically a global variable - to check for SSE4 (or POPCNT, etc.). What we have now does not work across all compilers. -- Marco
Apr 04 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Monday, 4 April 2016 at 16:21:15 UTC, Marco Leise wrote:
 Am Mon, 04 Apr 2016 14:02:03 +0000
 schrieb 9il <ilyayaroshenko gmail.com>:
 - On amd64, whether floating-point math is handled by the FPU
   or SSE. When emulating floating-point, e.g. for
   float-to-string and string-to-float code, it is useful to
   know where to get the active rounding mode from, since they
   may differ and at least GCC has a switch to choose between
   both.
 - For compile time enabling of SSE4 code, a version define is
   sufficient. Sometimes we want to select a code path at
   runtime. For this to work, GDC and LDC use a conservative
   feature set at compile time (e.g. amd64 with SSE2) and tag
   each SSE4 function with an attribute to temporarily elevate
   the instruction set. (e.g.  attribute("target", "+sse4"))
   If you didn't tag the function like that the compiler would
   error out, because the SSE4 instructions are not supported
   by a minimal amd64 CPU.
   To put this to good use, we need a reliable way - basically
   a global variable - to check for SSE4 (or POPCNT, etc.). What
   we have now does not work across all compilers.
attribute("target", "+sse4")) would not work well for BLAS. BLAS needs compile time constants. This is very important because BLAS can be 95% portable, so I just need to write a code that would be optimized very well by compiler. --Ilya
Apr 04 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 04 Apr 2016 18:35:26 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

  attribute("target", "+sse4")) would not work well for BLAS. BLAS 
 needs compile time constants. This is very important because BLAS 
 can be 95% portable, so I just need to write a code that would be 
 optimized very well by compiler. --Ilya
It's just for the case where you want a generic executable with a generic and a specialized code path. I didn't mean this to be exclusively used without compile-time information about target features. -- Marco
Apr 11 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 9:21 AM, Marco Leise wrote:
    To put this to good use, we need a reliable way - basically
    a global variable - to check for SSE4 (or POPCNT, etc.). What
    we have now does not work across all compilers.
http://dlang.org/phobos/core_cpuid.html
Apr 04 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 4 Apr 2016 11:43:58 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 9:21 AM, Marco Leise wrote:
    To put this to good use, we need a reliable way - basically
    a global variable - to check for SSE4 (or POPCNT, etc.). What
    we have now does not work across all compilers.  
http://dlang.org/phobos/core_cpuid.html
That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler"); Both LLVM and GCC have moved to "extended inline assemblers" that require you to provide information about input, output and scratch registers as well as memory locations, so the compiler can see through the asm-block for register allocation and inlining purposes. It's more difficult to get right, but also more rewarding, as it enables you to write no-overhead "one-liners" and "intrinsics" while having calling conventions still handled by the compiler. An example for GDC: struct DblWord { ulong lo, hi; } /// Multiplies two machine words and returns a double /// machine word. DblWord bigMul(ulong x, ulong y) { DblWord tmp = void; // '=a' and '=d' are outputs to RAX and RDX // respectively that are bound to the two // fields of 'tmp'. // '"a" x' means that we want 'x' as input in // RAX and '"rm" y' places 'y' wherever it // suits the compiler (any general purpose // register or memory location). // 'mulq %3' multiplies with the ulong // represented by the argument at index 3 (y). asm { "mulq %3" : "=a" tmp.lo, "=d" tmp.hi : "a" x, "rm" y; } return tmp; } In the above example the compiler has enough information to inline the function or directly return the result in RAX:RDX without writing to memory first. The same thing in DMD would likely have turned out slower than emulating this using several uint->ulong multiplies. Although less powerful, the LDC team implemented Dlang inline assembly according to the specs and so core.cpuid works for them. GDC on the other hand is out of the picture until either 1) GDC adds Dlang inline assembly 2) core.cpuid duplicates most of its assembly code to support the GCC extended inline assembler I would prefer a common extended inline assembler though, because when you use it for performance reasons you typically cannot go with non-inlinable Dlang asm, so you end up with pure D for DMD, GCC asm for GDC and LDC asm - three code paths. -- Marco
Apr 11 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/11/2016 7:24 AM, Marco Leise wrote:
 Am Mon, 4 Apr 2016 11:43:58 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 9:21 AM, Marco Leise wrote:
     To put this to good use, we need a reliable way - basically
     a global variable - to check for SSE4 (or POPCNT, etc.). What
     we have now does not work across all compilers.
http://dlang.org/phobos/core_cpuid.html
That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler");
There's no reason core.cpuid, which has a platform-independent API, cannot be made to work with GDC and LDC. Adding more global variables to do the same thing would add no value and would not be easier to implement.
 Both LLVM and GCC have moved to "extended inline assemblers"
 that require you to provide information about input, output
 and scratch registers as well as memory locations, so the
 compiler can see through the asm-block for register allocation
 and inlining purposes. It's more difficult to get right, but
 also more rewarding, as it enables you to write no-overhead
 "one-liners" and "intrinsics" while having calling conventions
 still handled by the compiler.
I know, but "more difficult" is a bit of an understatement. For example, core.cpuid has not been implemented using those assemblers. BTW, dmd's inline assembler does know about which instructions read/write which registers, and makes use of that when inserting the code so it will work with the rest of the code generator's register usage tracking. I find needing to tell gcc which registers are read/written by a particular instruction to be a step BACKWARDS in usability. This is what computers are supposed to be good for :-)
 An example for GDC:

 	struct DblWord { ulong lo, hi; }

 	/// Multiplies two machine words and returns a double
 	/// machine word.
 	DblWord bigMul(ulong x, ulong y)
 	{
 		DblWord tmp = void;
 		// '=a' and '=d' are outputs to RAX and RDX
 		// respectively that are bound to the two
 		// fields of 'tmp'.
 		// '"a" x' means that we want 'x' as input in
 		// RAX and '"rm" y' places 'y' wherever it
 		// suits the compiler (any general purpose
 		// register or memory location).
 		// 'mulq %3' multiplies with the ulong
 		// represented by the argument at index 3 (y).
 		asm {
 			"mulq %3"
 			 : "=a" tmp.lo, "=d" tmp.hi
 			 : "a" x, "rm" y;
 		}
 		return tmp;
 	}

 In the above example the compiler has enough information to
 inline the function or directly return the result in RAX:RDX
 without writing to memory first. The same thing in DMD would
 likely have turned out slower than emulating this using
 several uint->ulong multiplies.
DMD doesn't inline functions with asm in them, but that is not the fault of the inline assembler. The only real weakness in the DMD inline assembler is it doesn't support "let the compiler select the register". DMD's strong support for compiler builtins, however, mitigate this to an acceptable level.
 Although less powerful, the LDC team implemented Dlang inline
 assembly according to the specs and so core.cpuid works for
 them. GDC on the other hand is out of the picture until either
 1) GDC adds Dlang inline assembly
 2) core.cpuid duplicates most of its assembly code to support
     the GCC extended inline assembler

 I would prefer a common extended inline assembler though,
 because when you use it for performance reasons you typically
 cannot go with non-inlinable Dlang asm, so you end up with pure
 D for DMD, GCC asm for GDC and LDC asm - three code paths.
Apr 11 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 11 Apr 2016 14:29:11 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/11/2016 7:24 AM, Marco Leise wrote:
 Am Mon, 4 Apr 2016 11:43:58 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
  
 On 4/4/2016 9:21 AM, Marco Leise wrote:  
     To put this to good use, we need a reliable way - basically
     a global variable - to check for SSE4 (or POPCNT, etc.). What
     we have now does not work across all compilers.  
http://dlang.org/phobos/core_cpuid.html
That's what I implied in "what we have now": import core.cpuid; writeln( mmx ); // prints 'false' with GDC version(InlineAsm_X86_Any) writeln("DMD and LDC support the Dlang inline assembler"); else writeln("GDC has the GCC extended inline assembler");
There's no reason core.cpuid, which has a platform-independent API, cannot be made to work with GDC and LDC. Adding more global variables to do the same thing would add no value and would not be easier to implement.
LDC implements InlineAsm_X86_Any (DMD style asm), so core.cpuid works. GDC is the only compiler that does not implement it. We agree that core.cpuid should provide this information, but what we have now - core.cpuid in a mix with GDC's lack of DMD style asm - does not work in practice for the years to come.
 Both LLVM and GCC have moved to "extended inline assemblers"
 that require you to provide information about input, output
 and scratch registers as well as memory locations, so the
 compiler can see through the asm-block for register allocation
 and inlining purposes. It's more difficult to get right, but
 also more rewarding, as it enables you to write no-overhead
 "one-liners" and "intrinsics" while having calling conventions
 still handled by the compiler.  
I know, but "more difficult" is a bit of an understatement. For example, core.cpuid has not been implemented using those assemblers.
Yep, and that makes it unavailable in GDC. All feature tests return false, even MMX or SSE2 on amd64.
 BTW, dmd's inline assembler does know about which instructions read/write
which 
 registers, and makes use of that when inserting the code so it will work with 
 the rest of the code generator's register usage tracking.
That is a pleasant surprise. :)
 I find needing to tell gcc which registers are read/written by a particular 
 instruction to be a step BACKWARDS in usability. This is what computers are 
 supposed to be good for :-)
Still, DMD does not inline asm and always adds a function prolog and epilog around asm blocks in an otherwise empty function (correct me if I'm wrong). "naked" means you have to duplicate code for the different calling conventions, in particular Win32. Your look on GCC (and LLVM) may be a bit biased. First of all you don't need to tell it exactly which registers to use. A rough classification is enough and gives the compiler a good idea of where calculations should be stored upon arrival at the asm statement. You can be specific down to the register name or let the backend chose freely with "rm" (= any register or memory). An example: We have a variable x that is computed inside a function followed by an asm block that multiplies it with something else. Typically you would "MOV EAX, [x]" to load x into the register that the MUL instruction expects. With extended assemblers you can be declarative about that and just state that x is needed in EAX as an input. You drop the MOV from the asm block and let the compiler figure out in its codegen, how x will end up in EAX. That's a step FORWARD in usability.
 DMD doesn't inline functions with asm in them, but that is not the fault of
the 
 inline assembler.
 
 The only real weakness in the DMD inline assembler is it doesn't support "let 
 the compiler select the register". DMD's strong support for compiler builtins, 
 however, mitigate this to an acceptable level.
Yes, I've witnessed that in multiply with overflow check. DMD generates very efficient code for 'mulu'. It's just that the compiler cannot have builtins for everything. (I personally was looking for 64-bit multiply with 128-bit result and SSE4 string scanning.) The extended assemblers in GCC and LLVM allow me to write intrinsics, often as a single(!) instruction, that seamlessly inlines into the surrounding code, just as DMD's builtins would do. And it seems to me we could have less backend complexity if we were able to implement intrinsics as library code with the same efficiency. ;) But most of the time when I want to access a specialized CPU instruction for speed with asm in DMD, the generic pure D code is faster. I would advise to only use it if the concept is not expressible in pure D at the moment. You might add that we shouldn't write asm in the first place, because compilers have become smart enough, but it's not like I was writing large chunks of asm. I use it to write "compiler builtins" in D source code. -- Marco
Apr 12 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/12/2016 9:53 AM, Marco Leise wrote:
 LDC implements InlineAsm_X86_Any (DMD style asm), so
 core.cpuid works. GDC is the only compiler that does not
 implement it. We agree that core.cpuid should provide this
 information, but what we have now - core.cpuid in a mix with
 GDC's lack of DMD style asm - does not work in practice for
 the years to come.
Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker.
 BTW, dmd's inline assembler does know about which instructions read/write which
 registers, and makes use of that when inserting the code so it will work with
 the rest of the code generator's register usage tracking.
That is a pleasant surprise. :)
https://github.com/D-Programming-Language/dmd/blob/master/src/iasm.c#L1255
 Still, DMD does not inline asm and always adds a function
 prolog and epilog around asm blocks in an otherwise
 empty function (correct me if I'm wrong).
Not if you use "naked".
 "naked" means you
 have to duplicate code for the different calling conventions,
 in particular Win32.
Why complain about it adding a prolog/epilog, and complain about it not adding it?
 Your look on GCC (and LLVM) may be a bit biased. First of all
 you don't need to tell it exactly which registers to use. A
 rough classification is enough and gives the compiler a good
 idea of where calculations should be stored upon arrival at
 the asm statement. You can be specific down to the register
 name or let the backend chose freely with "rm" (= any register
 or memory).
 An example: We have a variable x that is computed inside a
 function followed by an asm block that multiplies it with
 something else. Typically you would "MOV EAX, [x]" to load x
 into the register that the MUL instruction expects. With
 extended assemblers you can be declarative about that and just
 state that x is needed in EAX as an input. You drop the MOV
 from the asm block and let the compiler figure out in its
 codegen, how x will end up in EAX. That's a step FORWARD in
 usability.
It's a step backwards because I can't just say "MUL EAX". I have to tell GCC what register the result gets put in. This is, to my mind, ridiculous. GCC's inline assembler apparently has no knowledge of what the opcodes actually do.
Apr 12 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 12 Apr 2016 13:22:12 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/12/2016 9:53 AM, Marco Leise wrote:
 LDC implements InlineAsm_X86_Any (DMD style asm), so
 core.cpuid works. GDC is the only compiler that does not
 implement it. We agree that core.cpuid should provide this
 information, but what we have now - core.cpuid in a mix with
 GDC's lack of DMD style asm - does not work in practice for
 the years to come.  
Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker.
You mean it is ok, if I duplicated most of the asm in there and created a pull request ?
 Still, DMD does not inline asm and always adds a function
 prolog and epilog around asm blocks in an otherwise
 empty function (correct me if I'm wrong).  
Not if you use "naked".
 "naked" means you
 have to duplicate code for the different calling conventions,
 in particular Win32.  
Why complain about it adding a prolog/epilog, and complain about it not adding it?
Yeah, I didn't make this clear. To reduce code repetition I'd like to avoid "naked" and have the compiler handle the calling conventions. Let's compare the earlier example in both GDC and DMD in a coding style that is agnostic wrt. the calling convention. First GDC: struct DblWord { ulong lo, hi; } DblWord bigMul(ulong x, ulong y) { DblWord tmp; asm { "mulq %[y]" : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y; } return tmp; } This is turned into the following instruction sequence (AT&T): mov %rdi,%rax mul %rsi retq Note how elegantly GCC handles the calling convention for us. The prolog reduces to moving 'x' from RDI to RAX where I asked it to place it for the MUL to use as the implicit operand. After multiplying it by the explicit operand in RSI, the resulting two machine words would be in RAX:RDX as we know. I created a data structure to return those two and told GCC to tie tmp.lo to RAX and tmp.hi to RDX. Since the calling convention happens to return structs of 2 machine words in RAX:RDX, the whole assignment to 'tmp' and the return become no-ops. With inlining enabled only the 'mul' would remain. This is the ideal outcome. Now let's look at the DMD implementation - again letting the compiler figure out the calling convention: DblWord bigMul(ulong x, ulong y) { DblWord tmp; asm { mov RAX, x; mul y; mov tmp+DblWord.lo.offsetof, RAX; mov tmp+DblWord.hi.offsetof, RDX; } return tmp; } This generates the following: push %rbp mov %rsp,%rbp sub $0x20,%rsp mov %rdi,-0x10(%rbp) mov %rsi,-0x8(%rbp) lea -0x20(%rbp),%rax xor %ecx,%ecx mov %rcx,(%rax) mov %rcx,0x8(%rax) mov -0x8(%rbp),%rax mulq -0x10(%rbp) mov %rax,-0x20(%rbp) mov %rdx,-0x18(%rbp) mov -0x18(%rbp),%rdx mov -0x20(%rbp),%rax mov %rbp,%rsp pop %rbp retq In practice GDC will just replace the invokation with a single 'mul' instruction while DMD will emit a call to this 18 instructions long function. Now you keep telling me extended assembly is a step backwards. :)
 It's a step backwards because I can't just say "MUL EAX".
You could write this, you'd only have to tell the assembler that EAX and EDX will be overwritten, something that DMD already knows.
 I have to tell GCC what register the result gets put in.
And by doing this you allow it to figure out the shortest way to return the result in compliance with the calling convention.
 This is, to my mind, ridiculous.
I too find it annoying that I have to inform it about the scratch registers used in the asm, but the rest seems legit to me. At some point you will have to connect variables in the host language with registers in assembly. Doing this in a declarative manner instead of explicit assembly code, allows the backend to find the optimal code (literally) as demonstated above.
 GCC's inline assembler apparently has no knowledge of what
 the opcodes actually do.
Agreed. It seems to treat the assembly text merely as a text template. It is the same with LLVM's extended assembler which borrows heavily from GCC's. This is probably due to the fact that the assembler is historically a standalone executable and as such the authority for interpreting the asm code is outside of the scope of the host language compiler. Under these circumstances we might have gone for the same implementation. -- Marco
Apr 12 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/12/2016 4:29 PM, Marco Leise wrote:
 Am Tue, 12 Apr 2016 13:22:12 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/12/2016 9:53 AM, Marco Leise wrote:
 LDC implements InlineAsm_X86_Any (DMD style asm), so
 core.cpuid works. GDC is the only compiler that does not
 implement it. We agree that core.cpuid should provide this
 information, but what we have now - core.cpuid in a mix with
 GDC's lack of DMD style asm - does not work in practice for
 the years to come.
Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style in an hour or so. It could even be simply written separately in GAS and linked in. Since this has not been done, I can only conclude that core.cpuid has not been an actual blocker.
You mean it is ok, if I duplicated most of the asm in there and created a pull request ?
It's Boost licensed, and Boost licensed code can be shipped with GPL'd code as far as I know.
            "mulq %[y]"
            : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;
I don't see anything elegant about those lines, starting with "mulq" is not in any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a ulong and select the 64 bit version of the MUL opcode automatically. I can see nothing to recommend the: "=a" tmp.lo syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people could even figure that out without consulting stackoverflow! :-) I have no idea what: "a" x and: [y] "rm" y mean, nor why the ":" appears sometimes and the "," other times. It does look like it was designed by the same guy who invented TECO macros: https://www.reddit.com/r/programming/comments/4e07lo/last_night_in_a_fit_of_boredom_far_away_from_my/d1xlbh7 but that's not much of a compliment.
 In practice GDC will just replace the invokation with a single
 'mul' instruction while DMD will emit a call to this 18
 instructions long function. Now you keep telling me extended
 assembly is a step backwards. :)
DMD version: DblWord bigMul(ulong x, ulong y) { naked asm { mov RAX,RDI; mul RSI; ret; } }
 GCC's inline assembler apparently has no knowledge of what
 the opcodes actually do.
Agreed.
This is the basis of my assertion it is a step backwards. Granted, it has some nice capability as you've demonstrated. But it sure makes you suffer to get it.
Apr 12 2016
next sibling parent Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 13 April 2016 at 08:22, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 4:29 PM, Marco Leise wrote:
 In practice GDC will just replace the invokation with a single
 'mul' instruction while DMD will emit a call to this 18
 instructions long function. Now you keep telling me extended
 assembly is a step backwards. :)
DMD version: DblWord bigMul(ulong x, ulong y) { naked asm { mov RAX,RDI; mul RSI; ret; } }
Infact the "correct" version of "mul eax" is. asm { "mul{l} {%%}eax" : "=a" var : "a" var; } - Works with both dialects (Intel and ATT) - Compiler knows the first register ("a") is read and written to, so doesn't keep temporaries stored there. - Compiler loads the variable "var" into EAX before the statement is executed. - Compiler knows that the value of "var" in EAX after the statement is finished. http://goo.gl/64SSD5 Just toggle on/off intel syntax to see the difference. :-) I can agree that the way that instruction (or insn) templates look are pretty ugly. But IMO, for the most part on x86 their ugliness is attributed to having to support two types of assembler syntax at once.
Apr 13 2016
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 12 Apr 2016 23:22:37 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

            "mulq %[y]"
            : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;  
I don't see anything elegant about those lines, starting with "mulq" is not in any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a ulong and select the 64 bit version of the MUL opcode automatically. I can see nothing to recommend the: "=a" tmp.lo syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people could even figure that out without consulting stackoverflow! :-) I have no idea what: "a" x and: [y] "rm" y mean, nor why the ":" appears sometimes and the "," other times.
Tell me again, what's more elgant ! uint* pnb = cast(uint*)cf.processorNameBuffer.ptr; version(GNU) { asm { "cpuid" : "=a" pnb[0], "=b" pnb[1], "=c" pnb[ 2], "=d" pnb[ 3] : "a" 0x8000_0002; } asm { "cpuid" : "=a" pnb[4], "=b" pnb[5], "=c" pnb[ 6], "=d" pnb[ 7] : "a" 0x8000_0003; } asm { "cpuid" : "=a" pnb[8], "=b" pnb[9], "=c" pnb[10], "=d" pnb[11] : "a" 0x8000_0004; } } else version(D_InlineAsm_X86) { asm pure nothrow nogc { push ESI; mov ESI, pnb; mov EAX, 0x8000_0002; cpuid; mov [ESI], EAX; mov [ESI+4], EBX; mov [ESI+8], ECX; mov [ESI+12], EDX; mov EAX, 0x8000_0003; cpuid; mov [ESI+16], EAX; mov [ESI+20], EBX; mov [ESI+24], ECX; mov [ESI+28], EDX; mov EAX, 0x8000_0004; cpuid; mov [ESI+32], EAX; mov [ESI+36], EBX; mov [ESI+40], ECX; mov [ESI+44], EDX; pop ESI; } } else version(D_InlineAsm_X86_64) { asm pure nothrow nogc { push RSI; mov RSI, pnb; mov EAX, 0x8000_0002; cpuid; mov [RSI], EAX; mov [RSI+4], EBX; mov [RSI+8], ECX; mov [RSI+12], EDX; mov EAX, 0x8000_0003; cpuid; mov [RSI+16], EAX; mov [RSI+20], EBX; mov [RSI+24], ECX; mov [RSI+28], EDX; mov EAX, 0x8000_0004; cpuid; mov [RSI+32], EAX; mov [RSI+36], EBX; mov [RSI+40], ECX; mov [RSI+44], EDX; pop RSI; } } -- Marco
Apr 16 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/16/2016 2:40 PM, Marco Leise wrote:
 Tell me again, what's more elgant !
If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.
Apr 16 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 16 Apr 2016 21:46:08 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/16/2016 2:40 PM, Marco Leise wrote:
 Tell me again, what's more elgant !  
If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.
I hate the many pitfalls of extended asm: Forget to mention a side effect in the "clobbers" list and the compiler assumes that register or memory location still holds the value from before the asm. Have an _input_ reg clobbered? Must NOT name it in the clobber list but use it as a dummy output with a dummy variable assignment. The learning curve is steep and as you said, usually unintelligible without prior knowledge. But what I really miss from the last generation of inline assemblers are these points: 1. In most cases you can make the asm transparent to the optimizer leading to: 1.a Inlining of asm 1.b Dead-code removal of asm blocks 2. Asm Template arguments (e.g. input variables) are bound via constraints: 2.a Can use output constraint `"=a" var` to mean an of "AL", "AX", "EAX" or "RAX" depending on size of 'var' 2.b `"r" ptr` can bind 32-bit and 64-bit pointers often eliminating the need for duplicate asm blocks that only differ in one mention of e.g. RSI vs. ESI. 2.c Compiler seamlessly integrates host code variables with asm with host code. No need to manually pick tmp registers to move parameters and output. `"r" myUint` is all it takes for 'myUint' to end up in any of EAX, EDX, ... (whatever the register allocator deems efficient at that point) 3.d As a net result, asm templates often reduce to a single mnemonic and work with X86, X32 and AMD64. 3. In DMD I often see "naked" used to get rid of function prolog and epilog in an attempt to get an intrinsic-like, fast function. This requires extra care to get the calling convention right and may require more code duplication for e.g. Win32. Asm templates in GCC and LLVM benefit from this speedup automatically, because the backend will remove unneeded prolog/epilog code and even inline small functions. GCC's historically grown template syntax based on multiple _external_ assembler backends ain't that great and it is a PITA that it cannot understand the mnemonics and figure out side effects itself like DMD. But I hope I could highlight a few points where classic assemblers as found in Delphi or DMD fall behind in modern convenience and native efficiency. When C was invented it matched the CPUs quite well, but today we have dozens of instructions that C and D syntax has no expression for. All modern compilers spend considerable amount of backend code to the task of pattern matching code constructs like a layman's POPCNT and replace them with optimal CPU instructions. More and more we turn to browsing the list of readily available compiler built-ins first and the next step is to acknowledge the need and make inline assemblers powerful enough for programmers to efficiently implement non-existing intrinsics in library code. -- Marco
Apr 17 2016
prev sibling parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 12 April 2016 at 22:22, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 9:53 AM, Marco Leise wrote:
 Your look on GCC (and LLVM) may be a bit biased. First of all
 you don't need to tell it exactly which registers to use. A
 rough classification is enough and gives the compiler a good
 idea of where calculations should be stored upon arrival at
 the asm statement. You can be specific down to the register
 name or let the backend chose freely with "rm" (= any register
 or memory).
 An example: We have a variable x that is computed inside a
 function followed by an asm block that multiplies it with
 something else. Typically you would "MOV EAX, [x]" to load x
 into the register that the MUL instruction expects. With
 extended assemblers you can be declarative about that and just
 state that x is needed in EAX as an input. You drop the MOV
 from the asm block and let the compiler figure out in its
 codegen, how x will end up in EAX. That's a step FORWARD in
 usability.
It's a step backwards because I can't just say "MUL EAX". I have to tell GCC what register the result gets put in. This is, to my mind, ridiculous. GCC's inline assembler apparently has no knowledge of what the opcodes actually do.
asm { "mul eax"; } - That wasn't so difficult. :-) I don't know if D data and calling functions from DMD-IASM is safe (it is in GDC extended IASM). But I have always chosen the path that requires the least amount of maintenance burden/overhead. And I'm sorry to say that supporting GCC-style extended assembler both comes for free (handling is managed by the middle-end), and requires no platform-specific support on the language implementation side. However, I have always considered comparing the two a bit like apples and oranges. DMD compiles to object code, so it makes sense to me that you have an entire assembler embedded in. However GDC compiles to assembly, and I expect that GNU As will know a lot more about what opcodes actually do on, say a Motorola 68k, than the poor mans parser I would be able to write. There were a lot of challenges supporting DMD-style IASM, all non-existent in DMD. Drawing a list off the top of my head - I'll let you decide whether IASM is pro or con in this area, but again bear in mind that DMD doesn't have to deal with calling an external assembler. - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?) - Some opcodes in IASM have a different name in the assembler (Emitted fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall std.math didn't work without the translation). - Some opcodes are actually directives in disguise (db, ds, dw, ...) - Frame-relative addressing/displacement of a symbol before the backend has decided where incoming parameters will land is a good way to get hit by a truck. - GCC backend doesn't support naked functions on x86. - Or even in the sense that DMD supports naked functions where there is support (only plain text assembler allowed) - Want to support ARM? MIPS? PPC? At the time when GDC supported DMD-style IASM for x86, the implementation was over 3000 LOC, adding platform support just looked like an unmanageable nightmare.
Apr 12 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:
 It's a step backwards because I can't just say "MUL EAX". I have to tell GCC
 what register the result gets put in. This is, to my mind, ridiculous. GCC's
 inline assembler apparently has no knowledge of what the opcodes actually
 do.
asm { "mul eax"; } - That wasn't so difficult. :-)
My understanding is that is not sufficient if you want gcc to track register usage, etc. I could be wrong, I found the documentation on how the gcc inline assembler works to be impossible to figure out what was required and what wasn't. I'd just look at existing examples and modify to suit :-(
 I don't know if D data and calling functions from DMD-IASM is safe
I don't know what you mean by 'safe' in this context. If you follow the ABI it should work.
 (it
 is in GDC extended IASM).  But I have always chosen the path that
 requires the least amount of maintenance burden/overhead.  And I'm
 sorry to say that supporting GCC-style extended assembler both comes
 for free (handling is managed by the middle-end), and requires no
 platform-specific support on the language implementation side.
Your decision makes sense.
 However, I have always considered comparing the two a bit like apples
 and oranges.  DMD compiles to object code, so it makes sense to me
 that you have an entire assembler embedded in.  However GDC compiles
 to assembly, and I expect that GNU As will know a lot more about what
 opcodes actually do on, say a Motorola 68k, than the poor mans parser
 I would be able to write.

 There were a lot of challenges supporting DMD-style IASM, all
 non-existent in DMD.  Drawing a list off the top of my head - I'll let
 you decide whether IASM is pro or con in this area, but again bear in
 mind that DMD doesn't have to deal with calling an external assembler.

 - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?)
 - Some opcodes in IASM have a different name in the assembler (Emitted
 fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall
 std.math didn't work without the translation).
DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which opcode by looking at the operands, using the Intel CPU manual as a guide. It's a bit of a pain as there are a lot of special cases, but the end result is pretty straightforward if you're using the Intel CPU manual as a reference guide.
 - Some opcodes are actually directives in disguise (db, ds, dw, ...)
 - Frame-relative addressing/displacement of a symbol before the
 backend has decided where incoming parameters will land is a good way
 to get hit by a truck.
 - GCC backend doesn't support naked functions on x86.
 - Or even in the sense that DMD supports naked functions where there
 is support (only plain text assembler allowed)
 - Want to support ARM? MIPS? PPC?  At the time when GDC supported
 DMD-style IASM for x86, the implementation was over 3000 LOC, adding
 platform support just looked like an unmanageable nightmare.
I understand that GDC has special challenges because it writes to an assembler rather than direct to object code. I understand it is not easy to replicate DMD's iasm functionality. Which is why I haven't given you a hard time about it :-) and it is not terribly important. But core.cpuid needs to be made to work in GDC, whatever it takes to do so. ---- Personally, I strongly dislike the fact that the GAS syntax is the reverse of Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It makes my eyeballs hurt looking at it. It's like giving me a car with the brake and gas pedals reversed. Nothing but accidents result :-) And I don't like that they use different opcodes than the Intel manuals. That just sux, too. But I know that the GNU world is stuck with that, and GDC should behave like the rest of GCC.
Apr 12 2016
parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:
 - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?)
 - Some opcodes in IASM have a different name in the assembler (Emitted
 fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall
 std.math didn't work without the translation).
DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which opcode by looking at the operands, using the Intel CPU manual as a guide. It's a bit of a pain as there are a lot of special cases, but the end result is pretty straightforward if you're using the Intel CPU manual as a reference guide.
My only point was that in GDC, the translation of opcodes to machine code is done in two steps by two separate processes, rather than one. DMD is proof that the benefit of having unified syntax is a big win.
 - Some opcodes are actually directives in disguise (db, ds, dw, ...)
 - Frame-relative addressing/displacement of a symbol before the
 backend has decided where incoming parameters will land is a good way
 to get hit by a truck.
 - GCC backend doesn't support naked functions on x86.
 - Or even in the sense that DMD supports naked functions where there
 is support (only plain text assembler allowed)
 - Want to support ARM? MIPS? PPC?  At the time when GDC supported
 DMD-style IASM for x86, the implementation was over 3000 LOC, adding
 platform support just looked like an unmanageable nightmare.
I understand that GDC has special challenges because it writes to an assembler rather than direct to object code. I understand it is not easy to replicate DMD's iasm functionality. Which is why I haven't given you a hard time about it :-) and it is not terribly important. But core.cpuid needs to be made to work in GDC, whatever it takes to do so.
Indeed, it's been on my TODO list for a long time, among many other things. :-)
 ----

 Personally, I strongly dislike the fact that the GAS syntax is the reverse
 of Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It
 makes my eyeballs hurt looking at it. It's like giving me a car with the
 brake and gas pedals reversed. Nothing but accidents result :-) And I don't
 like that they use different opcodes than the Intel manuals. That just sux,
 too.
Like riding a backwards bicycle. :-) https://www.youtube.com/watch?v=MFzDaBzBlL0
 But I know that the GNU world is stuck with that, and GDC should behave like
 the rest of GCC.
Yeah, and I'm glad that you do.
Apr 13 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Apr 2016 09:51:25 +0200
schrieb Iain Buclaw via Digitalmars-d
<digitalmars-d puremagic.com>:

 On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 But core.cpuid needs to be made to work in GDC, whatever it takes to do so.
  
Indeed, it's been on my TODO list for a long time, among many other things. :-)
Would you want to implement this in the compiler like the checkedint functions? I guess that's the only way to guarantee cross-module inlining with GDC. Otherwise I would use __builtin_cpu_supports (const char *feature). (GCC practically has its own internal core.cpuid implementation made of intrinsics.) -- Marco
Apr 13 2016
parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 13 April 2016 at 11:13, Marco Leise via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Am Wed, 13 Apr 2016 09:51:25 +0200
 schrieb Iain Buclaw via Digitalmars-d
 <digitalmars-d puremagic.com>:

 On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 But core.cpuid needs to be made to work in GDC, whatever it takes to do so.
Indeed, it's been on my TODO list for a long time, among many other things. :-)
Would you want to implement this in the compiler like the checkedint functions? I guess that's the only way to guarantee cross-module inlining with GDC. Otherwise I would use __builtin_cpu_supports (const char *feature). (GCC practically has its own internal core.cpuid implementation made of intrinsics.) -- Marco
Yes, cpu_supports is a good way to do it as we only need to invoke __builtin_cpu_init once and cache all values when running 'shared static this()'. I would also like to be able to support other processes too. ARM is a high priority one which should follow suit.
Apr 13 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Apr 2016 11:21:35 +0200
schrieb Iain Buclaw via Digitalmars-d
<digitalmars-d puremagic.com>:

 Yes, cpu_supports is a good way to do it as we only need to invoke
 __builtin_cpu_init once and cache all values when running 'shared
 static this()'.
I was under the assumption that GCC already emits an 'early' static ctor with a call to __builtin_cpu_init(). It is also likely that we don't need extra code to copy GCC's cache to core.cpuids cache (unless the cached data is publicly exposed somehow). What is your stance on the cross module inlining issue? Stuff like hasPopcnt etc. wont be inlined unless you turn them into compiler recognised builtins, right? It's not a blocker, but something to keep in mind when not accessing global variables directly. How about this style as an alternative?: immutable bool mmx; immutable bool hasPopcnt; shared static this() { import gcc.builtins; mmx = __builtin_cpu_supports("mmx" ) > 0; hasPopcnt = __builtin_cpu_supports("popcnt") > 0; } -- Marco
Apr 13 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }
Please do not invent an alternative interface, use the one in core.cpuid:
Apr 13 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 13 Apr 2016 04:14:48 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }
  
Please do not invent an alternative interface, use the one in core.cpuid:
Yes, they are all property and a substitution with direct access to the globals will work around GDC's lack of cross-module inlining. Otherwise these feature checks which might be used in hot code, are more costly than they should be. I hate when things get in the way of efficiency. :) -- Marco
Apr 13 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/13/2016 5:47 AM, Marco Leise wrote:
 Yes, they are all  property and a substitution with direct
 access to the globals will work around GDC's lack of
 cross-module inlining. Otherwise these feature checks which
 might be used in hot code, are more costly than they should be.
 I hate when things get in the way of efficiency. :)
It doesn't need to be efficient, because such checks should be done at a higher level in the program's logic, not on low level code. Even so, the program could cache the result of the call.
Apr 13 2016
prev sibling parent reply Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 13 April 2016 at 13:14, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }
Please do not invent an alternative interface, use the one in core.cpuid:
An alternative interface needs to be invented anyway for other CPUs.
Apr 14 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/14/2016 1:21 AM, Iain Buclaw via Digitalmars-d wrote:
 An alternative interface needs to be invented anyway for other CPUs.
That would be fine. But there is no reason to redo core.cpuid for x86 machines.
Apr 14 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?
Target cpu configuration: - CPU architecture (done)
Done.
 - Count of FP/Integer registers
??
 - Allowed sets of instructions: for example, AVX2, FMA4
Done. D_SIMD
 - Compiler optimization options (for math)
Moot. DMD does not have compiler switches to set FP code generation. (This is deliberate.)
Apr 04 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:
 On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?
Target cpu configuration: - CPU architecture (done)
Done.
 - Count of FP/Integer registers
??
How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?
 - Allowed sets of instructions: for example, AVX2, FMA4
Done. D_SIMD
This is not enough. Needs to know is it AVX or AVX2 in compile time (this may be completely different source code for this cases).
 - Compiler optimization options (for math)
Moot. DMD does not have compiler switches to set FP code generation. (This is deliberate.)
We have LDC and GDC. And looks like a little bit standardization based on DMD would be good, even if this would be useless for DMD. With compile time information about CPU it is possible to always have fast generic BLAS for any target as soon as LLVM is released for this target. D+LLVM = fast generic BLAS. For DMD and GDC would be target specified BLAS optimizations. OpenBLAS kernels is 30 MB of assembler code! So we would be able to replace it once and for a very long time with Phobos. Best regards, Ilya
Apr 04 2016
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:
 OpenBLAS kernels is 30 MB of assembler code! So we would be 
 able to replace it once and for a very long time with Phobos.
Are you familiar with this project at all? https://github.com/flame/blis
Apr 04 2016
parent 9il <ilyayaroshenko gmail.com> writes:
On Monday, 4 April 2016 at 21:13:30 UTC, jmh530 wrote:
 On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:
 OpenBLAS kernels is 30 MB of assembler code! So we would be 
 able to replace it once and for a very long time with Phobos.
Are you familiar with this project at all? https://github.com/flame/blis
Thank for the link. BLIS has the same issue like OpenBLAS - a collection of kernels for each target. I want to write internal kernel compiler (like CT regex) that will build kernels based in CT information about the target. Best regards, Ilya
Apr 04 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 2:05 PM, 9il wrote:
 - Count of FP/Integer registers
??
How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?
These are deducible from X86, X86_64, and SIMD version identifiers.
 Needs to know is it AVX or AVX2 in compile time
Since the compiler never generates AVX or AVX2 instructions, there is no purpose to setting such as a predefined version identifier. You might as well use a: -version=AVX switch. Note that it is a very bad idea for a compiler to detect the CPU it is running on and default generate code specific to that CPU.
 (this may be completely different source code for this cases).
It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.
 We have LDC and GDC. And looks like a little bit standardization based on DMD
 would be good, even if this would be useless for DMD.
There is no such thing as a standard compiler floating point switch, and I'm doubtful defining one would be practical or make much of any sense.
 With compile time information about CPU it is possible to always have fast
 generic BLAS for any target as soon as LLVM is released for this target.
The SIMD instruction set is highly resistant to transforming generic code into optimal vector instructions. Yes, I know about auto-vectorization, and in general it is a doomed and unworkable technology. http://www.amazon.com/dp/0974364924 It's gotta be done by hand to get it to fly.
Apr 04 2016
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Monday, 4 April 2016 at 22:34:06 UTC, Walter Bright wrote:
 On 4/4/2016 2:05 PM, 9il wrote:
 - Count of FP/Integer registers
??
How many general purpose registers, SIMD Floating Point registers, SIMD Integer registers have a CPU?
These are deducible from X86, X86_64, and SIMD version identifiers.
It is impossible to deduct from that combination that Xeon Phi has 32 FP registers.
 Needs to know is it AVX or AVX2 in compile time
Since the compiler never generates AVX or AVX2 instructions, there is no purpose to setting such as a predefined version identifier. You might as well use a: -version=AVX switch. Note that it is a very bad idea for a compiler to detect the CPU it is running on and default generate code specific to that CPU.
"Since the compiler never generates AVX or AVX2" - this is definitely nor true, see, for example, LLVM vectorization and SLP vectorization. This is normal situation for scientific software, supercomputers software, hight performance server applications.
 (this may be completely different source code for this cases).
It's entirely practical to compile code with different source code, link them *both* into the executable, and switch between them based on runtime detection of the CPU.
This approach is complex, and normal for desktop applications. If you have a big cluster of similar computers or you have a supercomputer cluster, only the thing you want to do is `-mcpu=native`/ `-march=native`. And this single compiler flag should be enough to build hight performance linear algebra application.
 We have LDC and GDC. And looks like a little bit 
 standardization based on DMD
 would be good, even if this would be useless for DMD.
There is no such thing as a standard compiler floating point switch, and I'm doubtful defining one would be practical or make much of any sense.
I just want an unified instrument to receive CT information about target and optimization switches. It is OK if this information would have different switches on different compilers.
 With compile time information about CPU it is possible to 
 always have fast
 generic BLAS for any target as soon as LLVM is released for 
 this target.
The SIMD instruction set is highly resistant to transforming generic code into optimal vector instructions. Yes, I know about auto-vectorization, and in general it is a doomed and unworkable technology. http://www.amazon.com/dp/0974364924 It's gotta be done by hand to get it to fly.
Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I need CT information about target CPU, because it is impossible to build optimal BLAS kernels without it! My idea is internal kernel compiler :-) Something similar to compile time regex, but more complex. Best regards, Ilya
Apr 04 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 11:10 PM, 9il wrote:
 It is impossible to deduct from that combination that Xeon Phi has 32 FP
registers.
Since dmd doesn't generate specific code for a Xeon Phi, having a compile time switch for it is meaningless.
 "Since the compiler never generates AVX or AVX2" - this is definitely nor true,
 see, for example, LLVM vectorization and SLP vectorization.
dmd is not LLVM.
 It's entirely practical to compile code with different source code, link them
 *both* into the executable, and switch between them based on runtime detection
 of the CPU.
This approach is complex,
Not at all. Used to do it all the time in the DOS world (FPU vs emulation).
 I just want an unified instrument to receive CT information about target and
 optimization switches. It is OK if this information would have different
 switches on different compilers.
Optimizations simply do not transfer from one compiler to another, whether the switch is the same or not. They are highly implementation dependent.
 Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I
 need CT information about target CPU, because it is impossible to build optimal
 BLAS kernels without it!
I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.
Apr 05 2016
next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 It is impossible to deduct from that combination that Xeon Phi 
 has 32 FP registers.
Since dmd doesn't generate specific code for a Xeon Phi, having a compile time switch for it is meaningless.
 "Since the compiler never generates AVX or AVX2" - this is 
 definitely nor true,
 see, for example, LLVM vectorization and SLP vectorization.
dmd is not LLVM.
The particular design and limitations of the dmd backend shouldn't be used to define D. In the extreme, your argument would imply that there's no point having version(ARM) built in to the language, because dmd doesn't support it.
 It's entirely practical to compile code with different source 
 code, link them
 *both* into the executable, and switch between them based on 
 runtime detection
 of the CPU.
This approach is complex,
Not at all. Used to do it all the time in the DOS world (FPU vs emulation).
 I just want an unified instrument to receive CT information 
 about target and
 optimization switches. It is OK if this information would have 
 different
 switches on different compilers.
Optimizations simply do not transfer from one compiler to another, whether the switch is the same or not. They are highly implementation dependent.
 Auto vectorization is only example (maybe bad). I would use 
 SIMD vectors, but I
 need CT information about target CPU, because it is impossible 
 to build optimal
 BLAS kernels without it!
I still don't understand why you cannot just set '-version=xxx' on the command line and then switch off that version in your custom code.
So you're suggesting that libraries invent their own list of versions for specific architectures / CPU features, which the user then has to specify somehow on the command line? I want to be able to write code that uses standardised versions that work across various D compilers, with the user only needing to type e.g. -march=native on GDC and get the fastest possible code.
Apr 05 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/5/2016 2:03 AM, John Colvin wrote:
 So you're suggesting that libraries invent their own list of versions for
 specific architectures / CPU features, which the user then has to specify
 somehow on the command line?
 I want to be able to write code that uses standardised versions that work
across
 various D compilers, with the user only needing to type e.g. -march=native on
 GDC and get the fastest possible code.
There's a line between trying to standardize everything and letting add-on libraries be free to innovate. Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.) My experience with command line FPU switches is few users understand what they do and even fewer use them correctly. In fact, I suspect that having a command line FPU switch is too global a hammer. A pragma set in just the functions that need it might be much better. ------- In any case, this is not a blocker for getting the library designed, built and debugged.
Apr 05 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 On 4/5/2016 2:03 AM, John Colvin wrote:
 There's a line between trying to standardize everything and 
 letting add-on libraries be free to innovate.

 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently implemented.)

 My experience with command line FPU switches is few users 
 understand what they do and even fewer use them correctly.

 In fact, I suspect that having a command line FPU switch is too 
 global a hammer. A pragma set in just the functions that need 
 it might be much better.
What wrong for scientist to write `-mcpu=native`?
 -------

 In any case, this is not a blocker for getting the library 
 designed, built and debugged.
Yes, but this is bad idea to have a set of versions for Phobos, is not it? Ilya
Apr 05 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/5/2016 4:17 AM, 9il wrote:
 What wrong for scientist to write `-mcpu=native`?
Because it would affect all the code in the module and every template it imports, which is a problem if you are using 'static if' and want to compile different pieces with different settings.
Apr 05 2016
parent 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 6 April 2016 at 00:45:54 UTC, Walter Bright wrote:
 On 4/5/2016 4:17 AM, 9il wrote:
 What wrong for scientist to write `-mcpu=native`?
Because it would affect all the code in the module and every template it imports, which is a problem if you are using 'static if' and want to compile different pieces with different settings.
99.99% of them do not need to compile code with different settings. Furthermore 90% of them don't know what CPU their supercomputer has. They just want to have as fast code as possible without googling what CPU instructions are available for the CPU.
Apr 05 2016
prev sibling parent reply Joe Duarte <jose.duarte asu.edu> writes:
On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently implemented.)
There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions. In these settings -- many of them scientific compute or big data center operators -- they know what servers they have, what CPU platforms they have. They don't care about portability to the past, older computers and so forth. A runtime check would make no sense for them, not for their baseline, and it would probably be a waste of time for them to design code to run on pre-AVX silicon. (AVX is not new anymore -- it's been around for a few years.) Good examples can be found on Cloudflare's blog, especially Vlad Krasnov's posts. Here's one where he accelerates Golang's crypto libraries: https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/ Companies like CF probably spend millions of dollars on electricity, and there are some workloads where AVX-optimized code can yield tangible monetary savings. Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host. On the broader SIMD-as-first-class-citizen issue, I think it would be worth thinking about how to bake SIMD into the language instead of bolting it on. If I were designing a new language in 2016, I would take a fresh look at how SIMD could be baked into a language's core constructs. I'd think about new loop abstractions that could make SIMD easier to exploit, and how to nudge programmers away from serial monotonic mindsets and into more of a SIMD/FMA way of reasoning.
Apr 17 2016
next sibling parent Temtaime <temtaime gmail.com> writes:
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently 
 implemented.)
There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions.
In addition it's COMPILER work, not programmer! Compiler SHOULD be able to vectorize the code using SSE/AVX depending on command line switch. Why i should write all these merde ? Let compiler do its work. Also compiler CAN generate multiple versions of one function using different SIMD instructions : Intel C++ Compiler works this way : it generates a few versions of a function and checks at run-time CPU capabilities and executes the fastest one.
Apr 17 2016
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 
 Someone else said talked about marking "Broadwell" and other 
 generation names. As others have said, it's better to specify 
 features. I wanted to chime in with a couple of additional 
 examples. Intel's transactional memory accelerating 
 instructions (TSX) are only available on some Broadwell parts 
 because there was a bug in the original implementation (Haswell 
 and early Broadwell) and it's disabled on most. But the new 
 Broadwell server chips have it, and it's a big deal for some DB 
 workloads. Similarly, only some Skylake chips have the Secure 
 Guard instructions (SGX), which are very powerful for creating 
 secure enclaves on an untrusted host.
Thanks, I've seen similar comments in LLVM code. I have a question perhaps you can comment on? With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled? Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?
Apr 23 2016
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 23 Apr 2016 10:40:12 +0000
schrieb Johan Engelen <j j.nl>:

 I have a question perhaps you can comment on?
 With LLVM, it is possible to specify something like "+sse3,-sse2" 
 (I did not test whether this actually results in SSE3 
 instructions being used, but no SSE2 instructions). What should 
 be returned when querying whether "sse3" feature is enabled?
 Should __traits(targetHasFeature, "sse3") == true mean that 
 implied features (such as sse and sse2) are also available?
Please do test it. Activating sse3 and disabling sse2 likely causes the compiler to silently re-enable sse2 as a dependency or error out. -- Marco
Apr 23 2016
prev sibling parent Joe Duarte <jose.duarte asu.edu> writes:
On Saturday, 23 April 2016 at 10:40:12 UTC, Johan Engelen wrote:
 On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 
 Someone else said talked about marking "Broadwell" and other 
 generation names. As others have said, it's better to specify 
 features. I wanted to chime in with a couple of additional 
 examples. Intel's transactional memory accelerating 
 instructions (TSX) are only available on some Broadwell parts 
 because there was a bug in the original implementation 
 (Haswell and early Broadwell) and it's disabled on most. But 
 the new Broadwell server chips have it, and it's a big deal 
 for some DB workloads. Similarly, only some Skylake chips have 
 the Secure Guard instructions (SGX), which are very powerful 
 for creating secure enclaves on an untrusted host.
Thanks, I've seen similar comments in LLVM code. I have a question perhaps you can comment on? With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled? Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?
If you specify SSE3, you should definitely get SSE2 and plain old SSE with it. SSE3 is a superset of SSE2 and includes all the SSE2 instructions (more than 100 I think.) I'm not sure about your syntax – I thought the hyphen meant to include the option, not remove it, and I haven't seen the addition sign used for those settings. But I haven't done much with those optimization flags. You wouldn't want to exclude SSE2 support because it's becoming the bare minimum baseline for modern systems, the de facto FP unit. Windows 10 requires a CPU with SSE2, as do more and more applications on the archaic Unix-like platforms.
May 02 2016
prev sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' 
 on the command line and then switch off that version in your 
 custom code.
I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong. Ilya
Apr 05 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' on the command
 line and then switch off that version in your custom code.
I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.
Where does the compiler get the information that it should compile for, say, AFX?
Apr 05 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user 
 experience.
 3. This is possible and not very hard to implement if I am not 
 wrong.
Where does the compiler get the information that it should compile for, say, AFX?
No idea about AFX. Do you choose AFX to disallow me to find an example? You know better than me, that GCC and LLVM based compilers have options like march, mcpu, mtarget, mtune and others. And things like `-mcpu=native` or `-march=native` are allowed. Ilya
Apr 05 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/5/2016 4:07 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user experience.
 3. This is possible and not very hard to implement if I am not wrong.
Where does the compiler get the information that it should compile for, say, AFX?
No idea about AFX. Do you choose AFX to disallow me to find an example?
I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.
Apr 05 2016
next sibling parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 
 I want to make it clear that dmd does not generate AFX specific 
 code, has no switch to enable AFX code generation and has no 
 basis for setting predefined version identifiers for it.
How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
Apr 05 2016
next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 5 April 2016 at 21:41:46 UTC, Johan Engelen wrote:
 On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 
 I want to make it clear that dmd does not generate AFX 
 specific code, has no switch to enable AFX code generation and 
 has no basis for setting predefined version identifiers for it.
How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
Yes, something like that is what I am looking for. Two nitpicks: 1. __target("broadwell") is not well API. Something like that would be more efficient: enum target = __target(); // .. use target 2. Is it possible to reflect additional settings about instruction set? Maybe "broadwell,-avx"?
Apr 05 2016
prev sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 I want to make it clear that dmd does not generate AFX specific code, has
 no switch to enable AFX code generation and has no basis for setting
 predefined version identifiers for it.
How about adding a "__target(...)" compile-time function, that would return false if the compiler doesn't know? __target("broadwell") --> true means: target cpu is broadwell, false means compiler doesn't know or target cpu is not broadwell. Would that work for all?
With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Apr 06 2016
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 [...]
With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya
Apr 06 2016
next sibling parent reply Johan Engelen <j j.nl> writes:
On Wednesday, 6 April 2016 at 13:26:51 UTC, 9il wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 [...]
With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya
After browsing through some LLVM code, I think is actually very easy for LDC to also tell you about which features (sse2, avx, etc.) a target supports. Probably the most difficult part is defining an API. Ilya made a start here: http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org (but he doesn't like his earlier API "bool a = __target("broadwell")" any more ;-P , I also think enum cpu = __target(); would be nicer)
Apr 06 2016
parent 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 6 April 2016 at 14:31:58 UTC, Johan Engelen wrote:
 Probably the most difficult part is defining an API. Ilya made 
 a start here:
 http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org
 (but he doesn't like his earlier API "bool a = 
 __target("broadwell")" any more ;-P , I also think enum cpu = 
 __target(); would be nicer)
Ahaha)) --Ilya
Apr 06 2016
prev sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 6 April 2016 at 23:26, 9il via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 [...]
With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya
Sure, but it's an ongoing maintenance task, constantly requiring population with metadata for new processors that become available. Remember, most processors are arm processors, and there are like 20 manufacturers of arm chips, and many of those come in a series of minor variations with/without sub-features present, and in a lot of cases, each permutation of features attached to random manufacturers arm chip 'X' doesn't actually have a name to describe it. It's also completely impractical to declare a particular arm chip by name when compiling for arm. It's a sloppy relationship comparing intel and AMD let alone the myriad of arm chips available. TL;DR, defining architectures with an intel-centric naming convention is a very bad idea.
Apr 06 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:
 Sure, but it's an ongoing maintenance task, constantly requiring
 population with metadata for new processors that become available.
 Remember, most processors are arm processors, and there are like 20
 manufacturers of arm chips, and many of those come in a series of
 minor variations with/without sub-features present, and in a lot of
 cases, each permutation of features attached to random manufacturers
 arm chip 'X' doesn't actually have a name to describe it. It's also
 completely impractical to declare a particular arm chip by name when
 compiling for arm. It's a sloppy relationship comparing intel and AMD
 let alone the myriad of arm chips available.
 TL;DR, defining architectures with an intel-centric naming convention
 is a very bad idea.
You're not making a good case for a standard language defined set of definitions for all these (they'll always be obsolete, inadequate and probably wrong, as you point out).
Apr 06 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 6 Apr 2016 20:29:21 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:
 TL;DR, defining architectures with an intel-centric naming convention
 is a very bad idea.  
You're not making a good case for a standard language defined set of definitions for all these (they'll always be obsolete, inadequate and probably wrong, as you point out).
We can either define the language in terms of CPU models or features and Manu gave two good reasons to go with features: 1) Typically we end up with "version(SSE4)" and similar in our code, not "version(Haswell)". 2) On ARM chips it turns out difficult to translate models to features to begin with. It wasn't a good or bad case for the feature in general. That said, in the long run Dlang should grow said language. Aside from scientific servers there are also a few Linux distributions that compile and install most packages from sources and telling the compile to target the host CPU comes naturally there. In practice there is likely some config file that sets an environment variable like CFLAGS to "-march=native" on such systems. I understand that DMD doesn't concern itself with all that, but the D language itself of which DMD is one implementation should not artificially be limited compared to popular C/C++ compilers. I died a bit on the inside when I saw Phobos add both popcnt and _popcnt of which the latter is the version that uses the POPCNT instruction found in newer x86 CPUs. In GCC or LLVM when we use such an intrinic, the compiler will take a look at the compilation target and pick the optimal code at compile-time. In one micro-benchmark [1], POPCNT was roughly 50 times faster than bit-twiddling. If I wanted an SSE4 version in otherwise generic amd64 code, I would add attribute("target", "+sse4") before the function using popcnt. So in my eyes a system like GCC offers, where you can specify target features on the command line and also override them for specific functions is a viable solution that simplifies user code (just picking the popcnt, clz, bsr, ... intrinsic will always be optimal) and Phobos code by making _popcnt et.al. superfluous. In addition, the compiler could later error out on mnemonics in our inline assembly that don't exist on the target. This avoids unexpected "Illegal Instruction" crashes. [1] http://kent-vandervelden.blogspot.de/2009/10/counting-bits-population-count-and.html -- Marco
Apr 11 2016
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 7 Apr 2016 12:25:03 +1000
schrieb Manu via Digitalmars-d <digitalmars-d puremagic.com>:

 On 6 April 2016 at 23:26, 9il via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:  
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:  
 [...]  
With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya
Sure, but it's an ongoing maintenance task, constantly requiring population with metadata for new processors that become available. Remember, most processors are arm processors, and there are like 20 manufacturers of arm chips, and many of those come in a series of minor variations with/without sub-features present, and in a lot of cases, each permutation of features attached to random manufacturers arm chip 'X' doesn't actually have a name to describe it. It's also completely impractical to declare a particular arm chip by name when compiling for arm. It's a sloppy relationship comparing intel and AMD let alone the myriad of arm chips available. TL;DR, defining architectures with an intel-centric naming convention is a very bad idea.
GCC already keeps a cpu <=> feature mapping (after all it needs to know what features it can use when you specify -mcpu) so for GDC exposing available features isn't more difficult than exposing the CPU type. I'm not sure if you can actually enable/disable CPU features manually without -mcpu? However, available features and even the type used to describe the CPU are completely architecture specific in GCC. This means for GDC we have to write custom code for every supported architecture. (We already have to do this for version(Architecture) though). FYI this is handled in the gcc/config subsystem: https://github.com/gcc-mirror/gcc/tree/master/gcc/config #defines for C/ARM: arm_cpu_builtins in https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-c.c (__ARM_NEON__ etc) As you can see the only common requirement for backend architectures is to call def_or_undef_macro. This means we have to modify the gcc/config files and write replacements for arm_cpu_builtins and similar functions. Known ARM cores and feature sets: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-cores.def I guess every backend-architecture has to provide cpu names for -mcpu so that's probably the one thing we could expose to D for all architectures. (Names are of course GCC specific, but I guess LLVM should use compatible names). This is less work to implement in GDC but you'd have to duplicate the GCC feature table in phobos. OTOH standardizing the names and available feature flag means somebody with knowledge in that area has to write down a spec. TLDR: If required we can always expose compiler specific versions (GNU_NEON/LDC_NEON) even without DMD approval/integration. This should be coordinated with LDC though. Somebody has to make a list of needed identifiers, preferably mentioning the matching C macros. Things get much more complicated if you need feature flags not currently used by / present in GCC.
Apr 07 2016
prev sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 On 4/5/2016 4:07 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright 
 wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user 
 experience.
 3. This is possible and not very hard to implement if I am 
 not wrong.
Where does the compiler get the information that it should compile for, say, AFX?
No idea about AFX. Do you choose AFX to disallow me to find an example?
I want to make it clear that dmd does not generate AFX specific code, has no switch to enable AFX code generation and has no basis for setting predefined version identifiers for it.
Please think that D has other compilers, not only DMD. We need a language feature, and I am ok that this feature would be useless for DMD. But the fact that DMD can not optimize code for, say, AVX, AVX2, AVX-512, FMA4, ..., is not good reason to reject small language changes that would be very helpful for D for community. Yes, only few of us would use this feature directly, however, many of us would use this under-the-hood in BLAS/SIMD oriented part of Phobos.
Apr 05 2016
parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 6 April 2016 at 06:11:15 UTC, 9il wrote:
 Yes, only few of us would use this feature directly, however, 
 many of us would use this under-the-hood in BLAS/SIMD oriented 
 part of Phobos.
Especially since everyone says to use LDC for the fastest code anyway...
Apr 06 2016
prev sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 5 April 2016 at 20:30, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' on the
 command
 line and then switch off that version in your custom code.
I can do it, however I would like to get this information from compiler. Why? 1. This would help to eliminate configuration bugs. 2. This would reduce work for users and simplified user experience. 3. This is possible and not very hard to implement if I am not wrong.
Where does the compiler get the information that it should compile for, say, AFX?
I would add that GDC and LDC have such compiler flags and it's possible that they could pass the state of those flags through as versions, but all compilers need to agree on the set of versions that will be defined for this purpose. If DMD users express them as -version=[STANDARD_VERSION_NAME], that's fine, I guess, but a proper flag would help avoid the situation where people get the version names wrong, and it feels a little bit more deliberate. Setting a version this way might lead them to presume that it's just an arbitrary setting by the author of the build script, and not actually an agreed standard name that GDC and LDC also produce from their compiler flags. But at very least, the important detail is that the version ID's are standardised and shared among all compilers.
Apr 06 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.
It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time. 2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable. I suspect that using a pragma would be a much better approach: pragma(SIMD, AFX) { ... code ... } Doing it on the command line is certainly the traditional way, but it strikes me as being bug-prone and as unhygienic and obsolete as the C preprocessor is (for similar reasons).
Apr 06 2016
next sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 April 2016 at 10:42, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.
It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.
It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
 2. I'm not sure these global settings are the best approach, especially if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.
They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application. Runtime selection is not practical in a broad sense. Emitting small fragments of SIMD here and there will probably take a loss if they are all surrounded by a runtime selector. SIMD is all about pipelining, and runtime branches on SIMD version are antithesis to good SIMD usage; they can't be applied for small-scale deployment. In my experience, runtime selection is desirable for large scale instantiations at an outer level of the work loop. I've tried to design this intent in my library, by making each simd API capable of receiving SIMD version information via template arg, and within the library, the version is always passed through to dependent calls. The Idea is, if you follow this pattern; propagating a SIMD version template arg through to your outer function, then you can instantiate your higher-level work function for any number of SIMD feature combinations you feel is appropriate. Naturally, this process requires a default, otherwise this usage baggage will cloud the API everywhere (rather than in the few cases where a developer specifically wants to make use of it), and many developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1 in my applications, xbox developers would choose AVX1, it's very application/target-audience specific, but SSE2 is the only reasonable selection if we are not to accept a hint from the command line.
 The main trouble comes about when different modules are
 compiled with different settings. What happens with template code
 generation, when the templates are pulled from different modules? What
 happens when COMDAT functions are generated? (The linker picks one
 arbitrarily and discards the others.) Which settings wind up in the
 executable will be not easily predictable.
In my library design, the baseline simd version (expected from the compiler) is mangled into the symbols, just as in the case a user overrides it when instantiating a code path that may be selected on runtime branch. I had imagined this would solve such link related symbol selection problems. Can you think of cases where this is insufficient?
 I suspect that using a pragma would be a much better approach:

    pragma(SIMD, AFX)
    {
         ... code ...
    }

 Doing it on the command line is certainly the traditional way, but it
 strikes me as being bug-prone and as unhygienic and obsolete as the C
 preprocessor is (for similar reasons).
I've done it with a template arg because it can be manually propagated, and users can extrapolate the pattern into their outer work functions, which can then easily have multiple versions instantiated for runtime selection. I think it's also important to mangle it into the symbol name for the reasons I mention above.
Apr 06 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does not
 impede writing code that takes advantage of various SIMD code generation at
 compile time.
It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d It's even simpler if you use a makefile variable: FPU=AFX gdc -simd=$(FPU) -version=$(FPU) You also mentioned being blocked (i.e. demotivated) for *years* by this, and I assume that may be because we don't care about SIMD support. That would be wrong, as I care a lot about it. But I had no idea you were having a problem with this, as you did not file any bug reports. Suffering in silence is never going to work :-)
 2. I'm not sure these global settings are the best approach, especially if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.
They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application.
It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.
 Runtime selection is not practical in a broad sense. Emitting small
 fragments of SIMD here and there will probably take a loss if they are
 all surrounded by a runtime selector. SIMD is all about pipelining,
 and runtime branches on SIMD version are antithesis to good SIMD
 usage; they can't be applied for small-scale deployment.
 In my experience, runtime selection is desirable for large scale
 instantiations at an outer level of the work loop. I've tried to
 design this intent in my library, by making each simd API capable of
 receiving SIMD version information via template arg, and within the
 library, the version is always passed through to dependent calls.
 The Idea is, if you follow this pattern; propagating a SIMD version
 template arg through to your outer function, then you can instantiate
 your higher-level work function for any number of SIMD feature
 combinations you feel is appropriate.
Doing it at a high level is what I meant, not for each SIMD code fragment.
 Naturally, this process requires a default, otherwise this usage
 baggage will cloud the API everywhere (rather than in the few cases
 where a developer specifically wants to make use of it), and many
 developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
 in my applications, xbox developers would choose AVX1, it's very
 application/target-audience specific, but SSE2 is the only reasonable
 selection if we are not to accept a hint from the command line.
I still don't see how it is a problem to do the switch at a high level. Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set. Then, void app(int simd)() { ... my fabulous app ... } int main() { auto fpu = core.cpuid.getfpu(); switch (fpu) { case SIMD: app!(SIMD)(); break; case SIMD4: app!(SIMD4)(); break; default: error("unsupported FPU"); exit(1); } }
 I've done it with a template arg because it can be manually
 propagated, and users can extrapolate the pattern into their outer
 work functions, which can then easily have multiple versions
 instantiated for runtime selection.
 I think it's also important to mangle it into the symbol name for the
 reasons I mention above.
Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping. And yes, if mangled in as part of the symbol, the linker won't pick the wrong one.
Apr 06 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 I can understand that it might be demotivating for you, but 
 that is not a blocker. A blocker has no reasonable workaround. 
 This has a trivial workaround:

    gdc -simd=AFX foo.d

 becomes:

    gdc -simd=AFX -version=AFX foo.d

 It's even simpler if you use a makefile variable:

     FPU=AFX

     gdc -simd=$(FPU) -version=$(FPU)
ldc -mcpu=native becomes: ????
 I still don't see how it is a problem to do the switch at a 
 high level. Heck, you could put the ENTIRE ENGINE inside a 
 template, have a template parameter be the instruction set, and 
 instantiate the template for each supported instruction set.

 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
1. Executable size will grow with every instruction set release 2. BLAS already has big executable size And main: 3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets! Best regards, Ilya
Apr 07 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2016 12:59 AM, 9il wrote:
 1. Executable size will grow with every instruction set release
Yes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space. It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)
 3. This would not solve the problem for generic BLAS implementation for Phobos
 at all! How you would force compiler to USE and NOT USE specific vector
 permutations for example in the same object file? Yes, I know, DMD has not
 permutations. No, I don't want to write permutation for each architecture. Why?
 I can write simple D code that generates single LLVM IR code which would work
 for ALL targets!
There's no reason for the compiler to make target CPU information available when writing generic code.
Apr 07 2016
next sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 7 April 2016 at 09:41:06 UTC, Walter Bright wrote:
 On 4/7/2016 12:59 AM, 9il wrote:
 1. Executable size will grow with every instruction set release
Yes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space. It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)
what about 1GB game 2D for a Phone, or maybe a clock?
 3. This would not solve the problem for generic BLAS 
 implementation for Phobos
 at all! How you would force compiler to USE and NOT USE 
 specific vector
 permutations for example in the same object file? Yes, I know, 
 DMD has not
 permutations. No, I don't want to write permutation for each 
 architecture. Why?
 I can write simple D code that generates single LLVM IR code 
 which would work
 for ALL targets!
There's no reason for the compiler to make target CPU information available when writing generic code.
This is not true for BLAS based on D. You don't want to see the opportunities. The final result of your dogmatic decision would make code slower for DMD, but LDC and GDC would implement required simple features. I just wanted to write fast code for DMD too.
Apr 07 2016
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
 This is not true for BLAS based on D.
Perhaps if you provide him a simplified example he might see what you're talking about?
Apr 07 2016
parent 9il <ilyayaroshenko gmail.com> writes:
On Thursday, 7 April 2016 at 12:35:51 UTC, jmh530 wrote:
 On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
 This is not true for BLAS based on D.
Perhaps if you provide him a simplified example he might see what you're talking about?
He know what I am talking about. This is about architecture/style/concepts. If Walter disagree with this then nobody can change it.
Apr 07 2016
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 7 Apr 2016 02:41:06 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 3. This would not solve the problem for generic BLAS implementation
 for Phobos at all! How you would force compiler to USE and NOT USE
 specific vector permutations for example in the same object file?
 Yes, I know, DMD has not permutations. No, I don't want to write
 permutation for each architecture. Why? I can write simple D code
 that generates single LLVM IR code which would work for ALL
 targets!  
There's no reason for the compiler to make target CPU information available when writing generic code.
Actually for GDC/GCC you can't even write functions using certain SIMD stuff as 'generic' code. Unless you use -mavx or -march the builtins are not exposed to user code. IIRC the compiler even complains about inline ASM if you use unsupported instructions. You also can't always compile with the 'biggest' feature set, as GCC might use these features in codegen. TLDR; For GCC/GDC you will have to use target flags / attribute(target) to mix feature sets.
Apr 07 2016
prev sibling next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Wed, 6 Apr 2016 20:27:31 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does
 not impede writing code that takes advantage of various SIMD code
 generation at compile time.  
It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d
The problem is that march=x can set more than one feature flag. So instead of gdc -march=armv7-a you have to do gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32 -fversion=ARM_FEATURE_UNALIGNED ... Sou have to know exactly which features are supported for a CPU. Essentially you have to duplicate the CPU<=>feature database already present in GCC (and likely LLVM too) in your Makefile. And you'll need -march=armv7-a anyway to make sure the GCC codegen can use these features as well. So this issue is not a blocker, but what you propose is a workaround at best, not a solution.
Apr 07 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2016 3:15 AM, Johannes Pfau wrote:
 The problem is that march=x can set more than one
 feature flag. So instead of

 gdc -march=armv7-a
 you have to do
 gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32
 -fversion=ARM_FEATURE_UNALIGNED ...

 Sou have to know exactly which features are supported for a CPU.
 Essentially you have to duplicate the CPU<=>feature database already
 present in GCC (and likely LLVM too) in your Makefile. And you'll need
 -march=armv7-a anyway to make sure the GCC codegen can use these
 features as well.

 So this issue is not a blocker, but what you propose is a workaround at
 best, not a solution.
Having a veritable blizzard of these predefined versions, that constantly are obsoleted and new ones appearing, seems like a serious problem when trying to standardize the language.
Apr 07 2016
prev sibling next sibling parent reply Kai Nacke <kai redstar.de> writes:
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, Kai
Apr 07 2016
next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Thu, 07 Apr 2016 10:52:42 +0000
schrieb Kai Nacke <kai redstar.de>:

 On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
  
glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos. Regards, Kai
Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26a
Apr 07 2016
parent reply Johan Engelen <j j.nl> writes:
On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:

 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU Indirect 
 Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in druntime/Phobos.
 
 Regards,
 Kai
Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26a
I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine) I looked into this some time ago and did not see a reason to use the ifunc mechanism (which would not be available on Windows). I thought it should be implementable in a library, exactly as you did in your dpaste! :-) (does `&foo` return `impl`?)
Apr 07 2016
parent reply Johannes Pfau <nospam example.com> writes:
Am Thu, 07 Apr 2016 13:27:05 +0000
schrieb Johan Engelen <j j.nl>:

 On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:
  
 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU Indirect 
 Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in druntime/Phobos.
 
 Regards,
 Kai  
Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26a
I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine)
The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference. The main problem here is because of cyclic constructor detection it will be more difficult to implement a generic template solution. http://www.airs.com/blog/archives/403 "An alternative to all this linker stuff would be a variable holding a function pointer. The function could then be written in assembler to do the indirect jump. The variable would be initialized at program startup time. The efficiency would be the same. The address of the function would be the address of the indirect jump, so function pointers would compare consistently."
 I looked into this some time ago and did not see a reason to use 
 the ifunc mechanism (which would not be available on Windows). I 
 thought it should be implementable in a library, exactly as you 
 did in your dpaste! :-)
 (does `&foo` return `impl`?)
No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &. Here's the alternative using a constructor which makes the address accessible. The syntax will still be different though: __gshared void function() foo; shared static this() { foo = &foo1; } auto addr = &foo; // address of the variable addr = cast(void*)foo; // the function address
Apr 07 2016
parent Johan Engelen <j j.nl> writes:
On Thursday, 7 April 2016 at 14:46:06 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 13:27:05 +0000
 schrieb Johan Engelen <j j.nl>:

 On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:
 
 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU 
 Indirect Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in 
 druntime/Phobos.
 
 Regards,
 Kai
Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?: http://dpaste.dzfl.pl/0aa81325a26a
I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference. https://gcc.gnu.org/wiki/FunctionMultiVersioning "To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine)
The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference.
Yep exactly. For target multiversioned functions, I thought one would want to create one static ctor that calls cpuid once and sets all function ptrs of that module.
 (does `&foo` return `impl`?)
No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &.
OK. Well, the target multifunctioning would need compiler support anyway and it is easy to do something slightly different for `&foo` when foo is a multiversioned function. This should be fairly easy to implement in LDC, with some smarts needed in ordering and selecting the best function version.
Apr 07 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2016 3:52 AM, Kai Nacke wrote:
 On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries Would be awesome to have something similar in druntime/Phobos.
We already have core.cupid, which covers most of what that article talks about. The indirect function thing appears to be a way to selectively load from various dlls. But that can be done anyway with core.cpuid and dynamic dll loading, so I'm not sure what advantage it brings.
Apr 07 2016
prev sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 7 April 2016 at 13:27, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does not
 impede writing code that takes advantage of various SIMD code generation
 at
 compile time.
It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround: gdc -simd=AFX foo.d becomes: gdc -simd=AFX -version=AFX foo.d It's even simpler if you use a makefile variable: FPU=AFX gdc -simd=$(FPU) -version=$(FPU)
Sure. I've done this in my own tests. I just never published that anyone else should do it.
 You also mentioned being blocked (i.e. demotivated) for *years* by this, and
 I assume that may be because we don't care about SIMD support. That would be
 wrong, as I care a lot about it. But I had no idea you were having a problem
 with this, as you did not file any bug reports. Suffering in silence is
 never going to work :-)
There's been threads, but sure, I could have done more to push it along. Motivation is a complex and not particularly logical emotion, there's a lot of factors feeding into it. Not least of which, is that I haven't been working in games for a while, which means I haven't depended on it for my work. Don't take that to read I have lost interest in the support, just that the pressure is reduced. You'll have noticed that C++ interaction is my recent focus, since that's directly related to my current day-job, and the path that I need to solve now to get D into my work. That's consuming almost 100% of my D-time-allocation... if I could ever manage to just kick that goal, it might free me up >_< .. I keep on trying.
 2. I'm not sure these global settings are the best approach, especially
 if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.
They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application.
It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.
The author still needs to be able to control at compile-time what min-spec shall be supported. I agree the check is valuable, but I think it's an unrelated detail.
 Runtime selection is not practical in a broad sense. Emitting small
 fragments of SIMD here and there will probably take a loss if they are
 all surrounded by a runtime selector. SIMD is all about pipelining,
 and runtime branches on SIMD version are antithesis to good SIMD
 usage; they can't be applied for small-scale deployment.
 In my experience, runtime selection is desirable for large scale
 instantiations at an outer level of the work loop. I've tried to
 design this intent in my library, by making each simd API capable of
 receiving SIMD version information via template arg, and within the
 library, the version is always passed through to dependent calls.
 The Idea is, if you follow this pattern; propagating a SIMD version
 template arg through to your outer function, then you can instantiate
 your higher-level work function for any number of SIMD feature
 combinations you feel is appropriate.
Doing it at a high level is what I meant, not for each SIMD code fragment.
Sure, so you agree we need a mechanism for the author to tune the default selection then? Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is implied by D_SIMD)
 Naturally, this process requires a default, otherwise this usage
 baggage will cloud the API everywhere (rather than in the few cases
 where a developer specifically wants to make use of it), and many
 developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
 in my applications, xbox developers would choose AVX1, it's very
 application/target-audience specific, but SSE2 is the only reasonable
 selection if we are not to accept a hint from the command line.
I still don't see how it is a problem to do the switch at a high level.
It's not a problem, that's exactly my design, but it's not a universal solution.
 Heck, you could put the ENTIRE ENGINE inside a template, have a template
 parameter be the instruction set, and instantiate the template for each
 supported instruction set.

 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
Sure, I've designed for this specifically, but it's not practical to wind this all the way to the top of the stack. Some hot code will make make use of this pattern, but small fragments that appear throughout the code don't want to have this baggage applied. They should just work with the developer's deliberately selected default. It's not worth runtime selection on small deployments. You will likely end up with numerous helper functions, which when involved in the runtime-selected loops, would have different versions generated appropriately, but when these helper functions appear on their own, they would want to use a sensible default.
 I've done it with a template arg because it can be manually
 propagated, and users can extrapolate the pattern into their outer
 work functions, which can then easily have multiple versions
 instantiated for runtime selection.
 I think it's also important to mangle it into the symbol name for the
 reasons I mention above.
Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping.
I guess you haven't looked at my code, but yes, it's all mapped to enums used by the templates. The versions would assign a constant used as the template's default arg.
Apr 07 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2016 5:27 PM, Manu via Digitalmars-d wrote:
 You'll have noticed that C++ interaction is my recent focus, since
 that's directly related to my current day-job, and the path that I
 need to solve now to get D into my work.
We recognize C++ interoperability to be a key feature of D. I hope you like the support you got with the C++ virtual functions! I got bogged down recently with getting the C++ exception handling support working better, hopefully we've turned the corner on that one. I'd hoped to be further along at the moment with C++ interoperability (but it's always going to be a work in progress).
 That's consuming almost 100% of my D-time-allocation... if I could
 ever manage to just kick that goal, it might free me up >_< .. I keep
 on trying.
I do appreciate your efforts in this direction.
 Doing it at a high level is what I meant, not for each SIMD code fragment.
Sure, so you agree we need a mechanism for the author to tune the default selection then?
From the command line, probably not. I like the pragma thing better.
 Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is
implied by D_SIMD)
It is fine as a default, as it is the baseline minimum machine D is expecting.
Apr 07 2016
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 6 Apr 2016 17:42:30 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.  
It's a reasonable suggestion; some points: 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time. 2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable.
better ;-) If you've got a version() block in a template and compile two modules using the same template with different -version flags you'll have exactly that problem. Have an enum myFlag = x; in a config module + static if => problem solved. The problem isn't having global settings, the problem is having to manually specify the same global setting for every source file.
Apr 07 2016
prev sibling parent reply xenon325 <anm programmer.net> writes:
On Thursday, 7 April 2016 at 00:42:30 UTC, Walter Bright wrote:
 [...] especially if one is writing applications that 
 dynamically adjusts based on the CPU the user is running on. 
 The main trouble comes about when different modules are 
 compiled with different settings. What happens with template 
 code generation, when the templates are pulled from different 
 modules? What happens when COMDAT functions are generated? (The 
 linker picks one arbitrarily and discards the others.) Which 
 settings wind up in the executable will be not easily 
 predictable.

 I suspect that using a pragma would be a much better approach:

    pragma(SIMD, AFX)
    {
 	... code ...
    }

 Doing it on the command line is certainly the traditional way, 
 but it strikes me as being bug-prone and as unhygienic and 
 obsolete as the C preprocessor is (for similar reasons).
Have you seen how GCC's function multiversioning [1] ? This whole thread is far too low-level for me and I'm not sure if GCC's dispatcher overhead is OK, but the syntax looks really nice and it seems to address all of your concerns. __attribute__ ((target ("default"))) int foo () { // The default version of foo. return 0; } __attribute__ ((target ("sse4.2"))) int foo () { // foo version for SSE4.2 return 1; } __attribute__ ((target ("arch=atom"))) int foo () { // foo version for the Intel ATOM processor return 2; } [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning -Alexander
Apr 12 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 12 Apr 2016 10:55:18 +0000
schrieb xenon325 <anm programmer.net>:

 Have you seen how GCC's function multiversioning [1] ?
=20
 This whole thread is far too low-level for me and I'm not sure if=20
 GCC's dispatcher overhead is OK, but the syntax looks really nice=20
 and it seems to address all of your concerns.
=20
 	__attribute__ ((target ("default")))
 	int foo ()
 	{
 	  // The default version of foo.
 	  return 0;
 	}
=20
 	__attribute__ ((target ("sse4.2")))
 	int foo ()
 	{
 	  // foo version for SSE4.2
 	  return 1;
 	}
=20
=20
 [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning
=20
 -Alexander
Awesome! I just tried it and it ties runtime and compile-time selection of code paths together in an unprecedented way! As you said, there is the runtime dispatcher overhead if you just compile normally. But if you specifically compile with "gcc -msse4.2 <=E2=80=A6>", GCC calls the correct function directly: 0000000000400512 <main>: 400512: e8 f5 ff ff ff callq 40050c <_Z3foov.sse4.2> 400517: f3 c3 repz retq=20 400519: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) For demonstration purposes I disabled the inliner here. The best thing about it is that for users of libraries employing this technique, it happens behind the scenes and user code stays clean of instrumentation. No ugly versioning and hand written switch-case blocks! (It currently only works with C++ on x86, but I like the general direction.) --=20 Marco
Apr 12 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
The system seems to call CPUID at startup and for every
multiversioned function, patch an offset in its dispatcher
function. The dispatcher function is then nothing more than a
jump realtive to RIP, e.g.:

  jmp    QWORD PTR [rip+0x200bf2]

This is as efficient as it gets short of using whole-program
optimization.

-- 
Marco
Apr 12 2016
prev sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:
 Have you seen how GCC's function multiversioning [1] ?
I've been thinking about the gcc multiversioning since you mentioned it previously. I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers. I don't know how some of those choices would get made at compile time for dynamic arrays. Would need some kind of run-time approach. At least for static arrays, you could do multiple versions of the function and then use template constraints to call whichever function. Some tuning would be necessary.
Apr 15 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 15 Apr 2016 18:54:12 +0000
schrieb jmh530 <john.michael.hall gmail.com>:

 On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:
 Have you seen how GCC's function multiversioning [1] ?
  
I've been thinking about the gcc multiversioning since you mentioned it previously. I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers.
GCC only has one architecture as a target at a time. As long as this is so, there is little point in contemplating how it handles multiple architectures and network traffic. :) CPUs run the bulk of code, from booting over kernel and drivers to applications and there will always be something that can be optimized if it is statically known that a certain instruction set is supported. To pick up your matrices example, imagine OpenGL code that has some 4x4 matrices that are in no direct relation to each other. The GPU is only good at bulk processing, and it doesn't apply here. So you need the general purpose processor and benefit from the knowledge that some SSE level is supported. In general, when you have to make many quick decisions on small amounts of data the GPU or networking are out of question. -- Marco
Apr 16 2016
prev sibling parent Johan Engelen <j j.nl> writes:
On Tuesday, 5 April 2016 at 09:39:21 UTC, 9il wrote:
 3. This is possible and not very hard to implement if I am not 
 wrong.
Last time I looked into this (related to implementing target, see [1]), I only found some Clang code dealing with this, but now I found LLVM functions about architectures, cpus, features, etc. So indeed I also think it will be relatively easy indeed to implement at least rudimentary support for what you'd want. [1] http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org
Apr 05 2016
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:
 - Allowed sets of instructions: for example, AVX2, FMA4
Done. D_SIMD
I'm not a SIMD expert, I've only played around with SIMD a little, but this confuses me. version(D_SIMD) will tell you when SIMD is implemented, but not what type of SIMD. For instance, if I am on a machine that can use AVX2 instructions, then code in a version(D_SIMD) block will execute, but it should also execute if the processor only supports SSE4. What if the writer of an SIMD library wants to have code execute differently if SSE4 is detected instead of AVX2?
Apr 04 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 2:11 PM, jmh530 wrote:
 version(D_SIMD) will tell you when SIMD is implemented, but not what type of
 SIMD.
The first SIMD level.
 For instance, if I am on a machine that can use AVX2 instructions, then
 code in a version(D_SIMD) block will execute, but it should also execute if the
 processor only supports SSE4. What if the writer of an SIMD library wants to
 have code execute differently if SSE4 is detected instead of AVX2?
Use a runtime switch (see core.cpuid).
Apr 04 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 4 Apr 2016 13:29:11 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?  
Target cpu configuration: - CPU architecture (done)
Done.
 - Count of FP/Integer registers  
??
 - Allowed sets of instructions: for example, AVX2, FMA4  
Done. D_SIMD
I wonder if answers like this are meant to be filled into a template like this: "We have [$2] in place for that. If that doesn't get the job $1, please report whatever is missing to bugzilla. Thanks!" Since otherwise it should be clear that the distinction between AVX2 and FMA4 asks for something more specialized than D_SIMD, which is basically the same as checking the front-end __VERSION__. -- Marco
Apr 11 2016
prev sibling next sibling parent reply Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 3 April 2016 at 16:14, 9il via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c apparently none
 of the D compilers has a working SIMD implementation (maybe GDC has but
 it's very difficult to work w/ the 2.066 frontend).


 https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d
 https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, stores, and
 integral mul/div. Is this really the current state of SIMD or am I missing
 sth.?

 -Martin
Hello Martin, Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. Best regards, Ilya
My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.
Apr 03 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/3/2016 12:39 AM, Manu via Digitalmars-d wrote:
 My SIMD implementation has been blocked on that for years too.
First I've heard of that.
 I need to know the SIMD level flags passed to the compiler at least,
 and DMD needs to introduce the concept.
Here is a list of all the open Bugzilla issues tagged with the keyword SIMD: https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced There is no issue I can find about being blocked for years on SIMD flags. I guarantee you that if you never report the problems you're having, you will suffer in silence and they will not get fixed :-)
Apr 03 2016
parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.
Here is a list of all the open Bugzilla issues tagged with the keyword SIMD: https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced There is no issue I can find about being blocked for years on SIMD flags. I guarantee you that if you never report the problems you're having, you will suffer in silence and they will not get fixed :-)
He's talked about it on his github PR: https://github.com/D-Programming-Language/phobos/pull/2862
Apr 03 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/3/2016 7:12 PM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:
 There is no issue I can find about being blocked for years on SIMD flags. I
 guarantee you that if you never report the problems you're having, you will
 suffer in silence and they will not get fixed :-)
He's talked about it on his github PR: https://github.com/D-Programming-Language/phobos/pull/2862
Yes, but I never noticed that until you posted a link. The place to file bug reports and enhancement requests is on Bugzilla. Otherwise nobody will see them. It's why we have Bugzilla.
Apr 03 2016
prev sibling next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.
I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
Apr 04 2016
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:
 I made a bug to track this problem: 
 https://issues.dlang.org/show_bug.cgi?id=15873
You might add link to this thread and github where he made the original comment..
Apr 04 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 10:27 AM, jmh530 wrote:
 On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:
 I made a bug to track this problem:
 https://issues.dlang.org/show_bug.cgi?id=15873
You might add link to this thread and github where he made the original comment..
http://www.digitalmars.com/d/archives/digitalmars/D/Any_usable_SIMD_implementation_282806.html
Apr 04 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 10:23 AM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at least,
 and DMD needs to introduce the concept.
I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
I believe the issue is fixed (for DMD) with a documentation improvement.
Apr 04 2016
parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Monday, 4 April 2016 at 19:43:43 UTC, Walter Bright wrote:
 On 4/4/2016 10:23 AM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.
I made a bug to track this problem: https://issues.dlang.org/show_bug.cgi?id=15873
I believe the issue is fixed (for DMD) with a documentation improvement.
I believe the problem is that you can't rely on D_SIMD that SSE4, FMA, AVX2, AVX-512, etc. are available on the target platform. See also http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.
Apr 04 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/4/2016 12:55 PM, ZombineDev wrote:
 I believe the issue is fixed (for DMD) with a documentation improvement.
I believe the problem is that you can't rely on D_SIMD that SSE4, FMA, AVX2, AVX-512, etc. are available on the target platform. See also http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.
Right, you can't. But the issue here is having the compiler give a predefined version for what is the MINIMUM that the target machine supports. And the D_SIMD does that. There is no purpose to the compiler predefining a version for an instruction set it does not generate code for. You can also do a runtime test with http://dlang.org/phobos/core_cpuid.html
Apr 04 2016
prev sibling parent Johan Engelen <j j.nl> writes:
On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 On 3 April 2016 at 16:14, 9il via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 Is it possible to introduce compile time information about 
 target platform? I am working on BLAS from scratch 
 implementation. And it is no hope to create something useable 
 without CT information about target.

 Best regards,
 Ilya
My SIMD implementation has been blocked on that for years too. I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept.
https://github.com/ldc-developers/ldc/pull/1434
Apr 15 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 03 Apr 2016 06:14:23 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

 Hello Martin,
 
 Is it possible to introduce compile time information about target 
 platform? I am working on BLAS from scratch implementation. And 
 it is no hope to create something useable without CT information 
 about target.
 
 Best regards,
 Ilya
+1000! I've hardcoded SSE4 in fast.json, but would much prefer to type version(sse4) and have it compile on older CPUs as well. -- Marco
Apr 04 2016
prev sibling next sibling parent Etienne <etcimon gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
Not sure if it's been mentioned, but I've made a best effort to implement GCC's in here: https://github.com/etcimon/botan/tree/master/source/botan/utils/simd
Apr 12 2016
prev sibling parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin
ndslice.algorithm [1], [2] compiled with recent LDC beta will do all work for you. Vectorized flag should be turned on and the last (row) dimension should have stride==1. Generic matrix-matrix multiplication [3] is available in Mir version 0.16.0-beta2 It should be compiled with recent LDC beta, and -mcpu=native flag. [1] http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html [2] https://github.com/dlang/phobos/pull/4652 [3] http://docs.mir.dlang.io/latest/mir_glas_gemm.html
Aug 22 2016