digitalmars.D - Any usable SIMD implementation?

Martin Nowak (11/11) Mar 31 2016 I'm currently working on a templated arrayop implementation (using RPN

ZombineDev (16/29) Mar 31 2016 I don't know how far has Ilya's work [1] advanced, but you may

Martin Nowak (31/32) Apr 01 2016 Well apparently stores w/ dmd's weird core.simd interface don't work, or

Iain Buclaw via Digitalmars-d (8/40) Apr 01 2016 I would just let the compiler optimize / vectorize the operation, but th...

Martin Nowak (6/11) Apr 02 2016 It's intended to replace the array ops in druntime, relying on

Iain Buclaw via Digitalmars-d (7/14) Apr 02 2016 then again that it is probably just me who thinks these things.

Martin Nowak (6/9) Apr 02 2016 I'm already using vector types for most operations, so it's somewhat

Johan Engelen (3/6) Apr 03 2016 Please submit a GH issue with LDC, thanks!
Martin Nowak (2/5) Apr 03 2016 https://github.com/D-Programming-Language/dmd/pull/5625

John Colvin (2/15) Mar 31 2016 Am I being stupid or is core.simd what you want?
Johan Engelen (9/12) Mar 31 2016 I think you want to write your code using SIMD primitives.
Iakh (7/20) Mar 31 2016 Unfortunately my one(https://github.com/Iakh/simd) is far from
9il (8/21) Apr 02 2016 Hello Martin,

Iain Buclaw via Digitalmars-d (9/26) Apr 02 2016 https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/...

9il (7/22) Apr 04 2016 Target cpu configuration:

Marco Leise (22/29) Apr 04 2016 - On amd64, whether floating-point math is handled by the FPU

9il (5/25) Apr 04 2016 @attribute("target", "+sse4")) would not work well for BLAS. BLAS

Marco Leise (8/12) Apr 11 2016 It's just for the case where you want a generic executable

Walter Bright (2/5) Apr 04 2016 http://dlang.org/phobos/core_cpuid.html

Marco Leise (56/62) Apr 11 2016 That's what I implied in "what we have now":

Walter Bright (17/78) Apr 11 2016 There's no reason core.cpuid, which has a platform-independent API, cann...

Marco Leise (54/101) Apr 12 2016 LDC implements InlineAsm_X86_Any (DMD style asm), so

Walter Bright (11/43) Apr 12 2016 Years? Anyone who needs core.cpuid could translate it to GDC's inline as...

Marco Leise (94/122) Apr 12 2016 You mean it is ok, if I duplicated most of the asm in there

Walter Bright (29/54) Apr 12 2016 It's Boost licensed, and Boost licensed code can be shipped with GPL'd c...

Iain Buclaw via Digitalmars-d (14/27) Apr 13 2016 Infact the "correct" version of "mul eax" is.
Marco Leise (64/87) Apr 16 2016 Tell me again, what's more elgant !

Walter Bright (3/4) Apr 16 2016 If I wanted to write in assembler, I wouldn't write in a high level lang...

Marco Leise (55/60) Apr 17 2016 I hate the many pitfalls of extended asm: Forget to mention a

Iain Buclaw via Digitalmars-d (33/54) Apr 12 2016 asm { "mul eax"; } - That wasn't so difficult. :-)

Walter Bright (26/62) Apr 12 2016 My understanding is that is not sufficient if you want gcc to track regi...

Iain Buclaw via Digitalmars-d (9/44) Apr 13 2016 My only point was that in GDC, the translation of opcodes to machine

Marco Leise (11/17) Apr 13 2016 Am Wed, 13 Apr 2016 09:51:25 +0200

Iain Buclaw via Digitalmars-d (7/24) Apr 13 2016 Yes, cpu_supports is a good way to do it as we only need to invoke

Marco Leise (23/26) Apr 13 2016 Am Wed, 13 Apr 2016 11:21:35 +0200

Walter Bright (3/12) Apr 13 2016 Please do not invent an alternative interface, use the one in core.cpuid...

Marco Leise (9/26) Apr 13 2016 Yes, they are all @property and a substitution with direct

Walter Bright (4/9) Apr 13 2016 It doesn't need to be efficient, because such checks should be done at a...

Iain Buclaw via Digitalmars-d (3/19) Apr 14 2016 An alternative interface needs to be invented anyway for other CPUs.

Walter Bright (2/3) Apr 14 2016 That would be fine. But there is no reason to redo core.cpuid for x86 ma...

Walter Bright (6/12) Apr 04 2016 ??

9il (17/30) Apr 04 2016 How many general purpose registers, SIMD Floating Point

jmh530 (3/5) Apr 04 2016 Are you familiar with this project at all?

9il (7/13) Apr 04 2016 Thank for the link. BLIS has the same issue like OpenBLAS - a

Walter Bright (17/27) Apr 04 2016 Since the compiler never generates AVX or AVX2 instructions, there is no...

9il (24/60) Apr 04 2016 It is impossible to deduct from that combination that Xeon Phi

Walter Bright (9/22) Apr 05 2016 Since dmd doesn't generate specific code for a Xeon Phi, having a compil...

John Colvin (12/45) Apr 05 2016 The particular design and limitations of the dmd backend

Walter Bright (15/21) Apr 05 2016 There's a line between trying to standardize everything and letting add-...

9il (5/22) Apr 05 2016 Yes, but this is bad idea to have a set of versions for Phobos,

Walter Bright (4/5) Apr 05 2016 Because it would affect all the code in the module and every template it...

9il (6/12) Apr 05 2016 99.99% of them do not need to compile code with different

Joe Duarte (38/44) Apr 17 2016 There are many organizations in the world that are building

Temtaime (9/20) Apr 17 2016 In addition it's COMPILER work, not programmer!
Johan Engelen (9/21) Apr 23 2016 Thanks, I've seen similar comments in LLVM code.

Marco Leise (7/14) Apr 23 2016 Please do test it. Activating sse3 and disabling sse2
Joe Duarte (12/34) May 02 2016 If you specify SSE3, you should definitely get SSE2 and plain old

9il (9/13) Apr 05 2016 I can do it, however I would like to get this information from

Walter Bright (2/10) Apr 05 2016 Where does the compiler get the information that it should compile for, ...

9il (7/16) Apr 05 2016 No idea about AFX. Do you choose AFX to disallow me to find an

Walter Bright (4/14) Apr 05 2016 I want to make it clear that dmd does not generate AFX specific code, ha...

Johan Engelen (6/10) Apr 05 2016 How about adding a "__target(...)" compile-time function, that

9il (9/20) Apr 05 2016 Yes, something like that is what I am looking for.
Manu via Digitalmars-d (7/18) Apr 06 2016 With respect to SIMD, knowing a processor model like 'broadwell' is

9il (3/12) Apr 06 2016 Yes, however this can be implemented in a spcial Phobos module.

Johan Engelen (10/23) Apr 06 2016 After browsing through some LLVM code, I think is actually very

9il (2/8) Apr 06 2016 Ahaha)) --Ilya

Manu via Digitalmars-d (14/29) Apr 06 2016 Sure, but it's an ongoing maintenance task, constantly requiring

Walter Bright (4/16) Apr 06 2016 You're not making a good case for a standard language defined set of def...

Marco Leise (40/47) Apr 11 2016 We can either define the language in terms of CPU models or

Johannes Pfau (35/68) Apr 07 2016 GCC already keeps a cpu <=> feature mapping (after all it needs to know

9il (9/29) Apr 05 2016 Please think that D has other compilers, not only DMD. We need a

jmh530 (3/6) Apr 06 2016 Especially since everyone says to use LDC for the fastest code

Manu via Digitalmars-d (15/33) Apr 06 2016 I would add that GDC and LDC have such compiler flags and it's

Walter Bright (19/21) Apr 06 2016 It's a reasonable suggestion; some points:

Manu via Digitalmars-d (43/68) Apr 06 2016 It's sufficiently blocking that I have not felt like working any

Walter Bright (36/77) Apr 06 2016 I can understand that it might be demotivating for you, but that is not ...

9il (16/39) Apr 07 2016 ldc -mcpu=native

Walter Bright (8/15) Apr 07 2016 Yes, and nobody cares. With virtual memory and demand loading, unexecute...

9il (7/29) Apr 07 2016 This is not true for BLAS based on D. You don't want to see the

jmh530 (3/4) Apr 07 2016 Perhaps if you provide him a simplified example he might see what

9il (4/9) Apr 07 2016 He know what I am talking about. This is about

Johannes Pfau (11/21) Apr 07 2016 Actually for GDC/GCC you can't even write functions using certain SIMD

Johannes Pfau (15/36) Apr 07 2016 The problem is that march=x can set more than one

Walter Bright (4/17) Apr 07 2016 Having a veritable blizzard of these predefined versions, that constantl...

Kai Nacke (8/18) Apr 07 2016 glibc has a special mechanism for resolving the called function

Johannes Pfau (7/31) Apr 07 2016 Available in GCC as the 'ifunc' attribute:

Johan Engelen (13/29) Apr 07 2016 I thought that the ifunc mechanism means an indirect call (i.e. a

Johannes Pfau (25/61) Apr 07 2016 The simple variant I've posted needs an additional branch on every

Johan Engelen (10/55) Apr 07 2016 Yep exactly.

Walter Bright (5/23) Apr 07 2016 We already have core.cupid, which covers most of what that article talks...

Manu via Digitalmars-d (38/122) Apr 07 2016 Sure. I've done this in my own tests. I just never published that

Walter Bright (9/19) Apr 07 2016 We recognize C++ interoperability to be a key feature of D. I hope you l...

Johannes Pfau (9/29) Apr 07 2016 That's my #1 argument why '-version' is dangerous and 'static if' is
xenon325 (25/42) Apr 12 2016 Have you seen how GCC's function multiversioning [1] ?

Marco Leise (19/43) Apr 12 2016 Awesome! I just tried it and it ties runtime and compile-time

Marco Leise (9/9) Apr 12 2016 The system seems to call CPUID at startup and for every

jmh530 (14/15) Apr 15 2016 I've been thinking about the gcc multiversioning since you

Marco Leise (18/33) Apr 16 2016 GCC only has one architecture as a target at a time. As long

Johan Engelen (8/10) Apr 05 2016 Last time I looked into this (related to implementing @target,

jmh530 (9/11) Apr 04 2016 I'm not a SIMD expert, I've only played around with SIMD a

Walter Bright (3/9) Apr 04 2016 Use a runtime switch (see core.cpuid).

Marco Leise (12/27) Apr 11 2016 I wonder if answers like this are meant to be filled into a

Manu via Digitalmars-d (5/28) Apr 03 2016 My SIMD implementation has been blocked on that for years too.

Walter Bright (7/10) Apr 03 2016 Here is a list of all the open Bugzilla issues tagged with the keyword S...

Jack Stouffer (3/13) Apr 03 2016 He's talked about it on his github PR:

Walter Bright (4/10) Apr 03 2016 Yes, but I never noticed that until you posted a link. The place to file...

Jack Stouffer (3/7) Apr 04 2016 I made a bug to track this problem:

jmh530 (3/5) Apr 04 2016 You might add link to this thread and github where he made the

Walter Bright (2/7) Apr 04 2016 http://www.digitalmars.com/d/archives/digitalmars/D/Any_usable_SIMD_impl...

Walter Bright (2/7) Apr 04 2016 I believe the issue is fixed (for DMD) with a documentation improvement.

ZombineDev (5/16) Apr 04 2016 I believe the problem is that you can't rely on D_SIMD that SSE4,

Walter Bright (7/11) Apr 04 2016 Right, you can't. But the issue here is having the compiler give a prede...

Johan Engelen (2/15) Apr 15 2016 https://github.com/ldc-developers/ldc/pull/1434

Marco Leise (7/16) Apr 04 2016 +1000!

Etienne (4/17) Apr 12 2016 Not sure if it's been mentioned, but I've made a best effort to
Ilya Yaroshenko (10/23) Aug 22 2016 ndslice.algorithm [1], [2] compiled with recent LDC beta will do

Martin Nowak <code+news.digitalmars dawg.eu> writes:

I'm currently working on a templated arrayop implementation (using RPN
to encode ASTs).
So far things worked out great, but now I got stuck b/c apparently none
of the D compilers has a working SIMD implementation (maybe GDC has but
it's very difficult to work w/ the 2.066 frontend).

https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d
https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

I don't want to do anything fancy, just unaligned loads, stores, and
integral mul/div. Is this really the current state of SIMD or am I
missing sth.?

-Martin

Mar 31 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

I don't know how far has Ilya's work [1] advanced, but you may 
want to join efforts with him. There are also two std.simd 
packages [2] [3].

BTW, I looked at your code a couple of days ago and I thought 
that it is a really interesting approach to encode operations 
like that. I'm just wondering if pursuing this approach is a good 
idea in the long run, i.e. is it expressible enough to cover the 
use cases of HPC which would also need something similar, but for 
custom linear algebra types.

Here's an interesting video about approaches to solving this 
problem in C++: https://www.youtube.com/watch?v=hfn0BVOegac

[1]: 
http://forum.dlang.org/post/nilhvnqbsgqhxdshpqfl forum.dlang.org

[2]: https://github.com/D-Programming-Language/phobos/pull/2862

[3]: https://github.com/Iakh/simd

Mar 31 2016

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 03/31/2016 10:55 AM, ZombineDev wrote:
 [2]: https://github.com/D-Programming-Language/phobos/pull/2862

Well apparently stores w/ dmd's weird core.simd interface don't work, or
I can't figure out (from the non-existent documentation) how to use it.

---
import core.simd;

void test(float4* ptr, float4 val)
{
    __simd_sto(XMM.STOUPS, *ptr, val);
    __simd(XMM.STOUPS, *ptr, val);
    auto val1 = __simd_sto(XMM.STOUPS, *ptr, val);
    auto val2 = __simd(XMM.STOUPS, *ptr, val);
}
---

LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but
for some reason comes with it's own broken ldc.simd.loadUnaligned
instead of providing intrinsics.

---
import core.simd, ldc.simd;

float4 test(float* ptr)
{
    return loadUnaligned!float4(ptr);
}
---

/home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't
parse inline LLVM IR:
        %r = load <4 x float>* %p, align 1
                               ^
expected comma after load's type

So are 3 different untested and unused APIs really the current state of
SIMD?

-Martin

Apr 01 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 2 Apr 2016 12:40 am, "Martin Nowak via Digitalmars-d" <
digitalmars-d puremagic.com> wrote:
 On 03/31/2016 10:55 AM, ZombineDev wrote:
 [2]: https://github.com/D-Programming-Language/phobos/pull/2862

 Well apparently stores w/ dmd's weird core.simd interface don't work, or
 I can't figure out (from the non-existent documentation) how to use it.

 ---
 import core.simd;

 void test(float4* ptr, float4 val)
 {
     __simd_sto(XMM.STOUPS, *ptr, val);
     __simd(XMM.STOUPS, *ptr, val);
     auto val1 = __simd_sto(XMM.STOUPS, *ptr, val);
     auto val2 = __simd(XMM.STOUPS, *ptr, val);
 }
 ---

 LDC at least has some intrinsics once you find ldc.gccbuiltins_x86, but
 for some reason comes with it's own broken ldc.simd.loadUnaligned
 instead of providing intrinsics.

 ---
 import core.simd, ldc.simd;

 float4 test(float* ptr)
 {
     return loadUnaligned!float4(ptr);
 }
 ---

 /home/dawg/dlang/ldc-0.17.1/bin/../import/ldc/simd.di(212): Error: can't
 parse inline LLVM IR:
         %r = load <4 x float>* %p, align 1
                                ^
 expected comma after load's type

 So are 3 different untested and unused APIs really the current state of
 SIMD?

 -Martin

I would just let the compiler optimize / vectorize the operation, but then
again that it is probably just me who thinks these things.

http://goo.gl/XdiKZX

I'm not aware of any intrinsic to load unaligned data. Only to assume
alignment.

Iain.

Apr 01 2016

Martin Nowak <code dawg.eu> writes:

On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:
 I would just let the compiler optimize / vectorize the 
 operation, but then again that it is probably just me who 
 thinks these things.

It's intended to replace the array ops in druntime, relying on 
vecorizers won't suffice, e.g. your example already stops working 
when I pass dynamic instead of static arrays.

 I'm not aware of any intrinsic to load unaligned data. Only to 
 assume alignment.

__builtin_ia32_loadups
__builtin_ia32_storeups

Apr 02 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 2 Apr 2016 9:45 am, "Martin Nowak via Digitalmars-d" <
digitalmars-d puremagic.com> wrote:
 On Saturday, 2 April 2016 at 06:13:24 UTC, Iain Buclaw wrote:
 I would just let the compiler optimize / vectorize the operation, but


then again that it is probably just me who thinks these things.
 It's intended to replace the array ops in druntime, relying on vecorizers

won't suffice, e.g. your example already stops working when I pass dynamic
instead of static arrays.
 I'm not aware of any intrinsic to load unaligned data. Only to assume


alignment.
 __builtin_ia32_loadups
 __builtin_ia32_storeups

Any agnostic way to... :-)

Apr 02 2016

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 04/02/2016 10:19 AM, Iain Buclaw via Digitalmars-d wrote:
 __builtin_ia32_loadups
 __builtin_ia32_storeups


 Any agnostic way to... :-)

I'm already using vector types for most operations, so it's somewhat
portable.
But for whatever reason D doesn't allow multiplication/division w/
integral vectors (departing from GCC/clang) and I can't perform
unaligned loads, so I have to resort to intrinsics for that.

Apr 02 2016

Johan Engelen <j j.nl> writes:

On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:
 LDC at least has some intrinsics once you find 
 ldc.gccbuiltins_x86, but for some reason comes with it's own 
 broken ldc.simd.loadUnaligned

Please submit a GH issue with LDC, thanks!

-Johan

Apr 03 2016

Martin Nowak <code dawg.eu> writes:

On Friday, 1 April 2016 at 22:31:00 UTC, Martin Nowak wrote:
 Well apparently stores w/ dmd's weird core.simd interface don't 
 work, or I can't figure out (from the non-existent 
 documentation) how to use it.

https://github.com/D-Programming-Language/dmd/pull/5625

Apr 03 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

Am I being stupid or is core.simd what you want?

Mar 31 2016

Johan Engelen <j j.nl> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

I think you want to write your code using SIMD primitives.
But in case you want the compiler to generate SIMD instructions, 
perhaps  ldc.attributes.target may help you:


I have not checked what LDC does with SIMD with default 
commandline parameters.

Cheers,
   Johan

Mar 31 2016

Iakh <iaktakh gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

Unfortunately my one(https://github.com/Iakh/simd) is far from
production code. For now I'm trying to figure out interface common
to all archs/compilers. And its more about SIMD comparison 
operations.

You could do loads, stores and mul with default D SIMD support
but not int div

Mar 31 2016

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

Hello Martin,

Is it possible to introduce compile time information about target 
platform? I am working on BLAS from scratch implementation. And 
it is no hope to create something useable without CT information 
about target.

Best regards,
Ilya

Apr 02 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" <digitalmars-d puremagic.com>
wrote:
 On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c apparently none
 of the D compilers has a working SIMD implementation (maybe GDC has but
 it's very difficult to work w/ the 2.066 frontend).


https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d
https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d
 I don't want to do anything fancy, just unaligned loads, stores, and


integral mul/div. Is this really the current state of SIMD or am I missing
sth.?
 -Martin


 Hello Martin,

 Is it possible to introduce compile time information about target

platform? I am working on BLAS from scratch implementation. And it is no
hope to create something useable without CT information about target.
 Best regards,
 Ilya

What kind of information?

Apr 02 2016

9il <ilyayaroshenko gmail.com> writes:

On Sunday, 3 April 2016 at 06:33:13 UTC, Iain Buclaw wrote:
 On 3 Apr 2016 8:15 am, "9il via Digitalmars-d" 
 <digitalmars-d puremagic.com> wrote:
 Hello Martin,

 Is it possible to introduce compile time information about 
 target

 platform? I am working on BLAS from scratch implementation. And 
 it is no hope to create something useable without CT 
 information about target.
 Best regards,
 Ilya

 What kind of information?

Target cpu configuration:
- CPU architecture (done)
- Count of FP/Integer registers
- Allowed sets of instructions: for example, AVX2, FMA4
- Compiler optimization options (for math)

Ilya

Apr 04 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 04 Apr 2016 14:02:03 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

 Target cpu configuration:
 - CPU architecture (done)
 - Count of FP/Integer registers
 - Allowed sets of instructions: for example, AVX2, FMA4
 - Compiler optimization options (for math)
 
 Ilya

- On amd64, whether floating-point math is handled by the FPU
  or SSE. When emulating floating-point, e.g. for
  float-to-string and string-to-float code, it is useful to
  know where to get the active rounding mode from, since they
  may differ and at least GCC has a switch to choose between
  both.
- For compile time enabling of SSE4 code, a version define is
  sufficient. Sometimes we want to select a code path at
  runtime. For this to work, GDC and LDC use a conservative
  feature set at compile time (e.g. amd64 with SSE2) and tag
  each SSE4 function with an attribute to temporarily elevate
  the instruction set. (e.g.  attribute("target", "+sse4"))
  If you didn't tag the function like that the compiler would
  error out, because the SSE4 instructions are not supported
  by a minimal amd64 CPU.
  To put this to good use, we need a reliable way - basically
  a global variable - to check for SSE4 (or POPCNT, etc.). What
  we have now does not work across all compilers.

-- 
Marco

Apr 04 2016

9il <ilyayaroshenko gmail.com> writes:

On Monday, 4 April 2016 at 16:21:15 UTC, Marco Leise wrote:
 Am Mon, 04 Apr 2016 14:02:03 +0000
 schrieb 9il <ilyayaroshenko gmail.com>:
 - On amd64, whether floating-point math is handled by the FPU
   or SSE. When emulating floating-point, e.g. for
   float-to-string and string-to-float code, it is useful to
   know where to get the active rounding mode from, since they
   may differ and at least GCC has a switch to choose between
   both.
 - For compile time enabling of SSE4 code, a version define is
   sufficient. Sometimes we want to select a code path at
   runtime. For this to work, GDC and LDC use a conservative
   feature set at compile time (e.g. amd64 with SSE2) and tag
   each SSE4 function with an attribute to temporarily elevate
   the instruction set. (e.g.  attribute("target", "+sse4"))
   If you didn't tag the function like that the compiler would
   error out, because the SSE4 instructions are not supported
   by a minimal amd64 CPU.
   To put this to good use, we need a reliable way - basically
   a global variable - to check for SSE4 (or POPCNT, etc.). What
   we have now does not work across all compilers.

 attribute("target", "+sse4")) would not work well for BLAS. BLAS 
needs compile time constants. This is very important because BLAS 
can be 95% portable, so I just need to write a code that would be 
optimized very well by compiler. --Ilya

Apr 04 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 04 Apr 2016 18:35:26 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

  attribute("target", "+sse4")) would not work well for BLAS. BLAS 
 needs compile time constants. This is very important because BLAS 
 can be 95% portable, so I just need to write a code that would be 
 optimized very well by compiler. --Ilya

It's just for the case where you want a generic executable
with a generic and a specialized code path. I didn't mean this
to be exclusively used without compile-time information about
target features. 

-- 
Marco

Apr 11 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 9:21 AM, Marco Leise wrote:
    To put this to good use, we need a reliable way - basically
    a global variable - to check for SSE4 (or POPCNT, etc.). What
    we have now does not work across all compilers.

http://dlang.org/phobos/core_cpuid.html

Apr 04 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 4 Apr 2016 11:43:58 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 9:21 AM, Marco Leise wrote:
    To put this to good use, we need a reliable way - basically
    a global variable - to check for SSE4 (or POPCNT, etc.). What
    we have now does not work across all compilers.  

 
 http://dlang.org/phobos/core_cpuid.html

That's what I implied in "what we have now":

	import core.cpuid;

	writeln( mmx );  // prints 'false' with GDC
	version(InlineAsm_X86_Any)
		writeln("DMD and LDC support the Dlang inline assembler");
	else
		writeln("GDC has the GCC extended inline assembler");

Both LLVM and GCC have moved to "extended inline assemblers"
that require you to provide information about input, output
and scratch registers as well as memory locations, so the
compiler can see through the asm-block for register allocation
and inlining purposes. It's more difficult to get right, but
also more rewarding, as it enables you to write no-overhead
"one-liners" and "intrinsics" while having calling conventions
still handled by the compiler. An example for GDC:

	struct DblWord { ulong lo, hi; }

	/// Multiplies two machine words and returns a double
	/// machine word.
	DblWord bigMul(ulong x, ulong y)
	{
		DblWord tmp = void;
		// '=a' and '=d' are outputs to RAX and RDX
		// respectively that are bound to the two
		// fields of 'tmp'.
		// '"a" x' means that we want 'x' as input in
		// RAX and '"rm" y' places 'y' wherever it
		// suits the compiler (any general purpose
		// register or memory location).
		// 'mulq %3' multiplies with the ulong
		// represented by the argument at index 3 (y). 
		asm {
			"mulq %3"
			 : "=a" tmp.lo, "=d" tmp.hi
			 : "a" x, "rm" y;
		}
		return tmp;
	}

In the above example the compiler has enough information to
inline the function or directly return the result in RAX:RDX
without writing to memory first. The same thing in DMD would
likely have turned out slower than emulating this using
several uint->ulong multiplies.

Although less powerful, the LDC team implemented Dlang inline
assembly according to the specs and so core.cpuid works for
them. GDC on the other hand is out of the picture until either
1) GDC adds Dlang inline assembly
2) core.cpuid duplicates most of its assembly code to support
   the GCC extended inline assembler

I would prefer a common extended inline assembler though,
because when you use it for performance reasons you typically
cannot go with non-inlinable Dlang asm, so you end up with pure
D for DMD, GCC asm for GDC and LDC asm - three code paths.

-- 
Marco

Apr 11 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/11/2016 7:24 AM, Marco Leise wrote:
 Am Mon, 4 Apr 2016 11:43:58 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 9:21 AM, Marco Leise wrote:
     To put this to good use, we need a reliable way - basically
     a global variable - to check for SSE4 (or POPCNT, etc.). What
     we have now does not work across all compilers.

 http://dlang.org/phobos/core_cpuid.html

 That's what I implied in "what we have now":

 	import core.cpuid;

 	writeln( mmx );  // prints 'false' with GDC
 	version(InlineAsm_X86_Any)
 		writeln("DMD and LDC support the Dlang inline assembler");
 	else
 		writeln("GDC has the GCC extended inline assembler");

There's no reason core.cpuid, which has a platform-independent API, cannot be 
made to work with GDC and LDC. Adding more global variables to do the same
thing 
would add no value and would not be easier to implement.


 Both LLVM and GCC have moved to "extended inline assemblers"
 that require you to provide information about input, output
 and scratch registers as well as memory locations, so the
 compiler can see through the asm-block for register allocation
 and inlining purposes. It's more difficult to get right, but
 also more rewarding, as it enables you to write no-overhead
 "one-liners" and "intrinsics" while having calling conventions
 still handled by the compiler.

I know, but "more difficult" is a bit of an understatement. For example, 
core.cpuid has not been implemented using those assemblers.

BTW, dmd's inline assembler does know about which instructions read/write which 
registers, and makes use of that when inserting the code so it will work with 
the rest of the code generator's register usage tracking.

I find needing to tell gcc which registers are read/written by a particular 
instruction to be a step BACKWARDS in usability. This is what computers are 
supposed to be good for :-)

 An example for GDC:

 	struct DblWord { ulong lo, hi; }

 	/// Multiplies two machine words and returns a double
 	/// machine word.
 	DblWord bigMul(ulong x, ulong y)
 	{
 		DblWord tmp = void;
 		// '=a' and '=d' are outputs to RAX and RDX
 		// respectively that are bound to the two
 		// fields of 'tmp'.
 		// '"a" x' means that we want 'x' as input in
 		// RAX and '"rm" y' places 'y' wherever it
 		// suits the compiler (any general purpose
 		// register or memory location).
 		// 'mulq %3' multiplies with the ulong
 		// represented by the argument at index 3 (y).
 		asm {
 			"mulq %3"
 			 : "=a" tmp.lo, "=d" tmp.hi
 			 : "a" x, "rm" y;
 		}
 		return tmp;
 	}

 In the above example the compiler has enough information to
 inline the function or directly return the result in RAX:RDX
 without writing to memory first. The same thing in DMD would
 likely have turned out slower than emulating this using
 several uint->ulong multiplies.

DMD doesn't inline functions with asm in them, but that is not the fault of the 
inline assembler.

The only real weakness in the DMD inline assembler is it doesn't support "let 
the compiler select the register". DMD's strong support for compiler builtins, 
however, mitigate this to an acceptable level.


 Although less powerful, the LDC team implemented Dlang inline
 assembly according to the specs and so core.cpuid works for
 them. GDC on the other hand is out of the picture until either
 1) GDC adds Dlang inline assembly
 2) core.cpuid duplicates most of its assembly code to support
     the GCC extended inline assembler

 I would prefer a common extended inline assembler though,
 because when you use it for performance reasons you typically
 cannot go with non-inlinable Dlang asm, so you end up with pure
 D for DMD, GCC asm for GDC and LDC asm - three code paths.

Apr 11 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 11 Apr 2016 14:29:11 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/11/2016 7:24 AM, Marco Leise wrote:
 Am Mon, 4 Apr 2016 11:43:58 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
  
 On 4/4/2016 9:21 AM, Marco Leise wrote:  
     To put this to good use, we need a reliable way - basically
     a global variable - to check for SSE4 (or POPCNT, etc.). What
     we have now does not work across all compilers.  

 http://dlang.org/phobos/core_cpuid.html  

 That's what I implied in "what we have now":

 	import core.cpuid;

 	writeln( mmx );  // prints 'false' with GDC
 	version(InlineAsm_X86_Any)
 		writeln("DMD and LDC support the Dlang inline assembler");
 	else
 		writeln("GDC has the GCC extended inline assembler");  

 
 There's no reason core.cpuid, which has a platform-independent API, cannot be 
 made to work with GDC and LDC. Adding more global variables to do the same
thing 
 would add no value and would not be easier to implement.

LDC implements InlineAsm_X86_Any (DMD style asm), so
core.cpuid works. GDC is the only compiler that does not
implement it. We agree that core.cpuid should provide this
information, but what we have now - core.cpuid in a mix with
GDC's lack of DMD style asm - does not work in practice for
the years to come.

 Both LLVM and GCC have moved to "extended inline assemblers"
 that require you to provide information about input, output
 and scratch registers as well as memory locations, so the
 compiler can see through the asm-block for register allocation
 and inlining purposes. It's more difficult to get right, but
 also more rewarding, as it enables you to write no-overhead
 "one-liners" and "intrinsics" while having calling conventions
 still handled by the compiler.  

 
 I know, but "more difficult" is a bit of an understatement. For example, 
 core.cpuid has not been implemented using those assemblers.

Yep, and that makes it unavailable in GDC. All feature tests
return false, even MMX or SSE2 on amd64.

 BTW, dmd's inline assembler does know about which instructions read/write
which 
 registers, and makes use of that when inserting the code so it will work with 
 the rest of the code generator's register usage tracking.

That is a pleasant surprise. :)

 I find needing to tell gcc which registers are read/written by a particular 
 instruction to be a step BACKWARDS in usability. This is what computers are 
 supposed to be good for :-)

Still, DMD does not inline asm and always adds a function
prolog and epilog around asm blocks in an otherwise
empty function (correct me if I'm wrong). "naked" means you
have to duplicate code for the different calling conventions,
in particular Win32.

Your look on GCC (and LLVM) may be a bit biased. First of all
you don't need to tell it exactly which registers to use. A
rough classification is enough and gives the compiler a good
idea of where calculations should be stored upon arrival at
the asm statement. You can be specific down to the register
name or let the backend chose freely with "rm" (= any register
or memory).
An example: We have a variable x that is computed inside a
function followed by an asm block that multiplies it with
something else. Typically you would "MOV EAX, [x]" to load x
into the register that the MUL instruction expects. With
extended assemblers you can be declarative about that and just
state that x is needed in EAX as an input. You drop the MOV
from the asm block and let the compiler figure out in its
codegen, how x will end up in EAX. That's a step FORWARD in
usability.

 DMD doesn't inline functions with asm in them, but that is not the fault of
the 
 inline assembler.
 
 The only real weakness in the DMD inline assembler is it doesn't support "let 
 the compiler select the register". DMD's strong support for compiler builtins, 
 however, mitigate this to an acceptable level.

Yes, I've witnessed that in multiply with overflow check.
DMD generates very efficient code for 'mulu'. It's just that
the compiler cannot have builtins for everything. (I
personally was looking for 64-bit multiply with 128-bit
result and SSE4 string scanning.)
The extended assemblers in GCC and LLVM allow me to write
intrinsics, often as a single(!) instruction, that seamlessly
inlines into the surrounding code, just as DMD's builtins
would do.
And it seems to me we could have less backend complexity if we
were able to implement intrinsics as library code with the
same efficiency. ;) But most of the time when I want to access
a specialized CPU instruction for speed with asm in DMD, the
generic pure D code is faster. I would advise to only use it
if the concept is not expressible in pure D at the moment.
You might add that we shouldn't write asm in the first place,
because compilers have become smart enough, but it's not
like I was writing large chunks of asm. I use it to write
"compiler builtins" in D source code.
 
-- 
Marco

Apr 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/12/2016 9:53 AM, Marco Leise wrote:
 LDC implements InlineAsm_X86_Any (DMD style asm), so
 core.cpuid works. GDC is the only compiler that does not
 implement it. We agree that core.cpuid should provide this
 information, but what we have now - core.cpuid in a mix with
 GDC's lack of DMD style asm - does not work in practice for
 the years to come.

Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style 
in an hour or so. It could even be simply written separately in GAS and linked 
in. Since this has not been done, I can only conclude that core.cpuid has not 
been an actual blocker.


 BTW, dmd's inline assembler does know about which instructions read/write which
 registers, and makes use of that when inserting the code so it will work with
 the rest of the code generator's register usage tracking.

 That is a pleasant surprise. :)

https://github.com/D-Programming-Language/dmd/blob/master/src/iasm.c#L1255


 Still, DMD does not inline asm and always adds a function
 prolog and epilog around asm blocks in an otherwise
 empty function (correct me if I'm wrong).

Not if you use "naked".

 "naked" means you
 have to duplicate code for the different calling conventions,
 in particular Win32.

Why complain about it adding a prolog/epilog, and complain about it not adding
it?


 Your look on GCC (and LLVM) may be a bit biased. First of all
 you don't need to tell it exactly which registers to use. A
 rough classification is enough and gives the compiler a good
 idea of where calculations should be stored upon arrival at
 the asm statement. You can be specific down to the register
 name or let the backend chose freely with "rm" (= any register
 or memory).
 An example: We have a variable x that is computed inside a
 function followed by an asm block that multiplies it with
 something else. Typically you would "MOV EAX, [x]" to load x
 into the register that the MUL instruction expects. With
 extended assemblers you can be declarative about that and just
 state that x is needed in EAX as an input. You drop the MOV
 from the asm block and let the compiler figure out in its
 codegen, how x will end up in EAX. That's a step FORWARD in
 usability.

It's a step backwards because I can't just say "MUL EAX". I have to tell GCC 
what register the result gets put in. This is, to my mind, ridiculous. GCC's 
inline assembler apparently has no knowledge of what the opcodes actually do.

Apr 12 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 12 Apr 2016 13:22:12 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/12/2016 9:53 AM, Marco Leise wrote:
 LDC implements InlineAsm_X86_Any (DMD style asm), so
 core.cpuid works. GDC is the only compiler that does not
 implement it. We agree that core.cpuid should provide this
 information, but what we have now - core.cpuid in a mix with
 GDC's lack of DMD style asm - does not work in practice for
 the years to come.  

 
 Years? Anyone who needs core.cpuid could translate it to GDC's inline asm
style 
 in an hour or so. It could even be simply written separately in GAS and linked 
 in. Since this has not been done, I can only conclude that core.cpuid has not 
 been an actual blocker.

You mean it is ok, if I duplicated most of the asm in there
and created a pull request ?

 Still, DMD does not inline asm and always adds a function
 prolog and epilog around asm blocks in an otherwise
 empty function (correct me if I'm wrong).  

 
 Not if you use "naked".
 
 "naked" means you
 have to duplicate code for the different calling conventions,
 in particular Win32.  

 
 Why complain about it adding a prolog/epilog, and complain about it not adding
it?

Yeah, I didn't make this clear. To reduce code repetition I'd
like to avoid "naked" and have the compiler handle the
calling conventions. Let's compare the earlier example in both
GDC and DMD in a coding style that is agnostic wrt. the
calling convention. First GDC:

  struct DblWord { ulong lo, hi; }
  DblWord bigMul(ulong x, ulong y)
  {
      DblWord tmp;
      asm {
          "mulq %[y]"
          : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;
      }
      return tmp;
  }

This is turned into the following instruction sequence (AT&T):

  mov    %rdi,%rax
  mul    %rsi
  retq

Note how elegantly GCC handles the calling convention for us.
The prolog reduces to moving 'x' from RDI to RAX where I asked
it to place it for the MUL to use as the implicit operand.
After multiplying it by the explicit operand in RSI, the
resulting two machine words would be in RAX:RDX as we know.
I created a data structure to return those two and told GCC to
tie tmp.lo to RAX and tmp.hi to RDX. Since the calling
convention happens to return structs of 2 machine words in
RAX:RDX, the whole assignment to 'tmp' and the return become
no-ops. With inlining enabled only the 'mul' would remain.
This is the ideal outcome. Now let's look at the DMD
implementation - again letting the compiler figure out the
calling convention:

  DblWord bigMul(ulong x, ulong y)
  {
      DblWord tmp;
      asm
      {
          mov RAX, x;
          mul y;
          mov tmp+DblWord.lo.offsetof, RAX;
          mov tmp+DblWord.hi.offsetof, RDX;
      }
      return tmp;
  }

This generates the following:

  push   %rbp
  mov    %rsp,%rbp
  sub    $0x20,%rsp
  mov    %rdi,-0x10(%rbp)
  mov    %rsi,-0x8(%rbp)
  lea    -0x20(%rbp),%rax
  xor    %ecx,%ecx
  mov    %rcx,(%rax)
  mov    %rcx,0x8(%rax)
  mov    -0x8(%rbp),%rax
  mulq   -0x10(%rbp)
  mov    %rax,-0x20(%rbp)
  mov    %rdx,-0x18(%rbp)
  mov    -0x18(%rbp),%rdx
  mov    -0x20(%rbp),%rax
  mov    %rbp,%rsp
  pop    %rbp
  retq

In practice GDC will just replace the invokation with a single
'mul' instruction while DMD will emit a call to this 18
instructions long function. Now you keep telling me extended
assembly is a step backwards. :)
 
 It's a step backwards because I can't just say "MUL EAX".

You could write this, you'd only have to tell the assembler
that EAX and EDX will be overwritten, something that DMD
already knows.

 I have to tell GCC what register the result gets put in.

And by doing this you allow it to figure out the shortest way
to return the result in compliance with the calling convention.

 This is, to my mind, ridiculous.

I too find it annoying that I have to inform it about the
scratch registers used in the asm, but the rest seems legit to
me. At some point you will have to connect variables in the
host language with registers in assembly. Doing this in a
declarative manner instead of explicit assembly code, allows
the backend to find the optimal code (literally) as demonstated
above.

 GCC's inline assembler apparently has no knowledge of what
 the opcodes actually do.

Agreed. It seems to treat the assembly text merely as a
text template. It is the same with LLVM's extended assembler
which borrows heavily from GCC's. This is probably due to the
fact that the assembler is historically a standalone
executable and as such the authority for interpreting the asm
code is outside of the scope of the host language compiler.
Under these circumstances we might have gone for the same
implementation.

-- 
Marco

Apr 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/12/2016 4:29 PM, Marco Leise wrote:
Am Tue, 12 Apr 2016 13:22:12 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

On 4/12/2016 9:53 AM, Marco Leise wrote:
LDC implements InlineAsm_X86_Any (DMD style asm), so
core.cpuid works. GDC is the only compiler that does not
implement it. We agree that core.cpuid should provide this
information, but what we have now - core.cpuid in a mix with
GDC's lack of DMD style asm - does not work in practice for
the years to come.

Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style
in an hour or so. It could even be simply written separately in GAS and linked
in. Since this has not been done, I can only conclude that core.cpuid has not
been an actual blocker.

You mean it is ok, if I duplicated most of the asm in there
and created a pull request ?

It's Boost licensed, and Boost licensed code can be shipped with GPL'd code as
far as I know.

"mulq %[y]"
: "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;

I don't see anything elegant about those lines, starting with "mulq" is not in
any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a
ulong and select the 64 bit version of the MUL opcode automatically.

I can see nothing to recommend the:

"=a" tmp.lo

syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people
could even figure that out without consulting stackoverflow! :-)

I have no idea what:

"a" x

and:

[y] "rm" y

mean, nor why the ":" appears sometimes and the "," other times. It does look
like it was designed by the same guy who invented TECO macros:

https://www.reddit.com/r/programming/comments/4e07lo/last_night_in_a_fit_of_boredom_far_away_from_my/d1xlbh7

but that's not much of a compliment.

In practice GDC will just replace the invokation with a single
'mul' instruction while DMD will emit a call to this 18
instructions long function. Now you keep telling me extended
assembly is a step backwards. :)

DMD version:

DblWord bigMul(ulong x, ulong y) {
naked asm {
mov RAX,RDI;
mul RSI;
ret;
}
}

GCC's inline assembler apparently has no knowledge of what
the opcodes actually do.

Agreed.

This is the basis of my assertion it is a step backwards. Granted, it has some
nice capability as you've demonstrated. But it sure makes you suffer to get it.

Apr 12 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 13 April 2016 at 08:22, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 4:29 PM, Marco Leise wrote:
 In practice GDC will just replace the invokation with a single
 'mul' instruction while DMD will emit a call to this 18
 instructions long function. Now you keep telling me extended
 assembly is a step backwards. :)


 DMD version:

   DblWord bigMul(ulong x, ulong y) {
     naked asm {
        mov RAX,RDI;
        mul RSI;
        ret;
      }
   }

Infact the "correct" version of "mul eax" is.

asm { "mul{l} {%%}eax" : "=a" var : "a" var; }

- Works with both dialects (Intel and ATT)
- Compiler knows the first register ("a") is read and written to, so
doesn't keep temporaries stored there.
- Compiler loads the variable "var" into EAX before the statement is executed.
- Compiler knows that the value of "var" in EAX after the statement is finished.

http://goo.gl/64SSD5

Just toggle on/off intel syntax to see the difference.  :-)

I can agree that the way that instruction (or insn) templates look are
pretty ugly.  But IMO, for the most part on x86 their ugliness is
attributed to having to support two types of assembler syntax at once.

Apr 13 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 12 Apr 2016 23:22:37 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

            "mulq %[y]"
            : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;  

 
 I don't see anything elegant about those lines, starting with "mulq" is not in 
 any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a 
 ulong and select the 64 bit version of the MUL opcode automatically.
 
 I can see nothing to recommend the:
 
      "=a" tmp.lo
 
 syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people 
 could even figure that out without consulting stackoverflow! :-)
 
 I have no idea what:
 
     "a" x
 
 and:
 
      [y] "rm" y
 
 mean, nor why the ":" appears sometimes and the "," other times.

Tell me again, what's more elgant !

        uint* pnb = cast(uint*)cf.processorNameBuffer.ptr;
        version(GNU)
        {
            asm { "cpuid" : "=a" pnb[0], "=b" pnb[1], "=c" pnb[ 2], "=d" pnb[
3] : "a" 0x8000_0002; }
            asm { "cpuid" : "=a" pnb[4], "=b" pnb[5], "=c" pnb[ 6], "=d" pnb[
7] : "a" 0x8000_0003; }
            asm { "cpuid" : "=a" pnb[8], "=b" pnb[9], "=c" pnb[10], "=d"
pnb[11] : "a" 0x8000_0004; }
        }
        else version(D_InlineAsm_X86)
        {
            asm pure nothrow  nogc {
                push ESI;
                mov ESI, pnb;
                mov EAX, 0x8000_0002;
                cpuid;
                mov [ESI], EAX;
                mov [ESI+4], EBX;
                mov [ESI+8], ECX;
                mov [ESI+12], EDX;
                mov EAX, 0x8000_0003;
                cpuid;
                mov [ESI+16], EAX;
                mov [ESI+20], EBX;
                mov [ESI+24], ECX;
                mov [ESI+28], EDX;
                mov EAX, 0x8000_0004;
                cpuid;
                mov [ESI+32], EAX;
                mov [ESI+36], EBX;
                mov [ESI+40], ECX;
                mov [ESI+44], EDX;
                pop ESI;
            }
        }
        else version(D_InlineAsm_X86_64)
        {
            asm pure nothrow  nogc {
                push RSI;
                mov RSI, pnb;
                mov EAX, 0x8000_0002;
                cpuid;
                mov [RSI], EAX;
                mov [RSI+4], EBX;
                mov [RSI+8], ECX;
                mov [RSI+12], EDX;
                mov EAX, 0x8000_0003;
                cpuid;
                mov [RSI+16], EAX;
                mov [RSI+20], EBX;
                mov [RSI+24], ECX;
                mov [RSI+28], EDX;
                mov EAX, 0x8000_0004;
                cpuid;
                mov [RSI+32], EAX;
                mov [RSI+36], EBX;
                mov [RSI+40], ECX;
                mov [RSI+44], EDX;
                pop RSI;
            }
        }

-- 
Marco

Apr 16 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/16/2016 2:40 PM, Marco Leise wrote:
 Tell me again, what's more elgant !

If I wanted to write in assembler, I wouldn't write in a high level language, 
especially a weird one like GNU version.

Apr 16 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 16 Apr 2016 21:46:08 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/16/2016 2:40 PM, Marco Leise wrote:
 Tell me again, what's more elgant !  

 
 If I wanted to write in assembler, I wouldn't write in a high level language, 
 especially a weird one like GNU version.

I hate the many pitfalls of extended asm: Forget to mention a
side effect in the "clobbers" list and the compiler assumes
that register or memory location still holds the value from
before the asm. Have an _input_ reg clobbered? Must NOT name
it in the clobber list but use it as a dummy output with a
dummy variable assignment. The learning curve is steep and as
you said, usually unintelligible without prior knowledge.

But what I really miss from the last generation of inline
assemblers are these points:

1. In most cases you can make the asm transparent to the
   optimizer leading to:
   1.a Inlining of asm
   1.b Dead-code removal of asm blocks

2. Asm Template arguments (e.g. input variables) are bound via
   constraints:
   2.a Can use output constraint `"=a" var` to mean an of "AL",
       "AX", "EAX" or "RAX" depending on size of 'var'
   2.b `"r" ptr` can bind 32-bit and 64-bit pointers often
       eliminating the need for duplicate asm blocks that only
       differ in one mention of e.g. RSI vs. ESI.
   2.c Compiler seamlessly integrates host code variables
       with asm with host code. No need to manually pick tmp
       registers to move parameters and output. `"r" myUint`
       is all it takes for 'myUint' to end up in any of EAX,
       EDX, ... (whatever the register allocator deems
       efficient at that point)
   3.d As a net result, asm templates often reduce to a single
       mnemonic and work with X86, X32 and AMD64.

3. In DMD I often see "naked" used to get rid of function
   prolog and epilog in an attempt to get an intrinsic-like,
   fast function. This requires extra care to get the calling
   convention right and may require more code duplication for
   e.g. Win32. Asm templates in GCC and LLVM benefit from this
   speedup automatically, because the backend will remove
   unneeded prolog/epilog code and even inline small functions.

GCC's historically grown template syntax based on multiple
_external_ assembler backends ain't that great and it is a
PITA that it cannot understand the mnemonics and figure out
side effects itself like DMD. But I hope I could highlight a
few points where classic assemblers as found in Delphi or DMD
fall behind in modern convenience and native efficiency.

When C was invented it matched the CPUs quite well, but today
we have dozens of instructions that C and D syntax has no
expression for. All modern compilers spend considerable amount
of backend code to the task of pattern matching code
constructs like a layman's POPCNT and replace them with
optimal CPU instructions. More and more we turn to browsing
the list of readily available compiler built-ins first and the
next step is to acknowledge the need and make inline
assemblers powerful enough for programmers to efficiently
implement non-existing intrinsics in library code.

-- 
Marco

Apr 17 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 12 April 2016 at 22:22, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 9:53 AM, Marco Leise wrote:
 Your look on GCC (and LLVM) may be a bit biased. First of all
 you don't need to tell it exactly which registers to use. A
 rough classification is enough and gives the compiler a good
 idea of where calculations should be stored upon arrival at
 the asm statement. You can be specific down to the register
 name or let the backend chose freely with "rm" (= any register
 or memory).
 An example: We have a variable x that is computed inside a
 function followed by an asm block that multiplies it with
 something else. Typically you would "MOV EAX, [x]" to load x
 into the register that the MUL instruction expects. With
 extended assemblers you can be declarative about that and just
 state that x is needed in EAX as an input. You drop the MOV
 from the asm block and let the compiler figure out in its
 codegen, how x will end up in EAX. That's a step FORWARD in
 usability.


 It's a step backwards because I can't just say "MUL EAX". I have to tell GCC
 what register the result gets put in. This is, to my mind, ridiculous. GCC's
 inline assembler apparently has no knowledge of what the opcodes actually
 do.

asm { "mul eax"; } - That wasn't so difficult. :-)

I don't know if D data and calling functions from DMD-IASM is safe (it
is in GDC extended IASM).  But I have always chosen the path that
requires the least amount of maintenance burden/overhead.  And I'm
sorry to say that supporting GCC-style extended assembler both comes
for free (handling is managed by the middle-end), and requires no
platform-specific support on the language implementation side.

However, I have always considered comparing the two a bit like apples
and oranges.  DMD compiles to object code, so it makes sense to me
that you have an entire assembler embedded in.  However GDC compiles
to assembly, and I expect that GNU As will know a lot more about what
opcodes actually do on, say a Motorola 68k, than the poor mans parser
I would be able to write.

There were a lot of challenges supporting DMD-style IASM, all
non-existent in DMD.  Drawing a list off the top of my head - I'll let
you decide whether IASM is pro or con in this area, but again bear in
mind that DMD doesn't have to deal with calling an external assembler.

- What dialect am I writing in? (Do I emit mul or mull? eax or %eax?)
- Some opcodes in IASM have a different name in the assembler (Emitted
fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall
std.math didn't work without the translation).
- Some opcodes are actually directives in disguise (db, ds, dw, ...)
- Frame-relative addressing/displacement of a symbol before the
backend has decided where incoming parameters will land is a good way
to get hit by a truck.
- GCC backend doesn't support naked functions on x86.
- Or even in the sense that DMD supports naked functions where there
is support (only plain text assembler allowed)
- Want to support ARM? MIPS? PPC?  At the time when GDC supported
DMD-style IASM for x86, the implementation was over 3000 LOC, adding
platform support just looked like an unmanageable nightmare.

Apr 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:
 It's a step backwards because I can't just say "MUL EAX". I have to tell GCC
 what register the result gets put in. This is, to my mind, ridiculous. GCC's
 inline assembler apparently has no knowledge of what the opcodes actually
 do.

 asm { "mul eax"; } - That wasn't so difficult. :-)

My understanding is that is not sufficient if you want gcc to track register 
usage, etc. I could be wrong, I found the documentation on how the gcc inline 
assembler works to be impossible to figure out what was required and what 
wasn't. I'd just look at existing examples and modify to suit :-(


 I don't know if D data and calling functions from DMD-IASM is safe

I don't know what you mean by 'safe' in this context. If you follow the ABI it 
should work.

 (it
 is in GDC extended IASM).  But I have always chosen the path that
 requires the least amount of maintenance burden/overhead.  And I'm
 sorry to say that supporting GCC-style extended assembler both comes
 for free (handling is managed by the middle-end), and requires no
 platform-specific support on the language implementation side.

Your decision makes sense.


 However, I have always considered comparing the two a bit like apples
 and oranges.  DMD compiles to object code, so it makes sense to me
 that you have an entire assembler embedded in.  However GDC compiles
 to assembly, and I expect that GNU As will know a lot more about what
 opcodes actually do on, say a Motorola 68k, than the poor mans parser
 I would be able to write.

 There were a lot of challenges supporting DMD-style IASM, all
 non-existent in DMD.  Drawing a list off the top of my head - I'll let
 you decide whether IASM is pro or con in this area, but again bear in
 mind that DMD doesn't have to deal with calling an external assembler.

 - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?)
 - Some opcodes in IASM have a different name in the assembler (Emitted
 fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall
 std.math didn't work without the translation).

DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no 
MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which
opcode 
by looking at the operands, using the Intel CPU manual as a guide.

It's a bit of a pain as there are a lot of special cases, but the end result is 
pretty straightforward if you're using the Intel CPU manual as a reference
guide.

 - Some opcodes are actually directives in disguise (db, ds, dw, ...)
 - Frame-relative addressing/displacement of a symbol before the
 backend has decided where incoming parameters will land is a good way
 to get hit by a truck.
 - GCC backend doesn't support naked functions on x86.
 - Or even in the sense that DMD supports naked functions where there
 is support (only plain text assembler allowed)
 - Want to support ARM? MIPS? PPC?  At the time when GDC supported
 DMD-style IASM for x86, the implementation was over 3000 LOC, adding
 platform support just looked like an unmanageable nightmare.

I understand that GDC has special challenges because it writes to an assembler 
rather than direct to object code. I understand it is not easy to replicate 
DMD's iasm functionality. Which is why I haven't given you a hard time about it 
:-) and it is not terribly important.

But core.cpuid needs to be made to work in GDC, whatever it takes to do so.

----

Personally, I strongly dislike the fact that the GAS syntax is the reverse of 
Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It makes 
my eyeballs hurt looking at it. It's like giving me a car with the brake and
gas 
pedals reversed. Nothing but accidents result :-) And I don't like that they
use 
different opcodes than the Intel manuals. That just sux, too.

But I know that the GNU world is stuck with that, and GDC should behave like
the 
rest of GCC.

Apr 12 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/12/2016 4:35 PM, Iain Buclaw via Digitalmars-d wrote:
 - What dialect am I writing in? (Do I emit mul or mull? eax or %eax?)
 - Some opcodes in IASM have a different name in the assembler (Emitted
 fdivrp as fdivp, and fdivp as fdivrp. No idea why but I recall
 std.math didn't work without the translation).


 DMD's iasm uses the opcodes as written in the Intel CPU manuals. There is no
 MULL opcode in the manual, so no MULL in DMD's iasm. It figures out which
 opcode by looking at the operands, using the Intel CPU manual as a guide.

 It's a bit of a pain as there are a lot of special cases, but the end result
 is pretty straightforward if you're using the Intel CPU manual as a
 reference guide.

My only point was that in GDC, the translation of opcodes to machine
code is done in two steps by two separate processes, rather than one.
DMD is proof that the benefit of having unified syntax is a big win.

 - Some opcodes are actually directives in disguise (db, ds, dw, ...)
 - Frame-relative addressing/displacement of a symbol before the
 backend has decided where incoming parameters will land is a good way
 to get hit by a truck.
 - GCC backend doesn't support naked functions on x86.
 - Or even in the sense that DMD supports naked functions where there
 is support (only plain text assembler allowed)
 - Want to support ARM? MIPS? PPC?  At the time when GDC supported
 DMD-style IASM for x86, the implementation was over 3000 LOC, adding
 platform support just looked like an unmanageable nightmare.


 I understand that GDC has special challenges because it writes to an
 assembler rather than direct to object code. I understand it is not easy to
 replicate DMD's iasm functionality. Which is why I haven't given you a hard
 time about it :-) and it is not terribly important.

 But core.cpuid needs to be made to work in GDC, whatever it takes to do so.

Indeed, it's been on my TODO list for a long time, among many other things. :-)

 ----

 Personally, I strongly dislike the fact that the GAS syntax is the reverse
 of Intel's. It isn't just GAS, it's GDB and everything else. It just sux. It
 makes my eyeballs hurt looking at it. It's like giving me a car with the
 brake and gas pedals reversed. Nothing but accidents result :-) And I don't
 like that they use different opcodes than the Intel manuals. That just sux,
 too.

Like riding a backwards bicycle. :-)

https://www.youtube.com/watch?v=MFzDaBzBlL0

 But I know that the GNU world is stuck with that, and GDC should behave like
 the rest of GCC.

Yeah, and I'm glad that you do.

Apr 13 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Apr 2016 09:51:25 +0200
schrieb Iain Buclaw via Digitalmars-d
<digitalmars-d puremagic.com>:

 On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 But core.cpuid needs to be made to work in GDC, whatever it takes to do so.
  

 
 Indeed, it's been on my TODO list for a long time, among many other things. :-)

Would you want to implement this in the compiler like the
checkedint functions? I guess that's the only way to guarantee
cross-module inlining with GDC. Otherwise I would use
__builtin_cpu_supports (const char *feature). (GCC practically
has its own internal core.cpuid implementation made of
intrinsics.)

-- 
Marco

Apr 13 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 13 April 2016 at 11:13, Marco Leise via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Am Wed, 13 Apr 2016 09:51:25 +0200
 schrieb Iain Buclaw via Digitalmars-d
 <digitalmars-d puremagic.com>:

 On 13 April 2016 at 07:59, Walter Bright via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 But core.cpuid needs to be made to work in GDC, whatever it takes to do so.

 Indeed, it's been on my TODO list for a long time, among many other things. :-)

 Would you want to implement this in the compiler like the
 checkedint functions? I guess that's the only way to guarantee
 cross-module inlining with GDC. Otherwise I would use
 __builtin_cpu_supports (const char *feature). (GCC practically
 has its own internal core.cpuid implementation made of
 intrinsics.)

 --
 Marco

Yes, cpu_supports is a good way to do it as we only need to invoke
__builtin_cpu_init once and cache all values when running 'shared
static this()'.

I would also like to be able to support other processes too.  ARM is a
high priority one which should follow suit.

Apr 13 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Apr 2016 11:21:35 +0200
schrieb Iain Buclaw via Digitalmars-d
<digitalmars-d puremagic.com>:

 Yes, cpu_supports is a good way to do it as we only need to invoke
 __builtin_cpu_init once and cache all values when running 'shared
 static this()'.

I was under the assumption that GCC already emits an 'early'
static ctor with a call to __builtin_cpu_init(). It is also
likely that we don't need extra code to copy GCC's cache to
core.cpuids cache (unless the cached data is publicly
exposed somehow). What is your stance on the cross module
inlining issue? Stuff like hasPopcnt etc. wont be inlined
unless you turn them into compiler recognised builtins, right?
It's not a blocker, but something to keep in mind when not
accessing global variables directly.

How about this style as an alternative?:

immutable bool mmx;
immutable bool hasPopcnt;

shared static this()
{
    import gcc.builtins;
    mmx       = __builtin_cpu_supports("mmx"   ) > 0;
    hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
}

-- 
Marco

Apr 13 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }

Please do not invent an alternative interface, use the one in core.cpuid:

Apr 13 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 13 Apr 2016 04:14:48 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }
  

 
 Please do not invent an alternative interface, use the one in core.cpuid:
 


Yes, they are all  property and a substitution with direct
access to the globals will work around GDC's lack of
cross-module inlining. Otherwise these feature checks which
might be used in hot code, are more costly than they should be.
I hate when things get in the way of efficiency. :)

-- 
Marco

Apr 13 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/13/2016 5:47 AM, Marco Leise wrote:
 Yes, they are all  property and a substitution with direct
 access to the globals will work around GDC's lack of
 cross-module inlining. Otherwise these feature checks which
 might be used in hot code, are more costly than they should be.
 I hate when things get in the way of efficiency. :)


It doesn't need to be efficient, because such checks should be done at a higher 
level in the program's logic, not on low level code. Even so, the program could 
cache the result of the call.

Apr 13 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 13 April 2016 at 13:14, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/13/2016 3:58 AM, Marco Leise wrote:
 How about this style as an alternative?:

 immutable bool mmx;
 immutable bool hasPopcnt;

 shared static this()
 {
      import gcc.builtins;
      mmx       = __builtin_cpu_supports("mmx"   ) > 0;
      hasPopcnt = __builtin_cpu_supports("popcnt") > 0;
 }

 Please do not invent an alternative interface, use the one in core.cpuid:



An alternative interface needs to be invented anyway for other CPUs.

Apr 14 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/14/2016 1:21 AM, Iain Buclaw via Digitalmars-d wrote:
 An alternative interface needs to be invented anyway for other CPUs.

That would be fine. But there is no reason to redo core.cpuid for x86 machines.

Apr 14 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?

 Target cpu configuration:
 - CPU architecture (done)

Done.

 - Count of FP/Integer registers

??

 - Allowed sets of instructions: for example, AVX2, FMA4

Done. D_SIMD

 - Compiler optimization options (for math)

Moot. DMD does not have compiler switches to set FP code generation. (This is 
deliberate.)

Apr 04 2016

9il <ilyayaroshenko gmail.com> writes:

On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:
 On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?

 Target cpu configuration:
 - CPU architecture (done)

 Done.

 - Count of FP/Integer registers

 ??

How many general purpose registers, SIMD Floating Point 
registers, SIMD Integer registers have a CPU?

 - Allowed sets of instructions: for example, AVX2, FMA4

 Done. D_SIMD

This is not enough. Needs to know is it AVX or AVX2 in compile 
time (this may be completely different source code for this 
cases).

 - Compiler optimization options (for math)

 Moot. DMD does not have compiler switches to set FP code 
 generation. (This is deliberate.)

We have LDC and GDC. And looks like a little bit standardization 
based on DMD would be good, even if this would be useless for DMD.

With compile time information about CPU it is possible to always 
have fast generic BLAS for any target as soon as LLVM is released 
for this target.

D+LLVM = fast generic BLAS. For DMD and GDC would be target 
specified BLAS optimizations.

OpenBLAS kernels is 30 MB of assembler code! So we would be able 
to replace it once and for a very long time with Phobos.

Best regards,
Ilya

Apr 04 2016

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:
 OpenBLAS kernels is 30 MB of assembler code! So we would be 
 able to replace it once and for a very long time with Phobos.

Are you familiar with this project at all?
https://github.com/flame/blis

Apr 04 2016

9il <ilyayaroshenko gmail.com> writes:

On Monday, 4 April 2016 at 21:13:30 UTC, jmh530 wrote:
 On Monday, 4 April 2016 at 21:05:44 UTC, 9il wrote:
 OpenBLAS kernels is 30 MB of assembler code! So we would be 
 able to replace it once and for a very long time with Phobos.

 Are you familiar with this project at all?
 https://github.com/flame/blis

Thank for the link. BLIS has the same issue like OpenBLAS - a 
collection of kernels for each target. I want to write internal 
kernel compiler (like CT regex) that will build kernels based in 
CT information about the target.

Best regards,
Ilya

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 2:05 PM, 9il wrote:
 - Count of FP/Integer registers

 ??

 How many general purpose registers, SIMD Floating Point registers, SIMD Integer
 registers have a CPU?

These are deducible from X86, X86_64, and SIMD version identifiers.


 Needs to know is it AVX or AVX2 in compile time

Since the compiler never generates AVX or AVX2 instructions, there is no
purpose 
to setting such as a predefined version identifier. You might as well use a:

     -version=AVX

switch. Note that it is a very bad idea for a compiler to detect the CPU it is 
running on and default generate code specific to that CPU.


 (this may be completely different source code for this cases).

It's entirely practical to compile code with different source code, link them 
*both* into the executable, and switch between them based on runtime detection 
of the CPU.


 We have LDC and GDC. And looks like a little bit standardization based on DMD
 would be good, even if this would be useless for DMD.

There is no such thing as a standard compiler floating point switch, and I'm 
doubtful defining one would be practical or make much of any sense.


 With compile time information about CPU it is possible to always have fast
 generic BLAS for any target as soon as LLVM is released for this target.

The SIMD instruction set is highly resistant to transforming generic code into 
optimal vector instructions. Yes, I know about auto-vectorization, and in 
general it is a doomed and unworkable technology.

   http://www.amazon.com/dp/0974364924

It's gotta be done by hand to get it to fly.

Apr 04 2016

9il <ilyayaroshenko gmail.com> writes:

On Monday, 4 April 2016 at 22:34:06 UTC, Walter Bright wrote:
 On 4/4/2016 2:05 PM, 9il wrote:
 - Count of FP/Integer registers

 ??

 How many general purpose registers, SIMD Floating Point 
 registers, SIMD Integer
 registers have a CPU?

 These are deducible from X86, X86_64, and SIMD version 
 identifiers.

It is impossible to deduct from that combination that Xeon Phi 
has 32 FP registers.

 Needs to know is it AVX or AVX2 in compile time

 Since the compiler never generates AVX or AVX2 instructions, 
 there is no purpose to setting such as a predefined version 
 identifier. You might as well use a:

     -version=AVX

 switch. Note that it is a very bad idea for a compiler to 
 detect the CPU it is running on and default generate code 
 specific to that CPU.

"Since the compiler never generates AVX or AVX2" - this is 
definitely nor true, see, for example, LLVM vectorization and SLP 
vectorization.

This is normal situation for scientific software, supercomputers 
software, hight performance server applications.

 (this may be completely different source code for this cases).

 It's entirely practical to compile code with different source 
 code, link them *both* into the executable, and switch between 
 them based on runtime detection of the CPU.

This approach is complex, and normal for desktop applications. If 
you have a big cluster of similar computers or you have a 
supercomputer cluster, only the thing you want to do is 
`-mcpu=native`/ `-march=native`. And this single compiler flag 
should be enough to build hight performance linear algebra 
application.

 We have LDC and GDC. And looks like a little bit 
 standardization based on DMD
 would be good, even if this would be useless for DMD.

 There is no such thing as a standard compiler floating point 
 switch, and I'm doubtful defining one would be practical or 
 make much of any sense.

I just want an unified instrument to receive CT information about 
target and optimization switches. It is OK if this information 
would have different switches on different compilers.

 With compile time information about CPU it is possible to 
 always have fast
 generic BLAS for any target as soon as LLVM is released for 
 this target.

 The SIMD instruction set is highly resistant to transforming 
 generic code into optimal vector instructions. Yes, I know 
 about auto-vectorization, and in general it is a doomed and 
 unworkable technology.

   http://www.amazon.com/dp/0974364924

 It's gotta be done by hand to get it to fly.

Auto vectorization is only example (maybe bad). I would use SIMD 
vectors, but I need CT information about target CPU, because it 
is impossible to build optimal BLAS kernels without it!  My idea 
is internal kernel compiler :-) Something similar to compile time 
regex, but more complex.

Best regards,
Ilya

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 11:10 PM, 9il wrote:
 It is impossible to deduct from that combination that Xeon Phi has 32 FP
registers.

Since dmd doesn't generate specific code for a Xeon Phi, having a compile time 
switch for it is meaningless.


 "Since the compiler never generates AVX or AVX2" - this is definitely nor true,
 see, for example, LLVM vectorization and SLP vectorization.

dmd is not LLVM.


 It's entirely practical to compile code with different source code, link them
 *both* into the executable, and switch between them based on runtime detection
 of the CPU.

 This approach is complex,

Not at all. Used to do it all the time in the DOS world (FPU vs emulation).


 I just want an unified instrument to receive CT information about target and
 optimization switches. It is OK if this information would have different
 switches on different compilers.

Optimizations simply do not transfer from one compiler to another, whether the 
switch is the same or not. They are highly implementation dependent.


 Auto vectorization is only example (maybe bad). I would use SIMD vectors, but I
 need CT information about target CPU, because it is impossible to build optimal
 BLAS kernels without it!

I still don't understand why you cannot just set '-version=xxx' on the command 
line and then switch off that version in your custom code.

Apr 05 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 It is impossible to deduct from that combination that Xeon Phi 
 has 32 FP registers.

 Since dmd doesn't generate specific code for a Xeon Phi, having 
 a compile time switch for it is meaningless.


 "Since the compiler never generates AVX or AVX2" - this is 
 definitely nor true,
 see, for example, LLVM vectorization and SLP vectorization.

 dmd is not LLVM.

The particular design and limitations of the dmd backend 
shouldn't be used to define D. In the extreme, your argument 
would imply that there's no point having version(ARM) built in to 
the language, because dmd doesn't support it.


 It's entirely practical to compile code with different source 
 code, link them
 *both* into the executable, and switch between them based on 
 runtime detection
 of the CPU.

 This approach is complex,

 Not at all. Used to do it all the time in the DOS world (FPU vs 
 emulation).


 I just want an unified instrument to receive CT information 
 about target and
 optimization switches. It is OK if this information would have 
 different
 switches on different compilers.

 Optimizations simply do not transfer from one compiler to 
 another, whether the switch is the same or not. They are highly 
 implementation dependent.


 Auto vectorization is only example (maybe bad). I would use 
 SIMD vectors, but I
 need CT information about target CPU, because it is impossible 
 to build optimal
 BLAS kernels without it!

 I still don't understand why you cannot just set '-version=xxx' 
 on the command line and then switch off that version in your 
 custom code.

So you're suggesting that libraries invent their own list of 
versions for specific architectures / CPU features, which the 
user then has to specify somehow on the command line?

I want to be able to write code that uses standardised versions 
that work across various D compilers, with the user only needing 
to type e.g. -march=native on GDC and get the fastest possible 
code.

Apr 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/5/2016 2:03 AM, John Colvin wrote:
 So you're suggesting that libraries invent their own list of versions for
 specific architectures / CPU features, which the user then has to specify
 somehow on the command line?
 I want to be able to write code that uses standardised versions that work
across
 various D compilers, with the user only needing to type e.g. -march=native on
 GDC and get the fastest possible code.

There's a line between trying to standardize everything and letting add-on 
libraries be free to innovate.

Besides, I think it's a poor design to customize the app for only one SIMD
type. 
A better idea (I've repeated this ad nauseum over the years) is to have n 
modules, one for each supported SIMD type. Compile and link all of them in,
then 
detect the SIMD type at runtime and call the corresponding module. (This is how 
the D array ops are currently implemented.)

My experience with command line FPU switches is few users understand what they 
do and even fewer use them correctly.

In fact, I suspect that having a command line FPU switch is too global a
hammer. 
A pragma set in just the functions that need it might be much better.

-------

In any case, this is not a blocker for getting the library designed, built and 
debugged.

Apr 05 2016

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 On 4/5/2016 2:03 AM, John Colvin wrote:
 There's a line between trying to standardize everything and 
 letting add-on libraries be free to innovate.

 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently implemented.)

 My experience with command line FPU switches is few users 
 understand what they do and even fewer use them correctly.

 In fact, I suspect that having a command line FPU switch is too 
 global a hammer. A pragma set in just the functions that need 
 it might be much better.

What wrong for scientist to write `-mcpu=native`?

 -------

 In any case, this is not a blocker for getting the library 
 designed, built and debugged.

Yes, but this is bad idea to have a set of versions for Phobos, 
is not it?

Ilya

Apr 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/5/2016 4:17 AM, 9il wrote:
 What wrong for scientist to write `-mcpu=native`?

Because it would affect all the code in the module and every template it 
imports, which is a problem if you are using 'static if' and want to compile 
different pieces with different settings.

Apr 05 2016

9il <ilyayaroshenko gmail.com> writes:

On Wednesday, 6 April 2016 at 00:45:54 UTC, Walter Bright wrote:
 On 4/5/2016 4:17 AM, 9il wrote:
 What wrong for scientist to write `-mcpu=native`?

 Because it would affect all the code in the module and every 
 template it imports, which is a problem if you are using 
 'static if' and want to compile different pieces with different 
 settings.

99.99% of them do not need to compile code with different 
settings. Furthermore 90% of them don't know what CPU their 
supercomputer has. They just want to have as fast code as 
possible without googling what CPU instructions are available for 
the CPU.

Apr 05 2016

Joe Duarte <jose.duarte asu.edu> writes:

On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently implemented.)

There are many organizations in the world that are building 
software in-house, where such software is targeted to modern CPU 
SIMD types, most typically AVX/AVX2 and crypto instructions.

In these settings -- many of them scientific compute or big data 
center operators -- they know what servers they have, what CPU 
platforms they have. They don't care about portability to the 
past, older computers and so forth. A runtime check would make no 
sense for them, not for their baseline, and it would probably be 
a waste of time for them to design code to run on pre-AVX 
silicon. (AVX is not new anymore -- it's been around for a few 
years.)

Good examples can be found on Cloudflare's blog, especially Vlad 
Krasnov's posts. Here's one where he accelerates Golang's crypto 
libraries: 
https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/

Companies like CF probably spend millions of dollars on 
electricity, and there are some workloads where AVX-optimized 
code can yield tangible monetary savings.

Someone else said talked about marking "Broadwell" and other 
generation names. As others have said, it's better to specify 
features. I wanted to chime in with a couple of additional 
examples. Intel's transactional memory accelerating instructions 
(TSX) are only available on some Broadwell parts because there 
was a bug in the original implementation (Haswell and early 
Broadwell) and it's disabled on most. But the new Broadwell 
server chips have it, and it's a big deal for some DB workloads. 
Similarly, only some Skylake chips have the Secure Guard 
instructions (SGX), which are very powerful for creating secure 
enclaves on an untrusted host.

On the broader SIMD-as-first-class-citizen issue, I think it 
would be worth thinking about how to bake SIMD into the language 
instead of bolting it on. If I were designing a new language in 
2016, I would take a fresh look at how SIMD could be baked into a 
language's core constructs. I'd think about new loop abstractions 
that could make SIMD easier to exploit, and how to nudge 
programmers away from serial monotonic mindsets and into more of 
a SIMD/FMA way of reasoning.

Apr 17 2016

Temtaime <temtaime gmail.com> writes:

On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
 Besides, I think it's a poor design to customize the app for 
 only one SIMD type. A better idea (I've repeated this ad 
 nauseum over the years) is to have n modules, one for each 
 supported SIMD type. Compile and link all of them in, then 
 detect the SIMD type at runtime and call the corresponding 
 module. (This is how the D array ops are currently 
 implemented.)

 There are many organizations in the world that are building 
 software in-house, where such software is targeted to modern 
 CPU SIMD types, most typically AVX/AVX2 and crypto instructions.

In addition it's COMPILER work, not programmer!
Compiler SHOULD be able to vectorize the code using SSE/AVX 
depending on command line switch. Why i should write all these 
merde ? Let compiler do its work.

Also compiler CAN generate multiple versions of one function 
using different SIMD instructions : Intel C++ Compiler works this 
way : it generates a few versions of a function and checks at 
run-time CPU capabilities and executes the fastest one.

Apr 17 2016

Johan Engelen <j j.nl> writes:

On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 
 Someone else said talked about marking "Broadwell" and other 
 generation names. As others have said, it's better to specify 
 features. I wanted to chime in with a couple of additional 
 examples. Intel's transactional memory accelerating 
 instructions (TSX) are only available on some Broadwell parts 
 because there was a bug in the original implementation (Haswell 
 and early Broadwell) and it's disabled on most. But the new 
 Broadwell server chips have it, and it's a big deal for some DB 
 workloads. Similarly, only some Skylake chips have the Secure 
 Guard instructions (SGX), which are very powerful for creating 
 secure enclaves on an untrusted host.

Thanks, I've seen similar comments in LLVM code.

I have a question perhaps you can comment on?
With LLVM, it is possible to specify something like "+sse3,-sse2" 
(I did not test whether this actually results in SSE3 
instructions being used, but no SSE2 instructions). What should 
be returned when querying whether "sse3" feature is enabled?
Should __traits(targetHasFeature, "sse3") == true mean that 
implied features (such as sse and sse2) are also available?

Apr 23 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 23 Apr 2016 10:40:12 +0000
schrieb Johan Engelen <j j.nl>:

 I have a question perhaps you can comment on?
 With LLVM, it is possible to specify something like "+sse3,-sse2" 
 (I did not test whether this actually results in SSE3 
 instructions being used, but no SSE2 instructions). What should 
 be returned when querying whether "sse3" feature is enabled?
 Should __traits(targetHasFeature, "sse3") == true mean that 
 implied features (such as sse and sse2) are also available?

Please do test it. Activating sse3 and disabling sse2
likely causes the compiler to silently re-enable sse2 as a
dependency or error out.

-- 
Marco

Apr 23 2016

Joe Duarte <jose.duarte asu.edu> writes:

On Saturday, 23 April 2016 at 10:40:12 UTC, Johan Engelen wrote:
 On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
 
 Someone else said talked about marking "Broadwell" and other 
 generation names. As others have said, it's better to specify 
 features. I wanted to chime in with a couple of additional 
 examples. Intel's transactional memory accelerating 
 instructions (TSX) are only available on some Broadwell parts 
 because there was a bug in the original implementation 
 (Haswell and early Broadwell) and it's disabled on most. But 
 the new Broadwell server chips have it, and it's a big deal 
 for some DB workloads. Similarly, only some Skylake chips have 
 the Secure Guard instructions (SGX), which are very powerful 
 for creating secure enclaves on an untrusted host.

 Thanks, I've seen similar comments in LLVM code.

 I have a question perhaps you can comment on?
 With LLVM, it is possible to specify something like 
 "+sse3,-sse2" (I did not test whether this actually results in 
 SSE3 instructions being used, but no SSE2 instructions). What 
 should be returned when querying whether "sse3" feature is 
 enabled?
 Should __traits(targetHasFeature, "sse3") == true mean that 
 implied features (such as sse and sse2) are also available?

If you specify SSE3, you should definitely get SSE2 and plain old 
SSE with it. SSE3 is a superset of SSE2 and includes all the SSE2 
instructions (more than 100 I think.)

I'm not sure about your syntax – I thought the hyphen meant to 
include the option, not remove it, and I haven't seen the 
addition sign used for those settings. But I haven't done much 
with those optimization flags.

You wouldn't want to exclude SSE2 support because it's becoming 
the bare minimum baseline for modern systems, the de facto FP 
unit. Windows 10 requires a CPU with SSE2, as do more and more 
applications on the archaic Unix-like platforms.

May 02 2016

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' 
 on the command line and then switch off that version in your 
 custom code.

I can do it, however I would like to get this information from 
compiler. Why?

1. This would help to eliminate configuration bugs.
2. This would reduce work for users and simplified user 
experience.
3. This is possible and not very hard to implement if I am not 
wrong.

Ilya

Apr 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' on the command
 line and then switch off that version in your custom code.

 I can do it, however I would like to get this information from compiler. Why?

 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user experience.
 3. This is possible and not very hard to implement if I am not wrong.


Where does the compiler get the information that it should compile for, say,
AFX?

Apr 05 2016

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user 
 experience.
 3. This is possible and not very hard to implement if I am not 
 wrong.


 Where does the compiler get the information that it should 
 compile for, say, AFX?

No idea about AFX. Do you choose AFX to disallow me to find an 
example?

You know better than me, that GCC and LLVM based compilers have 
options like march, mcpu, mtarget, mtune and others. And things 
like `-mcpu=native` or `-march=native` are allowed.

Ilya

Apr 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/5/2016 4:07 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user experience.
 3. This is possible and not very hard to implement if I am not wrong.


 Where does the compiler get the information that it should compile for, say,
AFX?

 No idea about AFX. Do you choose AFX to disallow me to find an example?

I want to make it clear that dmd does not generate AFX specific code, has no 
switch to enable AFX code generation and has no basis for setting predefined 
version identifiers for it.

Apr 05 2016

Johan Engelen <j j.nl> writes:

On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 
 I want to make it clear that dmd does not generate AFX specific 
 code, has no switch to enable AFX code generation and has no 
 basis for setting predefined version identifiers for it.

How about adding a "__target(...)" compile-time function, that 
would return false if the compiler doesn't know?

__target("broadwell")  --> true means: target cpu is broadwell, 
false means compiler doesn't know or target cpu is not broadwell.

Would that work for all?

Apr 05 2016

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 5 April 2016 at 21:41:46 UTC, Johan Engelen wrote:
 On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 
 I want to make it clear that dmd does not generate AFX 
 specific code, has no switch to enable AFX code generation and 
 has no basis for setting predefined version identifiers for it.

 How about adding a "__target(...)" compile-time function, that 
 would return false if the compiler doesn't know?

 __target("broadwell")  --> true means: target cpu is broadwell, 
 false means compiler doesn't know or target cpu is not 
 broadwell.

 Would that work for all?

Yes, something like that is what I am looking for.
Two nitpicks:
1. __target("broadwell") is not well API. Something like that 
would be more efficient:
enum target = __target();
// .. use target
2. Is it possible to reflect additional settings about 
instruction set? Maybe "broadwell,-avx"?

Apr 05 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 I want to make it clear that dmd does not generate AFX specific code, has
 no switch to enable AFX code generation and has no basis for setting
 predefined version identifiers for it.


 How about adding a "__target(...)" compile-time function, that would return
 false if the compiler doesn't know?

 __target("broadwell")  --> true means: target cpu is broadwell, false means
 compiler doesn't know or target cpu is not broadwell.

 Would that work for all?

With respect to SIMD, knowing a processor model like 'broadwell' is
not helpful, since we really want to know 'sse4'. If we know processor
model, then we need to keep a compile-time table in our code somewhere
if every possible cpu ever known and it's associated feature set.
Knowing the feature we're interested is what we need.

Apr 06 2016

9il <ilyayaroshenko gmail.com> writes:

On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 [...]

 With respect to SIMD, knowing a processor model like 
 'broadwell' is not helpful, since we really want to know 
 'sse4'. If we know processor model, then we need to keep a 
 compile-time table in our code somewhere if every possible cpu 
 ever known and it's associated feature set. Knowing the feature 
 we're interested is what we need.

Yes, however this can be implemented in a spcial Phobos module. 
So compilers would need less work. --Ilya

Apr 06 2016

Johan Engelen <j j.nl> writes:

On Wednesday, 6 April 2016 at 13:26:51 UTC, 9il wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 [...]

 With respect to SIMD, knowing a processor model like 
 'broadwell' is not helpful, since we really want to know 
 'sse4'. If we know processor model, then we need to keep a 
 compile-time table in our code somewhere if every possible cpu 
 ever known and it's associated feature set. Knowing the 
 feature we're interested is what we need.

 Yes, however this can be implemented in a spcial Phobos module. 
 So compilers would need less work. --Ilya

After browsing through some LLVM code, I think is actually very 
easy for LDC to also tell you about which features (sse2, avx, 
etc.) a target supports.

Probably the most difficult part is defining an API. Ilya made a 
start here:
http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org
(but he doesn't like his earlier API "bool a = 
__target("broadwell")" any more ;-P , I also think enum cpu = 
__target(); would be nicer)

Apr 06 2016

9il <ilyayaroshenko gmail.com> writes:

On Wednesday, 6 April 2016 at 14:31:58 UTC, Johan Engelen wrote:
 Probably the most difficult part is defining an API. Ilya made 
 a start here:
 http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org
 (but he doesn't like his earlier API "bool a = 
 __target("broadwell")" any more ;-P , I also think enum cpu = 
 __target(); would be nicer)

Ahaha))  --Ilya

Apr 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 April 2016 at 23:26, 9il via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 [...]


 With respect to SIMD, knowing a processor model like 'broadwell' is not
 helpful, since we really want to know 'sse4'. If we know processor model,
 then we need to keep a compile-time table in our code somewhere if every
 possible cpu ever known and it's associated feature set. Knowing the feature
 we're interested is what we need.


 Yes, however this can be implemented in a spcial Phobos module. So compilers
 would need less work. --Ilya

Sure, but it's an ongoing maintenance task, constantly requiring
population with metadata for new processors that become available.
Remember, most processors are arm processors, and there are like 20
manufacturers of arm chips, and many of those come in a series of
minor variations with/without sub-features present, and in a lot of
cases, each permutation of features attached to random manufacturers
arm chip 'X' doesn't actually have a name to describe it. It's also
completely impractical to declare a particular arm chip by name when
compiling for arm. It's a sloppy relationship comparing intel and AMD
let alone the myriad of arm chips available.
TL;DR, defining architectures with an intel-centric naming convention
is a very bad idea.

Apr 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:
 Sure, but it's an ongoing maintenance task, constantly requiring
 population with metadata for new processors that become available.
 Remember, most processors are arm processors, and there are like 20
 manufacturers of arm chips, and many of those come in a series of
 minor variations with/without sub-features present, and in a lot of
 cases, each permutation of features attached to random manufacturers
 arm chip 'X' doesn't actually have a name to describe it. It's also
 completely impractical to declare a particular arm chip by name when
 compiling for arm. It's a sloppy relationship comparing intel and AMD
 let alone the myriad of arm chips available.
 TL;DR, defining architectures with an intel-centric naming convention
 is a very bad idea.

You're not making a good case for a standard language defined set of
definitions 
for all these (they'll always be obsolete, inadequate and probably wrong, as
you 
point out).

Apr 06 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 6 Apr 2016 20:29:21 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:
 TL;DR, defining architectures with an intel-centric naming convention
 is a very bad idea.  

 
 You're not making a good case for a standard language defined set of
definitions 
 for all these (they'll always be obsolete, inadequate and probably wrong, as
you 
 point out).

We can either define the language in terms of CPU models or
features and Manu gave two good reasons to go with features:
1) Typically we end up with "version(SSE4)" and similar in our
   code, not "version(Haswell)".
2) On ARM chips it turns out difficult to translate models to
   features to begin with.

It wasn't a good or bad case for the feature in general.
That said, in the long run Dlang should grow said language.
Aside from scientific servers there are also a few Linux
distributions that compile and install most packages from
sources and telling the compile to target the host CPU comes
naturally there. In practice there is likely some config file
that sets an environment variable like CFLAGS to
"-march=native" on such systems.
I understand that DMD doesn't concern itself with all that,
but the D language itself of which DMD is one implementation
should not artificially be limited compared to popular C/C++
compilers. I died a bit on the inside when I saw Phobos add
both popcnt and _popcnt of which the latter is the version that
uses the POPCNT instruction found in newer x86 CPUs.
In GCC or LLVM when we use such an intrinic, the compiler will
take a look at the compilation target and pick the optimal
code at compile-time. In one micro-benchmark [1], POPCNT was
roughly 50 times faster than bit-twiddling. If I wanted an SSE4
version in otherwise generic amd64 code, I would add
 attribute("target", "+sse4") before the function using popcnt.

So in my eyes a system like GCC offers, where you can specify
target features on the command line and also override them for
specific functions is a viable solution that simplifies user
code (just picking the popcnt, clz, bsr, ... intrinsic will
always be optimal) and Phobos code by making _popcnt et.al.
superfluous. In addition, the compiler could later error out on
mnemonics in our inline assembly that don't exist on the
target. This avoids unexpected "Illegal Instruction" crashes.

[1]
http://kent-vandervelden.blogspot.de/2009/10/counting-bits-population-count-and.html

-- 
Marco

Apr 11 2016

Johannes Pfau <nospam example.com> writes:

Am Thu, 7 Apr 2016 12:25:03 +1000
schrieb Manu via Digitalmars-d <digitalmars-d puremagic.com>:

 On 6 April 2016 at 23:26, 9il via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:  
 On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:  
 [...]  


 With respect to SIMD, knowing a processor model like 'broadwell'
 is not helpful, since we really want to know 'sse4'. If we know
 processor model, then we need to keep a compile-time table in our
 code somewhere if every possible cpu ever known and it's
 associated feature set. Knowing the feature we're interested is
 what we need.  


 Yes, however this can be implemented in a spcial Phobos module. So
 compilers would need less work. --Ilya  

 
 Sure, but it's an ongoing maintenance task, constantly requiring
 population with metadata for new processors that become available.
 Remember, most processors are arm processors, and there are like 20
 manufacturers of arm chips, and many of those come in a series of
 minor variations with/without sub-features present, and in a lot of
 cases, each permutation of features attached to random manufacturers
 arm chip 'X' doesn't actually have a name to describe it. It's also
 completely impractical to declare a particular arm chip by name when
 compiling for arm. It's a sloppy relationship comparing intel and AMD
 let alone the myriad of arm chips available.
 TL;DR, defining architectures with an intel-centric naming convention
 is a very bad idea.

GCC already keeps a cpu <=> feature mapping (after all it needs to know
what features it can use when you specify -mcpu) so for GDC exposing
available features isn't more difficult than exposing the CPU type.

I'm not sure if you can actually enable/disable CPU features manually
without -mcpu?

However, available features and even the type used to describe the CPU
are completely architecture specific in GCC. This means for GDC we
have to write custom code for every supported architecture. (We
already have to do this for version(Architecture) though).



FYI this is handled in the gcc/config subsystem:
https://github.com/gcc-mirror/gcc/tree/master/gcc/config

#defines for C/ARM: arm_cpu_builtins in
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-c.c
(__ARM_NEON__ etc)

As you can see the only common requirement for backend architectures is
to call def_or_undef_macro. This means we have to modify the gcc/config
files and write replacements for arm_cpu_builtins and similar functions.

Known ARM cores and feature sets:
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-cores.def


I guess every backend-architecture has to provide cpu names for -mcpu
so that's probably the one thing we could expose to D for all
architectures. (Names are of course GCC specific, but I guess LLVM
should use compatible names). This is less work to implement in GDC but
you'd have to duplicate the GCC feature table in phobos. OTOH
standardizing the names and available feature flag means somebody with
knowledge in that area has to write down a spec.



TLDR:
If required we can always expose compiler specific versions
(GNU_NEON/LDC_NEON) even without DMD approval/integration. This should
be coordinated with LDC though. Somebody has to make a list of needed
identifiers, preferably mentioning the matching C macros.

Things get much more complicated if you need feature flags not
currently used by / present in GCC.

Apr 07 2016

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 5 April 2016 at 21:29:41 UTC, Walter Bright wrote:
 On 4/5/2016 4:07 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 10:30:19 UTC, Walter Bright wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright 
 wrote:
 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user 
 experience.
 3. This is possible and not very hard to implement if I am 
 not wrong.


 Where does the compiler get the information that it should 
 compile for, say, AFX?

 No idea about AFX. Do you choose AFX to disallow me to find an 
 example?

 I want to make it clear that dmd does not generate AFX specific 
 code, has no switch to enable AFX code generation and has no 
 basis for setting predefined version identifiers for it.

Please think that D has other compilers, not only DMD. We need a 
language feature, and I am ok that this feature would be useless 
for DMD. But the fact that DMD can not optimize code for, say, 
AVX, AVX2, AVX-512, FMA4, ..., is not good reason to reject small 
language changes that would be very helpful for D for community. 
Yes, only few of us would use this feature directly, however, 
many of us would use this under-the-hood in BLAS/SIMD oriented 
part of Phobos.

Apr 05 2016

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 6 April 2016 at 06:11:15 UTC, 9il wrote:
 Yes, only few of us would use this feature directly, however, 
 many of us would use this under-the-hood in BLAS/SIMD oriented 
 part of Phobos.

Especially since everyone says to use LDC for the fastest code 
anyway...

Apr 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 5 April 2016 at 20:30, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/5/2016 2:39 AM, 9il wrote:
 On Tuesday, 5 April 2016 at 08:34:32 UTC, Walter Bright wrote:
 On 4/4/2016 11:10 PM, 9il wrote:
 I still don't understand why you cannot just set '-version=xxx' on the
 command
 line and then switch off that version in your custom code.


 I can do it, however I would like to get this information from compiler.
 Why?

 1. This would help to eliminate configuration bugs.
 2. This would reduce work for users and simplified user experience.
 3. This is possible and not very hard to implement if I am not wrong.



 Where does the compiler get the information that it should compile for, say,
 AFX?

I would add that GDC and LDC have such compiler flags and it's
possible that they could pass the state of those flags through as
versions, but all compilers need to agree on the set of versions that
will be defined for this purpose. If DMD users express them as
-version=[STANDARD_VERSION_NAME], that's fine, I guess, but a proper
flag would help avoid the situation where people get the version names
wrong, and it feels a little bit more deliberate.
Setting a version this way might lead them to presume that it's just
an arbitrary setting by the author of the build script, and not
actually an agreed standard name that GDC and LDC also produce from
their compiler flags.

But at very least, the important detail is that the version ID's are
standardised and shared among all compilers.

Apr 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.

It's a reasonable suggestion; some points:

1. This has been characterized as a blocker, it is not, as it does not impede 
writing code that takes advantage of various SIMD code generation at compile
time.

2. I'm not sure these global settings are the best approach, especially if one 
is writing applications that dynamically adjusts based on the CPU the user is 
running on. The main trouble comes about when different modules are compiled 
with different settings. What happens with template code generation, when the 
templates are pulled from different modules? What happens when COMDAT functions 
are generated? (The linker picks one arbitrarily and discards the others.)
Which 
settings wind up in the executable will be not easily predictable.

I suspect that using a pragma would be a much better approach:

    pragma(SIMD, AFX)
    {
	... code ...
    }

Doing it on the command line is certainly the traditional way, but it strikes
me 
as being bug-prone and as unhygienic and obsolete as the C preprocessor is (for 
similar reasons).

Apr 06 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 April 2016 at 10:42, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.


 It's a reasonable suggestion; some points:

 1. This has been characterized as a blocker, it is not, as it does not
 impede writing code that takes advantage of various SIMD code generation at
 compile time.

It's sufficiently blocking that I have not felt like working any
further without this feature present. I can't feel like it 'works' or
it's 'done', until I can demonstrate this functionality.
Perhaps we can call it a psychological blocker, and I am personally
highly susceptible to those.

 2. I'm not sure these global settings are the best approach, especially if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.

They are necessary to provide a baseline. It is typical when building
code that you specify a min-spec. This is what's used by default
throughout the application.
Runtime selection is not practical in a broad sense. Emitting small
fragments of SIMD here and there will probably take a loss if they are
all surrounded by a runtime selector. SIMD is all about pipelining,
and runtime branches on SIMD version are antithesis to good SIMD
usage; they can't be applied for small-scale deployment.
In my experience, runtime selection is desirable for large scale
instantiations at an outer level of the work loop. I've tried to
design this intent in my library, by making each simd API capable of
receiving SIMD version information via template arg, and within the
library, the version is always passed through to dependent calls.
The Idea is, if you follow this pattern; propagating a SIMD version
template arg through to your outer function, then you can instantiate
your higher-level work function for any number of SIMD feature
combinations you feel is appropriate.
Naturally, this process requires a default, otherwise this usage
baggage will cloud the API everywhere (rather than in the few cases
where a developer specifically wants to make use of it), and many
developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
in my applications, xbox developers would choose AVX1, it's very
application/target-audience specific, but SSE2 is the only reasonable
selection if we are not to accept a hint from the command line.

 The main trouble comes about when different modules are
 compiled with different settings. What happens with template code
 generation, when the templates are pulled from different modules? What
 happens when COMDAT functions are generated? (The linker picks one
 arbitrarily and discards the others.) Which settings wind up in the
 executable will be not easily predictable.

In my library design, the baseline simd version (expected from the
compiler) is mangled into the symbols, just as in the case a user
overrides it when instantiating a code path that may be selected on
runtime branch.
I had imagined this would solve such link related symbol selection
problems. Can you think of cases where this is insufficient?


 I suspect that using a pragma would be a much better approach:

    pragma(SIMD, AFX)
    {
         ... code ...
    }

 Doing it on the command line is certainly the traditional way, but it
 strikes me as being bug-prone and as unhygienic and obsolete as the C
 preprocessor is (for similar reasons).

I've done it with a template arg because it can be manually
propagated, and users can extrapolate the pattern into their outer
work functions, which can then easily have multiple versions
instantiated for runtime selection.
I think it's also important to mangle it into the symbol name for the
reasons I mention above.

Apr 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does not
 impede writing code that takes advantage of various SIMD code generation at
 compile time.

 It's sufficiently blocking that I have not felt like working any
 further without this feature present. I can't feel like it 'works' or
 it's 'done', until I can demonstrate this functionality.
 Perhaps we can call it a psychological blocker, and I am personally
 highly susceptible to those.

I can understand that it might be demotivating for you, but that is not a 
blocker. A blocker has no reasonable workaround. This has a trivial workaround:

    gdc -simd=AFX foo.d

becomes:

    gdc -simd=AFX -version=AFX foo.d

It's even simpler if you use a makefile variable:

     FPU=AFX

     gdc -simd=$(FPU) -version=$(FPU)

You also mentioned being blocked (i.e. demotivated) for *years* by this, and I 
assume that may be because we don't care about SIMD support. That would be 
wrong, as I care a lot about it. But I had no idea you were having a problem 
with this, as you did not file any bug reports. Suffering in silence is never 
going to work :-)


 2. I'm not sure these global settings are the best approach, especially if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.

 They are necessary to provide a baseline. It is typical when building
 code that you specify a min-spec. This is what's used by default
 throughout the application.

It is not necessary to do it that way. Call std.cpuid to determine what is 
available at runtime, and issue an error message if not. There is no runtime 
cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to 
seg fault trying to execute instructions that do not exist.


 Runtime selection is not practical in a broad sense. Emitting small
 fragments of SIMD here and there will probably take a loss if they are
 all surrounded by a runtime selector. SIMD is all about pipelining,
 and runtime branches on SIMD version are antithesis to good SIMD
 usage; they can't be applied for small-scale deployment.
 In my experience, runtime selection is desirable for large scale
 instantiations at an outer level of the work loop. I've tried to
 design this intent in my library, by making each simd API capable of
 receiving SIMD version information via template arg, and within the
 library, the version is always passed through to dependent calls.
 The Idea is, if you follow this pattern; propagating a SIMD version
 template arg through to your outer function, then you can instantiate
 your higher-level work function for any number of SIMD feature
 combinations you feel is appropriate.

Doing it at a high level is what I meant, not for each SIMD code fragment.


 Naturally, this process requires a default, otherwise this usage
 baggage will cloud the API everywhere (rather than in the few cases
 where a developer specifically wants to make use of it), and many
 developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
 in my applications, xbox developers would choose AVX1, it's very
 application/target-audience specific, but SSE2 is the only reasonable
 selection if we are not to accept a hint from the command line.

I still don't see how it is a problem to do the switch at a high level. Heck, 
you could put the ENTIRE ENGINE inside a template, have a template parameter be 
the instruction set, and instantiate the template for each supported
instruction 
set.

Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

 I've done it with a template arg because it can be manually
 propagated, and users can extrapolate the pattern into their outer
 work functions, which can then easily have multiple versions
 instantiated for runtime selection.
 I think it's also important to mangle it into the symbol name for the
 reasons I mention above.

Note that version identifiers are not usable directly as template parameters. 
You'd have to set up a mapping.

And yes, if mangled in as part of the symbol, the linker won't pick the wrong
one.

Apr 06 2016

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 I can understand that it might be demotivating for you, but 
 that is not a blocker. A blocker has no reasonable workaround. 
 This has a trivial workaround:

    gdc -simd=AFX foo.d

 becomes:

    gdc -simd=AFX -version=AFX foo.d

 It's even simpler if you use a makefile variable:

     FPU=AFX

     gdc -simd=$(FPU) -version=$(FPU)

     ldc -mcpu=native

becomes:

      ????
 I still don't see how it is a problem to do the switch at a 
 high level. Heck, you could put the ENTIRE ENGINE inside a 
 template, have a template parameter be the instruction set, and 
 instantiate the template for each supported instruction set.

 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

1. Executable size will grow with every instruction set release
2. BLAS already has big executable size
And main:
3. This would not solve the problem for generic BLAS 
implementation for Phobos at all! How you would force compiler to 
USE and NOT USE specific vector permutations for example in the 
same object file? Yes, I know, DMD has not permutations. No, I 
don't want to write permutation for each architecture. Why? I can 
write simple D code that generates single LLVM IR code which 
would work for ALL targets!

Best regards,
Ilya

Apr 07 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2016 12:59 AM, 9il wrote:
 1. Executable size will grow with every instruction set release

Yes, and nobody cares. With virtual memory and demand loading, unexecuted code 
will never be loaded off of disk and will never consume memory space. And with
a 
64 bit address space, there will never be a shortage of virtual address space.

It will consume space on your 1 terabyte drive. Meh. I have several of those 
drives, and what consumes space is video, not code binaries :-)



 3. This would not solve the problem for generic BLAS implementation for Phobos
 at all! How you would force compiler to USE and NOT USE specific vector
 permutations for example in the same object file? Yes, I know, DMD has not
 permutations. No, I don't want to write permutation for each architecture. Why?
 I can write simple D code that generates single LLVM IR code which would work
 for ALL targets!

There's no reason for the compiler to make target CPU information available
when 
writing generic code.

Apr 07 2016

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 7 April 2016 at 09:41:06 UTC, Walter Bright wrote:
 On 4/7/2016 12:59 AM, 9il wrote:
 1. Executable size will grow with every instruction set release

 Yes, and nobody cares. With virtual memory and demand loading, 
 unexecuted code will never be loaded off of disk and will never 
 consume memory space. And with a 64 bit address space, there 
 will never be a shortage of virtual address space.

 It will consume space on your 1 terabyte drive. Meh. I have 
 several of those drives, and what consumes space is video, not 
 code binaries :-)

what about 1GB game 2D for a Phone, or maybe a clock?

 3. This would not solve the problem for generic BLAS 
 implementation for Phobos
 at all! How you would force compiler to USE and NOT USE 
 specific vector
 permutations for example in the same object file? Yes, I know, 
 DMD has not
 permutations. No, I don't want to write permutation for each 
 architecture. Why?
 I can write simple D code that generates single LLVM IR code 
 which would work
 for ALL targets!

 There's no reason for the compiler to make target CPU 
 information available when writing generic code.

This is not true for BLAS based on D. You don't want to see the 
opportunities. The final result of your dogmatic decision would 
make code slower for DMD, but LDC and GDC would implement 
required simple features. I just wanted to write fast code for 
DMD too.

Apr 07 2016

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
 This is not true for BLAS based on D.

Perhaps if you provide him a simplified example he might see what 
you're talking about?

Apr 07 2016

9il <ilyayaroshenko gmail.com> writes:

On Thursday, 7 April 2016 at 12:35:51 UTC, jmh530 wrote:
 On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
 This is not true for BLAS based on D.

 Perhaps if you provide him a simplified example he might see 
 what you're talking about?

He know what I am talking about. This is about 
architecture/style/concepts. If Walter disagree with this then 
nobody can change it.

Apr 07 2016

Johannes Pfau <nospam example.com> writes:

Am Thu, 7 Apr 2016 02:41:06 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 3. This would not solve the problem for generic BLAS implementation
 for Phobos at all! How you would force compiler to USE and NOT USE
 specific vector permutations for example in the same object file?
 Yes, I know, DMD has not permutations. No, I don't want to write
 permutation for each architecture. Why? I can write simple D code
 that generates single LLVM IR code which would work for ALL
 targets!  

 
 There's no reason for the compiler to make target CPU information
 available when writing generic code.

Actually for GDC/GCC you can't even write functions using certain SIMD
stuff as 'generic' code. Unless you use -mavx or -march the builtins
are not exposed to user code. IIRC the compiler even complains about
inline ASM if you use unsupported instructions.

You also can't always compile with the 'biggest' feature set, as GCC
might use these features in codegen.

TLDR;
For GCC/GDC you will have to use target flags /  attribute(target) to
mix feature sets.

Apr 07 2016

Johannes Pfau <nospam example.com> writes:

Am Wed, 6 Apr 2016 20:27:31 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does
 not impede writing code that takes advantage of various SIMD code
 generation at compile time.  

 It's sufficiently blocking that I have not felt like working any
 further without this feature present. I can't feel like it 'works'
 or it's 'done', until I can demonstrate this functionality.
 Perhaps we can call it a psychological blocker, and I am personally
 highly susceptible to those.  

 
 I can understand that it might be demotivating for you, but that is
 not a blocker. A blocker has no reasonable workaround. This has a
 trivial workaround:
 
     gdc -simd=AFX foo.d
 
 becomes:
 
     gdc -simd=AFX -version=AFX foo.d
 

The problem is that march=x can set more than one
feature flag. So instead of

gdc -march=armv7-a
you have to do
gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32
-fversion=ARM_FEATURE_UNALIGNED ...

Sou have to know exactly which features are supported for a CPU.
Essentially you have to duplicate the CPU<=>feature database already
present in GCC (and likely LLVM too) in your Makefile. And you'll need
-march=armv7-a anyway to make sure the GCC codegen can use these
features as well.

So this issue is not a blocker, but what you propose is a workaround at
best, not a solution.

Apr 07 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2016 3:15 AM, Johannes Pfau wrote:
 The problem is that march=x can set more than one
 feature flag. So instead of

 gdc -march=armv7-a
 you have to do
 gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32
 -fversion=ARM_FEATURE_UNALIGNED ...

 Sou have to know exactly which features are supported for a CPU.
 Essentially you have to duplicate the CPU<=>feature database already
 present in GCC (and likely LLVM too) in your Makefile. And you'll need
 -march=armv7-a anyway to make sure the GCC codegen can use these
 features as well.

 So this issue is not a blocker, but what you propose is a workaround at
 best, not a solution.


Having a veritable blizzard of these predefined versions, that constantly are 
obsoleted and new ones appearing, seems like a serious problem when trying to 
standardize the language.

Apr 07 2016

Kai Nacke <kai redstar.de> writes:

On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

glibc has a special mechanism for resolving the called function 
during loading. See the section on the GNU Indirect Function 
Mechanism here: 
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries

Would be awesome to have something similar in druntime/Phobos.

Regards,
Kai

Apr 07 2016

Johannes Pfau <nospam example.com> writes:

Am Thu, 07 Apr 2016 10:52:42 +0000
schrieb Kai Nacke <kai redstar.de>:

 On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }
  

 
 glibc has a special mechanism for resolving the called function 
 during loading. See the section on the GNU Indirect Function 
 Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in druntime/Phobos.
 
 Regards,
 Kai

Available in GCC as the 'ifunc' attribute:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

What do you mean by 'something similar in druntime/phobos'? A platform
independent (slightly slower) variant?:

http://dpaste.dzfl.pl/0aa81325a26a

Apr 07 2016

Johan Engelen <j j.nl> writes:

On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:

 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU Indirect 
 Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in druntime/Phobos.
 
 Regards,
 Kai

 Available in GCC as the 'ifunc' attribute: 
 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

 What do you mean by 'something similar in druntime/phobos'? A 
 platform independent (slightly slower) variant?:

 http://dpaste.dzfl.pl/0aa81325a26a

I thought that the ifunc mechanism means an indirect call (i.e. a 
function ptr is set at the start of the program) ? That would be 
the same as what you are doing without performance difference.

https://gcc.gnu.org/wiki/FunctionMultiVersioning
"To keep the cost of dispatching low, the IFUNC mechanism is used 
for dispatching. This makes the call to the dispatcher a one-time 
thing during startup and a call to a function version is a single 
jump ** indirect ** instruction." (emphasis mine)
I looked into this some time ago and did not see a reason to use 
the ifunc mechanism (which would not be available on Windows). I 
thought it should be implementable in a library, exactly as you 
did in your dpaste! :-)  (does `&foo` return `impl`?)

Apr 07 2016

Johannes Pfau <nospam example.com> writes:

Am Thu, 07 Apr 2016 13:27:05 +0000
schrieb Johan Engelen <j j.nl>:

 On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:
  
 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU Indirect 
 Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in druntime/Phobos.
 
 Regards,
 Kai  

 Available in GCC as the 'ifunc' attribute: 
 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

 What do you mean by 'something similar in druntime/phobos'? A 
 platform independent (slightly slower) variant?:

 http://dpaste.dzfl.pl/0aa81325a26a  

 
 I thought that the ifunc mechanism means an indirect call (i.e. a 
 function ptr is set at the start of the program) ? That would be 
 the same as what you are doing without performance difference.
 
 https://gcc.gnu.org/wiki/FunctionMultiVersioning
 "To keep the cost of dispatching low, the IFUNC mechanism is used 
 for dispatching. This makes the call to the dispatcher a one-time 
 thing during startup and a call to a function version is a single 
 jump ** indirect ** instruction." (emphasis mine)

The simple variant I've posted needs an additional branch on every
function call. If we instead initialize the function pointer in a
shared static ctor there's indeed no performance difference. The main
problem here is because of cyclic constructor detection it will be more
difficult to implement a generic template solution.

http://www.airs.com/blog/archives/403
"An alternative to all this linker stuff would be a variable holding a
function pointer. The function could then be written in assembler to do
the indirect jump. The variable would be initialized at program startup
time. The efficiency would be the same. The address of the function
would be the address of the indirect jump, so function pointers would
compare consistently."

 I looked into this some time ago and did not see a reason to use 
 the ifunc mechanism (which would not be available on Windows). I 
 thought it should be implementable in a library, exactly as you 
 did in your dpaste! :-)

 (does `&foo` return `impl`?)

No, &foo will return the address of the wrapper function. I'm not sure
if we can solve this. IIRC we can't overload &. Here's the alternative
using a constructor which makes the address accessible. The syntax
will still be different though:

__gshared void function() foo;
shared static this()
{
    foo = &foo1;
}

auto addr = &foo; // address of the variable
addr = cast(void*)foo; // the function address

Apr 07 2016

Johan Engelen <j j.nl> writes:

On Thursday, 7 April 2016 at 14:46:06 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 13:27:05 +0000
 schrieb Johan Engelen <j j.nl>:

 On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
 Am Thu, 07 Apr 2016 10:52:42 +0000
 schrieb Kai Nacke <kai redstar.de>:
 
 glibc has a special mechanism for resolving the called 
 function during loading. See the section on the GNU 
 Indirect Function Mechanism here: 
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
 
 Would be awesome to have something similar in 
 druntime/Phobos.
 
 Regards,
 Kai

 Available in GCC as the 'ifunc' attribute: 
 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

 What do you mean by 'something similar in druntime/phobos'? 
 A platform independent (slightly slower) variant?:

 http://dpaste.dzfl.pl/0aa81325a26a

 
 I thought that the ifunc mechanism means an indirect call 
 (i.e. a function ptr is set at the start of the program) ? 
 That would be the same as what you are doing without 
 performance difference.
 
 https://gcc.gnu.org/wiki/FunctionMultiVersioning
 "To keep the cost of dispatching low, the IFUNC mechanism is 
 used
 for dispatching. This makes the call to the dispatcher a 
 one-time
 thing during startup and a call to a function version is a 
 single
 jump ** indirect ** instruction." (emphasis mine)

 The simple variant I've posted needs an additional branch on 
 every function call. If we instead initialize the function 
 pointer in a shared static ctor there's indeed no performance 
 difference.

Yep exactly.
For  target multiversioned functions, I thought one would want to 
create one static ctor that calls cpuid once and sets all 
function ptrs of that module.

 (does `&foo` return `impl`?)

 No, &foo will return the address of the wrapper function. I'm 
 not sure if we can solve this. IIRC we can't overload &.

OK. Well, the  target multifunctioning would need compiler 
support anyway and it is easy to do something slightly different 
for `&foo` when foo is a multiversioned function.

This should be fairly easy to implement in LDC, with some smarts 
needed in ordering and selecting the best function version.

Apr 07 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2016 3:52 AM, Kai Nacke wrote:
 On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

 glibc has a special mechanism for resolving the called function during loading.
 See the section on the GNU Indirect Function Mechanism here:
 https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries


 Would be awesome to have something similar in druntime/Phobos.

We already have core.cupid, which covers most of what that article talks about. 
The indirect function thing appears to be a way to selectively load from
various 
dlls. But that can be done anyway with core.cpuid and dynamic dll loading, so 
I'm not sure what advantage it brings.

Apr 07 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 7 April 2016 at 13:27, Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
 1. This has been characterized as a blocker, it is not, as it does not
 impede writing code that takes advantage of various SIMD code generation
 at
 compile time.


 It's sufficiently blocking that I have not felt like working any
 further without this feature present. I can't feel like it 'works' or
 it's 'done', until I can demonstrate this functionality.
 Perhaps we can call it a psychological blocker, and I am personally
 highly susceptible to those.


 I can understand that it might be demotivating for you, but that is not a
 blocker. A blocker has no reasonable workaround. This has a trivial
 workaround:

    gdc -simd=AFX foo.d

 becomes:

    gdc -simd=AFX -version=AFX foo.d

 It's even simpler if you use a makefile variable:

     FPU=AFX

     gdc -simd=$(FPU) -version=$(FPU)

Sure. I've done this in my own tests. I just never published that
anyone else should do it.


 You also mentioned being blocked (i.e. demotivated) for *years* by this, and
 I assume that may be because we don't care about SIMD support. That would be
 wrong, as I care a lot about it. But I had no idea you were having a problem
 with this, as you did not file any bug reports. Suffering in silence is
 never going to work :-)

There's been threads, but sure, I could have done more to push it along.
Motivation is a complex and not particularly logical emotion, there's
a lot of factors feeding into it.

Not least of which, is that I haven't been working in games for a
while, which means I haven't depended on it for my work. Don't take
that to read I have lost interest in the support, just that the
pressure is reduced.
You'll have noticed that C++ interaction is my recent focus, since
that's directly related to my current day-job, and the path that I
need to solve now to get D into my work.
That's consuming almost 100% of my D-time-allocation... if I could
ever manage to just kick that goal, it might free me up >_< .. I keep
on trying.


 2. I'm not sure these global settings are the best approach, especially
 if
 one is writing applications that dynamically adjusts based on the CPU the
 user is running on.


 They are necessary to provide a baseline. It is typical when building
 code that you specify a min-spec. This is what's used by default
 throughout the application.


 It is not necessary to do it that way. Call std.cpuid to determine what is
 available at runtime, and issue an error message if not. There is no runtime
 cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly
 to seg fault trying to execute instructions that do not exist.

The author still needs to be able to control at compile-time what
min-spec shall be supported.
I agree the check is valuable, but I think it's an unrelated detail.


 Runtime selection is not practical in a broad sense. Emitting small
 fragments of SIMD here and there will probably take a loss if they are
 all surrounded by a runtime selector. SIMD is all about pipelining,
 and runtime branches on SIMD version are antithesis to good SIMD
 usage; they can't be applied for small-scale deployment.
 In my experience, runtime selection is desirable for large scale
 instantiations at an outer level of the work loop. I've tried to
 design this intent in my library, by making each simd API capable of
 receiving SIMD version information via template arg, and within the
 library, the version is always passed through to dependent calls.
 The Idea is, if you follow this pattern; propagating a SIMD version
 template arg through to your outer function, then you can instantiate
 your higher-level work function for any number of SIMD feature
 combinations you feel is appropriate.


 Doing it at a high level is what I meant, not for each SIMD code fragment.

Sure, so you agree we need a mechanism for the author to tune the
default selection then? Or are you suggesting SSE2 is 'fine' as a
default? (ie, that is what is implied by D_SIMD)


 Naturally, this process requires a default, otherwise this usage
 baggage will cloud the API everywhere (rather than in the few cases
 where a developer specifically wants to make use of it), and many
 developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
 in my applications, xbox developers would choose AVX1, it's very
 application/target-audience specific, but SSE2 is the only reasonable
 selection if we are not to accept a hint from the command line.


 I still don't see how it is a problem to do the switch at a high level.

It's not a problem, that's exactly my design, but it's not a universal solution.

 Heck, you could put the ENTIRE ENGINE inside a template, have a template
 parameter be the instruction set, and instantiate the template for each
 supported instruction set.

 Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

Sure, I've designed for this specifically, but it's not practical to
wind this all the way to the top of the stack.
Some hot code will make make use of this pattern, but small fragments
that appear throughout the code don't want to have this baggage
applied. They should just work with the developer's deliberately
selected default. It's not worth runtime selection on small
deployments. You will likely end up with numerous helper functions,
which when involved in the runtime-selected loops, would have
different versions generated appropriately, but when these helper
functions appear on their own, they would want to use a sensible
default.

 I've done it with a template arg because it can be manually
 propagated, and users can extrapolate the pattern into their outer
 work functions, which can then easily have multiple versions
 instantiated for runtime selection.
 I think it's also important to mangle it into the symbol name for the
 reasons I mention above.


 Note that version identifiers are not usable directly as template
 parameters. You'd have to set up a mapping.

I guess you haven't looked at my code, but yes, it's all mapped to
enums used by the templates. The versions would assign a constant used
as the template's default arg.

Apr 07 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2016 5:27 PM, Manu via Digitalmars-d wrote:
 You'll have noticed that C++ interaction is my recent focus, since
 that's directly related to my current day-job, and the path that I
 need to solve now to get D into my work.

We recognize C++ interoperability to be a key feature of D. I hope you like the 
support you got with the C++ virtual functions! I got bogged down recently with 
getting the C++ exception handling support working better, hopefully we've 
turned the corner on that one. I'd hoped to be further along at the moment with 
C++ interoperability (but it's always going to be a work in progress).


 That's consuming almost 100% of my D-time-allocation... if I could
 ever manage to just kick that goal, it might free me up >_< .. I keep
 on trying.

I do appreciate your efforts in this direction.


 Doing it at a high level is what I meant, not for each SIMD code fragment.

 Sure, so you agree we need a mechanism for the author to tune the
 default selection then?

 From the command line, probably not. I like the pragma thing better.


 Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is
implied by D_SIMD)

It is fine as a default, as it is the baseline minimum machine D is expecting.

Apr 07 2016

Johannes Pfau <nospam example.com> writes:

Am Wed, 6 Apr 2016 17:42:30 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
 But at very least, the important detail is that the version ID's are
 standardised and shared among all compilers.  

 
 It's a reasonable suggestion; some points:
 
 1. This has been characterized as a blocker, it is not, as it does
 not impede writing code that takes advantage of various SIMD code
 generation at compile time.
 
 2. I'm not sure these global settings are the best approach,
 especially if one is writing applications that dynamically adjusts
 based on the CPU the user is running on. The main trouble comes about
 when different modules are compiled with different settings. What
 happens with template code generation, when the templates are pulled
 from different modules? What happens when COMDAT functions are
 generated? (The linker picks one arbitrarily and discards the
 others.) Which settings wind up in the executable will be not easily
 predictable.
 


better ;-) If you've got a version() block in a template and compile
two modules using the same template with different -version flags you'll
have exactly that problem. Have an enum myFlag = x; in a config module
+ static if => problem solved.

The problem isn't having global settings, the problem is having to
manually specify the same global setting for every source file.

Apr 07 2016

xenon325 <anm programmer.net> writes:

On Thursday, 7 April 2016 at 00:42:30 UTC, Walter Bright wrote:
 [...] especially if one is writing applications that 
 dynamically adjusts based on the CPU the user is running on. 
 The main trouble comes about when different modules are 
 compiled with different settings. What happens with template 
 code generation, when the templates are pulled from different 
 modules? What happens when COMDAT functions are generated? (The 
 linker picks one arbitrarily and discards the others.) Which 
 settings wind up in the executable will be not easily 
 predictable.

 I suspect that using a pragma would be a much better approach:

    pragma(SIMD, AFX)
    {
 	... code ...
    }

 Doing it on the command line is certainly the traditional way, 
 but it strikes me as being bug-prone and as unhygienic and 
 obsolete as the C preprocessor is (for similar reasons).

Have you seen how GCC's function multiversioning [1] ?

This whole thread is far too low-level for me and I'm not sure if 
GCC's dispatcher overhead is OK, but the syntax looks really nice 
and it seems to address all of your concerns.

	__attribute__ ((target ("default")))
	int foo ()
	{
	  // The default version of foo.
	  return 0;
	}

	__attribute__ ((target ("sse4.2")))
	int foo ()
	{
	  // foo version for SSE4.2
	  return 1;
	}

	__attribute__ ((target ("arch=atom")))
	int foo ()
	{
	  // foo version for the Intel ATOM processor
	  return 2;
	}


[1] https://gcc.gnu.org/wiki/FunctionMultiVersioning

-Alexander

Apr 12 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 12 Apr 2016 10:55:18 +0000
schrieb xenon325 <anm programmer.net>:

 Have you seen how GCC's function multiversioning [1] ?
=20
 This whole thread is far too low-level for me and I'm not sure if=20
 GCC's dispatcher overhead is OK, but the syntax looks really nice=20
 and it seems to address all of your concerns.
=20
 	__attribute__ ((target ("default")))
 	int foo ()
 	{
 	  // The default version of foo.
 	  return 0;
 	}
=20
 	__attribute__ ((target ("sse4.2")))
 	int foo ()
 	{
 	  // foo version for SSE4.2
 	  return 1;
 	}
=20
=20
 [1] https://gcc.gnu.org/wiki/FunctionMultiVersioning
=20
 -Alexander

Awesome! I just tried it and it ties runtime and compile-time
selection of code paths together in an unprecedented way!

As you said, there is the runtime dispatcher overhead if you
just compile normally. But if you specifically compile with
"gcc -msse4.2 <=E2=80=A6>", GCC calls the correct function directly:

0000000000400512 <main>:
  400512:	e8 f5 ff ff ff       	callq  40050c <_Z3foov.sse4.2>
  400517:	f3 c3                	repz retq=20
  400519:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

For demonstration purposes I disabled the inliner here.

The best thing about it is that for users of libraries
employing this technique, it happens behind the scenes and user
code stays clean of instrumentation. No ugly versioning and
hand written switch-case blocks! (It currently only works with
C++ on x86, but I like the general direction.)

--=20
Marco

Apr 12 2016

Marco Leise <Marco.Leise gmx.de> writes:

The system seems to call CPUID at startup and for every
multiversioned function, patch an offset in its dispatcher
function. The dispatcher function is then nothing more than a
jump realtive to RIP, e.g.:

  jmp    QWORD PTR [rip+0x200bf2]

This is as efficient as it gets short of using whole-program
optimization.

-- 
Marco

Apr 12 2016

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:
 Have you seen how GCC's function multiversioning [1] ?

I've been thinking about the gcc multiversioning since you 
mentioned it previously.

I keep thinking about how the optimal algorithm for something 
like matrix multiplication depends on the size of the matrices.

For instance, you might do something for very small matrices that 
just relies on one processor, then you add in SIMD as the size 
grows, then you add in multiple CPUs, then you add in the GPU (or 
maybe you add before CPUs), then you add in multiple computers.

I don't know how some of those choices would get made at compile 
time for dynamic arrays. Would need some kind of run-time 
approach. At least for static arrays, you could do multiple 
versions of the function and then use template constraints to 
call whichever function. Some tuning would be necessary.

Apr 15 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 15 Apr 2016 18:54:12 +0000
schrieb jmh530 <john.michael.hall gmail.com>:

 On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote:
 Have you seen how GCC's function multiversioning [1] ?
  

 
 I've been thinking about the gcc multiversioning since you 
 mentioned it previously.
 
 I keep thinking about how the optimal algorithm for something 
 like matrix multiplication depends on the size of the matrices.
 
 For instance, you might do something for very small matrices that 
 just relies on one processor, then you add in SIMD as the size 
 grows, then you add in multiple CPUs, then you add in the GPU (or 
 maybe you add before CPUs), then you add in multiple computers.

GCC only has one architecture as a target at a time. As long
as this is so, there is little point in contemplating how it
handles multiple architectures and network traffic. :)
CPUs run the bulk of code, from booting over kernel and
drivers to applications and there will always be something
that can be optimized if it is statically known that a certain
instruction set is supported. To pick up your matrices
example, imagine OpenGL code that has some 4x4 matrices that
are in no direct relation to each other. The GPU is only good
at bulk processing, and it doesn't apply here. So you need the
general purpose processor and benefit from the knowledge that
some SSE level is supported. In general, when you have to make
many quick decisions on small amounts of data the GPU or
networking are out of question.

-- 
Marco

Apr 16 2016

Johan Engelen <j j.nl> writes:

On Tuesday, 5 April 2016 at 09:39:21 UTC, 9il wrote:
 3. This is possible and not very hard to implement if I am not 
 wrong.

Last time I looked into this (related to implementing  target, 
see [1]), I only found some Clang code dealing with this, but now 
I found LLVM functions about architectures, cpus, features, etc. 
So indeed I also think it will be relatively easy indeed to 
implement at least rudimentary support for what you'd want.

[1] 
http://forum.dlang.org/post/eodutgruoofruperrgif forum.dlang.org

Apr 05 2016

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 4 April 2016 at 20:29:11 UTC, Walter Bright wrote:
 - Allowed sets of instructions: for example, AVX2, FMA4

 Done. D_SIMD

I'm not a SIMD expert, I've only played around with SIMD a 
little, but this confuses me.

version(D_SIMD) will tell you when SIMD is implemented, but not 
what type of SIMD. For instance, if I am on a machine that can 
use AVX2 instructions, then code in a version(D_SIMD) block will 
execute, but it should also execute if the processor only 
supports SSE4. What if the writer of an SIMD library wants to 
have code execute differently if SSE4 is detected instead of AVX2?

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 2:11 PM, jmh530 wrote:
 version(D_SIMD) will tell you when SIMD is implemented, but not what type of
 SIMD.

The first SIMD level.

 For instance, if I am on a machine that can use AVX2 instructions, then
 code in a version(D_SIMD) block will execute, but it should also execute if the
 processor only supports SSE4. What if the writer of an SIMD library wants to
 have code execute differently if SSE4 is detected instead of AVX2?

Use a runtime switch (see core.cpuid).

Apr 04 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 4 Apr 2016 13:29:11 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/4/2016 7:02 AM, 9il wrote:
 What kind of information?  

 Target cpu configuration:
 - CPU architecture (done)  

 
 Done.
 
 - Count of FP/Integer registers  

 
 ??
 
 - Allowed sets of instructions: for example, AVX2, FMA4  

 
 Done. D_SIMD

I wonder if answers like this are meant to be filled into a
template like this: "We have [$2] in place for that. If that
doesn't get the job $1, please report whatever is missing to
bugzilla. Thanks!"
Since otherwise it should be clear that the distinction
between AVX2 and FMA4 asks for something more specialized
than D_SIMD, which is basically the same as checking the
front-end __VERSION__.

-- 
Marco

Apr 11 2016

Manu via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 3 April 2016 at 16:14, 9il via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c apparently none
 of the D compilers has a working SIMD implementation (maybe GDC has but
 it's very difficult to work w/ the 2.066 frontend).


 https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d
 https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, stores, and
 integral mul/div. Is this really the current state of SIMD or am I missing
 sth.?

 -Martin


 Hello Martin,

 Is it possible to introduce compile time information about target platform?
 I am working on BLAS from scratch implementation. And it is no hope to
 create something useable without CT information about target.

 Best regards,
 Ilya

My SIMD implementation has been blocked on that for years too.
I need to know the SIMD level flags passed to the compiler at least,
and DMD needs to introduce the concept.

Apr 03 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/3/2016 12:39 AM, Manu via Digitalmars-d wrote:
 My SIMD implementation has been blocked on that for years too.

First I've heard of that.


 I need to know the SIMD level flags passed to the compiler at least,
 and DMD needs to introduce the concept.

Here is a list of all the open Bugzilla issues tagged with the keyword SIMD:

https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced

There is no issue I can find about being blocked for years on SIMD flags. I 
guarantee you that if you never report the problems you're having, you will 
suffer in silence and they will not get fixed :-)

Apr 03 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.

 Here is a list of all the open Bugzilla issues tagged with the 
 keyword SIMD:

 https://issues.dlang.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=SIMD%2C%20&keywords_type=allwords&list_id=207488&query_format=advanced

 There is no issue I can find about being blocked for years on 
 SIMD flags. I guarantee you that if you never report the 
 problems you're having, you will suffer in silence and they 
 will not get fixed :-)

He's talked about it on his github PR: 
https://github.com/D-Programming-Language/phobos/pull/2862

Apr 03 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/3/2016 7:12 PM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 22:00:51 UTC, Walter Bright wrote:
 There is no issue I can find about being blocked for years on SIMD flags. I
 guarantee you that if you never report the problems you're having, you will
 suffer in silence and they will not get fixed :-)

 He's talked about it on his github PR:
 https://github.com/D-Programming-Language/phobos/pull/2862

Yes, but I never noticed that until you posted a link. The place to file bug 
reports and enhancement requests is on Bugzilla. Otherwise nobody will see
them. 
It's why we have Bugzilla.

Apr 03 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.

I made a bug to track this problem: 
https://issues.dlang.org/show_bug.cgi?id=15873

Apr 04 2016

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:
 I made a bug to track this problem: 
 https://issues.dlang.org/show_bug.cgi?id=15873

You might add link to this thread and github where he made the 
original comment..

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 10:27 AM, jmh530 wrote:
 On Monday, 4 April 2016 at 17:23:49 UTC, Jack Stouffer wrote:
 I made a bug to track this problem:
 https://issues.dlang.org/show_bug.cgi?id=15873

 You might add link to this thread and github where he made the original
comment..

http://www.digitalmars.com/d/archives/digitalmars/D/Any_usable_SIMD_implementation_282806.html

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 10:23 AM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at least,
 and DMD needs to introduce the concept.

 I made a bug to track this problem:
https://issues.dlang.org/show_bug.cgi?id=15873

I believe the issue is fixed (for DMD) with a documentation improvement.

Apr 04 2016

ZombineDev <petar.p.kirov gmail.com> writes:

On Monday, 4 April 2016 at 19:43:43 UTC, Walter Bright wrote:
 On 4/4/2016 10:23 AM, Jack Stouffer wrote:
 On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at 
 least,
 and DMD needs to introduce the concept.

 I made a bug to track this problem: 
 https://issues.dlang.org/show_bug.cgi?id=15873

 I believe the issue is fixed (for DMD) with a documentation 
 improvement.

I believe the problem is that you can't rely on D_SIMD that SSE4, 
FMA, AVX2, AVX-512, etc. are available on the target platform. 
See also 
http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.

Apr 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 4/4/2016 12:55 PM, ZombineDev wrote:
 I believe the issue is fixed (for DMD) with a documentation improvement.

 I believe the problem is that you can't rely on D_SIMD that SSE4, FMA, AVX2,
 AVX-512, etc. are available on the target platform. See also
 http://forum.dlang.org/post/fnrmgfvqmykttsuuxxib forum.dlang.org.

Right, you can't. But the issue here is having the compiler give a predefined 
version for what is the MINIMUM that the target machine supports. And the
D_SIMD 
does that.

There is no purpose to the compiler predefining a version for an instruction
set 
it does not generate code for.

You can also do a runtime test with http://dlang.org/phobos/core_cpuid.html

Apr 04 2016

Johan Engelen <j j.nl> writes:

On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote:
 On 3 April 2016 at 16:14, 9il via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 Is it possible to introduce compile time information about 
 target platform? I am working on BLAS from scratch 
 implementation. And it is no hope to create something useable 
 without CT information about target.

 Best regards,
 Ilya

 My SIMD implementation has been blocked on that for years too.
 I need to know the SIMD level flags passed to the compiler at 
 least, and DMD needs to introduce the concept.

https://github.com/ldc-developers/ldc/pull/1434

Apr 15 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 03 Apr 2016 06:14:23 +0000
schrieb 9il <ilyayaroshenko gmail.com>:

 Hello Martin,
 
 Is it possible to introduce compile time information about target 
 platform? I am working on BLAS from scratch implementation. And 
 it is no hope to create something useable without CT information 
 about target.
 
 Best regards,
 Ilya

+1000!

I've hardcoded SSE4 in fast.json, but would much prefer to
type version(sse4) and have it compile on older CPUs as well.

-- 
Marco

Apr 04 2016

Etienne <etcimon gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

Not sure if it's been mentioned, but I've made a best effort to 
implement GCC's in here:

https://github.com/etcimon/botan/tree/master/source/botan/utils/simd

Apr 12 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
 I'm currently working on a templated arrayop implementation 
 (using RPN
 to encode ASTs).
 So far things worked out great, but now I got stuck b/c 
 apparently none
 of the D compilers has a working SIMD implementation (maybe GDC 
 has but
 it's very difficult to work w/ the 2.066 frontend).

 https://github.com/MartinNowak/druntime/blob/arrayOps/src/cor
/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d

 I don't want to do anything fancy, just unaligned loads, 
 stores, and integral mul/div. Is this really the current state 
 of SIMD or am I missing sth.?

 -Martin

ndslice.algorithm [1], [2] compiled with recent LDC beta will do 
all work for you. Vectorized flag should be turned on and the 
last (row) dimension should have stride==1.

Generic matrix-matrix multiplication [3] is available in Mir 
version 0.16.0-beta2
It should be compiled with recent LDC beta, and -mcpu=native flag.

[1] http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html
[2] https://github.com/dlang/phobos/pull/4652
[3] http://docs.mir.dlang.io/latest/mir_glas_gemm.html

Aug 22 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Any usable SIMD implementation?