digitalmars.D - Why don't we switch to C like floating pointed arithmetic instead of

Seb (106/116) Aug 03 2016 Here's my reply & experience to this, maybe it helps to show the

Andrew Godfrey (16/20) Aug 04 2016 In my experience (production-quality FP coding in C++), you are

Walter Bright (9/14) Aug 04 2016 Also, carefully reading the C Standard, D's behavior is allowed by the C...

Fool (3/7) Aug 04 2016 How can we ensure that toFloat(toFloat(x) + toFloat(y)) does not

Walter Bright (2/4) Aug 04 2016 It's the whole point of it.

Fool (5/10) Aug 04 2016 I'm afraid, I don't understand your implementation. Isn't

Walter Bright (11/14) Aug 04 2016 You're right, in that case, it does. But C does, too:

Fool (11/29) Aug 04 2016 Yes. It seems, however, as if Rick Regan is not advocating this

Walter Bright (6/9) Aug 04 2016 There are cases where doing things at higher precision results in double...

Ilya Yaroshenko (9/22) Aug 05 2016 You are wrong that there are far fewer of those cases. This is

deadalnix (3/27) Aug 05 2016 Most C compilers always promote float to double, so I'm not sure

Ilya Yaroshenko (4/15) Aug 05 2016 1. Could you please provide an assembler example with clang or

deadalnix (46/49) Aug 05 2016 I have better: compile your favorite project with

Ilya Yaroshenko (3/12) Aug 05 2016 Your example is just a speculation. 3.0 force compiler to convert

deadalnix (4/20) Aug 05 2016 Ha you are right.

John Colvin (45/65) Aug 05 2016 Gotta be careful with those examples. See this:

Walter Bright (6/7) Aug 05 2016 Java originally came out with an edict that floats will all be done in f...

Walter Bright (10/16) Aug 05 2016 A library has a lot of algorithms in it, a library requiring exact IEEE

Ilya Yaroshenko (14/37) Aug 05 2016 No. For example std.math.log requires it! But you don't care

Seb (50/73) Aug 05 2016 Yep.

Walter Bright (7/12) Aug 05 2016 Speed is not the only criteria. Accuracy is as well. I've been using C m...

Walter Bright (3/6) Aug 05 2016 I'm interested in correct to the last bit results first, and performance...

Walter Bright (3/9) Aug 05 2016 I'd appreciate it if you could provide links to where these requirements...

Ilya Yaroshenko (30/45) Aug 05 2016 1. https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz

Walter Bright (10/33) Aug 05 2016 I agree that the typical summation algorithm suffers from double roundin...

Ilya Yaroshenko (63/71) Aug 06 2016 Phobos's sum is two different algorithms. Pairwise summation for

Walter Bright (7/62) Aug 06 2016 Thanks for your help with this.

Johannes Pfau (21/113) Aug 06 2016 This is not true for UDAs. LDC and GDC actually implement @attribute

Walter Bright (5/6) Aug 06 2016 You're right that there are gray areas, but the distinction is not arbit...

Walter Bright (6/10) Aug 06 2016 The LDC fastmath bothers me a lot. It throws away proper NaN and infinit...

Ilya Yaroshenko (7/22) Aug 06 2016 OK, then we need a third pragma,`pragma(ieeeRound)`. But

Iain Buclaw via Digitalmars-d (4/27) Aug 06 2016 No pragmas tied to a specific architecture should be allowed in the

Ilya Yaroshenko (5/24) Aug 06 2016 Then probably Mir will drop all compilers, but LDC

Iain Buclaw via Digitalmars-d (8/37) Aug 06 2016 If you need a function to work with an exclusive instruction set or

Ilya Yaroshenko (10/53) Aug 06 2016 What do you mean by "do it yourself"? Write code using FMA GCC

Iain Buclaw via Digitalmars-d (10/70) Aug 06 2016 There are compiler switches for that. Maybe there should be one

David Nadlinger (9/12) Aug 06 2016 This might be a solution for inherently compiler-specific

Patrick Schluter (5/12) Aug 06 2016 Hmmm, that's the whole point of pragmas (at least in C) to

Iain Buclaw via Digitalmars-d (12/26) Aug 06 2016 https://dlang.org/spec/pragma.html#predefined-pragmas

Walter Bright (3/5) Aug 06 2016 A good point. On the other hand, a list of them would be nice so impleme...
David Nadlinger (13/15) Aug 06 2016 I wholeheartedly agree. However, it's not like FP optimisation

Iain Buclaw via Digitalmars-d (5/18) Aug 06 2016 Well, you get fusedMath for free when turning on -mfma or -mfused-madd

Walter Bright (2/5) Aug 06 2016 I understand that, I just don't understand why that wouldn't be done any...

Ilya Yaroshenko (4/12) Aug 06 2016 Some applications requires exactly the same results for different

Walter Bright (2/5) Aug 06 2016 Let me rephrase the question - how does fusing them alter the result?

David Nadlinger (6/8) Aug 06 2016 There is just one rounding operation instead of two.

Walter Bright (5/11) Aug 06 2016 Yup.

Ilya Yaroshenko (5/14) Aug 06 2016 Fused operations are mul/div+add/sub only.

Walter Bright (2/8) Aug 07 2016 ok

Ilya Yaroshenko (3/11) Aug 06 2016 The result became more precise, because single rounding instead

David Nadlinger (15/18) Aug 06 2016 This is true – and precisely the reason why it is actually

Walter Bright (3/15) Aug 06 2016 I didn't know that, thanks for the explanation. But the same can be done...

deadalnix (3/18) Aug 07 2016 It's not as cut and dry. Sometime, processing faster mean you can

Fool (3/3) Aug 05 2016 Here is a relevant example:

Walter Bright (2/5) Aug 04 2016 https://github.com/dlang/druntime/pull/1621
deadalnix (3/19) Aug 04 2016 It is actually very common for C compiler to work with double for

Walter Bright (2/4) Aug 04 2016 In fact, it used to be specified that C behave that way!

Ilya Yaroshenko (4/4) Aug 04 2016 IEEE behaviour by default is required by numeric software.
Iain Buclaw via Digitalmars-d (8/16) Aug 04 2016 It would be nice if explicit casts were honoured by CTFE here.

Walter Bright (4/7) Aug 04 2016 It's important to remember that what gcc does and what the C standard al...
Seb (5/19) Aug 04 2016 I can reproduce this on DPaste (also x86_64).

Iain Buclaw via Digitalmars-d (15/36) Aug 06 2016 When testing the math functions, I chose not to compare results to

Seb <seb wilzba.ch> writes:

There was a recent discussion on Phobos about D's floating point 
behavior [1]. I think Ilya summarized quite elegantly our problem:

 We need to argue with  WalterBright to switch to C like 
 floating pointed arithmetic instead of automatic expansion to 
 reals (this is so horrible that it may kill D for science for 
 ever,  wilzbach can tell a story about Tinflex, I can tell 
 about precise and KBN summations, which does not work correctly 
 with 32-bit DMD). D had a lot of long discussions about math 
 and GC. We are moving to have great D without GC now. Well, I 
 hope we will have D with normal FP arithmetic. But how much 
 years we need to change  WalterBright's opinion on this problem 
 to the common form?

Here's my reply & experience to this, maybe it helps to show the 
problem.
I started to document all the bumps with FP math I ran into on 
our mir wiki [2]. While some of these are expected, there are 
some that are horrible, cruel & yield a constant fight against 
the compiler
FP behavior that is different depending on the (1) target, (2) 
compiler or (3) optimization level is very hard to work with. 
Wasn't the entire point of D to get it right and avoid weird, 
unproductive corner-cases across architectures?
The problem with Tinflex and a lot of other scientific math is 
that you need reproducible, predictive behavior. For example the 
Tinflex algorithm is quite sensitive as it (1) uses pow and exp, 
so errors sum up quickly and (2) it has an ordered heap of 
remaining FP values, which means due to FP magic which will be 
explained below I get totally different resulting values 
depending on the architecture. Note that the ordering itself is 
well defined for equality, e.g. the tuples (mymath(1.5), 100), 
(mymath(1.5), 200) need to result in same ordering. I don't want 
my program to fail just because I compiled for 32-bit, but maybe 
code example show more than words.

Consider the following program, it fails on 32-bit :/

```
alias S = double; // same for float

S fun(S)(in S x)
{
     return -1 / x;
}

void main()
{
     S i = fun!S(3);
     assert(i == S(-1) / 3); // this lines passes
     assert(fun!S(3) == S(-1) / 3); // error on 32-bit

     // just to be clear, the following variants don't work either 
on 32-bit
     assert(fun!S(3) == S(-1.0 / 3);
     assert(fun!S(3) == cast(S) S(-1) / 3);
     assert(fun!S(3) == S(S(-1) / 3));
     assert(S(fun!S(3)) == S(-1) / 3);
     assert(cast(S) fun!S(3) == S(-1) / 3);
}
```

Maybe it's easier to see why this behavior is tricky when we look 
at it in action. For example with this program DMD for x86_64 
will yield the same result whereas x86_32 will yield different 
ones.

```
import std.stdio;

alias S = float;
float shL = 0x1.1b95eep-4; // -0.069235
float shR = 0x1.9c7cfep-7; // -0.012588

F fun(F)(F x)
{
     return 1.0 + x * 2;
}

S f1()
{
     S r = fun(shR);
     S l = fun(shL);
     return (r - l);
}

S f2()
{
     return (fun(shR) - fun(shL));
}

void main
{
     writefln("f1: %a", f1()); //  -0x1.d00cap-4
     writefln("f2: %a", f2());  // on 32-bit: -0x1.d00c9cp-4
     assert(f1 == f2); // fails on 32-bit
}
```

To make matters worse std.math yields different results than 
compiler/assembly intrinsics - note that in this example import 
std.math.pow adds about 1K instructions to the output assembler, 
whereas llvm_powf boils down to the assembly powf. Of course the 
performance of powf is a lot better, I measured [3] that e.g. 
std.math.pow takes ~1.5x as long for both LDC and DMD. Of course 
if you need to run this very often, this cost isn't acceptable.

```
void main()
{
     alias S = float;
     S s1 = 0x1.24c92ep+5;
     S s2 = -0x1.1c71c8p+0;

     import std.math : std_pow = pow;
     import core.stdc.stdio : printf;
     import core.stdc.math: powf;

     printf("std: %a\n", std_pow(s1, s2)); // 0x1.2c155ap-6
     printf("pow: %a\n", s1 ^^ s2); // 0x1.2c155ap-6
     printf("powf: %a\n", powf(s1, s2)); // 0x1.2c1558p-6

     version(LDC)
     {
         import ldc.intrinsics : llvm_pow;
         printf("ldc: %a\n", llvm_pow(s1, s2)); // 0x1.2c1558p-6
     }
}
```

I excluded the discrepancies in FP arithmetics between Windows 
and Linux/macOS as it's hopefully just a bug [4].

[1] https://github.com/dlang/phobos/pull/3217
[2] https://github.com/libmir/mir/wiki/Floating-point-issues
[3] https://github.com/wilzbach/d-benchmarks
[4] https://issues.dlang.org/show_bug.cgi?id=16344

Aug 03 2016

Andrew Godfrey <X y.com> writes:

On Wednesday, 3 August 2016 at 23:00:11 UTC, Seb wrote:
 There was a recent discussion on Phobos about D's floating 
 point behavior [1]. I think Ilya summarized quite elegantly our 
 problem:

 [...]

In my experience (production-quality FP coding in C++), you are 
in error merely by combining floating point with exact comparison 
(==). Even if you have just one compiler and architecture to 
target, you can expect instability if you do this. Writing robust 
FP algorithms is an art and it's made harder if you use 
mathematical thinking, because FP arithmetic lacks many 
properties that integer or fixed-point arithmetic have.

I'm not saying D gets it right (I haven't explored that at all) 
but I am saying you need better examples.

Now, my major experience is in the context of Intel non-SIMD FP, 
where internal precision is 80-bit. I can see the appeal of 
asking for the ability to reduce internal precision to match the 
data type you're using, and I think I've read something written 
by Walter on that topic. But this would hardly be "C-like" FP 
support so I'm not sure that's he topic at hand.

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 7:08 AM, Andrew Godfrey wrote:
 Now, my major experience is in the context of Intel non-SIMD FP, where internal
 precision is 80-bit. I can see the appeal of asking for the ability to reduce
 internal precision to match the data type you're using, and I think I've read
 something written by Walter on that topic. But this would hardly be "C-like" FP
 support so I'm not sure that's he topic at hand.

Also, carefully reading the C Standard, D's behavior is allowed by the C 
Standard. The idea that C requires rounding of all intermediate values to the 
target precision is incorrect, and is not "C-like". C floating point semantics 
can and do vary from platform to platform, and vary based on optimization 
settings, and this is all allowed by the C Standard.

It has been proposed many times that the solution for D is to have a function 
called toFloat() or something like that in core.math, which guarantees a round 
to float precision for its argument. But so far nobody has written such a
function.

Aug 04 2016

Fool <fool dlang.org> writes:

On Thursday, 4 August 2016 at 18:53:23 UTC, Walter Bright wrote:
 It has been proposed many times that the solution for D is to 
 have a function called toFloat() or something like that in 
 core.math, which guarantees a round to float precision for its 
 argument. But so far nobody has written such a function.

How can we ensure that toFloat(toFloat(x) + toFloat(y)) does not 
involve double-rounding?

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 12:03 PM, Fool wrote:
 How can we ensure that toFloat(toFloat(x) + toFloat(y)) does not involve
 double-rounding?

It's the whole point of it.

Aug 04 2016

Fool <fool dlang.org> writes:

On Thursday, 4 August 2016 at 20:00:14 UTC, Walter Bright wrote:
 On 8/4/2016 12:03 PM, Fool wrote:
 How can we ensure that toFloat(toFloat(x) + toFloat(y)) does 
 not involve
 double-rounding?

 It's the whole point of it.

I'm afraid, I don't understand your implementation. Isn't 
toFloat(x) + toFloat(y) computed in real precision (first 
rounding)? Why doesn't toFloat(toFloat(x) + toFloat(y)) involve 
another rounding?

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 1:29 PM, Fool wrote:
 I'm afraid, I don't understand your implementation. Isn't toFloat(x) +
 toFloat(y) computed in real precision (first rounding)? Why doesn't
 toFloat(toFloat(x) + toFloat(y)) involve another rounding?

You're right, in that case, it does. But C does, too:

http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/

This is important to remember when advocating for "C-like" floating point - 
because C simply does not behave as most programmers seem to assume it does.

What toFloat() does is guarantee that its argument is rounded to float.

The best way to approach this when designing fp algorithms is to not require 
them to have reduced precision.

It's also important to realize that on some machines, the hardware does not 
actually support float precision operations, or may do so at a large runtime 
penalty (x87).

Aug 04 2016

Fool <fool dlang.org> writes:

On Thursday, 4 August 2016 at 20:58:57 UTC, Walter Bright wrote:
 On 8/4/2016 1:29 PM, Fool wrote:
 I'm afraid, I don't understand your implementation. Isn't 
 toFloat(x) +
 toFloat(y) computed in real precision (first rounding)? Why 
 doesn't
 toFloat(toFloat(x) + toFloat(y)) involve another rounding?

 You're right, in that case, it does. But C does, too:

 http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/

Yes. It seems, however, as if Rick Regan is not advocating this 
behavior.


 This is important to remember when advocating for "C-like" 
 floating point - because C simply does not behave as most 
 programmers seem to assume it does.

That's right. "C-like" might be what they say but what they want 
is double precision computations to be carried out in double 
precision.


 What toFloat() does is guarantee that its argument is rounded 
 to float.

 The best way to approach this when designing fp algorithms is 
 to not require them to have reduced precision.

I understand your point of view. However, there are (probably 
rare) situations where one requires more control. I think that 
simulating double-double precision arithmetic using Veltkamp 
split was mentioned as a resonable example, earlier.


 It's also important to realize that on some machines, the 
 hardware does not actually support float precision operations, 
 or may do so at a large runtime penalty (x87).

That's another story.

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 11:05 PM, Fool wrote:
 I understand your point of view. However, there are (probably rare) situations
 where one requires more control. I think that simulating double-double
precision
 arithmetic using Veltkamp split was mentioned as a resonable example, earlier.

There are cases where doing things at higher precision results in double 
rounding and a less accurate result. But I am pretty sure there are far fewer
of 
those cases compared to routine computations that get a more accurate result 
with more precision.

If that wasn't true, we wouldn't ever need double precision.

Aug 04 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 06:59:21 UTC, Walter Bright wrote:
 On 8/4/2016 11:05 PM, Fool wrote:
 I understand your point of view. However, there are (probably 
 rare) situations
 where one requires more control. I think that simulating 
 double-double precision
 arithmetic using Veltkamp split was mentioned as a resonable 
 example, earlier.

 There are cases where doing things at higher precision results 
 in double rounding and a less accurate result. But I am pretty 
 sure there are far fewer of those cases compared to routine 
 computations that get a more accurate result with more 
 precision.

 If that wasn't true, we wouldn't ever need double precision.

You are wrong that there are far fewer of those cases. This is 
naive point of view. A lot of netlib math functions require exact 
IEEE arithmetic. Tinflex requires it. Python C backend and Mir 
library require exact IEEE arithmetic. Atmosphere package 
requires it, Atmosphere is used as reference code for my 
publication in JMS, Springer. And the most important case: no one 
top scientific laboratory will use a language without exact IEEE 
arithmetic by default.

Aug 05 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 August 2016 at 07:43:19 UTC, Ilya Yaroshenko wrote:
 On Friday, 5 August 2016 at 06:59:21 UTC, Walter Bright wrote:
 On 8/4/2016 11:05 PM, Fool wrote:
 I understand your point of view. However, there are (probably 
 rare) situations
 where one requires more control. I think that simulating 
 double-double precision
 arithmetic using Veltkamp split was mentioned as a resonable 
 example, earlier.

 There are cases where doing things at higher precision results 
 in double rounding and a less accurate result. But I am pretty 
 sure there are far fewer of those cases compared to routine 
 computations that get a more accurate result with more 
 precision.

 If that wasn't true, we wouldn't ever need double precision.

 You are wrong that there are far fewer of those cases. This is 
 naive point of view. A lot of netlib math functions require 
 exact IEEE arithmetic. Tinflex requires it. Python C backend 
 and Mir library require exact IEEE arithmetic. Atmosphere 
 package requires it, Atmosphere is used as reference code for 
 my publication in JMS, Springer. And the most important case: 
 no one top scientific laboratory will use a language without 
 exact IEEE arithmetic by default.

Most C compilers always promote float to double, so I'm not sure 
what point you are trying to make here.

Aug 05 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 07:59:15 UTC, deadalnix wrote:
 On Friday, 5 August 2016 at 07:43:19 UTC, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This is 
 naive point of view. A lot of netlib math functions require 
 exact IEEE arithmetic. Tinflex requires it. Python C backend 
 and Mir library require exact IEEE arithmetic. Atmosphere 
 package requires it, Atmosphere is used as reference code for 
 my publication in JMS, Springer. And the most important case: 
 no one top scientific laboratory will use a language without 
 exact IEEE arithmetic by default.

 Most C compilers always promote float to double, so I'm not 
 sure what point you are trying to make here.

1. Could you please provide an assembler example with clang or 
recent gcc?
2. C compilers not promote double to 80-bit reals anyway.

Aug 05 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 August 2016 at 08:17:00 UTC, Ilya Yaroshenko wrote:
 1. Could you please provide an assembler example with clang or 
 recent gcc?

I have better: compile your favorite project with 
-Wdouble-promotion and enjoy the rain of warnings.

But try it yourself:

float foo(float a, float b) {
   return 3.0 * a / b;
}

GCC 5.3 gives me

foo(float, float):
         cvtss2sd        xmm0, xmm0
         cvtss2sd        xmm1, xmm1
         mulsd   xmm0, QWORD PTR .LC0[rip]
         divsd   xmm0, xmm1
         cvtsd2ss        xmm0, xmm0
         ret
.LC0:
         .long   0
         .long   1074266112

Which clearly uses double precision.

And clang 3.8:

LCPI0_0:


float)
         cvtss2sd        xmm0, xmm0
         mulsd   xmm0, qword ptr [rip + .LCPI0_0]
         cvtss2sd        xmm1, xmm1
         divsd   xmm0, xmm1
         cvtsd2ss        xmm0, xmm0
         ret

which uses double as well.

 2. C compilers not promote double to 80-bit reals anyway.

VC++ does it on 32 bits build, but initialize the x87 unit to 
double precision (on 80 bits floats - yes that's a x87 setting).

VC++ will keep using float for x64 builds.

Intel compiler use compiler flags to promote or not.

In case you were wondering, this is not limited to X86/64 as GCC 
gives me on ARM:

foo(float, float):
         fmov    d2, 3.0e+0
         fcvt    d0, s0
         fmul    d0, d0, d2
         fcvt    d1, s1
         fdiv    d0, d0, d1
         fcvt    s0, d0
         ret

Which also promotes to double.

Aug 05 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 08:43:48 UTC, deadalnix wrote:
 On Friday, 5 August 2016 at 08:17:00 UTC, Ilya Yaroshenko wrote:
 1. Could you please provide an assembler example with clang or 
 recent gcc?

 I have better: compile your favorite project with 
 -Wdouble-promotion and enjoy the rain of warnings.

 But try it yourself:

 float foo(float a, float b) {
   return 3.0 * a / b;
 }

Your example is just a speculation. 3.0 force compiler to convert 
a and  b to double. This is obvious.

Aug 05 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 August 2016 at 09:21:53 UTC, Ilya Yaroshenko wrote:
 On Friday, 5 August 2016 at 08:43:48 UTC, deadalnix wrote:
 On Friday, 5 August 2016 at 08:17:00 UTC, Ilya Yaroshenko 
 wrote:
 1. Could you please provide an assembler example with clang 
 or recent gcc?

 I have better: compile your favorite project with 
 -Wdouble-promotion and enjoy the rain of warnings.

 But try it yourself:

 float foo(float a, float b) {
   return 3.0 * a / b;
 }

 Your example is just a speculation. 3.0 force compiler to 
 convert a and  b to double. This is obvious.

Ha you are right.

Testing more it seems that gcc and clang are not promoting on 64 
bit code, but still are on 32 bits.

Aug 05 2016

John Colvin <john.loughran.colvin gmail.com> writes:

On Friday, 5 August 2016 at 08:43:48 UTC, deadalnix wrote:
 On Friday, 5 August 2016 at 08:17:00 UTC, Ilya Yaroshenko wrote:
 1. Could you please provide an assembler example with clang or 
 recent gcc?

 I have better: compile your favorite project with 
 -Wdouble-promotion and enjoy the rain of warnings.

 But try it yourself:

 float foo(float a, float b) {
   return 3.0 * a / b;
 }

 GCC 5.3 gives me

 foo(float, float):
         cvtss2sd        xmm0, xmm0
         cvtss2sd        xmm1, xmm1
         mulsd   xmm0, QWORD PTR .LC0[rip]
         divsd   xmm0, xmm1
         cvtsd2ss        xmm0, xmm0
         ret
 .LC0:
         .long   0
         .long   1074266112

Gotta be careful with those examples. See this: 
https://godbolt.org/g/0yNUSG

float foo1(float a, float b) {
   return 3.42 * (a / b);
}

float foo2(float a, float b) {
   return 3.0 * (a / b);
}

float foo3(float a, float b) {
   return 3.0 * a / b;
}

float foo4(float a, float b) {
   return 3.0f * a / b;
}

foo1(float, float):
         divss   xmm0, xmm1
         cvtss2sd        xmm0, xmm0
         mulsd   xmm0, QWORD PTR .LC0[rip]
         cvtsd2ss        xmm0, xmm0
         ret
foo2(float, float):
         divss   xmm0, xmm1
         mulss   xmm0, DWORD PTR .LC2[rip]
         ret
foo3(float, float):
         cvtss2sd        xmm0, xmm0
         cvtss2sd        xmm1, xmm1
         mulsd   xmm0, QWORD PTR .LC4[rip]
         divsd   xmm0, xmm1
         cvtsd2ss        xmm0, xmm0
         ret
foo4(float, float):
         mulss   xmm0, DWORD PTR .LC2[rip]
         divss   xmm0, xmm1
         ret
.LC0:
         .long   4123168604
         .long   1074486312
.LC2:
         .long   1077936128
.LC4:
         .long   0
         .long   1074266112


It depends on the literal value, not just the type.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 1:17 AM, Ilya Yaroshenko wrote:
 2. C compilers not promote double to 80-bit reals anyway.

Java originally came out with an edict that floats will all be done in float 
precision, and double in double.

Sun had evidently never used an x87 before, because it soon became obvious that 
it was unworkable on the x87, and the spec was changed to allow intermediate 
values out to 80 bits.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 12:43 AM, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This is naive point of
 view. A lot of netlib math functions require exact IEEE arithmetic. Tinflex
 requires it. Python C backend and Mir library require exact IEEE arithmetic.
 Atmosphere package requires it, Atmosphere is used as reference code for my
 publication in JMS, Springer. And the most important case: no one top
scientific
 laboratory will use a language without exact IEEE arithmetic by default.

A library has a lot of algorithms in it, a library requiring exact IEEE 
arithmetic doesn't mean every algorithm in it does. None of the Phobos math 
library functions require it, and as far as I can tell they are correct out to 
the last bit.

And besides, all these libraries presumably work, or used to work, on the x87, 
which does not provide exact IEEE arithmetic for intermediate results without a 
special setting, and that setting substantially slows it down.

By netlib do you mean the Cephes functions? I've used them, and am not aware of 
any of them that require reduced precision.

Aug 05 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 09:24:49 UTC, Walter Bright wrote:
 On 8/5/2016 12:43 AM, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This is 
 naive point of
 view. A lot of netlib math functions require exact IEEE 
 arithmetic. Tinflex
 requires it. Python C backend and Mir library require exact 
 IEEE arithmetic.
 Atmosphere package requires it, Atmosphere is used as 
 reference code for my
 publication in JMS, Springer. And the most important case: no 
 one top scientific
 laboratory will use a language without exact IEEE arithmetic 
 by default.

 A library has a lot of algorithms in it, a library requiring 
 exact IEEE arithmetic doesn't mean every algorithm in it does. 
 None of the Phobos math library functions require it, and as 
 far as I can tell they are correct out to the last bit.

No. For example std.math.log requires it! But you don't care 
about other compilers which not use yl2x and about making it 
template (real version slows down code for double and float).

 And besides, all these libraries presumably work, or used to 
 work, on the x87, which does not provide exact IEEE arithmetic 
 for intermediate results without a special setting, and that 
 setting substantially slows it down.

x87 FPU is deprecated. We have more significant performance 
issues with std.math. For example, it is x87 oriented, is slows 
down the code for double and float. Many functions are not 
inlined. This 2 are real performance problems.

 By netlib do you mean the Cephes functions? I've used them, and 
 am not aware of any of them that require reduced precision.

Yes and many of its functions requires IEEE. For example log2.c 
for doubles:
	z = x - 0.5;
	z -= 0.5;
	y = 0.5 * x  + 0.5;
This code requires IEEE. The same code appears in our std.math :P

Aug 05 2016

Seb <seb wilzba.ch> writes:

On Friday, 5 August 2016 at 09:40:59 UTC, Ilya Yaroshenko wrote:
 On Friday, 5 August 2016 at 09:24:49 UTC, Walter Bright wrote:
 On 8/5/2016 12:43 AM, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This 
 is naive point of
 view. A lot of netlib math functions require exact IEEE 
 arithmetic. Tinflex
 requires it. Python C backend and Mir library require exact 
 IEEE arithmetic.
 Atmosphere package requires it, Atmosphere is used as 
 reference code for my
 publication in JMS, Springer. And the most important case: no 
 one top scientific
 laboratory will use a language without exact IEEE arithmetic 
 by default.

 A library has a lot of algorithms in it, a library requiring 
 exact IEEE arithmetic doesn't mean every algorithm in it does. 
 None of the Phobos math library functions require it, and as 
 far as I can tell they are correct out to the last bit.

 No. For example std.math.log requires it! But you don't care 
 about other compilers which not use yl2x and about making it 
 template (real version slows down code for double and float).

Yep.

1) There are some function (exp, pow, log, round, sqrt) for which 
using llvm_intrinsincs significantly increases your performance.

It's a simple benchmark and might be flawed, but I hope it shows 
the point.

Code is here: 
https://gist.github.com/wilzbach/2b64e10dec66a3153c51fbd1e6848f72

 ldmd -inline -release -O3 -boundscheck=off test.d

fun: pow
std.math.pow   = 15 secs, 914 ms, 102 μs, and 8 hnsecs
core.stdc.pow  = 11 secs, 590 ms, 702 μs, and 5 hnsecs
llvm_pow       = 13 secs, 570 ms, 439 μs, and 7 hnsecs
fun: exp
std.math.exp   = 6 secs, 85 ms, 741 μs, and 7 hnsecs
core.stdc.exp  = 16 secs, 267 ms, 997 μs, and 4 hnsecs
llvm_exp       = 2 secs, 22 ms, and 876 μs
fun: exp2
std.math.exp2  = 3 secs, 117 ms, 624 μs, and 2 hnsecs
core.stdc.exp2 = 2 secs, 973 ms, and 243 μs
llvm_exp2      = 2 secs, 451 ms, 628 μs, and 9 hnsecs
fun: sin
std.math.sin   = 1 sec, 805 ms, 626 μs, and 7 hnsecs
core.stdc.sin  = 17 secs, 743 ms, 33 μs, and 5 hnsecs
llvm_sin       = 2 secs, 95 ms, and 178 μs
fun: cos
std.math.cos   = 2 secs, 820 ms, 684 μs, and 5 hnsecs
core.stdc.cos  = 17 secs, 626 ms, 78 μs, and 1 hnsec
llvm_cos       = 2 secs, 814 ms, 60 μs, and 5 hnsecs
fun: log
std.math.log   = 5 secs, 584 ms, 344 μs, and 5 hnsecs
core.stdc.log  = 16 secs, 443 ms, 893 μs, and 3 hnsecs
llvm_log       = 2 secs, 13 ms, 291 μs, and 1 hnsec
fun: log2
std.math.log2  = 5 secs, 583 ms, 777 μs, and 7 hnsecs
core.stdc.log2 = 2 secs, 800 ms, 848 μs, and 5 hnsecs
llvm_log2      = 2 secs, 165 ms, 849 μs, and 6 hnsecs
fun: sqrt
std.math.sqrt  = 799 ms and 917 μs
core.stdc.sqrt = 864 ms, 834 μs, and 7 hnsecs
llvm_sqrt      = 439 ms, 469 μs, and 2 hnsecs
fun: ceil
std.math.ceil  = 540 ms and 167 μs
core.stdc.ceil = 971 ms, 533 μs, and 6 hnsecs
llvm_ceil      = 562 ms, 490 μs, and 2 hnsecs
fun: round
std.math.round = 3 secs, 52 ms, 567 μs, and 3 hnsecs
core.stdc.round = 958 ms and 217 μs
llvm_round     = 590 ms, 742 μs, and 7 hnsecs


2) As mentioned before they can yield _different_ results

https://dpaste.dzfl.pl/c0ab5131b49d

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 4:27 AM, Seb wrote:
 1) There are some function (exp, pow, log, round, sqrt) for which using
 llvm_intrinsincs significantly increases your performance.

 It's a simple benchmark and might be flawed, but I hope it shows the point.

Speed is not the only criteria. Accuracy is as well. I've been using C math 
functions forever, and have constantly discovered that this or that math 
function on this or that platform produces bad results. This is why D's math 
functions are re-implemented in D rather than just forwarding to the C ones.


 2) As mentioned before they can yield _different_ results

 https://dpaste.dzfl.pl/c0ab5131b49d

Ah, but which result is the correct one? I am interested in getting the 
functions correct to the last bit first, and performance second.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 2:40 AM, Ilya Yaroshenko wrote:
 No. For example std.math.log requires it! But you don't care about other
 compilers which not use yl2x and about making it template (real version slows
 down code for double and float).

I'm interested in correct to the last bit results first, and performance
second. 
std.math changes that hew to that are welcome.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2016 12:43 AM, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This is naive point of
 view. A lot of netlib math functions require exact IEEE arithmetic. Tinflex
 requires it. Python C backend and Mir library require exact IEEE arithmetic.
 Atmosphere package requires it, Atmosphere is used as reference code for my
 publication in JMS, Springer. And the most important case: no one top
scientific
 laboratory will use a language without exact IEEE arithmetic by default.

I'd appreciate it if you could provide links to where these requirements are. I 
can't find anything on Tinflex, for example.

Aug 05 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 09:40:23 UTC, Walter Bright wrote:
 On 8/5/2016 12:43 AM, Ilya Yaroshenko wrote:
 You are wrong that there are far fewer of those cases. This is 
 naive point of
 view. A lot of netlib math functions require exact IEEE 
 arithmetic. Tinflex
 requires it. Python C backend and Mir library require exact 
 IEEE arithmetic.
 Atmosphere package requires it, Atmosphere is used as 
 reference code for my
 publication in JMS, Springer. And the most important case: no 
 one top scientific
 laboratory will use a language without exact IEEE arithmetic 
 by default.

 I'd appreciate it if you could provide links to where these 
 requirements are. I can't find anything on Tinflex, for example.

1. https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz
mathmodule.c, math_fsum has comment:
    Depends on IEEE 754 arithmetic guarantees and half-even 
rounding.
The same algorithm also available in Mir. And it does not work 
with 32 bit DMD.

2. sum_kbn in 
https://github.com/JuliaLang/julia/blob/master/base/reduce.jl
requires ieee arithmetic. The same algorithm also available in 
Mir.

3. http://www.netlib.org/cephes/
See log2.c for example:
	z = x - 0.5;
	z -= 0.5;
	y = 0.5 * x  + 0.5;
This code requires IEEE. And you can found it in Phobos std.math

4. Mir has 5 types of summation, and 3 of them requires IEEE.

5. Tinflex requires IEEE arithmetic because extended precision 
may force algorithm to choose wrong computation branch. The most 
significant part of original code was written in R, and the 
scientists who create this algorithm did not care about non IEEE 
compilers at all.

6. Atmosphere requires IEEE for may functions such as 
https://github.com/9il/atmosphere/blob/master/source/atmosphere/math.d#L616
Without proper IEEE rounding the are not guarantee that this 
functions will stop.

7. The same true for real world implementations of algorithms 
presented in Numeric Recipes, which uses various series expansion 
such as for Modified Bessel Function.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

Thanks for finding these.

On 8/5/2016 3:22 AM, Ilya Yaroshenko wrote:
 1. https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz
 mathmodule.c, math_fsum has comment:
    Depends on IEEE 754 arithmetic guarantees and half-even rounding.
 The same algorithm also available in Mir. And it does not work with 32 bit DMD.

 2. sum_kbn in https://github.com/JuliaLang/julia/blob/master/base/reduce.jl
 requires ieee arithmetic. The same algorithm also available in Mir.

I agree that the typical summation algorithm suffers from double rounding. But 
that's one algorithm. I would appreciate if you would review 
http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure it doesn't 
have this problem, and if it does, how we can fix it.


 3. http://www.netlib.org/cephes/
 See log2.c for example:
     z = x - 0.5;
     z -= 0.5;
     y = 0.5 * x  + 0.5;
 This code requires IEEE. And you can found it in Phobos std.math

It'd be great to have a value for x where it fails, then we can add it to the 
unittests and ensure it is fixed.


 4. Mir has 5 types of summation, and 3 of them requires IEEE.

See above for summation.


 5. Tinflex requires IEEE arithmetic because extended precision may force
 algorithm to choose wrong computation branch. The most significant part of
 original code was written in R, and the scientists who create this algorithm
did
 not care about non IEEE compilers at all.

 6. Atmosphere requires IEEE for may functions such as
 https://github.com/9il/atmosphere/blob/master/source/atmosphere/math.d#L616
 Without proper IEEE rounding the are not guarantee that this functions will
stop.

 7. The same true for real world implementations of algorithms presented in
 Numeric Recipes, which uses various series expansion such as for Modified
Bessel
 Function.

I hear you. I'd like to explore ways of solving it. Got any ideas?

Aug 05 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 5 August 2016 at 20:53:42 UTC, Walter Bright wrote:

 I agree that the typical summation algorithm suffers from 
 double rounding. But that's one algorithm. I would appreciate 
 if you would review 
 http://dlang.org/phobos/std_algorithm_iteration.html#sum to 
 ensure it doesn't have this problem, and if it does, how we can 
 fix it.

Phobos's sum is two different algorithms. Pairwise summation for 
Random Access Ranges and Kahan summation for Input Ranges. 
Pairwise summation does not require IEEE rounding, but Kahan 
summation requires it.

The problem with real world example is that it depends on 
optimisation. For example, if all temporary values are rounded, 
this is not a problem, and if all temporary values are not 
rounded this is not a problem too. However if some of them 
rounded and others are not, than this will break Kahan algorithm.

Kahan is the shortest and one of the slowest (comparing with KBN 
for example) summation algorithms. The true story about Kahan, 
that we may have it in Phobos, but we can use pairwise summation 
for Input Ranges without random access, and it will be faster 
then Kahan. So we don't need Kahan for current API at all.

Mir has both Kahan, which works with 32-bit DMD, and pairwise, 
witch works with input ranges.

Kahan, KBN, KB2, and Precise summations is always use `real` or 
`Complex!real` internal values for 32 bit X86 target. The only 
problem with Precise summation, if we need precise result in 
double and use real for internal summation, then the last bit 
will be wrong in the 50% of cases.

Another good point about Mir's summation algorithms, that they 
are Output Ranges. This means they can be used effectively to sum 
multidimensional arrays for example. Also, Precise summator may 
be used to compute exact sum of distributed data.

When we get a decision and solution for rounding problem, I will 
make PR for std.experimental.numeric.sum.

 I hear you. I'd like to explore ways of solving it. Got any 
 ideas?

We need to take the overall picture.

It is very important to recognise that D core team is small and D 
community is not large enough now to involve a lot of new 
professionals. This means that time of existing one engineers is 
very important for D and the most important engineer for D is 
you, Walter.

In the same time we need to move forward fast with language 
changes and druntime changes (GC-less Fibers for example).

So, we need to choose tricky options for development. The most 
important option for D in the science context is to split D 
Programming Language from DMD in our minds. I am not asking to 
remove DMD as reference compiler. Instead of that, we can 
introduce changes in D that can not be optimally implemented in 
DMD (because you have a lot of more important things to do for D 
instead of optimisation) but will be awesome for our LLVM-based 
or GCC-based backends.

We need 2 new pragmas with the same syntax as `pragma(inline, 
xxx)`:

1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, 
div-sub operations.
2. `pragma(fastMath)` equivalents to [1]. This pragma can be used 
to allow extended precision.

This should be 2 separate pragmas. The second one may assume the 
first one.

Recent LDC beta has  fastmath attribute for functions, and it is 
already used in Phobos ndslice.algorithm PR and its Mir's mirror. 
Attributes are alternative for pragmas, but their syntax should 
be extended, see [2]

The old approach is separate compilation, but it is weird, low 
level for users, and requires significant efforts for both small 
and large projects.

[1] http://llvm.org/docs/LangRef.html#fast-math-flags
[2] https://github.com/ldc-developers/ldc/issues/1669

Best regards,
Ilya

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 On Friday, 5 August 2016 at 20:53:42 UTC, Walter Bright wrote:

 I agree that the typical summation algorithm suffers from double rounding. But
 that's one algorithm. I would appreciate if you would review
 http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure it doesn't
 have this problem, and if it does, how we can fix it.

 Phobos's sum is two different algorithms. Pairwise summation for Random Access
 Ranges and Kahan summation for Input Ranges. Pairwise summation does not
require
 IEEE rounding, but Kahan summation requires it.

 The problem with real world example is that it depends on optimisation. For
 example, if all temporary values are rounded, this is not a problem, and if all
 temporary values are not rounded this is not a problem too. However if some of
 them rounded and others are not, than this will break Kahan algorithm.

 Kahan is the shortest and one of the slowest (comparing with KBN for example)
 summation algorithms. The true story about Kahan, that we may have it in
Phobos,
 but we can use pairwise summation for Input Ranges without random access, and
it
 will be faster then Kahan. So we don't need Kahan for current API at all.

 Mir has both Kahan, which works with 32-bit DMD, and pairwise, witch works with
 input ranges.

 Kahan, KBN, KB2, and Precise summations is always use `real` or `Complex!real`
 internal values for 32 bit X86 target. The only problem with Precise summation,
 if we need precise result in double and use real for internal summation, then
 the last bit will be wrong in the 50% of cases.

 Another good point about Mir's summation algorithms, that they are Output
 Ranges. This means they can be used effectively to sum multidimensional arrays
 for example. Also, Precise summator may be used to compute exact sum of
 distributed data.

 When we get a decision and solution for rounding problem, I will make PR for
 std.experimental.numeric.sum.

 I hear you. I'd like to explore ways of solving it. Got any ideas?

 We need to take the overall picture.

 It is very important to recognise that D core team is small and D community is
 not large enough now to involve a lot of new professionals. This means that
time
 of existing one engineers is very important for D and the most important
 engineer for D is you, Walter.

 In the same time we need to move forward fast with language changes and
druntime
 changes (GC-less Fibers for example).

 So, we need to choose tricky options for development. The most important option
 for D in the science context is to split D Programming Language from DMD in our
 minds. I am not asking to remove DMD as reference compiler. Instead of that, we
 can introduce changes in D that can not be optimally implemented in DMD
(because
 you have a lot of more important things to do for D instead of optimisation)
but
 will be awesome for our LLVM-based or GCC-based backends.

 We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub
operations.
 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow
 extended precision.

 This should be 2 separate pragmas. The second one may assume the first one.

 Recent LDC beta has  fastmath attribute for functions, and it is already used
in
 Phobos ndslice.algorithm PR and its Mir's mirror. Attributes are alternative
for
 pragmas, but their syntax should be extended, see [2]

 The old approach is separate compilation, but it is weird, low level for users,
 and requires significant efforts for both small and large projects.

 [1] http://llvm.org/docs/LangRef.html#fast-math-flags
 [2] https://github.com/ldc-developers/ldc/issues/1669

Thanks for your help with this.

Using attributes for this is a mistake. Attributes affect the interface to a 
function, not its internal implementation. Pragmas are suitable for internal 
implementation things. I also oppose using compiler flags, because they tend to 
be overly global, and the details of an algorithm should not be split between 
the source code and the makefile.

Aug 06 2016

Johannes Pfau <nospam example.com> writes:

Am Sat, 6 Aug 2016 02:29:50 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 On Friday, 5 August 2016 at 20:53:42 UTC, Walter Bright wrote:
  
 I agree that the typical summation algorithm suffers from double
 rounding. But that's one algorithm. I would appreciate if you
 would review
 http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure
 it doesn't have this problem, and if it does, how we can fix it. 

 Phobos's sum is two different algorithms. Pairwise summation for
 Random Access Ranges and Kahan summation for Input Ranges. Pairwise
 summation does not require IEEE rounding, but Kahan summation
 requires it.

 The problem with real world example is that it depends on
 optimisation. For example, if all temporary values are rounded,
 this is not a problem, and if all temporary values are not rounded
 this is not a problem too. However if some of them rounded and
 others are not, than this will break Kahan algorithm.

 Kahan is the shortest and one of the slowest (comparing with KBN
 for example) summation algorithms. The true story about Kahan, that
 we may have it in Phobos, but we can use pairwise summation for
 Input Ranges without random access, and it will be faster then
 Kahan. So we don't need Kahan for current API at all.

 Mir has both Kahan, which works with 32-bit DMD, and pairwise,
 witch works with input ranges.

 Kahan, KBN, KB2, and Precise summations is always use `real` or
 `Complex!real` internal values for 32 bit X86 target. The only
 problem with Precise summation, if we need precise result in double
 and use real for internal summation, then the last bit will be
 wrong in the 50% of cases.

 Another good point about Mir's summation algorithms, that they are
 Output Ranges. This means they can be used effectively to sum
 multidimensional arrays for example. Also, Precise summator may be
 used to compute exact sum of distributed data.

 When we get a decision and solution for rounding problem, I will
 make PR for std.experimental.numeric.sum.
  
 I hear you. I'd like to explore ways of solving it. Got any
 ideas?  

 We need to take the overall picture.

 It is very important to recognise that D core team is small and D
 community is not large enough now to involve a lot of new
 professionals. This means that time of existing one engineers is
 very important for D and the most important engineer for D is you,
 Walter.

 In the same time we need to move forward fast with language changes
 and druntime changes (GC-less Fibers for example).

 So, we need to choose tricky options for development. The most
 important option for D in the science context is to split D
 Programming Language from DMD in our minds. I am not asking to
 remove DMD as reference compiler. Instead of that, we can introduce
 changes in D that can not be optimally implemented in DMD (because
 you have a lot of more important things to do for D instead of
 optimisation) but will be awesome for our LLVM-based or GCC-based
 backends.

 We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add,
 div-sub operations. 2. `pragma(fastMath)` equivalents to [1]. This
 pragma can be used to allow extended precision.

 This should be 2 separate pragmas. The second one may assume the
 first one.

 Recent LDC beta has  fastmath attribute for functions, and it is
 already used in Phobos ndslice.algorithm PR and its Mir's mirror.
 Attributes are alternative for pragmas, but their syntax should be
 extended, see [2]

 The old approach is separate compilation, but it is weird, low
 level for users, and requires significant efforts for both small
 and large projects.

 [1] http://llvm.org/docs/LangRef.html#fast-math-flags
 [2] https://github.com/ldc-developers/ldc/issues/1669  

 
 Thanks for your help with this.
 
 Using attributes for this is a mistake. Attributes affect the
 interface to a function

This is not true for UDAs. LDC and GDC actually implement  attribute
as an UDA. And UDAs used in serialization interfaces, the std.benchmark
proposals, ... do not affect the interface either.

 not its internal implementation.

It's possible to reflect on the UDAs of the current function, so this
is not true in general:
-----------------------------
 (40) int foo()
{
    mixin("alias thisFunc = " ~ __FUNCTION__ ~ ";");
    return __traits(getAttributes, thisFunc)[0];
}
-----------------------------
https://dpaste.dzfl.pl/aa0615b40adf

I think this restriction is also quite arbitrary. For end users
attributes provide a much nicer syntax than pragmas. Both GDC and LDC
already successfully use UDAs for function specific backend options, so
DMD is really the exception here.

Additionally, even according to your rules pragma(mangle) should
actually be  mangle.

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 5:09 AM, Johannes Pfau wrote:
 I think this restriction is also quite arbitrary.

You're right that there are gray areas, but the distinction is not arbitrary.

For example, mangling does not affect the interface. It affects the name.

Using an attribute has more downsides, as it affects the whole function rather 
than just part of it, like a pragma would.

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub
operations.
 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow
 extended precision.


The LDC fastmath bothers me a lot. It throws away proper NaN and infinity 
handling, and throws away precision by allowing reciprocal and algebraic 
transformations. As I've said before, correctness should be first, not speed, 
and fastmath has nothing to do with this thread.


I don't know what the point of fusedMath is.

Aug 06 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 We need 2 new pragmas with the same syntax as `pragma(inline, 
 xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, 
 div-sub operations.
 2. `pragma(fastMath)` equivalents to [1]. This pragma can be 
 used to allow
 extended precision.


 The LDC fastmath bothers me a lot. It throws away proper NaN 
 and infinity handling, and throws away precision by allowing 
 reciprocal and algebraic transformations. As I've said before, 
 correctness should be first, not speed, and fastmath has 
 nothing to do with this thread.

OK, then we need a third pragma,`pragma(ieeeRound)`. But 
`pragma(fusedMath)` and `pragma(fastMath)` should be presented 
too.

 I don't know what the point of fusedMath is.

It allows a compiler to replace two arithmetic operations with 
single composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) 
instruction set.

Aug 06 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub
 operations.
 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to
 allow
 extended precision.



 The LDC fastmath bothers me a lot. It throws away proper NaN and infinity
 handling, and throws away precision by allowing reciprocal and algebraic
 transformations. As I've said before, correctness should be first, not
 speed, and fastmath has nothing to do with this thread.


 OK, then we need a third pragma,`pragma(ieeeRound)`. But `pragma(fusedMath)`
 and `pragma(fastMath)` should be presented too.

 I don't know what the point of fusedMath is.


 It allows a compiler to replace two arithmetic operations with single
 composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.

No pragmas tied to a specific architecture should be allowed in the
language spec, please.

Aug 06 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright 
 wrote:
 [...]


 OK, then we need a third pragma,`pragma(ieeeRound)`. But 
 `pragma(fusedMath)`
 and `pragma(fastMath)` should be presented too.

 [...]


 It allows a compiler to replace two arithmetic operations with 
 single composed one, see AVX2 (FMA3 for intel and FMA4 for 
 AMD) instruction set.

 No pragmas tied to a specific architecture should be allowed in 
 the language spec, please.

Then probably Mir will drop all compilers, but LDC
LLVM is tied for real world, so we can tied D for real world too. 
If a compiler can not implement optimization pragma, then this 
pragma can be just ignored by the compiler.

Aug 06 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 August 2016 at 12:07, Ilya Yaroshenko via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 [...]



 OK, then we need a third pragma,`pragma(ieeeRound)`. But
 `pragma(fusedMath)`
 and `pragma(fastMath)` should be presented too.

 [...]



 It allows a compiler to replace two arithmetic operations with single
 composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.


 No pragmas tied to a specific architecture should be allowed in the
 language spec, please.


 Then probably Mir will drop all compilers, but LDC
 LLVM is tied for real world, so we can tied D for real world too. If a
 compiler can not implement optimization pragma, then this pragma can be just
 ignored by the compiler.

If you need a function to work with an exclusive instruction set or
something as specific as use of composed/fused instructions, then it
is common to use an indirect function resolver to choose the most
relevant implementation for the system that's running the code (a la
 ifunc), then for the targetted fusedMath implementation, do it
yourself.

Aug 06 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 11:10:18 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 12:07, Ilya Yaroshenko via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright 
 wrote:
 [...]



 OK, then we need a third pragma,`pragma(ieeeRound)`. But
 `pragma(fusedMath)`
 and `pragma(fastMath)` should be presented too.

 [...]



 It allows a compiler to replace two arithmetic operations 
 with single composed one, see AVX2 (FMA3 for intel and FMA4 
 for AMD) instruction set.


 No pragmas tied to a specific architecture should be allowed 
 in the language spec, please.


 Then probably Mir will drop all compilers, but LDC
 LLVM is tied for real world, so we can tied D for real world 
 too. If a
 compiler can not implement optimization pragma, then this 
 pragma can be just
 ignored by the compiler.

 If you need a function to work with an exclusive instruction 
 set or something as specific as use of composed/fused 
 instructions, then it is common to use an indirect function 
 resolver to choose the most relevant implementation for the 
 system that's running the code (a la  ifunc), then for the 
 targetted fusedMath implementation, do it yourself.

What do you mean by "do it yourself"? Write code using FMA GCC 
intrinsics? Why I need to do something that can be automated by a 
compiler? Modern approach is to give a hint to the compiler 
instead of write specialised code for different architectures.

It seems you have misunderstood me. I don't want to force 
compiler to use explicit instruction sets. Instead, I want to 
give a hint to a compiler, about what math _transformations_ are 
allowed. And this hints are architecture independent. A compiler 
may a may not use this hints to optimise code.

Aug 06 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 August 2016 at 13:30, Ilya Yaroshenko via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 11:10:18 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 12:07, Ilya Yaroshenko via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 [...]




 OK, then we need a third pragma,`pragma(ieeeRound)`. But
 `pragma(fusedMath)`
 and `pragma(fastMath)` should be presented too.

 [...]




 It allows a compiler to replace two arithmetic operations with single
 composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.



 No pragmas tied to a specific architecture should be allowed in the
 language spec, please.



 Then probably Mir will drop all compilers, but LDC
 LLVM is tied for real world, so we can tied D for real world too. If a
 compiler can not implement optimization pragma, then this pragma can be
 just
 ignored by the compiler.


 If you need a function to work with an exclusive instruction set or
 something as specific as use of composed/fused instructions, then it is
 common to use an indirect function resolver to choose the most relevant
 implementation for the system that's running the code (a la  ifunc), then
 for the targetted fusedMath implementation, do it yourself.


 What do you mean by "do it yourself"? Write code using FMA GCC intrinsics?
 Why I need to do something that can be automated by a compiler? Modern
 approach is to give a hint to the compiler instead of write specialised code
 for different architectures.

 It seems you have misunderstood me. I don't want to force compiler to use
 explicit instruction sets. Instead, I want to give a hint to a compiler,
 about what math _transformations_ are allowed. And this hints are
 architecture independent. A compiler may a may not use this hints to
 optimise code.

There are compiler switches for that.  Maybe there should be one
pragma to tweak these compiler switches on a per-function basis,
rather than separately named pragmas.  That way you tell the compiler
what you want, rather than it being part of the language logic to
understand what must be turned on/off internally.

First, assume the language knows nothing about what platform it's
running on, then use that as a basis for suggesting new pragmas that
should be supported everywhere.

Aug 06 2016

David Nadlinger <code klickverbot.at> writes:

On Saturday, 6 August 2016 at 12:48:26 UTC, Iain Buclaw wrote:
 There are compiler switches for that.  Maybe there should be 
 one pragma to tweak these compiler switches on a per-function 
 basis, rather than separately named pragmas.

This might be a solution for inherently compiler-specific 
settings (although for LDC we would probably go for "type-safe" 
UDAs/pragmas instead of parsing faux command-line strings).

Floating point transformation semantics aren't compiler-specific, 
though. The corresponding options are used commonly enough in 
certain kinds of code that it doesn't seem prudent to require 
users to resort to compiler-specific ways of expressing them.

  — David

Aug 06 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright 
 wrote:


 No pragmas tied to a specific architecture should be allowed in 
 the language spec, please.

Hmmm, that's the whole point of pragmas (at least in C) to 
specify implementation specific stuff outside of the language 
specs. If it's in the language specs it should be done with 
language specific mechanisms.

Aug 06 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 August 2016 at 16:11, Patrick Schluter via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:


 No pragmas tied to a specific architecture should be allowed in the
 language spec, please.


 Hmmm, that's the whole point of pragmas (at least in C) to specify
 implementation specific stuff outside of the language specs. If it's in the
 language specs it should be done with language specific mechanisms.

https://dlang.org/spec/pragma.html#predefined-pragmas

"""
All implementations must support these, even if by just ignoring them.
...
Vendor specific pragma Identifiers can be defined if they are prefixed
by the vendor's trademarked name, in a similar manner to version
identifiers.
"""

So all added pragmas that have no vendor prefix must be treated as
part of the language in order to conform with the specs.

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 3:02 AM, Iain Buclaw via Digitalmars-d wrote:
 No pragmas tied to a specific architecture should be allowed in the
 language spec, please.


A good point. On the other hand, a list of them would be nice so
implementations 
don't step on each other.

Aug 06 2016

David Nadlinger <code klickverbot.at> writes:

On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 No pragmas tied to a specific architecture should be allowed in 
 the language spec, please.

I wholeheartedly agree. However, it's not like FP optimisation 
pragmas would be specific to any particular architecture. They 
just describe classes of transformations that are allowed on top 
of the standard semantics.

For example, whether transforming `a + (b * c)` into a single 
operation is allowed is not a question of the target architecture 
at all, but rather whether the implicit rounding after evaluating 
(b * c) can be skipped or not. While this in turn of course 
enables the compiler to use FMA instructions on x86/AVX, 
ARM/NEON, PPC, …, it is not architecture-specific at all on a 
conceptual level.

  — David

Aug 06 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 6 August 2016 at 22:12, David Nadlinger via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
 No pragmas tied to a specific architecture should be allowed in the
 language spec, please.


 I wholeheartedly agree. However, it's not like FP optimisation pragmas would
 be specific to any particular architecture. They just describe classes of
 transformations that are allowed on top of the standard semantics.

 For example, whether transforming `a + (b * c)` into a single operation is
 allowed is not a question of the target architecture at all, but rather
 whether the implicit rounding after evaluating (b * c) can be skipped or
 not. While this in turn of course enables the compiler to use FMA
 instructions on x86/AVX, ARM/NEON, PPC, …, it is not architecture-specific
 at all on a conceptual level.

Well, you get fusedMath for free when turning on -mfma or -mfused-madd
- whatever is most relevant for the target.

Try adding -mfma here.  http://goo.gl/xsvDXM

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 2:48 AM, Ilya Yaroshenko wrote:
 I don't know what the point of fusedMath is.

 It allows a compiler to replace two arithmetic operations with single composed
 one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.

I understand that, I just don't understand why that wouldn't be done anyway.

Aug 06 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 19:51:11 UTC, Walter Bright wrote:
 On 8/6/2016 2:48 AM, Ilya Yaroshenko wrote:
 I don't know what the point of fusedMath is.

 It allows a compiler to replace two arithmetic operations with 
 single composed
 one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction 
 set.

 I understand that, I just don't understand why that wouldn't be 
 done anyway.

Some applications requires exactly the same results for different 
architectures (probably because business requirement). So this 
optimization is turned off by default in LDC for example.

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 1:06 PM, Ilya Yaroshenko wrote:
 Some applications requires exactly the same results for different architectures
 (probably because business requirement). So this optimization is turned off by
 default in LDC for example.

Let me rephrase the question - how does fusing them alter the result?

Aug 06 2016

David Nadlinger <code klickverbot.at> writes:

On Saturday, 6 August 2016 at 21:56:06 UTC, Walter Bright wrote:
 Let me rephrase the question - how does fusing them alter the 
 result?

There is just one rounding operation instead of two.

Of course, if floating point values are strictly defined as 
having only a minimum precision, then folding away the rounding 
after the multiplication is always legal.

  — David

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 3:14 PM, David Nadlinger wrote:
 On Saturday, 6 August 2016 at 21:56:06 UTC, Walter Bright wrote:
 Let me rephrase the question - how does fusing them alter the result?

 There is just one rounding operation instead of two.

Makes sense.


 Of course, if floating point values are strictly defined as having only a
 minimum precision, then folding away the rounding after the multiplication is
 always legal.

Yup.

So it does make sense that allowing fused operations would be equivalent to 
having no maximum precision.

Aug 06 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 22:32:08 UTC, Walter Bright wrote:
 On 8/6/2016 3:14 PM, David Nadlinger wrote:
 Of course, if floating point values are strictly defined as 
 having only a
 minimum precision, then folding away the rounding after the 
 multiplication is
 always legal.

 Yup.

 So it does make sense that allowing fused operations would be 
 equivalent to having no maximum precision.

Fused operations are mul/div+add/sub only.
Fused operations does not break compesator subtraction:
auto t = a - x + x;
So, please, make them as separate pragma.

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 11:45 PM, Ilya Yaroshenko wrote:
 So it does make sense that allowing fused operations would be equivalent to
 having no maximum precision.

 Fused operations are mul/div+add/sub only.
 Fused operations does not break compesator subtraction:
 auto t = a - x + x;
 So, please, make them as separate pragma.

ok

Aug 07 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Saturday, 6 August 2016 at 21:56:06 UTC, Walter Bright wrote:
 On 8/6/2016 1:06 PM, Ilya Yaroshenko wrote:
 Some applications requires exactly the same results for 
 different architectures
 (probably because business requirement). So this optimization 
 is turned off by
 default in LDC for example.

 Let me rephrase the question - how does fusing them alter the 
 result?

The result became more precise, because single rounding instead 
of two.

Aug 06 2016

David Nadlinger <code klickverbot.at> writes:

On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 The LDC fastmath bothers me a lot. It throws away proper NaN 
 and infinity handling, and throws away precision by allowing 
 reciprocal and algebraic transformations.

This is true – and precisely the reason why it is actually 
defined (ldc.attributes) as

---
alias fastmath = AliasSeq!(llvmAttr("unsafe-fp-math", "true"), 
llvmFastMathFlag("fast"));
---

This way, users can actually combine different optimisations in a 
more tasteful manner as appropriate for their particular 
application.

Experience has shown that people – even those intimately familiar 
with FP semantics – expect a catch-all kitchen-sink switch for 
all natural optimisations (natural when equating FP values with 
real numbers). This is why the shorthand exists.

  — David

Aug 06 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2016 2:12 PM, David Nadlinger wrote:
 This is true – and precisely the reason why it is actually defined
 (ldc.attributes) as

 ---
 alias fastmath = AliasSeq!(llvmAttr("unsafe-fp-math", "true"),
 llvmFastMathFlag("fast"));
 ---

 This way, users can actually combine different optimisations in a more tasteful
 manner as appropriate for their particular application.

 Experience has shown that people – even those intimately familiar with FP
 semantics – expect a catch-all kitchen-sink switch for all natural
optimisations
 (natural when equating FP values with real numbers). This is why the shorthand
 exists.

I didn't know that, thanks for the explanation. But the same can be done for 
pragmas, as the second argument isn't just true|false, it's an expression.

Aug 06 2016

deadalnix <deadalnix gmail.com> writes:

On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
 On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
 We need 2 new pragmas with the same syntax as `pragma(inline, 
 xxx)`:

 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, 
 div-sub operations.
 2. `pragma(fastMath)` equivalents to [1]. This pragma can be 
 used to allow
 extended precision.


 The LDC fastmath bothers me a lot. It throws away proper NaN 
 and infinity handling, and throws away precision by allowing 
 reciprocal and algebraic transformations. As I've said before, 
 correctness should be first, not speed, and fastmath has 
 nothing to do with this thread.


 I don't know what the point of fusedMath is.

It's not as cut and dry. Sometime, processing faster mean you can 
process more data to begin with, and get a better result.

Aug 07 2016

Fool <fool dlang.org> writes:

Here is a relevant example:

https://hal.inria.fr/inria-00171497v1/document

It is used in at least one real world geometric modeling system.

Aug 05 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 11:53 AM, Walter Bright wrote:
 It has been proposed many times that the solution for D is to have a function
 called toFloat() or something like that in core.math, which guarantees a round
 to float precision for its argument. But so far nobody has written such a
function.

https://github.com/dlang/druntime/pull/1621

Aug 04 2016

deadalnix <deadalnix gmail.com> writes:

On Thursday, 4 August 2016 at 18:53:23 UTC, Walter Bright wrote:
 On 8/4/2016 7:08 AM, Andrew Godfrey wrote:
 Now, my major experience is in the context of Intel non-SIMD 
 FP, where internal
 precision is 80-bit. I can see the appeal of asking for the 
 ability to reduce
 internal precision to match the data type you're using, and I 
 think I've read
 something written by Walter on that topic. But this would 
 hardly be "C-like" FP
 support so I'm not sure that's he topic at hand.

 Also, carefully reading the C Standard, D's behavior is allowed 
 by the C Standard. The idea that C requires rounding of all 
 intermediate values to the target precision is incorrect, and 
 is not "C-like". C floating point semantics can and do vary 
 from platform to platform, and vary based on optimization 
 settings, and this is all allowed by the C Standard.

It is actually very common for C compiler to work with double for 
intermediate values, which isn't far from what D does.

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 1:03 PM, deadalnix wrote:
 It is actually very common for C compiler to work with double for intermediate
 values, which isn't far from what D does.

In fact, it used to be specified that C behave that way!

Aug 04 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

IEEE behaviour by default is required by numeric software. 
 fastmath (like recent LDC) or something like that can be used to 
allow extended precision.

Ilya

Aug 04 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 4 August 2016 at 01:00, Seb via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Consider the following program, it fails on 32-bit :/

It would be nice if explicit casts were honoured by CTFE here.
toDouble(a + b) just seems to be avoiding the why CTFE ignores the
cast in cast(double)(a + b).

 To make matters worse std.math yields different results than
 compiler/assembly intrinsics - note that in this example import std.math.pow
 adds about 1K instructions to the output assembler, whereas llvm_powf boils
 down to the assembly powf. Of course the performance of powf is a lot
 better, I measured [3] that e.g. std.math.pow takes ~1.5x as long for both
 LDC and DMD. Of course if you need to run this very often, this cost isn't
 acceptable.

This could be something specific to your architecture.  I get the same
result on from all versions of powf, and from GCC builtins too,
regardless of optimization tunings.

Aug 04 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/4/2016 2:13 PM, Iain Buclaw via Digitalmars-d wrote:
 This could be something specific to your architecture.  I get the same
 result on from all versions of powf, and from GCC builtins too,
 regardless of optimization tunings.


It's important to remember that what gcc does and what the C standard allows
are 
not necessarily the same - even if the former is standard compliant. C allows 
for a lot of implementation defined FP behavior.

Aug 04 2016

Seb <seb wilzba.ch> writes:

On Thursday, 4 August 2016 at 21:13:23 UTC, Iain Buclaw wrote:
 On 4 August 2016 at 01:00, Seb via Digitalmars-d 
 <digitalmars-d puremagic.com> wrote:
 To make matters worse std.math yields different results than 
 compiler/assembly intrinsics - note that in this example 
 import std.math.pow adds about 1K instructions to the output 
 assembler, whereas llvm_powf boils down to the assembly powf. 
 Of course the performance of powf is a lot better, I measured 
 [3] that e.g. std.math.pow takes ~1.5x as long for both LDC 
 and DMD. Of course if you need to run this very often, this 
 cost isn't acceptable.

 This could be something specific to your architecture.  I get 
 the same result on from all versions of powf, and from GCC 
 builtins too, regardless of optimization tunings.

I can reproduce this on DPaste (also x86_64).

https://dpaste.dzfl.pl/c0ab5131b49d

Behavior with a recent LDC build is similar (as annotated with 
the comments).

Aug 04 2016

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 4 August 2016 at 23:38, Seb via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On Thursday, 4 August 2016 at 21:13:23 UTC, Iain Buclaw wrote:
 On 4 August 2016 at 01:00, Seb via Digitalmars-d
 <digitalmars-d puremagic.com> wrote:
 To make matters worse std.math yields different results than
 compiler/assembly intrinsics - note that in this example import std.math.pow
 adds about 1K instructions to the output assembler, whereas llvm_powf boils
 down to the assembly powf. Of course the performance of powf is a lot
 better, I measured [3] that e.g. std.math.pow takes ~1.5x as long for both
 LDC and DMD. Of course if you need to run this very often, this cost isn't
 acceptable.

 This could be something specific to your architecture.  I get the same
 result on from all versions of powf, and from GCC builtins too, regardless
 of optimization tunings.


 I can reproduce this on DPaste (also x86_64).

 https://dpaste.dzfl.pl/c0ab5131b49d

 Behavior with a recent LDC build is similar (as annotated with the
 comments).

When testing the math functions, I chose not to compare results to
what C libraries, or CPU instructions return, but rather compared the
results to Wolfram, which I hope I'm correct in saying is a more
reliable and proven source of scientific maths than libm.

As of the time I ported all pure D (not IASM) implementations of math
functions, the results returned from all unittests using 80-bit reals
were identical with Wolfram given up to the last 2 digits as an
acceptable error with some values.  This was true for all except
inputs that were just inside the domain for the function, in which
case only double precision was guaranteed.  Where applicable, they
were also found to in some cases to be more accurate than the inline
assembler or yl2x implementations version paths that are used if you
compile with DMD or LDC.

Aug 06 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Why don't we switch to C like floating pointed arithmetic instead of