digitalmars.D.ldc - A small comparison

• bearophile (122/122) Nov 20 2013 I've found a post about a functional library for LuaJIT code:
• jerro (12/29) Nov 20 2013 If I use core.stdc.math.sin and compile the code with:
• bearophile (6/28) Nov 21 2013 I guess the back-end is very similar, and it's able to optimize
• Kai Nacke (13/17) Nov 22 2013 The difference between std.math.sin and core.stdc.math.sin is
• Kai Nacke (6/8) Dec 13 2013 The answer is simple: it is yet not implemented. Just look at the
"bearophile" <bearophileHUGS lycos.com> writes:
```I've found a post about a functional library for LuaJIT code:

http://rtsisyk.github.io/luafun/intro.html

This dynamically typed Lua code:

require "fun" ()
n = 100
x = sum(map(function(x) return x^2 end, take(n,
tabulate(math.sin))))
-- calculate sum(sin(x)^2 for x in 0..n-1)
print(x)
50.011981355266

Gets compiled to:

->LOOP:
0bcaffd0  movsd [rsp+0x8], xmm7
0bcaffda  ucomisd xmm6, xmm1
0bcaffde  jnb 0x0bca0028        ->6
0bcaffec  fld qword [rsp+0x8]
0bcafff0  fsin
0bcafff2  fstp qword [rsp]
0bcafff5  movsd xmm5, [rsp]
0bcafffa  mulsd xmm5, xmm5
0bcafffe  jmp 0x0bcaffd0        ->LOOP

So I've written a similar D version using Phobos:

void main() {
import std.stdio, std.algorithm, std.range;
enum n = 100;
immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a
+ b};
r.writeln;
}

LDC2 compiles it to (32 bit):

LBB0_1:
fxch
fstpt   44(%esp)
fld %st(0)
fstpt   (%esp)
fld1
fstpt   32(%esp)
calll   __D3std4math3sinFNaNbNfeZe
subl    \$12, %esp
fmul    %st(0), %st(0)
fldt    44(%esp)
fstpt   44(%esp)
fldt    32(%esp)
fldt    44(%esp)
decl    %esi
fxch
jne LBB0_1

As you see the D version doesn't use the fsin instruction as
LuaJIT, it calls a library function.

If I import the sin from core.stdc.math ldc2 produces (notice the
usage of xmm registers):

LBB0_1:
movsd   %xmm1, 32(%esp)
movsd   %xmm0, 40(%esp)
movsd   %xmm1, (%esp)
calll   _sin
fstpl   48(%esp)
movsd   48(%esp), %xmm0
mulsd   %xmm0, %xmm0
movsd   40(%esp), %xmm1
movsd   %xmm1, 40(%esp)
movsd   32(%esp), %xmm1
movsd   40(%esp), %xmm0
decl    %esi
jne LBB0_1

I have also written a normal for loop in both C and D. This is C
code:

#include "stdio.h"
#include "math.h"

int main() {
double r = 0.0;
int i;
for (i = 0; i < 100; i++) {
const double aux = sin(i);
r += aux * aux;
}
printf("%f\n", r);
return 0;
}

GCC compiles it to:

.L3:
cvtsi2sd    %ebx, %xmm0
call    sin
.L2:
mulsd   %xmm0, %xmm0
cmpl    \$100, %ebx
movsd   %xmm0, 8(%rsp)
jne .L3

The Intel compiler compiles it to this (this is even vectorized,
sin2 and mulpd work on two doubles at a time):

..B1.2:                         # Preds ..B1.8 ..B1.7
cvtdq2pd  %xmm8, %xmm0
#8.32
call      __svml_sin2
#8.28
mulpd     %xmm0, %xmm0
#9.20
#7.5
#5.14
#9.9
cmpb      \$100, %r12b
#7.5
jb        ..B1.2        # Prob 99%
#7.5

In scientific code it's common to have small kernels that take
most of the run-time of a program, so it's important to shave
every instructions from such loops.

Do you think Phobos/LDC2 are doing well enough here?

Bye,
bearophile
```
Nov 20 2013
"jerro" <a a.com> writes:
``` void main() {
import std.stdio, std.algorithm, std.range;
enum n = 100;
immutable double r = n.iota.map!(x => x.sin ^^
2).reduce!q{a + b};
r.writeln;
}

If I use core.stdc.math.sin and compile the code with:

gdc c.d -o c -O3 -fno-bounds-check -frelease

I get:

404e80:	cvtsi2sd %ebx,%xmm0
404e84:	callq  403070 <sin plt>
404e89:	mulsd  %xmm0,%xmm0
404e90:	cmp    \$0x64,%ebx
404e99:	movsd  %xmm0,0x28(%rsp)
404e9f:	jne    404e80 <_Dmain+0x60>

This is almost exactly the same code as

.L3:
cvtsi2sd    %ebx, %xmm0
call    sin
.L2:
mulsd   %xmm0, %xmm0
cmpl    \$100, %ebx
movsd   %xmm0, 8(%rsp)
jne .L3

```
Nov 20 2013
"bearophile" <bearophileHUGS lycos.com> writes:
```jerro:

If I use core.stdc.math.sin and compile the code with:

gdc c.d -o c -O3 -fno-bounds-check -frelease

I get:

404e80:	cvtsi2sd %ebx,%xmm0
404e84:	callq  403070 <sin plt>
404e89:	mulsd  %xmm0,%xmm0
404e90:	cmp    \$0x64,%ebx
404e99:	movsd  %xmm0,0x28(%rsp)
404e9f:	jne    404e80 <_Dmain+0x60>

This is almost exactly the same code as

.L3:
cvtsi2sd    %ebx, %xmm0
call    sin
.L2:
mulsd   %xmm0, %xmm0
cmpl    \$100, %ebx
movsd   %xmm0, 8(%rsp)
jne .L3

I guess the back-end is very similar, and it's able to optimize
the code well :-)

What's the difference between std.math.sin and core.stdc.math.sin?

Bye,
bearophile
```
Nov 21 2013
"Kai Nacke" <kai redstar.de> writes:
```Hi bearophile!

On Wednesday, 20 November 2013 at 22:12:29 UTC, bearophile wrote:
In scientific code it's common to have small kernels that take
most of the run-time of a program, so it's important to shave
every instructions from such loops.

Do you think Phobos/LDC2 are doing well enough here?

The difference between std.math.sin and core.stdc.math.sin is
that std.math.sin is mapped to the LLVM intrinsic llvm.sin while
core.stdc.math.sin is the sine function from the C library.

I would expect that the LLVM intrinsic is replaced with fsin (at
least with -m32). I am unsure why this does not happen.

My understanding is that the auto-vectorizer should kick-in here
(see http://llvm.org/docs/Vectorizers.html). I really need to go
deeper here to understand what's happening in LLVM.

For sure, Phobos/LDC2 should do better here!!!

Regards,
Kai
```
Nov 22 2013
"Kai Nacke" <kai redstar.de> writes:
```On Friday, 22 November 2013 at 21:44:28 UTC, Kai Nacke wrote:
I would expect that the LLVM intrinsic is replaced with fsin
(at least with -m32). I am unsure why this does not happen.

The answer is simple: it is yet not implemented. Just look at the
slides from last LLVM developer meeting:
http://www.llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf

Regards,
Kai
```
Dec 13 2013