digitalmars.D.ldc - A small comparison

bearophile (122/122) Nov 20 2013 I've found a post about a functional library for LuaJIT code:

jerro (12/29) Nov 20 2013 If I use core.stdc.math.sin and compile the code with:

bearophile (6/28) Nov 21 2013 I guess the back-end is very similar, and it's able to optimize

Kai Nacke (13/17) Nov 22 2013 The difference between std.math.sin and core.stdc.math.sin is

Kai Nacke (6/8) Dec 13 2013 The answer is simple: it is yet not implemented. Just look at the

"bearophile" <bearophileHUGS lycos.com> writes:

I've found a post about a functional library for LuaJIT code:

http://rtsisyk.github.io/luafun/intro.html


This dynamically typed Lua code:

require "fun" ()
  n = 100
  x = sum(map(function(x) return x^2 end, take(n, 
tabulate(math.sin))))
  -- calculate sum(sin(x)^2 for x in 0..n-1)
  print(x)
  50.011981355266


Gets compiled to:

  ->LOOP:
  0bcaffd0  movsd [rsp+0x8], xmm7
  0bcaffd6  addsd xmm4, xmm5
  0bcaffda  ucomisd xmm6, xmm1
  0bcaffde  jnb 0x0bca0028        ->6
  0bcaffe4  addsd xmm6, xmm0
  0bcaffe8  addsd xmm7, xmm0
  0bcaffec  fld qword [rsp+0x8]
  0bcafff0  fsin
  0bcafff2  fstp qword [rsp]
  0bcafff5  movsd xmm5, [rsp]
  0bcafffa  mulsd xmm5, xmm5
  0bcafffe  jmp 0x0bcaffd0        ->LOOP


So I've written a similar D version using Phobos:

void main() {
     import std.stdio, std.algorithm, std.range;
     enum n = 100;
     immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a 
+ b};
     r.writeln;
}


LDC2 compiles it to (32 bit):

LBB0_1:
     fxch
     fstpt   44(%esp)
     fld %st(0)
     fstpt   (%esp)
     fld1
     faddp   %st(1)
     fstpt   32(%esp)
     calll   __D3std4math3sinFNaNbNfeZe
     subl    $12, %esp
     fmul    %st(0), %st(0)
     fldt    44(%esp)
     faddp   %st(1)
     fstpt   44(%esp)
     fldt    32(%esp)
     fldt    44(%esp)
     decl    %esi
     fxch
     jne LBB0_1


As you see the D version doesn't use the fsin instruction as 
LuaJIT, it calls a library function.


If I import the sin from core.stdc.math ldc2 produces (notice the 
usage of xmm registers):

LBB0_1:
     movsd   %xmm1, 32(%esp)
     movsd   %xmm0, 40(%esp)
     movsd   %xmm1, (%esp)
     calll   _sin
     fstpl   48(%esp)
     movsd   48(%esp), %xmm0
     mulsd   %xmm0, %xmm0
     movsd   40(%esp), %xmm1
     addsd   %xmm0, %xmm1
     movsd   %xmm1, 40(%esp)
     movsd   32(%esp), %xmm1
     movsd   40(%esp), %xmm0
     addsd   LCPI0_0, %xmm1
     decl    %esi
     jne LBB0_1


I have also written a normal for loop in both C and D. This is C 
code:


#include "stdio.h"
#include "math.h"

int main() {
     double r = 0.0;
     int i;
     for (i = 0; i < 100; i++) {
         const double aux = sin(i);
         r += aux * aux;
     }
     printf("%f\n", r);
     return 0;
}


GCC compiles it to:


.L3:
     cvtsi2sd    %ebx, %xmm0
     call    sin
.L2:
     mulsd   %xmm0, %xmm0
     addl    $1, %ebx
     cmpl    $100, %ebx
     addsd   8(%rsp), %xmm0
     movsd   %xmm0, 8(%rsp)
     jne .L3


The Intel compiler compiles it to this (this is even vectorized, 
sin2 and mulpd work on two doubles at a time):


         cvtdq2pd  %xmm8, %xmm0                                  

         call      __svml_sin2                                   

         mulpd     %xmm0, %xmm0                                  

         addb      $2, %r12b                                     

         paddd     %xmm9, %xmm8                                  

         addpd     %xmm0, %xmm10                                 

         cmpb      $100, %r12b                                   





In scientific code it's common to have small kernels that take 
most of the run-time of a program, so it's important to shave 
every instructions from such loops.

Do you think Phobos/LDC2 are doing well enough here?

Bye,
bearophile

Nov 20 2013

"jerro" <a a.com> writes:

 void main() {
     import std.stdio, std.algorithm, std.range;
     enum n = 100;
     immutable double r = n.iota.map!(x => x.sin ^^ 
 2).reduce!q{a + b};
     r.writeln;
 }

If I use core.stdc.math.sin and compile the code with:

gdc c.d -o c -O3 -fno-bounds-check -frelease

I get:

   404e80:	cvtsi2sd %ebx,%xmm0
   404e84:	callq  403070 <sin plt>
   404e89:	mulsd  %xmm0,%xmm0
   404e8d:	add    $0x1,%ebx
   404e90:	cmp    $0x64,%ebx
   404e93:	addsd  0x28(%rsp),%xmm0
   404e99:	movsd  %xmm0,0x28(%rsp)
   404e9f:	jne    404e80 <_Dmain+0x60>

This is almost exactly the same code as

 .L3:
     cvtsi2sd    %ebx, %xmm0
     call    sin
 .L2:
     mulsd   %xmm0, %xmm0
     addl    $1, %ebx
     cmpl    $100, %ebx
     addsd   8(%rsp), %xmm0
     movsd   %xmm0, 8(%rsp)
     jne .L3

Nov 20 2013

"bearophile" <bearophileHUGS lycos.com> writes:

jerro:

 If I use core.stdc.math.sin and compile the code with:

 gdc c.d -o c -O3 -fno-bounds-check -frelease

 I get:

   404e80:	cvtsi2sd %ebx,%xmm0
   404e84:	callq  403070 <sin plt>
   404e89:	mulsd  %xmm0,%xmm0
   404e8d:	add    $0x1,%ebx
   404e90:	cmp    $0x64,%ebx
   404e93:	addsd  0x28(%rsp),%xmm0
   404e99:	movsd  %xmm0,0x28(%rsp)
   404e9f:	jne    404e80 <_Dmain+0x60>

 This is almost exactly the same code as

 .L3:
    cvtsi2sd    %ebx, %xmm0
    call    sin
 .L2:
    mulsd   %xmm0, %xmm0
    addl    $1, %ebx
    cmpl    $100, %ebx
    addsd   8(%rsp), %xmm0
    movsd   %xmm0, 8(%rsp)
    jne .L3


I guess the back-end is very similar, and it's able to optimize 
the code well :-)

What's the difference between std.math.sin and core.stdc.math.sin?

Bye,
bearophile

Nov 21 2013

"Kai Nacke" <kai redstar.de> writes:

Hi bearophile!

On Wednesday, 20 November 2013 at 22:12:29 UTC, bearophile wrote:
 In scientific code it's common to have small kernels that take 
 most of the run-time of a program, so it's important to shave 
 every instructions from such loops.

 Do you think Phobos/LDC2 are doing well enough here?

The difference between std.math.sin and core.stdc.math.sin is 
that std.math.sin is mapped to the LLVM intrinsic llvm.sin while 
core.stdc.math.sin is the sine function from the C library.

I would expect that the LLVM intrinsic is replaced with fsin (at 
least with -m32). I am unsure why this does not happen.

My understanding is that the auto-vectorizer should kick-in here 
(see http://llvm.org/docs/Vectorizers.html). I really need to go 
deeper here to understand what's happening in LLVM.

For sure, Phobos/LDC2 should do better here!!!

Regards,
Kai

Nov 22 2013

"Kai Nacke" <kai redstar.de> writes:

On Friday, 22 November 2013 at 21:44:28 UTC, Kai Nacke wrote:
 I would expect that the LLVM intrinsic is replaced with fsin 
 (at least with -m32). I am unsure why this does not happen.

The answer is simple: it is yet not implemented. Just look at the 
slides from last LLVM developer meeting: 
http://www.llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf

Regards,
Kai

Dec 13 2013

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - A small comparison