digitalmars.D - A new calling convention in VS2013

bearophile (100/100) Jul 13 2013 Through Reddit I've found an article about

Iain Buclaw (7/13) Jul 13 2013 I'd vote for not adding more fluff which makes ABI differences
Benjamin Thaut (10/22) Jul 14 2013 I think it would be more important if dmd would actually use the XMM

bearophile (13/16) Jul 14 2013 I don't agree, because:

Kagamin (5/5) Jul 14 2013 Calling convention optimizations can probably be done during

bearophile (6/11) Jul 14 2013 In D you can tag a free function as "private", to make them

Benjamin Thaut (9/24) Jul 14 2013 I just wanted to say that there are currently bigger fish to fry then

bearophile (15/21) Jul 14 2013 I understand and I agree. On the other hand I think there are
bearophile (4/6) Jul 16 2013 Why?

"bearophile" <bearophileHUGS lycos.com> writes:

Through Reddit I've found an article about 
vector-calling-convention added to VS2013:
http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx


So I have written what I think is a similar D program:


import core.stdc.stdio, core.simd;

struct Particle { float4 x, y; }

Particle addParticles(in Particle p1, in Particle p2) pure 
nothrow {
     return Particle(p1.x + p2.x, p1.y + p2.y);
}

// BUG 10627 and 10523
//alias Particle2 = float4[2];
//Particle2 addParticles(in Particle2 p1, in Particle2 p2) {
//    return p1[] + p2[];
//}

void main() {
     auto p1 = Particle([1, 2, 3, 4], [10, 20, 30, 40]);
     printf("%f %f %f %f %f %f %f %f\n",
            p1.x.array[0], p1.x.array[1], p1.x.array[2], 
p1.x.array[3],
            p1.y.array[0], p1.y.array[1], p1.y.array[2], 
p1.y.array[3]);

     auto p2 = Particle([100, 200, 300, 400], [1000, 2000, 3000, 
4000]);
     printf("%f %f %f %f %f %f %f %f\n",
            p2.x.array[0], p2.x.array[1], p2.x.array[2], 
p2.x.array[3],
            p2.y.array[0], p2.y.array[1], p2.y.array[2], 
p2.y.array[3]);

     auto p3 = addParticles(p1, p2);
     printf("%f %f %f %f %f %f %f %f\n",
            p3.x.array[0], p3.x.array[1], p3.x.array[2], 
p3.x.array[3],
            p3.y.array[0], p3.y.array[1], p3.y.array[2], 
p3.y.array[3]);
}


I have compiled with the latest ldc2 (Windows32):

ldc2 -O5 -disable-inlining -release -vectorize-slp 
-vectorize-slp-aggressive -output-s test.d


The resulting X86 asm:

__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$16, %esp
	movaps	40(%ebp), %xmm0
	movaps	56(%ebp), %xmm1
	addps	8(%ebp), %xmm0
	addps	24(%ebp), %xmm1
	movups	%xmm1, 16(%eax)
	movups	%xmm0, (%eax)
	movl	%ebp, %esp
	popl	%ebp
	ret	$64

__Dmain:
...
	movaps	160(%esp), %xmm0
	movaps	176(%esp), %xmm1
	movaps	%xmm1, 48(%esp)
	movaps	%xmm0, 32(%esp)
	movaps	128(%esp), %xmm0
	movaps	144(%esp), %xmm1
	movaps	%xmm1, 16(%esp)
	movaps	%xmm0, (%esp)
	leal	96(%esp), %eax
	calll	__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle
	subl	$64, %esp
	movss	96(%esp), %xmm0
	movss	100(%esp), %xmm1
	movss	104(%esp), %xmm2
	movss	108(%esp), %xmm3
	movss	112(%esp), %xmm4
	movss	116(%esp), %xmm5
	movss	120(%esp), %xmm6
	movss	124(%esp), %xmm7
	cvtss2sd	%xmm7, %xmm7
	movsd	%xmm7, 60(%esp)
	cvtss2sd	%xmm6, %xmm6
	movsd	%xmm6, 52(%esp)
	cvtss2sd	%xmm5, %xmm5
	movsd	%xmm5, 44(%esp)
	cvtss2sd	%xmm4, %xmm4
	movsd	%xmm4, 36(%esp)
	cvtss2sd	%xmm3, %xmm3
	movsd	%xmm3, 28(%esp)
	cvtss2sd	%xmm2, %xmm2
	movsd	%xmm2, 20(%esp)
	cvtss2sd	%xmm1, %xmm1
	movsd	%xmm1, 12(%esp)
	cvtss2sd	%xmm0, %xmm0
	movsd	%xmm0, 4(%esp)
	movl	$_.str3, (%esp)
	calll	___mingw_printf
	xorl	%eax, %eax
	movl	%ebp, %esp
	popl	%ebp
	ret


Are those vector calling conventions useful for D too?

Bye,
bearophile

Jul 13 2013

"Iain Buclaw" <ibuclaw ubuntu.com> writes:

On Saturday, 13 July 2013 at 10:36:01 UTC, bearophile wrote:
 Through Reddit I've found an article about 
 vector-calling-convention added to VS2013:
 http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx

-- snip --

 Are those vector calling conventions useful for D too?

 Bye,
 bearophile


I'd vote for not adding more fluff which makes ABI differences 
between compilers greater.  But it certainly looks like if would 
be useful if you wish to save the time taken to copy the vector 
from XMM registers onto the stack and back again when passing 
values around.

Jul 13 2013

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 13.07.2013 12:35, schrieb bearophile:
 The resulting X86 asm:

 __D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle:

      pushl    %ebp
      movl    %esp, %ebp
      andl    $-16, %esp
      subl    $16, %esp
      movaps    40(%ebp), %xmm0
      movaps    56(%ebp), %xmm1
      addps    8(%ebp), %xmm0
      addps    24(%ebp), %xmm1
      movups    %xmm1, 16(%eax)
      movups    %xmm0, (%eax)

I think it would be more important if dmd would actually use the XMM 
registers correctly for computations. As you can see from the 
disassembly dmd generates code that always adds/moves from/to memory and 
does not stay within the registers at all.

http://d.puremagic.com/issues/show_bug.cgi?id=10226

Until dmd uses the XMM registers correctly it doesn't make much sense to 
add a special calling convetion for this purpose.

Kind Regards
Benjamin Thaut

Jul 14 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Benjamin Thaut:

 http://d.puremagic.com/issues/show_bug.cgi?id=10226

I see there are codegen inefficiencies.


 Until dmd uses the XMM registers correctly it doesn't make much 
 sense to add a special calling convetion for this purpose.

I don't agree, because:
- Even if DMD codegen is far from not perfect, it's a good idea 
to improve all things in parallel. Generally improving things (or 
fixing bugs) gives a better result if you adopt a pipelined 
development approach.
- A vector calling convention is meant to be usable on other 
compilers too, like LDC2, that have better codegen. (The asm I 
have shown in this thread comes from LDC2 because dmd doesn't 
even use SIMD registers on Windows 32 bit).

Bye,
bearophile

Jul 14 2013

"Kagamin" <spam here.lot> writes:

Calling convention optimizations can probably be done during 
whole program optimization, which 1) usable for 
computation-intensive applications anyway, 2) guarantees 
invisibility of those fastcall functions to external code so 
there's no incompatibility.

Jul 14 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Kagamin:

 Calling convention optimizations can probably be done during 
 whole program optimization, which 1) usable for 
 computation-intensive applications anyway, 2) guarantees 
 invisibility of those fastcall functions to external code so 
 there's no incompatibility.

In D you can tag a free function as "private", to make them 
module-private. Maybe in this case the D compiler is free to use 
any kind of calling convention for them.

Bye,
bearophile

Jul 14 2013

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 14.07.2013 14:11, schrieb bearophile:
 Benjamin Thaut:

 http://d.puremagic.com/issues/show_bug.cgi?id=10226

 I see there are codegen inefficiencies.


 Until dmd uses the XMM registers correctly it doesn't make much sense
 to add a special calling convetion for this purpose.

 I don't agree, because:
 - Even if DMD codegen is far from not perfect, it's a good idea to
 improve all things in parallel. Generally improving things (or fixing
 bugs) gives a better result if you adopt a pipelined development approach.
 - A vector calling convention is meant to be usable on other compilers
 too, like LDC2, that have better codegen. (The asm I have shown in this
 thread comes from LDC2 because dmd doesn't even use SIMD registers on
 Windows 32 bit).

 Bye,
 bearophile

I just wanted to say that there are currently bigger fish to fry then 
micro optimization through calling convetions. (GC, allocators, all the 
bugs...)

Did you compile the shown code with optimization enabled or is that a 
debug build? If it is optimized I'M going to be disappointed by LDCs 
codegen.

Kind Regards
Benjamin Thaut

Jul 14 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Benjamin Thaut:

 I just wanted to say that there are currently bigger fish to 
 fry then micro optimization through calling convetions. (GC, 
 allocators, all the bugs...)

I understand and I agree. On the other hand I think there are 
things that (if desired) it's better to introduce sooner, despite 
some important bugs are not fixed, because they shape the future 
of D a bit.


 Did you compile the shown code with optimization enabled or is 
 that a debug build? If it is optimized I'M going to be 
 disappointed by LDCs codegen.

If you take a look at the original post I have used:

ldc2 -O5 -disable-inlining -release -vectorize-slp
-vectorize-slp-aggressive -output-s test.d

I think that's about the max optimization, I have also added some 
aggressive optimization switches introduced only the latest LLVM 
version (if you remove them the resulting asm of addParticles is 
about the same, but it reorders less better some of the 
instructions inside the dmain).

Bye,
bearophile

Jul 14 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Benjamin Thaut:

 If it is optimized I'M going to be disappointed by LDCs 
 codegen.<

Why?

Bye,
bearophile

Jul 16 2013

D Programming

C/C++ Programming

Other

digitalmars.D - A new calling convention in VS2013