www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Align a variable on the stack.

reply TheFlyingFiddle <borin.lukas gmail.com> writes:
Is there a built in way to do this in dmd?

Basically I want to do this:

auto decode(T)(...)
{
    while(...)
    {
       T t = T.init; //I want this aligned to 64 bytes.
    }
}


Currently I am using:

align(64) struct Aligner(T)
{
    T value;
}

auto decode(T)(...)
{
    Aligner!T t = void;
    while(...)
    {
       t.value = T.init;
    }
}

But is there a less hacky way? From the documentation of align it 
seems i cannot use that for this kind of stuff. Also I don't want 
to have to use align(64) on my T struct type since for my usecase 
I am decoding arrays of T.

The reason that I want to do this in the first place is that if 
the variable is aligned i get about a 2.5x speedup (i don't 
really know why... found it by accident)
Nov 03 2015
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 3 November 2015 at 23:29:45 UTC, TheFlyingFiddle 
wrote:
 Is there a built in way to do this in dmd?

 Basically I want to do this:

 auto decode(T)(...)
 {
    while(...)
    {
       T t = T.init; //I want this aligned to 64 bytes.
    }
 }


 Currently I am using:

 align(64) struct Aligner(T)
 {
    T value;
 }

 auto decode(T)(...)
 {
    Aligner!T t = void;
    while(...)
    {
       t.value = T.init;
    }
 }

 But is there a less hacky way? From the documentation of align 
 it seems i cannot use that for this kind of stuff. Also I don't 
 want to have to use align(64) on my T struct type since for my 
 usecase I am decoding arrays of T.

 The reason that I want to do this in the first place is that if 
 the variable is aligned i get about a 2.5x speedup (i don't 
 really know why... found it by accident)
Note that there are two different alignments: to control padding between instances on the stack (arrays) to control padding between members of a struct align(64) //arrays struct foo { align(16) short baz; //between members align (1) float quux; } your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing. As to a less jacky solution I'm not sure there is one.
Nov 03 2015
parent reply TheFlyingFiddle <borin.lukas gmail.com> writes:
On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson 
wrote:
 Note that there are two different alignments:
          to control padding between instances on the stack 
 (arrays)
          to control padding between members of a struct

 align(64) //arrays
 struct foo
 {
       align(16) short baz; //between members
       align (1) float quux;
 }

 your 2.5x speedup is due to aligned vs. unaligned loads and 
 stores which for SIMD type stuff has a really big effect. 
 Basically misaligned stuff is really slow. IIRC there was a 
 (blog/paper?) of someone on a uC spending a vast amount of time 
 in ONE misaligned integer assignment causing traps and getting 
 the kernel involved. Not quite as bad on x86 but still with 
 doing.

 As to a less jacky solution I'm not sure there is one.
Thanks for the reply. I did some more checking around and I found that it was not really an alignment problem but was caused by using the default init value of my type. My starting type. align(64) struct Phys { float x, y, z, w; //More stuff. } //Was 64 bytes in size at the time. The above worked fine, it was fast and all. But after a while I wanted the data in a diffrent format. So I started decoding positions, and other variables in separate arrays. Something like this: align(16) struct Pos { float x, y, z, w; } This counter to my limited knowledge of how cpu's work was much slower. Doing the same thing lot's of times, touching less memory with less branches should in theory at-least be faster right? So after I ruled out bottlenecks in the parser I assumed there was some alignment problems so I did my Aligner hack. This caused to code to run faster so I assumed this was the cause... Naive! (there was a typo in the code I submitted to begin with I used a = Align!(T).init and not a.value = T.init) The performance was actually cased by the line : t = T.init no matter if it was aligned or not. I solved the problem by changing the struct to look like this. align(16) struct Pos { float x = float.nan; float y = float.nan; float z = float.nan; float w = float.nan; } Basically T.init get's explicit values. But... this should be the same Pos.init as the default Pos.init. So I really fail to understand how this could fix the problem. I guessed the compiler generates some slightly different code if I do it this way? And that this slightly different code fixes some bottleneck in the cpu. But when I took a look at the assembly of the function I could not find any difference in the generated code... I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?
Nov 04 2015
next sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
wrote:
 I don't really know where to go from here to figure out the 
 underlying cause. Does anyone have any suggestions?
Can you publish two compilable and runnable versions of the code that exhibit the difference? Then we can have a look at the generated assembly. If there's really different code being generated depending on whether the .init value is explicitly set to float.nan or not, then this suggests there is a bug in DMD.
Nov 05 2015
parent reply TheFlyingFiddle <borin.lukas gmail.com> writes:
On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
 On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
 wrote:
 Can you publish two compilable and runnable versions of the 
 code that exhibit the difference? Then we can have a look at 
 the generated assembly. If there's really different code being 
 generated depending on whether the .init value is explicitly 
 set to float.nan or not, then this suggests there is a bug in 
 DMD.
I created a simple example here: struct A { float x, y, z ,w; } struct B { float x=float.nan; float y=float.nan; float z=float.nan; float w=float.nan; } void initVal(T)(ref T t, ref float k) { pragma(inline, false); t.x = k; t.y = k * 2; t.z = k / 2; t.w = k^^3; } __gshared A[] a; void benchA() { A val; foreach(float f; 0 .. 1000_000) { val = A.init; initVal(val, f); a ~= val; } } __gshared B[] b; void benchB() { B val; foreach(float f; 0 .. 1000_000) { val = B.init; initVal(val, f); b ~= val; } } int main(string[] argv) { import std.datetime; import std.stdio; auto res = benchmark!(benchA, benchB)(1); writeln("Default: ", res[0]); writeln("Explicit: ", res[1]); return 0; } output: Default: TickDuration(1637842) Explicit: TickDuration(167088) ~10x slowdown...
Nov 05 2015
parent reply TheFlyingFiddle <borin.lukas gmail.com> writes:
On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
 ~10x slowdown...
I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.
Nov 05 2015
parent reply TheFlyingFiddle <borin.lukas gmail.com> writes:
On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz 
 wrote:
 ~10x slowdown...
I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.
I reduced it further: struct A { float x, y, z ,w; } struct B { float x=float.nan; float y=float.nan; float z=float.nan; float w=float.nan; } void initVal(T)(ref T t, ref float k) { pragma(inline, false); } void benchA() { foreach(float f; 0 .. 1000_000) { A val = A.init; initVal(val, f); } } void benchB() { foreach(float f; 0 .. 1000_000) { B val = B.init; initVal(val, f); } } int main(string[] argv) { import std.datetime; import std.stdio; auto res = benchmark!(benchA, benchB)(1); writeln("Default: ", res[0]); writeln("Explicit: ", res[1]); readln; return 0; } also i am using dmd -release -boundcheck=off -inline The pragma(inline, false) is there to prevent it from removing the assignment in the loop.
Nov 05 2015
next sibling parent reply rsw0x <anonymous anonymous.com> writes:
On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]
I reduced it further: [...]
these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline
Nov 05 2015
parent reply TheFlyingFiddle <borin.lukas gmail.com> writes:
On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]
I reduced it further: [...]
these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline
Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.
Nov 05 2015
next sibling parent rsw0x <anonymous anonymous.com> writes:
On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:
 On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]
I reduced it further: [...]
these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline
Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.
linux x86-64
Nov 05 2015
prev sibling parent reply steven kladitis <steven_kladitis yahoo.com> writes:
On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:
 On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]
I reduced it further: [...]
these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline
Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.
I am still disappointed that DMD is not native 64 bit in windows yet. Please show exactly how you are getting 64 bit to work in windows 10. I have never gotten this to work. for any version of DMD. All of my new $400.00 systems are 4 gig 64 bit windows 10... and the processor instruction sets are very nice. I dabble in assembler. I have always wondered why D does not take advantage of newer instructions...... and 64 bit. I see a 64 Bit droid Compiler for D. :):):)
Nov 06 2015
parent BBaz <bb.temp gmx.com> writes:
On Saturday, 7 November 2015 at 03:18:59 UTC, steven kladitis 
wrote:
 [...]
 I am still disappointed that DMD is not native 64 bit in 
 windows yet.
 [...]
It's because they can't make a nice distribution. DMD win32 is a nice package that works out of the box (compiler, standard C lib, standard D lib, linker, etc) without any further configuration or derquirement. DMD win64 requires MSVS for the standard C lib and the linker.
Nov 06 2015
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
Ok, benchA and benchB have the same assembler code generated. 
However, I _can_ reproduce the slowdown albeit on average only 
20%-40%, not a factor of 10.

It turns out that it's always the first tested function that's 
slower. You can test this by switching benchA and benchB in the 
call to benchmark(). I suspect the reason is that the OS is 
paging in the code the first time, and we're actually seeing the 
cost of the page fault. If you a second round of benchmarks after 
the first one, that one shows more or less the same performance 
for both functions.
Nov 06 2015
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:
 Ok, benchA and benchB have the same assembler code generated. 
 However, I _can_ reproduce the slowdown albeit on average only 
 20%-40%, not a factor of 10.
Forgot to add that this is on Linux x86_64, so that probably explains the difference.
 It turns out that it's always the first tested function that's 
 slower. You can test this by switching benchA and benchB in the 
 call to benchmark(). I suspect the reason is that the OS is 
 paging in the code the first time, and we're actually seeing 
 the cost of the page fault. If you a second round of benchmarks 
 after the first one, that one shows more or less the same 
 performance for both functions.
Nov 06 2015
parent TheFlyingFiddle <borin.lukas gmail.com> writes:
On Friday, 6 November 2015 at 11:38:29 UTC, Marc Schütz wrote:
 On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:
 Ok, benchA and benchB have the same assembler code generated. 
 However, I _can_ reproduce the slowdown albeit on average only 
 20%-40%, not a factor of 10.
Forgot to add that this is on Linux x86_64, so that probably explains the difference.
 It turns out that it's always the first tested function that's 
 slower. You can test this by switching benchA and benchB in 
 the call to benchmark(). I suspect the reason is that the OS 
 is paging in the code the first time, and we're actually 
 seeing the cost of the page fault. If you a second round of 
 benchmarks after the first one, that one shows more or less 
 the same performance for both functions.
I tested swapping around the functions on windows x86 and I still get the same slowdown with the default initializer. Still basically the same running speed of both functions on windows x64. Interestingly enough the slowdown disappear if I add another float variable to the structs. This causes the assembly to change to using different instructions so I guess that is why. Also it only seems to affect small structs with floats in them. If I change the memebers to int both versions run at the same speed on x86 aswell.
Nov 06 2015
prev sibling parent BBasile <bb.temp gmx.com> writes:
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
wrote:
 [...]
 I solved the problem by changing the struct to look like this.
 align(16) struct Pos
 {
     float x = float.nan;
     float y = float.nan;
     float z = float.nan;
     float w = float.nan;
 }
wow that's quite strange. FP members should be initialized without initializer ! Eg you should get the same with align(16) struct Pos { float x, y, ,z, w; }
Nov 05 2015
prev sibling parent reply arGus <mailinator mailinator.com> writes:
I did some testing on Linux and Windows.
I ran the code with ten times the iterations, and found the 
results consistent with what has previously been observed in this 
thread.
The code seems to run just fine on Linux, but is slowed down 10x 
on Windows x86.


Windows (32-bit)

rdmd bug.d -inline -boundscheck=off -release
Default:  TickDuration(14398890)
Explicit: TickDuration(168888)


Linux (64-bit)

rdmd bug.d -m64 -inline -boundscheck=off
Default:  TickDuration(59090876)
Explicit: TickDuration(49529493)


Linux (32-bit)

rdmd bug.d -inline -boundscheck=off
Default:  TickDuration(58882306)
Explicit: TickDuration(49231968)
Nov 06 2015
parent rsw0x <anonymous anonymous.com> writes:
On Friday, 6 November 2015 at 17:55:47 UTC, arGus wrote:
 I did some testing on Linux and Windows.
 I ran the code with ten times the iterations, and found the 
 results consistent with what has previously been observed in 
 this thread.
 The code seems to run just fine on Linux, but is slowed down 
 10x on Windows x86.


 Windows (32-bit)

 rdmd bug.d -inline -boundscheck=off -release
 Default:  TickDuration(14398890)
 Explicit: TickDuration(168888)


 Linux (64-bit)

 rdmd bug.d -m64 -inline -boundscheck=off
 Default:  TickDuration(59090876)
 Explicit: TickDuration(49529493)


 Linux (32-bit)

 rdmd bug.d -inline -boundscheck=off
 Default:  TickDuration(58882306)
 Explicit: TickDuration(49231968)
File a bug report, this probably needs Walter to look at it.
Nov 06 2015