www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 2278] New: Guarantee alignment of stack-allocated variables on x86

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278

           Summary: Guarantee alignment of stack-allocated variables on x86
           Product: D
           Version: 1.034
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: clugdbug yahoo.com.au


Use of SSE instructions in 32-bit Windows is problematic, since Windows and the
C calling convention only aligns the stack to 4 bytes, not 8.
It's too late for C and C++ to fix this problem. But D still has a chance, with
a simple addition to the ABI...

Insert the following line into the spec:
D functions must be called with a stack aligned to an 8 byte boundary.

And how to implement this:
(1) whenever a D function is called, insert a 'push EBP'/'pop EBP' around it,
if it has an odd-numbered number of (pushed arguments + pushed registers so far
in this function). Note that this applies to invoking a delegate, too.
(EBP is the best register to use, since it's guaranteed to be preserved, and
it's almost certainly been used recently. On Intel CPUs this means it won't
cause a register read stall).
(2) if local variables are created, make sure that the frame allocates an even
number of DWORDs. (Create a unused local int, if necessary).
(3) extern() functions need stack alignment code at the top of them, since they
could be called from other languages, with wrong stack alignment. Here's an
example.
---
void main()
{
    asm {
        naked;
        mov EBP, ESP;
        and ESP, 0xFFFF_FFC0;    // align to a 64 byte boundary.    
        call alignedmain;
        mov ESP, EBP;
        ret;
    }
}
---
(4) alloca() also needs to ensure that it allocates an even number of DWORDs.

Note that a clever compiler could play games with the frame pointer to
eliminate the (tiny -- approx 1.5 cycles) overhead of (1) in almost all cases.
(eg, by converting one of the 'push reg's into 'mov [EBP+xx], reg' ).

The important thing to note about this solution (compared to using step(3)
everywhere) is that it has lower overhead, and means that the innermost
functions, which are most likely to need stack alignment, don't need to
manually align it. Also note that when there's an even number of parameters,
the overhead is _zero_.


-- 
Aug 11 2008
next sibling parent reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278






This looks like a broad change for a particular case. The particular case is
short numeric arrays of constant size (because those get stack-allocated). So
why not have the compiler align only those at 8-byte boundaries and leave
everything else alone?

Copy semantics for constant-size arrays will certainly help too.


-- 
Aug 11 2008
parent Don <nospam nospam.com.au> writes:
d-bugmail puremagic.com wrote:
 http://d.puremagic.com/issues/show_bug.cgi?id=2278
 
 
 
 
 

 This looks like a broad change for a particular case. The particular case is
 short numeric arrays of constant size (because those get stack-allocated). So
 why not have the compiler align only those at 8-byte boundaries and leave
 everything else alone?
Yes, it would be possible to align only those functions which use arrays, or large structures. But, note that (a) it's relevant for _any_ usage of SSE instructions, not just array operations. Many C++ compilers are using SSE in place of general-purpose registers. (b) It also makes a big difference to the speed of memcpy/memmove, even when no vector instructions are used. In some cases, it also speeds up floating point operations on 'real' operands; and (c) as Walter notes, the procedure for aligning a stack frame is quite clumsy. (d) if you want pass-by-value for constant-size arrays, you need to align them, too, and that is only possible by doing this kind of padding of the stack
 Copy semantics for constant-size arrays will certainly help too.
Yes. ====== A quote from Agner Fog's assembly programming manual: --- All 64-bit operating systems, and some 32-bit operating systems (Mac OS and later versions of Linux) keep the stack aligned by 16 at all CALL instructions. This eliminates the need for the AND instruction and the frame pointer. It is necessary to propagate this alignment from one CALL instruction to the next by proper adjustment of the stack pointer in each function. --- It's really a much nicer solution than multiple frame pointers.
Aug 12 2008
prev sibling next sibling parent reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278






Keeping the stack always aligned is not that simple. The code generator will
also push/pop register pairs when it runs out of them.

Probably the most practical approach is to align static arrays by using the
code to AND the ESP register, but this means that there will be two frame
pointers for the function. Ug.


-- 
Aug 11 2008
parent Don <nospam nospam.com.au> writes:
d-bugmail puremagic.com wrote:
 http://d.puremagic.com/issues/show_bug.cgi?id=2278
 
 
 
 
 

 Keeping the stack always aligned is not that simple. The code generator will
 also push/pop register pairs when it runs out of them.
Yes, that's why I said it needs an extra push if and only if (pushed arguments + pushed registers so far in this function) is odd. Code generator needs a counter which is incremented for every push, and decremented for every pop. This counter should be consulted before generating a function call.
 
 Probably the most practical approach is to align static arrays by using the
 code to AND the ESP register, but this means that there will be two frame
 pointers for the function. Ug.
Aug 12 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


shro8822 vandals.uidaho.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |shro8822 vandals.uidaho.edu






IIRC there is a x86 (enter leave?) that moves the top of the stack in a way
that can be undone. If that allows a non literal arguments, a pair of these
around the scope would do it.

offset = FP
offset += ENTER_META_DATA.sizeof
offset &= 0x0f
offset -= ENTER_META_DATA.sizeof

enter offset // push offset space and some metadata
..... scope
leave // pop it all off


-- 
Aug 11 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278






enter & leave just simple sugar for pushing and popping ebp or whatever.
if you can do it by enter & leave , you can do it simply by replacing it with
pushing & popping ebp.

align(8) void func()  // make sure the stack align to 8
{
}

void func(){} // align to 4 , this might be useful to cut the use of the stack.

align to 8 for all might result a lot stack memory unused(but i'm not sure
about this).

with instructions mentioned by W, it should be a fair enough trade-off of
runtime efficiency & stack memory usage.


-- 
Aug 12 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278






The problem with entering a function and then aligning the stack is that the
code in the function can no longer access the function parameters with a known
offset.

Probably the best approach to this is to do the equivalent to alloca() -
allocate the aligned data on the stack separately, and store a pointer to it in
the regular stack frame. The compiler can sugar over all this.


-- 
Aug 12 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |baryluk smp.if.uj.edu.pl



*** Issue 1847 has been marked as a duplicate of this issue. ***

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 15 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278




In D2 on entering main() stack may or may not be aligned to 8 bytes depending
on length of command line with which program was ran. This may cause as much as
x2 difference with no apparent reason for it. (Lack of alignment is a pity, but
this particular case is plainly confusing).

Example. Run with different command lines, for example with and without
extension.

import core.stdc.stdio: printf;
import std.date: getUTCtime, ticksPerSecond;
void main() {
    double d = 0.0;
    auto t0 = getUTCtime();
    for (size_t i = 0; i < 100_000_000; i++)
        d += 1;
    auto t1 = getUTCtime();
    printf("%lf\n", d);
    printf("%u\n", (cast(size_t)&d) % 8);
    printf("%lf\n", (cast(double)t1 - cast(double)t0) / ticksPerSecond);
}

Also this code shows that inside a frame variables are placed as if stack
alignment was expected. (note that a & d are either both aligned on 8 or both
unaligned)

import core.stdc.stdio: printf;

void main() {
    int a;
    double d;
    printf("%X:%u %X:%u\n", &a, (cast(size_t)&a) % 8, &d, (cast(size_t)&d) %
8);
}

Also +1 for some way to have locals aligned, be it explicit align(n) before
declaration of var, or before function (I like this one), or throughout whole
program.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


Benjamin Thaut <code benjamin-thaut.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code benjamin-thaut.de



PDT ---
SSE is getting more and more important in performance cirtical applications.
There should be at least one way to make shure that a certain variable that is
beeing allocated on the stack is aligned. I recently came across this issue in
D2.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 21 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


Manu <turkeyman gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |turkeyman gmail.com



I'm at the point where I can't reasonably work around this issue anymore.

It's not just for SSE (although that is one very important case), there are
also structures that encapsulate SSE variables (16 byte), structures that must
be L1 line aligned (64/128 bytes), structures that must be GPU page aligned
(4k-ish), virtual page alignment, and occasionally other alignments are
required (for instance, in one case an algorithms performance was near doubled
by aligning to 256 bytes, and squatting a byte of data in the unused low bits
of the pointer)

Structure alignment is really really important, and it's very annoying to
work-around (and often wastes memory in doing so)

As we did with 256bit vectors, can we define the grammar for attributing a
struct with an alignment? Then GDC/LDC can hook it straight up, and DMD can
produce an unsupported message for the time being.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 24 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc



In DMD 2.060beta this problem seems partially solved, for structs:


import core.stdc.stdio: printf;
align(16) struct Foo { ubyte u; }
// struct Foo { ubyte u; } // try this
void main() {
    Foo f1;
    ubyte[3] b1;
    Foo f2;
    ubyte[5] b2;
    Foo f3;
    ubyte[7] b3;
    short s1;
    Foo f4;
    printf("%u\n", cast(size_t)&f1 % 16);
    printf("%u\n", cast(size_t)&f2 % 16);
    printf("%u\n", cast(size_t)&f3 % 16);
    printf("%u\n", cast(size_t)&f4 % 16);
}

Output:
0
0
0
0



But this syntax is not supported yet:

void main() {
    align(16) ubyte u;
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 24 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=2278


Temtaime <temtaime gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |temtaime gmail.com



BUMP.

2.63.2 regression ?

import core.stdc.stdio: printf;
align(16) struct Foo { ubyte u; }
// struct Foo { ubyte u; } // try this
void main() {
    Foo f1;
    ubyte[3] b1;
    Foo f2;
    ubyte[5] b2;
    Foo f3;
    ubyte[7] b3;
    short s1;
    Foo f4;
    printf("%u\n", cast(size_t)&f1 % 16);
    printf("%u\n", cast(size_t)&f2 % 16);
    printf("%u\n", cast(size_t)&f3 % 16);
    printf("%u\n", cast(size_t)&f4 % 16);
}

Output:
8
8
8
8

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2013