www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - How to prevent optimizer from reordering stuff?

reply Dan Olson <zans.is.for.cans yahoo.com> writes:
While tracking down std.math problems for ARM, I find that optimizer
will reorder instructions to get FPSCR flags before the divide
operation.

Is there is a way to force instruction ordering here?  I tried the
llvm_memory_fence, but it doesn't do the job.

real zero = 0.0;

void foo()
{
    import std.math, std.c.stdio, ldc.llvmasm;

    real x = 1.0 / zero;

    auto f = __asm!uint("vmrs $0, fpscr", "=r");
    IeeeFlags flags = ieeeFlags();
    printf("%f, %u %d\n", x, f, flags.divByZero);
}

Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is
after both my inline asm and std.math ieeeFlags().

	vldr	d8, [r0]
	  InlineAsm Start
	vmrs	r4, fpscr
	  InlineAsm End
	mov	r0, r5
	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags
	vmov.f64	d16, #1.000000e+00
	mov	r0, r5
	vdiv.f64	d8, d16, d8

What to do?
--
Dan
Mar 14 2015
next sibling parent reply "David Nadlinger" <code klickverbot.at> writes:
On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
 While tracking down std.math problems for ARM, I find that 
 optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.
IIRC FP flag/mode support is a tricky topic in LLVM in general, but this specific problem seems weird. What are the attributes for __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The optimizer should never move code across arbitrary function calls… David
Mar 14 2015
parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
"David Nadlinger" <code klickverbot.at> writes:

 On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
 While tracking down std.math problems for ARM, I find that optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.
IIRC FP flag/mode support is a tricky topic in LLVM in general, but this specific problem seems weird. What are the attributes for __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The optimizer should never move code across arbitrary function calls… David
Hi David. I don't see any attributes for for that function. I will just paste some of the -output-ll results since nothing sticks out to me. declare fastcc void _D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags( std.math.IeeeFlags* noalias sret) define fastcc void _D10unittester3fooFZv() { %flags = alloca %std.math.IeeeFlags, align 4 %1 = load double* _D10unittester4zeroe, align 8 %2 = fdiv double 1.000000e+00, %1 %3 = tail call i32 asm sideeffect "vmrs $0, fpscr", "=r"() #0 call fastcc void _D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags( std.math.IeeeFlags* noalias sret %flags) %tmp = call fastcc i1 _D3std4math9IeeeFlags9divByZeroMFNdZb(%std.math.IeeeFlags* %flags) %4 = zext i1 %tmp to i32 %tmp1 = call i32 (i8*, ...)* printf(i8* getelementptr inbounds ([11 x i8]* .str12, i32 0, i32 0), double %2, i32 %3, i32 %4) ret void } The only guess I have right now for this is from: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042e/IHI0042E_aapcs.pdf The FPSCR is the only status register that may be accessed by conforming code. It is a global register with the following properties: - The condition code bits (28-31), the cumulative saturation (QC) bit (27) and the cumulative exception-status bits (0-4) are not preserved across a public interface. (snip) Maybe that means the compiler can says FPSCR state from my vdiv.f64 is undefined across function call boundaries, so ordering should not matter?
Mar 14 2015
parent reply David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:
Hi Dan,

On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
 I don't see any attributes for for that function.  I will just paste
 some of the -output-ll results since nothing sticks out to me.
Yeah, seems like everything is in order (no pun intended) after the main IR-level optimizer. This suggests that the reordering happens on the target-specific optimization or instruction selection level. I suppose you could try disabling codegen optimizations if you wanted to investigate this further.
 Maybe that means the compiler can says FPSCR state from my vdiv.f64
 is undefined across function call boundaries, so ordering should not
 matter?
This seems like a reasonable guess. Did you try asking on the LLVM IRC channel or mailing list? Depending on the outcome (i.e. if the ABI is really to be interpreted that way), we should probably discuss its implications for D's FP handling strategy on the main D mailing lists. Best, David
Mar 15 2015
parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

 Hi Dan,

 On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
 I don't see any attributes for for that function.  I will just paste
 some of the -output-ll results since nothing sticks out to me.
Yeah, seems like everything is in order (no pun intended) after the main IR-level optimizer. This suggests that the reordering happens on the target-specific optimization or instruction selection level. I suppose you could try disabling codegen optimizations if you wanted to investigate this further.
It is a good puzzle. For what it is worth, clang does the same thing with similar code.
 Maybe that means the compiler can says FPSCR state from my vdiv.f64
 is undefined across function call boundaries, so ordering should not
 matter?
This seems like a reasonable guess. Did you try asking on the LLVM IRC channel or mailing list? Depending on the outcome (i.e. if the ABI is really to be interpreted that way), we should probably discuss its implications for D's FP handling strategy on the main D mailing lists.
I have not asked elsewhere yet. I'm going to explore the problem a bit more, then ask.
Mar 15 2015
parent Dan Olson <zans.is.for.cans yahoo.com> writes:
Ok, I have stumbled into an old problem it seems.

C99 invented "#pragma STDC FENV_ACCESS ON" to prevent optimizer from
reordering instructions that affect float environment.  See note [2]
here:

http://en.wikipedia.org/wiki/C99#Example

And clang (LLVM) does not support this pragma:

https://llvm.org/bugs/show_bug.cgi?id=10409

Work around in C is to use volatile vars to force ordering.

And one more reference:

http://wiki.musl-libc.org/wiki/Mathematical_Library#Fenv_and_error_handling
Mar 15 2015
prev sibling parent Dan Olson <zans.is.for.cans yahoo.com> writes:
Dan Olson <zans.is.for.cans yahoo.com> writes:

 While tracking down std.math problems for ARM, I find that optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.

 Is there is a way to force instruction ordering here?  I tried the
 llvm_memory_fence, but it doesn't do the job.

 real zero = 0.0;

 void foo()
 {
     import std.math, std.c.stdio, ldc.llvmasm;

     real x = 1.0 / zero;

     auto f = __asm!uint("vmrs $0, fpscr", "=r");
     IeeeFlags flags = ieeeFlags();
     printf("%f, %u %d\n", x, f, flags.divByZero);
 }

 Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is
 after both my inline asm and std.math ieeeFlags().

 	vldr	d8, [r0]
 	  InlineAsm Start
 	vmrs	r4, fpscr
 	  InlineAsm End
 	mov	r0, r5
 	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags
 	vmov.f64	d16, #1.000000e+00
 	mov	r0, r5
 	vdiv.f64	d8, d16, d8

 What to do?
I have a solution. At least it is a start. Specifying the result of the floating point operation as argument of an empty inline asm gives correct ordering. And doesn't do any unnecessary stores like the C volatile trick (FORCE_EVAL macro). For my use, I wrapped the inline asm in a function "use()" that is specific to ARM because of the 'w' constraint. I am thinking it could be named FORCE_EVAL to align with what is in linux libm and then made general for other D cpu targets. void use(T)(T x) nogc nothrow { import std.traits; static if (isFloatingPoint!(T)) __asm("", "w", x); // arm fp reg else __asm("", "r", x); } Compile as before (-O), but with use(x). real zero = 0.0; void foo() { import std.math, std.c.stdio, ldc.llvmasm; real x = 1.0 / zero; use(x); // get float flags in arm specifc way auto f = __asm!uint("vmrs $0, fpscr", "=r"); // get float flags D way IeeeFlags flags = ieeeFlags(); printf("%f, %u %d\n", x, f, flags.divByZero); } Now vdiv.f64 happens before all the flag fetching. vmov.f64 d16, #1.000000e+00 add r5, sp, #4 vldr d17, [r0] mov r0, r5 vdiv.f64 d8, d16, d17 <------ yeah! InlineAsm Start InlineAsm End InlineAsm Start vmrs r4, fpscr InlineAsm End blx __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags -- Dan
Mar 15 2015