www.digitalmars.com         C & C++   DMDScript  

D.gnu - -O2/-O3 Optimization bug?

reply "Mike" <none none.com> writes:
Hello again,

I'm continuing my work on an ARM Cortex-M port of the D Runtime.  
I now have a repository 
(https://github.com/JinShil/D_Runtime_ARM_Cortex-M_study) and a 
wiki 
(https://github.com/JinShil/D_Runtime_ARM_Cortex-M_study/wiki/1.0-Introduction) 
for anyone interested.  I'm doing my best to document the entire 
process.

I tried playing with GDC/GCC optimizations recently, and noticed 
that it breaks the following simple code from my "Hello World" 
experiment 
(http://wiki.dlang.org/Extremely_minimal_semihosted_%22Hello_World%22)

void OnReset()
{
   while(true)
   {
     // Create semihosting message
     uint[3] message =
       [
	2, 			      //stderr
	cast(uint)"hello\r\n".ptr,    //ptr to string
	7                             //size of string
       ];

     //Send semihosting command
     SendCommand(0x05, &message);
   }
}

Compiling with...
   arm-none-eabi-gdc -O1 start.d -o start.o
... works fine, but compiling with...
   arm-none-eabi-gdc -O2 start.d -o start.o
... or ...
   arm-none-eabi-gdc -O3 start.d -o start.o
... does not.

I traced this down to the -finline-small-functions and 
-fipa-cp-clone options, so if I compile with...
   arm-none-eabi-gdc -O2 -fno-inline-small-functions start.d -o 
start.o
... or ...
   arm-none-eabi-gdc -O3 -fno-inline-small-functions 
-fno-ipa-cp-clone start.d -o start.o
... it works fine.

Comparing the assembly generated with...
   arm-none-eabi-gdc -O1 start.d -o start.o
... and ...
   arm-none-eabi-gdc -O2 start.d -o start.o
... I can see that the "hello\r\n" string constant vanishes from 
the assembly file with the -O2 option.

"So what's the question, Mike?" I hear you say:
1.  Is this just one of the consequences of using -O2/-O3, and I 
should just suck it up and deal with it?
2.  Is this potentially a bug in the GCC backend?
3.  Is this potentially a bug in GDC or the DMD frontend?

Thanks for the help,
Mike
Jan 21 2014
next sibling parent "Mike" <none none.com> writes:
On Wednesday, 22 January 2014 at 00:28:34 UTC, Mike wrote:
 "So what's the question, Mike?" I hear you say:
 1.  Is this just one of the consequences of using -O2/-O3, and 
 I should just suck it up and deal with it?
 2.  Is this potentially a bug in the GCC backend?
 3.  Is this potentially a bug in GDC or the DMD frontend?

I always forget to add the most important piece of information: I'm using the GDC 4.8 branch back-ported at the beginning of the year compiled for arm-none-eabi.
Jan 21 2014
prev sibling next sibling parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On 22 January 2014 00:28, Mike <none none.com> wrote:
 Hello again,

 I'm continuing my work on an ARM Cortex-M port of the D Runtime.  I now have
 a repository (https://github.com/JinShil/D_Runtime_ARM_Cortex-M_study) and a
 wiki
 (https://github.com/JinShil/D_Runtime_ARM_Cortex-M_study/wiki/1.0-Introduction)
 for anyone interested.  I'm doing my best to document the entire process.

 I tried playing with GDC/GCC optimizations recently, and noticed that it
 breaks the following simple code from my "Hello World" experiment
 (http://wiki.dlang.org/Extremely_minimal_semihosted_%22Hello_World%22)

 void OnReset()
 {
   while(true)
   {
     // Create semihosting message
     uint[3] message =
       [
         2,                            //stderr
         cast(uint)"hello\r\n".ptr,    //ptr to string
         7                             //size of string
       ];

     //Send semihosting command
     SendCommand(0x05, &message);
   }
 }

 Compiling with...
   arm-none-eabi-gdc -O1 start.d -o start.o
 ... works fine, but compiling with...
   arm-none-eabi-gdc -O2 start.d -o start.o
 ... or ...
   arm-none-eabi-gdc -O3 start.d -o start.o
 ... does not.

 I traced this down to the -finline-small-functions and -fipa-cp-clone
 options, so if I compile with...
   arm-none-eabi-gdc -O2 -fno-inline-small-functions start.d -o start.o
 ... or ...
   arm-none-eabi-gdc -O3 -fno-inline-small-functions -fno-ipa-cp-clone
 start.d -o start.o
 ... it works fine.

 Comparing the assembly generated with...
   arm-none-eabi-gdc -O1 start.d -o start.o
 ... and ...
   arm-none-eabi-gdc -O2 start.d -o start.o
 ... I can see that the "hello\r\n" string constant vanishes from the
 assembly file with the -O2 option.

 "So what's the question, Mike?" I hear you say:
 1.  Is this just one of the consequences of using -O2/-O3, and I should just
 suck it up and deal with it?
 2.  Is this potentially a bug in the GCC backend?
 3.  Is this potentially a bug in GDC or the DMD frontend?

Personally, I would never use -O3 for low level start.o kernel stuff. As you are coding on a small board, wouldn't you instead use -Os ?
Jan 22 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 22 Jan 2014 00:28:32 +0000
schrieb "Mike" <none none.com>:

 
 "So what's the question, Mike?" I hear you say:
 1.  Is this just one of the consequences of using -O2/-O3, and I 
 should just suck it up and deal with it?
 2.  Is this potentially a bug in the GCC backend?
 3.  Is this potentially a bug in GDC or the DMD frontend?
 
 Thanks for the help,
 Mike

I can only guess, but this looks like another 'volatile' problem. You'd have to post the ASM of the optimized version somewhere, and probably the output of -fdump-tree-optimized for the optimized version. But anyway, I guess it inlines 'SendCommand' and then thinks you're not using the message and probably completely optimizes the call away. Then it sees you're never using message and removes the rest of your code. If SendCommand was written in D you'd have to mark the target of the copy volatile (or shared). But I'm not sure how this applies to the inline asm though. In C you have asm volatile, but I never used that. This answer seems to state that you have to use asm volatile: http://stackoverflow.com/a/5057270/471401 So the questions for Iain: * should we mark all inline ASM blocks as volatile? * shared can't replace volatile in this case as `shared asm{...}` isn't valid * Should we add some GDC specific way to mark extended ASM blocks as volatile? As DMD doesn't optimize ASM blocks at all there's probably no need for a standard solution?
Jan 22 2014
prev sibling next sibling parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On 22 January 2014 15:03, Johannes Pfau <nospam example.com> wrote:
 Am Wed, 22 Jan 2014 00:28:32 +0000
 schrieb "Mike" <none none.com>:

 "So what's the question, Mike?" I hear you say:
 1.  Is this just one of the consequences of using -O2/-O3, and I
 should just suck it up and deal with it?
 2.  Is this potentially a bug in the GCC backend?
 3.  Is this potentially a bug in GDC or the DMD frontend?

 Thanks for the help,
 Mike

I can only guess, but this looks like another 'volatile' problem. You'd have to post the ASM of the optimized version somewhere, and probably the output of -fdump-tree-optimized for the optimized version. But anyway, I guess it inlines 'SendCommand' and then thinks you're not using the message and probably completely optimizes the call away. Then it sees you're never using message and removes the rest of your code. If SendCommand was written in D you'd have to mark the target of the copy volatile (or shared). But I'm not sure how this applies to the inline asm though. In C you have asm volatile, but I never used that. This answer seems to state that you have to use asm volatile: http://stackoverflow.com/a/5057270/471401 So the questions for Iain: * should we mark all inline ASM blocks as volatile? * shared can't replace volatile in this case as `shared asm{...}` isn't valid * Should we add some GDC specific way to mark extended ASM blocks as volatile? As DMD doesn't optimize ASM blocks at all there's probably no need for a standard solution?

We already do (ExtAsmStatement::toIR -> ASM_VOLATILE_P (exp) = 1;) Regards Iain
Jan 22 2014
prev sibling next sibling parent "Mike" <none none.com> writes:
On Wednesday, 22 January 2014 at 15:03:49 UTC, Johannes Pfau 
wrote:
 Am Wed, 22 Jan 2014 00:28:32 +0000
 schrieb "Mike" <none none.com>:

 
 "So what's the question, Mike?" I hear you say:
 1.  Is this just one of the consequences of using -O2/-O3, and 
 I should just suck it up and deal with it?
 2.  Is this potentially a bug in the GCC backend?
 3.  Is this potentially a bug in GDC or the DMD frontend?
 
 Thanks for the help,
 Mike

I can only guess, but this looks like another 'volatile' problem. You'd have to post the ASM of the optimized version somewhere, and probably the output of -fdump-tree-optimized for the optimized version. But anyway, I guess it inlines 'SendCommand' and then thinks you're not using the message and probably completely optimizes the call away. Then it sees you're never using message and removes the rest of your code. If SendCommand was written in D you'd have to mark the target of the copy volatile (or shared). But I'm not sure how this applies to the inline asm though. In C you have asm volatile, but I never used that. This answer seems to state that you have to use asm volatile: http://stackoverflow.com/a/5057270/471401 So the questions for Iain: * should we mark all inline ASM blocks as volatile? * shared can't replace volatile in this case as `shared asm{...}` isn't valid * Should we add some GDC specific way to mark extended ASM blocks as volatile? As DMD doesn't optimize ASM blocks at all there's probably no need for a standard solution?

Thanks for the response, Johannes. Defining message as "shared uint[3] message" and defining SendMessage as "void SendCommand(int command, shared void* message)" did the trick.
Jan 22 2014
prev sibling next sibling parent "Mike" <none none.com> writes:
On Wednesday, 22 January 2014 at 13:08:53 UTC, Iain Buclaw wrote:
 Personally, I would never use -O3 for low level start.o kernel 
 stuff.

In my simple D program, however, -O2 also doesn't work.
 As you are coding on a small board, wouldn't you instead use 
 -Os ?

I sometimes use -Os and sometimes use -O2/-O3. If I'm controlling something low-speed like a refrigerator or other kitchen appliance, I use -Os so I can use the cheapest chip available. However, for my current project, I'm making and HMI/Industrial controller. The HMI will have a software rendered graphics engine with vector graphics, alpha blending, TrueType fonts, etc... I've already built this in C++, and the -O2/-O3 was very significant in my performance benchmarks. I didn't notice any difference between -O2 and -O3, though. It uses about 700KB of Flash memory, and most of that is the TrueType font data, so I'm quite satisfied with my C++ results. Interestingly, since I started using GCC 4.8 in my C++ project, -O3 breaks my memset function, but -O2 does not, so I'm sticking with -O2 at the moment. With GCC 4.7, -O3 worked fine. If you can't see any error in my D code and the compiler and optimizer are working properly, shouldn't my program work at these optimization levels without resorting to special qualifiers like shared/volatile? NOTE: I'll post assembly and the optimization tree when I get home from work today.
Jan 22 2014
prev sibling next sibling parent "Mike" <none none.com> writes:
On Wednesday, 22 January 2014 at 15:03:49 UTC, Johannes Pfau 
wrote:
 I can only guess, but this looks like another 'volatile' 
 problem. You'd
 have to post the ASM of the optimized version somewhere, and 
 probably
 the output of -fdump-tree-optimized for the optimized version.

Here's the output with -fdump-tree-optimized ******* ;; Function start.OnReset (OnReset, funcdef_no=1, decl_uid=3544, cgraph_uid=1) (executed once) start.OnReset () { uint message[3]; <bb 2>: <bb 3>: __asm__ __volatile__("mov r0, %[cmd]; mov r1, %[msg]; bkpt #0xAB" : : "cmd" "r" 5, "msg" "r" &message : "r0", "r1", "r1"); <bb 4>: goto <bb 3>; } ;; Function start.SendCommand (_D5start11SendCommandFiPvZv, funcdef_no=0, decl_uid=3545, cgraph_uid=0) start.SendCommand (int command, void * message) { <bb 2>: __asm__ __volatile__("mov r0, %[cmd]; mov r1, %[msg]; bkpt #0xAB" : : "cmd" "r" command_1(D), "msg" "r" message_2(D) : "r0", "r1", "r1"); return; } ******* Here's the output of the unoptimized version ******* ;; Function start.SendCommand (_D5start11SendCommandFiPvZv, funcdef_no=0, decl_uid=3545, cgraph_uid=0) start.SendCommand (int command, void * message) { <bb 2>: __asm__ __volatile__("mov r0, %[cmd]; mov r1, %[msg]; bkpt #0xAB" : : "cmd" "r" command_1(D), "msg" "r" message_2(D) : "r0", "r1", "r1"); return; } ;; Function start.OnReset (OnReset, funcdef_no=1, decl_uid=3544, cgraph_uid=1) start.OnReset () { uint message[3]; <unnamed type> D.3562; <unnamed type> _1; <bb 2>: message = *.LC1; <bb 3>: _1 = 0; if (_1 != 0) goto <bb 5>; else goto <bb 4>; <bb 4>: start.SendCommand (5, &message); goto <bb 3>; <bb 5>: message ={v} {CLOBBER}; return; } ******** The __asm__ __volatile__ seems to indicate Iain is right. Notice the message = *.LC1 in the unoptimized version, but not the optimized version. This is the first time I've seen this kind of output, so can you decipher what's going on? And here's the optimized assembly. I'm not sure how to do this, so I used -fverbose-asm -Wa,-adhln ********** http://pastebin.com/NY2PNWzS ********** And here's the unoptimized assembly for comparison ********** http://pastebin.com/hbThtCsP ********** Thanks for taking the time. Mike
Jan 23 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 23 Jan 2014 11:30:41 +0000
schrieb "Mike" <none none.com>:

 
 The __asm__ __volatile__ seems to indicate Iain is right.  Notice 
 the message = *.LC1 in the unoptimized version, but not the 
 optimized version.  This is the first time I've seen this kind of 
 output, so can you decipher what's going on?

We can see a few things in that output: * The function really got inlined * The ASM is still there and marked as volatile * In the optimized version, the ubyte[3] message variable is still there, but it's not initialized (message =*.LC1 is pseudo code for 'initialize message on the stack with the data stored at .LC1')
 And here's the optimized assembly.  I'm not sure how to do this, 
 so I used -fverbose-asm -Wa,-adhln
 **********
 http://pastebin.com/NY2PNWzS
 **********

Nice, I always used -S but this output is better of course ;-) I think what could be happening here is that GCC doesn't know what memory you're accessing via the message pointer in SendCommand. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and search for "If your assembler instructions access memory in an unpredictable fashion" Maybe typing "message" as uint* or uint[3]* instead of void* is already good enough. Otherwise try using a memory input as described on that page.
Jan 23 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 22 Jan 2014 16:49:54 +0000
schrieb Iain Buclaw <ibuclaw gdcproject.org>:

 
 We already do (ExtAsmStatement::toIR -> ASM_VOLATILE_P (exp) = 1;)
 
 Regards
 Iain

I guess I should have looked that up before posting wild speculations ;-)
Jan 23 2014
prev sibling next sibling parent "Mike" <none none.com> writes:
On Thursday, 23 January 2014 at 16:56:19 UTC, Johannes Pfau wrote:
 I think what could be happening here is that GCC doesn't know 
 what
 memory you're accessing via the message pointer in SendCommand.

 See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
 and search for "If your assembler instructions access memory in 
 an
 unpredictable fashion"

 Maybe typing "message" as uint* or uint[3]* instead of void* is 
 already
 good enough. Otherwise try using a memory input as described on 
 that
 page.

That appears to be it. I added memory to the clobber list and it worked without any extra modifiers. And, my executable is now only 56 bytes. It still doesn't do anything useful, but it's small and optimized :-) Thanks so much for the information and the help. Mike
Jan 24 2014
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 24 Jan 2014 11:12:36 +0000
schrieb "Mike" <none none.com>:

 That appears to be it.  I added memory to the clobber list and it 
 worked without any extra modifiers.  And, my executable is now 
 only 56 bytes.  It still doesn't do anything useful, but it's 
 small and optimized :-)
 
 Thanks so much for the information and the help.
 

You're welcome! BTW: The optimizer generated this code for SendCommand mov r3, r0 command, command mov r2, r1 message, message mov r0, r3; command mov r1, r2; message It's a little bit unfortunate that you can't specify in extended inline asm that you want a specific register on ARM (on x86 you could tell gcc to use e.g. eax for this variable). Otherwise you could use something like asm { "bkpt #0xAB" : : [cmd] "r0" command, [msg] "r1" message : ; }; and the compiler could avoid those useless moves.
Jan 24 2014