www.digitalmars.com         C & C++   DMDScript  

D.gnu - Improving codegen for ARM Cortex-M

reply Mike Franklin <slavo5150 yahoo.com> writes:
I've finally succeeded in getting a build of my STM32 ARM 
Cortex-M proof of concept in LDC and GDC, thanks to the recent 
changes in both compilers.  So, I now have a way to compare code 
generation between the two compilers.

The project is extremely simple; it just generates a bunch of 
random rectangles on it's small LCD screen.  This is done by 
simply writing to memory in a frame buffer.

Unfortunately, GDC's code executes quite a bit slower than LDC's 
code.  The difference is quite noticeable, as I can see the rate 
of the status LED blinking much slower with GDC than with LDC.

The code to do this is below (I simplified it for this 
discussion, but tested to ensure reproduction of the symptoms.  I 
also did away with the random behavior to remove that variable).

a block of code in main.d
---
uint i = 0;
while(true)
{
     lcd.fillRect(x, y, width, height, color);
     if ((i % 1000) == 0)
     {
         statusLED.toggle();
     }

     i++;
}

in lcd.d
---
 noinline pragma(inline, false) void fillRect(int x, int y, uint 
width, uint height, ushort color)
{
     int y2 = y + height;
     for(int _y = y; _y <= y2; _y++)
     {
         ltdc.fillSpan(x, _y, width, color);
     }
}

from ltdc.d
-----------
void fillSpan(int x, int y, uint spanWidth, ushort color)
{
     int start = y * width + x;
     for(int i = 0; i < spanWidth; i++)
     {
         frameBuffer[start + i] = color;
     }
}

LDC disassembly
---------------
ldc2 -conf= -disable-simplify-libcalls -c -Os  
-mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 
-Isource/runtime -boundscheck=off

<_D5board3lcd8fillRectFiikktZv>:
80000b8:  e92d 43f0   stmdb  sp!, {r4, r5, r6, r7, r8, r9, lr}
80000bc:  eb03 0e01   add.w  lr, r3, r1
80000c0:  459e        cmp  lr, r3
80000c2:  bfb8        it  lt
80000c4:  e8bd 83f0   ldmialt.w  sp!, {r4, r5, r6, r7, r8, r9, pc}






80000e0:  1a54        subs  r4, r2, r1



80000ec:  b1f2        cbz  r2, 800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>


80000f4:  d30a        bcc.n  800010c 
<_D5board3lcd8fillRectFiikktZv+0x54>
80000f6:  463e        mov  r6, r7






8000108:  42ac        cmp  r4, r5
800010a:  d1f5        bne.n  80000f8 
<_D5board3lcd8fillRectFiikktZv+0x40>
800010c:  b171        cbz  r1, 800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>



8000118:  4435        add  r5, r6

800011e:  d005        beq.n  800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>



8000128:  bf18        it  ne



8000132:  4573        cmp  r3, lr
8000134:  ddda        ble.n  80000ec 
<_D5board3lcd8fillRectFiikktZv+0x34>
8000136:  e8bd 83f0   ldmia.w  sp!, {r4, r5, r6, r7, r8, r9, pc}

GDC disassembly
---------------
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs 
-nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 
-mfloat-abi=hard -Isource/runtime -fno-bounds-check 
-ffunction-sections -fdata-sections -fno-weak

<_D5board3lcd8fillRectFiikktZv>:
800049c:  b470        push  {r4, r5, r6}
800049e:  440b        add  r3, r1
80004a0:  4299        cmp  r1, r3

80004a6:  dc15        bgt.n  80004d4 
<_D5board3lcd8fillRectFiikktZv+0x38>



<_D5board3lcd8fillRectFiikktZv+0x3c>)
80004b2:  4410        add  r0, r2



80004be:  b122        cbz  r2, 80004ca 
<_D5board3lcd8fillRectFiikktZv+0x2e>
80004c0:  19a0        adds  r0, r4, r6

80004c6:  42a0        cmp  r0, r4
80004c8:  d1fb        bne.n  80004c2 
<_D5board3lcd8fillRectFiikktZv+0x26>

80004cc:  428b        cmp  r3, r1

80004d2:  daf4        bge.n  80004be 
<_D5board3lcd8fillRectFiikktZv+0x22>
80004d4:  bc70        pop  {r4, r5, r6}
80004d6:  4770        bx  lr
80004d8:  20000000   .word  0x20000000

For both LDC and GDC `fillSpan` gets inlined into `fillRect`.  I 
had to disable inlining for `fillRect` to make it easier to 
compare the disassembly, otherwise all I get is a huge `main`.

I used `O2` for GDC because the `Os` was even slower, and didn't 
inline `fillSpan`.

Although GDC's code is shorter, LDC's code is faster.  My guess 
is that this is due to the `ldm` and `stm` instructions in the 
LDC disassembly which are SIMD instructions (load multiple, and 
store multiple), but I'm not sure.

I've tried a number of different optimization permutations (too 
many to list here), but they didn't seem to make any difference.

I ask for any insight you might have, should you wish to give 
this your attention.  Regardless, I'll keep investigating.

Thanks,
Mike
Jul 20 2018
next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
Actually the assembly output from objdump isn't quite accurate.  
Here's the generated assembly from the compiler.

LDC
---
ldc2 -conf= -disable-simplify-libcalls -c -Os  
-mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 
-Isource/runtime -boundscheck=off

_D5board3lcd8fillRectFiikktZv:
	.fnstart
	.save	{r4, r5, r6, r7, r8, r9, lr}
	push.w	{r4, r5, r6, r7, r8, r9, lr}
	add.w	lr, r3, r1
	cmp	lr, r3
	it	lt
	poplt.w	{r4, r5, r6, r7, r8, r9, pc}


	movw	r8, :lower16:_D5board4ltdc11frameBufferG76800t


	movt	r8, :upper16:_D5board4ltdc11frameBufferG76800t
	subs	r4, r2, r1



.LBB1_1:
	cbz	r2, .LBB1_8


	blo	.LBB1_5
	mov	r6, r7
.LBB1_4:



	strh	r0, [r6]


	cmp	r4, r5
	bne	.LBB1_4
.LBB1_5:
	cbz	r1, .LBB1_8



	add	r5, r6

	beq	.LBB1_8



	it	ne

.LBB1_8:


	cmp	r3, lr
	ble	.LBB1_1
	pop.w	{r4, r5, r6, r7, r8, r9, pc}

GDC
---
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs 
-nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 
-mfloat-abi=hard -Isource/runtime -fno-bounds-check 
-ffunction-sections -fdata-sections -fno-weak

_D5board3lcd8fillRectFiikktZv:
	.fnstart
.LFB4:
	  args = 4, pretend = 0, frame = 0
	  frame_needed = 0, uses_anonymous_args = 0
	  link register save eliminated.
	push	{r4, r5, r6}
	add	r3, r3, r1
	cmp	r1, r3

	bgt	.L47


	ldr	r4, .L58
	add	r0, r0, r2



.L51:
	cbz	r2, .L49
	adds	r0, r4, r6
.L50:

	cmp	r0, r4
	bne	.L50
.L49:

	cmp	r3, r1

	bge	.L51
.L47:
	pop	{r4, r5, r6}
	bx	lr

Mike
Jul 20 2018
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Friday, 20 July 2018 at 12:49:59 UTC, Mike Franklin wrote:

 GDC
 ---
 arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs 
 -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 
 -mfloat-abi=hard -Isource/runtime -fno-bounds-check 
 -ffunction-sections -fdata-sections -fno-weak

 _D5board3lcd8fillRectFiikktZv:
 	.fnstart
 .LFB4:
 	  args = 4, pretend = 0, frame = 0
 	  frame_needed = 0, uses_anonymous_args = 0
 	  link register save eliminated.
 	push	{r4, r5, r6}
 	add	r3, r3, r1
 	cmp	r1, r3

 	bgt	.L47


 	ldr	r4, .L58
 	add	r0, r0, r2



 .L51:
 	cbz	r2, .L49
 	adds	r0, r4, r6
 .L50:

 	cmp	r0, r4
 	bne	.L50
 .L49:

 	cmp	r3, r1

 	bge	.L51
 .L47:
 	pop	{r4, r5, r6}
 	bx	lr
Gah. Sorry folks. I keep screwing up. I can see above that `fillSpan` function is not being inlined. I must be doing something wrong. Please ignore this thread. Sorry, Mike
Jul 20 2018
prev sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Friday, 20 July 2018 at 11:11:12 UTC, Mike Franklin wrote:

 I ask for any insight you might have, should you wish to give 
 this your attention.  Regardless, I'll keep investigating.
Just to follow up, after I enabled `-funroll-loops` for GDC, it was almost twice as fast as LDC, though the code size was a little larger. Bottom line is: I just need to learn the compilers better (both of them) and learn how to tune them for the application. Mike
Jul 20 2018