www.digitalmars.com         C & C++   DMDScript  

D.gnu - Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M

reply Mike <none none.com> writes:
This is just an experience report for those who might be 
interested.

Given this pull request 
(https://github.com/D-Programming-GDC/GDC/pull/456) I thought I'd 
try testing a more recent GDC with my STM32 demo 
(https://github.com/JinShil/stm32f42_discovery_demo) and see what 
the state of things is.  Here are my results.

TypeInfo Stubs
===============
I was able to reduce object.d down to the following:

-- object.d --
module object;

alias size_t    = typeof(int.sizeof);
alias ptrdiff_t = typeof(cast(void*)0 - cast(void*)0);

alias string = immutable(char)[];

class Object
{ }

class TypeInfo
{ }

class TypeInfo_Const : TypeInfo
{
     size_t getHash(in void *p) const nothrow { return 0; }
}
----------------------

I wasn't able to completely omit TypeInfo because for some reason 
the compiler is still looking for `getHash` in `TypeInfo_Const`.

-- output from gdc --
object.d:1:1: error: class object.TypeInfo_Const is forward 
referenced when looking for 'getHash'
  module object;
  ^
cc1d: error: no property 'getHash' for type 
'object.TypeInfo_Const'
-----------------------

But, most of the `TypeInfo` stubs are no longer required.

TypeInfo Bloat
===============
Unfortunately the TypeInfo bloat documented here 
(https://issues.dlang.org/show_bug.cgi?id=14758) is still there.  
I suppose that's to be expected as it appears the aforementioned 
pull request only removed the need to write `TypeInfo` stubs.

Binary size for the STM32 demo is about 600kB when it should be 
more like 6kB.

Compile Speed
=============
It takes about 1 minute 30 seconds to build and link the STM32 
demo resulting in a 6kB binary.  That's pretty bad, but I suspect 
that's a DMD CTFE problem.

Mike
Jun 25
next sibling parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 25 June 2017 at 12:18, Mike via D.gnu <d.gnu puremagic.com> wrote:
 This is just an experience report for those who might be interested.

 Given this pull request (https://github.com/D-Programming-GDC/GDC/pull/456)
 I thought I'd try testing a more recent GDC with my STM32 demo
 (https://github.com/JinShil/stm32f42_discovery_demo) and see what the state
 of things is.  Here are my results.

 TypeInfo Stubs
 ===============
 I was able to reduce object.d down to the following:

 -- object.d --
 module object;

 alias size_t    = typeof(int.sizeof);
 alias ptrdiff_t = typeof(cast(void*)0 - cast(void*)0);

 alias string = immutable(char)[];

 class Object
 { }

 class TypeInfo
 { }

 class TypeInfo_Const : TypeInfo
 {
     size_t getHash(in void *p) const nothrow { return 0; }
 }
 ----------------------

 I wasn't able to completely omit TypeInfo because for some reason the
 compiler is still looking for `getHash` in `TypeInfo_Const`.

 -- output from gdc --
 object.d:1:1: error: class object.TypeInfo_Const is forward referenced when
 looking for 'getHash'
  module object;
  ^
 cc1d: error: no property 'getHash' for type 'object.TypeInfo_Const'
This would be coming from the front-end. The only functions that gdc itself generates are for ModuleInfo, all other artificial data and routines are emitted because the compiler was told to.
 -----------------------

 But, most of the `TypeInfo` stubs are no longer required.

 TypeInfo Bloat
 ===============
 Unfortunately the TypeInfo bloat documented here
 (https://issues.dlang.org/show_bug.cgi?id=14758) is still there.  I suppose
 that's to be expected as it appears the aforementioned pull request only
 removed the need to write `TypeInfo` stubs.

 Binary size for the STM32 demo is about 600kB when it should be more like
 6kB.
Out of curiosity, is that the original built binary, or post trimming (strip / --gc-sections)? There might be a size optimization possibility putting the TypeInfo in comdat, perhaps we could give that a go. There's also https://github.com/D-Programming-GDC/GDC/pull/100 - perhaps there should be a revival of that.
 Compile Speed
 =============
 It takes about 1 minute 30 seconds to build and link the STM32 demo
 resulting in a 6kB binary.  That's pretty bad, but I suspect that's a DMD
 CTFE problem.
Thanks for the update. Iain.
Jun 25
parent reply Mike <none none.com> writes:
On Sunday, 25 June 2017 at 10:44:26 UTC, Iain Buclaw wrote:
 Out of curiosity, is that the original built binary, or post 
 trimming
 (strip / --gc-sections)?
I'm compiling with -ffunction-sections and -fdata-sections, and linking with --gc-sections.
 There might be a size optimization possibility putting the 
 TypeInfo in comdat, perhaps we could give that a go.
Don't know what that means, but if it helps reduce dead code and doesn't have any unintended consequences, sounds good!
 There's also https://github.com/D-Programming-GDC/GDC/pull/100 
 - perhaps there should be a revival of that.
I'm not really interested in that because its too blunt of an instrument. I'd like to use TypeInfo, but only if I'm doing dynamic casts or other things that require such runtime information. Also, I'd only want the TypeInfo for the types that need it in my binary. I've said this before but I'll repeat: I like TypeInfo; I just don't like dead code. Mike
Jun 25
next sibling parent reply Mike <none none.com> writes:
On Sunday, 25 June 2017 at 10:53:35 UTC, Mike wrote:

 I'm not really interested in that because its too blunt of an 
 instrument.  I'd like to use TypeInfo, but only if I'm doing 
 dynamic casts or other things that require such runtime 
 information.  Also, I'd only want the TypeInfo for the types 
 that need it in my binary.  I've said this before but I'll 
 repeat:  I like TypeInfo; I just don't like dead code.
Just a little more information about this. It doesn't appear that the entire TypeInfo object is remaining in the binary. It appears to only be the `name` of the type that the linker just can't seem to "garbage collection" from the .rodata section. Mike
Jun 25
parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 25 June 2017 at 13:30, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Sunday, 25 June 2017 at 10:53:35 UTC, Mike wrote:

 I'm not really interested in that because its too blunt of an instrument.
 I'd like to use TypeInfo, but only if I'm doing dynamic casts or other
 things that require such runtime information.  Also, I'd only want the
 TypeInfo for the types that need it in my binary.  I've said this before but
 I'll repeat:  I like TypeInfo; I just don't like dead code.
Just a little more information about this. It doesn't appear that the entire TypeInfo object is remaining in the binary. It appears to only be the `name` of the type that the linker just can't seem to "garbage collection" from the .rodata section. Mike
Ah ha! It seems that for whatever reason, binutils can't strip strings. But if you wrap that string around a static symbol, then has no problem removing it. Using your small test here: https://issues.dlang.org/show_bug.cgi?id=14758 Making the following modifications for gdc. --- long syscall(long code, long arg1 = 0, in void* arg2 = null, long arg3 = 0) { long result = void; asm { "syscall" : /* "=a" (result) ??? No output operands == asm volatile? That seems wrong... */ : "a" (code), "D" (arg1), "S" (arg2), "d" (arg3); } return result; } --- Compiling with -fdata-sections -Os shows that .rodata now only contains: Contents of section .rodata: 40010e 48656c6c 6f0a00 Hello.. I'll make a formal PR when I have some time later. Iain.
Jun 25
parent Mike <none none.com> writes:
On Sunday, 25 June 2017 at 13:40:56 UTC, Iain Buclaw wrote:

 I'll make a formal PR when I have some time later.

 Iain.
For those who may encounter this post at a later date, the PR is here: https://github.com/D-Programming-GDC/GDC/pull/505 and it seems to solve the problem. Yeah!
Jun 26
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 25 Jun 2017 10:53:35 +0000
schrieb Mike <none none.com>:
 
 
 I'm not really interested in that because its too blunt of an 
 instrument.  I'd like to use TypeInfo, but only if I'm doing 
 dynamic casts or other things that require such runtime 
 information.  Also, I'd only want the TypeInfo for the types that 
 need it in my binary.  I've said this before but I'll repeat:  I 
 like TypeInfo; I just don't like dead code.
 
I think dynamic casts might actually be the only valid feature for TypeInfo. Everything else will hopefully switch to templated interfaces + CTFE introspection. -- Johannes
Jun 25
prev sibling parent reply Mike <none none.com> writes:
On Sunday, 25 June 2017 at 10:18:04 UTC, Mike wrote:

 Compile Speed
 =============
 It takes about 1 minute 30 seconds to build and link the STM32 
 demo resulting in a 6kB binary.  That's pretty bad, but I 
 suspect that's a DMD CTFE problem.
I'm been informed that this PR (https://github.com/dlang/dmd/pull/6418) may fix this. I just my test it today if I can get it to compile with GDC. Do you accept cherry-picked PRs like this? Mike
Jun 25
next sibling parent "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 26 June 2017 at 04:23, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Sunday, 25 June 2017 at 10:18:04 UTC, Mike wrote:

 Compile Speed
 =============
 It takes about 1 minute 30 seconds to build and link the STM32 demo
 resulting in a 6kB binary.  That's pretty bad, but I suspect that's a DMD
 CTFE problem.
I'm been informed that this PR (https://github.com/dlang/dmd/pull/6418) may fix this. I just my test it today if I can get it to compile with GDC. Do you accept cherry-picked PRs like this?
As it is a bug fix, yes I do. Iain.
Jun 26
prev sibling parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 26 June 2017 at 04:23, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Sunday, 25 June 2017 at 10:18:04 UTC, Mike wrote:

 Compile Speed
 =============
 It takes about 1 minute 30 seconds to build and link the STM32 demo
 resulting in a 6kB binary.  That's pretty bad, but I suspect that's a DMD
 CTFE problem.
I'm been informed that this PR (https://github.com/dlang/dmd/pull/6418) may fix this. I just my test it today if I can get it to compile with GDC. Do you accept cherry-picked PRs like this?
Looks like it would be best to apply these PRs in the order listed: https://github.com/dlang/dmd/pull/5276 https://github.com/dlang/dmd/pull/5948 https://github.com/dlang/dmd/pull/6418 Regards Iain.
Jun 27
next sibling parent reply Mike <none none.com> writes:
On Tuesday, 27 June 2017 at 17:55:02 UTC, Iain Buclaw wrote:

 Looks like it would be best to apply these PRs in the order 
 listed:

 https://github.com/dlang/dmd/pull/5276 
 https://github.com/dlang/dmd/pull/5948 
 https://github.com/dlang/dmd/pull/6418
Thanks. I actually wasn't aware, until yesterday, that GDC isn't using the DDMD frontend. How do you usually handle such things? Just use the transpiler between your ears? Also, what's the plan for GDC w/ DDMD? Mike
Jun 27
parent "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 27 June 2017 at 23:27, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Tuesday, 27 June 2017 at 17:55:02 UTC, Iain Buclaw wrote:

 Looks like it would be best to apply these PRs in the order listed:

 https://github.com/dlang/dmd/pull/5276
 https://github.com/dlang/dmd/pull/5948
 https://github.com/dlang/dmd/pull/6418
Thanks. I actually wasn't aware, until yesterday, that GDC isn't using the DDMD frontend. How do you usually handle such things? Just use the transpiler between your ears?
Carefully backport as needed. I have been given approval to create a dmd-cxx branch in dlang where I'll put the continued maintenance of the C++ implementation until a time when we're ready to switch.
 Also, what's the plan for GDC w/ DDMD?
Get it so that we can test the latest dmd/stable branch without changing any code (just swap out the dmd frontend). When happy that everything works. Keep in sync with dmd/stable branch, no more trailing behind releases. Regressions in DMD are handled immediately, not after a new release has gone out. Regards, Iain.
Jun 27
prev sibling parent reply Mike <none none.com> writes:
On Tuesday, 27 June 2017 at 17:55:02 UTC, Iain Buclaw wrote:

 Looks like it would be best to apply these PRs in the order 
 listed:

 https://github.com/dlang/dmd/pull/5276 
 https://github.com/dlang/dmd/pull/5948 
 https://github.com/dlang/dmd/pull/6418
Compiling https://github.com/JinShil/stm32f42_discovery_demo GDC 7.1 Without https://github.com/D-Programming-GDC/GDC/pull/507 ----------------------------------------------------------------- time arm-none-eabi-gdc -c -O3 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o real 1m14.780s user 0m53.201s sys 0m5.704s time arm-none-eabi-ld binary/firmware.o -Tlinker/linker.ld --gc-sections -o binary/firmware real 0m34.636s user 0m33.120s sys 0m1.201s GDC 7.1 With https://github.com/D-Programming-GDC/GDC/pull/507 -------------------------------------------------------------- time arm-none-eabi-gdc -c -O3 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o real 0m55.745s user 0m50.962s sys 0m2.338s time arm-none-eabi-ld binary/firmware.o -Tlinker/linker.ld --gc-sections -o binary/firmware real 0m33.768s user 0m33.212s sys 0m0.503s PR507 eems to have made a mild improvement. In an effort to draw a comparison, I modified the code as little as possible to get it to build with DMD, and this is the result: -------------------------------------------------- time dmd -m32 -c -conf= -boundscheck=off -release -betterC -Isource/runtime source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -of=binary/firmware.o real 0m3.086s user 0m2.569s sys 0m0.517s Not sure if that's a valid comparison, but that's a huge difference. I'd be glad to do some troubleshooting if you have any ideas. But I may not get to it right away as I will be traveling for a few weeks. Mike
Jun 28
next sibling parent reply Johannes Pfau <nospam example.com> writes:
On Wednesday, 28 June 2017 at 11:19:10 UTC, Mike wrote:
 I'd be glad to do some troubleshooting if you have any ideas.  
 But I may not get to it right away as I will be traveling for a 
 few weeks.

 Mike
Are we sure this is a frontend problem? Have you already tried -fsyntax-only to remove the backend computation? Also which DMD version did you use for comparison? I'd try 2.068 (if your code compiles with that version) and 2.071.2. There may have been other changes in newer frontend versions improving build performance.
Jun 28
parent reply Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 12:15:17 UTC, Johannes Pfau wrote:

 Have you already tried -fsyntax-only to remove the backend 
 computation?
time arm-none-eabi-gdc -c -O3 -fsyntax-only -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o real 0m1.858s user 0m1.157s sys 0m0.241s Does that indicate a backend problem? If you look at my previous post, linking was also quite high at 33s.
 Also which DMD version did you use for comparison? I'd try 
 2.068 (if your code compiles with that version) and 2.071.2. 
 There may have been other changes in newer frontend versions 
 improving build performance.
dmd --version DMD64 D Compiler v2.074.0 Mike
Jun 28
parent Johannes Pfau <nospam example.com> writes:
Am Wed, 28 Jun 2017 12:27:17 +0000
schrieb Mike <none none.com>:

 Does that indicate a backend problem?  If you look at my previous 
 post, linking was also quite high at 33s.
Probably. AFAIR -fsyntax-only includes all frontend processing so the long runtime is likely due to the optimization phases, assembling and linking. However, let's wait for Iain for a more competent response ;-) -- Johannes
Jun 28
prev sibling next sibling parent reply Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 11:19:10 UTC, Mike wrote:

 I'd be glad to do some troubleshooting if you have any ideas.
I just checked in a branch that will compile with a desktop GDC compiler in case someone wants to give it a try. I tested with 6.2.1 20160830 from the Arch Linux repository. It won't produce a working binary, but it will go through the motions. git clone https://github.com/JinShil/stm32f42_discovery_demo.git cd stm32f42_discovery_demo/ git checkout desktop-gdc rdmd build.d Mike
Jun 28
parent reply Johannes Pfau <nospam example.com> writes:
Am Wed, 28 Jun 2017 12:56:40 +0000
schrieb Mike <none none.com>:

 On Wednesday, 28 June 2017 at 11:19:10 UTC, Mike wrote:
 
 I'd be glad to do some troubleshooting if you have any ideas.  
I just checked in a branch that will compile with a desktop GDC compiler in case someone wants to give it a try. I tested with 6.2.1 20160830 from the Arch Linux repository. It won't produce a working binary, but it will go through the motions. git clone https://github.com/JinShil/stm32f42_discovery_demo.git cd stm32f42_discovery_demo/ git checkout desktop-gdc rdmd build.d Mike
I guess if archlinux GDC is slow as well this is unlikely the main cause, but did you build your compiler in release mode (--enable-checking=release)? -- Johannes
Jun 28
parent Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 13:48:54 UTC, Johannes Pfau wrote:

 I guess if archlinux GDC is slow as well this is unlikely the 
 main cause, but did you build your compiler in release mode 
 (--enable-checking=release)?
Thanks, I wasn't aware of that. I just rebuilt and it brought the time down another 25%. time arm-none-eabi-gdc -c -O3 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o real 1m0.518s user 0m54.594s sys 0m2.558s Mike
Jun 28
prev sibling parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 28 June 2017 at 13:19, Mike via D.gnu <d.gnu puremagic.com> wrote:
 PR507 eems to have made a mild improvement.
OK, so will pull it in. As what has already been said, if you aren't doing already, always use --enable-checking=release, any other value (apart from none) is for testing builds only. You probably want to tone down on optimizations as well. -O3 will be doing a lot of work, sometimes for little or no gain. In most cases, -O2 -finline-functions is good enough, which can be abbreviated further as simply -Os. [for full list of enabled/disabled passes: gdc -Q -Os --help=optimizers] You can see a breakdown of what areas the compiler spends the most time in with -ftime-report Regards, Iain.
Jun 28
parent reply Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 16:52:26 UTC, Iain Buclaw wrote:

 You probably want to tone down on optimizations as well.  -O3 
 will be doing a lot of work, sometimes for little or no gain.  
 In most cases, -O2 -finline-functions is good enough, which can 
 be abbreviated further as simply -Os.  [for full list of 
 enabled/disabled passes: gdc -Q -Os --help=optimizers]

 You can see a breakdown of what areas the compiler spends the 
 most time in with -ftime-report
Compiling with -O3 ------------------ phase opt and generate : 50.74 (96%) usr 21.24 (99%) sys 72.94 (97%) wall 2426962 kB (94%) ggc TOTAL : 53.02 21.49 75.55 2589984 kB real 1m21.339s user 0m57.086s sys 0m22.297s arm-none-eabi-size binary/firmware text data bss dec hex filename 6228 0 153600 159828 27054 binary/firmware Compiling with -O2 -finline-functions ------------------------------------- phase opt and generate : 50.71 (96%) usr 20.58 (98%) sys 72.04 (97%) wall 2381419 kB (94%) ggc TOTAL : 52.89 20.93 74.63 2544441 kB real 1m20.755s user 0m56.857s sys 0m21.826s arm-none-eabi-size binary/firmware text data bss dec hex filename 5912 0 153600 159512 26f18 binary/firmware Compiling with -O0 ------------------ phase opt and generate : 22.95 (91%) usr 5.42 (94%) sys 28.38 (92%) wall 1777106 kB (92%) ggc TOTAL : 25.14 5.74 30.94 1940102 kB real 0m36.476s user 0m29.600s sys 0m6.647s arm-none-eabi-size binary/firmware text data bss dec hex filename 45250 0 153600 198850 308c2 binary/firmware ------------------------------------------------------------------------- The vast majority of time is spent in "phase opt and generate". A few observations: * Elapsed time isn't much different between -O3 and -O2 -finline-functions * -O2 -finline-functions gave me a smaller binary :) * -O0 reduced time significantly, but "phase opt and generate" still takes an awfully long time relative to everything else What exactly is "phase opt and generate"? I'm assuming "opt" means optimizer, but why is it taking such a long time even with -O0? Maybe it's the "generate" part of that that's the most significant. With -O0 there's still quite a few things enabled, so maybe I'll start appending a "-fno" to each one and see if I can find a culprit. -O0 -Q --help=optimizers -faggressive-loop-optimizations [enabled] -fauto-inc-dec [enabled] -fdce [enabled] -fdelete-null-pointer-checks [enabled] -fdse [enabled] -fearly-inlining [enabled] -ffp-contract=[off|on|fast] fast -ffp-int-builtin-inexact [enabled] -ffunction-cse [enabled] -fgcse-lm [enabled] -finline [enabled] -finline-atomics [enabled] -fira-hoist-pressure [enabled] -fira-share-save-slots [enabled] -fira-share-spill-slots [enabled] -fivopts [enabled] -fjump-tables [enabled] -flifetime-dse [enabled] -fmath-errno [enabled] -fpeephole [enabled] -fplt [enabled] -fprefetch-loop-arrays [enabled] -fprintf-return-value [enabled] -freg-struct-return [enabled] -frename-registers [enabled] -frtti [enabled] -fsched-critical-path-heuristic [enabled] -fsched-dep-count-heuristic [enabled] -fsched-group-heuristic [enabled] -fsched-interblock [enabled] -fsched-last-insn-heuristic [enabled] -fsched-rank-heuristic [enabled] -fsched-spec [enabled] -fsched-spec-insn-heuristic [enabled] -fsched-stalled-insns-dep [enabled] -fschedule-fusion [enabled] -fshort-enums [enabled] -fshrink-wrap-separate [enabled] -fsigned-zeros [enabled] -fsimd-cost-model=[unlimited|dynamic|cheap] unlimited -fsplit-ivs-in-unroller [enabled] -fssa-backprop [enabled] -fstack-reuse=[all|named_vars|none] all -fstdarg-opt [enabled] -fstrict-volatile-bitfields [enabled] -fno-threadsafe-statics [enabled] -ftrapping-math [enabled] -ftree-cselim [enabled] -ftree-forwprop [enabled] -ftree-loop-if-convert [enabled] -ftree-loop-im [enabled] -ftree-loop-ivcanon [enabled] -ftree-loop-optimize [enabled] -ftree-phiprop [enabled] -ftree-reassoc [enabled] -ftree-scev-cprop [enabled] -fvar-tracking [enabled] -fvar-tracking-assignments [enabled] -fweb [enabled]
Jun 28
parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 28 June 2017 at 23:15, Mike via D.gnu <d.gnu puremagic.com> wrote:
 -------------------------------------------------------------------------
 The vast majority of time is spent in "phase opt and generate".  A few
 observations:

 * Elapsed time isn't much different between -O3 and -O2 -finline-functions
 * -O2 -finline-functions gave me a smaller binary :)
I did say that -Os (optimize for size) is practically identical to this. So I'm not surprised. ;-) And yeah, one of the big differences between -O2 and -O3 is that when it comes to inlining, -O3 mostly disregards size and cost heuristics.
 * -O0 reduced time significantly, but "phase opt and generate" still takes
 an awfully long time relative to everything else

 What exactly is "phase opt and generate"?  I'm assuming "opt" means
 optimizer, but why is it taking such a long time even with -O0?  Maybe it's
 the "generate" part of that that's the most significant.
Phase opt and generate is the topl-evel timer for the entire "backend" compilation phase. I was expecting to see more of a breakdown of individual passes.
 With -O0 there's still quite a few things enabled, so maybe I'll start
 appending a "-fno" to each one and see if I can find a culprit.
A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary? If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more. Regards, Iain.
Jun 28
next sibling parent reply Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

 Phase opt and generate is the topl-evel timer for the entire 
 "backend" compilation phase.  I was expecting to see more of a 
 breakdown of individual passes.
Sorry, it didn't look broken down to me. Here's the full report. arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections -ftime-report source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2310 kB ( 0%) ggc phase parsing : 2.21 ( 4%) usr 0.32 ( 2%) sys 2.55 ( 3%) wall 160684 kB ( 6%) ggc phase opt and generate : 51.89 (96%) usr 20.13 (98%) sys 72.29 (97%) wall 2381419 kB (94%) ggc phase last asm : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 26 kB ( 0%) ggc phase finalize : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc garbage collection : 0.90 ( 2%) usr 0.04 ( 0%) sys 1.05 ( 1%) wall 0 kB ( 0%) ggc dump files : 4.17 ( 8%) usr 1.96 (10%) sys 5.67 ( 8%) wall 0 kB ( 0%) ggc callgraph construction : 0.66 ( 1%) usr 0.20 ( 1%) sys 1.07 ( 1%) wall 26036 kB ( 1%) ggc callgraph optimization : 1.55 ( 3%) usr 0.78 ( 4%) sys 1.89 ( 3%) wall 1689 kB ( 0%) ggc ipa dead code removal : 0.29 ( 1%) usr 0.00 ( 0%) sys 0.28 ( 0%) wall 0 kB ( 0%) ggc ipa inheritance graph : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc ipa devirtualization : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc ipa cp : 0.21 ( 0%) usr 0.01 ( 0%) sys 0.18 ( 0%) wall 6160 kB ( 0%) ggc ipa inlining heuristics : 0.69 ( 1%) usr 0.15 ( 1%) sys 0.67 ( 1%) wall 88573 kB ( 3%) ggc ipa function splitting : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 9 kB ( 0%) ggc ipa comdats : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc ipa various optimizations: 0.08 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc ipa reference : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc ipa profile : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 0.38 ( 1%) usr 0.09 ( 0%) sys 0.54 ( 1%) wall 0 kB ( 0%) ggc ipa icf : 1.59 ( 3%) usr 0.01 ( 0%) sys 1.60 ( 2%) wall 11 kB ( 0%) ggc ipa SRA : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc ipa free lang data : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc ipa free inline summary : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc cfg construction : 0.15 ( 0%) usr 0.06 ( 0%) sys 0.12 ( 0%) wall 5 kB ( 0%) ggc cfg cleanup : 0.66 ( 1%) usr 0.27 ( 1%) sys 1.04 ( 1%) wall 17 kB ( 0%) ggc trivially dead code : 0.12 ( 0%) usr 0.05 ( 0%) sys 0.38 ( 1%) wall 0 kB ( 0%) ggc df scan insns : 0.45 ( 1%) usr 0.19 ( 1%) sys 0.56 ( 1%) wall 5569 kB ( 0%) ggc df multiple defs : 0.24 ( 0%) usr 0.06 ( 0%) sys 0.28 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 0.15 ( 0%) usr 0.03 ( 0%) sys 0.26 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.60 ( 1%) usr 0.25 ( 1%) sys 0.70 ( 1%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.32 ( 1%) usr 0.13 ( 1%) sys 0.57 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.05 ( 0%) usr 0.03 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.56 ( 1%) usr 0.18 ( 1%) sys 0.86 ( 1%) wall 2562 kB ( 0%) ggc register information : 0.14 ( 0%) usr 0.13 ( 1%) sys 0.40 ( 1%) wall 0 kB ( 0%) ggc alias analysis : 0.79 ( 1%) usr 0.34 ( 2%) sys 1.14 ( 2%) wall 28569 kB ( 1%) ggc alias stmt walking : 0.10 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc register scan : 0.07 ( 0%) usr 0.01 ( 0%) sys 0.11 ( 0%) wall 106 kB ( 0%) ggc rebuild jump labels : 0.05 ( 0%) usr 0.05 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc parser (global) : 2.19 ( 4%) usr 0.32 ( 2%) sys 2.51 ( 3%) wall 160144 kB ( 6%) ggc early inlining heuristics: 0.17 ( 0%) usr 0.09 ( 0%) sys 0.24 ( 0%) wall 19510 kB ( 1%) ggc inline parameters : 0.35 ( 1%) usr 0.18 ( 1%) sys 0.44 ( 1%) wall 58124 kB ( 2%) ggc integration : 0.63 ( 1%) usr 0.24 ( 1%) sys 0.85 ( 1%) wall 80071 kB ( 3%) ggc tree gimplify : 0.48 ( 1%) usr 0.17 ( 1%) sys 0.53 ( 1%) wall 109681 kB ( 4%) ggc tree eh : 0.13 ( 0%) usr 0.07 ( 0%) sys 0.20 ( 0%) wall 13982 kB ( 1%) ggc tree CFG construction : 0.19 ( 0%) usr 0.05 ( 0%) sys 0.17 ( 0%) wall 54230 kB ( 2%) ggc tree CFG cleanup : 0.69 ( 1%) usr 0.38 ( 2%) sys 1.19 ( 2%) wall 1131 kB ( 0%) ggc tree tail merge : 0.11 ( 0%) usr 0.02 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc tree VRP : 0.93 ( 2%) usr 0.35 ( 2%) sys 1.29 ( 2%) wall 89761 kB ( 4%) ggc tree Early VRP : 0.21 ( 0%) usr 0.08 ( 0%) sys 0.31 ( 0%) wall 42204 kB ( 2%) ggc tree copy propagation : 0.06 ( 0%) usr 0.03 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree PTA : 1.78 ( 3%) usr 0.85 ( 4%) sys 2.50 ( 3%) wall 4103 kB ( 0%) ggc tree PHI insertion : 0.07 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 6571 kB ( 0%) ggc tree SSA rewrite : 0.16 ( 0%) usr 0.06 ( 0%) sys 0.20 ( 0%) wall 20087 kB ( 1%) ggc tree SSA other : 0.21 ( 0%) usr 0.13 ( 1%) sys 0.51 ( 1%) wall 5602 kB ( 0%) ggc tree SSA incremental : 0.15 ( 0%) usr 0.10 ( 0%) sys 0.30 ( 0%) wall 60 kB ( 0%) ggc tree operand scan : 0.34 ( 1%) usr 0.22 ( 1%) sys 0.56 ( 1%) wall 56364 kB ( 2%) ggc dominator optimization : 0.73 ( 1%) usr 0.22 ( 1%) sys 0.75 ( 1%) wall 7545 kB ( 0%) ggc backwards jump threading: 0.30 ( 1%) usr 0.09 ( 0%) sys 0.25 ( 0%) wall 111 kB ( 0%) ggc tree SRA : 0.13 ( 0%) usr 0.04 ( 0%) sys 0.17 ( 0%) wall 28 kB ( 0%) ggc isolate eroneous paths : 0.04 ( 0%) usr 0.03 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc tree CCP : 0.68 ( 1%) usr 0.24 ( 1%) sys 0.85 ( 1%) wall 7302 kB ( 0%) ggc tree PHI const/copy prop: 0.05 ( 0%) usr 0.02 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree split crit edges : 0.05 ( 0%) usr 0.06 ( 0%) sys 0.17 ( 0%) wall 19 kB ( 0%) ggc tree reassociation : 0.23 ( 0%) usr 0.07 ( 0%) sys 0.38 ( 1%) wall 6 kB ( 0%) ggc tree PRE : 1.28 ( 2%) usr 0.48 ( 2%) sys 1.78 ( 2%) wall 50466 kB ( 2%) ggc tree FRE : 0.69 ( 1%) usr 0.36 ( 2%) sys 1.22 ( 2%) wall 17297 kB ( 1%) ggc tree code sinking : 0.10 ( 0%) usr 0.05 ( 0%) sys 0.13 ( 0%) wall 6 kB ( 0%) ggc tree linearize phis : 0.19 ( 0%) usr 0.08 ( 0%) sys 0.27 ( 0%) wall 41714 kB ( 2%) ggc tree backward propagate : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc tree forward propagate : 0.23 ( 0%) usr 0.08 ( 0%) sys 0.38 ( 1%) wall 62 kB ( 0%) ggc tree phiprop : 0.06 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree conservative DCE : 0.21 ( 0%) usr 0.15 ( 1%) sys 0.36 ( 0%) wall 209 kB ( 0%) ggc tree aggressive DCE : 0.28 ( 1%) usr 0.12 ( 1%) sys 0.44 ( 1%) wall 83438 kB ( 3%) ggc tree buildin call DCE : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.09 ( 0%) usr 0.09 ( 0%) sys 0.21 ( 0%) wall 0 kB ( 0%) ggc PHI merge : 0.07 ( 0%) usr 0.04 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc tree loop optimization : 0.02 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc loopless fn : 0.04 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree loop invariant motion: 0.01 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 1 kB ( 0%) ggc complete unrolling : 0.05 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 136 kB ( 0%) ggc tree iv optimization : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 120 kB ( 0%) ggc tree copy headers : 0.03 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 7 kB ( 0%) ggc tree SSA uncprop : 0.28 ( 1%) usr 0.13 ( 1%) sys 0.31 ( 0%) wall 0 kB ( 0%) ggc tree NRV optimization : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 849 kB ( 0%) ggc tree switch conversion : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree strlen optimization: 0.03 ( 0%) usr 0.01 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc dominance frontiers : 0.09 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 1.42 ( 3%) usr 0.51 ( 2%) sys 1.94 ( 3%) wall 0 kB ( 0%) ggc control dependences : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc out of ssa : 0.22 ( 0%) usr 0.10 ( 0%) sys 0.35 ( 0%) wall 7465 kB ( 0%) ggc expand vars : 0.02 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 506 kB ( 0%) ggc expand : 0.63 ( 1%) usr 0.24 ( 1%) sys 1.12 ( 1%) wall 63840 kB ( 3%) ggc post expand cleanups : 0.24 ( 0%) usr 0.04 ( 0%) sys 0.23 ( 0%) wall 18401 kB ( 1%) ggc varconst : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 539 kB ( 0%) ggc lower subreg : 0.07 ( 0%) usr 0.01 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc jump : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc forward prop : 0.73 ( 1%) usr 0.26 ( 1%) sys 0.86 ( 1%) wall 2110 kB ( 0%) ggc CSE : 0.50 ( 1%) usr 0.19 ( 1%) sys 0.73 ( 1%) wall 1053 kB ( 0%) ggc dead code elimination : 0.23 ( 0%) usr 0.07 ( 0%) sys 0.38 ( 1%) wall 0 kB ( 0%) ggc dead store elim1 : 0.24 ( 0%) usr 0.09 ( 0%) sys 0.48 ( 1%) wall 1039 kB ( 0%) ggc dead store elim2 : 0.27 ( 0%) usr 0.14 ( 1%) sys 0.39 ( 1%) wall 960 kB ( 0%) ggc loop analysis : 0.10 ( 0%) usr 0.06 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc loop init : 1.34 ( 2%) usr 0.51 ( 2%) sys 1.93 ( 3%) wall 183463 kB ( 7%) ggc loop invariant motion : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 1 kB ( 0%) ggc loop doloop : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 36 kB ( 0%) ggc loop fini : 0.61 ( 1%) usr 0.31 ( 2%) sys 0.94 ( 1%) wall 0 kB ( 0%) ggc CPROP : 0.21 ( 0%) usr 0.05 ( 0%) sys 0.21 ( 0%) wall 295 kB ( 0%) ggc PRE : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.06 ( 0%) wall 4 kB ( 0%) ggc auto inc dec : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.15 ( 0%) wall 934 kB ( 0%) ggc CSE 2 : 0.29 ( 1%) usr 0.18 ( 1%) sys 0.44 ( 1%) wall 171 kB ( 0%) ggc branch prediction : 0.20 ( 0%) usr 0.06 ( 0%) sys 0.16 ( 0%) wall 4067 kB ( 0%) ggc combiner : 0.84 ( 2%) usr 0.22 ( 1%) sys 1.42 ( 2%) wall 13624 kB ( 1%) ggc if-conversion : 0.28 ( 1%) usr 0.08 ( 0%) sys 0.41 ( 1%) wall 2 kB ( 0%) ggc scheduling : 1.45 ( 3%) usr 0.63 ( 3%) sys 2.15 ( 3%) wall 4177 kB ( 0%) ggc integrated RA : 1.83 ( 3%) usr 0.70 ( 3%) sys 2.45 ( 3%) wall 964084 kB (38%) ggc LRA non-specific : 0.69 ( 1%) usr 0.33 ( 2%) sys 0.90 ( 1%) wall 2272 kB ( 0%) ggc LRA virtuals elimination: 0.27 ( 0%) usr 0.15 ( 1%) sys 0.36 ( 0%) wall 1881 kB ( 0%) ggc LRA reload inheritance : 0.09 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc LRA create live ranges : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.12 ( 0%) wall 1 kB ( 0%) ggc LRA hard reg assignment : 0.09 ( 0%) usr 0.05 ( 0%) sys 0.20 ( 0%) wall 0 kB ( 0%) ggc reload : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc reload CSE regs : 0.46 ( 1%) usr 0.09 ( 0%) sys 0.43 ( 1%) wall 2852 kB ( 0%) ggc thread pro- & epilogue : 0.25 ( 0%) usr 0.13 ( 1%) sys 0.45 ( 1%) wall 37093 kB ( 1%) ggc if-conversion 2 : 0.06 ( 0%) usr 0.02 ( 0%) sys 0.18 ( 0%) wall 0 kB ( 0%) ggc peephole 2 : 0.11 ( 0%) usr 0.04 ( 0%) sys 0.18 ( 0%) wall 11 kB ( 0%) ggc hard reg cprop : 0.12 ( 0%) usr 0.05 ( 0%) sys 0.19 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 1.05 ( 2%) usr 0.44 ( 2%) sys 1.67 ( 2%) wall 3203 kB ( 0%) ggc machine dep reorg : 0.21 ( 0%) usr 0.05 ( 0%) sys 0.26 ( 0%) wall 10319 kB ( 0%) ggc reorder blocks : 0.10 ( 0%) usr 0.03 ( 0%) sys 0.20 ( 0%) wall 20 kB ( 0%) ggc shorten branches : 0.16 ( 0%) usr 0.05 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc final : 0.88 ( 2%) usr 0.47 ( 2%) sys 1.51 ( 2%) wall 15600 kB ( 1%) ggc variable output : 0.30 ( 1%) usr 0.03 ( 0%) sys 0.33 ( 0%) wall 10352 kB ( 0%) ggc symout : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc tree if-combine : 0.05 ( 0%) usr 0.02 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc straight-line strength reduction: 0.13 ( 0%) usr 0.07 ( 0%) sys 0.22 ( 0%) wall 0 kB ( 0%) ggc store merging : 0.07 ( 0%) usr 0.03 ( 0%) sys 0.01 ( 0%) wall 9 kB ( 0%) ggc address lowering : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc early local passes : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc unaccounted optimizations: 0.01 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc rest of compilation : 5.83 (11%) usr 2.63 (13%) sys 8.49 (11%) wall 101391 kB ( 4%) ggc unaccounted post reload : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc unaccounted late compilation: 0.03 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc remove unused locals : 0.18 ( 0%) usr 0.08 ( 0%) sys 0.22 ( 0%) wall 0 kB ( 0%) ggc address taken : 0.13 ( 0%) usr 0.03 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc rebuild frequencies : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc repair loop structures : 0.07 ( 0%) usr 0.03 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 54.11 20.45 74.91 2544441 kB
 A thought just occurred to me, you are compiling the entire 
 program + object.d right?  Nothing else will link/be linked to 
 the binary?
I'm passing all files except druntime files via the command line. druntime files are imported via -Isource/runtime. But essentially yes, I'm compiling the entire application in one command so I can get cross-module inlining. I tried moving all runtime files to the command line, but I get errors about __entrypoint. cc1d: error: module __entrypoint is in file '__entrypoint.d' which cannot be read Specify path to file '__entrypoint.d' with -I switch
 If that is the case, you should definitely compile with 
 -fwhole-program.  I suspect that may cut down your compilation 
 time by half or even more.
If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary. I suspect this is due to the error above about __entrypoint. That is, if there's no entry point, the whole program gets garbage collected. I think you might be on to something here though. I'm out of time now; gotta catch a plane soon. I'll try to do more troubleshooting when I return. Mike
Jun 28
parent "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 29 June 2017 at 00:50, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
 If that is the case, you should definitely compile with -fwhole-program.
 I suspect that may cut down your compilation time by half or even more.
If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary. I suspect this is due to the error above about __entrypoint. That is, if there's no entry point, the whole program gets garbage collected. I think you might be on to something here though.
Yes, it seems like there's a more than a few hundred functions being emitted, and as they are all considered externally visible, none can be removed during the optimization pass. If they are all considered static (in the C sense), then unused and inlined functions can be removed immediately, giving the backend less work. I think the only caveat with using -fwhole-program is that C main must be present in the compilation, otherwise as you've noted, everything gets removed as unused code. Regards, Iain.
Jun 28
prev sibling parent reply Mike <none none.com> writes:
On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

 A thought just occurred to me, you are compiling the entire 
 program + object.d right?  Nothing else will link/be linked to 
 the binary?

 If that is the case, you should definitely compile with 
 -fwhole-program.  I suspect that may cut down your compilation 
 time by half or even more.
I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success. Regardless of what I do, the compiler doesn't emit anything except main. arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o arm-none-eabi-nm binary/firmware.o U _Dmain U _d_run_main 00000000 T main Are you sure this works with multiple modules passed to the compiler on one line? Mike
Jul 17
next sibling parent "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

 A thought just occurred to me, you are compiling the entire program +
 object.d right?  Nothing else will link/be linked to the binary?

 If that is the case, you should definitely compile with -fwhole-program.
 I suspect that may cut down your compilation time by half or even more.
I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success. Regardless of what I do, the compiler doesn't emit anything except main. arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o arm-none-eabi-nm binary/firmware.o U _Dmain U _d_run_main 00000000 T main Are you sure this works with multiple modules passed to the compiler on one line?
Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol. After making a tweak, it compiles a program that does a little bit more. https://gist.github.com/ibuclaw/d1fad77b074fded2682a16df93369a84 This is the compiled result: https://gist.github.com/ibuclaw/93c707f4729ffc4376539283defdc709 Iain.
Jul 17
prev sibling parent reply "Iain Buclaw via D.gnu" <d.gnu puremagic.com> writes:
On 18 July 2017 at 01:19, Iain Buclaw <ibuclaw gdcproject.org> wrote:
 On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu puremagic.com> wrote:
 On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

 A thought just occurred to me, you are compiling the entire program +
 object.d right?  Nothing else will link/be linked to the binary?

 If that is the case, you should definitely compile with -fwhole-program.
 I suspect that may cut down your compilation time by half or even more.
I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success. Regardless of what I do, the compiler doesn't emit anything except main. arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o arm-none-eabi-nm binary/firmware.o U _Dmain U _d_run_main 00000000 T main Are you sure this works with multiple modules passed to the compiler on one line?
Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol. After making a tweak, it compiles a program that does a little bit more.
Infact, after inspecting the unoptimized result, I think I can safely say this is the entire program, as it is intended to be built. Now, what can we learn from this? What can and should be done better? Could this be achieved without -fwhole-program? So far, the compiler really does just force every symbol to be emitted, at the cost of compilation time (no function can be removed during optimization). I think this is really just a workaround for not setting properly the visibility for compiled symbols (there's a lack of documentation in D about this), I think we could do better and be more explicit in setting this up. However the problem has almost always been templates that get cut, but shouldn't have, and so end up as being undefined at link-time. There's probably a few bug reports to be raised. Regards Iain.
Jul 17
parent Mike <none none.com> writes:
On Monday, 17 July 2017 at 23:57:02 UTC, Iain Buclaw wrote:

 Infact, after inspecting the unoptimized result, I think I can 
 safely say this is the entire program, as it is intended to be 
 built.
If I compile without optimizations, I am able to get a complete binary also, even without the changes you made. But with optimizations, everything gets removed. Anyway, I'm going to move on from this for now and just try to get a working binary. I believe once I properly implement volatileLoad and volatileStore, I'll be able to get a working binary. Mike
Jul 18