digitalmars.D - DMD and GDC are unnecessarily using heap allocations for closures

Siarhei Siamashka (72/72) May 29 2022 Consider the following example:

user1234 (4/8) May 30 2022 maybe `private long binary_search(long n, long val, long i){...}`

Siarhei Siamashka (15/18) May 30 2022 The code can be massaged in various ways to reduce the allocation

Iain Buclaw (5/8) May 30 2022 Since when have I said that? Maybe I'm missing a context. :-)

deadalnix (19/27) May 30 2022 It still would benefit from it. All the other transformation that

Nicholas Wilson (10/17) May 30 2022 (LDC dev) I doubt this will be able to be fixed. @nogc is a
Johan (11/26) May 30 2022 The `@nogc` attribute is a language-level guarantee that no GC

max haughton (14/43) May 30 2022 Ideally the dmd frontend would have a nice reliable IR to do this
Siarhei Siamashka (23/32) May 31 2022 Then maybe another attribute

max haughton (3/7) May 30 2022 Also it's worth noting you can actually make some of these range

Siarhei Siamashka (29/38) Dec 10 2023 Yes, this works and

Iain Buclaw (4/10) May 30 2022 Out of curiosity, are you linking in phobos statically or

Siarhei Siamashka (90/102) May 30 2022 It's statically linked with libphobos. Both GDC and LDC can

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

Consider the following example:

```D
import std.algorithm, std.range, std.stdio;

long binary_search(long n, long val, long i) {
   return iota(1, i + 1).map!(x => n / x == 
val).assumeSorted.upperBound(false).length;
}

// A solution for 
https://atcoder.jp/contests/abc230/tasks/abc230_e
long solve(long n) {
   long ans = 0, i = n;
   while (i != 0) {
     long val = n / i;
     auto cnt = binary_search(n, val, i);
     ans += cnt * val;
     i -= cnt;
   }
   return ans;
}

void main() {
   long ans = 0;
   foreach (n ; 1 .. 100000)
     ans += solve(n);
   writeln(ans);
}
```

Benchmarks with GDC 11.2.0, LDC 1.27.1 (LLVM 12.0.0) and DMD 
2.091.1:
```
$ dmd -O -g -release -inline test.d && time ./test
55836809328

real	0m10.654s
user	0m11.001s
sys	0m0.052s

$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
55836809328

real	0m6.520s
user	0m6.519s
sys	0m0.000s


$ ldc2 -O -g -release test.d && time ./test
55836809328

real	0m1.904s
user	0m1.903s
sys	0m0.000s
```
LDC produces significantly faster code here and one of the major 
contributing factors is that LDC is able to avoid heap 
allocations (as can be confirmed by running a profiler). It is 
possible to force LDC to also use heap allocations via adding 
'--disable-gc2stack' option and the performance drops:

```
$ ldc2 -O -g -release --disable-gc2stack test.d && time ./test
55836809328

real	0m3.621s
user	0m3.620s
sys	0m0.000s
```

So only LDC is doing a proper job and lives up to its state of 
the art optimizing compiler reputation. But not everything is 
perfect even with LDC. In another thread 
https://forum.dlang.org/thread/t6rijv$nlb$1 digitalmars.com  
Steven Schveighoffer mentioned  nogc annotations. Now if I add 
 nogc attribute to 'binary_search' function, then LDC refuses to 
compile the source code:

```
$ ldc2 -O -g -release test.d && time ./test
test.d(3): Error: function `test.binary_search` is ` nogc` yet 
allocates closures with the GC
test.d(4):        test.binary_search.__lambda4 closes over 
variable n at test.d(3)
```

What do you think about all of this?

May 29 2022

user1234 <user1234 12.de> writes:

On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]

maybe `private long binary_search(long n, long val, long i){...}` 
will make think other compilers that the stuff is only used 
localy ?

May 30 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is only 
 used localy ?

The code can be massaged in various ways to reduce the allocation 
overhead. For example, the whole 'binary_search' function is not 
necessary and the code from it can be manually inlined (this 
reduces the allocation penalty).

So far I'm quite happy about the quality of the code generated by 
LDC and it has never failed me. But I wonder if the  nogc 
attribute can play a bit nicer with the GC2Stack IR optimization 
pass? For example, somehow postpone the error reporting in the 
frontend and only fail if GC2Stack is actually unable to 
eliminate all GC allocations in the  nogc code. Maybe I can 
report this to LDC developers.

Not sure if anything can be done about GDC. Iain does not seem to 
appreciate reports of this kind in GCC bugzilla and I don't want 
to annoy him unnecessarily.

May 30 2022

Iain Buclaw <ibuclaw gdcproject.org> writes:

On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 Not sure if anything can be done about GDC. Iain does not seem 
 to appreciate reports of this kind in GCC bugzilla and I don't 
 want to annoy him unnecessarily.

Since when have I said that?  Maybe I'm missing a context. :-)

IMO the DMD front-end should be fixed to not tell GDC and LDC to 
allocate a closure in the first place, then LDC wouldn't have the 
need for a custom pass to revert the GC allocations.

May 30 2022

deadalnix <deadalnix gmail.com> writes:

On Monday, 30 May 2022 at 10:34:18 UTC, Iain Buclaw wrote:
 On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 Not sure if anything can be done about GDC. Iain does not seem 
 to appreciate reports of this kind in GCC bugzilla and I don't 
 want to annoy him unnecessarily.

 Since when have I said that?  Maybe I'm missing a context. :-)

 IMO the DMD front-end should be fixed to not tell GDC and LDC 
 to allocate a closure in the first place, then LDC wouldn't 
 have the need for a custom pass to revert the GC allocations.

It still would benefit from it. All the other transformation that 
LLVM will apply to the code are likely to uncover opportunities 
to revert GC allocations. For instance:

```d
auto foo() {
   return new Object();
}

void bar() {
   foo();
}
```

Nothing in the code "as this" allow you to remove the allocation, 
but once foo is inlined in bar, then the object can be allocated 
on stack (and then optimized away :P).

Maybe this is trivial enough so that it could realistically be 
done in the front end, but I think this makes the point well 
enough: optimizations often do uncover other optimization 
opportunities.

May 30 2022

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 So far I'm quite happy about the quality of the code generated 
 by LDC and it has never failed me. But I wonder if the  nogc 
 attribute can play a bit nicer with the GC2Stack IR 
 optimization pass? For example, somehow postpone the error 
 reporting in the frontend and only fail if GC2Stack is actually 
 unable to eliminate all GC allocations in the  nogc code. Maybe 
 I can report this to LDC developers.

(LDC dev) I doubt this will be able to be fixed.  nogc is a 
frontend construct and would be difficult, if not impossible, to 
have it depend on the ability of backend optimisation passes that 
may or may not be able to remove all calls to the GC (if they are 
enabled at all).

LDC does have some specific semantic analysis passes but all of 
them are done after DMD's ones. GC2stack is an IR level 
optimisation not a frontend one and IR is only generated after 
all frontend semantic analysis is completed without error.

May 30 2022

Johan <j j.nl> writes:

On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is 
 only used localy ?

 The code can be massaged in various ways to reduce the 
 allocation overhead. For example, the whole 'binary_search' 
 function is not necessary and the code from it can be manually 
 inlined (this reduces the allocation penalty).

 So far I'm quite happy about the quality of the code generated 
 by LDC and it has never failed me. But I wonder if the  nogc 
 attribute can play a bit nicer with the GC2Stack IR 
 optimization pass? For example, somehow postpone the error 
 reporting in the frontend and only fail if GC2Stack is actually 
 unable to eliminate all GC allocations in the  nogc code. Maybe 
 I can report this to LDC developers.

The ` nogc` attribute is a language-level guarantee that no GC 
allocations will happen in that function. This guarantee can not 
depend on an optimization, unless any compiler implementation is 
required to implement that optimization also for unoptimized 
code: effectively making it no longer an "optimization", but just 
normal language behavior. (example: C++17 requires copy elision 
when a function returns a temporary object+ (unnamed object), but 
does not require it when a function returns a named object.)

In short: what you propose for LDC should not be done.

-Johan

May 30 2022

max haughton <maxhaton gmail.com> writes:

On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:
 On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is 
 only used localy ?

 The code can be massaged in various ways to reduce the 
 allocation overhead. For example, the whole 'binary_search' 
 function is not necessary and the code from it can be manually 
 inlined (this reduces the allocation penalty).

 So far I'm quite happy about the quality of the code generated 
 by LDC and it has never failed me. But I wonder if the  nogc 
 attribute can play a bit nicer with the GC2Stack IR 
 optimization pass? For example, somehow postpone the error 
 reporting in the frontend and only fail if GC2Stack is 
 actually unable to eliminate all GC allocations in the  nogc 
 code. Maybe I can report this to LDC developers.

 The ` nogc` attribute is a language-level guarantee that no GC 
 allocations will happen in that function. This guarantee can 
 not depend on an optimization, unless any compiler 
 implementation is required to implement that optimization also 
 for unoptimized code: effectively making it no longer an 
 "optimization", but just normal language behavior. (example: 
 C++17 requires copy elision when a function returns a temporary 
 object+ (unnamed object), but does not require it when a 
 function returns a named object.)

 In short: what you propose for LDC should not be done.

 -Johan

Ideally the dmd frontend would have a nice reliable IR to do this 
analysis properly and then have a more aggressive notion of 
 nogc, but it doesn't unfortunately.

SDC does have an IR but it's a long way away from being able to 
worry about this kind of issue. Hypothetically it would be much 
more feasible to drop into an IR where you can do dataflow 
analysis properly then go back to an AST.

Also, the semantics can be specified as a set of test cases any D 
compiler must be able to resolve without closures. That way you 
can rely on it but don't have to force the implementations to do 
a certain thing (other than doing something at all).

I did fiddle around with getting GCC to do this but I'm not sure 
it can be done without breaking a sweat.

May 30 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:
 The ` nogc` attribute is a language-level guarantee that no GC 
 allocations will happen in that function.

Then maybe another attribute 
 makebestefforttohavenocginoptimizedbuild (with a different 
shorter name) could be introduced? So that the GC allocations 
originating from such functions would be additionally marked as 
'please optimize me out' by the frontend. The GC2Stack IR 
optimization pass would do its normal job. And if some of such 
marked GC allocations still remain after GC2Stack, then the 
compiler could spit out a helpful performance warning message. 
The non-optimized debug builds can do GC allocations without 
making any noise. And its okay for any compiler to just ignore 
this attribute (similar to how 'inline' is handled in C language).

I think that there are two somewhat different use cases for  nogc:
1. we can't tolerate any GC allocations, because the runtime has 
no garbage collector at all
2. we can do GC allocations, but strongly desire to avoid them 
for performance reasons

The current  nogc is apparently designed for the use case 1 and 
is overly conservative. But I'm primarily interested in the use 
case 2.

 This guarantee can not depend on an optimization, unless any 
 compiler implementation is required to implement that 
 optimization also for unoptimized code: effectively making it 
 no longer an "optimization", but just normal language behavior. 
 (example: C++17 requires copy elision when a function returns a 
 temporary object+ (unnamed object), but does not require it 
 when a function returns a named object.)

Would it be difficult/practical to implement that optimization 
(or some simplified form of it) also for unoptimized code in the 
frontend and make it the normal language behavior?

May 31 2022

max haughton <maxhaton gmail.com> writes:

On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]

Also it's worth noting you can actually make some of these range 
patterns nogc via using zip(repeat(closureVar), blah).map

May 30 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 30 May 2022 at 11:49:44 UTC, max haughton wrote:
 On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]

 Also it's worth noting you can actually make some of these 
 range patterns nogc via using zip(repeat(closureVar), blah).map

Yes, this works and

```D
long binary_search(long n, long val, long i) {
   return iota(1, i + 1).map!(x => n / x == 
val).assumeSorted.upperBound(false).length;
}
```

can be indeed rewritten as

```D
long binary_search(long n, long val, long i)  nogc {
   struct C { long n, val; }
   return zip(repeat(C(n, val)), iota(1, i + 1)).map!(x => x[0].n 
/ x[1] == x[0].val)
                                                
.assumeSorted.upperBound(false).length;
}
```

Results in roughly no performance changes for LDC thanks to 
GC2Stack already taking care of it before, a major performance 
improvement for GDC (6.467s -> 4.553s) and a major performance 
regression for DMD (10.024s -> 13.733s). I'm not concerned about 
DMD, because it is known to be bad at optimizing this stuff 
anyway. But the readability of the code suffers, even though this 
is without any doubt a usable escape hatch for ` nogc` programs. 
Thanks a lot for the hint.

Are there no other better solutions/workarounds planned in the 
compiler frontend? As a random experiment I tried to plaster the 
`scope` keyword everywhere and this unsurprisingly had no effect.

Dec 10 2023

Iain Buclaw <ibuclaw gdcproject.org> writes:

On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
 55836809328

 real	0m6.520s
 user	0m6.519s
 sys	0m0.000s

 What do you think about all of this?

Out of curiosity, are you linking in phobos statically or 
dynamically?  You can force either with `-static-libphobos` or 
`-shared-libphobos`.

May 30 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:
 On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
 55836809328

 real	0m6.520s
 user	0m6.519s
 sys	0m0.000s

 What do you think about all of this?

 Out of curiosity, are you linking in phobos statically or 
 dynamically?  You can force either with `-static-libphobos` or 
 `-shared-libphobos`.

It's statically linked with libphobos. Both GDC and LDC can 
inline everything here. And one major difference is that LDC is 
also able to eliminate GC allocation: 
https://d.godbolt.org/z/x1jK1M149

Another major difference is that LDC does some extra "cheat" to 
use a 32-bit division instruction if dividend and divisor are 
small enough. But it only does this trick for '-mcpu=x86-64' 
(default) and stops doing it for '-mcpu=native' (which in my case 
is nehalem): https://d.godbolt.org/z/8ExEqqE41

If everything is manually inlined into main function, then 
benchmarks look like this:
```
$ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2
55836809328

  Performance counter stats for './test2':


utilized





frontend cycles idle

backend cycles idle

per cycle

stalled cycles per insn


all branches

        1.947334248 seconds time elapsed

        1.921595000 seconds user
        0.000000000 seconds sys
```

```
$ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2
55836809328

  Performance counter stats for './test2':


utilized





frontend cycles idle

backend cycles idle

per cycle

stalled cycles per insn


all branches

        4.600090630 seconds time elapsed

        4.599885000 seconds user
        0.000000000 seconds sys
```

```
$ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out
55836809328

  Performance counter stats for './a.out':


utilized





frontend cycles idle

backend cycles idle

per cycle

stalled cycles per insn


all branches

        4.626366827 seconds time elapsed

        4.605163000 seconds user
        0.000000000 seconds sys
```

GDC and LDC become equally fast if closure allocations overhead 
is negligible and if LDC does not use 32-bit division instead of 
64-bit one.

May 30 2022

D Programming

C/C++ Programming

Other

digitalmars.D - DMD and GDC are unnecessarily using heap allocations for closures