www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - DMD and GDC are unnecessarily using heap allocations for closures

reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
Consider the following example:

```D
import std.algorithm, std.range, std.stdio;

long binary_search(long n, long val, long i) {
   return iota(1, i + 1).map!(x => n / x == 
val).assumeSorted.upperBound(false).length;
}

// A solution for 
https://atcoder.jp/contests/abc230/tasks/abc230_e
long solve(long n) {
   long ans = 0, i = n;
   while (i != 0) {
     long val = n / i;
     auto cnt = binary_search(n, val, i);
     ans += cnt * val;
     i -= cnt;
   }
   return ans;
}

void main() {
   long ans = 0;
   foreach (n ; 1 .. 100000)
     ans += solve(n);
   writeln(ans);
}
```

Benchmarks with GDC 11.2.0, LDC 1.27.1 (LLVM 12.0.0) and DMD 
2.091.1:
```
$ dmd -O -g -release -inline test.d && time ./test
55836809328

real	0m10.654s
user	0m11.001s
sys	0m0.052s

$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
55836809328

real	0m6.520s
user	0m6.519s
sys	0m0.000s


$ ldc2 -O -g -release test.d && time ./test
55836809328

real	0m1.904s
user	0m1.903s
sys	0m0.000s
```
LDC produces significantly faster code here and one of the major 
contributing factors is that LDC is able to avoid heap 
allocations (as can be confirmed by running a profiler). It is 
possible to force LDC to also use heap allocations via adding 
'--disable-gc2stack' option and the performance drops:

```
$ ldc2 -O -g -release --disable-gc2stack test.d && time ./test
55836809328

real	0m3.621s
user	0m3.620s
sys	0m0.000s
```

So only LDC is doing a proper job and lives up to its state of 
the art optimizing compiler reputation. But not everything is 
perfect even with LDC. In another thread 
https://forum.dlang.org/thread/t6rijv$nlb$1 digitalmars.com  
Steven Schveighoffer mentioned  nogc annotations. Now if I add 
 nogc attribute to 'binary_search' function, then LDC refuses to 
compile the source code:

```
$ ldc2 -O -g -release test.d && time ./test
test.d(3): Error: function `test.binary_search` is ` nogc` yet 
allocates closures with the GC
test.d(4):        test.binary_search.__lambda4 closes over 
variable n at test.d(3)
```

What do you think about all of this?
May 29 2022
next sibling parent reply user1234 <user1234 12.de> writes:
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]
maybe `private long binary_search(long n, long val, long i){...}` will make think other compilers that the stuff is only used localy ?
May 30 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is only 
 used localy ?
The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers. Not sure if anything can be done about GDC. Iain does not seem to appreciate reports of this kind in GCC bugzilla and I don't want to annoy him unnecessarily.
May 30 2022
next sibling parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 Not sure if anything can be done about GDC. Iain does not seem 
 to appreciate reports of this kind in GCC bugzilla and I don't 
 want to annoy him unnecessarily.
Since when have I said that? Maybe I'm missing a context. :-) IMO the DMD front-end should be fixed to not tell GDC and LDC to allocate a closure in the first place, then LDC wouldn't have the need for a custom pass to revert the GC allocations.
May 30 2022
parent deadalnix <deadalnix gmail.com> writes:
On Monday, 30 May 2022 at 10:34:18 UTC, Iain Buclaw wrote:
 On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 Not sure if anything can be done about GDC. Iain does not seem 
 to appreciate reports of this kind in GCC bugzilla and I don't 
 want to annoy him unnecessarily.
Since when have I said that? Maybe I'm missing a context. :-) IMO the DMD front-end should be fixed to not tell GDC and LDC to allocate a closure in the first place, then LDC wouldn't have the need for a custom pass to revert the GC allocations.
It still would benefit from it. All the other transformation that LLVM will apply to the code are likely to uncover opportunities to revert GC allocations. For instance: ```d auto foo() { return new Object(); } void bar() { foo(); } ``` Nothing in the code "as this" allow you to remove the allocation, but once foo is inlined in bar, then the object can be allocated on stack (and then optimized away :P). Maybe this is trivial enough so that it could realistically be done in the front end, but I think this makes the point well enough: optimizations often do uncover other optimization opportunities.
May 30 2022
prev sibling next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 So far I'm quite happy about the quality of the code generated 
 by LDC and it has never failed me. But I wonder if the  nogc 
 attribute can play a bit nicer with the GC2Stack IR 
 optimization pass? For example, somehow postpone the error 
 reporting in the frontend and only fail if GC2Stack is actually 
 unable to eliminate all GC allocations in the  nogc code. Maybe 
 I can report this to LDC developers.
(LDC dev) I doubt this will be able to be fixed. nogc is a frontend construct and would be difficult, if not impossible, to have it depend on the ability of backend optimisation passes that may or may not be able to remove all calls to the GC (if they are enabled at all). LDC does have some specific semantic analysis passes but all of them are done after DMD's ones. GC2stack is an IR level optimisation not a frontend one and IR is only generated after all frontend semantic analysis is completed without error.
May 30 2022
prev sibling parent reply Johan <j j.nl> writes:
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is 
 only used localy ?
The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers.
The ` nogc` attribute is a language-level guarantee that no GC allocations will happen in that function. This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.) In short: what you propose for LDC should not be done. -Johan
May 30 2022
next sibling parent max haughton <maxhaton gmail.com> writes:
On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:
 On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:
 On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:
 maybe `private long binary_search(long n, long val, long 
 i){...}` will make think other compilers that the stuff is 
 only used localy ?
The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers.
The ` nogc` attribute is a language-level guarantee that no GC allocations will happen in that function. This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.) In short: what you propose for LDC should not be done. -Johan
Ideally the dmd frontend would have a nice reliable IR to do this analysis properly and then have a more aggressive notion of nogc, but it doesn't unfortunately. SDC does have an IR but it's a long way away from being able to worry about this kind of issue. Hypothetically it would be much more feasible to drop into an IR where you can do dataflow analysis properly then go back to an AST. Also, the semantics can be specified as a set of test cases any D compiler must be able to resolve without closures. That way you can rely on it but don't have to force the implementations to do a certain thing (other than doing something at all). I did fiddle around with getting GCC to do this but I'm not sure it can be done without breaking a sweat.
May 30 2022
prev sibling parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:
 The ` nogc` attribute is a language-level guarantee that no GC 
 allocations will happen in that function.
Then maybe another attribute makebestefforttohavenocginoptimizedbuild (with a different shorter name) could be introduced? So that the GC allocations originating from such functions would be additionally marked as 'please optimize me out' by the frontend. The GC2Stack IR optimization pass would do its normal job. And if some of such marked GC allocations still remain after GC2Stack, then the compiler could spit out a helpful performance warning message. The non-optimized debug builds can do GC allocations without making any noise. And its okay for any compiler to just ignore this attribute (similar to how 'inline' is handled in C language). I think that there are two somewhat different use cases for nogc: 1. we can't tolerate any GC allocations, because the runtime has no garbage collector at all 2. we can do GC allocations, but strongly desire to avoid them for performance reasons The current nogc is apparently designed for the use case 1 and is overly conservative. But I'm primarily interested in the use case 2.
 This guarantee can not depend on an optimization, unless any 
 compiler implementation is required to implement that 
 optimization also for unoptimized code: effectively making it 
 no longer an "optimization", but just normal language behavior. 
 (example: C++17 requires copy elision when a function returns a 
 temporary object+ (unnamed object), but does not require it 
 when a function returns a named object.)
Would it be difficult/practical to implement that optimization (or some simplified form of it) also for unoptimized code in the frontend and make it the normal language behavior?
May 31 2022
prev sibling next sibling parent reply max haughton <maxhaton gmail.com> writes:
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]
Also it's worth noting you can actually make some of these range patterns nogc via using zip(repeat(closureVar), blah).map
May 30 2022
parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 30 May 2022 at 11:49:44 UTC, max haughton wrote:
 On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 Consider the following example:

 ```D
 import std.algorithm, std.range, std.stdio;

 [...]
Also it's worth noting you can actually make some of these range patterns nogc via using zip(repeat(closureVar), blah).map
Yes, this works and ```D long binary_search(long n, long val, long i) { return iota(1, i + 1).map!(x => n / x == val).assumeSorted.upperBound(false).length; } ``` can be indeed rewritten as ```D long binary_search(long n, long val, long i) nogc { struct C { long n, val; } return zip(repeat(C(n, val)), iota(1, i + 1)).map!(x => x[0].n / x[1] == x[0].val) .assumeSorted.upperBound(false).length; } ``` Results in roughly no performance changes for LDC thanks to GC2Stack already taking care of it before, a major performance improvement for GDC (6.467s -> 4.553s) and a major performance regression for DMD (10.024s -> 13.733s). I'm not concerned about DMD, because it is known to be bad at optimizing this stuff anyway. But the readability of the code suffers, even though this is without any doubt a usable escape hatch for ` nogc` programs. Thanks a lot for the hint. Are there no other better solutions/workarounds planned in the compiler frontend? As a random experiment I tried to plaster the `scope` keyword everywhere and this unsurprisingly had no effect.
Dec 10 2023
prev sibling parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
 55836809328

 real	0m6.520s
 user	0m6.519s
 sys	0m0.000s

 What do you think about all of this?
Out of curiosity, are you linking in phobos statically or dynamically? You can force either with `-static-libphobos` or `-shared-libphobos`.
May 30 2022
parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:
 On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
 $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
 55836809328

 real	0m6.520s
 user	0m6.519s
 sys	0m0.000s

 What do you think about all of this?
Out of curiosity, are you linking in phobos statically or dynamically? You can force either with `-static-libphobos` or `-shared-libphobos`.
It's statically linked with libphobos. Both GDC and LDC can inline everything here. And one major difference is that LDC is also able to eliminate GC allocation: https://d.godbolt.org/z/x1jK1M149 Another major difference is that LDC does some extra "cheat" to use a 32-bit division instruction if dividend and divisor are small enough. But it only does this trick for '-mcpu=x86-64' (default) and stops doing it for '-mcpu=native' (which in my case is nehalem): https://d.godbolt.org/z/8ExEqqE41 If everything is manually inlined into main function, then benchmarks look like this: ``` $ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2 55836809328 Performance counter stats for './test2': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 1.947334248 seconds time elapsed 1.921595000 seconds user 0.000000000 seconds sys ``` ``` $ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2 55836809328 Performance counter stats for './test2': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 4.600090630 seconds time elapsed 4.599885000 seconds user 0.000000000 seconds sys ``` ``` $ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out 55836809328 Performance counter stats for './a.out': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 4.626366827 seconds time elapsed 4.605163000 seconds user 0.000000000 seconds sys ``` GDC and LDC become equally fast if closure allocations overhead is negligible and if LDC does not use 32-bit division instead of 64-bit one.
May 30 2022