digitalmars.D - DMD and GDC are unnecessarily using heap allocations for closures
- Siarhei Siamashka (72/72) May 29 2022 Consider the following example:
- user1234 (4/8) May 30 2022 maybe `private long binary_search(long n, long val, long i){...}`
- Siarhei Siamashka (15/18) May 30 2022 The code can be massaged in various ways to reduce the allocation
- Iain Buclaw (5/8) May 30 2022 Since when have I said that? Maybe I'm missing a context. :-)
- deadalnix (19/27) May 30 2022 It still would benefit from it. All the other transformation that
- Nicholas Wilson (10/17) May 30 2022 (LDC dev) I doubt this will be able to be fixed. @nogc is a
- Johan (11/26) May 30 2022 The `@nogc` attribute is a language-level guarantee that no GC
- max haughton (14/43) May 30 2022 Ideally the dmd frontend would have a nice reliable IR to do this
- Siarhei Siamashka (23/32) May 31 2022 Then maybe another attribute
- max haughton (3/7) May 30 2022 Also it's worth noting you can actually make some of these range
- Siarhei Siamashka (29/38) Dec 10 2023 Yes, this works and
- Iain Buclaw (4/10) May 30 2022 Out of curiosity, are you linking in phobos statically or
- Siarhei Siamashka (90/102) May 30 2022 It's statically linked with libphobos. Both GDC and LDC can
Consider the following example: ```D import std.algorithm, std.range, std.stdio; long binary_search(long n, long val, long i) { return iota(1, i + 1).map!(x => n / x == val).assumeSorted.upperBound(false).length; } // A solution for https://atcoder.jp/contests/abc230/tasks/abc230_e long solve(long n) { long ans = 0, i = n; while (i != 0) { long val = n / i; auto cnt = binary_search(n, val, i); ans += cnt * val; i -= cnt; } return ans; } void main() { long ans = 0; foreach (n ; 1 .. 100000) ans += solve(n); writeln(ans); } ``` Benchmarks with GDC 11.2.0, LDC 1.27.1 (LLVM 12.0.0) and DMD 2.091.1: ``` $ dmd -O -g -release -inline test.d && time ./test 55836809328 real 0m10.654s user 0m11.001s sys 0m0.052s $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out 55836809328 real 0m6.520s user 0m6.519s sys 0m0.000s $ ldc2 -O -g -release test.d && time ./test 55836809328 real 0m1.904s user 0m1.903s sys 0m0.000s ``` LDC produces significantly faster code here and one of the major contributing factors is that LDC is able to avoid heap allocations (as can be confirmed by running a profiler). It is possible to force LDC to also use heap allocations via adding '--disable-gc2stack' option and the performance drops: ``` $ ldc2 -O -g -release --disable-gc2stack test.d && time ./test 55836809328 real 0m3.621s user 0m3.620s sys 0m0.000s ``` So only LDC is doing a proper job and lives up to its state of the art optimizing compiler reputation. But not everything is perfect even with LDC. In another thread https://forum.dlang.org/thread/t6rijv$nlb$1 digitalmars.com Steven Schveighoffer mentioned nogc annotations. Now if I add nogc attribute to 'binary_search' function, then LDC refuses to compile the source code: ``` $ ldc2 -O -g -release test.d && time ./test test.d(3): Error: function `test.binary_search` is ` nogc` yet allocates closures with the GC test.d(4): test.binary_search.__lambda4 closes over variable n at test.d(3) ``` What do you think about all of this?
May 29 2022
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:Consider the following example: ```D import std.algorithm, std.range, std.stdio; [...]maybe `private long binary_search(long n, long val, long i){...}` will make think other compilers that the stuff is only used localy ?
May 30 2022
On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:maybe `private long binary_search(long n, long val, long i){...}` will make think other compilers that the stuff is only used localy ?The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers. Not sure if anything can be done about GDC. Iain does not seem to appreciate reports of this kind in GCC bugzilla and I don't want to annoy him unnecessarily.
May 30 2022
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:Not sure if anything can be done about GDC. Iain does not seem to appreciate reports of this kind in GCC bugzilla and I don't want to annoy him unnecessarily.Since when have I said that? Maybe I'm missing a context. :-) IMO the DMD front-end should be fixed to not tell GDC and LDC to allocate a closure in the first place, then LDC wouldn't have the need for a custom pass to revert the GC allocations.
May 30 2022
On Monday, 30 May 2022 at 10:34:18 UTC, Iain Buclaw wrote:On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:It still would benefit from it. All the other transformation that LLVM will apply to the code are likely to uncover opportunities to revert GC allocations. For instance: ```d auto foo() { return new Object(); } void bar() { foo(); } ``` Nothing in the code "as this" allow you to remove the allocation, but once foo is inlined in bar, then the object can be allocated on stack (and then optimized away :P). Maybe this is trivial enough so that it could realistically be done in the front end, but I think this makes the point well enough: optimizations often do uncover other optimization opportunities.Not sure if anything can be done about GDC. Iain does not seem to appreciate reports of this kind in GCC bugzilla and I don't want to annoy him unnecessarily.Since when have I said that? Maybe I'm missing a context. :-) IMO the DMD front-end should be fixed to not tell GDC and LDC to allocate a closure in the first place, then LDC wouldn't have the need for a custom pass to revert the GC allocations.
May 30 2022
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers.(LDC dev) I doubt this will be able to be fixed. nogc is a frontend construct and would be difficult, if not impossible, to have it depend on the ability of backend optimisation passes that may or may not be able to remove all calls to the GC (if they are enabled at all). LDC does have some specific semantic analysis passes but all of them are done after DMD's ones. GC2stack is an IR level optimisation not a frontend one and IR is only generated after all frontend semantic analysis is completed without error.
May 30 2022
On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:The ` nogc` attribute is a language-level guarantee that no GC allocations will happen in that function. This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.) In short: what you propose for LDC should not be done. -Johanmaybe `private long binary_search(long n, long val, long i){...}` will make think other compilers that the stuff is only used localy ?The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers.
May 30 2022
On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:On Monday, 30 May 2022 at 10:24:12 UTC, Siarhei Siamashka wrote:Ideally the dmd frontend would have a nice reliable IR to do this analysis properly and then have a more aggressive notion of nogc, but it doesn't unfortunately. SDC does have an IR but it's a long way away from being able to worry about this kind of issue. Hypothetically it would be much more feasible to drop into an IR where you can do dataflow analysis properly then go back to an AST. Also, the semantics can be specified as a set of test cases any D compiler must be able to resolve without closures. That way you can rely on it but don't have to force the implementations to do a certain thing (other than doing something at all). I did fiddle around with getting GCC to do this but I'm not sure it can be done without breaking a sweat.On Monday, 30 May 2022 at 07:42:29 UTC, user1234 wrote:The ` nogc` attribute is a language-level guarantee that no GC allocations will happen in that function. This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.) In short: what you propose for LDC should not be done. -Johanmaybe `private long binary_search(long n, long val, long i){...}` will make think other compilers that the stuff is only used localy ?The code can be massaged in various ways to reduce the allocation overhead. For example, the whole 'binary_search' function is not necessary and the code from it can be manually inlined (this reduces the allocation penalty). So far I'm quite happy about the quality of the code generated by LDC and it has never failed me. But I wonder if the nogc attribute can play a bit nicer with the GC2Stack IR optimization pass? For example, somehow postpone the error reporting in the frontend and only fail if GC2Stack is actually unable to eliminate all GC allocations in the nogc code. Maybe I can report this to LDC developers.
May 30 2022
On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:The ` nogc` attribute is a language-level guarantee that no GC allocations will happen in that function.Then maybe another attribute makebestefforttohavenocginoptimizedbuild (with a different shorter name) could be introduced? So that the GC allocations originating from such functions would be additionally marked as 'please optimize me out' by the frontend. The GC2Stack IR optimization pass would do its normal job. And if some of such marked GC allocations still remain after GC2Stack, then the compiler could spit out a helpful performance warning message. The non-optimized debug builds can do GC allocations without making any noise. And its okay for any compiler to just ignore this attribute (similar to how 'inline' is handled in C language). I think that there are two somewhat different use cases for nogc: 1. we can't tolerate any GC allocations, because the runtime has no garbage collector at all 2. we can do GC allocations, but strongly desire to avoid them for performance reasons The current nogc is apparently designed for the use case 1 and is overly conservative. But I'm primarily interested in the use case 2.This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.)Would it be difficult/practical to implement that optimization (or some simplified form of it) also for unoptimized code in the frontend and make it the normal language behavior?
May 31 2022
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:Consider the following example: ```D import std.algorithm, std.range, std.stdio; [...]Also it's worth noting you can actually make some of these range patterns nogc via using zip(repeat(closureVar), blah).map
May 30 2022
On Monday, 30 May 2022 at 11:49:44 UTC, max haughton wrote:On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:Yes, this works and ```D long binary_search(long n, long val, long i) { return iota(1, i + 1).map!(x => n / x == val).assumeSorted.upperBound(false).length; } ``` can be indeed rewritten as ```D long binary_search(long n, long val, long i) nogc { struct C { long n, val; } return zip(repeat(C(n, val)), iota(1, i + 1)).map!(x => x[0].n / x[1] == x[0].val) .assumeSorted.upperBound(false).length; } ``` Results in roughly no performance changes for LDC thanks to GC2Stack already taking care of it before, a major performance improvement for GDC (6.467s -> 4.553s) and a major performance regression for DMD (10.024s -> 13.733s). I'm not concerned about DMD, because it is known to be bad at optimizing this stuff anyway. But the readability of the code suffers, even though this is without any doubt a usable escape hatch for ` nogc` programs. Thanks a lot for the hint. Are there no other better solutions/workarounds planned in the compiler frontend? As a random experiment I tried to plaster the `scope` keyword everywhere and this unsurprisingly had no effect.Consider the following example: ```D import std.algorithm, std.range, std.stdio; [...]Also it's worth noting you can actually make some of these range patterns nogc via using zip(repeat(closureVar), blah).map
Dec 10 2023
On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out 55836809328 real 0m6.520s user 0m6.519s sys 0m0.000s What do you think about all of this?Out of curiosity, are you linking in phobos statically or dynamically? You can force either with `-static-libphobos` or `-shared-libphobos`.
May 30 2022
On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:It's statically linked with libphobos. Both GDC and LDC can inline everything here. And one major difference is that LDC is also able to eliminate GC allocation: https://d.godbolt.org/z/x1jK1M149 Another major difference is that LDC does some extra "cheat" to use a 32-bit division instruction if dividend and divisor are small enough. But it only does this trick for '-mcpu=x86-64' (default) and stops doing it for '-mcpu=native' (which in my case is nehalem): https://d.godbolt.org/z/8ExEqqE41 If everything is manually inlined into main function, then benchmarks look like this: ``` $ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2 55836809328 Performance counter stats for './test2': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 1.947334248 seconds time elapsed 1.921595000 seconds user 0.000000000 seconds sys ``` ``` $ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2 55836809328 Performance counter stats for './test2': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 4.600090630 seconds time elapsed 4.599885000 seconds user 0.000000000 seconds sys ``` ``` $ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out 55836809328 Performance counter stats for './a.out': utilized frontend cycles idle backend cycles idle per cycle stalled cycles per insn all branches 4.626366827 seconds time elapsed 4.605163000 seconds user 0.000000000 seconds sys ``` GDC and LDC become equally fast if closure allocations overhead is negligible and if LDC does not use 32-bit division instead of 64-bit one.$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out 55836809328 real 0m6.520s user 0m6.519s sys 0m0.000s What do you think about all of this?Out of curiosity, are you linking in phobos statically or dynamically? You can force either with `-static-libphobos` or `-shared-libphobos`.
May 30 2022