www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - vectorization of a simple loop -- not in DMD?

reply Ivan Kazmenko <gassa mail.ru> writes:
Hi.

I'm looking at the compiler output of DMD (-O -release), LDC (-O 
-release), and GDC (-O3) for a simple array operation:

```
void add1 (int [] a)
{
     foreach (i; 0..a.length)
         a[i] += 1;
}
```

Here are the outputs: https://godbolt.org/z/GcznbjEaf

 From what I gather at the view linked above, DMD does not use XMM 
registers for speedup, and does not unroll the loop either.  
Switching between 32bit and 64bit doesn't help either.  However, 
I recall in the past it was capable of at least some of these 
optimizations.  So, how do I enable them for such a function?

Ivan Kazmenko.
Jul 11 2022
next sibling parent reply max haughton <maxhaton gmail.com> writes:
On Monday, 11 July 2022 at 18:15:16 UTC, Ivan Kazmenko wrote:
 Hi.

 I'm looking at the compiler output of DMD (-O -release), LDC 
 (-O -release), and GDC (-O3) for a simple array operation:

 ```
 void add1 (int [] a)
 {
     foreach (i; 0..a.length)
         a[i] += 1;
 }
 ```

 Here are the outputs: https://godbolt.org/z/GcznbjEaf

 From what I gather at the view linked above, DMD does not use 
 XMM registers for speedup, and does not unroll the loop either. 
  Switching between 32bit and 64bit doesn't help either.  
 However, I recall in the past it was capable of at least some 
 of these optimizations.  So, how do I enable them for such a 
 function?

 Ivan Kazmenko.
How long ago is the past? The godbolt.org dmd is quite old. The dmd backend is ancient, it isn't really capable of these kinds of loop optimizations.
Jul 11 2022
parent reply IGotD- <nise nise.com> writes:
On Monday, 11 July 2022 at 18:19:41 UTC, max haughton wrote:
 The dmd backend is ancient, it isn't really capable of these 
 kinds of loop optimizations.
I've said it several times before. Just depreciate the the DMD backend, it's just not up to the task anymore. This is not criticism against the original purpose of it as back in the 90s and early 2000s it made sense to create your own backend. Time has moved on and we have LLVM and GCC backends with a lot of CPU support that the D project could never achieve themselves. The D project should just can the DMD backend in order to free up resources for more important tasks. Some people say they like it because it is fast, yes it is fast because it doesn't do much.
Jul 11 2022
next sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
On Monday, 11 July 2022 at 21:46:10 UTC, IGotD- wrote:
 On Monday, 11 July 2022 at 18:19:41 UTC, max haughton wrote:
 The dmd backend is ancient, it isn't really capable of these 
 kinds of loop optimizations.
I've said it several times before. Just depreciate the the DMD backend, it's just not up to the task anymore. This is not criticism against the original purpose of it as back in the 90s and early 2000s it made sense to create your own backend. Time has moved on and we have LLVM and GCC backends with a lot of CPU support that the D project could never achieve themselves. The D project should just can the DMD backend in order to free up resources for more important tasks. Some people say they like it because it is fast, yes it is fast because it doesn't do much.
I use D because DMD compiles my huge project in ~1 second (full clean rebuild) It is a competitive advantage that many languages doesn't have LDC clean full rebuild ``` $ time dub build -f --compiler=ldc2 Performing "debug" build using ldc2 for x86_64. game ~master: building configuration "desktop"... Linking... real 0m18.033s user 0m0.000s sys 0m0.015s ``` LDC incremental ``` $ time dub build --compiler=ldc2 Performing "debug" build using ldc2 for x86_64. game ~master: building configuration "desktop"... Linking... real 0m17.215s user 0m0.000s sys 0m0.000s ``` DMD clean full rebuild ``` $ time dub build -f --compiler=dmd Performing "debug" build using dmd for x86_64. game ~master: building configuration "desktop"... Linking... real 0m1.348s user 0m0.031s sys 0m0.015s ``` DMD incremental ``` $ time dub build --compiler=dmd Performing "debug" build using dmd for x86_64. game ~master: building configuration "desktop"... Linking... real 0m1.249s user 0m0.000s sys 0m0.000s ``` The day DMD gets removed is the day i will good a different language I want to thank Walter for maintaining DMD the compiler, and making it incredibly fast at compiling code Release perf can't beat LLVM and its amount of optimizations, but the advantage is it allows VERY FAST and QUICK iteration time, it is ESSENTIAL for developing software
Jul 11 2022
next sibling parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 11 July 2022 at 22:16:05 UTC, ryuukk_ wrote:
 I use D because DMD compiles my huge project in ~1 second (full 
 clean rebuild)

 It is a competitive advantage that many languages doesn't have
The other programming languages typically use an interpreter for quick iterations and rapid development. For example, Python programming language has CPython interpreter, PyPy Just-in-Time compiler and Cython optimizing static compiler (not perfect right now, but shows a lot of promise). D still has a certain advantage over interpreters, because DMD generated code is typically only up to twice slower than LDC generated code. If the x86 architecture stops being dominant in the future and gets displaced by ARM or RISC-V, then this may become a problem for DMD. But we'll cross that bridge when we get there.
Jul 11 2022
prev sibling parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 11 July 2022 at 22:16:05 UTC, ryuukk_ wrote:
 LDC clean full rebuild
 ```
 $ time dub build -f --compiler=ldc2
 Performing "debug" build using ldc2 for x86_64.
 game ~master: building configuration "desktop"...
 Linking...

 real    0m18.033s
 user    0m0.000s
 sys     0m0.015s
 ```

 DMD clean full rebuild
 ```
 $ time dub build -f --compiler=dmd
 Performing "debug" build using dmd for x86_64.
 game ~master: building configuration "desktop"...
 Linking...

 real    0m1.348s
 user    0m0.031s
 sys     0m0.015s
 ```
BTW, I'm very curious about investigating the reason for such huge build time difference, but can't reproduce it on my computer. For example, compiling the DUB source code itself via the same DUB commands only results in DMD showing roughly twice faster build times (which is great, but nowhere close to ~13x difference): ``` $ git clone https://github.com/dlang/dub.git $ cd dub ``` ``` $ time dub build -f --compiler=ldc2 Performing "debug" build using ldc2 for x86_64. dub 1.29.1+commit.38.g7f6f024f: building configuration "application"... Serializing composite type Flags!(BuildRequirement) which has no serializable fields Serializing composite type Flags!(BuildOption) which has no serializable fields Linking... real 0m34.371s user 0m32.883s sys 0m1.488s ``` ``` $ time dub build -f --compiler=dmd Performing "debug" build using dmd for x86_64. dub 1.29.1+commit.38.g7f6f024f: building configuration "application"... Serializing composite type Flags!(BuildRequirement) which has no serializable fields Serializing composite type Flags!(BuildOption) which has no serializable fields Linking... real 0m14.078s user 0m12.941s sys 0m1.129s ``` Is there an open source DUB package, which can be used to reproduce a huge build time difference between LDC and DMD?
Jul 12 2022
next sibling parent reply bauss <jacobbauss gmail.com> writes:
On Tuesday, 12 July 2022 at 07:06:37 UTC, Siarhei Siamashka wrote:
 ```

 real	0m34.371s
 user	0m32.883s
 sys	0m1.488s
 ```
 ```

 real	0m14.078s
 user	0m12.941s
 sys	0m1.129s

 ```

 Is there an open source DUB package, which can be used to 
 reproduce a huge build time difference between LDC and DMD?
You don't think this difference is huge? DMD is over 2x as fast.
Jul 12 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Tuesday, 12 July 2022 at 07:58:44 UTC, bauss wrote:
 You don't think this difference is huge? DMD is over 2x as fast.
I think that DMD having more than 10x faster compilation speed in ryuukk_'s project shows that there is likely either a misconfiguration in DUB build setup or some other low hanging fruit for LDC. This looks like an opportunity to easily improve something in a major way.
Jul 12 2022
parent ryuukk_ <ryuukk.dev gmail.com> writes:
On Tuesday, 12 July 2022 at 09:18:02 UTC, Siarhei Siamashka wrote:
 On Tuesday, 12 July 2022 at 07:58:44 UTC, bauss wrote:
 You don't think this difference is huge? DMD is over 2x as 
 fast.
I think that DMD having more than 10x faster compilation speed in ryuukk_'s project shows that there is likely either a misconfiguration in DUB build setup or some other low hanging fruit for LDC. This looks like an opportunity to easily improve something in a major way.
You where right! looks like i accidentally put a dflags (O3) into the debug config for ldc! ``` $ time dub build -f --compiler=ldc2 Performing "debug" build using ldc2 for x86_64. game ~master: building configuration "desktop"... Linking... Creating library .dub\build\desktop-debug-windows-x86_64-ldc_v1.30.0-beta1-4B08B3C693144187830F0 15271A53A3\game.lib and object .dub\build\desktop-debug-windows-x86_64-ldc_v1.30.0-beta1-4B08B3C693144187830F0F15271A53A3\game.exp LINK : warning LNK4098: defaultlib 'libvcruntime' conflicts with use of other libs; use /NODEFAULTLIB:library real 0m4.521s user 0m0.000s sys 0m0.000s ``` Incremental: ``` $ time dub build --compiler=ldc2 Performing "debug" build using ldc2 for x86_64. game ~master: building configuration "desktop"... Linking... Creating library .dub\build\desktop-debug-windows-x86_64-ldc_v1.30.0-beta1-4B08B3C693144187830F0 15271A53A3\game.lib and object .dub\build\desktop-debug-windows-x86_64-ldc_v1.30.0-beta1-4B08B3C693144187830F0F15271A53A3\game.exp LINK : warning LNK4098: defaultlib 'libvcruntime' conflicts with use of other libs; use /NODEFAULTLIB:library real 0m4.516s user 0m0.015s sys 0m0.000s ``` Here updated result, down to 4.5sec
Jul 12 2022
prev sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
How do i achieve fast compile speed (results above were on 
windows, on linux i get much faster results):

I maintain healthy project management:

- Templates ONLY when necessary and when the cost is worth the 
time saved in the long term
   - this is why i try to lobby for builtin tagged union instead 
of std.sumtype

- Dependencies, only dependency WITHOUT dependencies, and i keep 
them at the bare minimum!

- Imports of std: i simply don't, and i copy/paste functions i 
need

- I avoid dub packages, instead i prefer import/src path, and i 
chery pick what i need
Jul 12 2022
parent reply bauss <jacobbauss gmail.com> writes:
On Tuesday, 12 July 2022 at 10:32:36 UTC, ryuukk_ wrote:
 How do i achieve fast compile speed (results above were on 
 windows, on linux i get much faster results):

 I maintain healthy project management:
This
 - Imports of std: i simply don't, and i copy/paste functions i 
 need

 - I avoid dub packages, instead i prefer import/src path, and i 
 chery pick what i need
And this. Can be argued to not be healthy project management. Of course if you're alone it doesn't matter, but if it's a larger project that will have multiple maintainers then it will never work and will tarnish the project entirely.
Jul 12 2022
parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
On Tuesday, 12 July 2022 at 12:47:26 UTC, bauss wrote:
 Of course if you're alone it doesn't matter, but if it's a 
 larger project that will have multiple maintainers then it will 
 never work and will tarnish the project entirely.
That's true, i work solo on my project so it doesn't bother me It definitely is something hard to balance But one sure thing is that's something you have to monitor every so often, if you don't, then you end up with poor build speed that's harder to fix I wonder if DMD/LDC/GDC have built in tools to profile and track performance Rust have this: https://perf.rust-lang.org/ Maybe we need to do something similar
Jul 12 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Tuesday, 12 July 2022 at 13:23:36 UTC, ryuukk_ wrote:
 I wonder if DMD/LDC/GDC have built in tools to profile and 
 track performance
Linux has a decent system wide profiler: https://perf.wiki.kernel.org/index.php/Main_Page And there are other useful tools, such as callgrind. To take advantage of all these tools, DMD/LDC/GDC only need to provide debugging symbols in the generated binaries, which they already do. Profiling applications to identify performance bottlenecks is very easy nowadays.
 Rust have this: https://perf.rust-lang.org/

 Maybe we need to do something similar
What is this website? Are they tracking performance differences between different versions of Rust or something? Like ensuring that the compile time does not regress without them noticing this immediately?
Jul 13 2022
parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
On Thursday, 14 July 2022 at 05:30:58 UTC, Siarhei Siamashka 
wrote:
 On Tuesday, 12 July 2022 at 13:23:36 UTC, ryuukk_ wrote:
 I wonder if DMD/LDC/GDC have built in tools to profile and 
 track performance
Linux has a decent system wide profiler: https://perf.wiki.kernel.org/index.php/Main_Page And there are other useful tools, such as callgrind. To take advantage of all these tools, DMD/LDC/GDC only need to provide debugging symbols in the generated binaries, which they already do. Profiling applications to identify performance bottlenecks is very easy nowadays.
I am not talking about linux, and i am not talking about 3rd party tools I am talking about the developers of DMD/LDC/GDC, do they profile the compilers, do they provide ways to monitor/track performance? do they benchmark specific parts of the compilers? I am not talking about the output of valgrind Zig also has: https://ziglang.org/perf/ (very slow to load) Having such thing is more useful than being able to plug valgrind god knows how into the compiler and try to decipher what does what and what results correspond to what internally, and what about a graph over time to catch regressions? DMD is very fast at compiling code, so i guess Walter doing enough work to monitor all of that LDC on the other hand.. they'd benefit a lot by having such thing in place
Jul 14 2022
parent max haughton <maxhaton gmail.com> writes:
On Thursday, 14 July 2022 at 13:00:24 UTC, ryuukk_ wrote:
 On Thursday, 14 July 2022 at 05:30:58 UTC, Siarhei Siamashka 
 wrote:
 On Tuesday, 12 July 2022 at 13:23:36 UTC, ryuukk_ wrote:
 I wonder if DMD/LDC/GDC have built in tools to profile and 
 track performance
Linux has a decent system wide profiler: https://perf.wiki.kernel.org/index.php/Main_Page And there are other useful tools, such as callgrind. To take advantage of all these tools, DMD/LDC/GDC only need to provide debugging symbols in the generated binaries, which they already do. Profiling applications to identify performance bottlenecks is very easy nowadays.
I am not talking about linux, and i am not talking about 3rd party tools I am talking about the developers of DMD/LDC/GDC, do they profile the compilers, do they provide ways to monitor/track performance? do they benchmark specific parts of the compilers? I am not talking about the output of valgrind Zig also has: https://ziglang.org/perf/ (very slow to load) Having such thing is more useful than being able to plug valgrind god knows how into the compiler and try to decipher what does what and what results correspond to what internally, and what about a graph over time to catch regressions? DMD is very fast at compiling code, so i guess Walter doing enough work to monitor all of that LDC on the other hand.. they'd benefit a lot by having such thing in place
Running valgrind on the compiler is completely trivial. Builtin profilers are often terrible. LDC and GDC and dmd all have instrumenting profilers builtin, of varying quality. gprof in particular is somewhat infamous. dmd isn't particularly fast, it does a lot of unnecessary work. LDC is slow because LLVM is slow. We need a graph over time, yes.
Jul 14 2022
prev sibling parent bachmeier <no spam.net> writes:
On Monday, 11 July 2022 at 21:46:10 UTC, IGotD- wrote:

 Just depreciate the the DMD backend, it's just not up to the 
 task anymore.
Just deprecate LDC and GDC. They compile slowly and are unlikely to ever deliver fast compile times, due to their design.
 Some people say they like it because it is fast, yes it is fast 
 because it doesn't do much.
If it produces code that's fast enough, there is zero benefit to using a different compiler. If you use a development workflow that's heavy on compilation, stay away from LDC or GDC until you're done - and even then, you might not have any motivation to use either.
Jul 12 2022
prev sibling next sibling parent Bruce Carneal <bcarneal gmail.com> writes:
On Monday, 11 July 2022 at 18:15:16 UTC, Ivan Kazmenko wrote:
 Hi.

 I'm looking at the compiler output of DMD (-O -release), LDC 
 (-O -release), and GDC (-O3) for a simple array operation:

 ```
 void add1 (int [] a)
 {
     foreach (i; 0..a.length)
         a[i] += 1;
 }
 ```

 Here are the outputs: https://godbolt.org/z/GcznbjEaf

 From what I gather at the view linked above, DMD does not use 
 XMM registers for speedup, and does not unroll the loop either.
[snip] Specifying a SIMD capable target will reveal an even wider gap in capability. (LDC -mcpu=x86-64-v3 or gdc -march=x86-64-v3).
Jul 11 2022
prev sibling parent z <z z.com> writes:
On Monday, 11 July 2022 at 18:15:16 UTC, Ivan Kazmenko wrote:
 Hi.

 I'm looking at the compiler output of DMD (-O -release), LDC 
 (-O -release), and GDC (-O3) for a simple array operation:

 ```
 void add1 (int [] a)
 {
     foreach (i; 0..a.length)
         a[i] += 1;
 }
 ```

 Here are the outputs: https://godbolt.org/z/GcznbjEaf

 From what I gather at the view linked above, DMD does not use 
 XMM registers for speedup, and does not unroll the loop either. 
  Switching between 32bit and 64bit doesn't help either.  
 However, I recall in the past it was capable of at least some 
 of these optimizations.  So, how do I enable them for such a 
 function?

 Ivan Kazmenko.
No, not in DMD. DMD generates what looks like 32 bit code adapted to x86_64. LDC may optimize this kind of loop with a tri-way branch depending on how many array elements remain. but it can both generate very good loop code(particularly when AVX-512 is available and the struct/data arrangement in memory is unfavorable for SIMD) and very questionable code. You may be losing performance for obscure reasons that look like gnomes decided to steal your precious cpu cycles and when that happens there is no way to fix it other than manually going in with a disassembler/debugger, changing defect optimizations in hot code paths to something faster then save back to executable file.(yikes, i know.)
Jul 14 2022