www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Performance test of short-circuiting AliasSeq

reply Stefan Koch <uplink.coder googlemail.com> writes:
Hi,

So I've asked myself if the PR 
https://github.com/dlang/dmd/pull/11057

Which complicated the compiler internals and got pulled on the 
basis that it would increase performance did actually increase 
performance.
So I ran it on a staticMap benchmark which it should speed up.

I am going to use the same benchmark as I used for
https://forum.dlang.org/post/kktulflpozdrsxeinfbg forum.dlang.org

Firstly let me post the results of the benchmark without walters 
patch applied.

---

   Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, 
System: 51.8 ms]
   Range (min … max):   413.5 ms … 476.7 ms    40 runs


   Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, 
System: 10.0 ms]
   Range (min … max):   208.6 ms … 235.9 ms    40 runs


   Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, 
System: 39.5 ms]
   Range (min … max):   316.6 ms … 345.7 ms    40 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.50 ± 0.05 times faster than './dmd.sh sm.d -version=Walter'
     1.98 ± 0.08 times faster than './dmd.sh sm.d'

---

If you care to compare the timings at the bottom they pretty much 
match the results I've measured in the benchmark previously and 
posted in the thread above.

Now let's see how the test performs with the patches applied.


   Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, 
System: 45.7 ms]
   Range (min … max):   411.0 ms … 444.0 ms    40 runs


   Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, 
System: 10.9 ms]
   Range (min … max):   223.3 ms … 243.9 ms    40 runs


   Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, 
System: 36.0 ms]
   Range (min … max):   331.0 ms … 375.1 ms    40 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.48 ± 0.04 times faster than './dmd.sh sm.d -version=Walter'
     1.83 ± 0.05 times faster than './dmd.sh sm.d'

We see the difference between `...` and Walters unrolled 
staticMap shrink.
And we go see a decrease of the divide and conquer version of 
staticMap.

However We do see the mean times of the `...` and Walters 
unrolled staticMap actually increase.

That made me curious and I repeated the measurements with a 
higher repetition count. Just to make sure that this is not a 
spur of the moment thing.

---- "short-circuit" patch applied.
uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine 
"./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d 
-version=Walter" -r 90

   Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, 
System: 51.9 ms]
   Range (min … max):   409.6 ms … 468.8 ms    90 runs


   Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, 
System: 10.2 ms]
   Range (min … max):   220.0 ms … 272.1 ms    90 runs


   Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, 
System: 40.9 ms]
   Range (min … max):   328.9 ms … 359.3 ms    90 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.45 ± 0.07 times faster than './dmd.sh sm.d -version=Walter'
     1.82 ± 0.09 times faster than './dmd.sh sm.d'

This is consistent with what we got before.
For good measure (pun intended), I tested the DMD version without 
the patch with an increased repetition count as well.

---- without "short-circuit" patch:
uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine 
"./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d 
-version=Walter" -r 90

   Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, 
System: 52.3 ms]
   Range (min … max):   412.8 ms … 464.5 ms    90 runs


   Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, 
System: 9.0 ms]
   Range (min … max):   209.0 ms … 241.6 ms    90 runs


   Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, 
System: 41.9 ms]
   Range (min … max):   318.7 ms … 364.6 ms    90 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.51 ± 0.06 times faster than './dmd.sh sm.d -version=Walter'
     1.97 ± 0.07 times faster than './dmd.sh sm.d'

The results seem quite solid.
At leasr on benchmark I have used for "short-circuiting" AliasSeq 
leads to a 4% slowdown for walters unrolled staticMap. and a 7% 
slowdown for `...`

I think we should not assumed include performance improvements 
before measuring.
Jun 01
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 [...]
TLDR; Performance patch caused a slowdown. Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
Jun 01
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to optimize takes more 
 time than you safe by optimizing it.
You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
Jun 01
next sibling parent FeepingCreature <feepingcreature gmail.com> writes:
On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.
You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
Oh hey, Vtune is free again for noncommercial use! Nice, there was a phase where it was only available commercially. That said, for multi-run timings I'm very fond of multitime ( https://tratt.net/laurie/src/multitime/ ) and perf-tools can also give you good per-instruction analytics, though its presentation is not nearly as nice as vtune's.
Jun 02
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.
You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
Vtune doesn't do intrusive profiling as far as I know. Therefore it's measurements _can_ miss functions which are called often, but are very short running. A profile obtained by periodic sampling is has the potential of misrepresenting the situation. In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling. Your short-circuiting patch was the only thing that changed between the versions. I do have a fully instrumented version of dmd. I know exactly which functions spends how much time processing which symbol. I will post those results soon.
Jun 02
next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 Vtune doesn't do intrusive profiling as far as I know.
 Therefore it's measurements _can_ miss functions which are 
 called often,
 but are very short running.
 A profile  obtained by periodic sampling is has the potential 
 of misrepresenting the situation.
I don't see how this can be. A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)
Jun 02
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 09:28:48 UTC, FeepingCreature wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 Vtune doesn't do intrusive profiling as far as I know.
 Therefore it's measurements _can_ miss functions which are 
 called often,
 but are very short running.
 A profile  obtained by periodic sampling is has the potential 
 of misrepresenting the situation.
I don't see how this can be. A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)
You are correct. Given a uniform program execution and very fine grained sampling i.e. sampling breaks in a sub microsecond interval, you get an understanding of where the program spends it's time overall. In a quite detailed matter. However because you frequently interrupting execution in short intervals you are also changing the execution profile. Making your results potentially inaccurate.
Jun 02
parent FeepingCreature <feepingcreature gmail.com> writes:
On Tuesday, 2 June 2020 at 09:48:47 UTC, Stefan Koch wrote:
 You are correct.
 Given a uniform program execution and very fine grained 
 sampling i.e. sampling breaks in a sub microsecond interval, 
 you get an understanding of where the program spends it's time 
 overall. In a quite detailed matter.

 However because you frequently interrupting execution in short 
 intervals you are also changing the execution profile.
 Making your results potentially inaccurate.
Ah, that's true...
Jun 02
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.
You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
Vtune doesn't do intrusive profiling as far as I know. Therefore it's measurements _can_ miss functions which are called often, but are very short running. A profile obtained by periodic sampling is has the potential of misrepresenting the situation. In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling. Your short-circuiting patch was the only thing that changed between the versions. I do have a fully instrumented version of dmd. I know exactly which functions spends how much time processing which symbol. I will post those results soon.
So I have a little profiling tool in DMD which tells me how often certain functions are being called. This is the output for the test with the patch: === Phase Time Distribution : === phase avgTime absTime freq dmd.dsymbolsem.dsymbolSemantic 127270.86 37145399296 291861 dmd.dsymbolsem.templateInstanceSemantic 355060.25 17450856448 49149 dmd.expressionsem.symbolToExp 993.92 113991488 114689 dmd.dtemplate.TemplateInstance.syntaxCopy 1327.78 92444120 69623 dmd.dmangle.Mangler.mangleType 293.39 60637312 206680 dmd.dtemplate.TemplateDeclaration.findExistingInstance 950.36 19462416 20479 dmd.mtype.TypeIdentifier.syntaxCopy 203.80 12570752 61681 Type::mangleToBuffer 83.38 2107624 25277 dmd.dmangle.Mangler.mangleSymbol 573.35 108936 190 dmd.func.FuncDeclaration.functionSemantic 374.00 374 1 === Phase Time Distribution : === phase avgTime absTime freq dmd.dsymbolsem.dsymbolSemantic 124654.58 36400631808 292012 dmd.dsymbolsem.templateInstanceSemantic 348265.00 17116876800 49149 dmd.expressionsem.symbolToExp 1013.76 116267384 114689 dmd.dtemplate.TemplateInstance.syntaxCopy 1317.11 91701088 69623 dmd.dmangle.Mangler.mangleType 288.24 59609648 206809 dmd.dtemplate.TemplateDeclaration.findExistingInstance 954.04 19537828 20479 dmd.mtype.TypeIdentifier.syntaxCopy 205.71 12690568 61693 Type::mangleToBuffer 91.08 2307294 25332 dmd.dmangle.Mangler.mangleSymbol 832.58 169014 203 dmd.func.FuncDeclaration.functionSemantic 680.00 680 1 As you can see the templateInstanceSemantic is called the same number of times. That would imply the optimization does never actually get triggered.
Jun 02
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 19:21:32 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 [...]
So I have a little profiling tool in DMD which tells me how often certain functions are being called. This is the output for the test with the patch: [...]
The bottom table is with the patch applied To not read to much into the timings they are in cycles. And therefore liable to drift around by a few 100
Jun 02
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 19:30:16 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 19:21:32 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 [...]
So I have a little profiling tool in DMD which tells me how often certain functions are being called. This is the output for the test with the patch: [...]
The bottom table is with the patch applied To not read to much into the timings they are in cycles. And therefore liable to drift around by a few 100
Argh I meant the bottom is without. The top table is with the patch applied.
Jun 02
prev sibling next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 [...]
To state this more fairly: The short-circuiting patch did IMPROVE the average performance of older staticMap implementation by 0.7% And only showed a decrease of performance with the new staticMap and ... The amount of which if 4% and 7% respectively.
Jun 01
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 Which complicated the compiler internals and got pulled on the 
 basis that it would increase performance did actually increase 
 performance.
 So I ran it on a staticMap benchmark which it should speed up.

 I am going to use the same benchmark as I used for
 https://forum.dlang.org/post/kktulflpozdrsxeinfbg forum.dlang.org

 Firstly let me post the results of the benchmark without 
 walters patch applied.

 ---

   Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, 
 System: 51.8 ms]
   Range (min … max):   413.5 ms … 476.7 ms    40 runs


   Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, 
 System: 10.0 ms]
   Range (min … max):   208.6 ms … 235.9 ms    40 runs


   Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, 
 System: 39.5 ms]
   Range (min … max):   316.6 ms … 345.7 ms    40 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.50 ± 0.05 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.98 ± 0.08 times faster than './dmd.sh sm.d'

 ---

 If you care to compare the timings at the bottom they pretty 
 much match the results I've measured in the benchmark 
 previously and posted in the thread above.

 Now let's see how the test performs with the patches applied.


   Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, 
 System: 45.7 ms]
   Range (min … max):   411.0 ms … 444.0 ms    40 runs


   Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, 
 System: 10.9 ms]
   Range (min … max):   223.3 ms … 243.9 ms    40 runs


   Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, 
 System: 36.0 ms]
   Range (min … max):   331.0 ms … 375.1 ms    40 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.48 ± 0.04 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.83 ± 0.05 times faster than './dmd.sh sm.d'

 We see the difference between `...` and Walters unrolled 
 staticMap shrink.
 And we go see a decrease of the divide and conquer version of 
 staticMap.

 However We do see the mean times of the `...` and Walters 
 unrolled staticMap actually increase.

 That made me curious and I repeated the measurements with a 
 higher repetition count. Just to make sure that this is not a 
 spur of the moment thing.

 ---- "short-circuit" patch applied.
 uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ 
 hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" 
 "./dmd.sh sm.d -version=Walter" -r 90

   Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, 
 System: 51.9 ms]
   Range (min … max):   409.6 ms … 468.8 ms    90 runs


   Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, 
 System: 10.2 ms]
   Range (min … max):   220.0 ms … 272.1 ms    90 runs


   Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, 
 System: 40.9 ms]
   Range (min … max):   328.9 ms … 359.3 ms    90 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.45 ± 0.07 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.82 ± 0.09 times faster than './dmd.sh sm.d'

 This is consistent with what we got before.
 For good measure (pun intended), I tested the DMD version 
 without the patch with an increased repetition count as well.

 ---- without "short-circuit" patch:
 uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ 
 hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" 
 "./dmd.sh sm.d -version=Walter" -r 90

   Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, 
 System: 52.3 ms]
   Range (min … max):   412.8 ms … 464.5 ms    90 runs


   Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, 
 System: 9.0 ms]
   Range (min … max):   209.0 ms … 241.6 ms    90 runs


   Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, 
 System: 41.9 ms]
   Range (min … max):   318.7 ms … 364.6 ms    90 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.51 ± 0.06 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.97 ± 0.07 times faster than './dmd.sh sm.d'

 The results seem quite solid.
 At leasr on benchmark I have used for "short-circuiting" 
 AliasSeq leads to a 4% slowdown for walters unrolled staticMap. 
 and a 7% slowdown for `...`

 I think we should not assumed include performance improvements 
 before measuring.
The reason the old version of staticMap did not see the slowdown is because I didn't disable codegen. code-gen inefficiencies occurring when emitting an unreasonable number to symbols tend to hide other problems. Here is a Benchmark which does not relay on our branch But uses the released dmd 2.092.0 Enjoy! uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o- -version=Walter" "./dmd_with_patch sm.d -c -o- -version=Walter" -m 90 Time (mean ± σ): 452.8 ms ± 7.5 ms [User: 415.5 ms, System: 37.4 ms] Range (min … max): 442.2 ms … 483.9 ms 90 runs Time (mean ± σ): 455.1 ms ± 10.4 ms [User: 417.3 ms, System: 37.7 ms] Range (min … max): 441.5 ms … 489.2 ms 90 runs Summary './dmd_without_patch sm.d -c -o- -version=Walter' ran 1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- -version=Walter' uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90 Time (mean ± σ): 583.2 ms ± 11.0 ms [User: 529.9 ms, System: 53.1 ms] Range (min … max): 570.0 ms … 631.0 ms 90 runs Time (mean ± σ): 584.3 ms ± 14.3 ms [User: 533.1 ms, System: 51.0 ms] Range (min … max): 566.5 ms … 657.9 ms 90 runs Summary './dmd_without_patch sm.d -c -o-' ran 1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-' uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90 Time (mean ± σ): 583.4 ms ± 10.5 ms [User: 529.2 ms, System: 54.0 ms] Range (min … max): 566.9 ms … 624.0 ms 90 runs Time (mean ± σ): 585.9 ms ± 13.9 ms [User: 530.5 ms, System: 55.2 ms] Range (min … max): 565.0 ms … 631.7 ms 90 runs Summary './dmd_without_patch sm.d -c -o-' ran 1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
Jun 03
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 3 June 2020 at 14:52:09 UTC, Stefan Koch wrote:
 The reason the old version of staticMap did not see the 
 slowdown is because I didn't disable codegen.

 code-gen inefficiencies occurring when emitting an unreasonable 
 number to symbols tend to hide other problems.

 Here is a Benchmark which does not relay on our branch
 But uses the released dmd 2.092.0

 Enjoy!

 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o- -version=Walter" 
 "./dmd_with_patch sm.d -c -o- -version=Walter" -m 90

   Time (mean ± σ):     452.8 ms ±   7.5 ms    [User: 415.5 ms, 
 System: 37.4 ms]
   Range (min … max):   442.2 ms … 483.9 ms    90 runs


   Time (mean ± σ):     455.1 ms ±  10.4 ms    [User: 417.3 ms, 
 System: 37.7 ms]
   Range (min … max):   441.5 ms … 489.2 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o- -version=Walter' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- 
 -version=Walter'
 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c 
 -o-" -m 90

   Time (mean ± σ):     583.2 ms ±  11.0 ms    [User: 529.9 ms, 
 System: 53.1 ms]
   Range (min … max):   570.0 ms … 631.0 ms    90 runs


   Time (mean ± σ):     584.3 ms ±  14.3 ms    [User: 533.1 ms, 
 System: 51.0 ms]
   Range (min … max):   566.5 ms … 657.9 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c 
 -o-" -m 90

   Time (mean ± σ):     583.4 ms ±  10.5 ms    [User: 529.2 ms, 
 System: 54.0 ms]
   Range (min … max):   566.9 ms … 624.0 ms    90 runs


   Time (mean ± σ):     585.9 ms ±  13.9 ms    [User: 530.5 ms, 
 System: 55.2 ms]
   Range (min … max):   565.0 ms … 631.7 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
Disregard this one. I had AliasSeq defined as: template AliasSeq(seq...) { enum AliasSeq = seq; } Which does not trigger the optimization. When I however do define AliasSeq as template AliasSeq(seq...) { alias AliasSeq = seq; } The optimization triggers and you get: uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o- -version=Walter" "./dmd_with_patch sm.d -c -o- -version=Walter" -m 50 Time (mean ± σ): 296.2 ms ± 6.8 ms [User: 263.6 ms, System: 32.5 ms] Range (min … max): 285.7 ms … 330.8 ms 50 runs Time (mean ± σ): 301.4 ms ± 11.7 ms [User: 270.6 ms, System: 30.8 ms] Range (min … max): 285.6 ms … 333.3 ms 50 runs Summary './dmd_without_patch sm.d -c -o- -version=Walter' ran 1.02 ± 0.05 times faster than './dmd_with_patch sm.d -c -o- -version=Walter' uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 50 Time (mean ± σ): 388.6 ms ± 8.6 ms [User: 346.5 ms, System: 42.2 ms] Range (min … max): 378.5 ms … 419.3 ms 50 runs Time (mean ± σ): 375.7 ms ± 9.9 ms [User: 332.8 ms, System: 42.8 ms] Range (min … max): 362.2 ms … 396.3 ms 50 runs Summary './dmd_with_patch sm.d -c -o-' ran 1.03 ± 0.04 times faster than './dmd_without_patch sm.d -c -o-' Which is somewhat consistent with the previous results. The that I did which does not do the optimization, shows no measurable difference. That means that if the optimization does not trigger no performance penalty in incurred FOR THIS TEST. Another thing that's surprising is ... somehow applying the patch does reduce the size of the binary. Which just goes to show that you really cannot actually tell right from wrong anymore with modern optimizers. -rwxrwxr-x 1 uplink uplink 19281504 Jun 3 16:30 dmd_without_patch -rwxrwxr-x 1 uplink uplink 19279120 Jun 3 16:31 dmd_with_patch My guess is that llvm's inliner went less crazy because of an unpredictable branch in there.
Jun 03