digitalmars.D - Performance test of short-circuiting AliasSeq

Stefan Koch (107/107) Jun 01 2020 Hi,

Stefan Koch (4/8) Jun 01 2020 TLDR; Performance patch caused a slowdown.

Walter Bright (3/6) Jun 01 2020 You'll need a finer grained profile to reach this conclusion than the gr...

FeepingCreature (7/14) Jun 02 2020 Oh hey, Vtune is free again for noncommercial use! Nice, there
Stefan Koch (16/23) Jun 02 2020 Vtune doesn't do intrusive profiling as far as I know.

FeepingCreature (9/15) Jun 02 2020 I don't see how this can be.

Stefan Koch (9/25) Jun 02 2020 You are correct.

FeepingCreature (2/10) Jun 02 2020 Ah, that's true...

Stefan Koch (54/78) Jun 02 2020 So I have a little profiling tool in DMD which tells me how often

Stefan Koch (4/10) Jun 02 2020 The bottom table is with the patch applied

Stefan Koch (3/15) Jun 02 2020 Argh I meant the bottom is without.

Stefan Koch (7/11) Jun 01 2020 To state this more fairly:
Stefan Koch (51/162) Jun 03 2020 The reason the old version of staticMap did not see the slowdown

Stefan Koch (51/101) Jun 03 2020 Disregard this one.

Stefan Koch <uplink.coder googlemail.com> writes:

Hi,

So I've asked myself if the PR 
https://github.com/dlang/dmd/pull/11057

Which complicated the compiler internals and got pulled on the 
basis that it would increase performance did actually increase 
performance.
So I ran it on a staticMap benchmark which it should speed up.

I am going to use the same benchmark as I used for
https://forum.dlang.org/post/kktulflpozdrsxeinfbg forum.dlang.org

Firstly let me post the results of the benchmark without walters 
patch applied.

---

   Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, 
System: 51.8 ms]
   Range (min … max):   413.5 ms … 476.7 ms    40 runs


   Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, 
System: 10.0 ms]
   Range (min … max):   208.6 ms … 235.9 ms    40 runs


   Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, 
System: 39.5 ms]
   Range (min … max):   316.6 ms … 345.7 ms    40 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.50 ± 0.05 times faster than './dmd.sh sm.d -version=Walter'
     1.98 ± 0.08 times faster than './dmd.sh sm.d'

---

If you care to compare the timings at the bottom they pretty much 
match the results I've measured in the benchmark previously and 
posted in the thread above.

Now let's see how the test performs with the patches applied.


   Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, 
System: 45.7 ms]
   Range (min … max):   411.0 ms … 444.0 ms    40 runs


   Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, 
System: 10.9 ms]
   Range (min … max):   223.3 ms … 243.9 ms    40 runs


   Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, 
System: 36.0 ms]
   Range (min … max):   331.0 ms … 375.1 ms    40 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.48 ± 0.04 times faster than './dmd.sh sm.d -version=Walter'
     1.83 ± 0.05 times faster than './dmd.sh sm.d'

We see the difference between `...` and Walters unrolled 
staticMap shrink.
And we go see a decrease of the divide and conquer version of 
staticMap.

However We do see the mean times of the `...` and Walters 
unrolled staticMap actually increase.

That made me curious and I repeated the measurements with a 
higher repetition count. Just to make sure that this is not a 
spur of the moment thing.

---- "short-circuit" patch applied.
uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine 
"./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d 
-version=Walter" -r 90

   Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, 
System: 51.9 ms]
   Range (min … max):   409.6 ms … 468.8 ms    90 runs


   Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, 
System: 10.2 ms]
   Range (min … max):   220.0 ms … 272.1 ms    90 runs


   Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, 
System: 40.9 ms]
   Range (min … max):   328.9 ms … 359.3 ms    90 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.45 ± 0.07 times faster than './dmd.sh sm.d -version=Walter'
     1.82 ± 0.09 times faster than './dmd.sh sm.d'

This is consistent with what we got before.
For good measure (pun intended), I tested the DMD version without 
the patch with an increased repetition count as well.

---- without "short-circuit" patch:
uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine 
"./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d 
-version=Walter" -r 90

   Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, 
System: 52.3 ms]
   Range (min … max):   412.8 ms … 464.5 ms    90 runs


   Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, 
System: 9.0 ms]
   Range (min … max):   209.0 ms … 241.6 ms    90 runs


   Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, 
System: 41.9 ms]
   Range (min … max):   318.7 ms … 364.6 ms    90 runs

Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.51 ± 0.06 times faster than './dmd.sh sm.d -version=Walter'
     1.97 ± 0.07 times faster than './dmd.sh sm.d'

The results seem quite solid.
At leasr on benchmark I have used for "short-circuiting" AliasSeq 
leads to a 4% slowdown for walters unrolled staticMap. and a 7% 
slowdown for `...`

I think we should not assumed include performance improvements 
before measuring.

Jun 01 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 [...]

TLDR; Performance patch caused a slowdown.
Why? Because the checking for the case which it wants to optimize 
takes more time than you safe by optimizing it.

Jun 01 2020

Walter Bright <newshound2 digitalmars.com> writes:

On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to optimize takes more 
 time than you safe by optimizing it.

You'll need a finer grained profile to reach this conclusion than the gross 
measurement of the compiler runtime. I suggest using Vtune.

Jun 01 2020

FeepingCreature <feepingcreature gmail.com> writes:

On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.

 You'll need a finer grained profile to reach this conclusion 
 than the gross measurement of the compiler runtime. I suggest 
 using Vtune.

Oh hey, Vtune is free again for noncommercial use! Nice, there 
was a phase where it was only available commercially.

That said, for multi-run timings I'm very fond of multitime ( 
https://tratt.net/laurie/src/multitime/ ) and perf-tools can also 
give you good per-instruction analytics, though its presentation 
is not nearly as nice as vtune's.

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.

 You'll need a finer grained profile to reach this conclusion 
 than the gross measurement of the compiler runtime. I suggest 
 using Vtune.

Vtune doesn't do intrusive profiling as far as I know.
Therefore it's measurements _can_ miss functions which are called 
often,
but are very short running.
A profile  obtained by periodic sampling is has the potential of 
misrepresenting the situation.

In this case finer grained analysis is of course needed to do a 
100% definite statement, however the measurement itself is quite 
telling.
Your short-circuiting patch was the only thing that changed 
between the versions.

I do have a fully instrumented version of dmd.
I know exactly which functions spends how much time processing 
which symbol.

I will post those results soon.

Jun 02 2020

FeepingCreature <feepingcreature gmail.com> writes:

On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 Vtune doesn't do intrusive profiling as far as I know.
 Therefore it's measurements _can_ miss functions which are 
 called often,
 but are very short running.
 A profile  obtained by periodic sampling is has the potential 
 of misrepresenting the situation.

I don't see how this can be.

A sampling profiler doesn't rely on any sort of program 
breakpointing. Be the function long-running or short-running, if 
it's active 1% of the time it will (on net) show up on 1% of 
sampled backtraces, and so on. To my best understanding, the only 
way a function can fail to be visible if it's (depending on 
sample density, more or less) irrelevant to the total runtime. 
(Or it's invisibly inlined.)

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 June 2020 at 09:28:48 UTC, FeepingCreature wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 Vtune doesn't do intrusive profiling as far as I know.
 Therefore it's measurements _can_ miss functions which are 
 called often,
 but are very short running.
 A profile  obtained by periodic sampling is has the potential 
 of misrepresenting the situation.

 I don't see how this can be.

 A sampling profiler doesn't rely on any sort of program 
 breakpointing. Be the function long-running or short-running, 
 if it's active 1% of the time it will (on net) show up on 1% of 
 sampled backtraces, and so on. To my best understanding, the 
 only way a function can fail to be visible if it's (depending 
 on sample density, more or less) irrelevant to the total 
 runtime. (Or it's invisibly inlined.)

You are correct.
Given a uniform program execution and very fine grained sampling 
i.e. sampling breaks in a sub microsecond interval, you get an 
understanding of where the program spends it's time overall. In a 
quite detailed matter.

However because you frequently interrupting execution in short 
intervals you are also changing the execution profile.
Making your results potentially inaccurate.

Jun 02 2020

FeepingCreature <feepingcreature gmail.com> writes:

On Tuesday, 2 June 2020 at 09:48:47 UTC, Stefan Koch wrote:
 You are correct.
 Given a uniform program execution and very fine grained 
 sampling i.e. sampling breaks in a sub microsecond interval, 
 you get an understanding of where the program spends it's time 
 overall. In a quite detailed matter.

 However because you frequently interrupting execution in short 
 intervals you are also changing the execution profile.
 Making your results potentially inaccurate.

Ah, that's true...

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
 On 6/1/2020 1:23 PM, Stefan Koch wrote:
 TLDR; Performance patch caused a slowdown.
 Why? Because the checking for the case which it wants to 
 optimize takes more time than you safe by optimizing it.

 You'll need a finer grained profile to reach this conclusion 
 than the gross measurement of the compiler runtime. I suggest 
 using Vtune.

 Vtune doesn't do intrusive profiling as far as I know.
 Therefore it's measurements _can_ miss functions which are 
 called often,
 but are very short running.
 A profile  obtained by periodic sampling is has the potential 
 of misrepresenting the situation.

 In this case finer grained analysis is of course needed to do a 
 100% definite statement, however the measurement itself is 
 quite telling.
 Your short-circuiting patch was the only thing that changed 
 between the versions.

 I do have a fully instrumented version of dmd.
 I know exactly which functions spends how much time processing 
 which symbol.

 I will post those results soon.

So I have a little profiling tool in DMD which tells me how often 
certain functions are being called.
This is the output for the test with the patch:

  === Phase Time Distribution : ===
  phase                                                            
                           avgTime    absTime       freq

  dmd.dsymbolsem.dsymbolSemantic                                   
                           127270.86  37145399296   291861
  dmd.dsymbolsem.templateInstanceSemantic                          
                           355060.25  17450856448   49149
  dmd.expressionsem.symbolToExp                                    
                           993.92     113991488     114689
  dmd.dtemplate.TemplateInstance.syntaxCopy                        
                           1327.78    92444120      69623
  dmd.dmangle.Mangler.mangleType                                   
                           293.39     60637312      206680
  dmd.dtemplate.TemplateDeclaration.findExistingInstance           
                           950.36     19462416      20479
  dmd.mtype.TypeIdentifier.syntaxCopy                              
                           203.80     12570752      61681
  Type::mangleToBuffer                                             
                           83.38      2107624       25277
  dmd.dmangle.Mangler.mangleSymbol                                 
                           573.35     108936        190
  dmd.func.FuncDeclaration.functionSemantic                        
                           374.00     374           1

  === Phase Time Distribution : ===
  phase                                                            
                           avgTime    absTime       freq
  dmd.dsymbolsem.dsymbolSemantic                                   
                           124654.58  36400631808   292012
  dmd.dsymbolsem.templateInstanceSemantic                          
                           348265.00  17116876800   49149
  dmd.expressionsem.symbolToExp                                    
                           1013.76    116267384     114689
  dmd.dtemplate.TemplateInstance.syntaxCopy                        
                           1317.11    91701088      69623
  dmd.dmangle.Mangler.mangleType                                   
                           288.24     59609648      206809
  dmd.dtemplate.TemplateDeclaration.findExistingInstance           
                           954.04     19537828      20479
  dmd.mtype.TypeIdentifier.syntaxCopy                              
                           205.71     12690568      61693
  Type::mangleToBuffer                                             
                           91.08      2307294       25332
  dmd.dmangle.Mangler.mangleSymbol                                 
                           832.58     169014        203
  dmd.func.FuncDeclaration.functionSemantic                        
                           680.00     680           1

As you can see the templateInstanceSemantic is called the same 
number of times.
That would imply the optimization does never actually get 
triggered.

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 June 2020 at 19:21:32 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 [...]

 So I have a little profiling tool in DMD which tells me how 
 often certain functions are being called.
 This is the output for the test with the patch:

 [...]

The bottom table is with the patch applied
To not read to much into the timings they are in cycles.
And therefore liable to drift around by a few 100

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 June 2020 at 19:30:16 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 19:21:32 UTC, Stefan Koch wrote:
 On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
 [...]

 So I have a little profiling tool in DMD which tells me how 
 often certain functions are being called.
 This is the output for the test with the patch:

 [...]

 The bottom table is with the patch applied
 To not read to much into the timings they are in cycles.
 And therefore liable to drift around by a few 100

Argh I meant the bottom is without.
The top table is with the patch applied.

Jun 02 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 [...]

To state this more fairly:

The short-circuiting patch did IMPROVE the average performance of 
older staticMap implementation by 0.7%

And only showed a decrease of performance with the new staticMap 
and ...
The amount of which if 4% and 7% respectively.

Jun 01 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
 Hi,

 So I've asked myself if the PR 
 https://github.com/dlang/dmd/pull/11057

 Which complicated the compiler internals and got pulled on the 
 basis that it would increase performance did actually increase 
 performance.
 So I ran it on a staticMap benchmark which it should speed up.

 I am going to use the same benchmark as I used for
 https://forum.dlang.org/post/kktulflpozdrsxeinfbg forum.dlang.org

 Firstly let me post the results of the benchmark without 
 walters patch applied.

 ---

   Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, 
 System: 51.8 ms]
   Range (min … max):   413.5 ms … 476.7 ms    40 runs


   Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, 
 System: 10.0 ms]
   Range (min … max):   208.6 ms … 235.9 ms    40 runs


   Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, 
 System: 39.5 ms]
   Range (min … max):   316.6 ms … 345.7 ms    40 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.50 ± 0.05 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.98 ± 0.08 times faster than './dmd.sh sm.d'

 ---

 If you care to compare the timings at the bottom they pretty 
 much match the results I've measured in the benchmark 
 previously and posted in the thread above.

 Now let's see how the test performs with the patches applied.


   Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, 
 System: 45.7 ms]
   Range (min … max):   411.0 ms … 444.0 ms    40 runs


   Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, 
 System: 10.9 ms]
   Range (min … max):   223.3 ms … 243.9 ms    40 runs


   Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, 
 System: 36.0 ms]
   Range (min … max):   331.0 ms … 375.1 ms    40 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.48 ± 0.04 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.83 ± 0.05 times faster than './dmd.sh sm.d'

 We see the difference between `...` and Walters unrolled 
 staticMap shrink.
 And we go see a decrease of the divide and conquer version of 
 staticMap.

 However We do see the mean times of the `...` and Walters 
 unrolled staticMap actually increase.

 That made me curious and I repeated the measurements with a 
 higher repetition count. Just to make sure that this is not a 
 spur of the moment thing.

 ---- "short-circuit" patch applied.
 uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ 
 hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" 
 "./dmd.sh sm.d -version=Walter" -r 90

   Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, 
 System: 51.9 ms]
   Range (min … max):   409.6 ms … 468.8 ms    90 runs


   Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, 
 System: 10.2 ms]
   Range (min … max):   220.0 ms … 272.1 ms    90 runs


   Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, 
 System: 40.9 ms]
   Range (min … max):   328.9 ms … 359.3 ms    90 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.45 ± 0.07 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.82 ± 0.09 times faster than './dmd.sh sm.d'

 This is consistent with what we got before.
 For good measure (pun intended), I tested the DMD version 
 without the patch with an increased repetition count as well.

 ---- without "short-circuit" patch:
 uplink uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ 
 hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" 
 "./dmd.sh sm.d -version=Walter" -r 90

   Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, 
 System: 52.3 ms]
   Range (min … max):   412.8 ms … 464.5 ms    90 runs


   Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, 
 System: 9.0 ms]
   Range (min … max):   209.0 ms … 241.6 ms    90 runs


   Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, 
 System: 41.9 ms]
   Range (min … max):   318.7 ms … 364.6 ms    90 runs

 Summary
   './dmd.sh sm.d -version=DotDotDot' ran
     1.51 ± 0.06 times faster than './dmd.sh sm.d 
 -version=Walter'
     1.97 ± 0.07 times faster than './dmd.sh sm.d'

 The results seem quite solid.
 At leasr on benchmark I have used for "short-circuiting" 
 AliasSeq leads to a 4% slowdown for walters unrolled staticMap. 
 and a 7% slowdown for `...`

 I think we should not assumed include performance improvements 
 before measuring.


The reason the old version of staticMap did not see the slowdown 
is because I didn't disable codegen.

code-gen inefficiencies occurring when emitting an unreasonable 
number to symbols tend to hide other problems.

Here is a Benchmark which does not relay on our branch
But uses the released dmd 2.092.0

Enjoy!

uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
"./dmd_without_patch sm.d -c -o- -version=Walter" 
"./dmd_with_patch sm.d -c -o- -version=Walter" -m 90

   Time (mean ± σ):     452.8 ms ±   7.5 ms    [User: 415.5 ms, 
System: 37.4 ms]
   Range (min … max):   442.2 ms … 483.9 ms    90 runs


   Time (mean ± σ):     455.1 ms ±  10.4 ms    [User: 417.3 ms, 
System: 37.7 ms]
   Range (min … max):   441.5 ms … 489.2 ms    90 runs

Summary
   './dmd_without_patch sm.d -c -o- -version=Walter' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- 
-version=Walter'
uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
"./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" 
-m 90

   Time (mean ± σ):     583.2 ms ±  11.0 ms    [User: 529.9 ms, 
System: 53.1 ms]
   Range (min … max):   570.0 ms … 631.0 ms    90 runs


   Time (mean ± σ):     584.3 ms ±  14.3 ms    [User: 533.1 ms, 
System: 51.0 ms]
   Range (min … max):   566.5 ms … 657.9 ms    90 runs

Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
"./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" 
-m 90

   Time (mean ± σ):     583.4 ms ±  10.5 ms    [User: 529.2 ms, 
System: 54.0 ms]
   Range (min … max):   566.9 ms … 624.0 ms    90 runs


   Time (mean ± σ):     585.9 ms ±  13.9 ms    [User: 530.5 ms, 
System: 55.2 ms]
   Range (min … max):   565.0 ms … 631.7 ms    90 runs

Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'

Jun 03 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 3 June 2020 at 14:52:09 UTC, Stefan Koch wrote:
 The reason the old version of staticMap did not see the 
 slowdown is because I didn't disable codegen.

 code-gen inefficiencies occurring when emitting an unreasonable 
 number to symbols tend to hide other problems.

 Here is a Benchmark which does not relay on our branch
 But uses the released dmd 2.092.0

 Enjoy!

 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o- -version=Walter" 
 "./dmd_with_patch sm.d -c -o- -version=Walter" -m 90

   Time (mean ± σ):     452.8 ms ±   7.5 ms    [User: 415.5 ms, 
 System: 37.4 ms]
   Range (min … max):   442.2 ms … 483.9 ms    90 runs


   Time (mean ± σ):     455.1 ms ±  10.4 ms    [User: 417.3 ms, 
 System: 37.7 ms]
   Range (min … max):   441.5 ms … 489.2 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o- -version=Walter' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- 
 -version=Walter'
 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c 
 -o-" -m 90

   Time (mean ± σ):     583.2 ms ±  11.0 ms    [User: 529.9 ms, 
 System: 53.1 ms]
   Range (min … max):   570.0 ms … 631.0 ms    90 runs


   Time (mean ± σ):     584.3 ms ±  14.3 ms    [User: 533.1 ms, 
 System: 51.0 ms]
   Range (min … max):   566.5 ms … 657.9 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
 uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
 "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c 
 -o-" -m 90

   Time (mean ± σ):     583.4 ms ±  10.5 ms    [User: 529.2 ms, 
 System: 54.0 ms]
   Range (min … max):   566.9 ms … 624.0 ms    90 runs


   Time (mean ± σ):     585.9 ms ±  13.9 ms    [User: 530.5 ms, 
 System: 55.2 ms]
   Range (min … max):   565.0 ms … 631.7 ms    90 runs

 Summary
   './dmd_without_patch sm.d -c -o-' ran
     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'

Disregard this one.
I had AliasSeq defined as: template AliasSeq(seq...) { enum 
AliasSeq = seq; }
Which does not trigger the optimization.

When I however do define AliasSeq as template AliasSeq(seq...) { 
alias AliasSeq = seq; }

The optimization triggers and you get:

uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
"./dmd_without_patch sm.d -c -o- -version=Walter" 
"./dmd_with_patch sm.d -c -o- -version=Walter" -m 50

   Time (mean ± σ):     296.2 ms ±   6.8 ms    [User: 263.6 ms, 
System: 32.5 ms]
   Range (min … max):   285.7 ms … 330.8 ms    50 runs


   Time (mean ± σ):     301.4 ms ±  11.7 ms    [User: 270.6 ms, 
System: 30.8 ms]
   Range (min … max):   285.6 ms … 333.3 ms    50 runs

Summary
   './dmd_without_patch sm.d -c -o- -version=Walter' ran
     1.02 ± 0.05 times faster than './dmd_with_patch sm.d -c -o- 
-version=Walter'
uplink uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine 
"./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" 
-m 50

   Time (mean ± σ):     388.6 ms ±   8.6 ms    [User: 346.5 ms, 
System: 42.2 ms]
   Range (min … max):   378.5 ms … 419.3 ms    50 runs


   Time (mean ± σ):     375.7 ms ±   9.9 ms    [User: 332.8 ms, 
System: 42.8 ms]
   Range (min … max):   362.2 ms … 396.3 ms    50 runs

Summary
   './dmd_with_patch sm.d -c -o-' ran
     1.03 ± 0.04 times faster than './dmd_without_patch sm.d -c 
-o-'

Which is somewhat consistent with the previous results.
The that I did which does not do the optimization, shows no 
measurable difference.
That means that if the optimization does not trigger no 
performance penalty in incurred FOR THIS TEST.

Another thing that's surprising is ... somehow applying the patch 
does reduce the size of the binary. Which just goes to show that 
you really cannot actually tell right from wrong anymore with 
modern optimizers.

-rwxrwxr-x 1 uplink uplink 19281504 Jun  3 16:30 dmd_without_patch
-rwxrwxr-x 1 uplink uplink 19279120 Jun  3 16:31 dmd_with_patch

My guess is that llvm's inliner went less crazy because of an 
unpredictable branch in there.

Jun 03 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Performance test of short-circuiting AliasSeq