www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - D Mir: standard deviation speed

reply tastyminerals <tastyminerals gmail.com> writes:
I am trying to implement standard deviation calculation in Mir 
for benchmark purposes.
I have two implementations. One is the straightforward std = 
sqrt(mean(abs(x - x.mean())**2)) and the other follows Welford's 
algorithm for computing variance (as described here: 
https://www.johndcook.com/blog/standard_deviation/).

However, although the first implementation should be less 
efficient / slower, the benchmarking results show a startling 
difference in its favour. I'd like to understand if I am doing 
something wrong and would appreciate some explanation.


import std.math : abs;
import mir.ndslice;
import mir.math.common : pow, sqrt, fastmath;
import mir.math.sum : sum;
import mir.math.stat : mean;

 fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
{
     pragma(inline, false);
     if (flatMatrix.empty)
         return 0.0;
     double n = cast(double) flatMatrix.length;
     double mu = flatMatrix.mean;
     return (flatMatrix.map!(a => (a - mu).abs ^^ 2).sum!"fast" / 
n).sqrt;
}



 fastmath double sdWelford(T)(Slice!(T*, 1) flatMatrix)
{
     pragma(inline, false);
     if (flatMatrix.empty)
         return 0.0;

     double m0 = 0.0;
     double m1 = 0.0;
     double s0 = 0.0;
     double s1 = 0.0;
     double n = 0.0;
     foreach (x; flatMatrix.field)
     {
         ++n;
         m1 = m0 + (x - m0) / n;
         s1 = s0 + (x - m0) * (x - m1);
         m0 = m1;
         s0 = s1;
     }
     // switch to n - 1 for sample variance
     return (s1 / n).sqrt;
}

Benchmarking:

Naive std (1k loops):
   std of [60, 60] matrix 0.001727
   std of [300, 300] matrix 0.043452
   std of [600, 600] matrix 0.182177
   std of [800, 800] matrix 0.345367

std with Welford's variance (1k loops):
   std of [60, 60] matrix 0.0225476
   std of [300, 300] matrix 0.534528
   std of [600, 600] matrix 2.0714
   std of [800, 800] matrix 3.60142
Jul 14 2020
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
 I am trying to implement standard deviation calculation in Mir 
 for benchmark purposes.
 I have two implementations. One is the straightforward std = 
 sqrt(mean(abs(x - x.mean())**2)) and the other follows 
 Welford's algorithm for computing variance (as described here: 
 https://www.johndcook.com/blog/standard_deviation/).

 However, although the first implementation should be less 
 efficient / slower, the benchmarking results show a startling 
 difference in its favour. I'd like to understand if I am doing 
 something wrong and would appreciate some explanation.


 import std.math : abs;
 import mir.ndslice;
 import mir.math.common : pow, sqrt, fastmath;
 import mir.math.sum : sum;
 import mir.math.stat : mean;

  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
 {
     pragma(inline, false);
     if (flatMatrix.empty)
         return 0.0;
     double n = cast(double) flatMatrix.length;
     double mu = flatMatrix.mean;
     return (flatMatrix.map!(a => (a - mu).abs ^^ 2).sum!"fast" 
 / n).sqrt;
 }



  fastmath double sdWelford(T)(Slice!(T*, 1) flatMatrix)
 {
     pragma(inline, false);
     if (flatMatrix.empty)
         return 0.0;

     double m0 = 0.0;
     double m1 = 0.0;
     double s0 = 0.0;
     double s1 = 0.0;
     double n = 0.0;
     foreach (x; flatMatrix.field)
     {
         ++n;
         m1 = m0 + (x - m0) / n;
         s1 = s0 + (x - m0) * (x - m1);
         m0 = m1;
         s0 = s1;
     }
     // switch to n - 1 for sample variance
     return (s1 / n).sqrt;
 }

 Benchmarking:

 Naive std (1k loops):
   std of [60, 60] matrix 0.001727
   std of [300, 300] matrix 0.043452
   std of [600, 600] matrix 0.182177
   std of [800, 800] matrix 0.345367

 std with Welford's variance (1k loops):
   std of [60, 60] matrix 0.0225476
   std of [300, 300] matrix 0.534528
   std of [600, 600] matrix 2.0714
   std of [800, 800] matrix 3.60142
It would be helpful to provide a link. You should only need one accumulator for mean and centered sum of squares. See the python example under the Welford example https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance This may have broken optimization somehow. variance and standardDeviation were recently added to mir.math.stat. They have the option to switch between Welford's algorithm and the others. What you call as the naive algorithm, is VarianceAlgo.twoPass and the Welford algorithm can be toggled with VarianceAlgo.online, which is the default option. It also would be interesting if you re-did the analysis with the built-in mir functions. There are some other small differences between your implementation and the one in mir, beyond the issue discussed above. You take the absolute value before the square root and force the use of sum!"fast". Another difference is VarianceAlgo.online in mir is using a precise calculation of the mean rather than the fast update that Welford uses. This may have a modest impact on performance, but should provide more accurate results.
Jul 14 2020
parent reply tastyminerals <tastyminerals gmail.com> writes:
On Tuesday, 14 July 2020 at 19:36:21 UTC, jmh530 wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
   [...]
It would be helpful to provide a link. You should only need one accumulator for mean and centered sum of squares. See the python example under the Welford example https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance This may have broken optimization somehow. variance and standardDeviation were recently added to mir.math.stat. They have the option to switch between Welford's algorithm and the others. What you call as the naive algorithm, is VarianceAlgo.twoPass and the Welford algorithm can be toggled with VarianceAlgo.online, which is the default option. It also would be interesting if you re-did the analysis with the built-in mir functions. There are some other small differences between your implementation and the one in mir, beyond the issue discussed above. You take the absolute value before the square root and force the use of sum!"fast". Another difference is VarianceAlgo.online in mir is using a precise calculation of the mean rather than the fast update that Welford uses. This may have a modest impact on performance, but should provide more accurate results.
Ok, the wiki page looks more informative, I shall look into my Welford implementation. I've just used standardDeviation from Mir and it showed even worse results than both of the examples above. Here is a (WIP) project as of now. Line 160 in https://github.com/tastyminerals/mir_benchmarks_2/blob/master/source/basic_ops.d std of [60, 60] matrix 0.0389492 (> 0.001727) std of [300, 300] matrix 1.03592 (> 0.043452) std of [600, 600] matrix 4.2875 (> 0.182177) std of [800, 800] matrix 7.9415 (> 0.345367)
Jul 14 2020
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 15 July 2020 at 05:57:56 UTC, tastyminerals wrote:
 [snip]

 Here is a (WIP) project as of now.
 Line 160 in 
 https://github.com/tastyminerals/mir_benchmarks_2/blob/master/source/basic_ops.d

 std of [60, 60] matrix 0.0389492 (> 0.001727)
 std of [300, 300] matrix 1.03592 (> 0.043452)
 std of [600, 600] matrix 4.2875 (> 0.182177)
 std of [800, 800] matrix 7.9415 (> 0.345367)
I changed the dflags-ldc to "-mcpu-native -O" and compiled with `dub run --compiler=ldc2`. I got similar results as yours for both in the initial run. I changed sd to fmamath private double sd(T)(Slice!(T*, 1) flatMatrix) { pragma(inline, false); if (flatMatrix.empty) return 0.0; double n = cast(double) flatMatrix.length; double mu = flatMatrix.mean; return (flatMatrix.map!(a => (a - mu) ^^ 2) .sum!"precise" / n).sqrt; } and got std of [10, 10] matrix 0.0016321 std of [20, 20] matrix 0.0069788 std of [300, 300] matrix 2.42063 std of [60, 60] matrix 0.0828711 std of [600, 600] matrix 9.72251 std of [800, 800] matrix 18.1356 And the biggest change by far was the sum!"precise" instead of sum!"fast". When I ran your benchStd function with ans = matrix.flattened.standardDeviation!(double, "online", "fast"); I got std of [10, 10] matrix 1e-07 std of [20, 20] matrix 0 std of [300, 300] matrix 0 std of [60, 60] matrix 1e-07 std of [600, 600] matrix 0 std of [800, 800] matrix 0 I got the same result with Summator.naive. That almost seems too low. The default is Summator.appropriate, which is resolved to Summator.pairwise in this case. It is faster than Summator.precise, but still slower than Summator.naive or Summator.fast. Your welfordSD should line up with Summator.naive. When I change that to ans = matrix.flattened.standardDeviation!(double, "online", "precise"); I get Running .\mir_benchmarks_2.exe std of [10, 10] matrix 0.0031737 std of [20, 20] matrix 0.0153603 std of [300, 300] matrix 4.15738 std of [60, 60] matrix 0.171211 std of [600, 600] matrix 17.7443 std of [800, 800] matrix 34.2592 I also tried changing your welfordSD function based on the stuff I mentioned above, but it did not make a large difference.
Jul 15 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 11:23:00 UTC, jmh530 wrote:
 On Wednesday, 15 July 2020 at 05:57:56 UTC, tastyminerals wrote:
 [snip]

 Here is a (WIP) project as of now.
 Line 160 in 
 https://github.com/tastyminerals/mir_benchmarks_2/blob/master/source/basic_ops.d

 std of [60, 60] matrix 0.0389492 (> 0.001727)
 std of [300, 300] matrix 1.03592 (> 0.043452)
 std of [600, 600] matrix 4.2875 (> 0.182177)
 std of [800, 800] matrix 7.9415 (> 0.345367)
I changed the dflags-ldc to "-mcpu-native -O" and compiled with `dub run --compiler=ldc2`. I got similar results as yours for both in the initial run. I changed sd to fmamath private double sd(T)(Slice!(T*, 1) flatMatrix)
fastmath violates all summation algorithms except `"fast"`. The same bug is in the original author's post.
Jul 15 2020
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 15 July 2020 at 11:26:19 UTC, 9il wrote:
 [snip]
  fmamath private double sd(T)(Slice!(T*, 1) flatMatrix)
fastmath violates all summation algorithms except `"fast"`. The same bug is in the original author's post.
I hadn't realized that fmamath was the problem, rather than fastmath overall. fmamathis used on many mir.math.stat functions, though admittedly not in the accumulators.
Jul 15 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 11:37:23 UTC, jmh530 wrote:
 On Wednesday, 15 July 2020 at 11:26:19 UTC, 9il wrote:
 [snip]
  fmamath private double sd(T)(Slice!(T*, 1) flatMatrix)
fastmath violates all summation algorithms except `"fast"`. The same bug is in the original author's post.
I hadn't realized that fmamath was the problem, rather than fastmath overall. fmamathis used on many mir.math.stat functions, though admittedly not in the accumulators.
Ah, no, my bad! You write fmamath, I have read it as fastmath. fmamath is OK here.
Jul 15 2020
parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 15 July 2020 at 11:41:35 UTC, 9il wrote:
 [snip]

 Ah, no, my bad! You write  fmamath, I have read it as 
  fastmath.  fmamath is OK here.
I've mixed up fastmath and fmamath as well. No worries.
Jul 15 2020
prev sibling parent reply 9il <ilyayaroshenko gmail.com> writes:
On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
Jul 14 2020
next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
`mean` is the summation algorithm too
Jul 14 2020
prev sibling parent reply tastyminerals <tastyminerals gmail.com> writes:
On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
Jul 14 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 06:00:46 UTC, tastyminerals wrote:
 On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
They both are more precise by default.
Jul 14 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 06:55:51 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:00:46 UTC, tastyminerals wrote:
 On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
They both are more precise by default.
This was a reply to the other your post in the thread, sorry. Mir algorithms are more precise by default then the algorithms you have provided.
Jul 14 2020
parent reply tastyminerals <tastyminerals gmail.com> writes:
On Wednesday, 15 July 2020 at 06:57:21 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:55:51 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:00:46 UTC, tastyminerals 
 wrote:
 On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 On Tuesday, 14 July 2020 at 19:04:45 UTC, tastyminerals 
 wrote:
  fastmath private double sd0(T)(Slice!(T*, 1) flatMatrix)
fastmath shouldn't be really used with summation algorithms except the `"fast"` version of them. Otherwise, they may or may not behave like "fast".
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
They both are more precise by default.
This was a reply to the other your post in the thread, sorry. Mir algorithms are more precise by default then the algorithms you have provided.
Right. Is this why standardDeviation is significantly slower?
Jul 15 2020
parent reply 9il <ilyayaroshenko gmail.com> writes:
On Wednesday, 15 July 2020 at 07:34:59 UTC, tastyminerals wrote:
 On Wednesday, 15 July 2020 at 06:57:21 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:55:51 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:00:46 UTC, tastyminerals 
 wrote:
 On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 [...]
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
They both are more precise by default.
This was a reply to the other your post in the thread, sorry. Mir algorithms are more precise by default then the algorithms you have provided.
Right. Is this why standardDeviation is significantly slower?
Yes. It allows you to pick a summation option, you can try others then default in benchmarks.
Jul 15 2020
parent tastyminerals <tastyminerals gmail.com> writes:
On Wednesday, 15 July 2020 at 07:51:31 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 07:34:59 UTC, tastyminerals wrote:
 On Wednesday, 15 July 2020 at 06:57:21 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:55:51 UTC, 9il wrote:
 On Wednesday, 15 July 2020 at 06:00:46 UTC, tastyminerals 
 wrote:
 On Wednesday, 15 July 2020 at 02:08:48 UTC, 9il wrote:
 [...]
Good to know. So, it's fine to use it with sum!"fast" but better avoid it for general purposes.
They both are more precise by default.
This was a reply to the other your post in the thread, sorry. Mir algorithms are more precise by default then the algorithms you have provided.
Right. Is this why standardDeviation is significantly slower?
Yes. It allows you to pick a summation option, you can try others then default in benchmarks.
Indeed, I played around with VarianceAlgo and Summation options, and they impact the end result a lot. ans = matrix.flattened.standardDeviation!(VarianceAlgo.naive, Summation.appropriate); std of [300, 300] matrix 0.375903 std of [60, 60] matrix 0.0156448 std of [600, 600] matrix 1.54429 std of [800, 800] matrix 3.03954 ans = matrix.flattened.standardDeviation!(VarianceAlgo.online, Summation.appropriate); std of [300, 300] matrix 1.12404 std of [60, 60] matrix 0.041968 std of [600, 600] matrix 5.01617 std of [800, 800] matrix 8.64363 The Summation.fast behaves strange though. I wonder what happened here? ans = matrix.flattened.standardDeviation!(VarianceAlgo.naive, Summation.fast); std of [300, 300] matrix 1e-06 std of [60, 60] matrix 9e-07 std of [600, 600] matrix 1.2e-06 std of [800, 800] matrix 9e-07 ans = matrix.flattened.standardDeviation!(VarianceAlgo.online, Summation.fast); std of [300, 300] matrix 9e-07 std of [60, 60] matrix 9e-07 std of [600, 600] matrix 1.1e-06 std of [800, 800] matrix 1e-06
Jul 15 2020