digitalmars.D.learn - Calculating mean and standard deviation with std.algorithm.reduce

Joseph Rushton Wakeling (41/41) Feb 13 2013 Hello all,

jerro (18/25) Feb 13 2013 I guess reduce would need to operate on tuples of M and Q, so you

jerro (5/6) Feb 13 2013 A small correction : you would need to use x.drop(1) instead of

bearophile (7/13) Feb 13 2013 See enumerate():

jerro (13/17) Feb 13 2013 I think the argument to enumerate here should be 2 (and the x

FG (12/16) Feb 13 2013 Typical thing with examples - they try to be terse and show off a mechan...

Joseph Rushton Wakeling (10/19) Feb 13 2013 Nice :-)
Joseph Rushton Wakeling (4/6) Feb 13 2013 My favourite -- in the tutorial for a really serious piece of scientific...

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

Hello all,

A little challenge that's been bothering me today.

The docs for std.algorithm give an illustration of its use to calculate mean
and 
standard deviation in a single pass:

       // Compute sum and sum of squares in one pass
       r = reduce!("a + b", "a + b * b")(tuple(0.0, 0.0), a);
       assert(r[0] == 35); // sum
       assert(r[1] == 233); // sum of squares
       // Compute average and standard deviation from the above
       auto avg = r[0] / a.length;
       auto stdev = sqrt(r[1] / a.length - avg * avg);

However, this formula for standard deviation is one that is well known for
being 
subject to potentially fatal rounding error.  Consider the following:

       float[] a = [10_000.0f, 10_001.0f, 10_002.0f];

       auto r = reduce!("a + b", "a + b * b")(tuple(0.0f, 0.0f), a);

       auto avg = r[0] / a.length;
       auto sd = sqrt(r[1] / a.length - avg * avg);
       writeln(avg, "\t", sd);

... which gives a value of 0 for standard deviation when compiled with LDC or 
GDC, and -nan with DMD, when the correct answer is 0.816497.

An improved online formula calculates together two quantities:

              {  x[1]    if k == 1
     M[k]  =  {
              {  M[k-1] + (x[k] - M[k-1]) / k    , if k > 1

              {  0      if k == 1
     Q[k]  =  {
              {  Q[k-1] + (k - 1) * ((x[k] - M[k-1]) ^^ 2) / k   ,  if k > 1

... which one can readily prove give you respectively the mean of x[1], ..., 
x[k] and the sum of squares of deviations from the mean; so that you can 
calculate sample variance as Q[n] / (n - 1), or standard variance as Q[n] / n 
with standard deviation being given by the square root of this.

In a reduce sense, you'd want to calculate the mean according to,

     reduce!"a + (b - a)/k"(0.0, x);

... where k represents the index count 1, 2, 3, ...  However, it's not evident 
to me how you could get reduce() to know this counting value.

Where calculating Q[k] is concerned, you need to know both the index value
_and_ 
the value of the corresponding M[k].  Again, it's not evident to me how you'd 
indicate to reduce() what is needed.

Can anyone offer advice on how to achieve these calculations using reduce?

Thanks & best wishes,

     -- Joe

Feb 13 2013

"jerro" <a a.com> writes:

 ... where k represents the index count 1, 2, 3, ...  However, 
 it's not evident to me how you could get reduce() to know this 
 counting value.

You would use zip and sequence to add indices to x, like this:

reduce!reducer(initial, zip(x, sequence!"n"))

 Where calculating Q[k] is concerned, you need to know both the 
 index value _and_ the value of the corresponding M[k].  Again, 
 it's not evident to me how you'd indicate to reduce() what is 
 needed.

I guess reduce would need to operate on tuples of M and Q, so you 
would have something like:

alias Tuple!(float, float) MQ;

MQ reducer(MQ mq, Tuple!(float, int) xi)
{
     return MQ(
         // compute new values or M and Q here
     );
}

And you would then call reduce like this:

reduce!reducer(MQ(x.front, 0), zip(x, sequence!"n"))

You could also use Tuple of M, Q and k instead of using zip and 
sequence. You would then pass MQk(x.front, 0, 0) as first 
argument to reduce (I'm assuming zero based indexing here) and 
simply compute the k component of the return value in reducer as 
mqk[2] + 1.

Feb 13 2013

"jerro" <a a.com> writes:

 reduce!reducer(MQ(x.front, 0), zip(x, sequence!"n"))

A small correction : you would need to use x.drop(1) instead of 
x, because the first element of x is only used to compute the 
initial value of 1. If you wanted k to have the same meaning as 
the one in your formula, you would need to use sequence!"n + 2" 
instead of sequence!"n".

Feb 13 2013

"bearophile" <bearophileHUGS lycos.com> writes:

jerro:

 reduce!reducer(MQ(x.front, 0), zip(x, sequence!"n"))

 A small correction : you would need to use x.drop(1) instead of 
 x, because the first element of x is only used to compute the 
 initial value of 1. If you wanted k to have the same meaning as 
 the one in your formula, you would need to use sequence!"n + 2" 
 instead of sequence!"n".

See enumerate():
http://d.puremagic.com/issues/show_bug.cgi?id=5550

I think with enumerate it becomes:
MQ(x.front, 0).enumerate(1).reduce!reducer()

Bye,
bearophile

Feb 13 2013

"jerro" <a a.com> writes:

 See enumerate():
 http://d.puremagic.com/issues/show_bug.cgi?id=5550

I like this enumerate() thing. Is there any particular reason why
it isn't in phobos, or is it just that no one has added it yet?

 I think with enumerate it becomes:
 MQ(x.front, 0).enumerate(1).reduce!reducer()

I think the argument to enumerate here should be 2 (and the x
range is missing, of course).

Another way to do this is:

MQk(x.front, 0, 2).reduce!reducer(x.drop(1))

Or using a lambda instead of reducer():

auto mqk = tuple(x.front, 0.to!double, 2).reduce!
      ((mqk, x) => tuple(
          mqk[0] + (x - mqk[0]) / mqk[2],
          mqk[1] + (mqk[2] - 1) * ((x - mqk[0]) ^^ 2) / mqk[2],
          mqk[2] + 1))
      (x.drop(1));

Feb 13 2013

FG <home fgda.pl> writes:

On 2013-02-13 14:44, Joseph Rushton Wakeling wrote:
 The docs for std.algorithm give an illustration of its use to calculate mean
and
 standard deviation in a single pass: [...]
 However, this formula for standard deviation is one that is well known for
being
 subject to potentially fatal rounding error.

Typical thing with examples - they try to be terse and show off a mechanism
like 
reduce, without going into too much details and hence are unusable IRL.

You can use reduce and put the division and subtraction into the reduce itself 
to prevent overflows. You also won't end up with jaw-dropping tuples, sorry. :)

     float[] a = [10_000.0f, 10_001.0f, 10_002.0f];
     auto n = a.length;
     auto avg = reduce!((a, b) => a + b / n)(0.0f, a);
     auto var = reduce!((a, b) => a + pow(b - avg, 2) / n)(0.0f, a);
     auto sd = sqrt(var);
     writeln(avg, "\t", sd);

Output: 10001   0.816497

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 03:48 PM, FG wrote:
 You can use reduce and put the division and subtraction into the reduce itself
 to prevent overflows. You also won't end up with jaw-dropping tuples, sorry. :)

      float[] a = [10_000.0f, 10_001.0f, 10_002.0f];
      auto n = a.length;
      auto avg = reduce!((a, b) => a + b / n)(0.0f, a);
      auto var = reduce!((a, b) => a + pow(b - avg, 2) / n)(0.0f, a);
      auto sd = sqrt(var);
      writeln(avg, "\t", sd);

 Output: 10001   0.816497

Nice :-)

The thing that's desirable about the example given in the docs is that it's a 
one-pass approach to getting standard deviation (or rather, mean and standard 
deviation together).  Two reduce commands as you suggest is fine for small
data, 
but if you have something really large you'd rather compute it in one pass. 
You 
might also have input that is not [easily] repeatable, so you _have_ to do it 
online -- imagine that your input range is e.g. a file byLine (you'd prefer to 
avoid parsing that twice), or input from a tape (OK, OK, old fashioned) or 
perhaps data from an external source.

Feb 13 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/13/2013 03:48 PM, FG wrote:
 Typical thing with examples - they try to be terse and show off a mechanism
like
 reduce, without going into too much details and hence are unusable IRL.

My favourite -- in the tutorial for a really serious piece of scientific code 
written in C:

      int n = rand() % 10;

Feb 13 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Calculating mean and standard deviation with std.algorithm.reduce