www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - RandomSample with specified random number generator

Hello all,

An interesting challenge with the randomSample functionality in random.d.

randomSample takes as input a range, a number of points to sample from that 
range, and (optionally) a random number generator.  It returns a range 
corresponding to a lazily-calculated sample of the desired size.

If you don't pass it a specific random number generator, then it just uses the 
thread-global random number generator.  A consequence of this is that the
sample 
changes each time you use it, i.e. if you write,

       auto sample1 = randomSample(iota(0, 100), 5);
       writeln(sample1);
       writeln(sample1);
       writeln(sample1);

... you get 3 different random samples, e.g.

       [0, 17, 18, 43, 77]
       [20, 25, 34, 53, 97]
       [6, 12, 28, 42, 87]

Conversely, if you _do_ pass randomSample a specific random number generator, a 
copy of it is stored inside the RandomSample struct.  This is necessary
because, 
as the sample is lazily evaluated, you need to guarantee the RNG's existence at 
the point when you evaluate.

A consequence of this is that if you evaluate the sample multiple times, you
get 
the same result, i.e. if you do:

       auto sample2 = randomSample(iota(0, 100), 5, rndGen);
       writeln("Sample with specified RNG:");
       writeln(sample2);
       writeln(sample2);
       writeln(sample2);

then you get something like e.g.

       [33, 35, 54, 55, 94]
       [33, 35, 54, 55, 94]
       [33, 35, 54, 55, 94]

... which is problematic because it's different behaviour from the case without 
a specific RNG being used, but not problematic per se.  One can see a logic for 
either case (a sample range always giving the same result, or always giving a 
different and independent result), it's just a matter of being clear which will 
happen.

However, a real problem arises if you want to have multiple samples each being 
passed a specific source of randomness.  If you do:

       auto sample3 = randomSample(iota(0, 100), 5, rndGen);
       writeln(sample3);
       auto sample4 = randomSample(iota(0, 100), 5, rndGen);
       writeln(sample4);
       auto sample5 = randomSample(iota(0, 100), 5, rndGen);
       writeln(sample5);

... then you'd expect in principle to get 3 different samples.  However,
because 
rndGen is _copied_ rather than used directly, its state does not get updated 
when the samples are evaluated.  As a consequence, each of the samples above 
gets passed _the same random number generator in the same state_, and you get 
out the same samples, e.g.

       [33, 35, 54, 55, 94]
       [33, 35, 54, 55, 94]
       [33, 35, 54, 55, 94]

(... yes, the same as sample2, assuming we're still in the same program and 
haven't made any other uses of rndGen in the meantime).

What this means is that randomSample is impossible to use effectively with a 
specified random number generator.  It's not just that successive "different" 
samples produce the same output, it's also that if you generate a random sample 
and then generate other random numbers afterwards, they will be correlated.

I can see only two possible resolutions of this issue, but I'm curious if
anyone 
else can think of something different.

   (i) store a reference to the random number generator instead of a copy
       (is this even possible?).  The trouble is, this isn't safe, because
       what if e.g. you have a function which does something like

           auto generateSample()
           {
               auto rng = Random(unpredictableSeed);
               auto sample = randomSample(iota(0, 100), 5, rng);
               return sample;
           }

           void main()
           {
               auto sample = generateSample();
               writeln(sample);  // Won't work, because rng will no longer
exist.
           }

       ... so this option seems impermissible.

  (ii) when copying the random number generator, seed it with an unpredictable
       seed before generating the first point of the random sample.  This will
       effectively make the sample an independent thread with respect to random
       number generation.  The disadvantage is that you lose the reproducibility
       of the sampling behaviour (e.g. computer game where you want to avoid the
       player being able to save/reload/try again and get a different outcome)
       and there might be unexpected correlations that occur if the seeding or
       the random number generation algorithm are done poorly.

The second seems the only really valid option to me, and has the advantage of 
making identical the behaviour of the sample whether or not it's given a 
specific RNG (i.e. using the sample 3 different times gets you 3 different
samples).

... any thoughts on this and on possible alternative options?

Thanks & best wishes,

     -- Joe
Jun 12 2012