www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Ranges and random numbers -- again

reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Hello all,

I think my last post on this topic may have been too convoluted, so here's
another go.  The goal here is to try and derive some well-defined strategies for
designing ranges that make use of random number generators.

The purpose of this first email is to justify what I think should be the first
basic rule of random ranges [that is, ranges where popFront() makes use of
random or pseudo-random numbers to generate the new value of front()].  That
rule is:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

To see why, let's start by considering a simple input range that provides an
infinite stream of random numbers in the range [min, max).

    struct SimpleRandomRange(RNG = void)
        if(isUniformRNG!RNG || is(RNG == void))
    {
        private real _min, _max, _u;

        static if (!is(RNG == void))
        {
            private RNG _gen;

            this(real min, real max, RNG gen)
            {
                _min = min;
                _max = max;
                _gen = gen;
                _u = uniform(_min, _max, _gen);
            }
        }
        else
        {
            this(real min, real max)
            {
                _min = min;
                _max = max;
                _u = uniform(_min, _max);
            }
        }

        immutable bool empty = false;

        real front()  property pure nothrow const
        {
            return _u;
        }

        void popFront()
        {
            static if (is(RNG == void))
                _u = uniform(_min, _max);
            else
                _u = uniform(_min, _max, _gen);
        }
    }

    auto simpleRandomRange()(real min, real max)
    {
        return SimpleRandomRange!void(min, max);
    }

    auto simpleRandomRange(RNG)(real min, real max, RNG gen)
        if (isUniformRNG!RNG)
    {
        return SimpleRandomRange!RNG(min, max, gen);
    }


Let's consider the basics of how this works.  If we don't specify a generator,

    auto r1 = simpleRandomRange(0.0, 1.0);

... then the call to uniform() will make use of the thread-local default random
number generator, rndGen.  If we do specify an RNG:

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0, 1.0, gen);

... then SimpleRandomRange will store a copy of it to use when generating the
random numbers.

So, what happens when we try and read numbers from this?  Let's try and generate
some output from this range:

    auto r1 = simpleRandomRange(0.0L, 1.0L);

    auto take1 = r1.take(5);
    auto take2 = r1.take(5);

    writeln(take1);
    writeln(take1);
    writeln(take2);
    writeln(take2);

... which might give us for example:

    [0.816207, 0.40483, 0.638285, 0.00737022, 0.467244]
    [0.816207, 0.902301, 0.0961124, 0.611573, 0.579579]
    [0.816207, 0.788726, 0.614613, 0.613375, 0.136722]
    [0.816207, 0.0942176, 0.894744, 0.751959, 0.77816]

We notice immediately that these sequences are all different, except for the
very first number, which is always identical.  What's more, it's identical for
each of the two 'takes'.

What's going on here?  Well, the first value is being set in the constructor;
subsequent values are being set by calls to the global RNG rndGen, and because
its state is being updated, that affects all subsequent values generated.

So, it means that

    (1) Different 'takes' from the same range will be different;

    (2) Reading twice from the same 'take' will also generate a different
        sequence.

That in itself is not so bad -- but the always-the-same first value is a
problem, as it will bias any statistics we generate.

What happens instead if we specify a generator?

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0L, 1.0L, gen);

    auto take3 = r2.take(5);
    writeln(take3);
    writeln(take3);

... then we may get for example:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

i.e., two identical sequences.  Superficially this looks nice -- a random
sequence that is consistent within the same 'take'.  But now let's take another
5 numbers:

    auto take4 = r2.take(5);
    writeln(take4);
    writeln(take4);

... and we get again:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

Why is this the same?  When we pass gen to the constructor of SimpleRandomRange,
it copies it by value -- meaning that updates to the internally-stored _gen
don't propagate back to the source RNG.  This is extremely undesirable, because
it means that we can generate very strong correlations between what ought to be
independent random sequences.

How can we resolve this dilemma?  One thought would be to ensure that any RNG
stored inside SimpleRandomRange is seeded unpredictably.  Manually, without
changing SimpleRandomRange, we can do this by making sure that we do something
like:

    auto r3 = simpleRandomRange(0.0L, 1.0L, Random(unpredictableSeed));
    auto take5 = r3.take(5);

    writeln(take5);
    writeln(take5);

    auto take6 = r3.take(5);
    writeln(take6);
    writeln(take6);

... but in fact this again produces identical output all round.

The other alternative is for random number generators to be redefined as
reference types.  A simple experiment is to redefine MersenneTwisterEngine as a
final class rather than a struct [N.B. this is probably _not_ the optimal way to
turn RNGs into reference types, but is quick and easy to try out!]:

    auto gen2 = new MtClass19937(unpredictableSeed);
    auto r5 = simpleRandomRange(0.0L, 1.0L, gen2);
    auto take9 = r5.take(5);
    auto take10 = r5.take(5);

    writeln(take9);
    writeln(take9);
    writeln(take10);
    writeln(take10);

... which gives us output similar to the original example using rndGen:

    [0.844254, 0.136284, 0.0805684, 0.351376, 0.830413]
    [0.844254, 0.863786, 0.983373, 0.389967, 0.257907]
    [0.844254, 0.825822, 0.176731, 0.397709, 0.470349]
    [0.844254, 0.349027, 0.118591, 0.707107, 0.893357]

i.e., an identical first value, but otherwise different.

One might object that the previous example -- the one where simpleRandomRange
was passed Random(unpredictableSeed) -- offers a superior outcome.  After all,
it generates a reproducible pseudo-random sequence that (theoretically)
shouldn't generate correlations with other (pseudo-)random activity in the
program.  The fact that two different .take commands generate identical output
is a feature, not a problem -- each time you're reading from the start of the
range, so you'd expect to get the same values!

To this I'd offer three responses.  The first is that enforcing an unpredictable
seed for the random range (in effect, making it operate like an independent
thread in terms of its internal RNG) makes things difficult to debug as, for
debugging purposes, you want to be able to have predictable seeds for streams of
pseudo-random numbers.  The second is that it seems to me that having a
multiplicity of separate RNGs risks statistical reliability: it's difficult to
be certain that two different "unpredictable" seeds won't in fact produce
correlated sequences, and becomes more difficult if you have many different
separately seeded RNGs.

Finally, note that even if you could set aside these issues, the reproducibility
of the sequence rests on the fact that your RNG is a pseudo-random number
generator.  If instead you pass SimpleRandomRange an RNG that reads from a true
source of randomness (say, /dev/random) it will be impossible to generate
predictable sequences.

Hence the rule which I posited at the beginning of this email, and which I'll
restate here, which seems to me the only way to ensure consistent behaviour of
these entities regardless of the source of randomness used:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

As a corollary, this means that pseudo-random number generators (in Phobos and
elsewhere) should be redefined as reference types, which has already been
pointed out in D bugzilla:
http://d.puremagic.com/issues/show_bug.cgi?id=7067

However, that's not the end of the story.  Remember that first value from the
range being always the same?  This also needs dealing with, and (I think) is
non-trivial.  I will discuss this in the next email.

In the meantime -- I'm interested in feedback on the proposed rule. :-)

Thanks and best wishes,

    -- Joe
Jun 17 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
      **************************************************************************
      * Reading multiple times from the start of the same random range, should *
      * produce different (and statistically independent) results each time.   *
      **************************************************************************

I don't think that's a good rule. A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values. Andrei
Jun 17 2013
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 17 June 2013 at 19:32:20 UTC, Joseph Rushton Wakeling 
wrote:
 In the meantime -- I'm interested in feedback on the proposed 
 rule. :-)

 Thanks and best wishes,

     -- Joe

Good analysis but (sorry) I think you are on the wrong track. One of the major problems in std.random is that the ranges use value semantics. This means they are *saved* whenever they are copied, as opposed to just referenced. This creates the problems you have mentioned, and even more. I have tried to fix it before: http://forum.dlang.org/thread/oiczxzkzxketxitncghl forum.dlang.org FWI, I gave up on the project, because it was too complex for me to handle an entire module. But there were no reasons for it to not work. In any case, you can *also* take a look at these 2 other conversations, they sup up the problem pretty quickly (especially the second): http://forum.dlang.org/thread/kuvhrfnrrbhzpyoirqgt forum.dlang.org http://forum.dlang.org/thread/mailman.259.1357667544.22503.digitalmars-d puremagic.com So I don't agree with your entire point: ************************************************************************** * Reading multiple times from the start of the same random range, should * * produce different (and statistically independent) results each time. * ************************************************************************** A random range should be viewed (IMO) as nothing more than a range that "was" (conceptually) simply filled with random numbers. Calling front on the same range twice in a row *should* produce the same result: No call to popFront => no change to the range. If it did change, it'd be a blatant violation of the range concept. It also means you can't have safe/nothrow/pure/const "front". Being able to *save* a random range (which your proposal would prevent) can have useful applications too. It means you can iterate on the same random number sequence several times, lazily, without having to store the results in a buffer. The problems you are trying to fix can (imo) can be better fixed with the reference approach.
Jun 17 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/17/13 6:29 PM, Joseph Rushton Wakeling wrote:
 On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
 A random range should be viewed (IMO) as nothing more than a range that "was"
 (conceptually) simply filled with random numbers. Calling front on the same
 range twice in a row *should* produce the same result: No call to popFront => 
no
 change to the range. If it did change, it'd be a blatant violation of the range
 concept. It also means you can't have safe/nothrow/pure/const "front".

Completely agree, and I don't think this is in contradiction with what I've proposed. My proposed "rule" might be better stated to clarify this.

Perhaps this would be a better statement: ************************************************************************ * Iterating fully over a given random range should produce a different * * sequence for each such complete iteration. * ************************************************************************ So, if you do, SomeRandomRange r; x = r.front; y = r.front; assert(x == y); // Passes!! But SomeRandomRange r; arr1 = array(r); arr2 = array(r); assert(x != y); // the two arrays are filled with different sequences.

Once you consume an input range, there's no way to consume it again. It's done. Andrei
Jun 17 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/13 4:57 AM, Joseph Rushton Wakeling wrote:
 On 06/18/2013 05:00 AM, Andrei Alexandrescu wrote:
 Once you consume an input range, there's no way to consume it again. It's done.

Sure, but what I'm proposing is a stronger constraint than "random ranges should be input ranges".

I'm not sure I understand what you are proposing. If a true random range is to be an input range with additional constraints, one simple thing is once .empty returns true there's no more extracting elements from it. Andrei
Jun 18 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. The above
 situation is a prime example of this problem. We *need* to make RNGs
 passed by reference. For situations where you *want* to duplicate a
 pseudo random sequence, an explicit method should be provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404. Andrei
Jun 18 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/13 11:11 AM, monarch_dodra wrote:
 On Tuesday, 18 June 2013 at 14:39:38 UTC, Andrei Alexandrescu wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. The above
 situation is a prime example of this problem. We *need* to make RNGs
 passed by reference. For situations where you *want* to duplicate a
 pseudo random sequence, an explicit method should be provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404. Andrei

That sounds nice, but is it possible? The only way to truly forward everything is via an alias this.

Nonono, that must use introspection and mixins, like WhiteHole and BlackHole do. Andrei
Jun 18 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/13 2:11 PM, Jonathan M Davis wrote:
 On Tuesday, June 18, 2013 10:39:38 Andrei Alexandrescu wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. The above
 situation is a prime example of this problem. We *need* to make RNGs
 passed by reference. For situations where you *want* to duplicate a
 pseudo random sequence, an explicit method should be provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404.

That may be a good way to go about it, but the default such as what rndGen returns really should be a reference type, and only ranges whose state can actually be copied such that continuing to iterate over them gives the same values should be forward ranges IMHO. Using Class!T might be able to make it so that we can avoid needing a std.random2, but we'll probably still need some minor code breakage in there somewhere. - Jonathan M Davis

How about defining rndGenRef and putting rndGen on the deprecation path? Andrei
Jun 18 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/18/13 3:32 PM, monarch_dodra wrote:
 Also: Class!PRNG.save would not actually save, it would merely return a
 copy of its PRNG payload, and not a new class instance. This could be
 problematic (for starters, it wouldn't qualify for forward range...)

Thought of that, updated the bug report. Destroy. Andrei
Jun 19 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 04:22:38PM -0400, Andrei Alexandrescu wrote:
 On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
     **************************************************************************
     * Reading multiple times from the start of the same random range, should *
     * produce different (and statistically independent) results each time.   *
     **************************************************************************

I don't think that's a good rule. A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values.

Yeah, if a range encapsulates a non-repeatable RNG (e.g., a "true" RNG that, say, reads environmental variables to randomize its output, or reads from a hardware RNG), then it should be an input range. Such ranges cannot have any meaningful .save semantics, so they should not be forward ranges. OTOH, one of the main issues in the current std.random is that RNGs are implemented as pass-by-value structs, which is a bad choice IMO. If we change them to have reference semantics instead, many of these issues would be solved. T -- 2+2=4. 2*2=4. 2^2=4. Therefore, +, *, and ^ are the same operation.
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 09:22 PM, Andrei Alexandrescu wrote:
 On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
      **************************************************************************
      * Reading multiple times from the start of the same random range, should *
      * produce different (and statistically independent) results each time.   *
      **************************************************************************

I don't think that's a good rule. A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values.

Let me clarify. I'm _not_ talking about pseudo-random number generating ranges such as MersenneTwisterEngine, which can indeed be forward ranges and in fact are deterministic. What I mean by a "random range" is one where popFront() calls a source of randomness (real or pseudo) in order to update the value of front(). So, I don't consider MersenneTwister to be a random range -- but RandomSample is, because it calls uniform() within popFront(). When saying "reading from the start", I also didn't mean to imply that successive calls to .front should produce different values. By "reading from the start" I meant something like this: SomeRandomRange r; foreach(x; r) writeln(r); writeln(); foreach(x; r) writeln(r); ... should produce two statistically independent sequences. RandomSample is interesting to consider in light of your remarks -- its .save can't in general guarantee the same values moving forward, because it can't guarantee that the source of randomness can reproduce the same values: property typeof(this) save() { auto ret = this; ret._input = _input.save; return ret; } If (and only if) the RandomSample instance contains an internal generator, then the auto ret = this; line will copy it (by value) and so the saved copy will reproduce the sequence that would have been produced by the original. However, if the RandomSample instance was initiated with Random == void in the template parameters, then it will use rndGen to generate random numbers, and so the forward sequence will be different each time. So in this sense the "save" is meaningless. Now, we could and maybe should deal with that by forbidding save() in the case where Random == void, but on the other hand, the idea of a random number generator copied by value creates other problems to do with statistical safety. If I do this: auto sample = randomSample(iota(100), 5, rndGen); ... which is an "obvious" thing to do, then the results of that will be correlated with subsequent things I do involving rndGen -- all because the generator gets copied by value. On the other hand, if we assume that the generator is a reference type, then it will inevitably produce different samples each time it's iterated over.
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 09:36 PM, monarch_dodra wrote:
 Good analysis but (sorry) I think you are on the wrong track.
 
 One of the major problems in std.random is that the ranges use value semantics.
 This means they are *saved* whenever they are copied, as opposed to just
 referenced. This creates the problems you have mentioned, and even more.

I agree that the fact that pseudo-random number generators use value semantics is a serious problem, and I was thinking of our previous discussions in preparing these remarks. However, I'll stress again for the avoidance of doubt that when referring to "random ranges" I was not referring to pseudo-random number generators, but to ranges that _call_ random-number generators (real or pseudo) inside popFront().
 I have tried to fix it before:
 http://forum.dlang.org/thread/oiczxzkzxketxitncghl forum.dlang.org
 FWI, I gave up on the project, because it was too complex for me to handle an
 entire module. But there were no reasons for it to not work.

I remember your work and was sad to see that it was not accepted -- actually one reason to start this discussion was to try and push awareness back to your contributions :-)
 A random range should be viewed (IMO) as nothing more than a range that "was"
 (conceptually) simply filled with random numbers. Calling front on the same
 range twice in a row *should* produce the same result: No call to popFront =>
no
 change to the range. If it did change, it'd be a blatant violation of the range
 concept. It also means you can't have safe/nothrow/pure/const "front".

Completely agree, and I don't think this is in contradiction with what I've proposed. My proposed "rule" might be better stated to clarify this.
 Being able to *save* a random range (which your proposal would prevent) can
have
 useful applications too. It means you can iterate on the same random number
 sequence several times, lazily, without having to store the results in a
buffer.

I agree, but there are broader issues to do with .save and ranges that call random number generators in popFront(). If they are calling rndGen() or any other "global" random number generator, then it's impossible for them to reproduce the same sequence over and over.
 The problems you are trying to fix can (imo) can be better fixed with the
 reference approach.

On the contrary, consider the SimpleRandomRange -- or RandomSample -- with a specified RNG that has reference semantics. You'll get a different sequence each time you iterate over it, and .save will not enable you to reproduce the sequence. I'd be happy to see a way in which one _could_ reproduce the sequence with .save, but it would have to be statistically reliable. (That is, if I pass it a given random number generator, then the results of the sequence would have to be uncorrelated with other subsequent calls to that generator in other parts of the program.)
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
 A random range should be viewed (IMO) as nothing more than a range that "was"
 (conceptually) simply filled with random numbers. Calling front on the same
 range twice in a row *should* produce the same result: No call to popFront =>
no
 change to the range. If it did change, it'd be a blatant violation of the range
 concept. It also means you can't have safe/nothrow/pure/const "front".

Completely agree, and I don't think this is in contradiction with what I've proposed. My proposed "rule" might be better stated to clarify this.

Perhaps this would be a better statement: ************************************************************************ * Iterating fully over a given random range should produce a different * * sequence for each such complete iteration. * ************************************************************************ So, if you do, SomeRandomRange r; x = r.front; y = r.front; assert(x == y); // Passes!! But SomeRandomRange r; arr1 = array(r); arr2 = array(r); assert(x != y); // the two arrays are filled with different sequences.
Jun 17 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 17 June 2013 at 22:18:44 UTC, Joseph Rushton Wakeling 
wrote:
 On 06/17/2013 09:36 PM, monarch_dodra wrote:
 Good analysis but (sorry) I think you are on the wrong track.
 
 One of the major problems in std.random is that the ranges use 
 value semantics.
 This means they are *saved* whenever they are copied, as 
 opposed to just
 referenced. This creates the problems you have mentioned, and 
 even more.

I agree that the fact that pseudo-random number generators use value semantics is a serious problem, and I was thinking of our previous discussions in preparing these remarks.

I didn't even realize you were part of that thread :/ Sorry about that.
 On the contrary, consider the SimpleRandomRange -- or 
 RandomSample -- with a
 specified RNG that has reference semantics.  You'll get a 
 different sequence
 each time you iterate over it, and .save will not enable you to 
 reproduce the
 sequence.

I don't think you should pass rndGen to adapter ranges if you want any kind of deterministic behavior. You should really give such ranges their own new generator, eg: auto sample = randomSample(iota(100), 5, Random(unpredictableSeed)); That said, it would help even more if we had something *even* simpler, such as: "Random random()" or "Random newRandomRange()", and then have the more idiomatic: auto sample = randomSample(iota(100), 5, newRandomRange());
Jun 17 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 17 June 2013 at 22:29:34 UTC, Joseph Rushton Wakeling 
wrote:
 On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
 A random range should be viewed (IMO) as nothing more than a 
 range that "was"
 (conceptually) simply filled with random numbers. Calling 
 front on the same
 range twice in a row *should* produce the same result: No 
 call to popFront => no
 change to the range. If it did change, it'd be a blatant 
 violation of the range
 concept. It also means you can't have safe/nothrow/pure/const 
 "front".

Completely agree, and I don't think this is in contradiction with what I've proposed. My proposed "rule" might be better stated to clarify this.

Perhaps this would be a better statement: ************************************************************************ * Iterating fully over a given random range should produce a different * * sequence for each such complete iteration. * ************************************************************************

I'm not sure what that means but...
 So, if you do,

     SomeRandomRange r;
     x = r.front;
     y = r.front;
     assert(x == y);  // Passes!!

 But

     SomeRandomRange r;
     arr1 = array(r);
     arr2 = array(r);
     assert(x != y);  // the two arrays are filled with 
 different sequences.

With reference ranges, that is indeed what you'd get. I could even add: SomeRandomForwardRange r; auto arr1 = array(r.save); auto arr2 = array(r); assert(x == y); // This time both are the same
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 11:34 PM, monarch_dodra wrote:
 I don't think you should pass rndGen to adapter ranges if you want any kind of
 deterministic behavior.

Actually, _passing_ rndGen with the current value semantics _will_ produce deterministic behaviour. The problem is that it's statistically unsafe as the copy-by-value means that the results of the adapter ranges will correlate with subsequent other uses of rndGen.
 You should really give such ranges their own new
 generator, eg:
 
    auto sample = randomSample(iota(100), 5, Random(unpredictableSeed));

I did discuss that in my longer email. I consider this to be statistically unsound because you can't guarantee that the results of Random(unpredictableSeed) will be statistically independent from the results of other random number generation activities in your program (a risk which increases with the number of random samples you generate in your program). It's also undesirable because relying on the user to be virtuous and generate a new, unpredictably seeded RNG is (to my mind) too trustworthy. If we consider D's goal to be safe-by-default, where statistics are concerned that should mean that the easy thing to do is also statistically "safe".
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 09:36 PM, monarch_dodra wrote:
 Being able to *save* a random range (which your proposal would prevent) can
have
 useful applications too. It means you can iterate on the same random number
 sequence several times, lazily, without having to store the results in a
buffer.

One further remark on this. I agree that it would be nice to be able to .save where possible -- that is, if one has a pseudo-random number sequence one should be able to save, and only if the source of randomness is "truly" random should the adapter range be an InputRange rather than ForwardRange. I concluded that this wasn't feasible more out of despair than desire, because I felt that deterministic, .save-able behaviour had its own traps that were potentially severe, because they would involve generating unintended statistical correlations that the user probably wouldn't notice.
Jun 17 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 11:18:36PM +0100, Joseph Rushton Wakeling wrote:
 On 06/17/2013 09:36 PM, monarch_dodra wrote:
 Good analysis but (sorry) I think you are on the wrong track.
 
 One of the major problems in std.random is that the ranges use value
 semantics.  This means they are *saved* whenever they are copied, as
 opposed to just referenced. This creates the problems you have
 mentioned, and even more.

I agree that the fact that pseudo-random number generators use value semantics is a serious problem, and I was thinking of our previous discussions in preparing these remarks.

Yeah we need to change RNGs to have reference semantics. I consider that a major design flaw in std.random. [...]
 I have tried to fix it before:
 http://forum.dlang.org/thread/oiczxzkzxketxitncghl forum.dlang.org
 FWI, I gave up on the project, because it was too complex for me to
 handle an entire module. But there were no reasons for it to not
 work.

I remember your work and was sad to see that it was not accepted -- actually one reason to start this discussion was to try and push awareness back to your contributions :-)

What were the reasons it was not accepted? T -- If it breaks, you get to keep both pieces. -- Software disclaimer notice
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 12:10 AM, H. S. Teoh wrote:
 What were the reasons it was not accepted?

I think mostly that, as a patch to std.random, it was a breaking change. There was a proposal to make a std.random2 but that would have been a rather more major piece of work :-(
Jun 17 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 11:36 PM, monarch_dodra wrote:
 I could even add:
 
      SomeRandomForwardRange r;
      auto arr1 = array(r.save);
      auto arr2 = array(r);
      assert(x == y); // This time both are the same

Try: auto gen2 = new MtClass19937(unpredictableSeed); /* the above is just MersenneTwister tweaked to be * a final class, hence reference type */ auto r5 = simpleRandomRange(0.0L, 1.0L, gen2); auto arr1 = array(r5.save.take(5)); auto arr2 = array(r5.take(5)); writeln(arr1); writeln(arr2); ... and you get two different sequences of numbers.
Jun 17 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, June 18, 2013 00:16:24 Joseph Rushton Wakeling wrote:
 On 06/18/2013 12:10 AM, H. S. Teoh wrote:
 What were the reasons it was not accepted?

I think mostly that, as a patch to std.random, it was a breaking change. There was a proposal to make a std.random2 but that would have been a rather more major piece of work :-(

Yeah. Changing std.random to use reference types would be a breaking change unless you just created new classes for everything. It would just be cleaner to do std.random2 instead. But that means that someone has to take the time to prepare it for the review queue. It would probably be a pretty easy sell though, since it can probably stay mostly the same aside from the struct -> class change (though at that point, we might as well take the opportunity to make sure that anything else that should be redesigned about it gets redesigned appropriately). - Jonathan M Davis
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 05:00 AM, Andrei Alexandrescu wrote:
 Once you consume an input range, there's no way to consume it again. It's done.

Sure, but what I'm proposing is a stronger constraint than "random ranges should be input ranges".
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 08:06 AM, Jonathan M Davis wrote:
 It would probably be a pretty easy sell  though, since it can probably stay
 mostly the same aside from the struct -> class change (though at that point,
 we might as well take the opportunity to  make sure that anything else that
 should be redesigned about it gets  redesigned appropriately).

Yea, this is also my feeling, which is part of why I'm pushing this concept of "random ranges" -- I want to ensure that the related issues are properly understood and discussed and some well-thought-out design patterns are prepared in order to ensure good and statistically reliable functionality in std.random2. One small note -- I'd have thought that a struct with an internal pointer-to-payload (allocated using manual memory management, not GC) would have been a superior design for pseudo-random number generators compared to making them final classes. The latter is just the easiest thing to do for simple tests of PRNG-as-reference-type.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 05:00 AM, Andrei Alexandrescu wrote:
 On 6/17/13 6:29 PM, Joseph Rushton Wakeling wrote:
 On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
 A random range should be viewed (IMO) as nothing more than a range that "was"
 (conceptually) simply filled with random numbers. Calling front on the same
 range twice in a row *should* produce the same result: No call to popFront
 =>  no
 change to the range. If it did change, it'd be a blatant violation of the range
 concept. It also means you can't have safe/nothrow/pure/const "front".

Completely agree, and I don't think this is in contradiction with what I've proposed. My proposed "rule" might be better stated to clarify this.

Perhaps this would be a better statement: ************************************************************************ * Iterating fully over a given random range should produce a different * * sequence for each such complete iteration. * ************************************************************************ So, if you do, SomeRandomRange r; x = r.front; y = r.front; assert(x == y); // Passes!! But SomeRandomRange r; arr1 = array(r); arr2 = array(r); assert(x != y); // the two arrays are filled with different sequences.

Once you consume an input range, there's no way to consume it again. It's done.

Let me be certain I understand what you're saying. Is it that something like this: SomeInputRange r; arr1 = array(r); arr2 = array(r); ... is not legit? Or are you saying that the properties I've described are simply the properties of input ranges? If the latter, then note that what I'm proposing is something stronger than just, "Random ranges should be input ranges." Once again I should rephrase: "Iterating fully over a given random range should produce a different and statistically independent sequence for each such complete iteration." I don't come to that conclusion because I _want_ random ranges to be un-.save-able, but because I think without that design choice, there will simply be too many ways to unknowingly generate unwanted correlations in random-number-using programs. I'll follow up on that point later today.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/17/2013 09:36 PM, monarch_dodra wrote:
 Being able to *save* a random range (which your proposal would prevent) can
have
 useful applications too. It means you can iterate on the same random number
 sequence several times, lazily, without having to store the results in a
buffer.

If I recall right, there are a number of Phobos entities that make use of the opportunity to .save -- some of std.algorithm, I think? I'll browse the code myself, but if anyone could point me to some explicit examples, it would be very useful.
Jun 18 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, June 18, 2013 10:27:11 Joseph Rushton Wakeling wrote:
 On 06/18/2013 08:06 AM, Jonathan M Davis wrote:
 It would probably be a pretty easy sell  though, since it can probably
 stay
 mostly the same aside from the struct -> class change (though at that
 point, we might as well take the opportunity to  make sure that anything
 else that should be redesigned about it gets  redesigned appropriately).

Yea, this is also my feeling, which is part of why I'm pushing this concept of "random ranges" -- I want to ensure that the related issues are properly understood and discussed and some well-thought-out design patterns are prepared in order to ensure good and statistically reliable functionality in std.random2. One small note -- I'd have thought that a struct with an internal pointer-to-payload (allocated using manual memory management, not GC) would have been a superior design for pseudo-random number generators compared to making them final classes. The latter is just the easiest thing to do for simple tests of PRNG-as-reference-type.

An internal payload might make more sense, but it probably needs to be done in a way that it can be controlled (which may require custom allocators), because even allocating with malloc all the time might be a problem for the guys who are really picky about memory stuff - though maybe that level of customization could be added later. I don't know. I also don't work in that kind of environment, so I don't know exactly what such programmers find acceptable. - Jonathan M Davis
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 10:32 AM, Jonathan M Davis wrote:
 An internal payload might make more sense, but it probably needs to be done in 
 a way that it can be controlled (which may require custom allocators), because 
 even allocating with malloc all the time might be a problem for the guys who 
 are really picky about memory stuff - though maybe that level of customization 
 could be added later. I don't know. I also don't work in that kind of 
 environment, so I don't know exactly what such programmers find acceptable.

Ahh, very good point. I guess we should probably invite Manu and Adam Ruppe to comment on this ...
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 09:32:35 UTC, Jonathan M Davis wrote:
 On Tuesday, June 18, 2013 10:27:11 Joseph Rushton Wakeling 
 wrote:
 On 06/18/2013 08:06 AM, Jonathan M Davis wrote:
 It would probably be a pretty easy sell  though, since it 
 can probably
 stay
 mostly the same aside from the struct -> class change 
 (though at that
 point, we might as well take the opportunity to  make sure 
 that anything
 else that should be redesigned about it gets  redesigned 
 appropriately).

Yea, this is also my feeling, which is part of why I'm pushing this concept of "random ranges" -- I want to ensure that the related issues are properly understood and discussed and some well-thought-out design patterns are prepared in order to ensure good and statistically reliable functionality in std.random2. One small note -- I'd have thought that a struct with an internal pointer-to-payload (allocated using manual memory management, not GC) would have been a superior design for pseudo-random number generators compared to making them final classes. The latter is just the easiest thing to do for simple tests of PRNG-as-reference-type.

An internal payload might make more sense, but it probably needs to be done in a way that it can be controlled (which may require custom allocators), because even allocating with malloc all the time might be a problem for the guys who are really picky about memory stuff - though maybe that level of customization could be added later. I don't know. I also don't work in that kind of environment, so I don't know exactly what such programmers find acceptable. - Jonathan M Davis

Memory wise, classes or pointer to payload will both have a memory allocation cost anyways. classes will be on the GC, whereas structs *may* use malloc + ref count. The advantage of classes is that you can default initialize them, which is a real + given that some random ranges may need priming. (or we could bring up yet again the subject of allowing structs to have explicitly callable constructors with no arguments). What I remember, is that there were some people who did numerics that didn't want to have reference types, because of the overhead to access the payload. They *really* wanted to be able to declare their types on stack. What I did find though is that if you implement all of the PRNGs as value types, you can then easily wrap them all inside a reference types or class. This: 1: Makes implementation easier 2: Leaves a door open for users that want explicitly value types (at their own responsibility). The last issue is input vs forward. IMO, PRNG's should be input by *default*, and forward (when possible) on demand. Making a PRNG forward by default means you can't prevent an algorithm from saving your range. That means that *even* if you have a reference type, you could still have duplicate series. In the above example with dual call to "array", one can *hope* to have two different ranges... but there is nothing preventing array from calling "save" just to troll the user. fill does (used to) do it :D I was also able to implement this pretty easily: Just give the saveable ranges the "dup" primitive. Users of the PRNG can then explicitly dup if they want to very safely and explicitly. If they really want a ForwardPRNG, then a simple adaptor that adds save in terms of dup is provided.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 10:30 AM, Joseph Rushton Wakeling wrote:
 I don't come to that conclusion because I _want_ random ranges to be
 un-.save-able, but because I think without that design choice, there will
simply
 be too many ways to unknowingly generate unwanted correlations in
 random-number-using programs.
 
 I'll follow up on that point later today.

Just as a simple example -- and this involving purely pseudo-random number generation, not "random ranges" as I've conceived them [*] -- consider the following: auto t1 = rndGen.take(5); writeln(t1); auto t2 = rndGen.take(5); writeln(t2); I'd expect that to produce two distinct sequences. In fact it produces two identical sequences. I think this must be because, passed a forward range like Mt19937, std.range.Take uses this constructor: this(R input) { _original = input; _current = input.save; } ... and so when rndGen.take(5) is consumed, rndGen itself is not iterated forward. To me that's a very serious potential source of statistical errors in a program, and it's particularly bad because it's innocuous -- absent the writeln of the results, it would be easy to never realize what's happening here. Now imagine the potential consequences for D's use in scientific simulation -- I don't want to see D get raked over the coals because some researcher's published simulation results turned out to be spurious ... :-) I'd be very happy to see a way in which this issue can be resolved while preserving the opportunity to .save, but I don't personally see one that doesn't rely on a heavy amount of user virtue to avoid these statistical pitfalls. [* Actually, rndGen.take(n) is arguably a random range according to my definition, because rndGen.take(n).popFront() will request a (pseudo-)random number.]
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 11:00 AM, monarch_dodra wrote:
 The last issue is input vs forward. IMO, PRNG's should be input by *default*,
 and forward (when possible) on demand.

I think that's potentially a very good design concept, and might offer a friendly way round the issues I mentioned with rndGen.take() in my email of a few minutes ago.
 Making a PRNG forward by default means you can't prevent an algorithm from
saving
 your range. That means that *even* if you have a reference type, you could
still
 have duplicate series.

Indeed -- I was about to follow up on my rndGen.take example with one involving a final class version of MersenneTwisterEngine, but I think it's obvious without the example :-)
 In the above example with dual call to "array", one can *hope* to have two
different
 ranges... but there is nothing preventing array from calling "save" just to
 troll the user. fill does (used to) do it :D
 
 I was also able to implement this pretty easily: Just give the saveable ranges
 the "dup" primitive. Users of the PRNG can then explicitly dup if they want to
 very safely and explicitly. If they really want a ForwardPRNG, then a simple
 adaptor that adds save in terms of dup is provided.

I think in principle one could have a .forward() method/property that would return a forward-range wrapper of the range in question. So, with this concept, one would expect, auto t1 = rndGen.take(5); writeln(t1); auto t2 = rndGen.take(5); writeln(t2); ... to produce _different_ sequences (as rndGen would be merely an input range, and not .save-able), whereas auto gen = rndGen.forward; auto t1 = gen.take(5); writeln(t1); auto t2 = gen.take(5); writeln(t2); .... would produce identical sequences as before. It could in turn be a good design principle that pseudo-random sequences (and random ranges deriving from pseudo-random sequences) should define the .forward property, whereas "true" random sequences and their derivatives would lack it.
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 10:16:02 UTC, Joseph Rushton Wakeling 
wrote:
 std.range.Take uses this constructor:

     this(R input) { _original = input; _current = input.save; }

It shouldn't. It is the caller who should chose to (or not to) save.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 11:31 AM, monarch_dodra wrote:
 On Tuesday, 18 June 2013 at 10:16:02 UTC, Joseph Rushton Wakeling wrote:
 std.range.Take uses this constructor:

     this(R input) { _original = input; _current = input.save; }

It shouldn't. It is the caller who should chose to (or not to) save.

Ack, you're quite right. Take doesn't even have a constructor -- that's a constructor for Cycle. I agree with your principle about it being the caller's decision whether or not to .save, but can that be assumed in general? I thought there were Phobos algorithms or ranges that deliberately took saved copies for internal use.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 06/18/2013 11:54 AM, Joseph Rushton Wakeling wrote:
 On 06/18/2013 11:31 AM, monarch_dodra wrote:
 On Tuesday, 18 June 2013 at 10:16:02 UTC, Joseph Rushton Wakeling wrote:
 std.range.Take uses this constructor:

     this(R input) { _original = input; _current = input.save; }

It shouldn't. It is the caller who should chose to (or not to) save.

Ack, you're quite right. Take doesn't even have a constructor -- that's a constructor for Cycle.

Proof that I was wrong -- please find attached a test module that implements a final class (hence, reference type) version of Mersenne Twister. If we then do, auto gen = new MtClass19937(unpredictableSeed); auto t1 = gen.take(5); auto t2 = gen.take(5); writeln(t1); writeln(t2); ... we get different sequences. However, the .save as implemented needs rewriting (I haven't touched it yet) as currently it simply returns a reference copy. I'll get onto that. :-) Anyway, _assuming_ that we can rely on algorithms and ranges to not surreptitiously .save the random number generator, this does seem to offer a genuine alternative to my more problematic proposal. I'm very happy about this as I had feared my own idea was "least bad" rather than "best", never a nice position to be in.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 06/18/2013 12:09 PM, Joseph Rushton Wakeling wrote:
 ... we get different sequences.  However, the .save as implemented needs
 rewriting (I haven't touched it yet) as currently it simply returns a reference
 copy.  I'll get onto that. :-)

Tweaked version attached. This adds a new constructor which copies a PRNG of the same type: this(typeof(this) source) { this.mt[] = source.mt[]; this.mti = source.mti; this._y = source._y; } ... and .save then makes use of this to return a duplicate: property typeof(this) save() { return new typeof(this)(this); } With this in hand, we can do, auto gen = new MtClass19937(unpredictableSeed); auto t1 = gen.take(5); auto t2 = gen.take(5); writeln(t1); writeln(t2); ... and get two independent sequences, whereas, auto t3 = gen.take(5); auto t4 = gen.save.take(5); writeln(t3); writeln(t4); ... produces two identical sequences. This still requires some intelligence/virtue on behalf of the programmer, though: if we do, auto t5 = gen.save.take(5); writeln(t5); ... and don't also take 5 from gen itself, then subsequent uses of gen will be correlated with t5, probably unintentionally. However, I think that's a fair tradeoff for the ability to .save at all. It's also now trivial to rewrite things like my SimpleRandomRange such that, auto gen = new MtClass19937(unpredictableSeed); auto r = simpleRandomRange(0.0L, 1.0L, gen); auto t1 = r5.take(5); auto t2 = r5.take(5); writeln(t1); writeln(t2); ... produces different sequences, whereas, auto t3 = r5.take(5); auto t4 = r5.save.take(5); writeln(t3); writeln(t4); ... produce the same. Ditto for "real" random ranges, such as std.random.RandomSample. So we have a solution to the first set of problems identified, that is much nicer than the one I had in mind. Very good! However, at least one more niggle remains -- to be followed up on.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 12:30 PM, Andrei Alexandrescu wrote:
 I'm not sure I understand what you are proposing. If a true random range is to
 be an input range with additional constraints, one simple thing is once .empty
 returns true there's no more extracting elements from it.

Sorry, that very brief email was sent in error -- I accidentally clicked the "send" button before I finished writing. The longer follow-up has more detail, though I think monarch_dodra's feedback has convinced me that my proposal is not necessary.
Jun 18 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 11:15:51AM +0100, Joseph Rushton Wakeling wrote:
 On 06/18/2013 10:30 AM, Joseph Rushton Wakeling wrote:
 I don't come to that conclusion because I _want_ random ranges to be
 un-.save-able, but because I think without that design choice, there
 will simply be too many ways to unknowingly generate unwanted
 correlations in random-number-using programs.
 
 I'll follow up on that point later today.

Just as a simple example -- and this involving purely pseudo-random number generation, not "random ranges" as I've conceived them [*] -- consider the following: auto t1 = rndGen.take(5); writeln(t1); auto t2 = rndGen.take(5); writeln(t2); I'd expect that to produce two distinct sequences. In fact it produces two identical sequences. I think this must be because, passed a forward range like Mt19937, std.range.Take uses this constructor: this(R input) { _original = input; _current = input.save; } ... and so when rndGen.take(5) is consumed, rndGen itself is not iterated forward.

No, I just looked at the source code, and there is no call to .save in the ctor. This is just another symptom of RNGs being passed by value: t1 and t2 are operating on two copies of rndGen passed by value. Were they passed by reference, this wouldn't have been a problem. I say again that RNGs being passed by value is a major BUG. The above situation is a prime example of this problem. We *need* to make RNGs passed by reference. For situations where you *want* to duplicate a pseudo random sequence, an explicit method should be provided to clone the RNG. Duplication of RNGs should never be implicit. [...]
 I'd be very happy to see a way in which this issue can be resolved
 while preserving the opportunity to .save, but I don't personally see
 one that doesn't rely on a heavy amount of user virtue to avoid these
 statistical pitfalls.

Just make RNGs passed by reference, and the above problem will not happen. T -- Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.
Jun 18 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 12:06:41AM -0700, Jonathan M Davis wrote:
 On Tuesday, June 18, 2013 00:16:24 Joseph Rushton Wakeling wrote:
 On 06/18/2013 12:10 AM, H. S. Teoh wrote:
 What were the reasons it was not accepted?

I think mostly that, as a patch to std.random, it was a breaking change. There was a proposal to make a std.random2 but that would have been a rather more major piece of work :-(

Yeah. Changing std.random to use reference types would be a breaking change unless you just created new classes for everything. It would just be cleaner to do std.random2 instead.

The name "std.random2" still makes me cringe. I wish there was a good alternative, but the only names I can think of are along the lines of std.rnd or std.rand, which are inferior names we wouldn't want to be stuck with. :-( T -- Why waste time learning, when ignorance is instantaneous? -- Hobbes, from Calvin & Hobbes
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 03:16 PM, H. S. Teoh wrote:
 No, I just looked at the source code, and there is no call to .save in
 the ctor. This is just another symptom of RNGs being passed by value: t1
 and t2 are operating on two copies of rndGen passed by value. Were they
 passed by reference, this wouldn't have been a problem.

You're right, it was my mistake -- I was scanning/searching the source code too quickly and a search for 'this(' took me into another struct without me realizing.
 Just make RNGs passed by reference, and the above problem will not
 happen.

That was the conclusion I also came to after monarch_dodra's examples. However, I'm still concerned about circumstances where an algorithm might choose to operate on a .save copy rather than on the original range -- this could result in unnoticed statistical correlations. Can I reasonably assume that no D language/runtime/standard library algorithm would do that? And if not, what are the examples where a .save is consumed instead of the original range?
Jun 18 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 11:27:03AM +0100, Joseph Rushton Wakeling wrote:
 On 06/18/2013 11:00 AM, monarch_dodra wrote:
 The last issue is input vs forward. IMO, PRNG's should be input by
 *default*, and forward (when possible) on demand.


+1. [...]
 In the above example with dual call to "array", one can *hope* to
 have two different ranges... but there is nothing preventing array
 from calling "save" just to troll the user. fill does (used to) do
 it :D
 
 I was also able to implement this pretty easily: Just give the
 saveable ranges the "dup" primitive. Users of the PRNG can then
 explicitly dup if they want to very safely and explicitly. If they
 really want a ForwardPRNG, then a simple adaptor that adds save in
 terms of dup is provided.


+1. We should follow D's dictum of being safe/correct by default, but allow the user to get under the hood if they need to. Copying a PRNG state is a potentially unsafe (in the probabilistic sense) operation, so it should be made explicit -- call .dup if the user really wants a copy of the PRNG sequence, pass by reference otherwise.
 I think in principle one could have a .forward() method/property that
 would return a forward-range wrapper of the range in question.
 
 So, with this concept, one would expect,
 
     auto t1 = rndGen.take(5);
     writeln(t1);
 
     auto t2 = rndGen.take(5);
     writeln(t2);
 
 ... to produce _different_ sequences (as rndGen would be merely an input range,
 and not .save-able), whereas
 
     auto gen = rndGen.forward;
 
     auto t1 = gen.take(5);
     writeln(t1);
 
     auto t2 = gen.take(5);
     writeln(t2);
 
 .... would produce identical sequences as before.

Actually, take doesn't (and shouldn't!) call .save, so both cases should return different sequences. The cause of the original problem was that RNGs are being passed by value, so you're actually working on copies of the PRNG state. Also, the above code is obscure; someone reading the code may have missed the call to .forward and thus misunderstand the intent of the code. I much rather provide .dup and force the user to write this: auto gen = rndGen; auto t1 = gen.dup.take(5); // Explicit call to .dup auto t2 = gen.take(5); This states up-front that we're expecting two identical sequences from the (P)RNG, so someone reading the code wouldn't be misled to think two different sequences are being generated. Self-documenting code is always better than code whose semantics rely on obscure contextual information that is easily missed. T -- Debian GNU/Linux: Cray on your desktop.
Jun 18 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 03:32:02PM +0100, Joseph Rushton Wakeling wrote:
 On 06/18/2013 03:16 PM, H. S. Teoh wrote:
 No, I just looked at the source code, and there is no call to .save in
 the ctor. This is just another symptom of RNGs being passed by value: t1
 and t2 are operating on two copies of rndGen passed by value. Were they
 passed by reference, this wouldn't have been a problem.

You're right, it was my mistake -- I was scanning/searching the source code too quickly and a search for 'this(' took me into another struct without me realizing.
 Just make RNGs passed by reference, and the above problem will not
 happen.

That was the conclusion I also came to after monarch_dodra's examples. However, I'm still concerned about circumstances where an algorithm might choose to operate on a .save copy rather than on the original range -- this could result in unnoticed statistical correlations. Can I reasonably assume that no D language/runtime/standard library algorithm would do that? And if not, what are the examples where a .save is consumed instead of the original range?

Well, .save *is* used in a few places; the few that I know of include cartesianProduct (at least one of the ranges need to be iterated over repeatedly in order to compute the product), and find(), I believe, where the needle needs to be at least a forward range so that multiple candidate subranges can be checked repeatedly. And joiner forwards .save to its wrapper range, though it doesn't actually call it as part of its operation. There may be a few other places where .save is used. W.r.t. (P)RNGs, the places where .save is used are where random ranges wouldn't really make sense anyway -- if I *really* wanted to search a range for a random needle, I'd make an array out of prng.take(n) first, and then pass that to find(); it'd be more efficient, if nothing else. And I can't think of any useful application of taking a cartesian product of a random range. T -- "Hi." "'Lo."
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 14:18:19 UTC, H. S. Teoh wrote:
 [...]

 Just make RNGs passed by reference, and the above problem will 
 not
 happen.


 T

Well... it's more than *just* pass-by-reference, its actual store-by-reference (eg "reference semantic"). In the case of "take", take store a copy of the passed range, so a pass-by-reference wouldn't really change much. I'm not sure if that's what you meant, but just in case. We don't want any ambiguity on the words we are using.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

On 06/18/2013 03:38 PM, H. S. Teoh wrote:
 W.r.t. (P)RNGs, the places where .save is used are where random ranges
 wouldn't really make sense anyway -- if I *really* wanted to search a
 range for a random needle, I'd make an array out of prng.take(n) first,
 and then pass that to find(); it'd be more efficient, if nothing else.
 And I can't think of any useful application of taking a cartesian
 product of a random range.

I'll have a look through these things and see if there are any ways in which they might reasonably offer a way to come unstuck statistics-wise. I've attached a little example. Here we call: auto gen = new MtClass19937(unpredictableSeed); auto r = simpleRandomRange(0.0L, 1.0L, gen); auto t1 = r.take(5); auto s = r.save; auto t2 = s.take(5); writeln(t1); writeln(t2); auto t3 = r.take(5); auto t4 = s.take(5); writeln(t3); writeln(t4); Note that the first pair and last pair produce the same output, but the successive takes from r (t1 and t3) are different from each other, as are the successive takes from the save s (t2 and t4). This is what you'd expect, but an algorithm might assume that repeatedly consuming r.save would result in the same output again and again and again.
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 14:39:38 UTC, Andrei Alexandrescu 
wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. 
 The above
 situation is a prime example of this problem. We *need* to 
 make RNGs
 passed by reference. For situations where you *want* to 
 duplicate a
 pseudo random sequence, an explicit method should be provided 
 to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404. Andrei

That sounds nice, but is it possible? The only way to truly forward everything is via an alias this. However, while all the calls are properly forwarded, it tends to fail for "is" traits. EG: struct S { int s; alias INT = int; void foo() {writeln("foo()");} void foo(int) {writeln("foo(int)");} void foo2(Args)(int) {writeln("foo!" ~ Args.stringof ~ "(int)");} property bool empty(){return true;} property int front(){return 1;} void popFront(){} } class Class(S) { private S __s; this(Args...)(auto ref Args args){__s = S(args);} alias __s this; } void main() { auto c = new Class!S(5); c.foo(); c.foo(5); c.foo2!short(5); c.s = 5; //it's a trap! Class!S.INT a = 5; assert(isInputRange!S); assert(isInputRange!(Class!S)); } Here, this compiles, but fails at the last line... Or maybe I'm not creative enough... ?
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 03:38 PM, H. S. Teoh wrote:
 Well, .save *is* used in a few places; the few that I know of include
 cartesianProduct (at least one of the ranges need to be iterated over
 repeatedly in order to compute the product), and find(), I believe,
 where the needle needs to be at least a forward range so that multiple
 candidate subranges can be checked repeatedly. And joiner forwards .save
 to its wrapper range, though it doesn't actually call it as part of its
 operation. There may be a few other places where .save is used.

Try: auto gen = new MtClass19937(1001); auto t1 = gen.take(2); auto t2 = gen.take(2); writeln(cartesianProduct(t1, t2)); auto t3 = gen.take(2); auto t4 = gen.take(2); writeln(cartesianProduct(t3, t4)); The results are interesting. t1 and t2 are evaluated to be equal to one another, but it looks like the original source range is consumed, as t3 and t4 are equal to one another but different from t1 and t2. Now try: auto gen = new MtClass19937(1001); auto t1 = gen.take(2); auto t2 = gen.take(2); writeln(cartesianProduct(t1, iota(2))); writeln(cartesianProduct(t1, iota(2))); writeln(cartesianProduct(t1, iota(2))); writeln(cartesianProduct(t1, iota(2))); ... identical results all round. The first argument is not consumed, cartesianProduct must iterate only over a saved copy. NOW try: auto gen = new MtClass19937(1001); auto t1 = gen.take(2); auto t2 = gen.take(2); writeln(cartesianProduct(iota(2), t1)); writeln(cartesianProduct(iota(2), t1)); writeln(cartesianProduct(iota(2), t2)); writeln(cartesianProduct(iota(2), t2)); ... and the results keep changing. The first argument to cartesianProduct is not consumed, the second is. I feel very uncomfortable about the idea of cases like this where the consumption of the source of randomness is so unpredictable.
Jun 18 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 04:26 PM, Joseph Rushton Wakeling wrote:
 ... and the results keep changing.  The first argument to cartesianProduct is
 not consumed, the second is.

I guess it's down to this part of cartesianProduct: static if (isInfinite!R1 && isInfinite!R2) { ... } else static if (isInputRange!R2 && isForwardRange!R1 && !isInfinite!R1) { return joiner(map!((ElementType!R2 a) => zip(range1.save, repeat(a))) (range2)); } else static if (isInputRange!R1 && isForwardRange!R2 && !isInfinite!R2) { return joiner(map!((ElementType!R1 a) => zip(repeat(a), range2.save)) (range1)); } In the case where both R1 and R2 are finite forward ranges, cartesianProduct will save range1 and not range2. I accept that this may not be an area where random input is expected, but it still feels "ouch" and is the kind of unexpected correlation problem I had anticipated. If we remove the .save property from MtClass19937 then the problem vanishes, but it is a shame to have to lose .save for pseudo-random number generators.
Jun 18 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 04:50:07PM +0200, monarch_dodra wrote:
 On Tuesday, 18 June 2013 at 14:18:19 UTC, H. S. Teoh wrote:
[...]

Just make RNGs passed by reference, and the above problem will not
happen.


T

Well... it's more than *just* pass-by-reference, its actual store-by-reference (eg "reference semantic").

Good point.
 In the case of "take", take store a copy of the passed range, so a
 pass-by-reference wouldn't really change much.

If the RNG were a reference type like a class, then the wrapper range Take would simply have another reference to the class instance, so the problem is avoided. But if it were only passed by reference but is a value type (e.g. modify take() to have a ref argument), then the problem will still persist, since the wrapper ctor will copy the RNG instance.
 I'm not sure if that's what you meant, but just in case. We don't
 want any ambiguity on the words we are using.

Thanks! T -- Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 16:11:06 UTC, Andrei Alexandrescu 
wrote:
 On 6/18/13 11:11 AM, monarch_dodra wrote:
 On Tuesday, 18 June 2013 at 14:39:38 UTC, Andrei Alexandrescu 
 wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. 
 The above
 situation is a prime example of this problem. We *need* to 
 make RNGs
 passed by reference. For situations where you *want* to 
 duplicate a
 pseudo random sequence, an explicit method should be 
 provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404. Andrei

That sounds nice, but is it possible? The only way to truly forward everything is via an alias this.

Nonono, that must use introspection and mixins, like WhiteHole and BlackHole do. Andrei

Hum... just looked at the code. I was hoping there was a way to do this without pulling out the heavy artillery...
Jun 18 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, June 18, 2013 10:39:38 Andrei Alexandrescu wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. The above
 situation is a prime example of this problem. We *need* to make RNGs
 passed by reference. For situations where you *want* to duplicate a
 pseudo random sequence, an explicit method should be provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404.

That may be a good way to go about it, but the default such as what rndGen returns really should be a reference type, and only ranges whose state can actually be copied such that continuing to iterate over them gives the same values should be forward ranges IMHO. Using Class!T might be able to make it so that we can avoid needing a std.random2, but we'll probably still need some minor code breakage in there somewhere. - Jonathan M Davis
Jun 18 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 18 June 2013 at 19:21:23 UTC, Andrei Alexandrescu 
wrote:
 On 6/18/13 2:11 PM, Jonathan M Davis wrote:
 On Tuesday, June 18, 2013 10:39:38 Andrei Alexandrescu wrote:
 On 6/18/13 10:16 AM, H. S. Teoh wrote:
 I say again that RNGs being passed by value is a major BUG. 
 The above
 situation is a prime example of this problem. We *need* to 
 make RNGs
 passed by reference. For situations where you *want* to 
 duplicate a
 pseudo random sequence, an explicit method should be 
 provided to clone
 the RNG. Duplication of RNGs should never be implicit.

Occasionally copying a RNG's state is useful, but I agree most of the time you want to just take a reference to it. I think a good way toward a solution is http://d.puremagic.com/issues/show_bug.cgi?id=10404.

That may be a good way to go about it, but the default such as what rndGen returns really should be a reference type, and only ranges whose state can actually be copied such that continuing to iterate over them gives the same values should be forward ranges IMHO. Using Class!T might be able to make it so that we can avoid needing a std.random2, but we'll probably still need some minor code breakage in there somewhere. - Jonathan M Davis

How about defining rndGenRef and putting rndGen on the deprecation path? Andrei

Even with 10404, I still think we'd need a new module. An adaptor helps, but these ranges really need to be ref by default (amongst other changes). We'd probably have to end up with MersenneRef, as well as pretty much everything else... Also: Class!PRNG.save would not actually save, it would merely return a copy of its PRNG payload, and not a new class instance. This could be problematic (for starters, it wouldn't qualify for forward range...)
Jun 18 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, June 18, 2013 15:21:23 Andrei Alexandrescu wrote:
 How about defining rndGenRef and putting rndGen on the deprecation path?

We could, though that's a bit ugly as names go IMHO. We could almost get away with simply changing what rndGen returns, but unfortunately, it returns Random rather than auto, so if anyone typed it explicitly, we'd break their code - unless we changed Random to be aliased to Class!Mt19937 instead of Mt19937, but then anything using Random explictly without rndGen would break. My _guess_ would be that we could get away with just changing rndGen to return something else and break almost no one, but unfortunately we can't know that for sure. We probably should have had rndGen return auto rather than Random, but we have more experience with ranges now than when std.random was written. Regardless, I think that we should be sure of what we want to do with the module as a whole before we start making tweaks like that, since if we decide that we really do want std.random2, there's not much point in tweaking std.random, and std.random2.rndGen could return what we want. If we do stick with std.random though, I think that we should come up with a better name than rndGenRef (maybe rndRange or genRnd or simply random). - Jonathan M Davis
Jun 18 2013
prev sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/18/2013 08:35 PM, Jonathan M Davis wrote:
 Regardless, I think that we should be sure of what we want to do with the 
 module as a whole before we start making tweaks like that, since if we decide 
 that we really do want std.random2, there's not much point in tweaking 
 std.random, and std.random2.rndGen could return what we want. If we do stick 
 with std.random though, I think that we should come up with a better name than 
 rndGenRef (maybe rndRange or genRnd or simply random).

If you're having a std.random2 then I don't see a reason to change the name -- keep it as rndGen. In _most_ cases of downstream code it won't break, so it smooths the upgrade path. I presume using std.random and std.random2 together would be considered an error, so is there really any risk of a clash? Also, assuming we do go with std.random2 rather than an upgrade to std.random, it may be better to go with structs-carrying-a-pointer-to-a-payload rather than with classes; that in turn might allow for less breakage in users' code. For example, if the Random alias is switched from a struct to a class, then users would have to go through their code tweaking stuff like this: - auto gen = Random(unpredictableSeed); + auto gen = new Random(unpredictableSeed); Whereas if RNGs are still structs on the outside (with reference to payload inside) then the original line should still work.
Jun 18 2013