digitalmars.D - Mir Random and Dlang Ranges [Example]

Ilya Yaroshenko (34/34) Nov 25 2016 The discussion [2] about Mir Random [1] and Range API become a

Andrei Alexandrescu (3/4) Nov 25 2016 I thought I agreed that a noncopyable struct with opCall is fine in

Joseph Rushton Wakeling (5/8) Nov 25 2016 FWIW, I suspect that Ilya means simply that it became long and

Ilya Yaroshenko (3/11) Nov 25 2016 Yes, thanks

Andrei Alexandrescu (2/11) Nov 25 2016 I'd say we can bury the hatchet and move on with Ilya's API. -- Andrei

Ilya Yaroshenko (4/21) Nov 25 2016 Yes, this was clear. There are also others who may disagree. This

Joseph Rushton Wakeling (22/28) Nov 25 2016 Without wanting to re-open the debate now, it seems to me that

Joseph Rushton Wakeling (37/40) Nov 25 2016 It's not a holy war, but your code example doesn't really address

Ilya Yaroshenko (20/47) Nov 25 2016 What kind of state should not be copied by value? I thought it is

Joseph Rushton Wakeling (76/85) Nov 25 2016 Unfortunately that's not true. The sampling algorithm pulls some

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

The discussion [2] about Mir Random [1] and Range API become a 
holy war [2]. I am putting the following example here. It also 
can be found at GitHub[1].

------
import std.range, std.stdio;

import random;
import random.variable: NormalVariable;

// Engines are allocated on stack or global

auto rng = Random(unpredictableSeed);

// Engines are passed by reference to algorithms

auto sample = rng

     // Random variables are passed by value

     .randomRange(NormalVariable!double(0, 1))

      // Fix sample length equal to 1000 elements (Input Range API)

     .take(1000)

     // Allocates memory and performs computation

     .array;

writeln(sample);
------

If you still think that Dlang Random API should be a Range API 
_by default_ instead of opCall, please submit a solid Range API 
solution, that will be fine for the real world applications such 
as Normal Variance Mean Mixtures [3], be  nogc  safe and be 
without data duplication. Only predefined ranges should be used 
for NormalVariable and the second distribution. It is an API 
incapsulation problem that have a nice solution using opCall 
because RandomVariables are tuples of params with opCall(Engine) 
method.

[1] https://github.com/libmir/mir-random
[2] 
http://forum.dlang.org/thread/zlhuomzishrqrjjpxfho forum.dlang.org
[3] https://en.wikipedia.org/wiki/Normal_variance-mean_mixture

Best regards,
Ilya

Nov 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/25/16 9:27 AM, Ilya Yaroshenko wrote:
 The discussion [2] about Mir Random [1] and Range API become a holy war [2].

I thought I agreed that a noncopyable struct with opCall is fine in 
conjunction with a range API adapter that uses a pointer. -- Andrei

Nov 25 2016

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Friday, 25 November 2016 at 15:04:24 UTC, Andrei Alexandrescu 
wrote:
 I thought I agreed that a noncopyable struct with opCall is 
 fine in conjunction with a range API adapter that uses a 
 pointer. -- Andrei

FWIW, I suspect that Ilya means simply that it became long and 
convoluted, not that anyone is taking a religious position on 
particular issues.

Nov 25 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 25 November 2016 at 15:22:36 UTC, Joseph Rushton 
Wakeling wrote:
 On Friday, 25 November 2016 at 15:04:24 UTC, Andrei 
 Alexandrescu wrote:
 I thought I agreed that a noncopyable struct with opCall is 
 fine in conjunction with a range API adapter that uses a 
 pointer. -- Andrei

 FWIW, I suspect that Ilya means simply that it became long and 
 convoluted, not that anyone is taking a religious position on 
 particular issues.

Yes, thanks

Nov 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/25/16 10:30 AM, Ilya Yaroshenko wrote:
 On Friday, 25 November 2016 at 15:22:36 UTC, Joseph Rushton Wakeling wrote:
 On Friday, 25 November 2016 at 15:04:24 UTC, Andrei Alexandrescu wrote:
 I thought I agreed that a noncopyable struct with opCall is fine in
 conjunction with a range API adapter that uses a pointer. -- Andrei

 FWIW, I suspect that Ilya means simply that it became long and
 convoluted, not that anyone is taking a religious position on
 particular issues.

 Yes, thanks

I'd say we can bury the hatchet and move on with Ilya's API. -- Andrei

Nov 25 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 25 November 2016 at 16:04:07 UTC, Andrei Alexandrescu 
wrote:
 On 11/25/16 10:30 AM, Ilya Yaroshenko wrote:
 On Friday, 25 November 2016 at 15:22:36 UTC, Joseph Rushton 
 Wakeling wrote:
 On Friday, 25 November 2016 at 15:04:24 UTC, Andrei 
 Alexandrescu wrote:
 I thought I agreed that a noncopyable struct with opCall is 
 fine in
 conjunction with a range API adapter that uses a pointer. -- 
 Andrei

 FWIW, I suspect that Ilya means simply that it became long and
 convoluted, not that anyone is taking a religious position on
 particular issues.

 Yes, thanks

 I'd say we can bury the hatchet and move on with Ilya's API. -- 
 Andrei

Yes, this was clear. There are also others who may disagree. This 
is the reason for this thread. --Ilya

Nov 25 2016

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Friday, 25 November 2016 at 16:08:15 UTC, Ilya Yaroshenko 
wrote:
 On Friday, 25 November 2016 at 16:04:07 UTC, Andrei 
 Alexandrescu wrote:
 I'd say we can bury the hatchet and move on with Ilya's API. 
 -- Andrei

 Yes, this was clear. There are also others who may disagree. 
 This is the reason for this thread. --Ilya

Without wanting to re-open the debate now, it seems to me that 
it's worth distinguishing carefully between:

   * the changes necessary to support Ilya's goals in terms of 
generating really
     high quality random variates (where I'm not sure the switch 
from ranges to
     functors really matters that much, except as a preference), 
and

   * the changes necessary to avoid the known issues in terms of 
accidental
     copying-by-value of RNG state (where the functor-based 
proposal is
     interesting, but not necessarily the best or only solution).

On the second point, I would suggest maintaining an open mind 
both towards Ilya's proposal but also to the possibility that 
there might be better alternatives.

(Minor remark: the copying-by-value of RNG state is not limited 
to random engines [i.e. pseudo-random algorithms]; it also 
applies to e.g. a random device which caches a buffer of bits 
read from a 'true' random source.)

Nov 25 2016

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Friday, 25 November 2016 at 14:27:13 UTC, Ilya Yaroshenko 
wrote:
 The discussion [2] about Mir Random [1] and Range API become a 
 holy war [2]. I am putting the following example here. It also 
 can be found at GitHub[1].

It's not a holy war, but your code example doesn't really address 
the key issues to do with the interaction of random number 
generation and ranges, which are not really much about how you 
create a reference-type range for RNGs themselves.  (Your 
solution is one of several.)

What would convince me of the merits of your approach is an 
implementation of `RandomSample` (as in, the range that extracts 
a subset of n elements out of the first N elements of an input 
range, with equal probability of selecting any particular 
element) that deals with the following challenges:

* its own internal state (or at least, those parts of it that 
correspond to state underlying the pseudo-random process, which 
is NOT limited to the RNG state) is not at risk of being copied 
by value;

* it's possible to generate multiple `RandomSample` instances 
deep in the inner loop of a program, without creating slow-down 
related to memory allocation (this was the reason I abandoned a 
class-based approach);

* it's easy to fit the `RandomSample` implementation into a range 
chain.

The last point doesn't have to be quite as simple as the sort of 
thing you can do currently, such as:

     iota(100).randomSample(10, rng).filter!(a => a > 50).writeln;

Ideally it would be possible to create a function that creates 
that kind of range chain and returns it, as in:

     auto myMagicRange(RNG)(size_t n, size_t m, ref RNG rng)
     {
         assert(n < m);

         return iota(m).randomSample(n, rng).filter!(a => a > n/2);
     }

     // Now output 10,000 unique filtered samples ...
     foreach (_; 0 .. 10_000)
     {
         myMagicRange(10, 100, rng).writeln;
     }

Nov 25 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Friday, 25 November 2016 at 15:09:46 UTC, Joseph Rushton 
Wakeling wrote:
 * its own internal state (or at least, those parts of it that 
 correspond to state underlying the pseudo-random process, which 
 is NOT limited to the RNG state) is not at risk of being copied 
 by value;

What kind of state should not be copied by value? I thought it is 
only an Engine.
Engines can not be copied by value in Mir Random. If you mean 
counters, than it is not correct for Dlang. Almost all D counters 
like ranges are copied by value. A `ref` for randomSample should 
be used to functions if thay want to modify its counters (the 
same logic exists in std.experimental.primitives like popFront ).

 * it's possible to generate multiple `RandomSample` instances 
 deep in the inner loop of a program, without creating slow-down 
 related to memory allocation (this was the reason I abandoned a 
 class-based approach);

No problem, the same approach like for RandomRange should work.

 * it's easy to fit the `RandomSample` implementation into a 
 range chain.

No problem, ditto

 The last point doesn't have to be quite as simple as the sort 
 of thing you can do currently, such as:

     iota(100).randomSample(10, rng).filter!(a => a > 
 50).writeln;

Why not? rng can be passed by ref and and its pointer can be 
preserved internally. So the same syntax will work in Mir Random.

 Ideally it would be possible to create a function that creates 
 that kind of range chain and returns it, as in:

     auto myMagicRange(RNG)(size_t n, size_t m, ref RNG rng)
     {
         assert(n < m);

         return iota(m).randomSample(n, rng).filter!(a => a > 
 n/2);
     }

     // Now output 10,000 unique filtered samples ...
     foreach (_; 0 .. 10_000)
     {
         myMagicRange(10, 100, rng).writeln;
     }

This works for RandomRange and can be done for RandomSample the 
same way.

BTW, I am looking for your RandomSample PR! :-)

range.algorithm [1] seems to be a proper module for it (it is 
small for now, so your name should be the first here).

[1] 
https://github.com/libmir/mir-random/blob/master/source/random/algorithm.d

Nov 25 2016

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Friday, 25 November 2016 at 15:56:45 UTC, Ilya Yaroshenko
wrote:
What kind of state should not be copied by value? I thought it
is only an Engine.

Unfortunately that's not true. The sampling algorithm pulls some
tricks to try to reduce the number of calls to the random number
generator (as well as to avoid calls to some more expensive
calculations to transform those variates into what is wanted).

The variable in question is here:
https://github.com/dlang/phobos/blob/f6b4e89d64eb998b80a70f23426e0e509e1933d8/std/random.d#L2413

And you can see here the places where regenerating it from the
RNG may be avoided:
https://github.com/dlang/phobos/blob/f6b4e89d64eb998b80a70f23426e0e509e1933d8/std/random.d#L2699-L2702
https://github.com/dlang/phobos/blob/f6b4e89d64eb998b80a70f23426e0e509e1933d8/std/random.d#L2763

Essentially the sampling algorithm carries out its own little
internal pseudo-random process which falls back to using the RNG
when necessary.

Now, in this specific case, it might be possible to avoid that.
The paper where this algorithm is introduced is more than 30
years old and it's possible that optimization is no longer very
relevant on modern hardware with modern RNGs, so it might be fine
to just generate the value concerned from the RNG with every
popFront().

However, that would need to be benchmarked; and more generally,
it doesn't seem to me to be reasonable to assume that other
random algorithms might not have similar requirements to maintain
some state associated with an internal pseudo-random process.

Engines can not be copied by value in Mir Random. If you mean
counters, than it is not correct for Dlang. Almost all D
counters like ranges are copied by value. A `ref` for
randomSample should be used to functions if thay want to modify
its counters (the same logic exists in
std.experimental.primitives like popFront ).

All of the "No problem" in your remarks comes out of the
assumption that it's just the RNG whose state must never be
copied by value. But even if we can manage to get rid of the
problematic state from this particular algorithm, I don't think
that's an assumption that can necessarily be generalized to all
random algorithms, which may do all sorts of clever state-based
tricks to optimize their performance.

Can that kind of state also be wrapped up inside a struct that
can't be copied by value? Possibly, but then that makes it
rather more tricky to use RandomSample in a friendly way inside
extended range chains as described in my previous post.

Stepping back from the specifics: one of my gripes, not with you
personally, but with the general discussion around this issue is
that people almost always seem to assume it's all about the RNGs,
and that if you stop them from being copied by value, then
everything else will fall out just fine. It's very rare that
people ever give consideration to the random algorithms except as
an afterthought. That means that the solutions proposed very
often seem nice and promising when looked at in terms of RNGs,
and sometimes also in terms of random distributions, but fall
flat the moment random algorithms come into play.

I've had a lot of conversations with different people around this
issue, and I've tried out a bunch of different ideas, including
functor RNGs which would be wrapped with a range API accessing
them via pointer (a suggestion Dicebot among others made to me).
In every case, I've run into difficulty when moving from RNGs and
distributions to random algorithms like RandomCover and
RandomSample.

Now, that's very possibly just my failure -- it would be great if
you could identify a solution where I could not. But the reason
I described the RandomSample-related problem above was not to get
assurances that it would not be a problem. It was meant to
encourage you to look in detail yourself at this problem, to
experience its nuances for yourself. It may not be as
unproblematic as you think.

BTW, I am looking for your RandomSample PR! :-)

Well, I would love to give you one, but in this case I do think
it may be worth your examining the algorithm and its
implementation in more detail yourself.

Please understand, I'm not trying to be obstructive here. I
genuinely think there are factors you have not considered in this
case, that would be useful for you to look into in greater depth
before you finalize your designs. BTW the original research
papers on the algorithm are here, if you want to take a look:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.329.6400&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.1689&rep=rep1&type=pdf

What I will follow up with is to benchmark the sampling algorithm
without the optimization trick. I'll also try to submit a PR for
the xoroshiro and xorshift* RNGs, and post issues containing some
of the feedback promised on other aspects of the design. You may
have to be a little patient with me, though ... it's a busy time
right now :-\

Nov 25 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Mir Random and Dlang Ranges [Example]