www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - What is the use case of RandomCover?

reply "Ivan Kazmenko" <gassa mail.ru> writes:
I'm unsure whether I should post into the ".learn" sub-forum or 
some other one in such a case, but still.

I wonder what is the use case of std.random.randomCover when one 
has std.random.randomShuffle.  I was trying to use it just to get 
a random permutation of integers, and found randomCover prior to 
randomShuffle.  However, for the number of elements as low as 
10,000, the delay was already rather surprising, so I searched 
for a faster solution and found randomShuffle does a superior 
job.  And now I wonder: how does one correctly use randomCover?  
Below is a sample test program showing the difference.

-----
import std.array;
import std.random;
import std.range;
import std.stdio;

immutable int MAX_N = 10_000;

int [] fun (int n, ref Random rnd)
{
	auto t = array (iota (MAX_N));
	version (randomCover)
	{
		auto c = randomCover (t, rnd);
	}
	version (randomShuffle)
	{
		auto c = t;
		randomShuffle (c, rnd);
	}

	return array (c);
}

void main ()
{
	auto rnd = Random (123456789);
	writeln (fun (MAX_N, rnd));
	writeln (fun (MAX_N, rnd) == fun (MAX_N, rnd));
}
-----

Here is a comparison:

1. Speed.
+randomShuffle performs O(n) steps and O(n) uniform() calls.
-randomCover performs O(n^2) steps and O(n^2) uniform() calls.

The latter however can (and perhaps should?) be optimized to 
O(n): in the implementation, the line
             auto chooseMe = uniform(0, k, _rnd) == 0;
can be moved outside the foreach loop and store the integer
         auto toPick = uniform(0, k, _rnd);
instead of a bool.  I can try and write the respective patch if 
needed.

2. Size.
-randomShuffle does not allocate anything extra, but modifies the 
range in place, and so requires allocating another range of n 
values if the original range has to be stored too.
+randomCover only allocates an array of n bools.  If that is the 
intended advantage, the implementation would be better off using 
a bit array instead of a bool array, as in this enhancement 
proposition: http://d.puremagic.com/issues/show_bug.cgi?id=2898

3. Laziness.
-randomShuffle just does its job once.
+randomCover produces some sort of a lazy generator instead.  
Still, the generator performs an O(n) computation on each step, 
so the profit is debatable.

4. Convenience.
+randomShuffle called multiple times with the same RNG advances 
the internal state of the RNG and thus produces different 
results.  If one needs the same results, it is still achievable 
by knowingly saving and loading the internal state of the RNG.
-randomCover called multiple times with the same RNG copies the 
RNG each time by value and thus produces the same result.  That 
is not the intended behavior in the majority of use cases I can 
imagine (e.g., generating different random permutations in a 
loop).  This is already the topic of an issue I found: 
http://d.puremagic.com/issues/show_bug.cgi?id=7067

Now, the only case I can think of where randomCover should be 
preferred to randomShuffle is when you have a huge range 
(hundreds of Mb), but you need to iterate only through the first 
few values in randomCover.  Is there any other?

Whether the above is indeed the intended use of randomCover or 
not, I think that the intended use (and a reference to 
randomShuffle for other cases) should be mentioned in the 
documentation along with the time complexity.

More on the topic of optimization, the performance of the whole 
randomCover thing can be optimized to O(n log n) using a Fenwick 
tree or such to popFront in O(log n).  But it will then require 
storing n integers, not n bools, thus losing the advantage of 
having smaller memory requirements than randomShuffle with copy.

-----
Ivan Kazmenko.
Feb 18 2013
next sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
 1. Speed.
 +randomShuffle performs O(n) steps and O(n) uniform() calls.
 -randomCover performs O(n^2) steps and O(n^2) uniform() calls.

 The latter however can (and perhaps should?) be optimized to 
 O(n): in the implementation, the line
Sorry, this part doesn't look clear. O(n^2) total with O(n^2) uniform() calls can be optimized to the same O(n^2) total but with only O(n) uniform() calls. It can be further optimized to O(n log n) total with O(n) uniform() calls using a Fenwick tree, but will then require storing n ints instead of n bools.
Feb 18 2013
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/18/2013 03:35 PM, Ivan Kazmenko wrote:
 I wonder what is the use case of std.random.randomCover when one has
 std.random.randomShuffle.  I was trying to use it just to get a random
 permutation of integers, and found randomCover prior to randomShuffle. 
However,
 for the number of elements as low as 10,000, the delay was already rather
 surprising, so I searched for a faster solution and found randomShuffle does a
 superior job.  And now I wonder: how does one correctly use randomCover? Below
 is a sample test program showing the difference.
Surely, because randomShuffle is re-ordering the elements in-place, whereas randomCover is not.
Feb 18 2013
parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
 Surely, because randomShuffle is re-ordering the elements 
 in-place, whereas randomCover is not.
Sorry, but that does not answer my question. If "in-place shuffle" versus "function return value" was the only intended difference, randomCover could as well look like this (simplified to int[] case for expressiveness): ----- int[] randomCover(int[] r, ref/*or not*/ Random rnd) { auto s = r.dup; randomShuffle(s, rnd); return s; } ----- However, there are ~70 lines in the implementation quite different from just calling randomShuffle, presumably for some purpose. And my question is, what is that purpose?
Feb 19 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 19 February 2013 at 09:04:12 UTC, Ivan Kazmenko wrote:
 Surely, because randomShuffle is re-ordering the elements 
 in-place, whereas randomCover is not.
Sorry, but that does not answer my question. If "in-place shuffle" versus "function return value" was the only intended difference, randomCover could as well look like this (simplified to int[] case for expressiveness): ----- int[] randomCover(int[] r, ref/*or not*/ Random rnd) { auto s = r.dup; randomShuffle(s, rnd); return s; } ----- However, there are ~70 lines in the implementation quite different from just calling randomShuffle, presumably for some purpose. And my question is, what is that purpose?
Hum... randomShuffle and randomSample actually have nothing to do with each other. RandomSample will not shuffle anything, but will just (lazily) iterate on the original range, skipping elements, making it a "random sample". It will not, in any circumstance, re-arrange any element. Try this: //-------- import std.random; import std.algorithm; import std.stdio; import std.array; void main() { int[25] arr; foreach(i, ref a; arr) a = i; assert(isSorted(arr[])); foreach(_; 0 .. 10) { auto sample = randomSample(arr[], 4).array(); assert(isSorted(sample)); writefln("[%(%3s,%)]", sample); } } //-------- [ 8, 12, 16, 21] [ 5, 12, 17, 23] [ 1, 15, 21, 22] [ 9, 15, 21, 23] [ 6, 7, 13, 24] [ 4, 6, 9, 11] [ 9, 12, 20, 23] [ 5, 14, 20, 23] [ 0, 2, 10, 24] [ 1, 9, 13, 23] //-------- I also want to comment on your "randomSample" vs "randomSuffle" implementation suggestion. Keep in mind that: a) randomSample doesn't allocate, whereas yours suggestion doesn't b) randomSample gives direct access to the elements, whereas your suggestion doesn't. If you don't care about a) and b), then by all means, dup away, and get better performance! But think about the fact that you wouldn't be able to do something like this... //-------- import std.random; import std.algorithm; import std.stdio; import std.array; void main() { int[25] arr; foreach(_; 0 .. 10) { auto sample = randomSample(arr[], 5); foreach(ref a; sample) ++a; } writefln("[%(%3s,%)]", arr[]); } //-------- Last but not least, be warned there is an old-standing bug with anything in phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and generate the same sequence over and over. Ergo, do NOT pass a specific random generator to things like randomSample or randomSuffle. This problem is one of the major reason we are currently (and slowly) re-designing random into random2.
Feb 19 2013
next sibling parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
Hi!

Thank you for the reply.

 Hum... randomShuffle and randomSample actually have nothing to 
 do with each other.
 <snip>
I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.
 I also want to comment on your "randomSample" vs "randomSuffle" 
 implementation suggestion. Keep in mind that:
 a) randomSample doesn't allocate, whereas yours suggestion 
 doesn't
 b) randomSample gives direct access to the elements, whereas 
 your suggestion doesn't.

 If you don't care about a) and b), then by all means, dup away, 
 and get better performance!

 But think about the fact that you wouldn't be able to do 
 something like this...
 <snip>
         auto sample = randomSample(arr[], 5);
         foreach(ref a; sample)
             ++a;
That stands for randomCover, too. Well, thank you, perhaps that's the difference I was seeking. If this is the intended difference, well, my proposition to enhance randomCover's performance and usefulness transforms into: 1. Document the usage of randomCover with an example such as above, and refer to randomShuffle as a faster version for simpler use cases. 2. Optimize the performance by putting Fenwick trees to good use. Currently, randomCover'ing 10,000 elements takes time on the order of one second, and for 100,000 or more elements, it is hardly usable.
 Last but not least, be warned there is an old-standing bug with 
 anything in phobos that takes a PRNG by value. Basically, the 
 PRNG will be duplicated, and generate the same sequence over 
 and over. Ergo, do NOT pass a specific random generator to 
 things like randomSample or randomSuffle.

 This problem is one of the major reason we are currently (and 
 slowly) re-designing random into random2.
So, there is a general agreement that in random2, RNG should by default get passed by reference everywhere? That's nice to hear. ----- Ivan Kazmenko.
Feb 19 2013
next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:
 I'd like to note that my post is about randomCover, not randomSample.  I do see
 the difference between the purpose of randomSample and randomShuffle.  But
 randomCover's effect is, at the first glance, just a slower version of
 randomSample wrapped as a lazy generator.
To be honest, I think RandomCover is (statistically) unsafe to use right now, as it takes a copy of the random number generator it is given _by value_. So unless you pass it an unpredictably-seeded new RNG that will not be used in any other context, you're going to get correlation between the output of RandomCover and other sequences of random numbers you generate elsewhere in your code. (This problem is also found in RandomSample, but at least in that case there is an alternative of calling rndGen.) As for purpose -- I think the point of RandomCover is to provide a means of generating a random shuffle that doesn't re-order elements of the original range in place, and that doesn't require creating a copy. It's probably worth investigating the algorithms out there for this kind of functionality, because I doubt the algorithm that's given is optimal.
Feb 19 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 19 February 2013 at 12:18:43 UTC, Joseph Rushton 
Wakeling wrote:
 On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:
 I'd like to note that my post is about randomCover, not 
 randomSample.  I do see
 the difference between the purpose of randomSample and 
 randomShuffle.  But
 randomCover's effect is, at the first glance, just a slower 
 version of
 randomSample wrapped as a lazy generator.
To be honest, I think RandomCover is (statistically) unsafe to use right now, as it takes a copy of the random number generator it is given _by value_.
If it keeps a copy, how it is passed (value or ref) actually becomes irrelevant. This is also a strong argument for the reference semantic PRNG approach.
Feb 19 2013
parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/19/2013 01:25 PM, monarch_dodra wrote:
 If it keeps a copy, how it is passed (value or ref) actually becomes
irrelevant.
Yes, I didn't explain myself well there. The real problem is that copying really duplicates the RNG rather than just copying a reference to the RNG engine.
 This is also a strong argument for the reference semantic PRNG approach.
Yes. Are you having any luck in discussions on how to map out a random2 where this will be implemented? I remember your original code was rejected as a plain update to Phobos because of fears over code breakage (though TBH I think this is a case where if you break someone's code you're doing them a favour, as they were probably doing something unwise).
Feb 21 2013
prev sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
Joseph Rushton Wakeling wrote:
 It's probably worth investigating the algorithms out there for 
 this kind of functionality, because I doubt the algorithm 
 that's given is optimal.
Given that the lazy version will hardly be possible without allocating O(n) additional memory anyway, the simple solution is: - build a random permutation of iota(n) at start using randomShuffle (instead of maintaining the private bool[] _chosen array), - lazily iterate the range according to the permutation. It is n ints of memory instead of n bools which is comparable, but O(n) initialization and O(1) per step which is definitely better than now.
Feb 19 2013
prev sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 19 February 2013 at 11:46:54 UTC, Ivan Kazmenko wrote:
 Hi!

 Thank you for the reply.

 Hum... randomShuffle and randomSample actually have nothing to 
 do with each other.
 <snip>
I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.
Hum... sorry about that.
 I also want to comment on your "randomSample" vs 
 "randomSuffle" implementation suggestion. Keep in mind that:
 a) randomSample doesn't allocate, whereas yours suggestion 
 doesn't
 b) randomSample gives direct access to the elements, whereas 
 your suggestion doesn't.

 If you don't care about a) and b), then by all means, dup 
 away, and get better performance!

 But think about the fact that you wouldn't be able to do 
 something like this...
 <snip>
        auto sample = randomSample(arr[], 5);
        foreach(ref a; sample)
            ++a;
That stands for randomCover, too. Well, thank you, perhaps that's the difference I was seeking. If this is the intended difference, well, my proposition to enhance randomCover's performance and usefulness transforms into: 1. Document the usage of randomCover with an example such as above, and refer to randomShuffle as a faster version for simpler use cases. 2. Optimize the performance by putting Fenwick trees to good use. Currently, randomCover'ing 10,000 elements takes time on the order of one second, and for 100,000 or more elements, it is hardly usable.
Extra documentation never hurts, but keep in mind that a ton of algorithms in phobos are lazy and operate this way. Usually, only the lazy version is implemented, as the aggressive version is trivial (as you suggested). AFAIK, most of the ranges in std.range are lazy (though not obviously) in one way or another.
 Last but not least, be warned there is an old-standing bug 
 with anything in phobos that takes a PRNG by value. Basically, 
 the PRNG will be duplicated, and generate the same sequence 
 over and over. Ergo, do NOT pass a specific random generator 
 to things like randomSample or randomSuffle.

 This problem is one of the major reason we are currently (and 
 slowly) re-designing random into random2.
So, there is a general agreement that in random2, RNG should by default get passed by reference everywhere? That's nice to hear. ----- Ivan Kazmenko.
The agreement is rather to make them reference types, so even when passed by value, you won't accidentally duplicate them. This is important as you sometimes pass random ranges to algorithms that have nothing to do with randomness. I'd say the "textbook" example would be: //---- import std.random; import std.algorithm; import std.stdio; void main() { uint[5] arr1; arr1[].fill(rndGen); uint[5] arr2; arr2[].fill(rndGen); writeln("arr1: ", arr1[]); writeln("arr1: ", arr2[]); } //---- arr1: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] arr2: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] //---- Oops!
Feb 19 2013
parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
monarch_dodra wrote:
 void main()
 {
     uint[5] arr1;
     arr1[].fill(rndGen);
     uint[5] arr2;
     arr2[].fill(rndGen);
     writeln("arr1: ", arr1[]);
     writeln("arr1: ", arr2[]);
 }
 //----
 arr1: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 arr2: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 //----
 Oops!
Oops indeed! It should be noted that a common need is to have reproducible RNG routines. For example, I was angry at Python breaking the reproducibility of everything except the vanilla random.random() around version 3.2. Still, as the language is going to have random2 anyway, and reproducibility will already suffer much from RNG becoming a reference type, it will be a good time for some cleanup. On a side note, as D also uses MT19937 as the main RNG, random2 designers may want to address the same accuracy issue which Python did. More on that topic here: http://bugs.python.org/issue9025
Feb 19 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 19 February 2013 at 21:54:54 UTC, Ivan Kazmenko wrote:
 monarch_dodra wrote:
 void main()
 {
    uint[5] arr1;
    arr1[].fill(rndGen);
    uint[5] arr2;
    arr2[].fill(rndGen);
    writeln("arr1: ", arr1[]);
    writeln("arr1: ", arr2[]);
 }
 //----
 arr1: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 arr2: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 //----
 Oops!
Oops indeed! It should be noted that a common need is to have reproducible RNG routines. For example, I was angry at Python breaking the reproducibility of everything except the vanilla random.random() around version 3.2. Still, as the language is going to have random2 anyway, and reproducibility will already suffer much from RNG becoming a reference type, it will be a good time for some cleanup.
In theory, there should be no reproducibility breakage, as the generators themselves will not be changed (they are _strictly_ modeled after boosts' (the ones we have anyways)). The breakage will only occur if the PRNG was being duplicated and generating the same sequence twice. In this case though, one would argue it's not reproducibility breakage, but straight up broken code being fixed...
Feb 19 2013
parent "Ivan Kazmenko" <gassa mail.ru> writes:
monarch_dodra wrote:
 In theory, there should be no reproducibility breakage, as the 
 generators themselves will not be changed (they are _strictly_ 
 modeled after boosts' (the ones we have anyways)).
Indeed, I just looked at Boost and C++11 implementations, and they currently have these divisions too. Having this kind of reproducibility across languages is a benefit in itself. bearophile wrote:
 Is this to report in Bugzilla?
Sorry, I fear it's a false alarm, and the answer is no. In any case, before advocating such a change further, I would like to know why Boost and C++11 people didn't do that in their realms. Maybe the division is actually more portable. Maybe extracting bits is indeed faster with MT19937, but it can produce poorer quality random values with RNGs which are not so well-distributed bit-by-bit. ----- Ivan Kazmenko.
Feb 19 2013
prev sibling parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
 On a side note, as D also uses MT19937 as the main RNG, random2 
 designers may want to address the same accuracy issue which 
 Python did.  More on that topic here: 
 http://bugs.python.org/issue9025
Sorry for replying to myself again, but the abovementioned accuracy issue is not about MT19937. It's about how we transform the (uniform enough) bits from MT19937 to an integer in some range [0, LIMIT). Looks like the inaccuracy is already addressed in D, so, sorry again for the bother. Still, the current implementation uses divisions (which is slow on most CPUs), and for a good source of random bits such as MT19937, the implementation which repeatedly gets exactly core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) may be faster. ----- Ivan Kazmenko
Feb 19 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Ivan Kazmenko:

 Still, the current implementation uses divisions (which is slow 
 on most CPUs), and for a good source of random bits such as 
 MT19937, the implementation which repeatedly gets exactly 
 core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) 
 may be faster.
Is this to report in Bugzilla? Bye, bearophile
Feb 19 2013
prev sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 02/19/2013 12:27 PM, monarch_dodra wrote:
 Last but not least, be warned there is an old-standing bug with anything in
 phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and
 generate the same sequence over and over. Ergo, do NOT pass a specific random
 generator to things like randomSample or randomSuffle.
Sorry, I missed your remark on this when writing my own email. But I'm not sure this issue can be over-emphasized ... :-)
Feb 19 2013