## digitalmars.D - randomSample

- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> May 23 2009
- dsimcha <dsimcha yahoo.com> May 23 2009
- BCS <none anon.com> May 23 2009
- BCS <none anon.com> May 23 2009
- Jason House <jason.james.house gmail.com> May 23 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> May 23 2009
- "Lionello Lunesu" <lionello lunesu.remove.com> May 24 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> May 25 2009
- Lionello Lunesu <lio lunesu.remove.com> May 25 2009

I'm writing a range to select a random sampling of a forward range. The number of elements in the forward range is already known. The strategy is: at any moment, I know I need to select k random samples out of n. To implement popFront, I generate a random number in [0, n - k]. That's the probability that the next item in the input will be part of the selection. If the random number is 0, then the element got selected. Otherwise, I must ignore that guy so I popFront the input range, decrease n by one, and repeat. This seems reasonable but somehow the selection comes skewed: the elements towards the beginning of the range are less represented than the ones towards the end. Where am I wrong? Andrei

May 23 2009

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleI'm writing a range to select a random sampling of a forward range. The number of elements in the forward range is already known. The strategy is: at any moment, I know I need to select k random samples out of n. To implement popFront, I generate a random number in [0, n - k]. That's the probability that the next item in the input will be part of the selection. If the random number is 0, then the element got selected. Otherwise, I must ignore that guy so I popFront the input range, decrease n by one, and repeat. This seems reasonable but somehow the selection comes skewed: the elements towards the beginning of the range are less represented than the ones towards the end. Where am I wrong? Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both read the code and did a monte carlo test for some short sequences to see if the distribution was uniform on the space of possible permutations. Isn't this just a case of using RandomCover, but stopping before iterating through the entire range? In other words, wouldn't the following work: auto selectK(R)(R someRange, uint K, Random gen) { assert(K <= someRange.length); return take(K, randomCover(someRange, gen)); }

May 23 2009

dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleI'm writing a range to select a random sampling of a forward range. The number of elements in the forward range is already known. The strategy is: at any moment, I know I need to select k random samples out of n. To implement popFront, I generate a random number in [0, n - k]. That's the probability that the next item in the input will be part of the selection. If the random number is 0, then the element got selected. Otherwise, I must ignore that guy so I popFront the input range, decrease n by one, and repeat. This seems reasonable but somehow the selection comes skewed: the elements towards the beginning of the range are less represented than the ones towards the end. Where am I wrong? Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both read the code and did a monte carlo test for some short sequences to see if the distribution was uniform on the space of possible permutations. Isn't this just a case of using RandomCover, but stopping before iterating through the entire range? In other words, wouldn't the following work: auto selectK(R)(R someRange, uint K, Random gen) { assert(K <= someRange.length); return take(K, randomCover(someRange, gen)); }

You are right. I'm just less than happy with randomCover because it needs additional memory, so I eliminated it as a possibility from the get-go. Andrei

May 23 2009

Hello Andrei,

Check my reasoning on this: It will be skewed because as soon as the system gets behind (and statistically it will half the time) the elements start being selected with a higher probability. Take the related case, select each element with equal probability until you have enough selected or have just enough left to get the correct amount. The last element will get selected about half the time no matter what percentage you want.

May 23 2009

Hello BCS,Check my reasoning on this: It will be skewed because as soon as the system gets behind (and statistically it will half the time) the elements start being selected with a higher probability. Take the related case, select each element with equal probability until you have enough selected or have just enough left to get the correct amount. The last element will get selected about half the time no matter what percentage you want.

after some ad-hoc tests ("fun time with Excel!") it seems that the above is off in the weeds.

May 23 2009

Andrei Alexandrescu Wrote:

You have your probability wrong. The selection probability is k/n. Pick a number in [0,n) and pick the element if the result is in [0,k). If you do select an item, decrement k. Always decrement n when popping, regardless of if the element is selected.

May 23 2009

Jason House wrote:Andrei Alexandrescu Wrote:

You have your probability wrong. The selection probability is k/n. Pick a number in [0,n) and pick the element if the result is in [0,k). If you do select an item, decrement k. Always decrement n when popping, regardless of if the element is selected.

Great, thanks, that works perfect. If I want to choose _toSelect stuff out of _available items, this works to position to the next element: for (;;) { auto r = uniform(0, _available); if (r < _toSelect) { // chosen! return; } // not chosen, retry assert(!_input.empty); _input.popFront; --_available; assert(_available > 0); } As an aside, it looks like the right-open range for random numbers is a good default. Andrei

May 23 2009

Andrei, I noticed in random.d, uniform template, that popFront is called in different locations for integral compared to floating point types: for integral types you .front first and .popFront afterwards, but for floating point types you start with .popFront and then check .front. This has a peculiar effect: for example, if you do uniform(0.0,100.0) followed by uniform(0,100) there's a big chance that the integral part of the first random number is equal to the second random number. import std.stdio, std.random; void main() { writeln(uniform(0.0,100.0)); writeln(uniform(0,100)); } I don't think this warrants a bug report, but I do think the location of .popFront should be standardized, either before or after any .front. Just sayin'. L.

May 24 2009

Lionello Lunesu wrote:Andrei, I noticed in random.d, uniform template, that popFront is called in different locations for integral compared to floating point types: for integral types you .front first and .popFront afterwards, but for floating point types you start with .popFront and then check .front. This has a peculiar effect: for example, if you do uniform(0.0,100.0) followed by uniform(0,100) there's a big chance that the integral part of the first random number is equal to the second random number. import std.stdio, std.random; void main() { writeln(uniform(0.0,100.0)); writeln(uniform(0,100)); } I don't think this warrants a bug report, but I do think the location of .popFront should be standardized, either before or after any .front. Just sayin'. L.

Definitely warrants a bug report. I'm busy today but I might find time for it tomorrow. Andrei

May 25 2009

Definitely warrants a bug report. I'm busy today but I might find time for it tomorrow.

http://d.puremagic.com/issues/show_bug.cgi?id=3025

May 25 2009