www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - randomSample

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I'm writing a range to select a random sampling of a forward range. The 
number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random samples 
out of n. To implement popFront, I generate a random number in [0, n - 
k]. That's the probability that the next item in the input will be part 
of the selection. If the random number is 0, then the element got 
selected. Otherwise, I must ignore that guy so I popFront the input 
range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the 
elements towards the beginning of the range are less represented than 
the ones towards the end. Where am I wrong?


Andrei
May 23 2009
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm writing a range to select a random sampling of a forward range. The
 number of elements in the forward range is already known.
 The strategy is: at any moment, I know I need to select k random samples
 out of n. To implement popFront, I generate a random number in [0, n -
 k]. That's the probability that the next item in the input will be part
 of the selection. If the random number is 0, then the element got
 selected. Otherwise, I must ignore that guy so I popFront the input
 range, decrease n by one, and repeat.
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both read the code and did a monte carlo test for some short sequences to see if the distribution was uniform on the space of possible permutations. Isn't this just a case of using RandomCover, but stopping before iterating through the entire range? In other words, wouldn't the following work: auto selectK(R)(R someRange, uint K, Random gen) { assert(K <= someRange.length); return take(K, randomCover(someRange, gen)); }
May 23 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm writing a range to select a random sampling of a forward range. The
 number of elements in the forward range is already known.
 The strategy is: at any moment, I know I need to select k random samples
 out of n. To implement popFront, I generate a random number in [0, n -
 k]. That's the probability that the next item in the input will be part
 of the selection. If the random number is 0, then the element got
 selected. Otherwise, I must ignore that guy so I popFront the input
 range, decrease n by one, and repeat.
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both read the code and did a monte carlo test for some short sequences to see if the distribution was uniform on the space of possible permutations. Isn't this just a case of using RandomCover, but stopping before iterating through the entire range? In other words, wouldn't the following work: auto selectK(R)(R someRange, uint K, Random gen) { assert(K <= someRange.length); return take(K, randomCover(someRange, gen)); }

You are right. I'm just less than happy with randomCover because it needs additional memory, so I eliminated it as a possibility from the get-go. Andrei
May 23 2009
prev sibling next sibling parent reply BCS <none anon.com> writes:
Hello Andrei,

 I'm writing a range to select a random sampling of a forward range.
 The number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random
 samples out of n. To implement popFront, I generate a random number in
 [0, n - k]. That's the probability that the next item in the input
 will be part of the selection. If the random number is 0, then the
 element got selected. Otherwise, I must ignore that guy so I popFront
 the input range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 
 Andrei
 

Check my reasoning on this: It will be skewed because as soon as the system gets behind (and statistically it will half the time) the elements start being selected with a higher probability. Take the related case, select each element with equal probability until you have enough selected or have just enough left to get the correct amount. The last element will get selected about half the time no matter what percentage you want.
May 23 2009
parent BCS <none anon.com> writes:
Hello BCS,

 Check my reasoning on this:
 
 It will be skewed because as soon as the system gets behind (and
 statistically it will half the time) the elements start being selected
 with a higher probability.
 
 Take the related case, select each element with equal probability
 until you have enough selected or have just enough left to get the
 correct amount. The last element will get selected about half the time
 no matter what percentage you want.
 

after some ad-hoc tests ("fun time with Excel!") it seems that the above is off in the weeds.
May 23 2009
prev sibling parent reply Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 I'm writing a range to select a random sampling of a forward range. The 
 number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random samples 
 out of n. To implement popFront, I generate a random number in [0, n - 
 k]. That's the probability that the next item in the input will be part 
 of the selection. If the random number is 0, then the element got 
 selected. Otherwise, I must ignore that guy so I popFront the input 
 range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the 
 elements towards the beginning of the range are less represented than 
 the ones towards the end. Where am I wrong?
 
 
 Andrei

You have your probability wrong. The selection probability is k/n. Pick a number in [0,n) and pick the element if the result is in [0,k). If you do select an item, decrement k. Always decrement n when popping, regardless of if the element is selected.
May 23 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 I'm writing a range to select a random sampling of a forward range.
 The number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random
 samples out of n. To implement popFront, I generate a random number
 in [0, n - k]. That's the probability that the next item in the
 input will be part of the selection. If the random number is 0,
 then the element got selected. Otherwise, I must ignore that guy so
 I popFront the input range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the 
 elements towards the beginning of the range are less represented
 than the ones towards the end. Where am I wrong?
 
 
 Andrei

You have your probability wrong. The selection probability is k/n. Pick a number in [0,n) and pick the element if the result is in [0,k). If you do select an item, decrement k. Always decrement n when popping, regardless of if the element is selected.

Great, thanks, that works perfect. If I want to choose _toSelect stuff out of _available items, this works to position to the next element: for (;;) { auto r = uniform(0, _available); if (r < _toSelect) { // chosen! return; } // not chosen, retry assert(!_input.empty); _input.popFront; --_available; assert(_available > 0); } As an aside, it looks like the right-open range for random numbers is a good default. Andrei
May 23 2009
parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
Andrei,

I noticed in random.d, uniform template, that popFront is called in 
different locations for integral compared to floating point types: for 
integral types you .front first and .popFront afterwards, but for floating 
point types you start with .popFront and then check .front.

This has a peculiar effect: for example, if you do uniform(0.0,100.0) 
followed by uniform(0,100) there's a big chance that the integral part of 
the first random number is equal to the second random number.

import std.stdio, std.random;
void main()
{
  writeln(uniform(0.0,100.0));
  writeln(uniform(0,100));
}

I don't think this warrants a bug report, but I do think the location of 
.popFront should be standardized, either before or after any .front.

Just sayin'.

L. 
May 24 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Lionello Lunesu wrote:
 Andrei,
 
 I noticed in random.d, uniform template, that popFront is called in 
 different locations for integral compared to floating point types: for 
 integral types you .front first and .popFront afterwards, but for 
 floating point types you start with .popFront and then check .front.
 
 This has a peculiar effect: for example, if you do uniform(0.0,100.0) 
 followed by uniform(0,100) there's a big chance that the integral part 
 of the first random number is equal to the second random number.
 
 import std.stdio, std.random;
 void main()
 {
  writeln(uniform(0.0,100.0));
  writeln(uniform(0,100));
 }
 
 I don't think this warrants a bug report, but I do think the location of 
 .popFront should be standardized, either before or after any .front.
 
 Just sayin'.
 
 L.

Definitely warrants a bug report. I'm busy today but I might find time for it tomorrow. Andrei
May 25 2009
parent Lionello Lunesu <lio lunesu.remove.com> writes:
 Definitely warrants a bug report. I'm busy today but I might find time 
 for it tomorrow.

http://d.puremagic.com/issues/show_bug.cgi?id=3025
May 25 2009