## digitalmars.D - randomSample

• Andrei Alexandrescu (12/12) May 23 2009 I'm writing a range to select a random sampling of a forward range. The
• dsimcha (10/22) May 23 2009 I'm confused. As far as I can tell, RandomCover is correct in 2.030. I...
• Andrei Alexandrescu (5/29) May 23 2009 You are right. I'm just less than happy with randomCover because it
• BCS (8/24) May 23 2009 Check my reasoning on this:
• BCS (3/14) May 23 2009 after some ad-hoc tests ("fun time with Excel!") it seems that the above...
• Jason House (2/18) May 23 2009 You have your probability wrong. The selection probability is k/n. Pick ...
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?

Andrei
```
May 23 2009
dsimcha <dsimcha yahoo.com> writes:
```== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

I'm confused.  As far as I can tell, RandomCover is correct in 2.030.  I both
the code and did a monte carlo test for some short sequences to see if the
distribution was uniform on the space of possible permutations.  Isn't this
just a
case of using RandomCover, but stopping before iterating through the entire
range?
In other words, wouldn't the following work:

auto selectK(R)(R someRange, uint K, Random gen) {
assert(K <= someRange.length);
return take(K, randomCover(someRange, gen));
}
```
May 23 2009
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```dsimcha wrote:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

I'm confused.  As far as I can tell, RandomCover is correct in 2.030.  I both
the code and did a monte carlo test for some short sequences to see if the
distribution was uniform on the space of possible permutations.  Isn't this
just a
case of using RandomCover, but stopping before iterating through the entire
range?
In other words, wouldn't the following work:

auto selectK(R)(R someRange, uint K, Random gen) {
assert(K <= someRange.length);
return take(K, randomCover(someRange, gen));
}

You are right. I'm just less than happy with randomCover because it
needs additional memory, so I eliminated it as a possibility from the
get-go.

Andrei
```
May 23 2009
BCS <none anon.com> writes:
```Hello Andrei,

I'm writing a range to select a random sampling of a forward range.
The number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random
samples out of n. To implement popFront, I generate a random number in
[0, n - k]. That's the probability that the next item in the input
will be part of the selection. If the random number is 0, then the
element got selected. Otherwise, I must ignore that guy so I popFront
the input range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?

Andrei

Check my reasoning on this:

It will be skewed because as soon as the system gets behind (and statistically
it will half the time) the elements start being selected with a higher
probability.

Take the related case, select each element with equal probability until you
have enough selected or have just enough left to get the correct amount.
The last element will get selected about half the time no matter what
percentage
you want.
```
May 23 2009
BCS <none anon.com> writes:
```Hello BCS,

Check my reasoning on this:

It will be skewed because as soon as the system gets behind (and
statistically it will half the time) the elements start being selected
with a higher probability.

Take the related case, select each element with equal probability
until you have enough selected or have just enough left to get the
correct amount. The last element will get selected about half the time
no matter what percentage you want.

after some ad-hoc tests ("fun time with Excel!") it seems that the above
is off in the weeds.
```
May 23 2009
Jason House <jason.james.house gmail.com> writes:
```Andrei Alexandrescu Wrote:

I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?

Andrei

You have your probability wrong. The selection probability is k/n. Pick a
number in [0,n) and pick the element if the result is in [0,k). If you do
select an item, decrement k. Always decrement n when popping, regardless of if
the element is selected.
```
May 23 2009
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```Jason House wrote:
Andrei Alexandrescu Wrote:

I'm writing a range to select a random sampling of a forward range.
The number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random
samples out of n. To implement popFront, I generate a random number
in [0, n - k]. That's the probability that the next item in the
input will be part of the selection. If the random number is 0,
then the element got selected. Otherwise, I must ignore that guy so
I popFront the input range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented
than the ones towards the end. Where am I wrong?

Andrei

You have your probability wrong. The selection probability is k/n.
Pick a number in [0,n) and pick the element if the result is in
[0,k). If you do select an item, decrement k. Always decrement n when
popping, regardless of if the element is selected.

Great, thanks, that works perfect. If I want to choose _toSelect stuff
out of _available items, this works to position to the next element:

for (;;)
{
auto r = uniform(0, _available);
if (r < _toSelect)
{
// chosen!
return;
}
// not chosen, retry
assert(!_input.empty);
_input.popFront;
--_available;
assert(_available > 0);
}

As an aside, it looks like the right-open range for random numbers is a
good default.

Andrei
```
May 23 2009
"Lionello Lunesu" <lionello lunesu.remove.com> writes:
```Andrei,

I noticed in random.d, uniform template, that popFront is called in
different locations for integral compared to floating point types: for
integral types you .front first and .popFront afterwards, but for floating

This has a peculiar effect: for example, if you do uniform(0.0,100.0)
followed by uniform(0,100) there's a big chance that the integral part of
the first random number is equal to the second random number.

import std.stdio, std.random;
void main()
{
writeln(uniform(0.0,100.0));
writeln(uniform(0,100));
}

I don't think this warrants a bug report, but I do think the location of
.popFront should be standardized, either before or after any .front.

Just sayin'.

L.
```
May 24 2009
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```Lionello Lunesu wrote:
Andrei,

I noticed in random.d, uniform template, that popFront is called in
different locations for integral compared to floating point types: for
integral types you .front first and .popFront afterwards, but for

This has a peculiar effect: for example, if you do uniform(0.0,100.0)
followed by uniform(0,100) there's a big chance that the integral part
of the first random number is equal to the second random number.

import std.stdio, std.random;
void main()
{
writeln(uniform(0.0,100.0));
writeln(uniform(0,100));
}

I don't think this warrants a bug report, but I do think the location of
.popFront should be standardized, either before or after any .front.

Just sayin'.

L.

Definitely warrants a bug report. I'm busy today but I might find time
for it tomorrow.

Andrei
```
May 25 2009
Lionello Lunesu <lio lunesu.remove.com> writes:
``` Definitely warrants a bug report. I'm busy today but I might find time
for it tomorrow.

http://d.puremagic.com/issues/show_bug.cgi?id=3025
```
May 25 2009