digitalmars.D - randomSample

Andrei Alexandrescu (12/12) May 23 2009 I'm writing a range to select a random sampling of a forward range. The

dsimcha (10/22) May 23 2009 I'm confused. As far as I can tell, RandomCover is correct in 2.030. I...

Andrei Alexandrescu (5/29) May 23 2009 You are right. I'm just less than happy with randomCover because it

BCS (8/24) May 23 2009 Check my reasoning on this:

BCS (3/14) May 23 2009 after some ad-hoc tests ("fun time with Excel!") it seems that the above...

Jason House (2/18) May 23 2009 You have your probability wrong. The selection probability is k/n. Pick ...

Andrei Alexandrescu (20/43) May 23 2009 Great, thanks, that works perfect. If I want to choose _toSelect stuff

Lionello Lunesu (18/18) May 24 2009 Andrei,

Andrei Alexandrescu (4/28) May 25 2009 Definitely warrants a bug report. I'm busy today but I might find time

Lionello Lunesu (1/3) May 25 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3025

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I'm writing a range to select a random sampling of a forward range. The 
number of elements in the forward range is already known.

The strategy is: at any moment, I know I need to select k random samples 
out of n. To implement popFront, I generate a random number in [0, n - 
k]. That's the probability that the next item in the input will be part 
of the selection. If the random number is 0, then the element got 
selected. Otherwise, I must ignore that guy so I popFront the input 
range, decrease n by one, and repeat.

This seems reasonable but somehow the selection comes skewed: the 
elements towards the beginning of the range are less represented than 
the ones towards the end. Where am I wrong?


Andrei

May 23 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm writing a range to select a random sampling of a forward range. The
 number of elements in the forward range is already known.
 The strategy is: at any moment, I know I need to select k random samples
 out of n. To implement popFront, I generate a random number in [0, n -
 k]. That's the probability that the next item in the input will be part
 of the selection. If the random number is 0, then the element got
 selected. Otherwise, I must ignore that guy so I popFront the input
 range, decrease n by one, and repeat.
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 Andrei

I'm confused.  As far as I can tell, RandomCover is correct in 2.030.  I both
read
the code and did a monte carlo test for some short sequences to see if the
distribution was uniform on the space of possible permutations.  Isn't this
just a
case of using RandomCover, but stopping before iterating through the entire
range?
 In other words, wouldn't the following work:

auto selectK(R)(R someRange, uint K, Random gen) {
    assert(K <= someRange.length);
    return take(K, randomCover(someRange, gen));
}

May 23 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I'm writing a range to select a random sampling of a forward range. The
 number of elements in the forward range is already known.
 The strategy is: at any moment, I know I need to select k random samples
 out of n. To implement popFront, I generate a random number in [0, n -
 k]. That's the probability that the next item in the input will be part
 of the selection. If the random number is 0, then the element got
 selected. Otherwise, I must ignore that guy so I popFront the input
 range, decrease n by one, and repeat.
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 Andrei

 
 I'm confused.  As far as I can tell, RandomCover is correct in 2.030.  I both
read
 the code and did a monte carlo test for some short sequences to see if the
 distribution was uniform on the space of possible permutations.  Isn't this
just a
 case of using RandomCover, but stopping before iterating through the entire
range?
  In other words, wouldn't the following work:
 
 auto selectK(R)(R someRange, uint K, Random gen) {
     assert(K <= someRange.length);
     return take(K, randomCover(someRange, gen));
 }

You are right. I'm just less than happy with randomCover because it 
needs additional memory, so I eliminated it as a possibility from the 
get-go.

Andrei

May 23 2009

BCS <none anon.com> writes:

Hello Andrei,

 I'm writing a range to select a random sampling of a forward range.
 The number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random
 samples out of n. To implement popFront, I generate a random number in
 [0, n - k]. That's the probability that the next item in the input
 will be part of the selection. If the random number is 0, then the
 element got selected. Otherwise, I must ignore that guy so I popFront
 the input range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the
 elements towards the beginning of the range are less represented than
 the ones towards the end. Where am I wrong?
 
 Andrei
 

Check my reasoning on this:

It will be skewed because as soon as the system gets behind (and statistically 
it will half the time) the elements start being selected with a higher
probability.

Take the related case, select each element with equal probability until you 
have enough selected or have just enough left to get the correct amount. 
The last element will get selected about half the time no matter what
percentage 
you want.

May 23 2009

BCS <none anon.com> writes:

Hello BCS,

 Check my reasoning on this:
 
 It will be skewed because as soon as the system gets behind (and
 statistically it will half the time) the elements start being selected
 with a higher probability.
 
 Take the related case, select each element with equal probability
 until you have enough selected or have just enough left to get the
 correct amount. The last element will get selected about half the time
 no matter what percentage you want.
 

after some ad-hoc tests ("fun time with Excel!") it seems that the above 
is off in the weeds.

May 23 2009

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 I'm writing a range to select a random sampling of a forward range. The 
 number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random samples 
 out of n. To implement popFront, I generate a random number in [0, n - 
 k]. That's the probability that the next item in the input will be part 
 of the selection. If the random number is 0, then the element got 
 selected. Otherwise, I must ignore that guy so I popFront the input 
 range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the 
 elements towards the beginning of the range are less represented than 
 the ones towards the end. Where am I wrong?
 
 
 Andrei

You have your probability wrong. The selection probability is k/n. Pick a
number in [0,n) and pick the element if the result is in [0,k). If you do
select an item, decrement k. Always decrement n when popping, regardless of if
the element is selected.

May 23 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 I'm writing a range to select a random sampling of a forward range.
 The number of elements in the forward range is already known.
 
 The strategy is: at any moment, I know I need to select k random
 samples out of n. To implement popFront, I generate a random number
 in [0, n - k]. That's the probability that the next item in the
 input will be part of the selection. If the random number is 0,
 then the element got selected. Otherwise, I must ignore that guy so
 I popFront the input range, decrease n by one, and repeat.
 
 This seems reasonable but somehow the selection comes skewed: the 
 elements towards the beginning of the range are less represented
 than the ones towards the end. Where am I wrong?
 
 
 Andrei

 
 You have your probability wrong. The selection probability is k/n.
 Pick a number in [0,n) and pick the element if the result is in
 [0,k). If you do select an item, decrement k. Always decrement n when
 popping, regardless of if the element is selected.

Great, thanks, that works perfect. If I want to choose _toSelect stuff 
out of _available items, this works to position to the next element:

         for (;;)
         {
             auto r = uniform(0, _available);
             if (r < _toSelect)
             {
                 // chosen!
                 return;
             }
             // not chosen, retry
             assert(!_input.empty);
             _input.popFront;
             --_available;
             assert(_available > 0);
         }

As an aside, it looks like the right-open range for random numbers is a 
good default.


Andrei

May 23 2009

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

Andrei,

I noticed in random.d, uniform template, that popFront is called in 
different locations for integral compared to floating point types: for 
integral types you .front first and .popFront afterwards, but for floating 
point types you start with .popFront and then check .front.

This has a peculiar effect: for example, if you do uniform(0.0,100.0) 
followed by uniform(0,100) there's a big chance that the integral part of 
the first random number is equal to the second random number.

import std.stdio, std.random;
void main()
{
  writeln(uniform(0.0,100.0));
  writeln(uniform(0,100));
}

I don't think this warrants a bug report, but I do think the location of 
.popFront should be standardized, either before or after any .front.

Just sayin'.

L.

May 24 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Lionello Lunesu wrote:
 Andrei,
 
 I noticed in random.d, uniform template, that popFront is called in 
 different locations for integral compared to floating point types: for 
 integral types you .front first and .popFront afterwards, but for 
 floating point types you start with .popFront and then check .front.
 
 This has a peculiar effect: for example, if you do uniform(0.0,100.0) 
 followed by uniform(0,100) there's a big chance that the integral part 
 of the first random number is equal to the second random number.
 
 import std.stdio, std.random;
 void main()
 {
  writeln(uniform(0.0,100.0));
  writeln(uniform(0,100));
 }
 
 I don't think this warrants a bug report, but I do think the location of 
 .popFront should be standardized, either before or after any .front.
 
 Just sayin'.
 
 L.

Definitely warrants a bug report. I'm busy today but I might find time 
for it tomorrow.

Andrei

May 25 2009

Lionello Lunesu <lio lunesu.remove.com> writes:

 Definitely warrants a bug report. I'm busy today but I might find time 
 for it tomorrow.

http://d.puremagic.com/issues/show_bug.cgi?id=3025

May 25 2009

D Programming

C/C++ Programming

Other

digitalmars.D - randomSample