## digitalmars.D - randomSample with unknown length

Magnus Lie Hetland <magnus hetland.org> writes:
```Reading the doc for std.random.randomSample, I saw that "The total
length of r must be known". There are rather straightforward algorithms
for drawing random samples *without* knowing this. This might be useful
if one wants to support input ranges, I guess?

Take, for example, the method described by Knuth (TAoP 2), for
selecting n elements uniformly at random from an input range:

- Select the first n elements as the current sample.
- Each subsequent element is rejected with a probability of 1 - n/t,
where t is the number seen so far.
- If a new item is selected, it replaces a random item in the current sample.

A cool property of this is that at any time, the current sample is one
drawn randomly (i.e., uniformly, without replacement) from the items
you've seen so far, so you could really stop at any point. That is,
stop iterating over the input; you can't really give the output as a
small-footprint range here (as far as I can see). Seems you have to
allocate room for n pointers. (Or you *could* just keep track of which
objects were swapped in -- might be worth the overhead if n is large
compared to the input size.)

--
Magnus Lie Hetland
http://hetland.org
```
Feb 02 2011
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 2/2/11 6:03 AM, Magnus Lie Hetland wrote:
Reading the doc for std.random.randomSample, I saw that "The total
length of r must be known". There are rather straightforward algorithms
for drawing random samples *without* knowing this. This might be useful
if one wants to support input ranges, I guess?

Take, for example, the method described by Knuth (TAoP 2), for selecting
n elements uniformly at random from an input range:

- Select the first n elements as the current sample.
- Each subsequent element is rejected with a probability of 1 - n/t,
where t is the number seen so far.
- If a new item is selected, it replaces a random item in the current
sample.

A cool property of this is that at any time, the current sample is one
drawn randomly (i.e., uniformly, without replacement) from the items
you've seen so far, so you could really stop at any point. That is, stop
iterating over the input; you can't really give the output as a
small-footprint range here (as far as I can see). Seems you have to
allocate room for n pointers. (Or you *could* just keep track of which
objects were swapped in -- might be worth the overhead if n is large
compared to the input size.)

I posted a problem solved by the algorithm above (and others, more
sophisticated ones) as a challenge to this group a couple of years ago.

randomSample is designed to subsample a large stream in constant space
and without needing to scan the entire stream in order to output the
first element. I used in in my dissertation where e.g. I had to select
100K samples from a stream of many millions.

Having a reservoir sample available would be nice. I'd be thrilled if
you coded up a reservoirSample(r, n) function for addition to std.random.

Andrei
```
Feb 02 2011
"Simen kjaeraas" <simen.kjaras gmail.com> writes:
```------------xSiOOR6ZU4bV01lWI6ZjHX
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

Having a reservoir sample available would be nice. I'd be thrilled if
you coded up a reservoirSample(r, n) function for addition to std.random.

Done. No comments or unittests yet, though.

randomSampleRange( R, num ) takes num samples from an input range, and
keeps a reservoir that is updated as you traverse the range (lazy, if you
wish).

randomSample( R, num ) takes num samples from all over a range (eager).

--
Simen
------------xSiOOR6ZU4bV01lWI6ZjHX
Content-Disposition: attachment; filename=RandomSample.d
Content-Type: application/octet-stream; name="RandomSample.d"
Content-Transfer-Encoding: Base64

77u/bW9kdWxlIFJhbmRvbVNhbXBsZTsNCg0KaW1wb3J0IHN0ZC5yYW5nZTsNCmlt
cG9ydCBzdGQucmFuZG9tOw0KaW1wb3J0IHN0ZC5zdGRpbzsNCg0Kc3RydWN0IFJh
bmRvbVNhbXBsZSggUiApIGlmICggaXNJbnB1dFJhbmdlIVIgKSB7DQoJUiByYW5n
ZTsNCglFbGVtZW50VHlwZSEoIFIgKVtdIF9mcm9udDsNCglib29sIGVtcHR5Ow0K
CVJhbmRvbSBybmQ7DQoJdWxvbmcgcG9zOw0KCQ0KCXRoaXMoIFIgciwgc2l6ZV90
IG51bSApIHsNCgkJc3RhdGljIGlmICggaXNGb3J3YXJkUmFuZ2UhUiApIHsNCgkJ
CXJhbmdlID0gci5zYXZlOw0KCQl9IGVsc2Ugew0KCQkJcmFuZ2UgPSByOw0KCQl9
DQoJCV9mcm9udCA9IGFycmF5KCB0YWtlKCByYW5nZSwgbnVtICkgKTsNCgkJc3Rh
dGljIGlmICggaXNGb3J3YXJkUmFuZ2UhUiApIHsNCgkJCXBvcEZyb250TiggcmFu
Z2UsIG51bSApOw0KCQl9DQoJCQ0KCQllbXB0eSA9IGZhbHNlOw0KCQlybmQgPSBS
YW5kb20oIHVucHJlZGljdGFibGVTZWVkICk7DQoJCXBvcyA9IG51bTsNCgl9DQoJ
DQoJQHByb3BlcnR5IEVsZW1lbnRUeXBlIShSKVtdIGZyb250KCApIHsNCgkJcmV0
dXJuIF9mcm9udC5kdXA7DQoJfQ0KCQ0KCXZvaWQgcG9wRnJvbnQoICkgew0KCQlh
c3NlcnQoICFlbXB0eSApOw0KCQlpZiAoIHVuaWZvcm0oIDAsIHBvcywgcm5kICkg
PCBfZnJvbnQubGVuZ3RoICkgew0KCQkJX2Zyb250WyB1bmlmb3JtKCAwLCBfZnJv
bnQubGVuZ3RoLCBybmQgKSBdID0gcmFuZ2UuZnJvbnQ7DQoJCX0NCgkJcmFuZ2Uu
cG9wRnJvbnQoICk7DQoJCXBvcysrOw0KCQllbXB0eSA9IHJhbmdlLmVtcHR5Ow0K
CX0NCn0NCg0KUmFuZG9tU2FtcGxlIVIgcmFuZG9tU2FtcGxlUmFuZ2UoIFIgKSgg
UiByLCBzaXplX3QgbnVtID0gMSApIGlmICggaXNJbnB1dFJhbmdlIVIgKSB7DQoJ
cmV0dXJuIFJhbmRvbVNhbXBsZSFSKCByLCBudW0gKTsNCn0NCg0KVFtdIHJhbmRv
bVNhbXBsZUltcGwoIFQsIFIgKSggVFtdIHJlc3VsdCwgUiByICkgew0KCWF1dG8g
cm5kID0gUmFuZG9tKCB1bnByZWRpY3RhYmxlU2VlZCApOw0KCWZvcmVhY2ggKCBp
LCBlOyByICkgew0KCQlpZiAoIHVuaWZvcm0oIDAsIHJlc3VsdC5sZW5ndGggKyBp
ICsgMSwgcm5kICkgPCByZXN1bHQubGVuZ3RoICkgew0KCQkJcmVzdWx0WyB1bmlm
b3JtKCAwLCByZXN1bHQubGVuZ3RoLCBybmQgKSBdID0gZTsNCgkJfQ0KCX0NCgly
ZXR1cm4gcmVzdWx0Ow0KfQ0KDQpFbGVtZW50VHlwZSEoIFIgKVtdIHJhbmRvbVNh
bXBsZSggUiApKCBSIHIsIHNpemVfdCBudW0gPSAxICkgaWYgKCBpc0lucHV0UmFu
Z2UhUiAmJiAhaXNGb3J3YXJkUmFuZ2UhUiApIHsNCglhdXRvIHJlc3VsdCA9IGFy
cmF5KCB0YWtlKCByLCBudW0gKSApOw0KCXJldHVybiByYW5kb21TYW1wbGVJbXBs
KCByZXN1bHQsIHJyICk7DQp9DQoNCkVsZW1lbnRUeXBlISggUiApW10gcmFuZG9t
U2FtcGxlKCBSICkoIFIgciwgc2l6ZV90IG51bSA9IDEgKSBpZiAoIGlzRm9yd2Fy
ZFJhbmdlIVIgKSB7DQoJYXV0byByciA9IHIuc2F2ZTsNCglhdXRvIHJlc3VsdCA9
IGFycmF5KCB0YWtlKCByciwgbnVtICkgKTsNCglwb3BGcm9udE4oIHJyLCBudW0g
KTsNCglyZXR1cm4gcmFuZG9tU2FtcGxlSW1wbCggcmVzdWx0LCByciApOw0KfQ0K
DQp2b2lkIG1haW4oICkgew0KCXdyaXRlbG4oICJSYW5nZSBzYW1wbGU6IiApOw0K
CWZvcmVhY2ggKCBlOyByYW5kb21TYW1wbGVSYW5nZSggIlNhbXBsZSB0ZXh0Iiwg
MyApICkgew0KCQl3cml0ZWxuKCBlICk7DQoJfQ0KCXdyaXRlbG4oICJTaW5nbGUg
c2FtcGxlOiIgKTsNCgl3cml0ZWxuKCByYW5kb21TYW1wbGUoICJTYW1wbGUgdGV4
dCIsIDMgKSApOw0KfQ==

------------xSiOOR6ZU4bV01lWI6ZjHX--
```
Feb 02 2011
Magnus Lie Hetland <magnus hetland.org> writes:
```On 2011-02-02 16:32:25 +0100, Andrei Alexandrescu said:

randomSample is designed to subsample a large stream in constant space
and without needing to scan the entire stream in order to output the
first element.

Sure. I was just thinking that you could have a version for the cases
where there was no end in sight :)

I used in in my dissertation where e.g. I had to select 100K samples
from a stream of many millions.

Cool.

Having a reservoir sample available would be nice. I'd be thrilled if
you coded up a reservoirSample(r, n) function for addition to
std.random.

Seems Simen beat me to it :)

--
Magnus Lie Hetland
http://hetland.org
```
Feb 02 2011