## digitalmars.D - random k-sample of a file

- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- dsimcha <dsimcha yahoo.com> Oct 09 2008
- Walter Bright <newshound1 digitalmars.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- BCS <ao pathlink.com> Oct 09 2008
- "Steven Schveighoffer" <schveiguy yahoo.com> Oct 09 2008
- Walter Bright <newshound1 digitalmars.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Walter Bright <newshound1 digitalmars.com> Oct 09 2008
- BCS <ao pathlink.com> Oct 09 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- KennyTM~ <kennytm gmail.com> Oct 09 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 09 2008
- "Denis Koroskin" <2korden gmail.com> Oct 09 2008
- bearophile <bearophileHUGS lycos.com> Oct 09 2008
- bearophile <bearophileHUGS lycos.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- bearophile <bearophileHUGS lycos.com> Oct 09 2008
- bearophile <bearophileHUGS lycos.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 10 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 10 2008
- Ary Borenszweig <ary esperanto.org.ar> Oct 10 2008
- bearophile <bearophileHUGS lycos.com> Oct 10 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 10 2008
- Sergey Gromov <snake.scaly gmail.com> Oct 10 2008
- Kirk McDonald <kirklin.mcdonald gmail.com> Oct 09 2008
- Jason House <jason.james.house gmail.com> Oct 09 2008
- Kirk McDonald <kirklin.mcdonald gmail.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Kirk McDonald <kirklin.mcdonald gmail.com> Oct 09 2008
- bearophile <bearophileHUGS lycos.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Christopher Wright <dhasenan gmail.com> Oct 10 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 10 2008
- Christopher Wright <dhasenan gmail.com> Oct 11 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Kirk McDonald <kirklin.mcdonald gmail.com> Oct 09 2008
- BCS <ao pathlink.com> Oct 09 2008
- Jason House <jason.james.house gmail.com> Oct 09 2008
- BCS <ao pathlink.com> Oct 09 2008
- Russell Lewis <webmaster villagersonline.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Russell Lewis <webmaster villagersonline.com> Oct 09 2008
- Christopher Wright <dhasenan gmail.com> Oct 09 2008
- "Carlos" <carlos-smith sympatico.ca> Oct 09 2008
- Sergey Gromov <snake.scaly gmail.com> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 09 2008
- Sergey Gromov <snake.scaly gmail.com> Oct 10 2008
- bearophile <bearophileHUGS lycos.com> Oct 10 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 10 2008
- "Jarrett Billingsley" <jarrett.billingsley gmail.com> Oct 09 2008
- "Denis Koroskin" <2korden gmail.com> Oct 10 2008
- Michel Fortin <michel.fortin michelf.com> Oct 10 2008
- "Jarrett Billingsley" <jarrett.billingsley gmail.com> Oct 10 2008
- "Denis Koroskin" <2korden gmail.com> Oct 10 2008

I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei

Oct 09 2008

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei

Pretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks. Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing. (I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.

Oct 09 2008

dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei

Pretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks.

So this is essentially an index of the file's lines?Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing.

So this would shuffle that index.(I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.

Can it be done faster and maybe in one pass? Andrei

Oct 09 2008

Andrei Alexandrescu wrote:The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Oct 09 2008

Walter Bright wrote:Andrei Alexandrescu wrote:The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

No go if the file is a stream. Can it be done in one pass? Andrei

Oct 09 2008

Reply to Andrei,Walter Bright wrote:Andrei Alexandrescu wrote:The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Andrei

read and store the first k lines with there line number, if no lines remain, output stored lines read another line replace an existing line with a probability based on the current line and the replaced line's location if no lines remain ... Now all that remains is to figure out that function

Oct 09 2008

"Andrei Alexandrescu" wroteWalter Bright wrote:Andrei Alexandrescu wrote:

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

No go if the file is a stream. Can it be done in one pass?

Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k. -Steve

Oct 09 2008

Steven Schveighoffer wrote:"Andrei Alexandrescu" wroteWalter Bright wrote:Andrei Alexandrescu wrote:

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k.

k is a "small" number in the sense that you can keep something proportional to k in working memory. You shouldn't keep something proportional to n in working memory. Andrei

Oct 09 2008

Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.

Oct 09 2008

Walter Bright wrote:Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.

I am getting revenge for "I never use anything in std.algorithm" :o). Andrei

Oct 09 2008

Andrei Alexandrescu wrote:Walter Bright wrote:Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.

I am getting revenge for "I never use anything in std.algorithm" :o).

Get in line for everyone getting their revenge on me!

Oct 09 2008

Reply to Andrei,Walter Bright wrote:Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?

uniform distribution. Unless you're willing to make a guess based on the file size.

Andrei

"Et tu, Brute!" - (Julius Caesar, Act III, Scene I).

Oct 09 2008

Walter Bright wrote:

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).

Oct 09 2008

Denis Koroskin wrote:On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:Walter Bright wrote:Andrei Alexandrescu wrote:

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told thatThe file's size is unbounded so loading it in memory is not an option.

Keeping k of the lines in memory is fine. Andrei

Oct 09 2008

Ary Borenszweig wrote:Walter Bright wrote:

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).

I guess p depends on the number of lines unfortunately. Because p needs to be decreasing with line count, otherwise the first few line will almost always be replaced (not uniform then).

Oct 09 2008

Denis Koroskin wrote:On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:Walter Bright wrote:

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told that

If you can't read more lines in this step, the answer are all the lines.The file's size is unbounded so loading it in memory is not an option.

Oct 09 2008

On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:Walter Bright wrote:

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told thatThe file's size is unbounded so loading it in memory is not an option.

Oct 09 2008

Andrei Alexandrescu:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones: from sys import argv from random import sample filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: samples = set(sample(xrange(nlines), k)) for i, line in enumerate(file(filename)): if i in samples: print line If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution. I'll think about the more complex problem in the next post. Later, bearophile

Oct 09 2008

I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory: from sys import argv from random import random filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: select = k remaining = nlines for line in file(filename): if random() < float(select) / remaining: print line select -= 1 remaining -= 1 I'll think for a solution that avoids reading the file twice then... Bye, bearophile

Oct 09 2008

bearophile wrote:I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory: from sys import argv from random import random filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: select = k remaining = nlines for line in file(filename): if random() < float(select) / remaining: print line select -= 1 remaining -= 1 I'll think for a solution that avoids reading the file twice then... Bye, bearophile

This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance. Andrei

Oct 09 2008

Andrei Alexandrescu:This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance.

If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice. You can run this code: from random import random from collections import defaultdict def main(freqs, nrepeats): for i in xrange(nrepeats): chosen = [] select = k remaining = n for value in values: if random() < (float(select) / remaining): chosen.append(value) select -= 1 remaining -= 1 for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] Its output: [1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526] It shows the values are balanced. If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly. Bye, bearophile

Oct 09 2008

Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange # randrange gives a random integer in [0, n) filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_lines Now I'll look for possible bugs in this third version and to the problem Andrei has just told me :-) Bye, bearophile

Oct 09 2008

bearophile wrote:Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange # randrange gives a random integer in [0, n) filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works. Andrei

Oct 09 2008

Andrei Alexandrescu escribió:bearophile wrote:Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange # randrange gives a random integer in [0, n) filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Andrei

Oct 09 2008

Ary Borenszweig wrote:Andrei Alexandrescu escribió:bearophile wrote:Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange # randrange gives a random integer in [0, n) filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3... Andrei

Oct 09 2008

Andrei Alexandrescu escribió:Ary Borenszweig wrote:Andrei Alexandrescu escribió:bearophile wrote:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei

Oct 09 2008

Ary Borenszweig wrote:Andrei Alexandrescu escribió:Ary Borenszweig wrote:Andrei Alexandrescu escribió:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should. Andrei

Oct 09 2008

Andrei Alexandrescu wrote:Ary Borenszweig wrote:Andrei Alexandrescu escribió:Ary Borenszweig wrote:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)Andrei

Oct 10 2008

Ary Borenszweig wrote:Andrei Alexandrescu wrote:Ary Borenszweig wrote:Andrei Alexandrescu escribió:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o) Andrei

Oct 10 2008

Andrei Alexandrescu wrote:Ary Borenszweig wrote:Andrei Alexandrescu wrote:Ary Borenszweig wrote:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't have Kirk's code here at work (I told Thunderbird not to download everything I had in my home)... but if I remember correctly, it had the same problem, it was something like: if (0 == uniform(0, k)): or something like that, and should be 0 != ... I wonder why both used the wrong probability... (that's why I made the math, because I wasn't sure with that probability it worked) I really liked the problem! It has a very simple and elegant solution... I thought the probability would be something more complex, but it turned out to be simple. :-)

Oct 10 2008

Andrei Alexandrescu:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious. Bye, bearophile

Oct 10 2008

bearophile wrote:Andrei Alexandrescu:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

No, that's a different problem. They want to extract n random samples from N records *without replacement*. It's a very different ballgame. See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html The problem as I stated is reasonably easy and my implementation should be correct. Andrei

Oct 10 2008

Fri, 10 Oct 2008 10:55:51 -0400, bearophile wrote:Andrei Alexandrescu:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

I think your algorithm has one mistake. The last line of your text gets into the k with probability 1/n while it should be k/n. Replace 1.0 with 1.0*k and the distribution will be absolutely (theoretically) uniform.

Oct 10 2008

Ary Borenszweig wrote:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound. -- Kirk McDonald

Oct 09 2008

Kirk McDonald wrote:Ary Borenszweig wrote:

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound.

That's why I thought it's such a beautiful problem. My proof is inductive over n. The invariant is that the k lines held represent a uniform selection of the n lines seen so far at any moment. Base: For n <= k, vacuously satisfied. Inductive step: We have a random selection of n lines so far in array selection, and we add one more line. We need to prove that the new selection is also uniform. The new line has probability k / (n + 1) of being part of the selection. We toss a skewed coin with that probability. If it turns heads, we pick the new element for keeping. Any element in the array should go with equal probability (by the inductive assumption), so we roll a k-sided dice to choose one element to throw away. Andrei

Oct 09 2008

Andrei Alexandrescu Wrote:

Based on your later stated requirements, use a heap. Insert each line into the heap using a random number as the key. After pre-loading the first k elements, each insert into the heap is followed by the removal of the head element. Performance is O(characters in file + log2(k)*lines in file) Obviously, the random number generator needs a range of values that exceeds the lines in the file by a good margin, or else you'll get biases in what gets kept.

Oct 09 2008

Andrei Alexandrescu wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. -- Kirk McDonald

Oct 09 2008

Kirk McDonald wrote:Andrei Alexandrescu wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei

Oct 09 2008

Andrei Alexandrescu wrote:Kirk McDonald wrote:Andrei Alexandrescu wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei

No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.) -- Kirk McDonald

Oct 09 2008

Kirk McDonald wrote:Andrei Alexandrescu wrote:Kirk McDonald wrote:Andrei Alexandrescu wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei

No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)

Sounds good. Andrei

Oct 09 2008

Andrei Alexandrescu:But where's the D code you guys?

My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than Python to create working prototypes. Bye, bearophile

Oct 09 2008

bearophile wrote:Andrei Alexandrescu:But where's the D code you guys?

My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than Python to create working prototypes.

True men only write D. Your solution looks eerily similar to mine. #!/usr/bin/rdmd import std.stdio, std.contracts, std.conv, std.random; void main(string[] args) { invariant k = parse!(uint)(args[1]); enforce(k > 0, "Must pass a strictly positive selection size"); string[] selection; auto gen = Random(unpredictableSeed); foreach (ulong tally, char[] line; lines(stdin)) { if (selection.length < k) { // Selection not full; add to it selection ~= line.idup; } else { auto t = uniform(gen, 0, tally + 1); if (t > k) continue; // no luck // Replace a random element in the selection selection[uniform(gen, 0, k - 1)] = line.idup; } } // Print selection foreach (s; selection) write(s); }

Oct 09 2008

Andrei Alexandrescu wrote:enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert? I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size

Oct 10 2008

Christopher Wright wrote:Andrei Alexandrescu wrote:enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert?

Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size

That looks pretty darn nice. Andrei

Oct 10 2008

Andrei Alexandrescu wrote:Christopher Wright wrote:Andrei Alexandrescu wrote:enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert?

Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.

True -- I forgot about that.I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size

That looks pretty darn nice. Andrei

If you want it, the code's in a public subversion repository: http://svn.dsource.org/projects/dmocks/dunit/trunk/ modules dunit.expect and dunit.constraint Currently BSD licensed, but if anyone wants it in a different license, that's up for negotiation. Though I think it would take some effort to make it work with Phobos.

Oct 11 2008

Kirk McDonald wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... Andrei

Oct 09 2008

Andrei Alexandrescu wrote:Kirk McDonald wrote:

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... Andrei

To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-) -- Kirk McDonald

Oct 09 2008

Kirk McDonald wrote:To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-)

I think you forgot to attach the D solution. ;o) Andrei

Oct 09 2008

Reply to Andrei,

struct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

Oct 09 2008

BCS Wrote:Reply to Andrei,

struct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

You should check if your new random number is good enough to keep before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.

Oct 09 2008

Reply to Jason,BCS Wrote:Reply to Andrei,

{ char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.

I was going for simplicity and ease of reading within a given big-O class. Switching to insertion sort or reusing data buffers would also do better.

Oct 09 2008

BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODE Andrei Alexandrescu wrote:

Oct 09 2008

Russell Lewis wrote:BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODE

Then the last line will be part of the selection with probability 1. Andrei

Oct 09 2008

Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows: BEGIN CODE int count = k; while(!feof(stdin)) { if(random(0,count) >= k) ReadALine(); // this discards the line else { int i = random(0,k); bufs[i] = ReadALine(); } count++; } END CODE Note that I'm assuming that random(a,b) is a function that randomly returns an integer in the range [a,b). This new version works because the most-recently-read line has a k/count chance of getting picked up into the current working set, and when we do choose to discard some line, we give all of the old lines an equal chance of survival. I don't have to to formally work out the math right now, but I'm pretty sure that that would give each line in the entire file a k/count chance of being in the set at any given point in the process. Russ Russell Lewis wrote:BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODE Andrei Alexandrescu wrote:

Oct 09 2008

Russell Lewis wrote:Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows: BEGIN CODE int count = k; while(!feof(stdin)) { if(random(0,count) >= k) ReadALine(); // this discards the line else { int i = random(0,k); bufs[i] = ReadALine(); } count++; } END CODE

Yah that works. Nice! Andrei

Oct 09 2008

Andrei Alexandrescu wrote:

You can't do a uniform random distribution without knowing the length. Your options are: - Hold the entire file in memory. - Hold the entire file so far in memory. Once you get over a certain threshold, remove lines randomly. You have to tweak the random distribution for the axing method to make sure it's uniform. If this is going to be used often on huge data sets, that is a reasonable effort. How would you need to tweak the random distribution? The easiest way is to get two random numbers: what section to remove lines from and which line in that section to remove stuff from. You can work out some annoying recurrence to find out how to get the probability of axing from one section given that it has been available for k rounds. I might work out the recurrence tonight, but I don't have a chalkboard.

Oct 09 2008

: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.

Oct 09 2008

Carlos wrote:: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.

I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision. Andrei

Oct 09 2008

Andrei Alexandrescu wrote:Carlos wrote:: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.

I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision.

I think "the number of total samples..." is the correct statement. Andrei

Oct 09 2008

Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.

Oct 09 2008

Sergey Gromov wrote:Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample; On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts? Andrei

Oct 09 2008

Jarrett Billingsley wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Well I'd like to extend std.stdio with capabilities that meet and exceed those in std.stream. The design of std.stream is ok, but makes it hard to compose streams, and requires a lot of hand coding to define new streams. Andrei

Oct 09 2008

Thu, 09 Oct 2008 22:28:51 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample;

Good advice, thanks. The Sample definition is unimportant for the solution but eats up a lot of visual space.On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?

While experimenting with Bearophile's loaderhttp://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

a file line-by-line. So the alternative should be as fast.

Oct 10 2008

Sergey Gromov:While experimenting with Bearophile's loaderhttp://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

a file line-by-line. So the alternative should be as fast.

BufferedFile seems the faster in Phobos, but it's 2-3 times slower than the line-by-line reading you have in Python, and I think it has 4 bugs, so I don't use it much, that's why I use the xfile() of my libs, that is a little slower, but without those bugs. Few bugs of BufferedFile can be seen if you try to load a very large file made of a single line (crashes D every time), or you try reading lines with \0 at the end of the line, etc. They are corner cases, usually. I suggest reading the C sources of the I/O of Python (or Perl, if you are able to read the sources. I think the differences in clarity between Python and Perl can be seen even in their C sources). Bye, bearophile

Oct 10 2008

Jarrett Billingsley wrote:On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

;)

That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei

Oct 10 2008

Denis Koroskin wrote:On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Jarrett Billingsley wrote:On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

side-by-side ;)

_another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei

Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)

That would make Phobos dependent on Tango, thus running into those licensing issues I have little understanding of. Andrei

Oct 10 2008

On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Oct 09 2008

On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)

Oct 10 2008

On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:

My attempt. Works in one pass, and distributes uniformly but only when k is a factor of the number of lines in the file. At least, I believe the distribution is uniform, but feel free to prove me wrong. /* Uniform random value r, with r >= min && r < max. */ real random(uint min, uint max) { ... } /* Shuffle elements uniformly in the given string array. */ void shuffle(ref string[] array) { ... } string[] challenge(alias R)(R input, uint k) { string[] result; uint line; real pickFactor = 1; // Initial stage: fill initial k strings while (!input.eof && line < k) { result ~= input.readLine; ++line; } shuffle(result); while (!input.eof) { // Next stage: half the chances of picking a candidate. pickFactor /= 2; string[] candidates; line = 0; while (!input.eof && line < k) { candidates ~= input.readLine; ++line; } shuffle(candidates); for (uint i; i < candidates.length; ++i) if (random(0, 1) < pickFactor) result[i] = candidates[i]; // Note: if candidate.length != result.length // we don't have uniformity. } return result; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/

Oct 10 2008

On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)

That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Oct 10 2008

On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Jarrett Billingsley wrote:On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

side-by-side ;)

_another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei

Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)

Oct 10 2008