www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - random k-sample of a file

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I just ran across a nice little problem that I wanted to share as an 
exercise for the interested in futile pastimes.

The challenge, should you accept it, is to write a program that given a 
number k and a file, outputs k lines picked uniformly at random from the 
file. (If the file has less than k lines, it should all be output.) The 
file's size is unbounded so loading it in memory is not an option. How'd 
you go about it?


Andrei
Oct 09 2008
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a
 number k and a file, outputs k lines picked uniformly at random from the
 file. (If the file has less than k lines, it should all be output.) The
 file's size is unbounded so loading it in memory is not an option. How'd
 you go about it?
 Andrei

Pretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks. Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing. (I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a
 number k and a file, outputs k lines picked uniformly at random from the
 file. (If the file has less than k lines, it should all be output.) The
 file's size is unbounded so loading it in memory is not an option. How'd
 you go about it?
 Andrei

Pretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks.

So this is essentially an index of the file's lines?
 Shuffle the ulong.sizeof chunks of this
 file the same way you would shuffle an ordinary array, using seeking as the
 equivalent to array indexing.

So this would shuffle that index.
 (I never said this was going to be efficient.)  For
 each of the first k offsets after shuffling, print everything in the original
file
 from that offset to the first EOL character encountered.

Can it be done faster and maybe in one pass? Andrei
Oct 09 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

No go if the file is a stream. Can it be done in one pass? Andrei
Oct 09 2008
next sibling parent BCS <ao pathlink.com> writes:
Reply to Andrei,

 Walter Bright wrote:
 
 Andrei Alexandrescu wrote:
 
 The challenge, should you accept it, is to write a program that
 given a number k and a file, outputs k lines picked uniformly at
 random from the file. (If the file has less than k lines, it should
 all be output.) The file's size is unbounded so loading it in memory
 is not an option. How'd you go about it?
 

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Andrei

read and store the first k lines with there line number, if no lines remain, output stored lines read another line replace an existing line with a probability based on the current line and the replaced line's location if no lines remain ... Now all that remains is to figure out that function
Oct 09 2008
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Andrei Alexandrescu" wrote
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

No go if the file is a stream. Can it be done in one pass?

Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k. -Steve
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines


Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k.

k is a "small" number in the sense that you can keep something proportional to k in working memory. You shouldn't keep something proportional to n in working memory. Andrei
Oct 09 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.
Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.

I am getting revenge for "I never use anything in std.algorithm" :o). Andrei
Oct 09 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.

I am getting revenge for "I never use anything in std.algorithm" :o).

Get in line for everyone getting their revenge on me!
Oct 09 2008
prev sibling parent BCS <ao pathlink.com> writes:
Reply to Andrei,

 Walter Bright wrote:
 
 Andrei Alexandrescu wrote:
 
 No go if the file is a stream. Can it be done in one pass?
 

uniform distribution. Unless you're willing to make a guess based on the file size.

Andrei

"Et tu, Brute!" - (Julius Caesar, Act III, Scene I).
Oct 09 2008
prev sibling next sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).
Oct 09 2008
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig 
 <ary esperanto.org.ar> wrote:
 
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told that
 The file's size is unbounded so loading it in memory is not an option.


Keeping k of the lines in memory is fine. Andrei
Oct 09 2008
prev sibling next sibling parent KennyTM~ <kennytm gmail.com> writes:
Ary Borenszweig wrote:
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).

I guess p depends on the number of lines unfortunately. Because p needs to be decreasing with line count, otherwise the first few line will almost always be replaced (not uniform then).
Oct 09 2008
prev sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
Denis Koroskin wrote:
 On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig 
 <ary esperanto.org.ar> wrote:
 
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told that

If you can't read more lines in this step, the answer are all the lines.
 
 The file's size is unbounded so loading it in memory is not an option.


Oct 09 2008
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar>  
wrote:

 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given  
 a number k and a file, outputs k lines picked uniformly at random from  
 the file. (If the file has less than k lines, it should all be  
 output.) The file's size is unbounded so loading it in memory is not  
 an option. How'd you go about it?

2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")

What if there is only one line in a file? Andrei told that
 The file's size is unbounded so loading it in memory is not an option.

Oct 09 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones: from sys import argv from random import sample filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: samples = set(sample(xrange(nlines), k)) for i, line in enumerate(file(filename)): if i in samples: print line If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution. I'll think about the more complex problem in the next post. Later, bearophile
Oct 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
I am not reading the anwers written by others, of course :-) With the help of
"Programming Pearls" here is my second version, that spares the memory required
for the chosen ones, so this code runs with very little memory:

from sys import argv
from random import random

filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))

if k >= nlines:
    for line in file(filename):
        print line
else:
    select = k
    remaining = nlines

    for line in file(filename):
        if random() < float(select) / remaining:
            print line
            select -= 1
        remaining -= 1

I'll think for a solution that avoids reading the file twice then...

Bye,
bearophile
Oct 09 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 I am not reading the anwers written by others, of course :-) With the help of
"Programming Pearls" here is my second version, that spares the memory required
for the chosen ones, so this code runs with very little memory:
 
 from sys import argv
 from random import random
 
 filename = argv[1]
 k = int(argv[2])
 nlines = sum(1 for _ in file(filename))
 
 if k >= nlines:
     for line in file(filename):
         print line
 else:
     select = k
     remaining = nlines
 
     for line in file(filename):
         if random() < float(select) / remaining:
             print line
             select -= 1
         remaining -= 1
 
 I'll think for a solution that avoids reading the file twice then...
 
 Bye,
 bearophile

This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance. Andrei
Oct 09 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 This is unfair. First line in the original file is included with 
 probability k / n. The probability of including the second line is 
 dependent on the event of including the first line. If it wasn't 
 included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines 
 should be given equal chance.

If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice. You can run this code: from random import random from collections import defaultdict def main(freqs, nrepeats): for i in xrange(nrepeats): chosen = [] select = k remaining = n for value in values: if random() < (float(select) / remaining): chosen.append(value) select -= 1 remaining -= 1 for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] Its output: [1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526] It shows the values are balanced. If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly. Bye, bearophile
Oct 09 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Third solution, this requires a storage of k lines (but you can keep this
storage on disk):

from sys import argv
from random import random, randrange
# randrange gives a random integer in [0, n)

filename = argv[1]
k = int(argv[2])
assert k > 0

chosen_lines = []
for i, line in enumerate(file(filename)):
    if i < k:
        chosen_lines.append(line)
    else:
        if random() < (1.0 / (i+1)):
            chosen_lines[randrange(k)] = line

print chosen_lines

Now I'll look for possible bugs in this third version and to the problem Andrei
has just told me :-)

Bye,
bearophile
Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep this
storage on disk):
 
 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)
 
 filename = argv[1]
 k = int(argv[2])
 assert k > 0
 
 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line
 
 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works. Andrei
Oct 09 2008
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(
 
 Andrei
 

Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3... Andrei
Oct 09 2008
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
 
 Andrei

Oct 09 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should. Andrei
Oct 09 2008
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)
 
 
 Andrei

Oct 10 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ary Borenszweig wrote:
 Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o) Andrei
Oct 10 2008
next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't have Kirk's code here at work (I told Thunderbird not to download everything I had in my home)... but if I remember correctly, it had the same problem, it was something like: if (0 == uniform(0, k)): or something like that, and should be 0 != ... I wonder why both used the wrong probability... (that's why I made the math, because I wasn't sure with that probability it worked) I really liked the problem! It has a very simple and elegant solution... I thought the probability would be something more complex, but it turned out to be simple. :-)
Oct 10 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say
 
 if random() > (1.0 / (i+1)):
 
 instead of
 
 if random() < (1.0 / (i+1)):
 
 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious. Bye, bearophile
Oct 10 2008
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say

 if random() > (1.0 / (i+1)):

 instead of

 if random() < (1.0 / (i+1)):

 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)


I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

No, that's a different problem. They want to extract n random samples from N records *without replacement*. It's a very different ballgame. See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html The problem as I stated is reasonably easy and my implementation should be correct. Andrei
Oct 10 2008
prev sibling parent Sergey Gromov <snake.scaly gmail.com> writes:
Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
 Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say
 
 if random() > (1.0 / (i+1)):
 
 instead of
 
 if random() < (1.0 / (i+1)):
 
 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

I think your algorithm has one mistake. The last line of your text gets into the k with probability 1/n while it should be k/n. Replace 1.0 with 1.0*k and the distribution will be absolutely (theoretically) uniform.
Oct 10 2008
prev sibling parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound. -- Kirk McDonald
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Kirk McDonald wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribió:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange
 # randrange gives a random integer in [0, n)

 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound.

That's why I thought it's such a beautiful problem. My proof is inductive over n. The invariant is that the k lines held represent a uniform selection of the n lines seen so far at any moment. Base: For n <= k, vacuously satisfied. Inductive step: We have a random selection of n lines so far in array selection, and we add one more line. We need to prove that the new selection is also uniform. The new line has probability k / (n + 1) of being part of the selection. We toss a skewed coin with that probability. If it turns heads, we pick the new element for keeping. Any element in the array should go with equal probability (by the inductive assumption), so we roll a k-sided dice to choose one element to throw away. Andrei
Oct 09 2008
prev sibling next sibling parent Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

Based on your later stated requirements, use a heap. Insert each line into the heap using a random number as the key. After pre-loading the first k elements, each insert into the heap is followed by the removal of the head element. Performance is O(characters in file + log2(k)*lines in file) Obviously, the random number generator needs a range of values that exceeds the lines in the file by a good margin, or else you'll get biases in what gets kept.
Oct 09 2008
prev sibling next sibling parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. -- Kirk McDonald
Oct 09 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei
Oct 09 2008
next sibling parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei

No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.) -- Kirk McDonald
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?


 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. Andrei

No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)

Sounds good. Andrei
Oct 09 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 But where's the D code you guys?

My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }
 D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than Python to create working prototypes. Bye, bearophile
Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 But where's the D code you guys?

My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }
 D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than Python to create working prototypes.

True men only write D. Your solution looks eerily similar to mine. #!/usr/bin/rdmd import std.stdio, std.contracts, std.conv, std.random; void main(string[] args) { invariant k = parse!(uint)(args[1]); enforce(k > 0, "Must pass a strictly positive selection size"); string[] selection; auto gen = Random(unpredictableSeed); foreach (ulong tally, char[] line; lines(stdin)) { if (selection.length < k) { // Selection not full; add to it selection ~= line.idup; } else { auto t = uniform(gen, 0, tally + 1); if (t > k) continue; // no luck // Replace a random element in the selection selection[uniform(gen, 0, k - 1)] = line.idup; } } // Print selection foreach (s; selection) write(s); }
Oct 09 2008
parent reply Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert? I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size
Oct 10 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Christopher Wright wrote:
 Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert?

Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.
 I've written an assertions framework, but it provides some more options 
 than assert:
 
 expect.because ("Must pass a strictly positive selection size").that 
 (k).greaterThan (0);
 
 If that failed, it would produce:
 AssertionException: Expected: greater than 0 but was: -4 -- Must pass a 
 strictly positive selection size

That looks pretty darn nice. Andrei
Oct 10 2008
parent Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
 Christopher Wright wrote:
 Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert?

Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.

True -- I forgot about that.
 I've written an assertions framework, but it provides some more 
 options than assert:

 expect.because ("Must pass a strictly positive selection size").that 
 (k).greaterThan (0);

 If that failed, it would produce:
 AssertionException: Expected: greater than 0 but was: -4 -- Must pass 
 a strictly positive selection size

That looks pretty darn nice. Andrei

If you want it, the code's in a public subversion repository: http://svn.dsource.org/projects/dmocks/dunit/trunk/ modules dunit.expect and dunit.constraint Currently BSD licensed, but if anyone wants it in a different license, that's up for negotiation. Though I think it would take some effort to make it work with Phobos.
Oct 11 2008
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... Andrei
Oct 09 2008
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... Andrei

To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-) -- Kirk McDonald
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Kirk McDonald wrote:
 To be fair, I started writing it before bearophile posted his, and 
 indeed did not see his solution until after I had posted it. :-)

I think you forgot to attach the D solution. ;o) Andrei
Oct 09 2008
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Andrei,

 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given
 a number k and a file, outputs k lines picked uniformly at random from
 the file. (If the file has less than k lines, it should all be
 output.) The file's size is unbounded so loading it in memory is not
 an option. How'd you go about it?
 
 Andrei
 

struct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
Oct 09 2008
parent reply Jason House <jason.james.house gmail.com> writes:
BCS Wrote:

 Reply to Andrei,
 
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given
 a number k and a file, outputs k lines picked uniformly at random from
 the file. (If the file has less than k lines, it should all be
 output.) The file's size is unbounded so loading it in memory is not
 an option. How'd you go about it?
 
 Andrei
 

struct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

You should check if your new random number is good enough to keep before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.
Oct 09 2008
parent BCS <ao pathlink.com> writes:
Reply to Jason,

 BCS Wrote:
 
 Reply to Andrei,
 
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that
 given a number k and a file, outputs k lines picked uniformly at
 random from the file. (If the file has less than k lines, it should
 all be output.) The file's size is unbounded so loading it in memory
 is not an option. How'd you go about it?
 
 Andrei
 

{ char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.

I was going for simplicity and ease of reading within a given big-O class. Switching to insertion sort or reusing data buffers would also do better.
Oct 09 2008
prev sibling next sibling parent reply Russell Lewis <webmaster villagersonline.com> writes:
BEGIN CODE
char[][] choose(int k)
{
   char[][] bufs; bufs.length = k;

   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();

   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }

   return bufs;
}

END CODE

Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

Oct 09 2008
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Russell Lewis wrote:
 BEGIN CODE
 char[][] choose(int k)
 {
   char[][] bufs; bufs.length = k;
 
   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();
 
   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }
 
   return bufs;
 }
 
 END CODE

Then the last line will be part of the selection with probability 1. Andrei
Oct 09 2008
prev sibling parent reply Russell Lewis <webmaster villagersonline.com> writes:
Sorry, I overlooked the (obvious) fact that this biases things towards 
the end of the file.  Change the while loop as follows:

BEGIN CODE
   int count = k;
   while(!feof(stdin))
   {
     if(random(0,count) >= k)
       ReadALine();   // this discards the line
     else
     {
       int i = random(0,k);
       bufs[i] = ReadALine();
     }
     count++;
   }
END CODE

Note that I'm assuming that random(a,b) is a function that randomly 
returns an integer in the range [a,b).

This new version works because the most-recently-read line has a k/count 
  chance of getting picked up into the current working set, and when we 
do choose to discard some line, we give all of the old lines an equal 
chance of survival.

I don't have to to formally work out the math right now, but I'm pretty 
sure that that would give each line in the entire file a k/count chance 
of being in the set at any given point in the process.

Russ

Russell Lewis wrote:
 BEGIN CODE
 char[][] choose(int k)
 {
   char[][] bufs; bufs.length = k;
 
   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();
 
   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }
 
   return bufs;
 }
 
 END CODE
 
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei


Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Russell Lewis wrote:
 Sorry, I overlooked the (obvious) fact that this biases things towards 
 the end of the file.  Change the while loop as follows:
 
 BEGIN CODE
   int count = k;
   while(!feof(stdin))
   {
     if(random(0,count) >= k)
       ReadALine();   // this discards the line
     else
     {
       int i = random(0,k);
       bufs[i] = ReadALine();
     }
     count++;
   }
 END CODE

Yah that works. Nice! Andrei
Oct 09 2008
prev sibling next sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

You can't do a uniform random distribution without knowing the length. Your options are: - Hold the entire file in memory. - Hold the entire file so far in memory. Once you get over a certain threshold, remove lines randomly. You have to tweak the random distribution for the axing method to make sure it's uniform. If this is going to be used often on huge data sets, that is a reasonable effort. How would you need to tweak the random distribution? The easiest way is to get two random numbers: what section to remove lines from and which line in that section to remove stuff from. You can work out some annoying recurrence to find out how to get the probability of axing from one section given that it has been available for k rounds. I might work out the recurrence tonight, but I don't have a chalkboard.
Oct 09 2008
parent reply "Carlos" <carlos-smith sympatico.ca> writes:
: You can't do a uniform random distribution without knowing the length.

Probably true for other distribution.
Most certainly not true for uniform distribution (with a raisonable k)

You can work on a subset of the file. Let say 1000 records.
The distribution being uniform, you can select (or eliminate),
a percentage of each subset and the results for the whole file
will be ok.
Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Carlos wrote:
 : You can't do a uniform random distribution without knowing the length.
 
 Probably true for other distribution.
 Most certainly not true for uniform distribution (with a raisonable k)
 
 You can work on a subset of the file. Let say 1000 records.
 The distribution being uniform, you can select (or eliminate),
 a percentage of each subset and the results for the whole file
 will be ok.

I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision. Andrei
Oct 09 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Andrei Alexandrescu wrote:
 Carlos wrote:
 : You can't do a uniform random distribution without knowing the length.

 Probably true for other distribution.
 Most certainly not true for uniform distribution (with a raisonable k)

 You can work on a subset of the file. Let say 1000 records.
 The distribution being uniform, you can select (or eliminate),
 a percentage of each subset and the results for the whole file
 will be ok.

I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision.

I think "the number of total samples..." is the correct statement. Andrei
Oct 09 2008
prev sibling next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 09 Oct 2008 13:56:48 -0500,
Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.
Oct 09 2008
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Sergey Gromov wrote:
 Thu, 09 Oct 2008 13:56:48 -0500,
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample; On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts? Andrei
Oct 09 2008
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jarrett Billingsley wrote:
 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 On a different topic, your use of std.stream reminded me. I was thinking of
 yanking it from Phobos. Thoughts?

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Well I'd like to extend std.stdio with capabilities that meet and exceed those in std.stream. The design of std.stream is ok, but makes it hard to compose streams, and requires a lot of hand coding to define new streams. Andrei
Oct 09 2008
prev sibling next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 09 Oct 2008 22:28:51 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 13:56:48 -0500,
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample;

Good advice, thanks. The Sample definition is unimportant for the solution but eats up a lot of visual space.
 On a different topic, your use of std.stream reminded me. I was thinking 
 of yanking it from Phobos. Thoughts?

While experimenting with Bearophile's loader
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

a file line-by-line. So the alternative should be as fast.
Oct 10 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 While experimenting with Bearophile's loader
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

a file line-by-line. So the alternative should be as fast.

BufferedFile seems the faster in Phobos, but it's 2-3 times slower than the line-by-line reading you have in Python, and I think it has 4 bugs, so I don't use it much, that's why I use the xfile() of my libs, that is a little slower, but without those bugs. Few bugs of BufferedFile can be seen if you try to load a very large file made of a single line (crashes D every time), or you try reading lines with \0 at the end of the line, etc. They are corner cases, usually. I suggest reading the C sources of the I/O of Python (or Perl, if you are able to read the sources. I think the differences in clarity between Python and Perl can be seen even in their C sources). Bye, bearophile
Oct 10 2008
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking
 of
 yanking it from Phobos. Thoughts?

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

;)

That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei
Oct 10 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> 
 wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was 
 thinking
 of
 yanking it from Phobos. Thoughts?

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

side-by-side ;)

_another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei

Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)

That would make Phobos dependent on Tango, thus running into those licensing issues I have little understanding of. Andrei
Oct 10 2008
prev sibling next sibling parent "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking of
 yanking it from Phobos. Thoughts?

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 09 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley  
<jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was  
 thinking of
 yanking it from Phobos. Thoughts?

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)
Oct 10 2008
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be output.) 
 The file's size is unbounded so loading it in memory is not an option. 
 How'd you go about it?

My attempt. Works in one pass, and distributes uniformly but only when k is a factor of the number of lines in the file. At least, I believe the distribution is uniform, but feel free to prove me wrong. /* Uniform random value r, with r >= min && r < max. */ real random(uint min, uint max) { ... } /* Shuffle elements uniformly in the given string array. */ void shuffle(ref string[] array) { ... } string[] challenge(alias R)(R input, uint k) { string[] result; uint line; real pickFactor = 1; // Initial stage: fill initial k strings while (!input.eof && line < k) { result ~= input.readLine; ++line; } shuffle(result); while (!input.eof) { // Next stage: half the chances of picking a candidate. pickFactor /= 2; string[] candidates; line = 0; while (!input.eof && line < k) { candidates ~= input.readLine; ++line; } shuffle(candidates); for (uint i; i < candidates.length; ++i) if (random(0, 1) < pickFactor) result[i] = candidates[i]; // Note: if candidate.length != result.length // we don't have uniformity. } return result; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 10 2008
prev sibling next sibling parent "Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:
On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking
 of
 yanking it from Phobos. Thoughts?

..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)

That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?
Oct 10 2008
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com>  
 wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was  
 thinking
 of
 yanking it from Phobos. Thoughts?

I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.

side-by-side ;)

_another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. Andrei

Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)
Oct 10 2008