digitalmars.D - random k-sample of a file

Andrei Alexandrescu (8/8) Oct 09 2008 I just ran across a nice little problem that I wanted to share as an

dsimcha (11/19) Oct 09 2008 Pretty simple, actually. Read the file one character at a time. Get th...

Andrei Alexandrescu (5/27) Oct 09 2008 So this would shuffle that index.

Walter Bright (5/10) Oct 09 2008 1. read the file, counting lines => nlines

Andrei Alexandrescu (3/14) Oct 09 2008 No go if the file is a stream. Can it be done in one pass?

BCS (8/27) Oct 09 2008 read and store the first k lines with there line number,
Steven Schveighoffer (5/18) Oct 09 2008 Any bounds on the input k? If not, then no. In that case, you must do ...

Andrei Alexandrescu (5/22) Oct 09 2008 k is a "small" number in the sense that you can keep something

Walter Bright (4/5) Oct 09 2008 No, because without knowing the number of lines, you cannot produce a

Andrei Alexandrescu (3/10) Oct 09 2008 I am getting revenge for "I never use anything in std.algorithm" :o).

Walter Bright (2/12) Oct 09 2008 Get in line for everyone getting their revenge on me!
BCS (2/17) Oct 09 2008 "Et tu, Brute!" - (Julius Caesar, Act III, Scene I).

Ary Borenszweig (12/23) Oct 09 2008 Can't this be done in one single pass?

Denis Koroskin (3/19) Oct 09 2008 What if there is only one line in a file? Andrei told that

Andrei Alexandrescu (3/29) Oct 09 2008 Keeping k of the lines in memory is fine.
Ary Borenszweig (2/28) Oct 09 2008

KennyTM~ (4/33) Oct 09 2008 I guess p depends on the number of lines unfortunately.

bearophile (19/26) Oct 09 2008 It's an interesting problem, I can offer two solutions, the first one is...

bearophile (20/20) Oct 09 2008 I am not reading the anwers written by others, of course :-) With the he...

Andrei Alexandrescu (7/33) Oct 09 2008 This is unfair. First line in the original file is included with

bearophile (30/35) Oct 09 2008 If each of them have the same chance, then you can't guarantee the resul...

bearophile (18/18) Oct 09 2008 Third solution, this requires a storage of k lines (but you can keep thi...

Andrei Alexandrescu (4/23) Oct 09 2008 We have a winner!!! There is actually a very simple proof on how and why...

Ary Borenszweig (10/37) Oct 09 2008 Say you want 2 lines from a file that has 3 lines. Say the lines are a,

Andrei Alexandrescu (3/42) Oct 09 2008 Nothing except you stop the induction at step 3...

Ary Borenszweig (9/53) Oct 09 2008 ... which is the last step in this case. There are only three lines.

Andrei Alexandrescu (7/63) Oct 09 2008 Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 /

Ary Borenszweig (7/75) Oct 10 2008 Ah, ok. It works if you do that. So bearophile's code must say

Andrei Alexandrescu (3/80) Oct 10 2008 His prize gets retired. Now how does Kirk's code fare? :o)

Ary Borenszweig (12/92) Oct 10 2008 I don't have Kirk's code here at work (I told Thunderbird not to
bearophile (29/41) Oct 10 2008 I don't like contests...

Andrei Alexandrescu (7/26) Oct 10 2008 No, that's a different problem. They want to extract n random samples
Sergey Gromov (6/56) Oct 10 2008 I think your algorithm has one mistake. The last line of your text gets...

Kirk McDonald (25/82) Oct 09 2008 To reiterate the code I was talking about (the solution for k == 1):

Andrei Alexandrescu (13/104) Oct 09 2008 That's why I thought it's such a beautiful problem. My proof is

Jason House (4/15) Oct 09 2008 Based on your later stated requirements, use a heap. Insert each line i...
Kirk McDonald (32/43) Oct 09 2008 This is a classic interview question. The solution for k == 1 is easy:

Andrei Alexandrescu (5/60) Oct 09 2008 This looks good too, but I'm not sure where i will start from in the

Kirk McDonald (14/80) Oct 09 2008 No, i should start at k (which it does, that is what the start=k

Andrei Alexandrescu (3/84) Oct 09 2008 Sounds good.

bearophile (20/22) Oct 09 2008 My third converted to D, using my libs:

Andrei Alexandrescu (23/44) Oct 09 2008 True men only write D. Your solution looks eerily similar to mine.

Christopher Wright (9/10) Oct 10 2008 Out of curiosity, why are you using enforce rather than assert?

Andrei Alexandrescu (5/18) Oct 10 2008 Enforce is never ignored, even in production builds. You wouldn't want a...

Christopher Wright (8/30) Oct 11 2008 If you want it, the code's in a public subversion repository:

Andrei Alexandrescu (5/61) Oct 09 2008 Oh forgot to mention this is superior numerically to bearophile's

Kirk McDonald (5/70) Oct 09 2008 To be fair, I started writing it before bearophile posted his, and

Andrei Alexandrescu (3/5) Oct 09 2008 I think you forgot to attach the D solution. ;o)

BCS (23/34) Oct 09 2008 struct S

Jason House (2/44) Oct 09 2008 You should check if your new random number is good enough to keep before...

BCS (3/47) Oct 09 2008 I was going for simplicity and ease of reading within a given big-O clas...

Russell Lewis (15/26) Oct 09 2008 BEGIN CODE

Andrei Alexandrescu (3/21) Oct 09 2008 Then the last line will be part of the selection with probability 1.
Russell Lewis (27/58) Oct 09 2008 Sorry, I overlooked the (obvious) fact that this biases things towards

Andrei Alexandrescu (3/20) Oct 09 2008 Yah that works. Nice!

Christopher Wright (15/26) Oct 09 2008 You can't do a uniform random distribution without knowing the length.

Carlos (7/7) Oct 09 2008 : You can't do a uniform random distribution without knowing the length.

Andrei Alexandrescu (4/13) Oct 09 2008 I think you can do even nonuniform distributions. The number of samples

Andrei Alexandrescu (3/16) Oct 09 2008 I think "the number of total samples..." is the correct statement.

Sergey Gromov (38/46) Oct 09 2008 Here's mine

Andrei Alexandrescu (6/62) Oct 09 2008 Sweet! Suggestion: replace Sample's definition with:

Jarrett Billingsley (5/7) Oct 09 2008 ..what? You want to remove most of Phobos' IO capabilities?

Andrei Alexandrescu (5/15) Oct 09 2008 Well I'd like to extend std.stdio with capabilities that meet and exceed...
Denis Koroskin (4/12) Oct 10 2008 Yeah, why need them when Tango is soon to be used with Phobos side-by-si...

Jarrett Billingsley (4/19) Oct 10 2008 That's kind of what I'm thinking. It seems like a waste of time for

Andrei Alexandrescu (7/28) Oct 10 2008 Well Walter wouldn't be working on it anyway. The thing is, if we want

Denis Koroskin (4/34) Oct 10 2008 Great! Then, how about getting familiar with Tango IO and building your ...

Andrei Alexandrescu (4/43) Oct 10 2008 That would make Phobos dependent on Tango, thus running into those

Sergey Gromov (7/71) Oct 10 2008 Good advice, thanks. The Sample definition is unimportant for the

bearophile (6/10) Oct 10 2008 BufferedFile seems the faster in Phobos, but it's 2-3 times slower than ...

Michel Fortin (45/53) Oct 10 2008 My attempt. Works in one pass, and distributes uniformly but only when

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I just ran across a nice little problem that I wanted to share as an 
exercise for the interested in futile pastimes.

The challenge, should you accept it, is to write a program that given a 
number k and a file, outputs k lines picked uniformly at random from the 
file. (If the file has less than k lines, it should all be output.) The 
file's size is unbounded so loading it in memory is not an option. How'd 
you go about it?


Andrei

Oct 09 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a
 number k and a file, outputs k lines picked uniformly at random from the
 file. (If the file has less than k lines, it should all be output.) The
 file's size is unbounded so loading it in memory is not an option. How'd
 you go about it?
 Andrei

Pretty simple, actually.  Read the file one character at a time.  Get the offset
of the beginning of each line as a ulong or something.  Store this information
in
another file essentially as an on-disk array, where, for example, the first
ulong.sizeof bytes represent the bit pattern for the offset of the first line,
the
next ulong.sizeof bytes represent the bit pattern for the offset of the second
line, etc.  Use plenty of casting tricks.  Shuffle the ulong.sizeof chunks of
this
file the same way you would shuffle an ordinary array, using seeking as the
equivalent to array indexing.  (I never said this was going to be efficient.) 
For
each of the first k offsets after shuffling, print everything in the original
file
from that offset to the first EOL character encountered.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a
 number k and a file, outputs k lines picked uniformly at random from the
 file. (If the file has less than k lines, it should all be output.) The
 file's size is unbounded so loading it in memory is not an option. How'd
 you go about it?
 Andrei

 
 Pretty simple, actually.  Read the file one character at a time.  Get the
offset
 of the beginning of each line as a ulong or something.  Store this information
in
 another file essentially as an on-disk array, where, for example, the first
 ulong.sizeof bytes represent the bit pattern for the offset of the first line,
the
 next ulong.sizeof bytes represent the bit pattern for the offset of the second
 line, etc.  Use plenty of casting tricks.

So this is essentially an index of the file's lines?

 Shuffle the ulong.sizeof chunks of this
 file the same way you would shuffle an ordinary array, using seeking as the
 equivalent to array indexing.

So this would shuffle that index.

 (I never said this was going to be efficient.)  For
 each of the first k offsets after shuffling, print everything in the original
file
 from that offset to the first EOL character encountered.

Can it be done faster and maybe in one pass?


Andrei

Oct 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

1. read the file, counting lines => nlines
2. select k random numbers 0..nlines
3. read the file line by line again, outputting the line if it's one of 
the selected lines

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

 
 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one of 
 the selected lines

No go if the file is a stream. Can it be done in one pass?

Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 Walter Bright wrote:
 
 Andrei Alexandrescu wrote:
 
 The challenge, should you accept it, is to write a program that
 given a number k and a file, outputs k lines picked uniformly at
 random from the file. (If the file has less than k lines, it should
 all be output.) The file's size is unbounded so loading it in memory
 is not an option. How'd you go about it?
 

 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one
 of
 the selected lines

 No go if the file is a stream. Can it be done in one pass?
 
 Andrei
 

read and store the first k lines with there line number, 
if no lines remain, output stored lines
read another line 
replace an existing line with a probability based on the current line and 
the replaced line's location
if no lines remain ...


Now all that remains is to figure out that function

Oct 09 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Andrei Alexandrescu" wrote
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one of 
 the selected lines

 No go if the file is a stream. Can it be done in one pass?

Any bounds on the input k?  If not, then no.  In that case, you must do one 
pass to count the lines, otherwise, you would end up possibly throwing away 
lines that needed to be output to satisfy k.

-Steve

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Steven Schveighoffer wrote:
 "Andrei Alexandrescu" wrote
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one of 
 the selected lines

 No go if the file is a stream. Can it be done in one pass?

 
 Any bounds on the input k?  If not, then no.  In that case, you must do one 
 pass to count the lines, otherwise, you would end up possibly throwing away 
 lines that needed to be output to satisfy k.

k is a "small" number in the sense that you can keep something 
proportional to k in working memory. You shouldn't keep something 
proportional to n in working memory.

Andrei

Oct 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

No, because without knowing the number of lines, you cannot produce a 
uniform distribution.

Unless you're willing to make a guess based on the file size.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

 
 No, because without knowing the number of lines, you cannot produce a 
 uniform distribution.
 
 Unless you're willing to make a guess based on the file size.

I am getting revenge for "I never use anything in std.algorithm" :o).

Andrei

Oct 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 No go if the file is a stream. Can it be done in one pass?

 No, because without knowing the number of lines, you cannot produce a 
 uniform distribution.

 Unless you're willing to make a guess based on the file size.

 
 I am getting revenge for "I never use anything in std.algorithm" :o).

Get in line for everyone getting their revenge on me!

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 Walter Bright wrote:
 
 Andrei Alexandrescu wrote:
 
 No go if the file is a stream. Can it be done in one pass?
 

 No, because without knowing the number of lines, you cannot produce a
 uniform distribution.
 
 Unless you're willing to make a guess based on the file size.
 

 I am getting revenge for "I never use anything in std.algorithm" :o).
 
 Andrei
 

"Et tu, Brute!" - (Julius Caesar, Act III, Scene I).

Oct 09 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

 
 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one of 
 the selected lines

Can't this be done in one single pass?

My thoughts were:

1. read the first k lines and assign them to the ones to return (the 
"stored lines")
2. probability of replacing a line: p = ??
3. for each new line, with probability p, replace a stored line (I don't 
know which), and decrease probability (p -= ??)

If you don't decrease the value of p in each iterations, the probability 
of keeping a line in the first group tends to zero.

I don't know the values of ??... maybe someone else can find them?

(I don't know how to randomly choose k elements of n, n >= k).

Oct 09 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar>  
wrote:

 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given  
 a number k and a file, outputs k lines picked uniformly at random from  
 the file. (If the file has less than k lines, it should all be  
 output.) The file's size is unbounded so loading it in memory is not  
 an option. How'd you go about it?

  1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one of  
 the selected lines

 Can't this be done in one single pass?

 My thoughts were:

 1. read the first k lines and assign them to the ones to return (the  
 "stored lines")

What if there is only one line in a file? Andrei told that

 The file's size is unbounded so loading it in memory is not an option.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig 
 <ary esperanto.org.ar> wrote:
 
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?

  1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one 
 of the selected lines

 Can't this be done in one single pass?

 My thoughts were:

 1. read the first k lines and assign them to the ones to return (the 
 "stored lines")

 
 What if there is only one line in a file? Andrei told that
 
 The file's size is unbounded so loading it in memory is not an option.


Keeping k of the lines in memory is fine.

Andrei

Oct 09 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Denis Koroskin wrote:
 On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig 
 <ary esperanto.org.ar> wrote:
 
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?

  1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one 
 of the selected lines

 Can't this be done in one single pass?

 My thoughts were:

 1. read the first k lines and assign them to the ones to return (the 
 "stored lines")

 
 What if there is only one line in a file? Andrei told that

If you can't read more lines in this step, the answer are all the lines.

 
 The file's size is unbounded so loading it in memory is not an option.

Oct 09 2008

KennyTM~ <kennytm gmail.com> writes:

Ary Borenszweig wrote:
 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?

 1. read the file, counting lines => nlines
 2. select k random numbers 0..nlines
 3. read the file line by line again, outputting the line if it's one 
 of the selected lines

 
 Can't this be done in one single pass?
 
 My thoughts were:
 
 1. read the first k lines and assign them to the ones to return (the 
 "stored lines")
 2. probability of replacing a line: p = ??
 3. for each new line, with probability p, replace a stored line (I don't 
 know which), and decrease probability (p -= ??)
 
 If you don't decrease the value of p in each iterations, the probability 
 of keeping a line in the first group tends to zero.
 
 I don't know the values of ??... maybe someone else can find them?
 
 (I don't know how to randomly choose k elements of n, n >= k).

I guess p depends on the number of lines unfortunately.
Because p needs to be decreasing with line count, otherwise the first 
few line will almost always be replaced (not uniform then).

Oct 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

It's an interesting problem, I can offer two solutions, the first one is the
one I use in practice, that is use a scripting language (but the same exact
program can be written in D, using my libs it becomes very similar) that loads
the file two times, the first to count the lines, then creates an array of
randomly chosen indexes (without replacement!) and uses them to print the
chosen ones: 

from sys import argv
from random import sample

filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))

if k >= nlines:
    for line in file(filename):
        print line
else:
    samples = set(sample(xrange(nlines), k))
    for i, line in enumerate(file(filename)):
        if i in samples:
            print line

If your practical tests show you that reading your file twice is too much slow,
then you may want to look for a more complex solution.

I'll think about the more complex problem in the next post.

Later,
bearophile

Oct 09 2008

bearophile <bearophileHUGS lycos.com> writes:

I am not reading the anwers written by others, of course :-) With the help of
"Programming Pearls" here is my second version, that spares the memory required
for the chosen ones, so this code runs with very little memory:

from sys import argv
from random import random

filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))

if k >= nlines:
    for line in file(filename):
        print line
else:
    select = k
    remaining = nlines

    for line in file(filename):
        if random() < float(select) / remaining:
            print line
            select -= 1
        remaining -= 1

I'll think for a solution that avoids reading the file twice then...

Bye,
bearophile

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 I am not reading the anwers written by others, of course :-) With the help of
"Programming Pearls" here is my second version, that spares the memory required
for the chosen ones, so this code runs with very little memory:
 
 from sys import argv
 from random import random
 
 filename = argv[1]
 k = int(argv[2])
 nlines = sum(1 for _ in file(filename))
 
 if k >= nlines:
     for line in file(filename):
         print line
 else:
     select = k
     remaining = nlines
 
     for line in file(filename):
         if random() < float(select) / remaining:
             print line
             select -= 1
         remaining -= 1
 
 I'll think for a solution that avoids reading the file twice then...
 
 Bye,
 bearophile

This is unfair. First line in the original file is included with 
probability k / n. The probability of including the second line is 
dependent on the event of including the first line. If it wasn't 
included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines 
should be given equal chance.

Andrei

Oct 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 This is unfair. First line in the original file is included with 
 probability k / n. The probability of including the second line is 
 dependent on the event of including the first line. If it wasn't 
 included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines 
 should be given equal chance.

If each of them have the same chance, then you can't guarantee the result has k
items. So you must have a biased choice.

You can run this code:

from random import random
from collections import defaultdict

def main(freqs, nrepeats):
    for i in xrange(nrepeats):
        chosen = []
        select = k
        remaining = n

        for value in values:
            if random() < (float(select) / remaining):
                chosen.append(value)
                select -= 1
            remaining -= 1

        for c in chosen:
            freqs[c] += 1

    return freqs

import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]


Its output:

[1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475,
1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467,
1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439,
1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525,
1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503,
1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507,
1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472,
1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526]

It shows the values are balanced.

If that little benchmark isn't enough, my code is based on algorithm S by
Knuth, you can take a look on its Art of computer programming for a
demonstration that it work correctly.

Bye,
bearophile

Oct 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Third solution, this requires a storage of k lines (but you can keep this
storage on disk):

from sys import argv
from random import random, randrange


filename = argv[1]
k = int(argv[2])
assert k > 0

chosen_lines = []
for i, line in enumerate(file(filename)):
    if i < k:
        chosen_lines.append(line)
    else:
        if random() < (1.0 / (i+1)):
            chosen_lines[randrange(k)] = line

print chosen_lines

Now I'll look for possible bugs in this third version and to the problem Andrei
has just told me :-)

Bye,
bearophile

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep this
storage on disk):
 
 from sys import argv
 from random import random, randrange

 
 filename = argv[1]
 k = int(argv[2])
 assert k > 0
 
 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line
 
 print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why 
this works.

Andrei

Oct 09 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 
 We have a winner!!! There is actually a very simple proof on how and why 
 this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, 
b and c.

What's the probability that c belongs to the result? It's "1.0 / (i+1)", 
where i = 2, so 1/3.

What's the probability that a does not belong to the result? Well, c 
must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. 
So it's 1/3 * 1/2 = 1/6.

What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.

What am I doing wrong? :-(

 
 Andrei

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how and 
 why this works.

 
 Say you want 2 lines from a file that has 3 lines. Say the lines are a, 
 b and c.
 
 What's the probability that c belongs to the result? It's "1.0 / (i+1)", 
 where i = 2, so 1/3.
 
 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. 
 So it's 1/3 * 1/2 = 1/6.
 
 What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
 
 What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

Andrei

Oct 09 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can keep 
 this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how and 
 why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines are 
 a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.

 What am I doing wrong? :-(

 
 Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines.

p(a) = 5/6
p(b) = 5/6
p(c) = 1/3

That doesn't seem uniform.

In another post, Kirk says: "Of the remaining 2 out of 3 chances, there 
is a 50% chance the second line will be chosen, and a 50% chance of the 
first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

 
 Andrei

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how and 
 why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines are 
 a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 
 ... which is the last step in this case. There are only three lines.
 
 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3
 
 That doesn't seem uniform.
 
 In another post, Kirk says: "Of the remaining 2 out of 3 chances, there 
 is a 50% chance the second line will be chosen, and a 50% chance of the 
 first line". Why "of the remaining"? It's in that 1 out of 3 chance, or 
 not?

Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
3.0. This is because c has the "right" to sit equally in either of the k 
positions. If code doesn't do that, there's a bug in it.

Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, 
as it should.


Andrei

Oct 09 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how 
 and why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines are 
 a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 
 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 ... which is the last step in this case. There are only three lines.

 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3

 That doesn't seem uniform.

 In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
 there is a 50% chance the second line will be chosen, and a 50% chance 
 of the first line". Why "of the remaining"? It's in that 1 out of 3 
 chance, or not?

 
 Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
 3.0. This is because c has the "right" to sit equally in either of the k 
 positions. If code doesn't do that, there's a bug in it.
 
 Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, 
 as it should.

Ah, ok. It works if you do that. So bearophile's code must say

if random() > (1.0 / (i+1)):

instead of

if random() < (1.0 / (i+1)):

(that is, if random() returns a real number between 0 and 1 with uniform 
distribution)

 
 
 Andrei

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ary Borenszweig wrote:
 Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how 
 and why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines 
 are a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, 
 c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must 
 choose 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 
 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 ... which is the last step in this case. There are only three lines.

 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3

 That doesn't seem uniform.

 In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
 there is a 50% chance the second line will be chosen, and a 50% 
 chance of the first line". Why "of the remaining"? It's in that 1 out 
 of 3 chance, or not?

 Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
 3.0. This is because c has the "right" to sit equally in either of the 
 k positions. If code doesn't do that, there's a bug in it.

 Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 
 1.0/3.0, as it should.

 
 Ah, ok. It works if you do that. So bearophile's code must say
 
 if random() > (1.0 / (i+1)):
 
 instead of
 
 if random() < (1.0 / (i+1)):
 
 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

Andrei

Oct 10 2008

Ary Borenszweig <ary esperanto.org.ar> writes:

Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how 
 and why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines 
 are a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, 
 c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must 
 choose 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 
 = 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 ... which is the last step in this case. There are only three lines.

 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3

 That doesn't seem uniform.

 In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
 there is a 50% chance the second line will be chosen, and a 50% 
 chance of the first line". Why "of the remaining"? It's in that 1 
 out of 3 chance, or not?

 Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
 3.0. This is because c has the "right" to sit equally in either of 
 the k positions. If code doesn't do that, there's a bug in it.

 Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 
 1.0/3.0, as it should.

 Ah, ok. It works if you do that. So bearophile's code must say

 if random() > (1.0 / (i+1)):

 instead of

 if random() < (1.0 / (i+1)):

 (that is, if random() returns a real number between 0 and 1 with 
 uniform distribution)

 
 His prize gets retired. Now how does Kirk's code fare? :o)

I don't have Kirk's code here at work (I told Thunderbird not to 
download everything I had in my home)... but if I remember correctly, it 
had the same problem, it was something like:

if (0 == uniform(0, k)):

or something like that, and should be 0 != ...

I wonder why both used the wrong probability...

(that's why I made the math, because I wasn't sure with that probability 
it worked)

I really liked the problem! It has a very simple and elegant solution... 
I thought the probability would be something more complex, but it turned 
out to be simple. :-)

Oct 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say
 
 if random() > (1.0 / (i+1)):
 
 instead of
 
 if random() < (1.0 / (i+1)):
 
 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

 
 His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests...

This simple benchmark shows that my third version is wrong in both ways, and it
keeps being wrong even if you remove the chosen.append(line), etc.

The problem isn't easy, this paper shows that it's not easy to create an
efficient implementation even when the total line count is known:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

from random import random, randrange
from collections import defaultdict

def main(k, values, freqs, nrepeats):
    for i in xrange(nrepeats):
        chosen = [None] * k
        for i, line in enumerate(values):
            if i < k:
                chosen.append(line)
            else:
                if random() < (1.0 / (i+1)):
                    chosen[randrange(k)] = line

        for c in chosen:
            freqs[c] += 1

    return freqs

import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]


My first version works quite probably, and probably the second one too.

I think there are ways to solve this problem keeping only k lines, but the code
isn't obvious.

Bye,
bearophile

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say

 if random() > (1.0 / (i+1)):

 instead of

 if random() < (1.0 / (i+1)):

 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

 His prize gets retired. Now how does Kirk's code fare? :o)

 
 I don't like contests...
 
 This simple benchmark shows that my third version is wrong in both ways, and
it keeps being wrong even if you remove the chosen.append(line), etc.
 
 The problem isn't easy, this paper shows that it's not easy to create an
efficient implementation even when the total line count is known:
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

No, that's a different problem. They want to extract n random samples 
from N records *without replacement*. It's a very different ballgame. 
See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html

The problem as I stated is reasonably easy and my implementation should 
be correct.


Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
 Andrei Alexandrescu:
 Ah, ok. It works if you do that. So bearophile's code must say
 
 if random() > (1.0 / (i+1)):
 
 instead of
 
 if random() < (1.0 / (i+1)):
 
 (that is, if random() returns a real number between 0 and 1 with uniform 
 distribution)

 
 His prize gets retired. Now how does Kirk's code fare? :o)

 
 I don't like contests...
 
 This simple benchmark shows that my third version is wrong in both ways, and
it keeps being wrong even if you remove the chosen.append(line), etc.
 
 The problem isn't easy, this paper shows that it's not easy to create an
efficient implementation even when the total line count is known:
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
 
 from random import random, randrange
 from collections import defaultdict
 
 def main(k, values, freqs, nrepeats):
     for i in xrange(nrepeats):
         chosen = [None] * k
         for i, line in enumerate(values):
             if i < k:
                 chosen.append(line)
             else:
                 if random() < (1.0 / (i+1)):
                     chosen[randrange(k)] = line
 
         for c in chosen:
             freqs[c] += 1
 
     return freqs
 
 import psyco; psyco.full()
 k = 15
 n = 100
 values = range(n)
 freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
 print [freqs.get(i, 0) for i in xrange(n)]
 
 
 My first version works quite probably, and probably the second one too.
 
 I think there are ways to solve this problem keeping only k lines, but the
code isn't obvious.

I think your algorithm has one mistake.  The last line of your text gets 
into the k with probability 1/n while it should be k/n.  Replace 1.0 
with 1.0*k and the distribution will be absolutely (theoretically) 
uniform.

Oct 10 2008

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how and 
 why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines are 
 a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 
 ... which is the last step in this case. There are only three lines.
 
 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3
 
 That doesn't seem uniform.
 
 In another post, Kirk says: "Of the remaining 2 out of 3 chances, there 
 is a 50% chance the second line will be chosen, and a 50% chance of the 
 first line". Why "of the remaining"? It's in that 1 out of 3 chance, or 
 not?
 

To reiterate the code I was talking about (the solution for k == 1):

from random import randint

chosen_line = None

for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

print chosen_line

Assume there are three lines in the file. The algorithm churns through 
to the last line. We have the following state:

i == 2
chosen_line = <one of the first two lines>

Then we call randint(0, 2). This gives us a 1 out of 3 chance of 
chosen_line being the last line (if the random number is 0). Therefore, 
there is also a 2 out of 3 chance of it being some *other* line (if the 
random number is 1 or 2).

It turns out there is an equal chance for it being either of the two 
lines. The algorithm starts on the first line, and does randint(0, 0), 
meaning chosen_line is always set to the first line initially. On the 
second line, it does randint(0, 1), which means it has a 50% chance of 
changing chosen_line to the second line.

The math continues to work out in this way as the number of lines 
increases without bound.

-- 
Kirk McDonald

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kirk McDonald wrote:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 Ary Borenszweig wrote:
 Andrei Alexandrescu escribi�:
 bearophile wrote:
 Third solution, this requires a storage of k lines (but you can 
 keep this storage on disk):

 from sys import argv
 from random import random, randrange


 filename = argv[1]
 k = int(argv[2])
 assert k > 0

 chosen_lines = []
 for i, line in enumerate(file(filename)):
     if i < k:
         chosen_lines.append(line)
     else:
         if random() < (1.0 / (i+1)):
             chosen_lines[randrange(k)] = line

 print chosen_lines

 We have a winner!!! There is actually a very simple proof on how 
 and why this works.

 Say you want 2 lines from a file that has 3 lines. Say the lines are 
 a, b and c.

 What's the probability that c belongs to the result? It's "1.0 / 
 (i+1)", where i = 2, so 1/3.

 What's the probability that a does not belong to the result? Well, c 
 must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
 0. So it's 1/3 * 1/2 = 1/6.

 What's the probability that a belongs to the result? It's 1 - 1/6 = 
 5/6.

 What am I doing wrong? :-(

 Nothing except you stop the induction at step 3...

 ... which is the last step in this case. There are only three lines.

 p(a) = 5/6
 p(b) = 5/6
 p(c) = 1/3

 That doesn't seem uniform.

 In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
 there is a 50% chance the second line will be chosen, and a 50% chance 
 of the first line". Why "of the remaining"? It's in that 1 out of 3 
 chance, or not?

 
 To reiterate the code I was talking about (the solution for k == 1):
 
 from random import randint
 
 chosen_line = None
 
 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line
 
 print chosen_line
 
 Assume there are three lines in the file. The algorithm churns through 
 to the last line. We have the following state:
 
 i == 2
 chosen_line = <one of the first two lines>
 
 Then we call randint(0, 2). This gives us a 1 out of 3 chance of 
 chosen_line being the last line (if the random number is 0). Therefore, 
 there is also a 2 out of 3 chance of it being some *other* line (if the 
 random number is 1 or 2).
 
 It turns out there is an equal chance for it being either of the two 
 lines. The algorithm starts on the first line, and does randint(0, 0), 
 meaning chosen_line is always set to the first line initially. On the 
 second line, it does randint(0, 1), which means it has a 50% chance of 
 changing chosen_line to the second line.
 
 The math continues to work out in this way as the number of lines 
 increases without bound.

That's why I thought it's such a beautiful problem. My proof is 
inductive over n. The invariant is that the k lines held represent a 
uniform selection of the n lines seen so far at any moment.

Base: For n <= k, vacuously satisfied.

Inductive step: We have a random selection of n lines so far in array 
selection, and we add one more line. We need to prove that the new 
selection is also uniform. The new line has probability k / (n + 1) of 
being part of the selection. We toss a skewed coin with that 
probability. If it turns heads, we pick the new element for keeping. Any 
element in the array should go with equal probability (by the inductive 
assumption), so we roll a k-sided dice to choose one element to throw away.


Andrei

Oct 09 2008

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei


Based on your later stated requirements, use a heap.  Insert each line into the
heap using a random number as the key.  After pre-loading the first k elements,
each insert into the heap is followed by the removal of the head element.

Performance is O(characters in file + log2(k)*lines in file)

Obviously, the random number generator needs a range of values that exceeds the
lines in the file by a good margin, or else you'll get biases in what gets kept.

Oct 09 2008

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

This is a classic interview question. The solution for k == 1 is easy:

from random import randint

chosen_line = None

for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

print chosen_line

(It is worth noting that randint() operates over an inclusive range.)

If you do the math, this works out to a uniform distribution. For 
instance, say the file has three lines. Once we read in the third line, 
there is a 1 out of 3 chance that it will be picked as the chosen_line. 
Of the remaining 2 out of 3 chances, there is a 50% chance the second 
line will be chosen, and a 50% chance of the first line.

Doing this for k > 1 becomes more complicated. We start by reading in 
the first k lines of the file. (And if we run out of lines before that, 
we're done.)

import itertools
from random import randint

f = open('filename')

k_lines = list(itertools.islice(f, k))



for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line

for line in k_lines:
     print line

This is my first crack at a solution. I am not sure how close to a 
uniform distribution this works out to.

-- 
Kirk McDonald

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

 
 This is a classic interview question. The solution for k == 1 is easy:
 
 from random import randint
 
 chosen_line = None
 
 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line
 
 print chosen_line
 
 (It is worth noting that randint() operates over an inclusive range.)
 
 If you do the math, this works out to a uniform distribution. For 
 instance, say the file has three lines. Once we read in the third line, 
 there is a 1 out of 3 chance that it will be picked as the chosen_line. 
 Of the remaining 2 out of 3 chances, there is a 50% chance the second 
 line will be chosen, and a 50% chance of the first line.
 
 Doing this for k > 1 becomes more complicated. We start by reading in 
 the first k lines of the file. (And if we run out of lines before that, 
 we're done.)
 
 import itertools
 from random import randint
 
 f = open('filename')
 
 k_lines = list(itertools.islice(f, k))
 


 for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line
 
 for line in k_lines:
     print line
 
 This is my first crack at a solution. I am not sure how close to a 
 uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the 
loop. It should start at k + 1.

But where's the D code you guys? D is better than Python at scripting. Ahem.


Andrei

Oct 09 2008

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

 This is a classic interview question. The solution for k == 1 is easy:

 from random import randint

 chosen_line = None

 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

 print chosen_line

 (It is worth noting that randint() operates over an inclusive range.)

 If you do the math, this works out to a uniform distribution. For 
 instance, say the file has three lines. Once we read in the third 
 line, there is a 1 out of 3 chance that it will be picked as the 
 chosen_line. Of the remaining 2 out of 3 chances, there is a 50% 
 chance the second line will be chosen, and a 50% chance of the first 
 line.

 Doing this for k > 1 becomes more complicated. We start by reading in 
 the first k lines of the file. (And if we run out of lines before 
 that, we're done.)

 import itertools
 from random import randint

 f = open('filename')

 k_lines = list(itertools.islice(f, k))



 for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line

 for line in k_lines:
     print line

 This is my first crack at a solution. I am not sure how close to a 
 uniform distribution this works out to.

 
 This looks good too, but I'm not sure where i will start from in the 
 loop. It should start at k + 1.
 
 But where's the D code you guys? D is better than Python at scripting. 
 Ahem.
 
 
 Andrei

No, i should start at k (which it does, that is what the start=k 
parameter to enumerate() is for). This is because the range passed to 
randint() starts at 0. The islice function (like slices in D and indeed 
regular slices in Python) uses an [inclusive, exclusive) range. Thus, 
itertools.islice(f, k) will grab the first k lines from the file. (The 
opening index of 0 is implied.)

islice is a function for "slicing" arbitrary iterable objects, 
particularly ones that can't normally be sliced. By slicing the first k 
lines from the file in this way, we also consume the first k lines from 
the file object, and so the for loop starts on line k, which is why I 
pass start=k to enumerate(). (Remember, I am counting lines from 0.)

-- 
Kirk McDonald

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that 
 given a number k and a file, outputs k lines picked uniformly at 
 random from the file. (If the file has less than k lines, it should 
 all be output.) The file's size is unbounded so loading it in memory 
 is not an option. How'd you go about it?


 Andrei

 This is a classic interview question. The solution for k == 1 is easy:

 from random import randint

 chosen_line = None

 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

 print chosen_line

 (It is worth noting that randint() operates over an inclusive range.)

 If you do the math, this works out to a uniform distribution. For 
 instance, say the file has three lines. Once we read in the third 
 line, there is a 1 out of 3 chance that it will be picked as the 
 chosen_line. Of the remaining 2 out of 3 chances, there is a 50% 
 chance the second line will be chosen, and a 50% chance of the first 
 line.

 Doing this for k > 1 becomes more complicated. We start by reading in 
 the first k lines of the file. (And if we run out of lines before 
 that, we're done.)

 import itertools
 from random import randint

 f = open('filename')

 k_lines = list(itertools.islice(f, k))



 for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line

 for line in k_lines:
     print line

 This is my first crack at a solution. I am not sure how close to a 
 uniform distribution this works out to.

 This looks good too, but I'm not sure where i will start from in the 
 loop. It should start at k + 1.

 But where's the D code you guys? D is better than Python at scripting. 
 Ahem.


 Andrei

 
 No, i should start at k (which it does, that is what the start=k 
 parameter to enumerate() is for). This is because the range passed to 
 randint() starts at 0. The islice function (like slices in D and indeed 
 regular slices in Python) uses an [inclusive, exclusive) range. Thus, 
 itertools.islice(f, k) will grab the first k lines from the file. (The 
 opening index of 0 is implied.)
 
 islice is a function for "slicing" arbitrary iterable objects, 
 particularly ones that can't normally be sliced. By slicing the first k 
 lines from the file in this way, we also consume the first k lines from 
 the file object, and so the for loop starts on line k, which is why I 
 pass start=k to enumerate(). (Remember, I am counting lines from 0.)
 

Sounds good.

Andrei

Oct 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 But where's the D code you guys?

My third converted to D, using my libs:

import std.conv, d.all;

void main(string[] args) {
    assert(args.length == 3);
    string filename = args[1];
    int k = toInt(args[2]);
    assert (k > 0);

    string[] chosen_lines;
    foreach (i, line; xfile(filename))
        if (i < k)
            chosen_lines ~= line;
        else
            if (fastRandom() < (1.0 / (i+1)))
                chosen_lines[randInt(k-1)] = line;

    putr(chosen_lines);
}


 D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than
Python to create working prototypes.

Bye,
bearophile

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 But where's the D code you guys?

 
 My third converted to D, using my libs:
 
 import std.conv, d.all;
 
 void main(string[] args) { assert(args.length == 3); string filename
 = args[1]; int k = toInt(args[2]); assert (k > 0);
 
 string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) 
 chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) 
 chosen_lines[randInt(k-1)] = line;
 
 putr(chosen_lines); }
 
 
 D is better than Python at scripting. Ahem.

 
 I know many languages, and so far I have never found something better
 than Python to create working prototypes.

True men only write D. Your solution looks eerily similar to mine.



import std.stdio, std.contracts, std.conv, std.random;

void main(string[] args) {
     invariant k = parse!(uint)(args[1]);
     enforce(k > 0, "Must pass a strictly positive selection size");
     string[] selection;
     auto gen = Random(unpredictableSeed);

     foreach (ulong tally, char[] line; lines(stdin)) {
         if (selection.length < k) {
             // Selection not full; add to it
             selection ~= line.idup;
         } else {
             auto t = uniform(gen, 0, tally + 1);
             if (t > k) continue; // no luck
             // Replace a random element in the selection
             selection[uniform(gen, 0, k - 1)] = line.idup;
         }
     }

     // Print selection
     foreach (s; selection) write(s);
}

Oct 09 2008

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

Out of curiosity, why are you using enforce rather than assert?

I've written an assertions framework, but it provides some more options 
than assert:

expect.because ("Must pass a strictly positive selection size").that 
(k).greaterThan (0);

If that failed, it would produce:
AssertionException: Expected: greater than 0 but was: -4 -- Must pass a 
strictly positive selection size

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Christopher Wright wrote:
 Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

 
 Out of curiosity, why are you using enforce rather than assert?

Enforce is never ignored, even in production builds. You wouldn't want a 
release build of your app to skip checking command line arguments.

 I've written an assertions framework, but it provides some more options 
 than assert:
 
 expect.because ("Must pass a strictly positive selection size").that 
 (k).greaterThan (0);
 
 If that failed, it would produce:
 AssertionException: Expected: greater than 0 but was: -4 -- Must pass a 
 strictly positive selection size

That looks pretty darn nice.


Andrei

Oct 10 2008

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
 Christopher Wright wrote:
 Andrei Alexandrescu wrote:
     enforce(k > 0, "Must pass a strictly positive selection size");

 Out of curiosity, why are you using enforce rather than assert?

 
 Enforce is never ignored, even in production builds. You wouldn't want a 
 release build of your app to skip checking command line arguments.

True -- I forgot about that.

 I've written an assertions framework, but it provides some more 
 options than assert:

 expect.because ("Must pass a strictly positive selection size").that 
 (k).greaterThan (0);

 If that failed, it would produce:
 AssertionException: Expected: greater than 0 but was: -4 -- Must pass 
 a strictly positive selection size

 
 That looks pretty darn nice.
 
 
 Andrei

If you want it, the code's in a public subversion repository:
http://svn.dsource.org/projects/dmocks/dunit/trunk/
modules dunit.expect and dunit.constraint

Currently BSD licensed, but if anyone wants it in a different license, 
that's up for negotiation. Though I think it would take some effort to 
make it work with Phobos.

Oct 11 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

 
 This is a classic interview question. The solution for k == 1 is easy:
 
 from random import randint
 
 chosen_line = None
 
 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line
 
 print chosen_line
 
 (It is worth noting that randint() operates over an inclusive range.)
 
 If you do the math, this works out to a uniform distribution. For 
 instance, say the file has three lines. Once we read in the third line, 
 there is a 1 out of 3 chance that it will be picked as the chosen_line. 
 Of the remaining 2 out of 3 chances, there is a 50% chance the second 
 line will be chosen, and a 50% chance of the first line.
 
 Doing this for k > 1 becomes more complicated. We start by reading in 
 the first k lines of the file. (And if we run out of lines before that, 
 we're done.)
 
 import itertools
 from random import randint
 
 f = open('filename')
 
 k_lines = list(itertools.islice(f, k))
 


 for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line
 
 for line in k_lines:
     print line
 
 This is my first crack at a solution. I am not sure how close to a 
 uniform distribution this works out to.
 

Oh forgot to mention this is superior numerically to bearophile's
because it's not affected by floating point quantization vagaries. But 
it came later too...

Andrei

Oct 09 2008

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Andrei Alexandrescu wrote:
 Kirk McDonald wrote:
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random 
 from the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

 This is a classic interview question. The solution for k == 1 is easy:

 from random import randint

 chosen_line = None

 for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

 print chosen_line

 (It is worth noting that randint() operates over an inclusive range.)

 If you do the math, this works out to a uniform distribution. For 
 instance, say the file has three lines. Once we read in the third 
 line, there is a 1 out of 3 chance that it will be picked as the 
 chosen_line. Of the remaining 2 out of 3 chances, there is a 50% 
 chance the second line will be chosen, and a 50% chance of the first 
 line.

 Doing this for k > 1 becomes more complicated. We start by reading in 
 the first k lines of the file. (And if we run out of lines before 
 that, we're done.)

 import itertools
 from random import randint

 f = open('filename')

 k_lines = list(itertools.islice(f, k))



 for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line

 for line in k_lines:
     print line

 This is my first crack at a solution. I am not sure how close to a 
 uniform distribution this works out to.

 
 Oh forgot to mention this is superior numerically to bearophile's
 because it's not affected by floating point quantization vagaries. But 
 it came later too...
 
 Andrei
 

To be fair, I started writing it before bearophile posted his, and 
indeed did not see his solution until after I had posted it. :-)

-- 
Kirk McDonald

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kirk McDonald wrote:
 To be fair, I started writing it before bearophile posted his, and 
 indeed did not see his solution until after I had posted it. :-)

I think you forgot to attach the D solution. ;o)

Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given
 a number k and a file, outputs k lines picked uniformly at random from
 the file. (If the file has less than k lines, it should all be
 output.) The file's size is unbounded so loading it in memory is not
 an option. How'd you go about it?
 
 Andrei
 

struct S
{
 char[] line;
 real v;
 int opCmp(S s)
 {
  if(s.v < this.v) return -1;
  else if(s.v > this.v) return +1;
  return 0;
 }
}

int k;
S[] buf = new S[k+1];

foreach(s;buf) s.v = real.max;

foreach(char[] line; stream)
{
    delete buf[$-1].line;
    buf[$-1].line = line.dup;;
    buf[$-1].v = reaqlRand();  // returns v > real.min and v < real.max
    buf.sort;
}

foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);

Oct 09 2008

Jason House <jason.james.house gmail.com> writes:

BCS Wrote:

 Reply to Andrei,
 
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given
 a number k and a file, outputs k lines picked uniformly at random from
 the file. (If the file has less than k lines, it should all be
 output.) The file's size is unbounded so loading it in memory is not
 an option. How'd you go about it?
 
 Andrei
 

 
 struct S
 {
  char[] line;
  real v;
  int opCmp(S s)
  {
   if(s.v < this.v) return -1;
   else if(s.v > this.v) return +1;
   return 0;
  }
 }
 
 int k;
 S[] buf = new S[k+1];
 
 foreach(s;buf) s.v = real.max;
 
 foreach(char[] line; stream)
 {
     delete buf[$-1].line;
     buf[$-1].line = line.dup;;
     buf[$-1].v = reaqlRand();  // returns v > real.min and v < real.max
     buf.sort;
 }
 
 foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
 
 

You should check if your new random number is good enough to keep before doing
the array manipulation and string dup... That small change should yield an
near-optimal big O. A heap should give slightly faster performance for moderate
values of k.

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Jason,

 BCS Wrote:
 
 Reply to Andrei,
 
 I just ran across a nice little problem that I wanted to share as an
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that
 given a number k and a file, outputs k lines picked uniformly at
 random from the file. (If the file has less than k lines, it should
 all be output.) The file's size is unbounded so loading it in memory
 is not an option. How'd you go about it?
 
 Andrei
 

 struct S
 {
 char[] line;
 real v;
 int opCmp(S s)
 {
 if(s.v < this.v) return -1;
 else if(s.v > this.v) return +1;
 return 0;
 }
 }
 int k;
 S[] buf = new S[k+1];
 foreach(s;buf) s.v = real.max;
 
 foreach(char[] line; stream)
 {
 delete buf[$-1].line;
 buf[$-1].line = line.dup;;
 buf[$-1].v = reaqlRand();  // returns v > real.min and v < real.max
 buf.sort;
 }
 foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
 

 You should check if your new random number is good enough to keep
 before doing the array manipulation and string dup... That small
 change should yield an near-optimal big O. A heap should give slightly
 faster performance for moderate values of k.
 

I was going for simplicity and ease of reading within a given big-O class.

Switching to insertion sort or reusing data buffers would also do better.

Oct 09 2008

Russell Lewis <webmaster villagersonline.com> writes:

BEGIN CODE
char[][] choose(int k)
{
   char[][] bufs; bufs.length = k;

   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();

   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }

   return bufs;
}

END CODE

Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Russell Lewis wrote:
 BEGIN CODE
 char[][] choose(int k)
 {
   char[][] bufs; bufs.length = k;
 
   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();
 
   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }
 
   return bufs;
 }
 
 END CODE

Then the last line will be part of the selection with probability 1.

Andrei

Oct 09 2008

Russell Lewis <webmaster villagersonline.com> writes:

Sorry, I overlooked the (obvious) fact that this biases things towards 
the end of the file.  Change the while loop as follows:

BEGIN CODE
   int count = k;
   while(!feof(stdin))
   {
     if(random(0,count) >= k)
       ReadALine();   // this discards the line
     else
     {
       int i = random(0,k);
       bufs[i] = ReadALine();
     }
     count++;
   }
END CODE

Note that I'm assuming that random(a,b) is a function that randomly 
returns an integer in the range [a,b).

This new version works because the most-recently-read line has a k/count 
  chance of getting picked up into the current working set, and when we 
do choose to discard some line, we give all of the old lines an equal 
chance of survival.

I don't have to to formally work out the math right now, but I'm pretty 
sure that that would give each line in the entire file a k/count chance 
of being in the set at any given point in the process.

Russ

Russell Lewis wrote:
 BEGIN CODE
 char[][] choose(int k)
 {
   char[][] bufs; bufs.length = k;
 
   for(int i=0; i<k; i++)
     bufs[i] = ReadALine();
 
   while(!feof(stdin))
   {
     int i = random(0,k);
     bufs[i] = ReadALine();
   }
 
   return bufs;
 }
 
 END CODE
 
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given 
 a number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be 
 output.) The file's size is unbounded so loading it in memory is not 
 an option. How'd you go about it?


 Andrei

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Russell Lewis wrote:
 Sorry, I overlooked the (obvious) fact that this biases things towards 
 the end of the file.  Change the while loop as follows:
 
 BEGIN CODE
   int count = k;
   while(!feof(stdin))
   {
     if(random(0,count) >= k)
       ReadALine();   // this discards the line
     else
     {
       int i = random(0,k);
       bufs[i] = ReadALine();
     }
     count++;
   }
 END CODE

Yah that works. Nice!

Andrei

Oct 09 2008

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?
 
 
 Andrei

You can't do a uniform random distribution without knowing the length.

Your options are:
- Hold the entire file in memory.
- Hold the entire file so far in memory. Once you get over a certain 
threshold, remove lines randomly.

You have to tweak the random distribution for the axing method to make 
sure it's uniform. If this is going to be used often on huge data sets, 
that is a reasonable effort.

How would you need to tweak the random distribution? The easiest way is 
to get two random numbers: what section to remove lines from and which 
line in that section to remove stuff from. You can work out some 
annoying recurrence to find out how to get the probability of axing from 
one section given that it has been available for k rounds.

I might work out the recurrence tonight, but I don't have a chalkboard.

Oct 09 2008

"Carlos" <carlos-smith sympatico.ca> writes:

: You can't do a uniform random distribution without knowing the length.

Probably true for other distribution.
Most certainly not true for uniform distribution (with a raisonable k)

You can work on a subset of the file. Let say 1000 records.
The distribution being uniform, you can select (or eliminate),
a percentage of each subset and the results for the whole file
will be ok.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Carlos wrote:
 : You can't do a uniform random distribution without knowing the length.
 
 Probably true for other distribution.
 Most certainly not true for uniform distribution (with a raisonable k)
 
 You can work on a subset of the file. Let say 1000 records.
 The distribution being uniform, you can select (or eliminate),
 a percentage of each subset and the results for the whole file
 will be ok.

I think you can do even nonuniform distributions. The number of samples 
seen should not influence your subsampling decision.

Andrei

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Andrei Alexandrescu wrote:
 Carlos wrote:
 : You can't do a uniform random distribution without knowing the length.

 Probably true for other distribution.
 Most certainly not true for uniform distribution (with a raisonable k)

 You can work on a subset of the file. Let say 1000 records.
 The distribution being uniform, you can select (or eliminate),
 a percentage of each subset and the results for the whole file
 will be ok.

 
 I think you can do even nonuniform distributions. The number of samples 
 seen should not influence your subsampling decision.

I think "the number of total samples..." is the correct statement.

Andrei

Oct 09 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 09 Oct 2008 13:56:48 -0500,
Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

Here's mine

module k_sample;

import std.conv: to;
import std.stream: BufferedFile;
import std.random: Random, uniform, unpredictableSeed;
import std.stdio: writefln;

void main(string[] args)
{
    auto f = new BufferedFile(args[1]);
    auto k = to!(int)(args[2]);
    assert(k > 0);

    struct Sample
    {
        string text;
        ulong line;
        this(string s, ulong n)
        {
            text = s;
            line = n;
        }
    }

    Sample[] samples;
    auto gen = Random(unpredictableSeed);

    foreach (ulong n, char[] s; f)
    {
        if (samples.length < k)
            samples ~= Sample(s.idup, n+1);
        else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
            samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
    }

    foreach (sample; samples)
        writefln("%8s: %s", sample.line, sample.text);
}

Usage: k-sample <file> <k>

It's easy to prove that a probability of any particular line appearing 
in the samples is k/n where n is the total number of lines in the text.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Thu, 09 Oct 2008 13:56:48 -0500,
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

 
 Here's mine
 
 module k_sample;
 
 import std.conv: to;
 import std.stream: BufferedFile;
 import std.random: Random, uniform, unpredictableSeed;
 import std.stdio: writefln;
 
 void main(string[] args)
 {
     auto f = new BufferedFile(args[1]);
     auto k = to!(int)(args[2]);
     assert(k > 0);
 
     struct Sample
     {
         string text;
         ulong line;
         this(string s, ulong n)
         {
             text = s;
             line = n;
         }
     }
 
     Sample[] samples;
     auto gen = Random(unpredictableSeed);
 
     foreach (ulong n, char[] s; f)
     {
         if (samples.length < k)
             samples ~= Sample(s.idup, n+1);
         else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
             samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
     }
 
     foreach (sample; samples)
         writefln("%8s: %s", sample.line, sample.text);
 }
 
 Usage: k-sample <file> <k>
 
 It's easy to prove that a probability of any particular line appearing 
 in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with:

alias Tuple!(string, "text", ulong, "line") Sample;

On a different topic, your use of std.stream reminded me. I was thinking 
of yanking it from Phobos. Thoughts?


Andrei

Oct 09 2008

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking of
 yanking it from Phobos. Thoughts?

..what?  You want to remove most of Phobos' IO capabilities?

I'm guessing there's something else you'd like to replace it with, but
I wonder what it could be.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jarrett Billingsley wrote:
 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 
 On a different topic, your use of std.stream reminded me. I was thinking of
 yanking it from Phobos. Thoughts?

 
 ..what?  You want to remove most of Phobos' IO capabilities?
 
 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

Well I'd like to extend std.stdio with capabilities that meet and exceed 
  those in std.stream. The design of std.stream is ok, but makes it hard 
to compose streams, and requires a lot of hand coding to define new streams.

Andrei

Oct 09 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley  
<jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was  
 thinking of
 yanking it from Phobos. Thoughts?

 ..what?  You want to remove most of Phobos' IO capabilities?

 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

Yeah, why need them when Tango is soon to be used with Phobos side-by-side  
;)

Oct 10 2008

"Jarrett Billingsley" <jarrett.billingsley gmail.com> writes:

On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking
 of
 yanking it from Phobos. Thoughts?

 ..what?  You want to remove most of Phobos' IO capabilities?

 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

 Yeah, why need them when Tango is soon to be used with Phobos side-by-side
 ;)

That's kind of what I'm thinking.  It seems like a waste of time for
_another_ IO framework to be developed when we have two already.
Priorities, priorities.  How about fixing frontend bugs instead?

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was thinking
 of
 yanking it from Phobos. Thoughts?

 ..what?  You want to remove most of Phobos' IO capabilities?

 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

 Yeah, why need them when Tango is soon to be used with Phobos side-by-side
 ;)

 
 That's kind of what I'm thinking.  It seems like a waste of time for
 _another_ IO framework to be developed when we have two already.
 Priorities, priorities.  How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want
e.g. std.algorithm to work on I/O, there must be integration of I/O with
the range concept. As far as I can see things, ranges will be quite a
powerful currency allowing many std components to trade data with one
another.

Andrei

Oct 10 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com>  
 wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was  
 thinking
 of
 yanking it from Phobos. Thoughts?

 ..what?  You want to remove most of Phobos' IO capabilities?

 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

 Yeah, why need them when Tango is soon to be used with Phobos  
 side-by-side
 ;)

  That's kind of what I'm thinking.  It seems like a waste of time for
 _another_ IO framework to be developed when we have two already.
 Priorities, priorities.  How about fixing frontend bugs instead?

 Well Walter wouldn't be working on it anyway. The thing is, if we want
 e.g. std.algorithm to work on I/O, there must be integration of I/O with
 the range concept. As far as I can see things, ranges will be quite a
 powerful currency allowing many std components to trade data with one
 another.

 Andrei

Great! Then, how about getting familiar with Tango IO and building your  
concept on top of it? ;)

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Jarrett Billingsley wrote:
 On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> 
 wrote:
 On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
 <jarrett.billingsley gmail.com> wrote:

 On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On a different topic, your use of std.stream reminded me. I was 
 thinking
 of
 yanking it from Phobos. Thoughts?

 ..what?  You want to remove most of Phobos' IO capabilities?

 I'm guessing there's something else you'd like to replace it with, but
 I wonder what it could be.

 Yeah, why need them when Tango is soon to be used with Phobos 
 side-by-side
 ;)

  That's kind of what I'm thinking.  It seems like a waste of time for
 _another_ IO framework to be developed when we have two already.
 Priorities, priorities.  How about fixing frontend bugs instead?

 Well Walter wouldn't be working on it anyway. The thing is, if we want
 e.g. std.algorithm to work on I/O, there must be integration of I/O with
 the range concept. As far as I can see things, ranges will be quite a
 powerful currency allowing many std components to trade data with one
 another.

 Andrei

 
 Great! Then, how about getting familiar with Tango IO and building your 
 concept on top of it? ;)

That would make Phobos dependent on Tango, thus running into those 
licensing issues I have little understanding of.

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 09 Oct 2008 22:28:51 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 13:56:48 -0500,
 Andrei Alexandrescu wrote:
 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.

 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from the 
 file. (If the file has less than k lines, it should all be output.) The 
 file's size is unbounded so loading it in memory is not an option. How'd 
 you go about it?

 
 Here's mine
 
 module k_sample;
 
 import std.conv: to;
 import std.stream: BufferedFile;
 import std.random: Random, uniform, unpredictableSeed;
 import std.stdio: writefln;
 
 void main(string[] args)
 {
     auto f = new BufferedFile(args[1]);
     auto k = to!(int)(args[2]);
     assert(k > 0);
 
     struct Sample
     {
         string text;
         ulong line;
         this(string s, ulong n)
         {
             text = s;
             line = n;
         }
     }
 
     Sample[] samples;
     auto gen = Random(unpredictableSeed);
 
     foreach (ulong n, char[] s; f)
     {
         if (samples.length < k)
             samples ~= Sample(s.idup, n+1);
         else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
             samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
     }
 
     foreach (sample; samples)
         writefln("%8s: %s", sample.line, sample.text);
 }
 
 Usage: k-sample <file> <k>
 
 It's easy to prove that a probability of any particular line appearing 
 in the samples is k/n where n is the total number of lines in the text.

 
 Sweet! Suggestion: replace Sample's definition with:
 
 alias Tuple!(string, "text", ulong, "line") Sample;

Good advice, thanks.  The Sample definition is unimportant for the 
solution but eats up a lot of visual space.

 On a different topic, your use of std.stream reminded me. I was thinking 
 of yanking it from Phobos. Thoughts?

While experimenting with Bearophile's loader
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

I've found out that BufferedFile is the fastest available way to process 
a file line-by-line.  So the alternative should be as fast.

Oct 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 While experimenting with Bearophile's loader
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402

 I've found out that BufferedFile is the fastest available way to process 
 a file line-by-line.  So the alternative should be as fast.

BufferedFile seems the faster in Phobos, but it's 2-3 times slower than the
line-by-line reading you have in Python, and I think it has 4 bugs, so I don't
use it much, that's why I use the xfile() of my libs, that is a little slower,
but without those bugs.

Few bugs of BufferedFile can be seen if you try to load a very large file made
of a single line (crashes D every time), or you try reading lines with \0 at
the end of the line, etc. They are corner cases, usually.

I suggest reading the C sources of the I/O of Python (or Perl, if you are able
to read the sources. I think the differences in clarity between Python and Perl
can be seen even in their C sources).

Bye,
bearophile

Oct 10 2008

Michel Fortin <michel.fortin michelf.com> writes:

On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I just ran across a nice little problem that I wanted to share as an 
 exercise for the interested in futile pastimes.
 
 The challenge, should you accept it, is to write a program that given a 
 number k and a file, outputs k lines picked uniformly at random from 
 the file. (If the file has less than k lines, it should all be output.) 
 The file's size is unbounded so loading it in memory is not an option. 
 How'd you go about it?

My attempt. Works in one pass, and distributes uniformly but only when 
k is a factor of the number of lines in the file. At least, I believe 
the distribution is uniform, but feel free to prove me wrong.

/* Uniform random value r, with r >= min && r < max. */
real random(uint min, uint max) { ... }

/* Shuffle elements uniformly in the given string array. */
void shuffle(ref string[] array) { ... }

string[] challenge(alias R)(R input, uint k)
{
	string[] result;
	uint line;
	real pickFactor = 1;

	// Initial stage: fill initial k strings
	while (!input.eof && line < k)
	{
		result ~= input.readLine;
		++line;
	}
	shuffle(result);

	while (!input.eof)
	{
		// Next stage: half the chances of picking a candidate.
		pickFactor /= 2;
		string[] candidates;
		line = 0;

		while (!input.eof && line < k)
		{
			candidates ~= input.readLine;
			++line;
		}
		shuffle(candidates);

		for (uint i; i < candidates.length; ++i)
			if (random(0, 1) < pickFactor)
				result[i] = candidates[i];
		// Note: if candidate.length != result.length
		// we don't have uniformity.
	}
	return result;
}

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 10 2008

D Programming

C/C++ Programming

Other

digitalmars.D - random k-sample of a file