digitalmars.D - shuffling lines in a stream

Andrei Alexandrescu (15/15) Oct 09 2008 The high traffic enjoyed by "random k-sample of a file" confirms that

BCS (2/22) Oct 09 2008 Um... I'm not following that.

Andrei Alexandrescu (21/45) Oct 09 2008 Very simple: given a file, just randomly change the order of its lines

BCS (2/7) Oct 09 2008 I got the bit about the end effect. The above is what I didn't get.

Andrei Alexandrescu (3/12) Oct 09 2008 Don't worry about that for now.

BCS (10/25) Oct 09 2008 cache the whole file into a tmp file

Andrei Alexandrescu (4/35) Oct 09 2008 I agree with the last paragraph, but lseeking seems overly inefficient.

BCS (3/14) Oct 10 2008 algorithmically, I don't think the lseek will matter, as to I/O and cach...

Andrei Alexandrescu (8/24) Oct 10 2008 I think it does. Essentially you impose random access on the input, or

BCS (3/33) Oct 10 2008 Is that not an I/O effect?

Andrei Alexandrescu (8/35) Oct 10 2008 Well I agree things can be coded such that you make the copy only if the...

BCS (4/13) Oct 10 2008 I think Big-O on mine is n.

Andrei Alexandrescu (6/20) Oct 10 2008 I disagree. Big-O with lseek under the rug? That's a non-solution

BCS (2/25) Oct 10 2008 In that case this isn't a problem /I/ am intersted in trying to solve.

Andrei Alexandrescu (6/26) Oct 09 2008 Forgot to mention a desirable property. It would be nice if the
Robert Fraser (3/23) Oct 09 2008 I know it's been suggested before, but we really need

Bruno Medeiros (5/8) Oct 16 2008 Word.

BLS (7/27) Oct 09 2008 Just in general

Andrei Alexandrescu (9/37) Oct 10 2008 I think I overspecified it. Pure and simple you must randomize lines in

Sergey Gromov (32/46) Oct 10 2008 Well, my temporary file contains the whole stream. Because there is a

Andrei Alexandrescu (4/23) Oct 10 2008 Yes, that probability exists, but that doesn't imply you must load the

Sergey Gromov (4/27) Oct 10 2008 This implies though that I cannot start output before I read all the

Andrei Alexandrescu (4/30) Oct 10 2008 Creating temporary files is fine as long as you never need to store the

KennyTM~ (3/35) Oct 10 2008 So HDD (or any file system) is not counted as memory? Then just store

Andrei Alexandrescu (5/41) Oct 10 2008 This takes us back to one lseek per line. You can create temporary

Sergey Gromov (18/33) Oct 10 2008 Ok, let me re-formulate the task then, since the creation of a temporary...

Andrei Alexandrescu (3/36) Oct 10 2008 Feeling doesn't qualify.
Christopher Wright (4/37) Oct 11 2008 You have to continually alter k to maintain a random distribution, but

Sean Kelly (10/10) Oct 10 2008 It would be slower than the seeking option, but something like a

Andrei Alexandrescu (4/15) Oct 10 2008 Sweet! This is even better than my solution. (And it's much faster than

Christopher Wright (5/22) Oct 11 2008 You can reduce the number of passes by using more temporary files per

Andrei Alexandrescu (7/18) Oct 10 2008 I think I found a problem with your solution. Consider you break the

Sergey Gromov (7/25) Oct 10 2008 After merging b and c you end up with a and bc. Then you randomly merge...

Andrei Alexandrescu (3/29) Oct 10 2008 How do you "randomly merge"? Describe the process.

Sergey Gromov (17/46) Oct 10 2008 a[] and b[] are files to merge. ab[] is the result. a.length and

Andrei Alexandrescu (3/49) Oct 10 2008 This should work. Use of ranges noted :o).

Sergey Gromov (7/57) Oct 10 2008 I just rigorously proved that probability of line from a getting into
Sergey Gromov (8/23) Oct 10 2008 Actually I felt an urge to write:

Sean Kelly (24/54) Oct 10 2008 while( a.hasMore || b.hasMore )

Andrei Alexandrescu (15/71) Oct 10 2008 Yah, Sergey got me thinking too. What if you have a large file that

Sergey Gromov (36/112) Oct 10 2008 Here's my little proof. Paste this into wikipedia to see formulas.

Sean Kelly (4/22) Oct 10 2008 When a is merged with (b&c) then its lines will be interleaved randomly

Andrei Alexandrescu (9/31) Oct 10 2008 I think that works with Sergey's appropriate crooking of the dice. For

Russell Lewis (47/67) Oct 13 2008 I've read most of the responses (not all) and haven't yet seen a

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

The high traffic enjoyed by "random k-sample of a file" confirms that 
plenty of people around here have idle cycles ready to be spent on fun 
tasks.

In that vein, here's another problem. Say you want to randomly shuffle 
lines in a stream. You don't know the size beforehand, and one pass is 
preferable. The stream can be larger than the working memory, but not an 
unbounded amount of times larger. Also lines in the stream can be 
unbounded, but it is guaranteed that at least a nonzero fraction of the 
file fits in working memory.

You can use temporary files, but of course minimizing their total size 
is recommended. Speed is of course important. Draft your solution and 
explain why you think it's correct and advantageous.

In case you hate random stuff, just wait; the next problem won't have 
any random element in it :o).


Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 The high traffic enjoyed by "random k-sample of a file" confirms that
 plenty of people around here have idle cycles ready to be spent on fun
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle
 lines in a stream. You don't know the size beforehand, and one pass is
 preferable. The stream can be larger than the working memory, but not
 an unbounded amount of times larger. Also lines in the stream can be
 unbounded, but it is guaranteed that at least a nonzero fraction of
 the file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size
 is recommended. Speed is of course important. Draft your solution and
 explain why you think it's correct and advantageous.
 
 In case you hate random stuff, just wait; the next problem won't have
 any random element in it :o).
 
 Andrei
 

Um... I'm not following that.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 The high traffic enjoyed by "random k-sample of a file" confirms that
 plenty of people around here have idle cycles ready to be spent on fun
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle
 lines in a stream. You don't know the size beforehand, and one pass is
 preferable. The stream can be larger than the working memory, but not
 an unbounded amount of times larger. Also lines in the stream can be
 unbounded, but it is guaranteed that at least a nonzero fraction of
 the file fits in working memory.

 You can use temporary files, but of course minimizing their total size
 is recommended. Speed is of course important. Draft your solution and
 explain why you think it's correct and advantageous.

 In case you hate random stuff, just wait; the next problem won't have
 any random element in it :o).

 Andrei

 
 Um... I'm not following that.

Very simple: given a file, just randomly change the order of its lines 
and output them. For example (assume read from standard input):

$ ./shuffle
a
b
c
^D
b
a
c
$ ./shuffle
a
b
c
^D
b
c
a
$ _


Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 You don't know the size beforehand, and
 one pass is preferable. The stream can be larger than the working
 memory, but not an unbounded amount of times larger. Also lines in
 the stream can be unbounded, but it is guaranteed that at least a
 nonzero fraction of the file fits in working memory.

I got the bit about the end effect. The above is what I didn't get.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 You don't know the size beforehand, and
 one pass is preferable. The stream can be larger than the working
 memory, but not an unbounded amount of times larger. Also lines in
 the stream can be unbounded, but it is guaranteed that at least a
 nonzero fraction of the file fits in working memory.

 
 I got the bit about the end effect. The above is what I didn't get.

Don't worry about that for now.

Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 BCS wrote:
 
 Reply to Andrei,
 
 You don't know the size beforehand, and
 one pass is preferable. The stream can be larger than the working
 memory, but not an unbounded amount of times larger. Also lines in
 the stream can be unbounded, but it is guaranteed that at least a
 nonzero fraction of the file fits in working memory.

 I got the bit about the end effect. The above is what I didn't get.
 

 Don't worry about that for now.
 
 Andrei
 

cache the whole file into a tmp file
store slice data on the way through (start/length of each line)
shuffle that set
lseek/read/write each slice in order

Quick'n'dirty

The only fun part is how to shuffle the deck. What I think might be easiest 
to do would be to qsort it with rand as the compare function.

I don't think there is any way to avoid storing the whole file because for 
a uniform sort there is a possibility that the last line will come out first.

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 BCS wrote:

 Reply to Andrei,

 You don't know the size beforehand, and
 one pass is preferable. The stream can be larger than the working
 memory, but not an unbounded amount of times larger. Also lines in
 the stream can be unbounded, but it is guaranteed that at least a
 nonzero fraction of the file fits in working memory.

 I got the bit about the end effect. The above is what I didn't get.

 Don't worry about that for now.

 Andrei

 
 cache the whole file into a tmp file
 store slice data on the way through (start/length of each line)
 shuffle that set
 lseek/read/write each slice in order
 
 Quick'n'dirty
 
 The only fun part is how to shuffle the deck. What I think might be 
 easiest to do would be to qsort it with rand as the compare function.
 
 I don't think there is any way to avoid storing the whole file because 
 for a uniform sort there is a possibility that the last line will come 
 out first.

I agree with the last paragraph, but lseeking seems overly inefficient. 
Could you avoid that?

Andrei

Oct 09 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 BCS wrote:
 

 I don't think there is any way to avoid storing the whole file
 because for a uniform sort there is a possibility that the last line
 will come out first.
 

 I agree with the last paragraph, but lseeking seems overly
 inefficient. Could you avoid that?
 
 Andrei
 

algorithmically, I don't think the lseek will matter, as to I/O and cache 
effects, I'll leave that to someone else.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 BCS wrote:

 
 I don't think there is any way to avoid storing the whole file
 because for a uniform sort there is a possibility that the last line
 will come out first.

 I agree with the last paragraph, but lseeking seems overly
 inefficient. Could you avoid that?

 Andrei

 
 algorithmically, I don't think the lseek will matter,

I think it does. Essentially you impose random access on the input, or 
copy to a medium that offers it.

gunzip --stdout bigfile.gz | shuffle

You'll have to compulsively store a copy of the input. Besides, random 
access is kind of a dicey proposition on large files. Of course, only 
measurement will show...


Andrei

Oct 10 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 BCS wrote:
 
 Reply to Andrei,
 
 BCS wrote:
 
 I don't think there is any way to avoid storing the whole file
 because for a uniform sort there is a possibility that the last
 line will come out first.
 

 I agree with the last paragraph, but lseeking seems overly
 inefficient. Could you avoid that?
 
 Andrei
 

 algorithmically, I don't think the lseek will matter,
 

 I think it does. Essentially you impose random access on the input, or
 copy to a medium that offers it.
 
 gunzip --stdout bigfile.gz | shuffle
 
 You'll have to compulsively store a copy of the input.

You have to anyway, or is that not what you agread with above?

 Besides, random
 access is kind of a dicey proposition on large files. Of course, only
 measurement will show...

Is that not an I/O effect?

  as to I/O and cache effects, I'll leave that to someone else.

 
 Andrei

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 BCS wrote:

 Reply to Andrei,

 BCS wrote:

 I don't think there is any way to avoid storing the whole file
 because for a uniform sort there is a possibility that the last
 line will come out first.

 I agree with the last paragraph, but lseeking seems overly
 inefficient. Could you avoid that?

 Andrei

 algorithmically, I don't think the lseek will matter,

 I think it does. Essentially you impose random access on the input, or
 copy to a medium that offers it.

 gunzip --stdout bigfile.gz | shuffle

 You'll have to compulsively store a copy of the input.

 
 You have to anyway, or is that not what you agread with above?

Well I agree things can be coded such that you make the copy only if the 
file is too large anyway. But one lseek call per line is a complete 
disaster, whether you call it I/O effect or whatnot. I'd go as far as 
saying I'd never even try such a solution, especially since there exists 
a forward-pass algorithm. Of course measuring would be a great way to 
prove me wrong.

Andrei

Oct 10 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,


 Well I agree things can be coded such that you make the copy only if
 the file is too large anyway. But one lseek call per line is a
 complete disaster, whether you call it I/O effect or whatnot. I'd go
 as far as saying I'd never even try such a solution, especially since
 there exists a forward-pass algorithm. Of course measuring would be a
 great way to prove me wrong.
 
 Andrei
 

I think Big-O on mine is n.

I don't think that can be beat. (Maybe I was being not at all clear that 
I was only looking at that aspect, the rest doesn't interest me as a puzzle)

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BCS wrote:
 Reply to Andrei,
 
 
 Well I agree things can be coded such that you make the copy only if
 the file is too large anyway. But one lseek call per line is a
 complete disaster, whether you call it I/O effect or whatnot. I'd go
 as far as saying I'd never even try such a solution, especially since
 there exists a forward-pass algorithm. Of course measuring would be a
 great way to prove me wrong.

 Andrei

 
 I think Big-O on mine is n.

I disagree. Big-O with lseek under the rug? That's a non-solution 
because it essentially redefines the problem into "if I had a 
random-access operator then this is pretty much like array shuffling". 
If you have a random-access operator there is no problem to think about.

Andrei

Oct 10 2008

BCS <ao pathlink.com> writes:

Reply to Andrei,

 BCS wrote:
 
 Reply to Andrei,
 
 Well I agree things can be coded such that you make the copy only if
 the file is too large anyway. But one lseek call per line is a
 complete disaster, whether you call it I/O effect or whatnot. I'd go
 as far as saying I'd never even try such a solution, especially
 since there exists a forward-pass algorithm. Of course measuring
 would be a great way to prove me wrong.
 
 Andrei
 

 I think Big-O on mine is n.
 

 I disagree. Big-O with lseek under the rug? That's a non-solution
 because it essentially redefines the problem into "if I had a
 random-access operator then this is pretty much like array shuffling".
 If you have a random-access operator there is no problem to think
 about.
 
 Andrei
 

In that case this isn't a problem /I/ am intersted in trying to solve.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.
 
 In case you hate random stuff, just wait; the next problem won't have 
 any random element in it :o).
 
 
 Andrei

Forgot to mention a desirable property. It would be nice if the 
algorithm would naturally subscale, e.g. if ran on a short file, it 
would produce a shuffled file with the speed competitive with an 
algorithm conceived for shorter files.

Andrei

Oct 09 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.
 
 In case you hate random stuff, just wait; the next problem won't have 
 any random element in it :o).
 
 
 Andrei

I know it's been suggested before, but we really need 
digitalmars.d.offtopic

Oct 09 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Robert Fraser wrote:
 
 I know it's been suggested before, but we really need 
 digitalmars.d.offtopic

Word.

-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Oct 16 2008

BLS <nanali nospam-wanadoo.fr> writes:

Andrei Alexandrescu schrieb:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.
 
 In case you hate random stuff, just wait; the next problem won't have 
 any random element in it :o).
 
 
 Andrei

Just in general
I would use a memory mapped file. Map data in chunks, say 4MB. Chunks 
are unmapped as necessary ~LRU, means implement a LRUCache.

This is how some "in memory database-engines" solve that sort of problems.
Also possible : I miss the real problem :(
Bjoern

Oct 09 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

BLS wrote:
 Andrei Alexandrescu schrieb:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not 
 an unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of 
 the file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 In case you hate random stuff, just wait; the next problem won't have 
 any random element in it :o).


 Andrei

 
 Just in general
 I would use a memory mapped file. Map data in chunks, say 4MB. Chunks 
 are unmapped as necessary ~LRU, means implement a LRUCache.
 
 This is how some "in memory database-engines" solve that sort of problems.
 Also possible : I miss the real problem :(

I think I overspecified it. Pure and simple you must randomize lines in 
a file that may not fit in memory. Note that memory mapping may help the 
speed of whichever implementation, but gives zero indication on how you 
would actually tackle the problem.

Please add [OT] to the title when answering, as I did, so people can 
filter it out. I think these are cool things to discuss nonetheless in 
D's context because it gives us ideas on language features and libraries.


Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 09 Oct 2008 23:42:47 -0500,
Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

Well, my temporary file contains the whole stream.  Because there is a 
probability that the last line of the stream goes first.

module shuffle;

import std.cstream: din, dout;
import std.stream: BufferedFile, FileMode;
import std.file: remove, exists;
import std.random: Random, uniform, unpredictableSeed, randomShuffle;
import std.stdio: writeln;

void main()
{
    auto gen = Random(unpredictableSeed);

    enum tempName = "temp.$$$";
    if (exists(tempName))
        remove(tempName);
    auto temp = new BufferedFile(tempName,
        FileMode.In | FileMode.OutNew);

    ulong[] cache;

    foreach (char[] s; din)
    {
        cache ~= temp.position;
        temp.writeLine(s);
    }

    randomShuffle(cache, gen);

    foreach (pos; cache)
    {
        temp.position = pos;
        writeln(temp.readLine());
    }

    temp.close();
}

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

Yes, that probability exists, but that doesn't imply you must load the 
entire file in memory.

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 07:55:38 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 
 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

This implies though that I cannot start output before I read all the 
lines from the stream.  Therefore I must store those lines somewhere.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

 
 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

Creating temporary files is fine as long as you never need to store the 
entire input in memory.

Andrei

Oct 10 2008

KennyTM~ <kennytm gmail.com> writes:

Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there is 
 a probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load 
 the entire file in memory.

 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

 
 Creating temporary files is fine as long as you never need to store the 
 entire input in memory.
 
 Andrei

So HDD (or any file system) is not counted as memory? Then just store 
all temporary variables to HDD.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

KennyTM~ wrote:
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there 
 is a probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load 
 the entire file in memory.

 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

 Creating temporary files is fine as long as you never need to store 
 the entire input in memory.

 Andrei

 
 So HDD (or any file system) is not counted as memory? Then just store 
 all temporary variables to HDD.

This takes us back to one lseek per line. You can create temporary 
files. They behave as files usually do - they are better accessed in 
sequential patterns and not too many times.

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 09:20:44 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

 
 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

 
 Creating temporary files is fine as long as you never need to store the 
 entire input in memory.

Ok, let me re-formulate the task then, since the creation of a temporary 
file containing all the lines seems inevitable.  You basically want to 
uniformly shuffle lines in a very large file.

The idea is this.  Let N be number of lines in the file, which you know 
after storing it.  There are two passes, forward and backward.  They are 
symmetric.  On a forward pass you read as many lines as you can fit in 
memory, let it be m.  Shuffle them.  Then choose a number, k which is
0 < k < m and is somehow based upon the m and N, like k = m * m/N.  
Write the k first lines back into the file and discard them from memory.  
Append as many new lines to your memory array as possible.  Shuffle.  
Choose k.  Write.  Repeat.  The reverse pass is exactly the same but you 
start from the end of file.

The tricky part is how to choose k.  It obviously needs to take number 
of discarded lines, q, into account like in k = m * m/(N-q).  But my 
math-fu is almost non-existent so I'm unable to prove if this approach 
can give an uniform distribution.  I feel like it can though.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Fri, 10 Oct 2008 09:20:44 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

 Creating temporary files is fine as long as you never need to store the 
 entire input in memory.

 
 Ok, let me re-formulate the task then, since the creation of a temporary 
 file containing all the lines seems inevitable.  You basically want to 
 uniformly shuffle lines in a very large file.
 
 The idea is this.  Let N be number of lines in the file, which you know 
 after storing it.  There are two passes, forward and backward.  They are 
 symmetric.  On a forward pass you read as many lines as you can fit in 
 memory, let it be m.  Shuffle them.  Then choose a number, k which is
 0 < k < m and is somehow based upon the m and N, like k = m * m/N.  
 Write the k first lines back into the file and discard them from memory.  
 Append as many new lines to your memory array as possible.  Shuffle.  
 Choose k.  Write.  Repeat.  The reverse pass is exactly the same but you 
 start from the end of file.
 
 The tricky part is how to choose k.  It obviously needs to take number 
 of discarded lines, q, into account like in k = m * m/(N-q).  But my 
 math-fu is almost non-existent so I'm unable to prove if this approach 
 can give an uniform distribution.  I feel like it can though.

Feeling doesn't qualify.

Andrei

Oct 10 2008

Christopher Wright <dhasenan gmail.com> writes:

Sergey Gromov wrote:
 Fri, 10 Oct 2008 09:20:44 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 07:55:38 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

 This implies though that I cannot start output before I read all the 
 lines from the stream.  Therefore I must store those lines somewhere.

 Creating temporary files is fine as long as you never need to store the 
 entire input in memory.

 
 Ok, let me re-formulate the task then, since the creation of a temporary 
 file containing all the lines seems inevitable.  You basically want to 
 uniformly shuffle lines in a very large file.
 
 The idea is this.  Let N be number of lines in the file, which you know 
 after storing it.  There are two passes, forward and backward.  They are 
 symmetric.  On a forward pass you read as many lines as you can fit in 
 memory, let it be m.  Shuffle them.  Then choose a number, k which is
 0 < k < m and is somehow based upon the m and N, like k = m * m/N.  
 Write the k first lines back into the file and discard them from memory.  
 Append as many new lines to your memory array as possible.  Shuffle.  
 Choose k.  Write.  Repeat.  The reverse pass is exactly the same but you 
 start from the end of file.
 
 The tricky part is how to choose k.  It obviously needs to take number 
 of discarded lines, q, into account like in k = m * m/(N-q).  But my 
 math-fu is almost non-existent so I'm unable to prove if this approach 
 can give an uniform distribution.  I feel like it can though.

You have to continually alter k to maintain a random distribution, but 
you have an invariant 0 <= k <= m. Those bounds mean you can't use this 
approach.

Oct 11 2008

Sean Kelly <sean invisibleduck.org> writes:

It would be slower than the seeking option, but something like a 
randomized mergesort would work as well.  If the program can buffer k 
lines in a file containing n lines, then read the first k lines into 
memory, shuffle them, and write them out to a temporary file.  Repeat 
until the input is exhausted.  Now randomly pick two of the temporary 
files and randomly merge them.  Repeat until two temporary files remain, 
then output the result of the final random merge to the screen.

For small files (ie. where n<k) the file would be read, shuffled in 
memory, and printed to the screen, assuming the proper checks were in place.


Sean

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.
 
 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

Sweet! This is even better than my solution. (And it's much faster than 
seeking.)

Andrei

Oct 10 2008

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files 
 remain, then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 
 Sweet! This is even better than my solution. (And it's much faster than 
 seeking.)
 
 Andrei

You can reduce the number of passes by using more temporary files per 
iteration. For instance, with O(log n) files per iteration, you'll get a 
time complexity of O(n log* n) rather than O(n log n). However, using 
more files makes the disk cache unhappy.

Oct 11 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.
 
 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

I think I found a problem with your solution. Consider you break the 
file in three chunks: a, b, c. Then you pick at random b and c and 
randomly merge them. The problem is, you make early the decision that 
nothing in a will appear in the first two thirds of the result. So the 
quality of randomness suffers. How would you address that?

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 15:54:18 -0500,
Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.
 
 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 
 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

After merging b and c you end up with a and bc.  Then you randomly merge 
these two files and lines from a have all the chances to appear anywhere 
within result.

When randomly merging, the probability of picking a line from a file 
should be proportional to a number of lines left in that file.

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

 
 After merging b and c you end up with a and bc.  Then you randomly merge 
 these two files and lines from a have all the chances to appear anywhere 
 within result.
 
 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

How do you "randomly merge"? Describe the process.

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 16:10:29 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

 
 After merging b and c you end up with a and bc.  Then you randomly merge 
 these two files and lines from a have all the chances to appear anywhere 
 within result.
 
 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 
 How do you "randomly merge"? Describe the process.

a[] and b[] are files to merge.  ab[] is the result.  a.length and 
b.length are number of lines left in either file.

while (a.length || b.length)
{
  if (uniform(gen, 0, a.length + b.length) < a.length)
  {
    ab.put(a.head);
    a.next();
  }
  else
  {
    ab.put(b.head);
    b.next();
  }
}

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sergey Gromov wrote:
 Fri, 10 Oct 2008 16:10:29 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

 After merging b and c you end up with a and bc.  Then you randomly merge 
 these two files and lines from a have all the chances to appear anywhere 
 within result.

 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 How do you "randomly merge"? Describe the process.

 
 a[] and b[] are files to merge.  ab[] is the result.  a.length and 
 b.length are number of lines left in either file.
 
 while (a.length || b.length)
 {
   if (uniform(gen, 0, a.length + b.length) < a.length)
   {
     ab.put(a.head);
     a.next();
   }
   else
   {
     ab.put(b.head);
     b.next();
   }
 }

This should work. Use of ranges noted :o).

Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 16:27:03 -0500,
Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 16:10:29 -0500,
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files remain, 
 then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

 After merging b and c you end up with a and bc.  Then you randomly merge 
 these two files and lines from a have all the chances to appear anywhere 
 within result.

 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 How do you "randomly merge"? Describe the process.

 
 a[] and b[] are files to merge.  ab[] is the result.  a.length and 
 b.length are number of lines left in either file.
 
 while (a.length || b.length)
 {
   if (uniform(gen, 0, a.length + b.length) < a.length)
   {
     ab.put(a.head);
     a.next();
   }
   else
   {
     ab.put(b.head);
     b.next();
   }
 }

 
 This should work.

I just rigorously proved that probability of line from a getting into 
output position 0 is equal to probability of line from a getting into 
position 1 and is equal to a.length/(a.length+b.length).  I need to 
recollect TeX to write this proof down for electronic use.

 Use of ranges noted :o).

Except that a.length and b.length should be stored in files somehow.

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 16:27:03 -0500,
Andrei Alexandrescu wrote:
 while (a.length || b.length)
 {
   if (uniform(gen, 0, a.length + b.length) < a.length)
   {
     ab.put(a.head);
     a.next();
   }
   else
   {
     ab.put(b.head);
     b.next();
   }
 }

 
 This should work. Use of ranges noted :o).

Actually I felt an urge to write:

if (uniform(gen, 0, a.length + b.length) < a.length)
  ab.put(a.next);
else
  ab.put(b.next);

I had to really get hold of myself.  ;)

Oct 10 2008

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer 
 k lines in a file containing n lines, then read the first k lines 
 into memory, shuffle them, and write them out to a temporary file.  
 Repeat until the input is exhausted.  Now randomly pick two of the 
 temporary files and randomly merge them.  Repeat until two temporary 
 files remain, then output the result of the final random merge to 
 the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were 
 in place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So 
 the quality of randomness suffers. How would you address that?

 After merging b and c you end up with a and bc.  Then you randomly 
 merge these two files and lines from a have all the chances to appear 
 anywhere within result.

 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 
 How do you "randomly merge"? Describe the process.


while( a.hasMore || b.hasMore )
{
     if( rand % 2 )
         r ~= a.readLine;
     else
         r ~= b.readLine;
}
while( a.hasMore )
     r ~= a.readLine;
while( b.hasMore )
     r ~= b.readLine;

The downside to randomly choosing two files is that it's possible you 
could end up always choosing a small and a large file, thus increasing 
the chance that the large file will have a bunch of lines left over 
which are essentially appended.

The easy way around this would be to maintain a list of files so only 
equal-sized files are merged (similar to how mergesort works).  However, 
I think a better way would be to use some sort of bias for the random 
check above so that different-sized files are on equal footing.  Perhaps 
write out the file with the first 4 bytes containing the number of lines 
in that file and use that to construct the bias.  I'll leave those with 
better math-fu than I to come up with a formula which would be fair.


Sean

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer 
 k lines in a file containing n lines, then read the first k lines 
 into memory, shuffle them, and write them out to a temporary file.  
 Repeat until the input is exhausted.  Now randomly pick two of the 
 temporary files and randomly merge them.  Repeat until two 
 temporary files remain, then output the result of the final random 
 merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were 
 in place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision 
 that nothing in a will appear in the first two thirds of the result. 
 So the quality of randomness suffers. How would you address that?

 After merging b and c you end up with a and bc.  Then you randomly 
 merge these two files and lines from a have all the chances to appear 
 anywhere within result.

 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 How do you "randomly merge"? Describe the process.

 
 
 while( a.hasMore || b.hasMore )
 {
     if( rand % 2 )
         r ~= a.readLine;
     else
         r ~= b.readLine;
 }
 while( a.hasMore )
     r ~= a.readLine;
 while( b.hasMore )
     r ~= b.readLine;
 
 The downside to randomly choosing two files is that it's possible you 
 could end up always choosing a small and a large file, thus increasing 
 the chance that the large file will have a bunch of lines left over 
 which are essentially appended.

Yah, Sergey got me thinking too. What if you have a large file that 
doesn't fit in memory by exactly one line?

Then I shuffle (n - 1) lines at random, and there remains the last line 
to write. If I choose from either fragment with probability 0.5, the 
last line will be with high probability among the first in the result!

 The easy way around this would be to maintain a list of files so only 
 equal-sized files are merged (similar to how mergesort works).  However, 
 I think a better way would be to use some sort of bias for the random 
 check above so that different-sized files are on equal footing.  Perhaps 
 write out the file with the first 4 bytes containing the number of lines 
 in that file and use that to construct the bias.  I'll leave those with 
 better math-fu than I to come up with a formula which would be fair.

(In similar situations I've put the number of lines right in the name of 
the temporary file. I kid you not! And it's O(1) too!)

It looks like any fragment should be chosen from with a probability 
proportional to its length. The lingering question for me is whether 
that probability needs to be updated, or can be fixed at the beginning 
of merging. I think fixed doesn't quite work unless special measures are 
taken at whenever a range is done.

Again many thanks to all participants for your illuminating insights.


Andrei

Oct 10 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Fri, 10 Oct 2008 17:45:28 -0500,
Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 Sergey Gromov wrote:
 Fri, 10 Oct 2008 15:54:18 -0500,
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer 
 k lines in a file containing n lines, then read the first k lines 
 into memory, shuffle them, and write them out to a temporary file.  
 Repeat until the input is exhausted.  Now randomly pick two of the 
 temporary files and randomly merge them.  Repeat until two 
 temporary files remain, then output the result of the final random 
 merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were 
 in place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision 
 that nothing in a will appear in the first two thirds of the result. 
 So the quality of randomness suffers. How would you address that?

 After merging b and c you end up with a and bc.  Then you randomly 
 merge these two files and lines from a have all the chances to appear 
 anywhere within result.

 When randomly merging, the probability of picking a line from a file 
 should be proportional to a number of lines left in that file.

 How do you "randomly merge"? Describe the process.

 
 
 while( a.hasMore || b.hasMore )
 {
     if( rand % 2 )
         r ~= a.readLine;
     else
         r ~= b.readLine;
 }
 while( a.hasMore )
     r ~= a.readLine;
 while( b.hasMore )
     r ~= b.readLine;
 
 The downside to randomly choosing two files is that it's possible you 
 could end up always choosing a small and a large file, thus increasing 
 the chance that the large file will have a bunch of lines left over 
 which are essentially appended.

 
 Yah, Sergey got me thinking too. What if you have a large file that 
 doesn't fit in memory by exactly one line?
 
 Then I shuffle (n - 1) lines at random, and there remains the last line 
 to write. If I choose from either fragment with probability 0.5, the 
 last line will be with high probability among the first in the result!
 
 The easy way around this would be to maintain a list of files so only 
 equal-sized files are merged (similar to how mergesort works).  However, 
 I think a better way would be to use some sort of bias for the random 
 check above so that different-sized files are on equal footing.  Perhaps 
 write out the file with the first 4 bytes containing the number of lines 
 in that file and use that to construct the bias.  I'll leave those with 
 better math-fu than I to come up with a formula which would be fair.

 
 (In similar situations I've put the number of lines right in the name of 
 the temporary file. I kid you not! And it's O(1) too!)
 
 It looks like any fragment should be chosen from with a probability 
 proportional to its length. The lingering question for me is whether 
 that probability needs to be updated, or can be fixed at the beginning 
 of merging. I think fixed doesn't quite work unless special measures are 
 taken at whenever a range is done.
 
 Again many thanks to all participants for your illuminating insights.

Here's my little proof.  Paste this into wikipedia to see formulas.

Probability of a line from <math>a</math> getting into the first line of 
result:

<math>
p_a^1=\frac{n_a}{n_a+n_b}
</math>

Likewise for <math>b</math>:

<math>
p_b^1=\frac{n_b}{n_a+n_b}
</math>

Probability of a line from <math>a</math> becoming a second line of the 
output, <math>p_a^2</math>, is:

<math>
p_a^2=p_a^2\Big|_{a_1} + p_a^2\Big|_{b_1}
</math>

where <math>p_a^2\Big|_{a_1}</math> and <math>p_a^2\Big|_{b_1}</math> 
are probabilities of a line from <math>a</math> getting into position 
<math>2</math> when the first line is from <math>a</math> and from 
<math>b</math> respectively.

<math>
p_a^2\Big|_{a_1} = \frac{n_a}{n_a+n_b}\cdot\frac{n_a-1}{n_a-1+n_b} = 
\frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}
</math>

<math>
p_a^2\Big|_{b_1} = \frac{n_b}{n_a+n_b}\cdot\frac{n_a}{n_a+n_b-1} = \frac
{n_a n_b}{(n_a+n_b)(n_a+n_b-1)}
</math>

Therefore

<math>
p_a^2 = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}+\frac{n_a n_b}{(n_a+n_b)
(n_a+n_b-1)} = \frac{n_a(n_a+n_b-1)}{(n_a+n_b)(n_a+n_b-1)} = \frac{n_a}
{n_a+n_b}
</math>

So at least <math>p_a^1=p_a^2</math>.  Induce anyone?

Oct 10 2008

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files 
 remain, then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 
 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

When a is merged with (b&c) then its lines will be interleaved randomly 
in the result.


Sean

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Sean Kelly wrote:
 Andrei Alexandrescu wrote:
 Sean Kelly wrote:
 It would be slower than the seeking option, but something like a 
 randomized mergesort would work as well.  If the program can buffer k 
 lines in a file containing n lines, then read the first k lines into 
 memory, shuffle them, and write them out to a temporary file.  Repeat 
 until the input is exhausted.  Now randomly pick two of the temporary 
 files and randomly merge them.  Repeat until two temporary files 
 remain, then output the result of the final random merge to the screen.

 For small files (ie. where n<k) the file would be read, shuffled in 
 memory, and printed to the screen, assuming the proper checks were in 
 place.

 I think I found a problem with your solution. Consider you break the 
 file in three chunks: a, b, c. Then you pick at random b and c and 
 randomly merge them. The problem is, you make early the decision that 
 nothing in a will appear in the first two thirds of the result. So the 
 quality of randomness suffers. How would you address that?

 
 When a is merged with (b&c) then its lines will be interleaved randomly 
 in the result.

I think that works with Sergey's appropriate crooking of the dice. For 
the record, my solution involved generating k shuffled sub-files, and 
then roll a fair k-sided dice and write one line in the chosen file to 
the output, until all files are exhausted. Sergey showed that my 
solution is wrong unless it so happens the last fragment is equal to all 
others. Thanks!

Well are we ready for the next puzzle, or bored?

Andrei

Oct 10 2008

Russell Lewis <webmaster villagersonline.com> writes:

I've read most of the responses (not all) and haven't yet seen a
duplicate of my algorithm, so here it is.  You can debate whether the
cost of opening many little files is as bad as lseek or not...I don't know.

BEGIN CODE

void main()
{
  CreateAnEmptyTmpDirectoryAndChdirToIt();

  while(!feof(stdin))
  {
    char[] name = Random8CharacterFilename();

    if(!file_exists(name))
      WriteToFile(name, ReadALine(stdin));
    else
    {
      if(CoinFlip())
        WriteToFile("junk", ReadALine(stdin));
      else
      {
        rename(name, "junk");
        WriteToFile(name, ReadALine(stdin));
      }

      // shuffle "junk" to some non-duplicate name.  chains of
      // renames are allowed so as to prevent any biasing of
      // the odds towards lines earlier in the file.
      name = Random8CharacterFilename();
      while(file_exists(name))
      {
        if(CoinFlip())
        {
          rename(name, "tmp");
          rename("junk", name);
          rename("tmp", "junk")
        }
        name = Random8CharacterFilename();
      }
    }
  }

  char[][] names = GetSortedFilenames();
  foreach(name; names)
  {
    PrintFileContents(name);
    unlink(name);
  }

  DestroyTmpDir();
}

END CODE

Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.
 
 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.
 
 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.
 
 In case you hate random stuff, just wait; the next problem won't have 
 any random element in it :o).
 
 
 Andrei

Oct 13 2008

D Programming

C/C++ Programming

Other

digitalmars.D - shuffling lines in a stream