digitalmars.D - Re: [OT] shuffling lines in a stream

Jason House (3/28) Oct 10 2008 What you want to optimize is still unclear to me. At its heart, you must...

Andrei Alexandrescu (9/43) Oct 10 2008 At the end of the day, devise a means to take care of the task. Faster
BLS (19/48) Oct 10 2008 Hi Jason,

Benji Smith (25/80) Oct 10 2008 Of course, this relies on the assumption that sentences have

BLS (4/94) Oct 10 2008 Thanks for the pointer benji. I should have a book about it, somewhere.

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 
 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 
 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.
 
 Andrei

What you want to optimize is still unclear to me. At its heart, you must pass
over the data twice. Once to index it, and once to display it. Some
library-based memmory-mapped file or a seekable stream seems like enough to get
the job done. Neither that bit nor the randomization of indicies seems
particularly interesting. What am I missing?

I look forward to the next challenge ;)

Oct 10 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file"
 confirms that plenty of people around here have idle cycles
 ready to be spent on fun tasks.
 
 In that vein, here's another problem. Say you want to randomly
 shuffle lines in a stream. You don't know the size beforehand,
 and one pass is preferable. The stream can be larger than the
 working memory, but not an unbounded amount of times larger.
 Also lines in the stream can be unbounded, but it is guaranteed
 that at least a nonzero fraction of the file fits in working
 memory.
 
 You can use temporary files, but of course minimizing their
 total size is recommended. Speed is of course important. Draft
 your solution and explain why you think it's correct and
 advantageous.

 Well, my temporary file contains the whole stream.  Because there
 is a probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load
 the entire file in memory.
 
 Andrei

 
 What you want to optimize is still unclear to me.

At the end of the day, devise a means to take care of the task. Faster
is better, occupying less space is better.

 At its heart, you must pass over the data twice. Once to index it,
 and once to display it. Some library-based memmory-mapped file or a
 seekable stream seems like enough to get the job done. Neither that
 bit nor the randomization of indicies seems particularly interesting.
 What am I missing?

Your solution will be slow. Doing one random seek per line is too much, 
even in a memory-mapped file. Mapping into memory is not magic! We are 
talking about many millions of lines. Also you can't memory-map streams 
unless you create a full-blown copy first.

 I look forward to the next challenge ;)

I think you will be surprised by the juicy parts of this one.


Andrei

Oct 10 2008

BLS <nanali nospam-wanadoo.fr> writes:

Jason House schrieb:
 Andrei Alexandrescu Wrote:
 
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there is a 
 probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load the 
 entire file in memory.

 Andrei

 
 What you want to optimize is still unclear to me. At its heart, you must pass
over the data twice. Once to index it, and once to display it. Some
library-based memmory-mapped file or a seekable stream seems like enough to get
the job done. Neither that bit nor the randomization of indicies seems
particularly interesting. What am I missing?
 
 I look forward to the next challenge ;) 

Hi Jason,
I think (I may be wrong) that -Normal distribution- is the key
to optimize speed.  Let's say the stream Andrei is talking about is a 
book. Each sentence in this book consist of N characters. (End of Line 
is a DOT) Now we read the first ~10 pages. Character by Character. Based 
on this information we are able to say that each sentence has an average 
size of 100 characters and the normal distribution looks similat to :
no Graphic possible :(
so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution

Indeed, NOT the solution Andrei is looking for, because indexing is 
still necesary. But due to the fact that finding the EOL/DOT is 
probabely faster by creating a slice from position 90 to 110 (in 

be already a win.
Further : During the indexing process we can also adjust our first(base) 
normal distribution, as well as our +- tolerance.

Heck it is late... hope this is not completely bullshi*
Bjoern

Oct 10 2008

Benji Smith <dlanguage benjismith.net> writes:

BLS wrote:
 Jason House schrieb:
 Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there is 
 a probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load 
 the entire file in memory.

 Andrei

 What you want to optimize is still unclear to me. At its heart, you 
 must pass over the data twice. Once to index it, and once to display 
 it. Some library-based memmory-mapped file or a seekable stream seems 
 like enough to get the job done. Neither that bit nor the 
 randomization of indicies seems particularly interesting. What am I 
 missing?

 I look forward to the next challenge ;) 

 Hi Jason,
 I think (I may be wrong) that -Normal distribution- is the key
 to optimize speed.  Let's say the stream Andrei is talking about is a 
 book. Each sentence in this book consist of N characters. (End of Line 
 is a DOT) Now we read the first ~10 pages. Character by Character. Based 
 on this information we are able to say that each sentence has an average 
 size of 100 characters and the normal distribution looks similat to :
 no Graphic possible :(
 so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
 
 Indeed, NOT the solution Andrei is looking for, because indexing is 
 still necesary. But due to the fact that finding the EOL/DOT is 
 probabely faster by creating a slice from position 90 to 110 (in 

 be already a win.
 Further : During the indexing process we can also adjust our first(base) 
 normal distribution, as well as our +- tolerance.
 
 Heck it is late... hope this is not completely bullshi*
 Bjoern

Of course, this relies on the assumption that sentences have 
normally-distributed length (which may or may not be true on any 
specific document).

You might be interested in checking out something called a "Kernel 
Density Estimator", which is the typical solution for estimating the 
shape of an unknown distribution based on a series of samples.

Here's the relevant wikipedia article:

http://en.wikipedia.org/wiki/Kernel_density_estimation

I learned about this technique because it was the subject of a job 
interview question once. The question was something like "Given a file 
containing 100 billion floating point values, what's the most effective 
algorithm for finding the median value. Your machine only has room for a 
billion values in memory at any given time, and making multiple passes 
of the file is very expensive."

In my solution, I built a histogram during the first pass, counting the 
number of entries in each bucket. At the end of the first pass, I could 
isolate the bucket containing the median, and then during the second 
pass, I'd be able to definitely identify the median value.

They pointed out several degenerate cases where my algorithm would 
collapse (because I always had to make assumptions about the shape of 
the distribution), and I left the interview stumped about the real answer.

When I got home, I did my research and found the KDE technique, which 
I've used on a few projects since then.

--benji

Oct 10 2008

BLS <nanali nospam-wanadoo.fr> writes:

Benji Smith schrieb:
 BLS wrote:
 Jason House schrieb:
 Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

 Well, my temporary file contains the whole stream.  Because there 
 is a probability that the last line of the stream goes first.

 Yes, that probability exists, but that doesn't imply you must load 
 the entire file in memory.

 Andrei

 What you want to optimize is still unclear to me. At its heart, you 
 must pass over the data twice. Once to index it, and once to display 
 it. Some library-based memmory-mapped file or a seekable stream seems 
 like enough to get the job done. Neither that bit nor the 
 randomization of indicies seems particularly interesting. What am I 
 missing?

 I look forward to the next challenge ;) 

 Hi Jason,
 I think (I may be wrong) that -Normal distribution- is the key
 to optimize speed.  Let's say the stream Andrei is talking about is a 
 book. Each sentence in this book consist of N characters. (End of Line 
 is a DOT) Now we read the first ~10 pages. Character by Character. 
 Based on this information we are able to say that each sentence has an 
 average size of 100 characters and the normal distribution looks 
 similat to :
 no Graphic possible :(
 so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution

 Indeed, NOT the solution Andrei is looking for, because indexing is 
 still necesary. But due to the fact that finding the EOL/DOT is 
 probabely faster by creating a slice from position 90 to 110 (in 

 be already a win.
 Further : During the indexing process we can also adjust our 
 first(base) normal distribution, as well as our +- tolerance.

 Heck it is late... hope this is not completely bullshi*
 Bjoern

 
 Of course, this relies on the assumption that sentences have 
 normally-distributed length (which may or may not be true on any 
 specific document).
 
 You might be interested in checking out something called a "Kernel 
 Density Estimator", which is the typical solution for estimating the 
 shape of an unknown distribution based on a series of samples.
 
 Here's the relevant wikipedia article:
 
 http://en.wikipedia.org/wiki/Kernel_density_estimation
 
 I learned about this technique because it was the subject of a job 
 interview question once. The question was something like "Given a file 
 containing 100 billion floating point values, what's the most effective 
 algorithm for finding the median value. Your machine only has room for a 
 billion values in memory at any given time, and making multiple passes 
 of the file is very expensive."
 
 In my solution, I built a histogram during the first pass, counting the 
 number of entries in each bucket. At the end of the first pass, I could 
 isolate the bucket containing the median, and then during the second 
 pass, I'd be able to definitely identify the median value.
 
 They pointed out several degenerate cases where my algorithm would 
 collapse (because I always had to make assumptions about the shape of 
 the distribution), and I left the interview stumped about the real answer.
 
 When I got home, I did my research and found the KDE technique, which 
 I've used on a few projects since then.
 
 --benji

Thanks for the pointer benji. I should have a book about it, somewhere. 
... well, have to look another day. And yes, my idea is pretty weak.
bjoern

Oct 10 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Re: [OT] shuffling lines in a stream