www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: [OT] shuffling lines in a stream

reply Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.

Yes, that probability exists, but that doesn't imply you must load the entire file in memory. Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)
Oct 10 2008
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file"
 confirms that plenty of people around here have idle cycles
 ready to be spent on fun tasks.
 
 In that vein, here's another problem. Say you want to randomly
 shuffle lines in a stream. You don't know the size beforehand,
 and one pass is preferable. The stream can be larger than the
 working memory, but not an unbounded amount of times larger.
 Also lines in the stream can be unbounded, but it is guaranteed
 that at least a nonzero fraction of the file fits in working
 memory.
 
 You can use temporary files, but of course minimizing their
 total size is recommended. Speed is of course important. Draft
 your solution and explain why you think it's correct and
 advantageous.

is a probability that the last line of the stream goes first.

the entire file in memory. Andrei

What you want to optimize is still unclear to me.

At the end of the day, devise a means to take care of the task. Faster is better, occupying less space is better.
 At its heart, you must pass over the data twice. Once to index it,
 and once to display it. Some library-based memmory-mapped file or a
 seekable stream seems like enough to get the job done. Neither that
 bit nor the randomization of indicies seems particularly interesting.
 What am I missing?

Your solution will be slow. Doing one random seek per line is too much, even in a memory-mapped file. Mapping into memory is not magic! We are talking about many millions of lines. Also you can't memory-map streams unless you create a full-blown copy first.
 I look forward to the next challenge ;)

I think you will be surprised by the juicy parts of this one. Andrei
Oct 10 2008
prev sibling parent reply BLS <nanali nospam-wanadoo.fr> writes:
Jason House schrieb:
 Andrei Alexandrescu Wrote:
 
 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms that 
 plenty of people around here have idle cycles ready to be spent on fun 
 tasks.

 In that vein, here's another problem. Say you want to randomly shuffle 
 lines in a stream. You don't know the size beforehand, and one pass is 
 preferable. The stream can be larger than the working memory, but not an 
 unbounded amount of times larger. Also lines in the stream can be 
 unbounded, but it is guaranteed that at least a nonzero fraction of the 
 file fits in working memory.

 You can use temporary files, but of course minimizing their total size 
 is recommended. Speed is of course important. Draft your solution and 
 explain why you think it's correct and advantageous.

probability that the last line of the stream goes first.

entire file in memory. Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)

I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* Bjoern
Oct 10 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
BLS wrote:
 Jason House schrieb:
 Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

a probability that the last line of the stream goes first.

the entire file in memory. Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)

I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* Bjoern

Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document). You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples. Here's the relevant wikipedia article: http://en.wikipedia.org/wiki/Kernel_density_estimation I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive." In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value. They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer. When I got home, I did my research and found the KDE technique, which I've used on a few projects since then. --benji
Oct 10 2008
parent BLS <nanali nospam-wanadoo.fr> writes:
Benji Smith schrieb:
 BLS wrote:
 Jason House schrieb:
 Andrei Alexandrescu Wrote:

 Sergey Gromov wrote:
 Thu, 09 Oct 2008 23:42:47 -0500,
 Andrei Alexandrescu wrote:
 The high traffic enjoyed by "random k-sample of a file" confirms 
 that plenty of people around here have idle cycles ready to be 
 spent on fun tasks.

 In that vein, here's another problem. Say you want to randomly 
 shuffle lines in a stream. You don't know the size beforehand, and 
 one pass is preferable. The stream can be larger than the working 
 memory, but not an unbounded amount of times larger. Also lines in 
 the stream can be unbounded, but it is guaranteed that at least a 
 nonzero fraction of the file fits in working memory.

 You can use temporary files, but of course minimizing their total 
 size is recommended. Speed is of course important. Draft your 
 solution and explain why you think it's correct and advantageous.

is a probability that the last line of the stream goes first.

the entire file in memory. Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)

I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* Bjoern

Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document). You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples. Here's the relevant wikipedia article: http://en.wikipedia.org/wiki/Kernel_density_estimation I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive." In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value. They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer. When I got home, I did my research and found the KDE technique, which I've used on a few projects since then. --benji

Thanks for the pointer benji. I should have a book about it, somewhere. ... well, have to look another day. And yes, my idea is pretty weak. bjoern
Oct 10 2008