## digitalmars.D - topN using a heap

- Andrei Alexandrescu (17/17) Jan 16 2016 https://github.com/D-Programming-Language/phobos/pull/3934
- Ivan Kazmenko (21/32) Jan 16 2016 Yeah, this sounds nice. Suppose we have an array of length n and
- Timon Gehr (3/8) Jan 16 2016 The implementation falls back to topNHeap whenever k is within the first...
- Andrei Alexandrescu (3/13) Jan 16 2016 Correct. The pedantic approach would be to only do that when k is up to
- Timon Gehr (7/21) Jan 16 2016 Ivan's analysis suggests that even something significantly larger, like
- Andrei Alexandrescu (5/7) Jan 16 2016 I'm not clear on how you got to that boundary. There are a few
- Ilya Yaroshenko (4/12) Jan 16 2016 size_t.sizeof * 8 - bsr(n) should be good approximation for
- Ivan Kazmenko (10/18) Jan 17 2016 The average case is O(n + (k log n log k)) for small enough k.
- Andrei Alexandrescu (12/14) Jan 17 2016 I don't understand how you derived the average case complexity, and I
- Ivan Kazmenko (42/58) Jan 17 2016 Yeah, I have the same notion.
- Andrei Alexandrescu (10/11) Jan 17 2016 OK, very nice. Thanks! I've modified topN to work as follows. In a loop:
- Ivan Kazmenko (27/39) Jan 18 2016 Here goes the test which shows quadratic behavior for the new
- Ivan Kazmenko (39/51) Jan 18 2016 Perhaps I should include a textual summary as well.
- Ivan Kazmenko (5/9) Jan 18 2016 Correction: this is the result of removing a uniform call in pull
- Ilya (3/12) Jan 18 2016 A RNGs don't improve worst case. It only changes an permutation
- Ivan Kazmenko (4/6) Jan 18 2016 Still, use of RNG makes it impossible to construct the worst case
- Andrei Alexandrescu (2/7) Jan 18 2016 BTW I forgot to thank you for this analysis. This is good stuff. -- Andr...
- Ilya (3/9) Jan 18 2016 No, it is definitely possible, because RNGs are Pseudo-RNGs.
- Andrei Alexandrescu (3/11) Jan 18 2016 unpredictableSeed uses the system clock as a source of randomness, so
- Ilya (3/20) Jan 18 2016 Would this work for pure functions? --Ilya
- Timon Gehr (2/17) Jan 18 2016 Only if they accept the RNG as an additional argument.
- Ilya (3/20) Jan 18 2016 Exactly :) So, I hope we would not add Pseudo-RNGs in
- Andrei Alexandrescu (27/42) Jan 18 2016 Of course not. I think this back-and-forth takes away from the gist of
- Ilya (7/61) Jan 18 2016 (a) Hope no
- Ivan Kazmenko (16/17) Jan 18 2016 No, sort utilizes Introsort (Quicksort but switch to Heapsort if
- Andrei Alexandrescu (3/18) Jan 19 2016 Do you think sort and topN would be attackable if they used a
- Ivan Kazmenko (17/20) Jan 19 2016 Yes, but with a little interactivity (not generating the input in
- Andrei Alexandrescu (3/9) Jan 20 2016 [snip]
- Xinok (29/59) Jan 18 2016 To be clear, sort is NOT attackable. Introsort is used for
- Andrei Alexandrescu (11/28) Jan 19 2016 Sorry, my bad. Could you please point to the code doing the
- Ivan Kazmenko (19/22) Jan 18 2016 Since point 4 is in fact already fixed a good while ago, my
- Andrei Alexandrescu (10/29) Jan 19 2016 Yah, I also think heap/double-heap topN is better; median-of-medians
- Andrei Alexandrescu (7/8) Jan 19 2016 Destroy! I made it part of
- Timon Gehr (4/10) Jan 19 2016 The switching heuristic is bad. It always switches after at most 8 steps...
- Andrei Alexandrescu (4/15) Jan 19 2016 My algebra is completely rotten. Thanks! Updated
- Jon Degenhardt (25/31) Sep 21 2016 Not completely clear from this thread what the conclusion was wrt
- Andrei Alexandrescu (3/10) Sep 21 2016 I have it on my list to move https://arxiv.org/abs/1606.00484 into
- Jon Degenhardt (5/18) Sep 21 2016 Very good, thanks. It'll be interesting to see how this algorithm
- Andrei Alexandrescu (4/20) Sep 21 2016 Preliminary results indicate that QuickselectAdaptive is about 10x
- Andrei Alexandrescu (2/2) Sep 23 2016 Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which...
- Stefan Koch (3/6) Sep 23 2016 annotate the templates.
- Andrei Alexandrescu (9/14) Sep 23 2016 I don't want to lose the deduction. I used another workaround (enumerate...
- Jon Degenhardt (6/10) Sep 23 2016 That's a very nice result. Both the absolute numbers and that all
- Andrei Alexandrescu (129/132) Sep 24 2016 Got curious so I tried to patch my ldc installation (0.17.1 (DMD
- Jon Degenhardt (18/33) Sep 25 2016 I added the topN pull request to a local LDC build (current LDC
- Andrei Alexandrescu (2/35) Sep 25 2016 Thanks for this work! -- Andrei
- Andrei Alexandrescu (2/7) Sep 23 2016 Nagonna work for generic functions -- Andrei
- deadalnix (9/11) Jan 18 2016 I got problem with that even when crytographically secure
- Andrei Alexandrescu (3/12) Jan 18 2016 How would this translate to a matter of selecting the pivot during sort?...
- deadalnix (4/6) Jan 18 2016 A large chunk of a given datacenter going quadratic at the same
- Andrei Alexandrescu (14/26) Jan 18 2016 Well it does improve things. The probability of hitting the worst case
- Ilya (7/41) Jan 18 2016 1. Yes, probability of hitting the worst case repeatedly is is
- Timon Gehr (2/8) Jan 18 2016 You also need to predict the seed. How do you do that?
- Ilya (3/13) Jan 18 2016 We can not use unpredictable seed (like system clock) in pure
- Timon Gehr (4/17) Jan 18 2016 Clearly. The point is, there already was an impure implementation of
- Ilya (3/25) Jan 18 2016 There is already implementation with predictable seed. Proof:
- Timon Gehr (3/7) Jan 18 2016 The default RNG is seeded with unpredictableSeed. What is the point you
- Ilya (3/11) Jan 18 2016 unpredictableSeed is initialized only once and can be easily
- Andrei Alexandrescu (9/22) Jan 18 2016 https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/rando...
- Timon Gehr (2/29) Jan 18 2016 https://en.wikipedia.org/wiki/Introselect
- Ivan Kazmenko (19/34) Jan 18 2016 Possible, but only if the seed and previous usage of victim
- Timon Gehr (3/5) Jan 18 2016 Yes, it selected the pivot uniformly at random using the global RNG.
- Ivan Kazmenko (28/44) Jan 17 2016 I'd suggest being able to switch between implementations.
- Andrei Alexandrescu (21/26) Jan 17 2016 A nice idea, but it seems a bit overengineered. The differences among
- Ziad Hatahet via Digitalmars-d (3/4) Jan 16 2016 Why not the first 10?
- Andrei Alexandrescu (2/6) Jan 16 2016 So you get to put the appropriate element in the 11th position. -- Andre...
- Andrei Alexandrescu (4/14) Jan 16 2016 To clarify this using a degenerate case: say someone calls topN(r, 0)
- Ivan Kazmenko (4/6) Jan 16 2016 And why this? Do we additionally require the k-th element to
- Andrei Alexandrescu (2/7) Jan 16 2016 Yes. -- Andrei
- deadalnix (4/23) Jan 17 2016 A common way to do it is to go quicksort, but only recurse on one
- Andrei Alexandrescu (4/6) Jan 17 2016 Yah, that's quickselect (which this work started from). It's linear, and...
- deadalnix (3/10) Jan 17 2016 Yeah; forget about me, I was dumb on that one.

https://github.com/D-Programming-Language/phobos/pull/3934 So, say you're looking for the smallest 10 elements out of 100_000. The quickselect algorithm (which topN currently uses) will successively partition the set in (assuming the pivot choice works well) 50_000, 25_000, etc chunks all the way down to finding the smallest 10. That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10: 1. Organize the first 11 elements into a max heap 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property. 3. At the end, swap the largest in the heap with the 10th and you're done! This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often. Destroy! Andrei

Jan 16 2016

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10: 1. Organize the first 11 elements into a max heap 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property. 3. At the end, swap the largest in the heap with the 10th and you're done! This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often.Yeah, this sounds nice. Suppose we have an array of length n and want to find the smallest k elements. The first step (heapify) is O(k) operations. 1. Let us analyze the rest for "small enough" k. 1a. If the array is "random", the expected number of insertions is the sum of probabilities that each new element is inserted. This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k times a part of harmonic series, that is, on the order of k log n. In each case, O(log k) operations are required to process the new element. In total, the "average" case takes (n-k) + (k log n log k), a total of O(n) for small enough values of k. 1b. In the worst case (each element is less than the current top), this still requires (n-k) log k operations, so we can say the total is O(n log k). 2. For k of the same order as n, the algorithm is O(n log n), just as a regular HeapSort. To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.

Jan 16 2016

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average....To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.

Jan 16 2016

On 1/16/16 8:00 PM, Timon Gehr wrote:On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:Correct. The pedantic approach would be to only do that when k is up to log(n). -- AndreiOn Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average....To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.

Jan 16 2016

On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:On 1/16/16 8:00 PM, Timon Gehr wrote:Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:Correct. The pedantic approach would be to only do that when k is up to log(n). -- AndreiOn Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average....To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.

Jan 16 2016

On 1/16/16 9:37 PM, Timon Gehr wrote:Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k.I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

Jan 16 2016

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu wrote:On 1/16/16 9:37 PM, Timon Gehr wrote:size_t.sizeof * 8 - bsr(n) should be good approximation for sorting. --IlyaIvan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k.I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

Jan 16 2016

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu wrote:On 1/16/16 9:37 PM, Timon Gehr wrote:The average case is O(n + (k log n log k)) for small enough k. So, any k below roughly n / log^2 (n) makes the second summand less than the first. Of course, the constant not present in asymptotic analysis, as well as handling the worst case, may make this notion impractical. Measure is the way :) . The worst case is just a decreasing sequence, or almost decreasing if we want to play against branch prediction as well.Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k.I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

Jan 17 2016

On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:The average case is O(n + (k log n log k)) for small enough k. So, any k below roughly n / log^2 (n) makes the second summand less than the first.I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k. Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct? If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere. Could you please clarify matters? Thanks! Andrei

Jan 17 2016

On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu wrote:On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:Yeah, I have the same notion.The average case is O(n + (k log n log k)) for small enough k. So, any k below roughly n / log^2 (n) makes the second summand less than the first.I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k. Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct?If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere.The upper bound O(n + k log k log n) is correct, but it's tight only for small k. I'll explain below.Could you please clarify matters? Thanks!Here is a more verbose version. Suppose for simplicity that the array consists real numbers of the same continuous random distribution. (A discrete random distribution would complicate the analysis a bit, since the probability of two elements being equal would become greater than zero.) Then, element a_i is the minimum of a_1, a_2, ..., a_i with probability 1/i. Moreover, it is the second minimum with probability 1/i, the third minimum with probability 1/i, ..., the maximum of a_1, a_2, ..., a_i with the same probability 1/i. To prove this, we can show that, if we look at the ordering permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < ... a_{p_i}, each of the i! possible permutations is equally probable. What immediately follows is that when we track k smallest elements and consider element i>k, the probability that it will change the k smallest elements encountered so far is k/i. Summing that for i from k+1 to n, we get the expected number of heap insertions: (k/(k+1)) + (k/(k+2)) + ... + k/n. Now, this is just k multiplied by: (1/1 + 1/2 + 1/3 + ... + 1/n) *minus* (1/1 + 1/2 + 1/3 + ... + 1/k). As these are harmonic series, the first line is asymptotically equal to log n, and the second line asymptotically equal to log k. To summarize, the expected number of heap insertions for an array of random reals is k (log n - log k), and the complexity is O(n + k log k (log n - log k)). What I did here for simplicity is just omit the "minus logarithm of k" part to get an upper bound. For small enough k, the bound is tight. For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 2. On the other hand, when k=Theta(n), log n - log k = log(n/k) = Theta(1). Ivan Kazmenko.

Jan 17 2016

On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:Here is a more verbose version.OK, very nice. Thanks! I've modified topN to work as follows. In a loop: * If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop * If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop * Take a quickselect step, adjust r and nth, and continue That seems to work nicely for all values involved. All - let me know how things can be further improved. Thx! Andrei

Jan 17 2016

On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu wrote:On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task) My local timings with "-O -release -noboundscheck -inline": building Element array: 4989 msecs checking Element array: 5018 msecs checking int array: 626 msecs With "-debug -unittest": building Element array: 8384 msecs checking Element array: 8380 msecs checking int array: 2987 msecs If we change the length MAX_N to something observable, say, 50, the array is: [0, 536870911, 2, 536870911, 4, 536870911, 6, 36, 8, 33, 10, 35, 12, 536870911, 14, 32, 16, 34, 18, 536870911, 536870911, 22, 23, 21, 20, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 31, 30, 536870911, 29, 536870911, 28, 536870911, 27, 536870911, 26, 536870911, 25, 536870911, 24, 536870911] The old version (2.070.0-b2) could not be tricked with it, does it use random? The inspiration is the paper "A Killer Adversary for Quicksort": http://www.cs.dartmouth.edu/~doug/mdmspe.pdf (I already mentioned it on the forums a while ago) Ivan Kazmenko.Here is a more verbose version.OK, very nice. Thanks! I've modified topN to work as follows. In a loop: * If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop * If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop * Take a quickselect step, adjust r and nth, and continue That seems to work nicely for all values involved. All - let me know how things can be further improved. Thx! Andrei

Jan 18 2016

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu wrote:Perhaps I should include a textual summary as well. The code on DPaste starts by constructing an array of Elements of size MAX_N; in the code, MAX_N is 50_000. After that, we run the victim function on our array. Here, the victim is topN (array, MAX_N / 2); it could be sort (array) or something else. An Element contains, or rather, pretends to contain, an int value. Here is how Element is special: the result of comparison for two Elements is decided on-the-fly. An Element can be either UNDECIDED or have a fixed value. Initially, all elements are UNDECIDED. When we compare two Elements and at least one of them has a fixed value, the comparison is resolved naturally, and UNDECIDED element is greater than any fixed element. When we compare two UNDECIDED elements, the one which participated more in the comparisons so far gets a fixed value: greater than any other value fixed so far, but still less than UNDECIDED. This way, the results of old comparisons are consistent with the new fixed value. Now, what do we achieve by running the victim function? Turns out that the algorithms using the idea of QuickSort or QuickSelect tend to make most comparisons against their current pivot value. Our Element responds to that by fixing the pivot to one of the lowest available values. After that, a partition using such pivot will have only few, O(1), elements before the pivot, and the rest after the pivot. In total, this will lead to quadratic performance. After running the victim function on our array of Elements (which - careful here - already takes quadratic time), we reorder them in their original order (to do that, each Element also stores its original index). Now, we can re-run the algorithm on the array obtained so far. If the victim function is (strongly) pure, it will inevitably make the same comparisons in the same order. The only difference is that their result will already be decided. Alternatively, we can make an int array of the current values in our array of Elements (also in their original order). Running the victim function on the int array must also make the same (quadratic number of) comparisons in the same order. Ivan Kazmenko.All - let me know how things can be further improved. Thx!Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task) The inspiration is the paper "A Killer Adversary for Quicksort": http://www.cs.dartmouth.edu/~doug/mdmspe.pdf (I already mentioned it on the forums a while ago) Ivan Kazmenko.

Jan 18 2016

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task)Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.

Jan 18 2016

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:A RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaHere goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task)Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.

Jan 18 2016

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:A RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On 1/18/16 6:27 PM, Ivan Kazmenko wrote:On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:BTW I forgot to thank you for this analysis. This is good stuff. -- AndreiA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:No, it is definitely possible, because RNGs are Pseudo-RNGs. --IlyaA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On 1/18/16 6:44 PM, Ilya wrote:On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- AndreiA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:On 1/18/16 6:44 PM, Ilya wrote:Would this work for pure functions? --IlyaOn Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- AndreiA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On 01/19/2016 12:51 AM, Ilya wrote:On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:Only if they accept the RNG as an additional argument.On 1/18/16 6:44 PM, Ilya wrote:Would this work for pure functions? --IlyaA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On Monday, 18 January 2016 at 23:55:38 UTC, Timon Gehr wrote:On 01/19/2016 12:51 AM, Ilya wrote:Exactly :) So, I hope we would not add Pseudo-RNGs in std.algorithm. --IlyaOn Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:Only if they accept the RNG as an additional argument.On 1/18/16 6:44 PM, Ilya wrote:Would this work for pure functions? --IlyaOn Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei[...]No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya

Jan 18 2016

On 1/18/16 6:51 PM, Ilya wrote:On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened: 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses. 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression. 3. So I made getPivot use just a bit of randomness taken from the length of the range. 4. sort was and is attackable before all of these changes 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable. 6. PRNGs don't have any other downside than making functions impure. The way I see it we have these solutions: (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades. (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all. (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern. AndreiA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:On 1/18/16 6:51 PM, Ilya wrote:(a) Hope no (b) Yes (c) Memory addresses may not work with user defined ranges and hashes could be slow. (d) Make PRNGs optional. --IlyaA RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaStill, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

Jan 18 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:4. sort was and is attackable before all of these changesNo, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse too deep): see https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/so ting.d#L1048-L1053. This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules. My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits that by not being able to slow the sort down. But, if I comment out line 1053 to disable the switching, things go quadratic as expected. L1053: depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2) / 3; The same remedy can help for topN: if we called partition more than log (length) in base (3/2) times, switch to the heap implementation of topN regardless of whether n is close to the edge. Ivan Kazmenko.

Jan 18 2016

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- Andrei4. sort was and is attackable before all of these changesNo, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse too deep): see https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053. This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules. My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits that by not being able to slow the sort down. But, if I comment out line 1053 to disable the switching, things go quadratic as expected. L1053: depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2) / 3; The same remedy can help for topN: if we called partition more than log (length) in base (3/2) times, switch to the heap implementation of topN regardless of whether n is close to the edge.

Jan 19 2016

On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote:On 01/18/2016 09:21 PM, Ivan Kazmenko wrote: Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- AndreiYes, but with a little interactivity (not generating the input in advance) and more labor (finding the state of RNG). Suppose the adversary has the means to generate an input for sort/topN, inspect the output, and then generate a bad input. With the current (not-secure) RNGs, from the first two actions, they can learn the current state of RNG: with sort, the order of elements which compare as equal depends on the choice of pivot, and therefore exposes the random bits involved; with topN, the order of everything except what topN guarantees does the same thing. After that, generating a bad input is definitely possible, since everything is known and deterministic at this point. The previously discussed method can be used, or something more clever if they want to make it fast. Ivan Kazmenko.

Jan 19 2016

On 01/19/2016 08:08 PM, Ivan Kazmenko wrote:On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote:[snip] Thanks! -- AndreiOn 01/18/2016 09:21 PM, Ivan Kazmenko wrote: Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- AndreiYes, but with a little interactivity (not generating the input in advance) and more labor (finding the state of RNG).

Jan 20 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened: 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses. 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression. 3. So I made getPivot use just a bit of randomness taken from the length of the range. 4. sort was and is attackable before all of these changes 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable. 6. PRNGs don't have any other downside than making functions impure. The way I see it we have these solutions: (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades. (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all. (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern. AndreiTo be clear, sort is NOT attackable. Introsort is used for unstable sorting which begins with quick sort but falls back to heap sort after too many recursions to maintain O(n log n) running time. Timsort is used for stable sorting which is a variant of merge sort but still runs in O(n log n) time. In either case, you're guaranteed to have O(n log n) running time in the worst case. On the other hand, someone can improve upon quick sort by choosing better pivots but be careful not to add too much overhead. A couple years ago, I tried choosing the pivot from a median of five but the result was as much as 10% slower. One idea is to begin initially choosing better pivots, but once the sublists fall below a certain length, simply choose the pivot from a median of three to avoid the extra overhead. As for topN, there are two approaches I'm aware of that are deterministic without resorting to impurities or RNGs. The first is introselect which is similar to introsort in that it has a fall back algorithm. The other approach is to choose the pivot from a median of medians. My idea is a variant of the latter approach: Begin by choosing the pivot from a median of five. If you continuously choose bad pivots, take a larger sample of elements to choose the pivot from. Keep growing the sample as necessary. Speaking of RNGs, they're technically pure as long as you always use the same seed. So what if we generated a random seed at the start of the process, but then reused that same seed over and over in pure functions for the duration of the process?

Jan 18 2016

On 01/18/2016 09:42 PM, Xinok wrote:To be clear, sort is NOT attackable. Introsort is used for unstable sorting which begins with quick sort but falls back to heap sort after too many recursions to maintain O(n log n) running time. Timsort is used for stable sorting which is a variant of merge sort but still runs in O(n log n) time. In either case, you're guaranteed to have O(n log n) running time in the worst case.Sorry, my bad. Could you please point to the code doing the introspection? I might do the same in topN.On the other hand, someone can improve upon quick sort by choosing better pivots but be careful not to add too much overhead. A couple years ago, I tried choosing the pivot from a median of five but the result was as much as 10% slower. One idea is to begin initially choosing better pivots, but once the sublists fall below a certain length, simply choose the pivot from a median of three to avoid the extra overhead.That's what https://github.com/D-Programming-Language/phobos/pull/3922 and in my measurements it's about as fast or faster than the current. Could you please re-run your measurements against that PR?Speaking of RNGs, they're technically pure as long as you always use the same seed. So what if we generated a random seed at the start of the process, but then reused that same seed over and over in pure functions for the duration of the process?That's a great idea. The way D is defined, it already implies that "immutable" is process-immutable, not immutable across runs. Consider http://dpaste.dzfl.pl/2fa5baf0c149. Very nice insights, Xinok. Thanks! Andrei

Jan 19 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:4. sort was and is attackable before all of these changes (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.Since point 4 is in fact already fixed a good while ago, my suggestion would be (b): to do the same for topN. Introselect (https://en.wikipedia.org/wiki/Introselect), already mentioned in this thread, uses median-of-medians to achieve worst case O(n) performance if we recurse too deep. There is already an issue suggesting to implement it: https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement deterministic topN). In fact, the O(n log n) heap approach as it is implemented now could be faster than O(n) median-of-medians approach on reasonable inputs. So, someone will have to implement median-of-medians and, well, measure. At the very least, googling for "median of medians in practice" and such yields the tag wiki from StackOverflow.com: "The constant factor in the O(n) is large, and the algorithm is not commonly used in practice.". Ivan Kazmenko.

Jan 18 2016

On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:Yah, I also think heap/double-heap topN is better; median-of-medians (MoM) is theoretically nice but practically less so. Though there are two things that could help MoM in D: (a) median of five (core part of MoM) is relatively cheap to compute with six comparisons; (b) we can use stride() to compute median without needing to copy a fifth of the input or to complicate the algorithm - just recurse on the stride! All in all this might be a good algo to implement and test. So anyhow we need the "intro" part. I'll get to that soon. Andrei4. sort was and is attackable before all of these changes (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.Since point 4 is in fact already fixed a good while ago, my suggestion would be (b): to do the same for topN. Introselect (https://en.wikipedia.org/wiki/Introselect), already mentioned in this thread, uses median-of-medians to achieve worst case O(n) performance if we recurse too deep. There is already an issue suggesting to implement it: https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement deterministic topN). In fact, the O(n log n) heap approach as it is implemented now could be faster than O(n) median-of-medians approach on reasonable inputs. So, someone will have to implement median-of-medians and, well, measure. At the very least, googling for "median of medians in practice" and such yields the tag wiki from StackOverflow.com: "The constant factor in the O(n) is large, and the algorithm is not commonly used in practice.".

Jan 19 2016

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:So anyhow we need the "intro" part. I'll get to that soon.Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 One interesting side topic would be how to avoid the use of goto without losing clarity. Andrei

Jan 19 2016

On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:The switching heuristic is bad. It always switches after at most 8 steps. I'd just use the second heuristic discussed in https://en.wikipedia.org/wiki/Introselect .So anyhow we need the "intro" part. I'll get to that soon.Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 ...

Jan 19 2016

On 01/19/2016 01:27 PM, Timon Gehr wrote:On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:My algebra is completely rotten. Thanks! Updated https://github.com/andralex/phobos/commit/cdd59b51a397ee1a4584e55c07d921c59acb5978. AndreiOn 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:The switching heuristic is bad. It always switches after at most 8 steps.So anyhow we need the "intro" part. I'll get to that soon.Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 ...

Jan 19 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:So let me summarize what has happened: 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses. [snip]Not completely clear from this thread what the conclusion was wrt getting known topN performance issues addressed. From pull requests it appears identified fixes are in current release versions of the DMD/LDC. However, I hit significant issues on one of the first large data sets I tried. Not an artificial data, but one with very skewed distributions of values (a google ngram file). Details here: https://issues.dlang.org/show_bug.cgi?id=16517. Includes test program, url for the ngram file. A brief summary - Data file is a TSV file with 3 numeric fields, a bit over 10 million values each with different distribution properties. Used both topN and sort get the median value. Since this was median, it topN for the mid-point value, not at one end or the other. (This is a specific callout for some of the issues identified.) Timing comparison of sort and topN, times in milliseconds: sort topN Field 2: 289 1756 Field 3: 285 148793 Field 4: 273 668906 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.

Sep 21 2016

On 9/21/16 4:16 AM, Jon Degenhardt wrote:Timing comparison of sort and topN, times in milliseconds: sort topN Field 2: 289 1756 Field 3: 285 148793 Field 4: 273 668906 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.I have it on my list to move https://arxiv.org/abs/1606.00484 into Phobos. Thanks for the data! -- Andrei

Sep 21 2016

On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote:On 9/21/16 4:16 AM, Jon Degenhardt wrote:Very good, thanks. It'll be interesting to see how this algorithm does on this data set. --JonTiming comparison of sort and topN, times in milliseconds: sort topN Field 2: 289 1756 Field 3: 285 148793 Field 4: 273 668906 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.I have it on my list to move https://arxiv.org/abs/1606.00484 into Phobos. Thanks for the data! -- Andrei

Sep 21 2016

On 09/21/2016 01:37 PM, Jon Degenhardt wrote:On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote:Preliminary results indicate that QuickselectAdaptive is about 10x faster than sort, net of data reading overheads, on the second-field data set. I'll proceed with submitting the algorithm in Phobos. -- AndreiOn 9/21/16 4:16 AM, Jon Degenhardt wrote:Very good, thanks. It'll be interesting to see how this algorithm does on this data set.Timing comparison of sort and topN, times in milliseconds: sort topN Field 2: 289 1756 Field 3: 285 148793 Field 4: 273 668906 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.I have it on my list to move https://arxiv.org/abs/1606.00484 into Phobos. Thanks for the data! -- Andrei

Sep 21 2016

Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei

Sep 23 2016

On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andreiannotate the templates.

Sep 23 2016

On 09/23/2016 11:40 AM, Stefan Koch wrote:On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:I don't want to lose the deduction. I used another workaround (enumerate the potential unsafe operations under an if (false) and then forward to a trusted function) in https://github.com/dlang/phobos/pull/4815. Reviews are welcome! BTW, as I commented in https://issues.dlang.org/show_bug.cgi?id=16517, using the new topN implementation instead of sort to compute the median over google's 1-grams is over 11x faster using dmd. AndreiWork is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andreiannotate the templates.

Sep 23 2016

On Friday, 23 September 2016 at 16:09:18 UTC, Andrei Alexandrescu wrote:BTW, as I commented in https://issues.dlang.org/show_bug.cgi?id=16517, using the new topN implementation instead of sort to compute the median over google's 1-grams is over 11x faster using dmd.That's a very nice result. Both the absolute numbers and that all three sets achieve similar performance, as they different distribution characteristics. --Jon

Sep 23 2016

On 09/23/2016 04:45 PM, Jon Degenhardt wrote:That's a very nice result. Both the absolute numbers and that all three sets achieve similar performance, as they different distribution characteristics.Got curious so I tried to patch my ldc installation (0.17.1 (DMD v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d uses the new syntax in various places. Probably the best baseline is the equivalent C++ code using the mature implementation of nth_element, see http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end of this message). Here's what I got: http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below. The topN code produced with dmd is faster in virtually all cases (has parity for -f 4, which I suspect is a degenerate case with most elements equal, which exercises only a small part of the algorithm). For C++ I used -O3 -DNDEBUG, any other flags I should use? As an aside, I enjoy poking fun at the stunning inefficiency of iostreams in my public talks; they delivered again :o). Andrei =========================== shell $ g++ -O3 -DNDEBUG -omedian_by_sort_cpp issue16517.cpp $ g++ -O3 -DNDEBUG -DTOPN -omedian_by_topn_cpp issue16517.cpp $ dmd -O -inline -release -ofmedian_by_sort -boundscheck=off issue16517.d $ dmd -O -inline -release -version=topn -ofmedian_by_topn -boundscheck=off issue16517.d $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp median (via sort): 1972; lines: 10512769; total time (ms): 3419.11; sort time (ms): 314.086 $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 1972; lines: 10512769; total time (ms): 1462; sort time (ms): 399 $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp median (via nth_element): 1972; lines: 10512769; total time (ms): 3255.41; nth_element time (ms): 40.676 $ cut -f 2 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 1972; lines: 10512769; total time (ms): 1211; topN time (ms): 32 $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp median (via sort): 3; lines: 10512769; total time (ms): 2949.11; sort time (ms): 297.814 $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 3; lines: 10512769; total time (ms): 1351; sort time (ms): 382 $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp median (via nth_element): 3; lines: 10512769; total time (ms): 2316.75; nth_element time (ms): 65.041 $ cut -f 3 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 3; lines: 10512769; total time (ms): 973; topN time (ms): 59 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp median (via sort): 2; lines: 10512769; total time (ms): 2614.47; sort time (ms): 351.285 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 2; lines: 10512769; total time (ms): 1269; sort time (ms): 361 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp median (via nth_element): 2; lines: 10512769; total time (ms): 2443.17; nth_element time (ms): 76.211 $ cut -f 4 /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 2; lines: 10512769; total time (ms): 998; topN time (ms): 77 =========================== =========================== issue16517.cpp #include <iostream> #include <cstdio> #include <ctime> #include <vector> #include <algorithm> using namespace std; int main() { clock_t swTotal, swSort; swTotal = clock(); vector<double> values; double num; while (cin >> num) { values.push_back(num); } size_t medianIndex = values.size() / 2; swSort = clock(); #ifdef TOPN const char* method = "nth_element"; nth_element(values.begin(), values.begin() + medianIndex, values.end()); #else const char* method = "sort"; sort(values.begin(), values.end()); #endif clock_t done = clock(); printf("median (via %s): %g; lines: %zu; total time (ms): %g; %s time (ms): %g\n", method, values[medianIndex], values.size(), (done - swTotal) * 1000. / CLOCKS_PER_SEC, method, (done - swSort) * 1000. / CLOCKS_PER_SEC); } =========================== =========================== issue16517.d import std.stdio; import std.array : appender; import std.algorithm : sort, topN; import std.conv; import std.range; import std.datetime: StopWatch; void main() { StopWatch swTotal, swSort; swTotal.start; Appender!(double[]) values; foreach (line; stdin.byLine) values.put(line.to!double); size_t medianIndex = values.data.length/2; swSort.start; version(topn) { topN(values.data, medianIndex); auto method = "topN"; } else { sort(values.data); auto method = "sort"; } swTotal.stop; swSort.stop; writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s time (ms): %d", method, values.data[medianIndex], values.data.length, swTotal.peek.msecs, method, swSort.peek.msecs); } ===========================

Sep 24 2016

On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei Alexandrescu wrote:Got curious so I tried to patch my ldc installation (0.17.1 (DMD v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d uses the new syntax in various places. Probably the best baseline is the equivalent C++ code using the mature implementation of nth_element, see http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end of this message). Here's what I got: http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below. The topN code produced with dmd is faster in virtually all cases (has parity for -f 4, which I suspect is a degenerate case with most elements equal, which exercises only a small part of the algorithm). For C++ I used -O3 -DNDEBUG, any other flags I should use? [snip]I added the topN pull request to a local LDC build (current LDC master). Also built the C++ nth_element and sort program you listed. (GCC 4.9.3, compiler line: g++-mp-4.9 -O3 -static-libgcc -static-libstdc++ -std=c++11). Did 7 runs for each variant on the three fields in ngram file. Median times are below. Median sort times (ms): Field 2 Field 3 Field 4 DMD 351 348 326 LDC 273 261 245 C++ 281 260 246 Median topN / nth_element times (ms): Field 2 Field 3 Field 4 LDC 21 44 57 C++ 41 60 71 Looking very good indeed!

Sep 25 2016

On 9/25/16 4:19 AM, Jon Degenhardt wrote:On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei Alexandrescu wrote:Thanks for this work! -- AndreiGot curious so I tried to patch my ldc installation (0.17.1 (DMD v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d uses the new syntax in various places. Probably the best baseline is the equivalent C++ code using the mature implementation of nth_element, see http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end of this message). Here's what I got: http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below. The topN code produced with dmd is faster in virtually all cases (has parity for -f 4, which I suspect is a degenerate case with most elements equal, which exercises only a small part of the algorithm). For C++ I used -O3 -DNDEBUG, any other flags I should use? [snip]I added the topN pull request to a local LDC build (current LDC master). Also built the C++ nth_element and sort program you listed. (GCC 4.9.3, compiler line: g++-mp-4.9 -O3 -static-libgcc -static-libstdc++ -std=c++11). Did 7 runs for each variant on the three fields in ngram file. Median times are below. Median sort times (ms): Field 2 Field 3 Field 4 DMD 351 348 326 LDC 273 261 245 C++ 281 260 246 Median topN / nth_element times (ms): Field 2 Field 3 Field 4 LDC 21 44 57 C++ 41 60 71 Looking very good indeed!

Sep 25 2016

On 09/23/2016 11:40 AM, Stefan Koch wrote:On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:Nagonna work for generic functions -- Andrei

Sep 23 2016

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- AndreiI got problem with that even when crytographically secure randomness wasn't needed more than once. A specific case included adding jitter on some timeout to avoid all hosts to expire at the same time, and used mersenne twister with a pseudo random seed. There still were way too many hosts ending up with the same timeout. System time is not enough.

Jan 18 2016

On 1/18/16 6:55 PM, deadalnix wrote:On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:How would this translate to a matter of selecting the pivot during sort? -- AndreiunpredictableSeed uses the system clock as a source of randomness, so we're good there. -- AndreiI got problem with that even when crytographically secure randomness wasn't needed more than once. A specific case included adding jitter on some timeout to avoid all hosts to expire at the same time, and used mersenne twister with a pseudo random seed. There still were way too many hosts ending up with the same timeout. System time is not enough.

Jan 18 2016

On Tuesday, 19 January 2016 at 00:17:30 UTC, Andrei Alexandrescu wrote:How would this translate to a matter of selecting the pivot during sort? -- AndreiA large chunk of a given datacenter going quadratic at the same time.

Jan 18 2016

On 1/18/16 6:18 PM, Ilya wrote:On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? AndreiOn Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:A RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaHere goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task)Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.

Jan 18 2016

On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu wrote:On 1/18/16 6:18 PM, Ilya wrote:1. Yes, probability of hitting the worst case repeatedly is is practically zero. But RNGs do not change this probability. 2. It is possible to build attack for our RNGs, because they are Pseudo-RNGs. --IlyaOn Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? Andrei

Jan 18 2016

On 01/19/2016 12:50 AM, Ilya wrote:... 1. Yes, probability of hitting the worst case repeatedly is is practically zero. But RNGs do not change this probability. 2. It is possible to build attack for our RNGs, because they are Pseudo-RNGs. --IlyaYou also need to predict the seed. How do you do that?

Jan 18 2016

On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:On 01/19/2016 12:50 AM, Ilya wrote:We can not use unpredictable seed (like system clock) in pure functions. --Ilya... 1. Yes, probability of hitting the worst case repeatedly is is practically zero. But RNGs do not change this probability. 2. It is possible to build attack for our RNGs, because they are Pseudo-RNGs. --IlyaYou also need to predict the seed. How do you do that?

Jan 18 2016

On 01/19/2016 12:55 AM, Ilya wrote:On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:Clearly. The point is, there already was an impure implementation of topN whose expected linear running time is still specified in the documentation. The current implementation does not fulfill this bound.On 01/19/2016 12:50 AM, Ilya wrote:We can not use unpredictable seed (like system clock) in pure functions. --Ilya... 1. Yes, probability of hitting the worst case repeatedly is is practically zero. But RNGs do not change this probability. 2. It is possible to build attack for our RNGs, because they are Pseudo-RNGs. --IlyaYou also need to predict the seed. How do you do that?

Jan 18 2016

On Tuesday, 19 January 2016 at 00:02:08 UTC, Timon Gehr wrote:On 01/19/2016 12:55 AM, Ilya wrote:There is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --IlyaOn Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:Clearly. The point is, there already was an impure implementation of topN whose expected linear running time is still specified in the documentation. The current implementation does not fulfill this bound.

Jan 18 2016

On 01/19/2016 01:12 AM, Ilya wrote:There is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --IlyaThe default RNG is seeded with unpredictableSeed. What is the point you are trying to make?

Jan 18 2016

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:On 01/19/2016 01:12 AM, Ilya wrote:unpredictableSeed is initialized only once and can be easily estimated. --IlyaThere is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --IlyaThe default RNG is seeded with unpredictableSeed. What is the point you are trying to make?

Jan 18 2016

On 1/18/16 7:46 PM, Ilya wrote:On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? AndreiOn 01/19/2016 01:12 AM, Ilya wrote:unpredictableSeed is initialized only once and can be easily estimated. --IlyaThere is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --IlyaThe default RNG is seeded with unpredictableSeed. What is the point you are trying to make?

Jan 18 2016

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:On 1/18/16 7:46 PM, Ilya wrote:A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array): a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers". b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!) c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays. IlyaOn Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? Andrei

Jan 18 2016

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:On 1/18/16 7:46 PM, Ilya wrote:A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array): a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers". b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!) c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays. EDIT: c. ... because each big array has own seed and time measurement would not work for big arrays with different seeds. Ilya

Jan 18 2016

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:

Jan 18 2016

On 01/19/2016 12:39 AM, Andrei Alexandrescu wrote:On 1/18/16 6:18 PM, Ilya wrote:https://en.wikipedia.org/wiki/Introselect

Jan 18 2016

On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu wrote:On 1/18/16 6:18 PM, Ilya wrote:Possible, but only if the seed and previous usage of victim program's global RNG are also known to the attacker... at which point the victim may well have a bigger problem than just quadratic topN.A RNGs don't improve worst case. It only changes an permutation for worst case. --IlyaWell it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input.I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc.Is it feasible to have two overloads of sort and/or topN, one pure and the other with better complexity guarantees?As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address.Hmm. Isn't using any pointers still breaking strong purity? Say the GC moved our array, and it now has a different address. A strongly pure function would order the elements exactly the same then. The same goes about sorting, where the thing depending on pivot choice is the relative order of "equal" elements. From a theoretical standpoint (not taking current D purity rules into account), I'd say using a pointer (which may be modified by GC) is as pure as just allowing a static RNG (a global one, or even another instance dedicated specifically to sort/topN). Ivan Kazmenko.

Jan 18 2016

On 01/18/2016 01:00 PM, Ivan Kazmenko wrote:The old version (2.070.0-b2) could not be tricked with it, does it use random?Yes, it selected the pivot uniformly at random using the global RNG. (This is also why the documentation says topN is O(n) in expectation.)

Jan 18 2016

On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:I'd suggest being able to switch between implementations. Recall that, when sorting, we have SwapStrategy.stable which is, well, stable, but additionally guarantees n log n with a larger constant (MergeSort). And we have SwapStrategy.unstable which uses Introsort, which, in turn, has smaller constant on sane inputs. Here, Andrei's suggestion is to also have two implementations, let us call them TopNStrategy.quickSelect and TopNStrategy.heap. The quickSelect one is O(n) on sane inputs but O(n^2) on an artificial worst case. The heap implementation is also O(n) when k is close to 0 or n, but O(n log n) otherwise. What's important is that it is also O(n log n) worst case. The current proposal switches between strategies based on a heuristic (k < n/8 or k > 7n/8 if I understand correctly), which may not be the best strategy in all cases. So, what can be done is to introduce TopNStrategy.auto which is the default and uses the (current or better) heuristic to switch between strategies, but also leave a way to explicitly select one of the two strategies, just like one can do now with sort!(..., SwapStrategy.whatWeExplicitlyWant). All the code for both strategies is already there. Each strategy has its benefits not covered by the other (quickSelect is faster for k~n and sane inputs, heap is faster for k close to boundary and special inputs). So, provide the default but let the user choose. Ivan Kazmenko.On 1/16/16 8:00 PM, Timon Gehr wrote:Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei

Jan 17 2016

On 01/17/2016 06:55 AM, Ivan Kazmenko wrote:So, what can be done is to introduce TopNStrategy.auto which is the default and uses the (current or better) heuristic to switch between strategies, but also leave a way to explicitly select one of the two strategies, just like one can do now with sort!(..., SwapStrategy.whatWeExplicitlyWant).A nice idea, but it seems a bit overengineered. The differences among approaches are rather subtle and explaining the circumstances under which one does better than the other is about as difficult as making the choice in the implementation. BTW there's yet another approach: 1. Create a max heap for r[0 .. nth] 2. Create a min heap for r[nth .. $] 3. As long as r[nth] < r[0], swap them and restore the heap property in the two heaps At the end of the process we have the smallest element in r[nth .. $], which is exactly in the r[nth] position, greater than or equal to the largest element in r[0 .. nth], which is exactly what the doctor prescribed. Complexity of this turns rather nice. Step 1 is O(k), step 2 is O(n - k). In the worst case (r is sorted descending), step 3 will stop after k iterations and does k heap insertions in the two heaps. So overall complexity is O(n + k log(k) + k log(n)). Since k <= n, keep the largest terms: O(n + k log(n)). So if k is proportional to n / log(n), we get O(n). And that's worst case! BTW I figured how to do stable partition. That'll come in a distinct PR. Andrei

Jan 17 2016

On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d < digitalmars-d puremagic.com> wrote:1. Organize the first 11 elements into a max heapWhy not the first 10?

Jan 16 2016

On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote: 1. Organize the first 11 elements into a max heap Why not the first 10?So you get to put the appropriate element in the 11th position. -- Andrei

Jan 16 2016

On 1/16/16 9:12 PM, Andrei Alexandrescu wrote:On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:To clarify this using a degenerate case: say someone calls topN(r, 0) i.e. find the minimum. Then you'd need a mini-heap of exactly one element to track the smallest element found so far. -- AndreiOn Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote: 1. Organize the first 11 elements into a max heap Why not the first 10?So you get to put the appropriate element in the 11th position. -- Andrei

Jan 16 2016

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:3. At the end, swap the largest in the heap with the 10th and you're done!And why this? Do we additionally require the k-th element to arrive exactly on k-th place?

Jan 16 2016

On 1/16/16 5:27 PM, Ivan Kazmenko wrote:On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:Yes. -- Andrei3. At the end, swap the largest in the heap with the 10th and you're done!And why this? Do we additionally require the k-th element to arrive exactly on k-th place?

Jan 16 2016

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:https://github.com/D-Programming-Language/phobos/pull/3934 So, say you're looking for the smallest 10 elements out of 100_000. The quickselect algorithm (which topN currently uses) will successively partition the set in (assuming the pivot choice works well) 50_000, 25_000, etc chunks all the way down to finding the smallest 10. That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10: 1. Organize the first 11 elements into a max heap 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property. 3. At the end, swap the largest in the heap with the 10th and you're done! This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often. Destroy! AndreiA common way to do it is to go quicksort, but only recurse on one side of the set. That should give log(n)^2 complexity on average.

Jan 17 2016

On 1/17/16 8:07 PM, deadalnix wrote:A common way to do it is to go quicksort, but only recurse on one side of the set. That should give log(n)^2 complexity on average.Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei

Jan 17 2016

On Monday, 18 January 2016 at 01:38:16 UTC, Andrei Alexandrescu wrote:On 1/17/16 8:07 PM, deadalnix wrote:Yeah; forget about me, I was dumb on that one.A common way to do it is to go quicksort, but only recurse on one side of the set. That should give log(n)^2 complexity on average.Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei

Jan 17 2016