www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - topN using a heap

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
https://github.com/D-Programming-Language/phobos/pull/3934

So, say you're looking for the smallest 10 elements out of 100_000. The 
quickselect algorithm (which topN currently uses) will successively 
partition the set in (assuming the pivot choice works well) 50_000, 
25_000, etc chunks all the way down to finding the smallest 10.

That's quite a bit of work, so 3934 uses an alternate strategy for 
finding the smallest 10:

1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element is found 
that is smaller than the largest in the heap, swap it with the largest 
in the heap then restore the heap property.

3. At the end, swap the largest in the heap with the 10th and you're done!

This is very effective, and is seldom referred in the selection 
literature. In fact, a more inefficient approach (heapify the entire 
range) is discussed more often.


Destroy!

Andrei
Jan 16
next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 That's quite a bit of work, so 3934 uses an alternate strategy 
 for finding the smallest 10:

 1. Organize the first 11 elements into a max heap

 2. Scan all other elements progressively. Whenever an element 
 is found that is smaller than the largest in the heap, swap it 
 with the largest in the heap then restore the heap property.

 3. At the end, swap the largest in the heap with the 10th and 
 you're done!

 This is very effective, and is seldom referred in the selection 
 literature. In fact, a more inefficient approach (heapify the 
 entire range) is discussed more often.
Yeah, this sounds nice. Suppose we have an array of length n and want to find the smallest k elements. The first step (heapify) is O(k) operations. 1. Let us analyze the rest for "small enough" k. 1a. If the array is "random", the expected number of insertions is the sum of probabilities that each new element is inserted. This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k times a part of harmonic series, that is, on the order of k log n. In each case, O(log k) operations are required to process the new element. In total, the "average" case takes (n-k) + (k log n log k), a total of O(n) for small enough values of k. 1b. In the worst case (each element is less than the current top), this still requires (n-k) log k operations, so we can say the total is O(n log k). 2. For k of the same order as n, the algorithm is O(n log n), just as a regular HeapSort. To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.
Jan 16
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...
To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.
The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.
Jan 16
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/16 8:00 PM, Timon Gehr wrote:
 On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...
To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.
The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.
Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei
Jan 16
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
 On 1/16/16 8:00 PM, Timon Gehr wrote:
 On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...
To summarize: For k<<n we have O(n) average case and O(n log k) worst case. For k~n we have O(n log n) as HeapSort.
The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.
Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei
Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)
Jan 16
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly larger, like
 n/log(n)² might work as an upper bound for k.
I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei
Jan 16
next sibling parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:
 On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly 
 larger, like
 n/log(n)² might work as an upper bound for k.
I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei
size_t.sizeof * 8 - bsr(n) should be good approximation for sorting. --Ilya
Jan 16
prev sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:
 On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly 
 larger, like
 n/log(n)² might work as an upper bound for k.
I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei
The average case is O(n + (k log n log k)) for small enough k. So, any k below roughly n / log^2 (n) makes the second summand less than the first. Of course, the constant not present in asymptotic analysis, as well as handling the worst case, may make this notion impractical. Measure is the way :) . The worst case is just a decreasing sequence, or almost decreasing if we want to play against branch prediction as well.
Jan 17
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
 The average case is O(n + (k log n log k)) for small enough k. So, any k
 below roughly n / log^2 (n) makes the second summand less than the first.
I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k. Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct? If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere. Could you please clarify matters? Thanks! Andrei
Jan 17
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu 
wrote:
 On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
 The average case is O(n + (k log n log k)) for small enough k. 
 So, any k
 below roughly n / log^2 (n) makes the second summand less than 
 the first.
I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k. Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct?
Yeah, I have the same notion.
 If k is any fraction of n, the worst case complexity comes to 
 O(n + n log n) i.e. O(n log n). By your calculation, the 
 average complexity comes to O(n log^2 n), i.e. greater than the 
 worst case. So there's a mistake somewhere.
The upper bound O(n + k log k log n) is correct, but it's tight only for small k. I'll explain below.
 Could you please clarify matters? Thanks!
Here is a more verbose version. Suppose for simplicity that the array consists real numbers of the same continuous random distribution. (A discrete random distribution would complicate the analysis a bit, since the probability of two elements being equal would become greater than zero.) Then, element a_i is the minimum of a_1, a_2, ..., a_i with probability 1/i. Moreover, it is the second minimum with probability 1/i, the third minimum with probability 1/i, ..., the maximum of a_1, a_2, ..., a_i with the same probability 1/i. To prove this, we can show that, if we look at the ordering permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < ... a_{p_i}, each of the i! possible permutations is equally probable. What immediately follows is that when we track k smallest elements and consider element i>k, the probability that it will change the k smallest elements encountered so far is k/i. Summing that for i from k+1 to n, we get the expected number of heap insertions: (k/(k+1)) + (k/(k+2)) + ... + k/n. Now, this is just k multiplied by: (1/1 + 1/2 + 1/3 + ... + 1/n) *minus* (1/1 + 1/2 + 1/3 + ... + 1/k). As these are harmonic series, the first line is asymptotically equal to log n, and the second line asymptotically equal to log k. To summarize, the expected number of heap insertions for an array of random reals is k (log n - log k), and the complexity is O(n + k log k (log n - log k)). What I did here for simplicity is just omit the "minus logarithm of k" part to get an upper bound. For small enough k, the bound is tight. For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 2. On the other hand, when k=Theta(n), log n - log k = log(n/k) = Theta(1). Ivan Kazmenko.
Jan 17
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:
 Here is a more verbose version.
OK, very nice. Thanks! I've modified topN to work as follows. In a loop: * If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop * If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop * Take a quickselect step, adjust r and nth, and continue That seems to work nicely for all values involved. All - let me know how things can be further improved. Thx! Andrei
Jan 17
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
wrote:
 On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:
 Here is a more verbose version.
OK, very nice. Thanks! I've modified topN to work as follows. In a loop: * If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop * If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop * Take a quickselect step, adjust r and nth, and continue That seems to work nicely for all values involved. All - let me know how things can be further improved. Thx! Andrei
Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task) My local timings with "-O -release -noboundscheck -inline": building Element array: 4989 msecs checking Element array: 5018 msecs checking int array: 626 msecs With "-debug -unittest": building Element array: 8384 msecs checking Element array: 8380 msecs checking int array: 2987 msecs If we change the length MAX_N to something observable, say, 50, the array is: [0, 536870911, 2, 536870911, 4, 536870911, 6, 36, 8, 33, 10, 35, 12, 536870911, 14, 32, 16, 34, 18, 536870911, 536870911, 22, 23, 21, 20, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 31, 30, 536870911, 29, 536870911, 28, 536870911, 27, 536870911, 26, 536870911, 25, 536870911, 24, 536870911] The old version (2.070.0-b2) could not be tricked with it, does it use random? The inspiration is the paper "A Killer Adversary for Quicksort": http://www.cs.dartmouth.edu/~doug/mdmspe.pdf (I already mentioned it on the forums a while ago) Ivan Kazmenko.
Jan 18
next sibling parent Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
 wrote:
 All - let me know how things can be further improved. Thx!
Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task) The inspiration is the paper "A Killer Adversary for Quicksort": http://www.cs.dartmouth.edu/~doug/mdmspe.pdf (I already mentioned it on the forums a while ago) Ivan Kazmenko.
Perhaps I should include a textual summary as well. The code on DPaste starts by constructing an array of Elements of size MAX_N; in the code, MAX_N is 50_000. After that, we run the victim function on our array. Here, the victim is topN (array, MAX_N / 2); it could be sort (array) or something else. An Element contains, or rather, pretends to contain, an int value. Here is how Element is special: the result of comparison for two Elements is decided on-the-fly. An Element can be either UNDECIDED or have a fixed value. Initially, all elements are UNDECIDED. When we compare two Elements and at least one of them has a fixed value, the comparison is resolved naturally, and UNDECIDED element is greater than any fixed element. When we compare two UNDECIDED elements, the one which participated more in the comparisons so far gets a fixed value: greater than any other value fixed so far, but still less than UNDECIDED. This way, the results of old comparisons are consistent with the new fixed value. Now, what do we achieve by running the victim function? Turns out that the algorithms using the idea of QuickSort or QuickSelect tend to make most comparisons against their current pivot value. Our Element responds to that by fixing the pivot to one of the lowest available values. After that, a partition using such pivot will have only few, O(1), elements before the pivot, and the rest after the pivot. In total, this will lead to quadratic performance. After running the victim function on our array of Elements (which - careful here - already takes quadratic time), we reorder them in their original order (to do that, each Element also stores its original index). Now, we can re-run the algorithm on the array obtained so far. If the victim function is (strongly) pure, it will inevitably make the same comparisons in the same order. The only difference is that their result will already be decided. Alternatively, we can make an int array of the current values in our array of Elements (also in their original order). Running the victim function on the int array must also make the same (quadratic number of) comparisons in the same order. Ivan Kazmenko.
Jan 18
prev sibling next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new 
 version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)
Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.
Jan 18
parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new 
 version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)
Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.
A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya
Jan 18
next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation 
 for worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
Jan 18
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 6:27 PM, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
BTW I forgot to thank you for this analysis. This is good stuff. -- Andrei
Jan 18
prev sibling parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
Jan 18
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Jan 18
next sibling parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Would this work for pure functions? --Ilya
Jan 18
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 12:51 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Would this work for pure functions? --Ilya
Only if they accept the RNG as an additional argument.
Jan 18
parent Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 23:55:38 UTC, Timon Gehr wrote:
 On 01/19/2016 12:51 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
 Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 [...]
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Would this work for pure functions? --Ilya
Only if they accept the RNG as an additional argument.
Exactly :) So, I hope we would not add Pseudo-RNGs in std.algorithm. --Ilya
Jan 18
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 6:51 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Would this work for pure functions? --Ilya
Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened: 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses. 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression. 3. So I made getPivot use just a bit of randomness taken from the length of the range. 4. sort was and is attackable before all of these changes 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable. 6. PRNGs don't have any other downside than making functions impure. The way I see it we have these solutions: (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades. (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all. (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern. Andrei
Jan 18
next sibling parent Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:51 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
 Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for
 worst case. --Ilya
Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.
No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
unpredictableSeed uses the system clock as a source of randomness, so we're good there. -- Andrei
Would this work for pure functions? --Ilya
Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened: 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses. 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression. 3. So I made getPivot use just a bit of randomness taken from the length of the range. 4. sort was and is attackable before all of these changes 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable. 6. PRNGs don't have any other downside than making functions impure. The way I see it we have these solutions: (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades. (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all. (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern. Andrei
(a) Hope no (b) Yes (c) Memory addresses may not work with user defined ranges and hashes could be slow. (d) Make PRNGs optional. --Ilya
Jan 18
prev sibling next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 4. sort was and is attackable before all of these changes
No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse too deep): see https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/so ting.d#L1048-L1053. This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules. My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits that by not being able to slow the sort down. But, if I comment out line 1053 to disable the switching, things go quadratic as expected. L1053: depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2) / 3; The same remedy can help for topN: if we called partition more than log (length) in base (3/2) times, switch to the heap implementation of topN regardless of whether n is close to the edge. Ivan Kazmenko.
Jan 18
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
 4. sort was and is attackable before all of these changes
No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse too deep): see https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053. This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules. My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits that by not being able to slow the sort down. But, if I comment out line 1053 to disable the switching, things go quadratic as expected. L1053: depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2) / 3; The same remedy can help for topN: if we called partition more than log (length) in base (3/2) times, switch to the heap implementation of topN regardless of whether n is close to the edge.
Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- Andrei
Jan 19
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu 
wrote:
 On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 Do you think sort and topN would be attackable if they used a 
 per-process-seeded RNG as per Xinok's idea? -- Andrei
Yes, but with a little interactivity (not generating the input in advance) and more labor (finding the state of RNG). Suppose the adversary has the means to generate an input for sort/topN, inspect the output, and then generate a bad input. With the current (not-secure) RNGs, from the first two actions, they can learn the current state of RNG: with sort, the order of elements which compare as equal depends on the choice of pivot, and therefore exposes the random bits involved; with topN, the order of everything except what topN guarantees does the same thing. After that, generating a bad input is definitely possible, since everything is known and deterministic at this point. The previously discussed method can be used, or something more clever if they want to make it fast. Ivan Kazmenko.
Jan 19
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/19/2016 08:08 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote:
 On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 Do you think sort and topN would be attackable if they used a
 per-process-seeded RNG as per Xinok's idea? -- Andrei
Yes, but with a little interactivity (not generating the input in advance) and more labor (finding the state of RNG).
[snip] Thanks! -- Andrei
Jan 20
prev sibling next sibling parent reply Xinok <xinok live.com> writes:
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 Of course not. I think this back-and-forth takes away from the 
 gist of things. So let me summarize what has happened:

 1. topN was reportedly slow. It was using a random pivot. I 
 made it use getPivot (deterministic) instead of a random pivot 
 in https://github.com/D-Programming-Language/phobos/pull/3921. 
 getPivot is also what sort uses.

 2. Now that both functions use getPivot, I set out to improve 
 it in 
 https://github.com/D-Programming-Language/phobos/pull/3922. The 
 problem is I couldn't make getPivot impure; pure functions 
 already call sort, so making it impure would have been a 
 regression.

 3. So I made getPivot use just a bit of randomness taken from 
 the length of the range.

 4. sort was and is attackable before all of these changes

 5. So now we have pure topN and sort (subject to the range 
 offering pure primitives) but they are both attackable.

 6. PRNGs don't have any other downside than making functions 
 impure.

 The way I see it we have these solutions:

 (a) make topN still use a random pivot. That way there's no 
 regression. Then proceed and make sort() avoid quadratic cases 
 in other ways, e.g. switch to heapsort if performance degrades.

 (b) Improve sort() first, then apply a similar strategy to 
 improving topN. Do not use RNGs at all.

 (c) Seek randomness in other places, e.g. address of elements, 
 data hashes etc. Come with a solution that may still be 
 attacked in narrow cases but make that unlikely enough to be a 
 real concern.


 Andrei
To be clear, sort is NOT attackable. Introsort is used for unstable sorting which begins with quick sort but falls back to heap sort after too many recursions to maintain O(n log n) running time. Timsort is used for stable sorting which is a variant of merge sort but still runs in O(n log n) time. In either case, you're guaranteed to have O(n log n) running time in the worst case. On the other hand, someone can improve upon quick sort by choosing better pivots but be careful not to add too much overhead. A couple years ago, I tried choosing the pivot from a median of five but the result was as much as 10% slower. One idea is to begin initially choosing better pivots, but once the sublists fall below a certain length, simply choose the pivot from a median of three to avoid the extra overhead. As for topN, there are two approaches I'm aware of that are deterministic without resorting to impurities or RNGs. The first is introselect which is similar to introsort in that it has a fall back algorithm. The other approach is to choose the pivot from a median of medians. My idea is a variant of the latter approach: Begin by choosing the pivot from a median of five. If you continuously choose bad pivots, take a larger sample of elements to choose the pivot from. Keep growing the sample as necessary. Speaking of RNGs, they're technically pure as long as you always use the same seed. So what if we generated a random seed at the start of the process, but then reused that same seed over and over in pure functions for the duration of the process?
Jan 18
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/18/2016 09:42 PM, Xinok wrote:
 To be clear, sort is NOT attackable. Introsort is used for unstable
 sorting which begins with quick sort but falls back to heap sort after
 too many recursions to maintain O(n log n) running time. Timsort is used
 for stable sorting which is a variant of merge sort but still runs in
 O(n log n) time. In either case, you're guaranteed to have O(n log n)
 running time in the worst case.
Sorry, my bad. Could you please point to the code doing the introspection? I might do the same in topN.
 On the other hand, someone can improve upon quick sort by choosing
 better pivots but be careful not to add too much overhead. A couple
 years ago, I tried choosing the pivot from a median of five but the
 result was as much as 10% slower. One idea is to begin initially
 choosing better pivots, but once the sublists fall below a certain
 length, simply choose the pivot from a median of three to avoid the
 extra overhead.
That's what https://github.com/D-Programming-Language/phobos/pull/3922 and in my measurements it's about as fast or faster than the current. Could you please re-run your measurements against that PR?
 Speaking of RNGs, they're technically pure as long as you always use the
 same seed. So what if we generated a random seed at the start of the
 process, but then reused that same seed over and over in pure functions
 for the duration of the process?
That's a great idea. The way D is defined, it already implies that "immutable" is process-immutable, not immutable across runs. Consider http://dpaste.dzfl.pl/2fa5baf0c149. Very nice insights, Xinok. Thanks! Andrei
Jan 19
prev sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 4. sort was and is attackable before all of these changes

 (b) Improve sort() first, then apply a similar strategy to 
 improving topN. Do not use RNGs at all.
Since point 4 is in fact already fixed a good while ago, my suggestion would be (b): to do the same for topN. Introselect (https://en.wikipedia.org/wiki/Introselect), already mentioned in this thread, uses median-of-medians to achieve worst case O(n) performance if we recurse too deep. There is already an issue suggesting to implement it: https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement deterministic topN). In fact, the O(n log n) heap approach as it is implemented now could be faster than O(n) median-of-medians approach on reasonable inputs. So, someone will have to implement median-of-medians and, well, measure. At the very least, googling for "median of medians in practice" and such yields the tag wiki from StackOverflow.com: "The constant factor in the O(n) is large, and the algorithm is not commonly used in practice.". Ivan Kazmenko.
Jan 18
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
 4. sort was and is attackable before all of these changes

 (b) Improve sort() first, then apply a similar strategy to improving
 topN. Do not use RNGs at all.
Since point 4 is in fact already fixed a good while ago, my suggestion would be (b): to do the same for topN. Introselect (https://en.wikipedia.org/wiki/Introselect), already mentioned in this thread, uses median-of-medians to achieve worst case O(n) performance if we recurse too deep. There is already an issue suggesting to implement it: https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement deterministic topN). In fact, the O(n log n) heap approach as it is implemented now could be faster than O(n) median-of-medians approach on reasonable inputs. So, someone will have to implement median-of-medians and, well, measure. At the very least, googling for "median of medians in practice" and such yields the tag wiki from StackOverflow.com: "The constant factor in the O(n) is large, and the algorithm is not commonly used in practice.".
Yah, I also think heap/double-heap topN is better; median-of-medians (MoM) is theoretically nice but practically less so. Though there are two things that could help MoM in D: (a) median of five (core part of MoM) is relatively cheap to compute with six comparisons; (b) we can use stride() to compute median without needing to copy a fifth of the input or to complicate the algorithm - just recurse on the stride! All in all this might be a good algo to implement and test. So anyhow we need the "intro" part. I'll get to that soon. Andrei
Jan 19
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.
Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 One interesting side topic would be how to avoid the use of goto without losing clarity. Andrei
Jan 19
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:
 On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.
Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 ...
The switching heuristic is bad. It always switches after at most 8 steps. I'd just use the second heuristic discussed in https://en.wikipedia.org/wiki/Introselect .
Jan 19
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/19/2016 01:27 PM, Timon Gehr wrote:
 On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:
 On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.
Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934. https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 ...
The switching heuristic is bad. It always switches after at most 8 steps.
My algebra is completely rotten. Thanks! Updated https://github.com/andralex/phobos/commit/cdd59b51a397ee1a4584e55c07d921c59acb5978. Andrei
Jan 19
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:
 unpredictableSeed uses the system clock as a source of 
 randomness, so we're good there. -- Andrei
I got problem with that even when crytographically secure randomness wasn't needed more than once. A specific case included adding jitter on some timeout to avoid all hosts to expire at the same time, and used mersenne twister with a pseudo random seed. There still were way too many hosts ending up with the same timeout. System time is not enough.
Jan 18
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 6:55 PM, deadalnix wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 unpredictableSeed uses the system clock as a source of randomness, so
 we're good there. -- Andrei
I got problem with that even when crytographically secure randomness wasn't needed more than once. A specific case included adding jitter on some timeout to avoid all hosts to expire at the same time, and used mersenne twister with a pseudo random seed. There still were way too many hosts ending up with the same timeout. System time is not enough.
How would this translate to a matter of selecting the pivot during sort? -- Andrei
Jan 18
parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 19 January 2016 at 00:17:30 UTC, Andrei Alexandrescu 
wrote:
 How would this translate to a matter of selecting the pivot 
 during sort? -- Andrei
A large chunk of a given datacenter going quadratic at the same time.
Jan 18
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)
Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.
A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya
Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? Andrei
Jan 18
next sibling parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko 
 wrote:
 Here goes the test which shows quadratic behavior for the 
 new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)
Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.
A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya
Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? Andrei
1. Yes, probability of hitting the worst case repeatedly is is practically zero. But RNGs do not change this probability. 2. It is possible to build attack for our RNGs, because they are Pseudo-RNGs. --Ilya
Jan 18
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they are
 Pseudo-RNGs.
 --Ilya
You also need to predict the seed. How do you do that?
Jan 18
parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they 
 are
 Pseudo-RNGs.
 --Ilya
You also need to predict the seed. How do you do that?
We can not use unpredictable seed (like system clock) in pure functions. --Ilya
Jan 18
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 12:55 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they are
 Pseudo-RNGs.
 --Ilya
You also need to predict the seed. How do you do that?
We can not use unpredictable seed (like system clock) in pure functions. --Ilya
Clearly. The point is, there already was an impure implementation of topN whose expected linear running time is still specified in the documentation. The current implementation does not fulfill this bound.
Jan 18
parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 00:02:08 UTC, Timon Gehr wrote:
 On 01/19/2016 12:55 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is 
 is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they 
 are
 Pseudo-RNGs.
 --Ilya
You also need to predict the seed. How do you do that?
We can not use unpredictable seed (like system clock) in pure functions. --Ilya
Clearly. The point is, there already was an impure implementation of topN whose expected linear running time is still specified in the documentation. The current implementation does not fulfill this bound.
There is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --Ilya
Jan 18
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
Jan 18
parent reply Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
unpredictableSeed is initialized only once and can be easily estimated. --Ilya
Jan 18
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
unpredictableSeed is initialized only once and can be easily estimated. --Ilya
https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? Andrei
Jan 18
next sibling parent Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
unpredictableSeed is initialized only once and can be easily estimated. --Ilya
https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? Andrei
A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array): a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers". b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!) c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays. Ilya
Jan 18
prev sibling next sibling parent Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
unpredictableSeed is initialized only once and can be easily estimated. --Ilya
https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? Andrei
A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array): a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers". b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!) c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays. EDIT: c. ... because each big array has own seed and time measurement would not work for big arrays with different seeds. Ilya
Jan 18
prev sibling parent Ilya <ilyayaroshenko gmail.com> writes:
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya
The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?
unpredictableSeed is initialized only once and can be easily estimated. --Ilya
https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120 unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0. How do you think things could be improved? Andrei
A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array): a. A hacker would not be able to estimate a seed using other API calls. For example, "give me a set of random numbers". b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!) c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays. EDIT: c. ... because each big array has own seed and time measurement would not work for big arrays with different seeds. __EDIT2__: To be fair (c) is not really correct. To do (c) correct a thread local global seed should be used to generate unpredictableSeed for each big arrays. Ilya
Jan 18
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/19/2016 12:39 AM, Andrei Alexandrescu wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)
Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.
A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya
Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? Andrei
https://en.wikipedia.org/wiki/Introselect
Jan 18
prev sibling parent Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for worst case. --Ilya
Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input.
Possible, but only if the seed and previous usage of victim program's global RNG are also known to the attacker... at which point the victim may well have a bigger problem than just quadratic topN.
 I'm not sure whether we should worry about this. Probably we do 
 because sort attacks are a press favorite.

 The nice thing about not relying on randomness is that pure 
 functions can call sort, topN etc.
Is it feasible to have two overloads of sort and/or topN, one pure and the other with better complexity guarantees?
 As a sort of a compromise I was thinking of seeding the RNG 
 with not only the length of the range, but also the integral 
 representation of the address of the first element. This would 
 still allow an attack if the input is always at the same 
 address.
Hmm. Isn't using any pointers still breaking strong purity? Say the GC moved our array, and it now has a different address. A strongly pure function would order the elements exactly the same then. The same goes about sorting, where the thing depending on pivot choice is the relative order of "equal" elements. From a theoretical standpoint (not taking current D purity rules into account), I'd say using a pointer (which may be modified by GC) is as pure as just allowing a static RNG (a global one, or even another instance dedicated specifically to sort/topN). Ivan Kazmenko.
Jan 18
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/18/2016 01:00 PM, Ivan Kazmenko wrote:
 The old version (2.070.0-b2) could not be tricked with it, does it use
 random?
Yes, it selected the pivot uniformly at random using the global RNG. (This is also why the documentation says topN is O(n) in expectation.)
Jan 18
prev sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:
 On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
 On 1/16/16 8:00 PM, Timon Gehr wrote:
 The implementation falls back to topNHeap whenever k is 
 within the first
 or last ~n/8 elements and therefore is Ω(n log n) on average.
Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei
Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)
I'd suggest being able to switch between implementations. Recall that, when sorting, we have SwapStrategy.stable which is, well, stable, but additionally guarantees n log n with a larger constant (MergeSort). And we have SwapStrategy.unstable which uses Introsort, which, in turn, has smaller constant on sane inputs. Here, Andrei's suggestion is to also have two implementations, let us call them TopNStrategy.quickSelect and TopNStrategy.heap. The quickSelect one is O(n) on sane inputs but O(n^2) on an artificial worst case. The heap implementation is also O(n) when k is close to 0 or n, but O(n log n) otherwise. What's important is that it is also O(n log n) worst case. The current proposal switches between strategies based on a heuristic (k < n/8 or k > 7n/8 if I understand correctly), which may not be the best strategy in all cases. So, what can be done is to introduce TopNStrategy.auto which is the default and uses the (current or better) heuristic to switch between strategies, but also leave a way to explicitly select one of the two strategies, just like one can do now with sort!(..., SwapStrategy.whatWeExplicitlyWant). All the code for both strategies is already there. Each strategy has its benefits not covered by the other (quickSelect is faster for k~n and sane inputs, heap is faster for k close to boundary and special inputs). So, provide the default but let the user choose. Ivan Kazmenko.
Jan 17
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 01/17/2016 06:55 AM, Ivan Kazmenko wrote:
 So, what can be done is to introduce TopNStrategy.auto which is the
 default and uses the (current or better) heuristic to switch between
 strategies, but also leave a way to explicitly select one of the two
 strategies, just like one can do now with sort!(...,
 SwapStrategy.whatWeExplicitlyWant).
A nice idea, but it seems a bit overengineered. The differences among approaches are rather subtle and explaining the circumstances under which one does better than the other is about as difficult as making the choice in the implementation. BTW there's yet another approach: 1. Create a max heap for r[0 .. nth] 2. Create a min heap for r[nth .. $] 3. As long as r[nth] < r[0], swap them and restore the heap property in the two heaps At the end of the process we have the smallest element in r[nth .. $], which is exactly in the r[nth] position, greater than or equal to the largest element in r[0 .. nth], which is exactly what the doctor prescribed. Complexity of this turns rather nice. Step 1 is O(k), step 2 is O(n - k). In the worst case (r is sorted descending), step 3 will stop after k iterations and does k heap insertions in the two heaps. So overall complexity is O(n + k log(k) + k log(n)). Since k <= n, keep the largest terms: O(n + k log(n)). So if k is proportional to n / log(n), we get O(n). And that's worst case! BTW I figured how to do stable partition. That'll come in a distinct PR. Andrei
Jan 17
prev sibling next sibling parent reply Ziad Hatahet via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 1. Organize the first 11 elements into a max heap
Why not the first 10?
Jan 16
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
 On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:


     1. Organize the first 11 elements into a max heap


 Why not the first 10?
So you get to put the appropriate element in the 11th position. -- Andrei
Jan 16
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/16 9:12 PM, Andrei Alexandrescu wrote:
 On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
 On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:


     1. Organize the first 11 elements into a max heap


 Why not the first 10?
So you get to put the appropriate element in the 11th position. -- Andrei
To clarify this using a degenerate case: say someone calls topN(r, 0) i.e. find the minimum. Then you'd need a mini-heap of exactly one element to track the smallest element found so far. -- Andrei
Jan 16
prev sibling next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 3. At the end, swap the largest in the heap with the 10th and 
 you're done!
And why this? Do we additionally require the k-th element to arrive exactly on k-th place?
Jan 16
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/16 5:27 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 3. At the end, swap the largest in the heap with the 10th and you're
 done!
And why this? Do we additionally require the k-th element to arrive exactly on k-th place?
Yes. -- Andrei
Jan 16
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 https://github.com/D-Programming-Language/phobos/pull/3934

 So, say you're looking for the smallest 10 elements out of 
 100_000. The quickselect algorithm (which topN currently uses) 
 will successively partition the set in (assuming the pivot 
 choice works well) 50_000, 25_000, etc chunks all the way down 
 to finding the smallest 10.

 That's quite a bit of work, so 3934 uses an alternate strategy 
 for finding the smallest 10:

 1. Organize the first 11 elements into a max heap

 2. Scan all other elements progressively. Whenever an element 
 is found that is smaller than the largest in the heap, swap it 
 with the largest in the heap then restore the heap property.

 3. At the end, swap the largest in the heap with the 10th and 
 you're done!

 This is very effective, and is seldom referred in the selection 
 literature. In fact, a more inefficient approach (heapify the 
 entire range) is discussed more often.


 Destroy!

 Andrei
A common way to do it is to go quicksort, but only recurse on one side of the set. That should give log(n)^2 complexity on average.
Jan 17
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/16 8:07 PM, deadalnix wrote:
 A common way to do it is to go quicksort, but only recurse on one side
 of the set. That should give log(n)^2 complexity on average.
Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei
Jan 17
parent deadalnix <deadalnix gmail.com> writes:
On Monday, 18 January 2016 at 01:38:16 UTC, Andrei Alexandrescu 
wrote:
 On 1/17/16 8:07 PM, deadalnix wrote:
 A common way to do it is to go quicksort, but only recurse on 
 one side
 of the set. That should give log(n)^2 complexity on average.
Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei
Yeah; forget about me, I was dumb on that one.
Jan 17