## digitalmars.D - simultaneous multiple key sorting algorithm

- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 27 2012
- sclytrack <sclytrack fake.com> Jan 27 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 27 2012
- "Christof Schardt" <christof schardt.info> Jan 27 2012
- "Andrew Krieger" <akrieger fb.com> Jan 27 2012
- "Robert Jacques" <sandford jhu.edu> Jan 27 2012
- Derek <ddparnell bigpond.com> Jan 27 2012
- "Vladimir Panteleev" <vladimir thecybershadow.net> Jan 27 2012
- Manfred Nowak <svv1999 hotmail.com> Jan 27 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 27 2012
- Manfred Nowak <svv1999 hotmail.com> Jan 29 2012
- "Peter Alexander" <peter.alexander.au gmail.com> Jan 27 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 27 2012
- Manfred Nowak <svv1999 hotmail.com> Jan 27 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 28 2012
- Manfred Nowak <svv1999 hotmail.com> Jan 28 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 28 2012
- Manfred Nowak <svv1999 hotmail.com> Jan 28 2012
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jan 28 2012
- "H. S. Teoh" <hsteoh quickfur.ath.cx> Jan 27 2012
- "H. S. Teoh" <hsteoh quickfur.ath.cx> Jan 28 2012

Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas? Thanks, Andrei

Jan 27 2012

On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas? Thanks, Andrei

http://en.wikipedia.org/wiki/K-d_tree Sclytrack

Jan 27 2012

On 1/27/12 3:07 PM, sclytrack wrote:On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas? Thanks, Andrei

http://en.wikipedia.org/wiki/K-d_tree

Thanks. I think k-d trees are good for general operations (insertion, deletion, and nearest neighbor query) whereas here we're interested in sorting. Andrei

Jan 27 2012

The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function). Level two (programming): how can this be expressed in an optimal algorithmical way, given D2. Your question suggests, the the (naive) sum-function is already the solution to level one. But I suspect, that the solution for the problem requires rather to find a proper weight function than a combinatation-algorithm. Christof

Jan 27 2012

On 1/27/12 5:35 PM, Christof Schardt wrote:The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).Level two (programming): how can this be expressed in an optimal algorithmical way, given D2. Your question suggests, the the (naive) sum-function is already the solution to level one.

This is not naive, because the quantities being summed are ranks, not key values.But I suspect, that the solution for the problem requires rather to find a proper weight function than a combinatation-algorithm.

No, I think this is a confusion. Andrei

Jan 27 2012

A k-d tree isn't directly applicable here - the problem is that the input data isn't normalized, so you can't just drop people into a 2d grid consisting of one axis being height, and the other axis being hair length. You need to 'normalize' the heights and lengths into ranks first, *then* insert them, and normalizing the lengths -> ranks is an NlogN process to begin with. Constructing a k-d tree isn't free either - you either have to pre-sort your inputs, or sort-as-you-go, which works out to NlogN as well. On Friday, 27 January 2012 at 20:07:44 UTC, sclytrack wrote:On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:

http://en.wikipedia.org/wiki/K-d_tree Sclytrack

Jan 27 2012

On Fri, 27 Jan 2012 16:45:41 -0600, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 1/27/12 5:35 PM, Christof Schardt wrote:The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

Weighting doesn't imply a simple linear constant. The rank function is an example of a non-linear weighting scheme. Non-linear weighting functions are common in complex sorting and classification problems.

Jan 27 2012

On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

Just a further clarification ... If a person is tall and has short hair, would they be included in the sort? In other words, are you ONLY looking for tall AND long-haired people or are you looking for relative ranking of tallness and hairiness? -- Derek Parnell Melbourne, Australia

Jan 27 2012

On 1/27/12 5:27 PM, Derek wrote:On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

Just a further clarification ... If a person is tall and has short hair, would they be included in the sort? In other words, are you ONLY looking for tall AND long-haired people or are you looking for relative ranking of tallness and hairiness?

Everybody is included in the sort: the size of the array before and after the sort is the same. Then, of course, the top K problem is very interesting too. Andrei

Jan 27 2012

On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:Here's a multikey sorting problem that's bothering me; I think it's well researched but I can't find the right keywords. Say we have a collection of people with height and hair length information. We want to get people who are "tall and long-haired". More precisely, we want to order people by rank(p.height) + rank(p.hairlength), where rank(p.k) is the function that returns the rank of person p if ordered by key k. The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

I don't think you can get much better than this. Knowing the rank (for one key) for one element requires looking at all the other elements. Thus, knowing the rank of all elements isn't possible in linear time or faster. Knowing the ranks of only one key doesn't get you much information regarding the final result. Even the item ranked last on one key can tie for first place in combination with the other. These observations seem to impose a strict dependency on the order the data is calculated. I don't see any shortcuts, but I'd love to be proven wrong. You may want to ask on http://cstheory.stackexchange.com/ .

Jan 27 2012

Andrei Alexandrescu wrote:That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks might be computed by a diogonal sweep over the area defined by the two dimensions and populated by the elements. Population and sweep would each be linear in time. -manfred

Jan 27 2012

On 1/27/12 8:26 PM, Manfred Nowak wrote:Andrei Alexandrescu wrote:That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks might be computed by a diogonal sweep over the area defined by the two dimensions and populated by the elements. Population and sweep would each be linear in time. -manfred

Interesting, I have some thoughts along the same lines. Could you please give more detail? Andrei

Jan 27 2012

Manfred Nowak wrote:Population and sweep would each be linear in time.

The sweep is linear in time only, if the populatable points of the area are bount by O(n), which is not neccessary the case. Therefore, I don't see to find in the generell case a solution faster than O( n + k*log( k)), where `n' is the number of elements of the input and `k' is the predefined number of "best of the breed" to be sorted. -manfred

Jan 29 2012

On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

Can you define "better"? Obviously you can't beat the O(n lg n) for comparison based ordering. The O(n) space may be avoidable, but it doesn't seem like the end of the world. Does this problem occur frequently enough in practice to justify having an optimised version in the standard library? Three sorts + O(n) space sounds fine to me.

Jan 27 2012

On 1/27/12 8:33 PM, Peter Alexander wrote:On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

Can you define "better"? Obviously you can't beat the O(n lg n) for comparison based ordering. The O(n) space may be avoidable, but it doesn't seem like the end of the world. Does this problem occur frequently enough in practice to justify having an optimised version in the standard library? Three sorts + O(n) space sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is something that is often needed even if not perceived as such. For example, the stock screeners offered by financial sites are crappy - they allow you to specify a range of e.g. P/E, P/book value, dividends etc. etc. and sort by one of those. This is crappy - generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.) A colleague at Facebook mentioned that you can use O(n) bucket sort for the last-leg sorting. The best approach I'm thinking of is to allocate an extra array of size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1. Then sort simultaneously the two arrays by k2, and then add 0 to the first slot, 1 to the second slot, ..., n-1 to the last slot. Continue this by any number of keys, and finally sort using the secondary array as a key. This works for any number of keys and only requires one extra array of size_t. The more interesting problem is yielding the answer lazily. I'm unclear what the best algorithm for such would be. Andrei

Jan 27 2012

Andrei Alexandrescu wrote:generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)

This is quite different and easier than the problem initially stated, because ranks must not be computed. 1) Compute the individual importance `ii' of each element `e', which seems to be the sum over all categories `c' of all `e.c.value * c.importance' 2) Use the linear median algorithm to find the element `ek' whose individual importance `ek.ii' has rank `k' in the set of all elements `e' and `k' is the number of elements `e' acceptable as "best of the breed" 3) return all elements `e' whith `e.ii >= ek.ii'. This can be done by a sweep over the array. Total running time aproximately: #elements * #categories Please note, that this relies on having choosen the vector of relativ importances for the categories correctly. How does one know about this correctness? For example in marketing it is desired to occupy unique selling points, because the gain of profit might be huge. But this means, that also the less for the buyer is huge. -manfred

Jan 27 2012

On 1/28/12 1:08 AM, Manfred Nowak wrote:Andrei Alexandrescu wrote:generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)

This is quite different and easier than the problem initially stated, because ranks must not be computed.

Actually they need to be computed. I think it's possible to find cases where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2. One potentially confusing issue is that importance applies to rank, not features. Andrei

Jan 28 2012

Andrei Alexandrescu wrote: [ About ranks]Actually they need to be computed.

Then the problem is still too unclear.I think it's possible to find cases where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. But if the ranks are known, the problem can be finished with the linear median algorithm.One potentially confusing issue is that importance applies to rank, not features.

Do you mean that alpha * rank(a, k1) + beta * rank(a, k2) has to be extremized for all `a'? Again: the problem is still unclear. -manfred

Jan 28 2012

On 1/28/12 9:10 AM, Manfred Nowak wrote:Andrei Alexandrescu wrote: [ About ranks]Actually they need to be computed.

Then the problem is still too unclear.

It's very simple and clear in formulation actually. Given a range r of elements, define rank(a, p) as the position that object a would have if range r would be sorted by predicate p. So the rank is a number between 0 and n-1 (n being the number of elements in the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a predicate, the rank of 20 is 1. (For simplicity we can consider for now that all elements are distinct.) Now say we have multiple predicates, p1, p2, ..., pk. We want to sort elements in increasing order of rank(a, p1) + ... + rank(a, pk).I think it's possible to find cases where rank(a, k1) + rank(a, k2)< rank(b, k1) + rank(b, k2) but alpha * a.k1 + beta * a.k2> alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. But if the ranks are known, the problem can be finished with the linear median algorithm.One potentially confusing issue is that importance applies to rank, not features.

Do you mean that alpha * rank(a, k1) + beta * rank(a, k2) has to be extremized for all `a'?

I don't understand this.Again: the problem is still unclear.

Hopefully the above has clarified it. Thanks for looking into it! Andrei

Jan 28 2012

Andrei Alexandrescu wrote:we have multiple predicates, p1, p2, ..., pk. We want to sort elements in increasing order of rank(a, p1) + ... + rank(a, pk).

Got it. Thinking ocasionally. -manfred

Jan 28 2012

On 1/27/12 11:02 PM, H. S. Teoh wrote:If you treat the attributes as components of an n-dimensional vector, then you could recast the problem as the convex hull of k hypercuboids, where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The points on the convex hull then are the high-ranking points (points inside the hull are obviously suboptimal). Furthermore, the closest points to the line spanned by the vector (1,1,1,...) are the points that maximize the most number of attributes. If you "peel away" the points on the convex hull, then the convex hull generated by the remaining points gives you the "second class" points. Peel away those, and the convex hull gives you the "third class" points, etc.. Granted, this isn't exactly the original problem as stated, but it seems to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the space is defined by rank, not the original features. Andrei

Jan 28 2012

On 1/28/12 10:29 AM, H. S. Teoh wrote:Actually, the reason I chose hypercubes was because I was hoping that the convex hull with hypercubes in the original features' space would be equivalent (or easily massaged to be equivalent) to using simplexes in rank space. (Although the part about the (1,1,1,...) vector may not work out quite that nicely in this case.) Otherwise it doesn't really give you a good algorithm since you'll have to precompute the ranks.

I understand. You're up to something; if you get to prove anything in that direction, I'd be interested. Thanks! Andrei

Jan 28 2012

On Fri, Jan 27, 2012 at 09:22:26PM -0500, Andrei Alexandrescu wrote: [...]I wasn't thinking necessarily of the standard library, but it is something that is often needed even if not perceived as such. For example, the stock screeners offered by financial sites are crappy - they allow you to specify a range of e.g. P/E, P/book value, dividends etc. etc. and sort by one of those. This is crappy - generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)

If I understand the problem correctly, isn't this a kind of optimization problem? I.e., given a set A of k objects with n attributes, find the object(s) that maximizes a given measure function f over its attributes, where f is a weighted sum of n attributes. If you treat the attributes as components of an n-dimensional vector, then you could recast the problem as the convex hull of k hypercuboids, where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The points on the convex hull then are the high-ranking points (points inside the hull are obviously suboptimal). Furthermore, the closest points to the line spanned by the vector (1,1,1,...) are the points that maximize the most number of attributes. If you "peel away" the points on the convex hull, then the convex hull generated by the remaining points gives you the "second class" points. Peel away those, and the convex hull gives you the "third class" points, etc.. Granted, this isn't exactly the original problem as stated, but it seems to be closer to what you intend to achieve with your stocks example. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher

Jan 27 2012

On Sat, Jan 28, 2012 at 08:12:52AM -0600, Andrei Alexandrescu wrote:On 1/27/12 11:02 PM, H. S. Teoh wrote:If you treat the attributes as components of an n-dimensional vector, then you could recast the problem as the convex hull of k hypercuboids, where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The points on the convex hull then are the high-ranking points (points inside the hull are obviously suboptimal). Furthermore, the closest points to the line spanned by the vector (1,1,1,...) are the points that maximize the most number of attributes. If you "peel away" the points on the convex hull, then the convex hull generated by the remaining points gives you the "second class" points. Peel away those, and the convex hull gives you the "third class" points, etc.. Granted, this isn't exactly the original problem as stated, but it seems to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the space is defined by rank, not the original features.

Actually, the reason I chose hypercubes was because I was hoping that the convex hull with hypercubes in the original features' space would be equivalent (or easily massaged to be equivalent) to using simplexes in rank space. (Although the part about the (1,1,1,...) vector may not work out quite that nicely in this case.) Otherwise it doesn't really give you a good algorithm since you'll have to precompute the ranks. T -- War doesn't prove who's right, just who's left. -- BSD Games' Fortune

Jan 28 2012