www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - simultaneous multiple key sorting algorithm

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Here's a multikey sorting problem that's bothering me; I think it's well 
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length 
information. We want to get people who are "tall and long-haired". More 
precisely, we want to order people by rank(p.height) + 
rank(p.hairlength), where rank(p.k) is the function that returns the 
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and 
then sort by their sum. That's three sorts and at least one temporary 
array. But I think things can be done considerably better. Any ideas?


Thanks,

Andrei
Jan 27 2012
next sibling parent reply sclytrack <sclytrack fake.com> writes:
On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?


 Thanks,

 Andrei

http://en.wikipedia.org/wiki/K-d_tree Sclytrack
Jan 27 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 3:07 PM, sclytrack wrote:
 On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?


 Thanks,

 Andrei

http://en.wikipedia.org/wiki/K-d_tree

Thanks. I think k-d trees are good for general operations (insertion, deletion, and nearest neighbor query) whereas here we're interested in sorting. Andrei
Jan 27 2012
parent reply "Christof Schardt" <christof schardt.info> writes:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function). Level two (programming): how can this be expressed in an optimal algorithmical way, given D2. Your question suggests, the the (naive) sum-function is already the solution to level one. But I suspect, that the solution for the problem requires rather to find a proper weight function than a combinatation-algorithm. Christof
Jan 27 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 5:35 PM, Christof Schardt wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).
 Level two (programming):
 how can this be expressed in an optimal algorithmical way,
 given D2.

 Your question suggests, the the (naive) sum-function is
 already the solution to level one.

This is not naive, because the quantities being summed are ranks, not key values.
 But I suspect, that the solution for the problem requires
 rather to find a proper weight function than a
 combinatation-algorithm.

No, I think this is a confusion. Andrei
Jan 27 2012
prev sibling next sibling parent "Andrew Krieger" <akrieger fb.com> writes:
A k-d tree isn't directly applicable here - the problem is that 
the input data isn't normalized, so you can't just drop people 
into a 2d grid consisting of one axis being height, and the other 
axis being hair length. You need to 'normalize' the heights and 
lengths into ranks first, *then* insert them, and normalizing the 
lengths -> ranks is an NlogN process to begin with. Constructing 
a k-d tree isn't free either - you either have to pre-sort your 
inputs, or sort-as-you-go, which works out to NlogN as well.

On Friday, 27 January 2012 at 20:07:44 UTC, sclytrack wrote:
 On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think 
 it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and 
 long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that 
 returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two 
 ranks and
 then sort by their sum. That's three sorts and at least one 
 temporary
 array. But I think things can be done considerably better. Any 
 ideas?


 Thanks,

 Andrei

http://en.wikipedia.org/wiki/K-d_tree Sclytrack

Jan 27 2012
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 27 Jan 2012 16:45:41 -0600, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 1/27/12 5:35 PM, Christof Schardt wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



I think, you are mixing two problem levels: Level one (theory): How are the two properties weighted against each other and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

Weighting doesn't imply a simple linear constant. The rank function is an example of a non-linear weighting scheme. Non-linear weighting functions are common in complex sorting and classification problems.
Jan 27 2012
prev sibling next sibling parent reply Derek <ddparnell bigpond.com> writes:
On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Here's a multikey sorting problem that's bothering me; I think it's well  
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length  
 information. We want to get people who are "tall and long-haired". More  
 precisely, we want to order people by rank(p.height) +  
 rank(p.hairlength), where rank(p.k) is the function that returns the  
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and  
 then sort by their sum. That's three sorts and at least one temporary  
 array. But I think things can be done considerably better. Any ideas?

Just a further clarification ... If a person is tall and has short hair, would they be included in the sort? In other words, are you ONLY looking for tall AND long-haired people or are you looking for relative ranking of tallness and hairiness? -- Derek Parnell Melbourne, Australia
Jan 27 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 5:27 PM, Derek wrote:
 On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Here's a multikey sorting problem that's bothering me; I think it's
 well researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired".
 More precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?

Just a further clarification ... If a person is tall and has short hair, would they be included in the sort? In other words, are you ONLY looking for tall AND long-haired people or are you looking for relative ranking of tallness and hairiness?

Everybody is included in the sort: the size of the array before and after the sort is the same. Then, of course, the top K problem is very interesting too. Andrei
Jan 27 2012
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu 
wrote:
 Here's a multikey sorting problem that's bothering me; I think 
 it's well researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length 
 information. We want to get people who are "tall and 
 long-haired". More precisely, we want to order people by 
 rank(p.height) + rank(p.hairlength), where rank(p.k) is the 
 function that returns the rank of person p if ordered by key k.

 The brute force approach would essentially compute the two 
 ranks and then sort by their sum. That's three sorts and at 
 least one temporary array. But I think things can be done 
 considerably better. Any ideas?

I don't think you can get much better than this. Knowing the rank (for one key) for one element requires looking at all the other elements. Thus, knowing the rank of all elements isn't possible in linear time or faster. Knowing the ranks of only one key doesn't get you much information regarding the final result. Even the item ranked last on one key can tie for first place in combination with the other. These observations seem to impose a strict dependency on the order the data is calculated. I don't see any shortcuts, but I'd love to be proven wrong. You may want to ask on http://cstheory.stackexchange.com/ .
Jan 27 2012
prev sibling next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Andrei Alexandrescu wrote:

 That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks might be computed by a diogonal sweep over the area defined by the two dimensions and populated by the elements. Population and sweep would each be linear in time. -manfred
Jan 27 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 8:26 PM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks might be computed by a diogonal sweep over the area defined by the two dimensions and populated by the elements. Population and sweep would each be linear in time. -manfred

Interesting, I have some thoughts along the same lines. Could you please give more detail? Andrei
Jan 27 2012
prev sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Manfred Nowak wrote:

 Population and sweep would each be linear in time.

The sweep is linear in time only, if the populatable points of the area are bount by O(n), which is not neccessary the case. Therefore, I don't see to find in the generell case a solution faster than O( n + k*log( k)), where `n' is the number of elements of the input and `k' is the predefined number of "best of the breed" to be sorted. -manfred
Jan 29 2012
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu 
wrote:
 The brute force approach would essentially compute the two 
 ranks and then sort by their sum. That's three sorts and at 
 least one temporary array. But I think things can be done 
 considerably better. Any ideas?

Can you define "better"? Obviously you can't beat the O(n lg n) for comparison based ordering. The O(n) space may be avoidable, but it doesn't seem like the end of the world. Does this problem occur frequently enough in practice to justify having an optimised version in the standard library? Three sorts + O(n) space sounds fine to me.
Jan 27 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 8:33 PM, Peter Alexander wrote:
 On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?

Can you define "better"? Obviously you can't beat the O(n lg n) for comparison based ordering. The O(n) space may be avoidable, but it doesn't seem like the end of the world. Does this problem occur frequently enough in practice to justify having an optimised version in the standard library? Three sorts + O(n) space sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is something that is often needed even if not perceived as such. For example, the stock screeners offered by financial sites are crappy - they allow you to specify a range of e.g. P/E, P/book value, dividends etc. etc. and sort by one of those. This is crappy - generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.) A colleague at Facebook mentioned that you can use O(n) bucket sort for the last-leg sorting. The best approach I'm thinking of is to allocate an extra array of size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1. Then sort simultaneously the two arrays by k2, and then add 0 to the first slot, 1 to the second slot, ..., n-1 to the last slot. Continue this by any number of keys, and finally sort using the secondary array as a key. This works for any number of keys and only requires one extra array of size_t. The more interesting problem is yielding the answer lazily. I'm unclear what the best algorithm for such would be. Andrei
Jan 27 2012
next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Andrei Alexandrescu wrote:

 generally what one wants is a selection of "best of breed" stocks
 that are among the top ones in a variety of categories. (Relative
 importance could be assigned to each category.) 

This is quite different and easier than the problem initially stated, because ranks must not be computed. 1) Compute the individual importance `ii' of each element `e', which seems to be the sum over all categories `c' of all `e.c.value * c.importance' 2) Use the linear median algorithm to find the element `ek' whose individual importance `ek.ii' has rank `k' in the set of all elements `e' and `k' is the number of elements `e' acceptable as "best of the breed" 3) return all elements `e' whith `e.ii >= ek.ii'. This can be done by a sweep over the array. Total running time aproximately: #elements * #categories Please note, that this relies on having choosen the vector of relativ importances for the categories correctly. How does one know about this correctness? For example in marketing it is desired to occupy unique selling points, because the gain of profit might be huge. But this means, that also the less for the buyer is huge. -manfred
Jan 27 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/28/12 1:08 AM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 generally what one wants is a selection of "best of breed" stocks
 that are among the top ones in a variety of categories. (Relative
 importance could be assigned to each category.)

This is quite different and easier than the problem initially stated, because ranks must not be computed.

Actually they need to be computed. I think it's possible to find cases where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2. One potentially confusing issue is that importance applies to rank, not features. Andrei
Jan 28 2012
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Andrei Alexandrescu wrote:

[ About ranks]
 Actually they need to be computed.

Then the problem is still too unclear.
 I think it's possible to find cases where
    rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2)
 but
    alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. But if the ranks are known, the problem can be finished with the linear median algorithm.
 One potentially confusing issue is that importance applies to
 rank, not features.

Do you mean that alpha * rank(a, k1) + beta * rank(a, k2) has to be extremized for all `a'? Again: the problem is still unclear. -manfred
Jan 28 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/28/12 9:10 AM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 [ About ranks]
 Actually they need to be computed.

Then the problem is still too unclear.

It's very simple and clear in formulation actually. Given a range r of elements, define rank(a, p) as the position that object a would have if range r would be sorted by predicate p. So the rank is a number between 0 and n-1 (n being the number of elements in the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a predicate, the rank of 20 is 1. (For simplicity we can consider for now that all elements are distinct.) Now say we have multiple predicates, p1, p2, ..., pk. We want to sort elements in increasing order of rank(a, p1) + ... + rank(a, pk).
 I think it's possible to find cases where
     rank(a, k1) + rank(a, k2)<  rank(b, k1) + rank(b, k2)
 but
     alpha * a.k1 + beta * a.k2>  alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. But if the ranks are known, the problem can be finished with the linear median algorithm.
 One potentially confusing issue is that importance applies to
 rank, not features.

Do you mean that alpha * rank(a, k1) + beta * rank(a, k2) has to be extremized for all `a'?

I don't understand this.
 Again: the problem is still unclear.

Hopefully the above has clarified it. Thanks for looking into it! Andrei
Jan 28 2012
parent Manfred Nowak <svv1999 hotmail.com> writes:
Andrei Alexandrescu wrote:

 we have multiple predicates, p1, p2, ..., pk. We want to sort 
 elements in increasing order of rank(a, p1) + ... + rank(a, pk).

Got it. Thinking ocasionally. -manfred
Jan 28 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/27/12 11:02 PM, H. S. Teoh wrote:
 If you treat the attributes as components of an n-dimensional vector,
 then you could recast the problem as the convex hull of k hypercuboids,
 where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
 points on the convex hull then are the high-ranking points (points
 inside the hull are obviously suboptimal). Furthermore, the closest
 points to the line spanned by the vector (1,1,1,...) are the points that
 maximize the most number of attributes.

 If you "peel away" the points on the convex hull, then the convex hull
 generated by the remaining points gives you the "second class" points.
 Peel away those, and the convex hull gives you the "third class" points,
 etc..

 Granted, this isn't exactly the original problem as stated, but it seems
 to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the space is defined by rank, not the original features. Andrei
Jan 28 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/28/12 10:29 AM, H. S. Teoh wrote:
 Actually, the reason I chose hypercubes was because I was hoping that
 the convex hull with hypercubes in the original features' space would be
 equivalent (or easily massaged to be equivalent) to using simplexes in
 rank space.  (Although the part about the (1,1,1,...) vector may not
 work out quite that nicely in this case.) Otherwise it doesn't really
 give you a good algorithm since you'll have to precompute the ranks.

I understand. You're up to something; if you get to prove anything in that direction, I'd be interested. Thanks! Andrei
Jan 28 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 27, 2012 at 09:22:26PM -0500, Andrei Alexandrescu wrote:
[...]
 I wasn't thinking necessarily of the standard library, but it is
 something that is often needed even if not perceived as such. For
 example, the stock screeners offered by financial sites are crappy -
 they allow you to specify a range of e.g. P/E, P/book value, dividends
 etc. etc. and sort by one of those. This is crappy - generally what
 one wants is a selection of "best of breed" stocks that are among the
 top ones in a variety of categories. (Relative importance could be
 assigned to each category.)

If I understand the problem correctly, isn't this a kind of optimization problem? I.e., given a set A of k objects with n attributes, find the object(s) that maximizes a given measure function f over its attributes, where f is a weighted sum of n attributes. If you treat the attributes as components of an n-dimensional vector, then you could recast the problem as the convex hull of k hypercuboids, where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The points on the convex hull then are the high-ranking points (points inside the hull are obviously suboptimal). Furthermore, the closest points to the line spanned by the vector (1,1,1,...) are the points that maximize the most number of attributes. If you "peel away" the points on the convex hull, then the convex hull generated by the remaining points gives you the "second class" points. Peel away those, and the convex hull gives you the "third class" points, etc.. Granted, this isn't exactly the original problem as stated, but it seems to be closer to what you intend to achieve with your stocks example. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
Jan 27 2012
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jan 28, 2012 at 08:12:52AM -0600, Andrei Alexandrescu wrote:
 On 1/27/12 11:02 PM, H. S. Teoh wrote:
If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the space is defined by rank, not the original features.

Actually, the reason I chose hypercubes was because I was hoping that the convex hull with hypercubes in the original features' space would be equivalent (or easily massaged to be equivalent) to using simplexes in rank space. (Although the part about the (1,1,1,...) vector may not work out quite that nicely in this case.) Otherwise it doesn't really give you a good algorithm since you'll have to precompute the ranks. T -- War doesn't prove who's right, just who's left. -- BSD Games' Fortune
Jan 28 2012