digitalmars.D - simultaneous multiple key sorting algorithm

Andrei Alexandrescu (12/12) Jan 27 2012 Here's a multikey sorting problem that's bothering me; I think it's well...

sclytrack (3/15) Jan 27 2012 http://en.wikipedia.org/wiki/K-d_tree

Andrei Alexandrescu (5/24) Jan 27 2012 Thanks. I think k-d trees are good for general operations (insertion,

Christof Schardt (13/15) Jan 27 2012 I think, you are mixing two problem levels:

Andrei Alexandrescu (6/20) Jan 27 2012 This is not naive, because the quantities being summed are ranks, not

Robert Jacques (2/12) Jan 27 2012 Weighting doesn't imply a simple linear constant. The rank function is a...

Andrew Krieger (9/35) Jan 27 2012 A k-d tree isn't directly applicable here - the problem is that

Derek (9/19) Jan 27 2012 Just a further clarification ...

Andrei Alexandrescu (5/23) Jan 27 2012 Everybody is included in the sort: the size of the array before and

Vladimir Panteleev (13/24) Jan 27 2012 I don't think you can get much better than this.
Manfred Nowak (6/7) Jan 27 2012 If a temporary array is allowed, the ranks and the sum of the ranks

Andrei Alexandrescu (4/11) Jan 27 2012 Interesting, I have some thoughts along the same lines. Could you please...
Manfred Nowak (9/10) Jan 29 2012 The sweep is linear in time only, if the populatable points of the area

Peter Alexander (9/13) Jan 27 2012 Can you define "better"?

Andrei Alexandrescu (21/32) Jan 27 2012 I wasn't thinking necessarily of the standard library, but it is

H. S. Teoh (24/32) Jan 27 2012 [...]

Andrei Alexandrescu (4/17) Jan 28 2012 That's close. Closer is to use simplexes instead of hypercubes. And the

H. S. Teoh (11/30) Jan 28 2012 [...]

Andrei Alexandrescu (4/10) Jan 28 2012 I understand. You're up to something; if you get to prove anything in

Manfred Nowak (20/23) Jan 27 2012 This is quite different and easier than the problem initially

Andrei Alexandrescu (7/13) Jan 28 2012 Actually they need to be computed. I think it's possible to find cases

Manfred Nowak (11/18) Jan 28 2012 Then the problem is still too unclear.

Andrei Alexandrescu (13/30) Jan 28 2012 It's very simple and clear in formulation actually.

Manfred Nowak (3/5) Jan 28 2012 Got it. Thinking ocasionally.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Here's a multikey sorting problem that's bothering me; I think it's well 
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length 
information. We want to get people who are "tall and long-haired". More 
precisely, we want to order people by rank(p.height) + 
rank(p.hairlength), where rank(p.k) is the function that returns the 
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and 
then sort by their sum. That's three sorts and at least one temporary 
array. But I think things can be done considerably better. Any ideas?


Thanks,

Andrei

Jan 27 2012

sclytrack <sclytrack fake.com> writes:

On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?


 Thanks,

 Andrei


http://en.wikipedia.org/wiki/K-d_tree


Sclytrack

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 3:07 PM, sclytrack wrote:
 On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?


 Thanks,

 Andrei


 http://en.wikipedia.org/wiki/K-d_tree

Thanks. I think k-d trees are good for general operations (insertion, 
deletion, and nearest neighbor query) whereas here we're interested in 
sorting.

Andrei

Jan 27 2012

"Christof Schardt" <christof schardt.info> writes:

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



I think, you are mixing two problem levels:

Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).

Level two (programming):
how can this be expressed in an optimal algorithmical way,
given D2.

Your question suggests, the the (naive) sum-function is
already the solution to level one.
But I suspect, that the solution for the problem requires
rather to find a proper weight function than a
combinatation-algorithm.

Christof

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 5:35 PM, Christof Schardt wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



 I think, you are mixing two problem levels:

 Level one (theory):
 How are the two properties weighted against each other
 and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

 Level two (programming):
 how can this be expressed in an optimal algorithmical way,
 given D2.

 Your question suggests, the the (naive) sum-function is
 already the solution to level one.

This is not naive, because the quantities being summed are ranks, not 
key values.

 But I suspect, that the solution for the problem requires
 rather to find a proper weight function than a
 combinatation-algorithm.

No, I think this is a confusion.


Andrei

Jan 27 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 27 Jan 2012 16:45:41 -0600, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 1/27/12 5:35 PM, Christof Schardt wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary



 I think, you are mixing two problem levels:

 Level one (theory):
 How are the two properties weighted against each other
 and can be combined to a total weight (a weight-function).

 There's no weighing. The sorting order is rank(k1) + rank(k2).

Weighting doesn't imply a simple linear constant. The rank function is an
example of a non-linear weighting scheme. Non-linear weighting functions are
common in complex sorting and classification problems.

Jan 27 2012

"Andrew Krieger" <akrieger fb.com> writes:

A k-d tree isn't directly applicable here - the problem is that 
the input data isn't normalized, so you can't just drop people 
into a 2d grid consisting of one axis being height, and the other 
axis being hair length. You need to 'normalize' the heights and 
lengths into ranks first, *then* insert them, and normalizing the 
lengths -> ranks is an NlogN process to begin with. Constructing 
a k-d tree isn't free either - you either have to pre-sort your 
inputs, or sort-as-you-go, which works out to NlogN as well.

On Friday, 27 January 2012 at 20:07:44 UTC, sclytrack wrote:
 On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
 Here's a multikey sorting problem that's bothering me; I think 
 it's well
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and 
 long-haired". More
 precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that 
 returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two 
 ranks and
 then sort by their sum. That's three sorts and at least one 
 temporary
 array. But I think things can be done considerably better. Any 
 ideas?


 Thanks,

 Andrei


 http://en.wikipedia.org/wiki/K-d_tree


 Sclytrack

Jan 27 2012

Derek <ddparnell bigpond.com> writes:

On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Here's a multikey sorting problem that's bothering me; I think it's well  
 researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length  
 information. We want to get people who are "tall and long-haired". More  
 precisely, we want to order people by rank(p.height) +  
 rank(p.hairlength), where rank(p.k) is the function that returns the  
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and  
 then sort by their sum. That's three sorts and at least one temporary  
 array. But I think things can be done considerably better. Any ideas?

Just a further clarification ...

If a person is tall and has short hair, would they be included in the  
sort? In other words, are you ONLY looking for tall AND long-haired people  
or are you looking for relative ranking of tallness and hairiness?

-- 
Derek Parnell
Melbourne, Australia

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 5:27 PM, Derek wrote:
 On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 Here's a multikey sorting problem that's bothering me; I think it's
 well researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length
 information. We want to get people who are "tall and long-haired".
 More precisely, we want to order people by rank(p.height) +
 rank(p.hairlength), where rank(p.k) is the function that returns the
 rank of person p if ordered by key k.

 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?

 Just a further clarification ...

 If a person is tall and has short hair, would they be included in the
 sort? In other words, are you ONLY looking for tall AND long-haired
 people or are you looking for relative ranking of tallness and hairiness?

Everybody is included in the sort: the size of the array before and 
after the sort is the same. Then, of course, the top K problem is very 
interesting too.

Andrei

Jan 27 2012

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu 
wrote:
 Here's a multikey sorting problem that's bothering me; I think 
 it's well researched but I can't find the right keywords.

 Say we have a collection of people with height and hair length 
 information. We want to get people who are "tall and 
 long-haired". More precisely, we want to order people by 
 rank(p.height) + rank(p.hairlength), where rank(p.k) is the 
 function that returns the rank of person p if ordered by key k.

 The brute force approach would essentially compute the two 
 ranks and then sort by their sum. That's three sorts and at 
 least one temporary array. But I think things can be done 
 considerably better. Any ideas?

I don't think you can get much better than this.

Knowing the rank (for one key) for one element requires looking 
at all the other elements. Thus, knowing the rank of all elements 
isn't possible in linear time or faster.

Knowing the ranks of only one key doesn't get you much 
information regarding the final result. Even the item ranked last 
on one key can tie for first place in combination with the other.

These observations seem to impose a strict dependency on the 
order the data is calculated. I don't see any shortcuts, but I'd 
love to be proven wrong.

You may want to ask on http://cstheory.stackexchange.com/ .

Jan 27 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Andrei Alexandrescu wrote:

 That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks 
might be computed by a diogonal sweep over the area defined by the two 
dimensions and populated by the elements.

Population and sweep would each be linear in time.

-manfred

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 8:26 PM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 That's three sorts and at least one temporary array.

 If a temporary array is allowed, the ranks and the sum of the ranks
 might be computed by a diogonal sweep over the area defined by the two
 dimensions and populated by the elements.

 Population and sweep would each be linear in time.

 -manfred

Interesting, I have some thoughts along the same lines. Could you please 
give more detail?

Andrei

Jan 27 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Manfred Nowak wrote:

 Population and sweep would each be linear in time.

The sweep is linear in time only, if the populatable points of the area 
are bount by O(n), which is not neccessary the case.

Therefore, I don't see to find in the generell case a solution faster 
than
   O( n + k*log( k)),
where `n' is the number of elements of the input and `k' is the 
predefined number of "best of the breed" to be sorted.

-manfred

Jan 29 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu 
wrote:
 The brute force approach would essentially compute the two 
 ranks and then sort by their sum. That's three sorts and at 
 least one temporary array. But I think things can be done 
 considerably better. Any ideas?

Can you define "better"?

Obviously you can't beat the O(n lg n) for comparison based 
ordering. The O(n) space may be avoidable, but it doesn't seem 
like the end of the world.

Does this problem occur frequently enough in practice to justify 
having an optimised version in the standard library? Three sorts 
+ O(n) space sounds fine to me.

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 8:33 PM, Peter Alexander wrote:
 On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:
 The brute force approach would essentially compute the two ranks and
 then sort by their sum. That's three sorts and at least one temporary
 array. But I think things can be done considerably better. Any ideas?

 Can you define "better"?

 Obviously you can't beat the O(n lg n) for comparison based ordering.
 The O(n) space may be avoidable, but it doesn't seem like the end of the
 world.

 Does this problem occur frequently enough in practice to justify having
 an optimised version in the standard library? Three sorts + O(n) space
 sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is 
something that is often needed even if not perceived as such. For 
example, the stock screeners offered by financial sites are crappy - 
they allow you to specify a range of e.g. P/E, P/book value, dividends 
etc. etc. and sort by one of those. This is crappy - generally what one 
wants is a selection of "best of breed" stocks that are among the top 
ones in a variety of categories. (Relative importance could be assigned 
to each category.)

A colleague at Facebook mentioned that you can use O(n) bucket sort for 
the last-leg sorting.

The best approach I'm thinking of is to allocate an extra array of 
size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1. 
Then sort simultaneously the two arrays by k2, and then add 0 to the 
first slot, 1 to the second slot, ..., n-1 to the last slot. Continue 
this by any number of keys, and finally sort using the secondary array 
as a key. This works for any number of keys and only requires one extra 
array of size_t.

The more interesting problem is yielding the answer lazily. I'm unclear 
what the best algorithm for such would be.


Andrei

Jan 27 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 27, 2012 at 09:22:26PM -0500, Andrei Alexandrescu wrote:
[...]
 I wasn't thinking necessarily of the standard library, but it is
 something that is often needed even if not perceived as such. For
 example, the stock screeners offered by financial sites are crappy -
 they allow you to specify a range of e.g. P/E, P/book value, dividends
 etc. etc. and sort by one of those. This is crappy - generally what
 one wants is a selection of "best of breed" stocks that are among the
 top ones in a variety of categories. (Relative importance could be
 assigned to each category.)

[...]

If I understand the problem correctly, isn't this a kind of optimization
problem? I.e., given a set A of k objects with n attributes, find the
object(s) that maximizes a given measure function f over its attributes,
where f is a weighted sum of n attributes.

If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.


T

-- 
If the comments and the code disagree, it's likely that *both* are
wrong. -- Christopher

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/27/12 11:02 PM, H. S. Teoh wrote:
 If you treat the attributes as components of an n-dimensional vector,
 then you could recast the problem as the convex hull of k hypercuboids,
 where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
 points on the convex hull then are the high-ranking points (points
 inside the hull are obviously suboptimal). Furthermore, the closest
 points to the line spanned by the vector (1,1,1,...) are the points that
 maximize the most number of attributes.

 If you "peel away" the points on the convex hull, then the convex hull
 generated by the remaining points gives you the "second class" points.
 Peel away those, and the convex hull gives you the "third class" points,
 etc..

 Granted, this isn't exactly the original problem as stated, but it seems
 to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the 
space is defined by rank, not the original features.

Andrei

Jan 28 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jan 28, 2012 at 08:12:52AM -0600, Andrei Alexandrescu wrote:
 On 1/27/12 11:02 PM, H. S. Teoh wrote:
If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

 
 That's close. Closer is to use simplexes instead of hypercubes. And
 the space is defined by rank, not the original features.

[...]

Actually, the reason I chose hypercubes was because I was hoping that
the convex hull with hypercubes in the original features' space would be
equivalent (or easily massaged to be equivalent) to using simplexes in
rank space.  (Although the part about the (1,1,1,...) vector may not
work out quite that nicely in this case.) Otherwise it doesn't really
give you a good algorithm since you'll have to precompute the ranks.


T

-- 
War doesn't prove who's right, just who's left. -- BSD Games' Fortune

Jan 28 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/28/12 10:29 AM, H. S. Teoh wrote:
 Actually, the reason I chose hypercubes was because I was hoping that
 the convex hull with hypercubes in the original features' space would be
 equivalent (or easily massaged to be equivalent) to using simplexes in
 rank space.  (Although the part about the (1,1,1,...) vector may not
 work out quite that nicely in this case.) Otherwise it doesn't really
 give you a good algorithm since you'll have to precompute the ranks.

I understand. You're up to something; if you get to prove anything in 
that direction, I'd be interested. Thanks!

Andrei

Jan 28 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Andrei Alexandrescu wrote:

 generally what one wants is a selection of "best of breed" stocks
 that are among the top ones in a variety of categories. (Relative
 importance could be assigned to each category.) 

This is quite different and easier than the problem initially
stated, because ranks must not be computed.

1) Compute the individual importance `ii' of each element `e', which 
seems to be the sum over all categories `c' of all
  `e.c.value * c.importance'  
2) Use the linear median algorithm to find the element `ek' whose 
individual importance `ek.ii' has rank `k' in the set of all elements 
`e' and `k' is the number of elements `e' acceptable as "best of the 
breed"
3) return all elements `e' whith `e.ii >= ek.ii'. This can be done
by a sweep over the array. 

Total running time aproximately: #elements * #categories


Please note, that this relies on having choosen the vector of relativ 
importances for the categories correctly. How does one know about this 
correctness?

For example in marketing it is desired to occupy unique selling 
points, because the gain of profit might be huge. But this means, that 
also the less for the buyer is huge. 

-manfred

Jan 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/28/12 1:08 AM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 generally what one wants is a selection of "best of breed" stocks
 that are among the top ones in a variety of categories. (Relative
 importance could be assigned to each category.)

 This is quite different and easier than the problem initially
 stated, because ranks must not be computed.

Actually they need to be computed. I think it's possible to find cases 
where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha * 
a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

One potentially confusing issue is that importance applies to rank, not 
features.


Andrei

Jan 28 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Andrei Alexandrescu wrote:

[ About ranks]
 Actually they need to be computed.

Then the problem is still too unclear.

 I think it's possible to find cases where
    rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2)
 but
    alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. 
But if the ranks are known, the problem can be finished with the 
linear median algorithm.

 One potentially confusing issue is that importance applies to
 rank, not features.

Do you mean that
  alpha * rank(a, k1) + beta * rank(a, k2)
has to be extremized for all `a'?

Again: the problem is still unclear.

-manfred

Jan 28 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/28/12 9:10 AM, Manfred Nowak wrote:
 Andrei Alexandrescu wrote:

 [ About ranks]
 Actually they need to be computed.

 Then the problem is still too unclear.

It's very simple and clear in formulation actually.

Given a range r of elements, define rank(a, p) as the position that 
object a would have if range r would be sorted by predicate p. So the 
rank is a number between 0 and n-1 (n being the number of elements in 
the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a 
predicate, the rank of 20 is 1. (For simplicity we can consider for now 
that all elements are distinct.)

Now say we have multiple predicates, p1, p2, ..., pk. We want to sort 
elements in increasing order of rank(a, p1) + ... + rank(a, pk).

 I think it's possible to find cases where
     rank(a, k1) + rank(a, k2)<  rank(b, k1) + rank(b, k2)
 but
     alpha * a.k1 + beta * a.k2>  alpha * b.k1 + beta.k2.

 Of course. Especially if the a, b, ... are sortable only partielly.
 But if the ranks are known, the problem can be finished with the
 linear median algorithm.

 One potentially confusing issue is that importance applies to
 rank, not features.

 Do you mean that
    alpha * rank(a, k1) + beta * rank(a, k2)
 has to be extremized for all `a'?

I don't understand this.

 Again: the problem is still unclear.

Hopefully the above has clarified it. Thanks for looking into it!


Andrei

Jan 28 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Andrei Alexandrescu wrote:

 we have multiple predicates, p1, p2, ..., pk. We want to sort 
 elements in increasing order of rank(a, p1) + ... + rank(a, pk).

Got it. Thinking ocasionally.

-manfred

Jan 28 2012

D Programming

C/C++ Programming

Other

digitalmars.D - simultaneous multiple key sorting algorithm