## digitalmars.D - simultaneous multiple key sorting algorithm

• Andrei Alexandrescu (12/12) Jan 27 2012 Here's a multikey sorting problem that's bothering me; I think it's well...
• sclytrack (3/15) Jan 27 2012 http://en.wikipedia.org/wiki/K-d_tree
• Andrei Alexandrescu (5/24) Jan 27 2012 Thanks. I think k-d trees are good for general operations (insertion,
• Christof Schardt (13/15) Jan 27 2012 I think, you are mixing two problem levels:
• Andrei Alexandrescu (6/20) Jan 27 2012 This is not naive, because the quantities being summed are ranks, not
• Robert Jacques (2/12) Jan 27 2012 Weighting doesn't imply a simple linear constant. The rank function is a...
• Andrew Krieger (9/35) Jan 27 2012 A k-d tree isn't directly applicable here - the problem is that
• Derek (9/19) Jan 27 2012 Just a further clarification ...
• Andrei Alexandrescu (5/23) Jan 27 2012 Everybody is included in the sort: the size of the array before and
• Vladimir Panteleev (13/24) Jan 27 2012 I don't think you can get much better than this.
• Manfred Nowak (6/7) Jan 27 2012 If a temporary array is allowed, the ranks and the sum of the ranks
• Andrei Alexandrescu (4/11) Jan 27 2012 Interesting, I have some thoughts along the same lines. Could you please...
• Manfred Nowak (9/10) Jan 29 2012 The sweep is linear in time only, if the populatable points of the area
• Peter Alexander (9/13) Jan 27 2012 Can you define "better"?
• Andrei Alexandrescu (21/32) Jan 27 2012 I wasn't thinking necessarily of the standard library, but it is
• H. S. Teoh (24/32) Jan 27 2012 [...]
• Andrei Alexandrescu (4/17) Jan 28 2012 That's close. Closer is to use simplexes instead of hypercubes. And the
• Manfred Nowak (20/23) Jan 27 2012 This is quite different and easier than the problem initially
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Thanks,

Andrei
```
Jan 27 2012
sclytrack <sclytrack fake.com> writes:
```On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Thanks,

Andrei

http://en.wikipedia.org/wiki/K-d_tree

Sclytrack
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 3:07 PM, sclytrack wrote:
On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Thanks,

Andrei

http://en.wikipedia.org/wiki/K-d_tree

Thanks. I think k-d trees are good for general operations (insertion,
deletion, and nearest neighbor query) whereas here we're interested in
sorting.

Andrei
```
Jan 27 2012
"Christof Schardt" <christof schardt.info> writes:
``` The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels:

Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).

Level two (programming):
how can this be expressed in an optimal algorithmical way,
given D2.

Your question suggests, the the (naive) sum-function is
already the solution to level one.
But I suspect, that the solution for the problem requires
rather to find a proper weight function than a
combinatation-algorithm.

Christof
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 5:35 PM, Christof Schardt wrote:
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels:

Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

Level two (programming):
how can this be expressed in an optimal algorithmical way,
given D2.

Your question suggests, the the (naive) sum-function is
already the solution to level one.

This is not naive, because the quantities being summed are ranks, not
key values.

But I suspect, that the solution for the problem requires
rather to find a proper weight function than a
combinatation-algorithm.

No, I think this is a confusion.

Andrei
```
Jan 27 2012
"Robert Jacques" <sandford jhu.edu> writes:
```On Fri, 27 Jan 2012 16:45:41 -0600, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
On 1/27/12 5:35 PM, Christof Schardt wrote:
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels:

Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

Weighting doesn't imply a simple linear constant. The rank function is an
example of a non-linear weighting scheme. Non-linear weighting functions are
common in complex sorting and classification problems.
```
Jan 27 2012
"Andrew Krieger" <akrieger fb.com> writes:
```A k-d tree isn't directly applicable here - the problem is that
the input data isn't normalized, so you can't just drop people
into a 2d grid consisting of one axis being height, and the other
axis being hair length. You need to 'normalize' the heights and
lengths into ranks first, *then* insert them, and normalizing the
lengths -> ranks is an NlogN process to begin with. Constructing
a k-d tree isn't free either - you either have to pre-sort your
inputs, or sort-as-you-go, which works out to NlogN as well.

On Friday, 27 January 2012 at 20:07:44 UTC, sclytrack wrote:
On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:
Here's a multikey sorting problem that's bothering me; I think
it's well
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and
long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that
returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two
ranks and
then sort by their sum. That's three sorts and at least one
temporary
array. But I think things can be done considerably better. Any
ideas?

Thanks,

Andrei

http://en.wikipedia.org/wiki/K-d_tree

Sclytrack

```
Jan 27 2012
Derek <ddparnell bigpond.com> writes:
```On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Just a further clarification ...

If a person is tall and has short hair, would they be included in the
sort? In other words, are you ONLY looking for tall AND long-haired people
or are you looking for relative ranking of tallness and hairiness?

--
Derek Parnell
Melbourne, Australia
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 5:27 PM, Derek wrote:
On Sat, 28 Jan 2012 06:36:49 +1100, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

Here's a multikey sorting problem that's bothering me; I think it's
well researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired".
More precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Just a further clarification ...

If a person is tall and has short hair, would they be included in the
sort? In other words, are you ONLY looking for tall AND long-haired
people or are you looking for relative ranking of tallness and hairiness?

Everybody is included in the sort: the size of the array before and
after the sort is the same. Then, of course, the top K problem is very
interesting too.

Andrei
```
Jan 27 2012
```On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu
wrote:
Here's a multikey sorting problem that's bothering me; I think
it's well researched but I can't find the right keywords.

Say we have a collection of people with height and hair length
information. We want to get people who are "tall and
long-haired". More precisely, we want to order people by
rank(p.height) + rank(p.hairlength), where rank(p.k) is the
function that returns the rank of person p if ordered by key k.

The brute force approach would essentially compute the two
ranks and then sort by their sum. That's three sorts and at
least one temporary array. But I think things can be done
considerably better. Any ideas?

I don't think you can get much better than this.

Knowing the rank (for one key) for one element requires looking
at all the other elements. Thus, knowing the rank of all elements
isn't possible in linear time or faster.

Knowing the ranks of only one key doesn't get you much
information regarding the final result. Even the item ranked last
on one key can tie for first place in combination with the other.

These observations seem to impose a strict dependency on the
order the data is calculated. I don't see any shortcuts, but I'd
love to be proven wrong.

You may want to ask on http://cstheory.stackexchange.com/ .
```
Jan 27 2012
Manfred Nowak <svv1999 hotmail.com> writes:
```Andrei Alexandrescu wrote:

That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks
might be computed by a diogonal sweep over the area defined by the two
dimensions and populated by the elements.

Population and sweep would each be linear in time.

-manfred
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 8:26 PM, Manfred Nowak wrote:
Andrei Alexandrescu wrote:

That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks
might be computed by a diogonal sweep over the area defined by the two
dimensions and populated by the elements.

Population and sweep would each be linear in time.

-manfred

Interesting, I have some thoughts along the same lines. Could you please
give more detail?

Andrei
```
Jan 27 2012
Manfred Nowak <svv1999 hotmail.com> writes:
```Manfred Nowak wrote:

Population and sweep would each be linear in time.

The sweep is linear in time only, if the populatable points of the area
are bount by O(n), which is not neccessary the case.

Therefore, I don't see to find in the generell case a solution faster
than
O( n + k*log( k)),
where `n' is the number of elements of the input and `k' is the
predefined number of "best of the breed" to be sorted.

-manfred
```
Jan 29 2012
"Peter Alexander" <peter.alexander.au gmail.com> writes:
```On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu
wrote:
The brute force approach would essentially compute the two
ranks and then sort by their sum. That's three sorts and at
least one temporary array. But I think things can be done
considerably better. Any ideas?

Can you define "better"?

Obviously you can't beat the O(n lg n) for comparison based
ordering. The O(n) space may be avoidable, but it doesn't seem
like the end of the world.

Does this problem occur frequently enough in practice to justify
having an optimised version in the standard library? Three sorts
+ O(n) space sounds fine to me.
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 8:33 PM, Peter Alexander wrote:
On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Can you define "better"?

Obviously you can't beat the O(n lg n) for comparison based ordering.
The O(n) space may be avoidable, but it doesn't seem like the end of the
world.

Does this problem occur frequently enough in practice to justify having
an optimised version in the standard library? Three sorts + O(n) space
sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is
something that is often needed even if not perceived as such. For
example, the stock screeners offered by financial sites are crappy -
they allow you to specify a range of e.g. P/E, P/book value, dividends
etc. etc. and sort by one of those. This is crappy - generally what one
wants is a selection of "best of breed" stocks that are among the top
ones in a variety of categories. (Relative importance could be assigned
to each category.)

A colleague at Facebook mentioned that you can use O(n) bucket sort for
the last-leg sorting.

The best approach I'm thinking of is to allocate an extra array of
size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1.
Then sort simultaneously the two arrays by k2, and then add 0 to the
first slot, 1 to the second slot, ..., n-1 to the last slot. Continue
this by any number of keys, and finally sort using the secondary array
as a key. This works for any number of keys and only requires one extra
array of size_t.

The more interesting problem is yielding the answer lazily. I'm unclear
what the best algorithm for such would be.

Andrei
```
Jan 27 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Jan 27, 2012 at 09:22:26PM -0500, Andrei Alexandrescu wrote:
[...]
I wasn't thinking necessarily of the standard library, but it is
something that is often needed even if not perceived as such. For
example, the stock screeners offered by financial sites are crappy -
they allow you to specify a range of e.g. P/E, P/book value, dividends
etc. etc. and sort by one of those. This is crappy - generally what
one wants is a selection of "best of breed" stocks that are among the
top ones in a variety of categories. (Relative importance could be
assigned to each category.)

[...]

If I understand the problem correctly, isn't this a kind of optimization
problem? I.e., given a set A of k objects with n attributes, find the
object(s) that maximizes a given measure function f over its attributes,
where f is a weighted sum of n attributes.

If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

T

--
If the comments and the code disagree, it's likely that *both* are
wrong. -- Christopher
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/27/12 11:02 PM, H. S. Teoh wrote:
If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the
space is defined by rank, not the original features.

Andrei
```
Jan 28 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Sat, Jan 28, 2012 at 08:12:52AM -0600, Andrei Alexandrescu wrote:
On 1/27/12 11:02 PM, H. S. Teoh wrote:
If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..

Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And
the space is defined by rank, not the original features.

[...]

Actually, the reason I chose hypercubes was because I was hoping that
the convex hull with hypercubes in the original features' space would be
equivalent (or easily massaged to be equivalent) to using simplexes in
rank space.  (Although the part about the (1,1,1,...) vector may not
work out quite that nicely in this case.) Otherwise it doesn't really
give you a good algorithm since you'll have to precompute the ranks.

T

--
War doesn't prove who's right, just who's left. -- BSD Games' Fortune
```
Jan 28 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/28/12 10:29 AM, H. S. Teoh wrote:
Actually, the reason I chose hypercubes was because I was hoping that
the convex hull with hypercubes in the original features' space would be
equivalent (or easily massaged to be equivalent) to using simplexes in
rank space.  (Although the part about the (1,1,1,...) vector may not
work out quite that nicely in this case.) Otherwise it doesn't really
give you a good algorithm since you'll have to precompute the ranks.

I understand. You're up to something; if you get to prove anything in
that direction, I'd be interested. Thanks!

Andrei
```
Jan 28 2012
Manfred Nowak <svv1999 hotmail.com> writes:
```Andrei Alexandrescu wrote:

generally what one wants is a selection of "best of breed" stocks
that are among the top ones in a variety of categories. (Relative
importance could be assigned to each category.)

This is quite different and easier than the problem initially
stated, because ranks must not be computed.

1) Compute the individual importance `ii' of each element `e', which
seems to be the sum over all categories `c' of all
`e.c.value * c.importance'
2) Use the linear median algorithm to find the element `ek' whose
individual importance `ek.ii' has rank `k' in the set of all elements
`e' and `k' is the number of elements `e' acceptable as "best of the
breed"
3) return all elements `e' whith `e.ii >= ek.ii'. This can be done
by a sweep over the array.

Total running time aproximately: #elements * #categories

Please note, that this relies on having choosen the vector of relativ
correctness?

For example in marketing it is desired to occupy unique selling
points, because the gain of profit might be huge. But this means, that
also the less for the buyer is huge.

-manfred
```
Jan 27 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/28/12 1:08 AM, Manfred Nowak wrote:
Andrei Alexandrescu wrote:

generally what one wants is a selection of "best of breed" stocks
that are among the top ones in a variety of categories. (Relative
importance could be assigned to each category.)

This is quite different and easier than the problem initially
stated, because ranks must not be computed.

Actually they need to be computed. I think it's possible to find cases
where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha *
a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

One potentially confusing issue is that importance applies to rank, not
features.

Andrei
```
Jan 28 2012
Manfred Nowak <svv1999 hotmail.com> writes:
```Andrei Alexandrescu wrote:

Actually they need to be computed.

Then the problem is still too unclear.

I think it's possible to find cases where
rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2)
but
alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly.
But if the ranks are known, the problem can be finished with the
linear median algorithm.

One potentially confusing issue is that importance applies to
rank, not features.

Do you mean that
alpha * rank(a, k1) + beta * rank(a, k2)
has to be extremized for all `a'?

Again: the problem is still unclear.

-manfred
```
Jan 28 2012
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/28/12 9:10 AM, Manfred Nowak wrote:
Andrei Alexandrescu wrote:

Actually they need to be computed.

Then the problem is still too unclear.

It's very simple and clear in formulation actually.

Given a range r of elements, define rank(a, p) as the position that
object a would have if range r would be sorted by predicate p. So the
rank is a number between 0 and n-1 (n being the number of elements in
the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a
predicate, the rank of 20 is 1. (For simplicity we can consider for now
that all elements are distinct.)

Now say we have multiple predicates, p1, p2, ..., pk. We want to sort
elements in increasing order of rank(a, p1) + ... + rank(a, pk).

I think it's possible to find cases where
rank(a, k1) + rank(a, k2)<  rank(b, k1) + rank(b, k2)
but
alpha * a.k1 + beta * a.k2>  alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly.
But if the ranks are known, the problem can be finished with the
linear median algorithm.

One potentially confusing issue is that importance applies to
rank, not features.

Do you mean that
alpha * rank(a, k1) + beta * rank(a, k2)
has to be extremized for all `a'?

I don't understand this.

Again: the problem is still unclear.

Hopefully the above has clarified it. Thanks for looking into it!

Andrei
```
Jan 28 2012
Manfred Nowak <svv1999 hotmail.com> writes:
```Andrei Alexandrescu wrote:

we have multiple predicates, p1, p2, ..., pk. We want to sort
elements in increasing order of rank(a, p1) + ... + rank(a, pk).

Got it. Thinking ocasionally.

-manfred
```
Jan 28 2012