digitalmars.D - [OT] n-way union

Andrei Alexandrescu (24/24) May 23 2009 This is somewhat OT but I think it's an interesting problem. Consider

Sean Kelly (14/42) May 23 2009 It seems like there are two basic options: either merge the arrays (as

Sean Kelly (9/9) May 23 2009 I have a better idea which is pretty much in line with the heap-based

Sean Kelly (2/11) May 23 2009 hehe... "pretty much in line" turns out to be "exactly the same." Oh we...

Jason House (4/35) May 23 2009 My first thought was to throw everything in a heap map with key=r.front ...
BCS (8/39) May 23 2009 in sudocode

Don (16/60) May 26 2009 In practice, I bet it'll go faster with a couple more steps:

Andrei Alexandrescu (9/37) May 26 2009 Correct. Unfortunately my heap primitives didn't quite allow that so

Don (11/50) May 26 2009 Depending on what "going through each element" means. Certainly, if you

Andrei Alexandrescu (5/15) May 26 2009 Oh, I see. So yes, comparisons can be reduced. So how exactly do you

Don (10/25) May 26 2009 That was my original thought, but it's easy to come up with cases that
Jason House (3/21) May 26 2009 This conversation made me realize a simple optimization (forgive me if i...

Andrei Alexandrescu (7/39) May 26 2009 Cool! That can actually be implemented as a micro-optimization with the

Frits van Bommel (4/24) May 26 2009 If you keep it separate, you only need one comparison to determine the c...

Andrei Alexandrescu (3/29) May 26 2009 Good point, and also I don't need to worry about adding new heap primiti...

Robert Fraser (2/34) May 23 2009 I'd be interested if there's some way to remove duplicates.
Georg Wrede (38/65) May 25 2009 If we'd know anything about the data, such as, the max value is always

Andrei Alexandrescu (3/40) May 25 2009 You can assume that each array is sorted.

Georg Wrede (4/45) May 25 2009 Err, you didn't comment on my algorithm, at the end. So, either it is

Jason House (2/48) May 25 2009 Don't get offended, he didn't respond to anyone else's algorithm either....

Andrei Alexandrescu (11/12) May 25 2009 I just wanted to bring what I think is an interesting problem to the

Jason House (3/17) May 25 2009 I bet BCS had the advantage of a real keyboard ;)

BCS (2/4) May 25 2009 you are 89% correct: http://en.wikipedia.org/wiki/Acer_Aspire_One

Andrei Alexandrescu (20/26) May 25 2009 I am sorry, at the first two reads I did not understand your algorithm.

Georg Wrede (71/101) May 25 2009 Canonical one. Righhhhhtttttt......

Andrei Alexandrescu (21/32) May 25 2009 [snip]

Georg Wrede (4/7) May 26 2009 Maybe I didn't understand BCS's algorithm. To me it looked slower. I'll
Georg Wrede (4/7) May 26 2009 Maybe I misunderstood the point of this OT thread. One could believe the...

bearophile (11/22) May 25 2009 Better to yield lazily the output items.

Andrei Alexandrescu (4/10) May 25 2009 I don't think that applies here. Unstable removal is O(1) and restoring

bearophile (9/10) May 25 2009 It's better for nWayUnion to be a lazy iterable.

Andrei Alexandrescu (5/8) May 25 2009 The policy of std.algorithm is to not surreptitiously allocate memory if...

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

This is somewhat OT but I think it's an interesting problem. Consider 
the following data:

double[][] a =
[
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
];

We want to compute an n-way union, i.e., efficiently span all elements 
in all arrays in a, in sorted order. You can assume that each individual 
array in a is sorted. The output of n-way union should be:

auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
];
assert(equal(nWayUnion(a), witness[]));

The STL and std.algorithm have set_union that does that for two sets: 
for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
n-way unions poses additional challenges. What would be a fast 
algorithm? (Imagine a could be a very large range of ranges).

Needless to say, nWayUnion is a range :o).

Finally, why would anyone care for something like this?


Andrei

May 23 2009

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

It seems like there are two basic options: either merge the arrays (as
per merge sort) and deal with multiple passes across the same data
elements or insert the elements from each array into a single
destination array and deal with a bunch of memmove operations.

The merge option is kind of interesting because it could benefit from a
parallel range of sorts.  Each front() op could actually return a range
which contained the front element of each non-empty range.  Pass this to 
a min() op accepting a range and drop the min element into the 
destination.  The tricky part would be working things in such a way that 
the originating range could have popFront() called once the insertion 
had occurred.

 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?

Other than mergeSort?  I'd think that being able to perform union of N 
sets in unison would be a nice way to eliminate arbitrary restrictions.

May 23 2009

Sean Kelly <sean invisibleduck.org> writes:

I have a better idea which is pretty much in line with the heap-based 
solutions already posted.  Place all the ranges in a min priority queue 
where the priority of a range is equal to the value of its top element. 
  Pop the value off the heap at the front of the queue.  If the heap is 
empty then remove it.  If it contains more elements then fix up the 
priority queue to restore the invariant (since the priority of the head 
range may have changed).  The temporary space would just be of size N 
where N is the number of ranges, and the number of duplicate compares is 
minimized.

May 23 2009

Sean Kelly <sean invisibleduck.org> writes:

Sean Kelly wrote:
 I have a better idea which is pretty much in line with the heap-based 
 solutions already posted.  Place all the ranges in a min priority queue 
 where the priority of a range is equal to the value of its top element. 
  Pop the value off the heap at the front of the queue.  If the heap is 
 empty then remove it.  If it contains more elements then fix up the 
 priority queue to restore the invariant (since the priority of the head 
 range may have changed).  The temporary space would just be of size N 
 where N is the number of ranges, and the number of duplicate compares is 
 minimized.

hehe... "pretty much in line" turns out to be "exactly the same."  Oh well.

May 23 2009

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
      [ 1, 4, 7, 8 ],
      [ 1, 7 ],
      [ 1, 7, 8],
      [ 4 ],
      [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
      1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

My first thought was to throw everything in a heap map with key=r.front and
value=r

optimal.
 
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 
 Andrei

May 23 2009

BCS <none anon.com> writes:

Hello Andrei,

 This is somewhat OT but I think it's an interesting problem. Consider
 the following data:
 
 double[][] a =
 [
 [ 1, 4, 7, 8 ],
 [ 1, 7 ],
 [ 1, 7, 8],
 [ 4 ],
 [ 7 ],
 ];
 We want to compute an n-way union, i.e., efficiently span all elements
 in all arrays in a, in sorted order. You can assume that each
 individual array in a is sorted. The output of n-way union should be:
 
 auto witness = [
 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets:
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But
 n-way unions poses additional challenges. What would be a fast
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 Andrei
 

in sudocode

1. build a min heap out of the elements based on the first element of each 
array
2. remove the first value from the root of the heap,
3. if empty array, drop it
4. if not empty re-heap
5. goto 2 if anything left

May 23 2009

Don <nospam nospam.com> writes:

BCS wrote:
 Hello Andrei,
 
 This is somewhat OT but I think it's an interesting problem. Consider
 the following data:

 double[][] a =
 [
 [ 1, 4, 7, 8 ],
 [ 1, 7 ],
 [ 1, 7, 8],
 [ 4 ],
 [ 7 ],
 ];
 We want to compute an n-way union, i.e., efficiently span all elements
 in all arrays in a, in sorted order. You can assume that each
 individual array in a is sorted. The output of n-way union should be:

 auto witness = [
 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets:
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But
 n-way unions poses additional challenges. What would be a fast
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).

 Finally, why would anyone care for something like this?

 Andrei

 
 in sudocode
 
 1. build a min heap out of the elements based on the first element of 
 each array
 2. remove the first value from the root of the heap,
 3. if empty array, drop it
 4. if not empty re-heap
 5. goto 2 if anything left
 
 

In practice, I bet it'll go faster with a couple more steps:
2. remove the first value (from array X) from the root of the heap,
2a. Determine new root of the heap (don't need to do a full re-heap yet).
2b. move all values from X until empty or >= new root.
4. reheap

There are interesting cases like:
[5,6,7,8,9,9,......9]
[30, 31,32,33,34...38]
[1,1,1,1,1,1,1,1,1,1,...1]
[10,11,12,13,14,15,16,16,......., 16]

where in theory, the total number of comparisons required is a function 
only of the number of arrays, and is independent of the actual number of 
elements.
If it's common to have runs where consecutive values come from a single 
array, and the arrays are random access, you could do MUCH better than O(n).

May 26 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 BCS wrote:
 in sudocode

 1. build a min heap out of the elements based on the first element of 
 each array
 2. remove the first value from the root of the heap,
 3. if empty array, drop it
 4. if not empty re-heap
 5. goto 2 if anything left

 In practice, I bet it'll go faster with a couple more steps:
 2. remove the first value (from array X) from the root of the heap,
 2a. Determine new root of the heap (don't need to do a full re-heap yet).
 2b. move all values from X until empty or >= new root.
 4. reheap

Correct. Unfortunately my heap primitives didn't quite allow that so 
currently I do it the "hard" way: pop from top, remove top from heap, 
reinsert (if not empty). But this is definitely an optimization worth 
doing. I think I'll implement it.

 There are interesting cases like:
 [5,6,7,8,9,9,......9]
 [30, 31,32,33,34...38]
 [1,1,1,1,1,1,1,1,1,1,...1]
 [10,11,12,13,14,15,16,16,......., 16]
 
 where in theory, the total number of comparisons required is a function 
 only of the number of arrays, and is independent of the actual number of 
 elements.
 If it's common to have runs where consecutive values come from a single 
 array, and the arrays are random access, you could do MUCH better than 
 O(n).

I think you can't do better. Complexity is O(n) at a minimum because you 
need to go through each element. The heap-based algorithm has O(n log 
m). What you could improve on is the log m factor (m = the number of sets).


Andrei

May 26 2009

Don <nospam nospam.com> writes:

Andrei Alexandrescu wrote:
 Don wrote:
 BCS wrote:
 in sudocode

 1. build a min heap out of the elements based on the first element of 
 each array
 2. remove the first value from the root of the heap,
 3. if empty array, drop it
 4. if not empty re-heap
 5. goto 2 if anything left

 In practice, I bet it'll go faster with a couple more steps:
 2. remove the first value (from array X) from the root of the heap,
 2a. Determine new root of the heap (don't need to do a full re-heap yet).
 2b. move all values from X until empty or >= new root.
 4. reheap

 
 Correct. Unfortunately my heap primitives didn't quite allow that so 
 currently I do it the "hard" way: pop from top, remove top from heap, 
 reinsert (if not empty). But this is definitely an optimization worth 
 doing. I think I'll implement it.
 
 There are interesting cases like:
 [5,6,7,8,9,9,......9]
 [30, 31,32,33,34...38]
 [1,1,1,1,1,1,1,1,1,1,...1]
 [10,11,12,13,14,15,16,16,......., 16]

 where in theory, the total number of comparisons required is a 
 function only of the number of arrays, and is independent of the 
 actual number of elements.
 If it's common to have runs where consecutive values come from a 
 single array, and the arrays are random access, you could do MUCH 
 better than O(n).

 
 I think you can't do better. Complexity is O(n) at a minimum because you 
 need to go through each element. The heap-based algorithm has O(n log 
 m). What you could improve on is the log m factor (m = the number of sets).

Depending on what "going through each element" means. Certainly, if you 
need to move them, you can't beat O(n) moves, but you can definitely get 
O(m*m) < < O(n) comparisons in the best case I described. And that would 
also be the number of splicing operations.

You could approximate this without hurting the worst-case too much, by 
(for example) once you get a run of more than K entries from the same 
array, step forward 2 entries; if that's still less than the root, copy 
both to the output, then step forward 4 entries, etc.

That should give a best-case O(m log n) comparisons/splices.
Which might not be of any practical use, of course.

May 26 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).

 
 Depending on what "going through each element" means. Certainly, if you 
 need to move them, you can't beat O(n) moves, but you can definitely get 
 O(m*m) < < O(n) comparisons in the best case I described. And that would 
 also be the number of splicing operations.

Oh, I see. So yes, comparisons can be reduced. So how exactly do you 
think that can be done? By binary searching the upper bound of the head 
to find a long run of identical elements?


Andrei

May 26 2009

Don <nospam nospam.com> writes:

Andrei Alexandrescu wrote:
 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).

 Depending on what "going through each element" means. Certainly, if 
 you need to move them, you can't beat O(n) moves, but you can 
 definitely get O(m*m) < < O(n) comparisons in the best case I 
 described. And that would also be the number of splicing operations.

 
 Oh, I see. So yes, comparisons can be reduced. So how exactly do you 
 think that can be done? By binary searching the upper bound of the head 
 to find a long run of identical elements?

That was my original thought, but it's easy to come up with cases that 
would perform really badly -- so that you do a binary search for every 
two elements.
Which was why I suggested repeated doubling of the stride distance, I 
think you can prove that's never worse than a linear search of every 
element.
Although, if it was worth doing, I'd imagine that implementations of STL 
set_union would already be doing it.

Anyway, it's a really interesting problem.

May 26 2009

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).

 
 Depending on what "going through each element" means. Certainly, if you 
 need to move them, you can't beat O(n) moves, but you can definitely get 
 O(m*m) < < O(n) comparisons in the best case I described. And that would 
 also be the number of splicing operations.

 
 Oh, I see. So yes, comparisons can be reduced. So how exactly do you 
 think that can be done? By binary searching the upper bound of the head 
 to find a long run of identical elements?
 
 
 Andrei

This conversation made me realize a simple optimization (forgive me if it's
what Don was implying)

Instead of storing m ranges in a heap, store only m-1 ranges. The range that
would be the min if the heap is kept separate. Each pop operation will compare
the new front with the min heap element. This allows avoiding the re-heaping
when the next element comes from the same range. If s is the number of times
the source range must switch, then the performance is O( s*log(m) + n - s). The
initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). 
Technically, the re-heaping steps are log(m-1), but that isn't important with
big O.

May 26 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum
 because you need to go through each element. The heap-based
 algorithm has O(n log m). What you could improve on is the log
 m factor (m = the number of sets).

 Depending on what "going through each element" means. Certainly,
 if you need to move them, you can't beat O(n) moves, but you can
 definitely get O(m*m) < < O(n) comparisons in the best case I
 described. And that would also be the number of splicing
 operations.

 Oh, I see. So yes, comparisons can be reduced. So how exactly do
 you think that can be done? By binary searching the upper bound of
 the head to find a long run of identical elements?
 
 
 Andrei

 
 This conversation made me realize a simple optimization (forgive me
 if it's what Don was implying)
 
 Instead of storing m ranges in a heap, store only m-1 ranges. The
 range that would be the min if the heap is kept separate. Each pop
 operation will compare the new front with the min heap element. This
 allows avoiding the re-heaping when the next element comes from the
 same range. If s is the number of times the source range must switch,
 then the performance is O( s*log(m) + n - s). The initial heaping is
 O(m*log(m)), but that's never worse than O(s*log(m)).  Technically,
 the re-heaping steps are log(m-1), but that isn't important with big
 O.

Cool! That can actually be implemented as a micro-optimization with the 
classic heap: right after you extract from the top (arr[0]), you look at 
its children (which are arr[1] and arr[2]). If both front()'s are still 
no less than yours, everything stays as it was and you're home free.

I'll implement this, thanks.


Andrei

May 26 2009

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Andrei Alexandrescu wrote:
 Jason House wrote:
 This conversation made me realize a simple optimization (forgive me
 if it's what Don was implying)

 Instead of storing m ranges in a heap, store only m-1 ranges. The
 range that would be the min if the heap is kept separate. Each pop
 operation will compare the new front with the min heap element. This
 allows avoiding the re-heaping when the next element comes from the
 same range. If s is the number of times the source range must switch,
 then the performance is O( s*log(m) + n - s). The initial heaping is
 O(m*log(m)), but that's never worse than O(s*log(m)).  Technically,
 the re-heaping steps are log(m-1), but that isn't important with big
 O.

 
 Cool! That can actually be implemented as a micro-optimization with the 
 classic heap: right after you extract from the top (arr[0]), you look at 
 its children (which are arr[1] and arr[2]). If both front()'s are still 
 no less than yours, everything stays as it was and you're home free.
 
 I'll implement this, thanks.

If you keep it separate, you only need one comparison to determine the current 
range still contains the minimum element.
Just saying :).

May 26 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Frits van Bommel wrote:
 Andrei Alexandrescu wrote:
 Jason House wrote:
 This conversation made me realize a simple optimization (forgive me
 if it's what Don was implying)

 Instead of storing m ranges in a heap, store only m-1 ranges. The
 range that would be the min if the heap is kept separate. Each pop
 operation will compare the new front with the min heap element. This
 allows avoiding the re-heaping when the next element comes from the
 same range. If s is the number of times the source range must switch,
 then the performance is O( s*log(m) + n - s). The initial heaping is
 O(m*log(m)), but that's never worse than O(s*log(m)).  Technically,
 the re-heaping steps are log(m-1), but that isn't important with big
 O.

 Cool! That can actually be implemented as a micro-optimization with 
 the classic heap: right after you extract from the top (arr[0]), you 
 look at its children (which are arr[1] and arr[2]). If both front()'s 
 are still no less than yours, everything stays as it was and you're 
 home free.

 I'll implement this, thanks.

 
 If you keep it separate, you only need one comparison to determine the 
 current range still contains the minimum element.
 Just saying :).

Good point, and also I don't need to worry about adding new heap primitives.

Andrei

May 26 2009

Robert Fraser <fraserofthenight gmail.com> writes:

Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 
 Andrei

I'd be interested if there's some way to remove duplicates.

May 23 2009

Georg Wrede <georg.wrede iki.fi> writes:

Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).

If we'd know anything about the data, such as, the max value is always 
smaller than the total number of elements in the subarrays, then we'd 
probably more easily invent a decent algorithm.

But the totally general algorithm has to be more inefficient. And 
constructing (not worst-case, but) tough-case data is trivial. For 
example, take a thousand subarrays, each a thousand elements long, 
containing random uints from the inclusive range 0..uint.max.

It's likely that you then can pop the first element in a[0], but the 
second element is larger, and the first element in a[1] is also larger. 
This results in that a is still sorted, except that a[0] may belong to a 
later position.

On the other hand, in your example, after popping all the ones, a is 
severely out of order.

---------------------

You probably have to have an array of structs, where each struct 
contains a reference to a subarray, and a copy of the first value of 
this subarray.

Sort the array of structs, and beginning with the first struct, pop from 
the pointed-to subarray as long as it contains elements equal to the 
first one. Then pop from the subarray that belongs to following struct, 
and so on, until a subbarray's first element already is larger.

While popping, you obviously update the value copies, and append to an 
output array.

Then sort the array of structs, trim it in case some of the structs now 
point to empty subarrays, and start over.


This is only moderately efficient, but had Stepanof or anybody else come 
up with something essentially better, you'd have heard of it already.

The only upshot of this algorithm is that the array a doesn't have to be 
sorted initially. (The subarrays, of course still do.)


The algorithm could be enhanced by, instead of an array of structs, 
having a singly liked list of these structs. Each time you pop, you move 
the relevant struct to another list, poppedStructs. When done, you only 
have to sort the poppedStructs list, and then /merge/ it with the list 
of unpopped structs, which still is sorted. This would obviously 
drastically improve efficiency, since it is likely that what needs to be 
sorted now is only a few items per round.

Hmmm. Maybe this isn't all that inefficient, after all...

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each 
 individual array in a is sorted. The output of n-way union should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).

 
 If we'd know anything about the data, such as, the max value is always 
 smaller than the total number of elements in the subarrays, then we'd 
 probably more easily invent a decent algorithm.
 
 But the totally general algorithm has to be more inefficient. And 
 constructing (not worst-case, but) tough-case data is trivial. For 
 example, take a thousand subarrays, each a thousand elements long, 
 containing random uints from the inclusive range 0..uint.max.

You can assume that each array is sorted.

Andrei

May 25 2009

Georg Wrede <georg.wrede iki.fi> writes:

Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all 
 elements in all arrays in a, in sorted order. You can assume that 
 each individual array in a is sorted. The output of n-way union 
 should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).

 If we'd know anything about the data, such as, the max value is always 
 smaller than the total number of elements in the subarrays, then we'd 
 probably more easily invent a decent algorithm.

 But the totally general algorithm has to be more inefficient. And 
 constructing (not worst-case, but) tough-case data is trivial. For 
 example, take a thousand subarrays, each a thousand elements long, 
 containing random uints from the inclusive range 0..uint.max.

 
 You can assume that each array is sorted.

Err, you didn't comment on my algorithm, at the end. So, either it is 
worthless to an extent not deserving even a dismissal, or you didn't 
read the rest of the post.

May 25 2009

Jason House <jason.james.house gmail.com> writes:

Georg Wrede Wrote:

 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all 
 elements in all arrays in a, in sorted order. You can assume that 
 each individual array in a is sorted. The output of n-way union 
 should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).

 If we'd know anything about the data, such as, the max value is always 
 smaller than the total number of elements in the subarrays, then we'd 
 probably more easily invent a decent algorithm.

 But the totally general algorithm has to be more inefficient. And 
 constructing (not worst-case, but) tough-case data is trivial. For 
 example, take a thousand subarrays, each a thousand elements long, 
 containing random uints from the inclusive range 0..uint.max.

 
 You can assume that each array is sorted.

 
 Err, you didn't comment on my algorithm, at the end. So, either it is 
 worthless to an extent not deserving even a dismissal, or you didn't 
 read the rest of the post.

Don't get offended, he didn't respond to anyone else's algorithm either.

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Jason House wrote:
 Don't get offended, he didn't respond to anyone else's algorithm either. 

I just wanted to bring what I think is an interesting problem to the 
table; my response shouldn't be used as a yardstick for one's ability to 
define a good algorithm.

Now that I feel pressured to give an answer, I'd say that BCS's answer 
is right on the money. It's the exact algorithm I put in std.algorithm 
(not checked in yet), and it works leanly and meanly on a large data set 
as we speak. Sean's, bearophile's, and your answers were also very close 
to BCS's, just that yours was a tad less detailed and the others came a 
bit later.


Andrei

May 25 2009

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 Jason House wrote:
 Don't get offended, he didn't respond to anyone else's algorithm either. 

 
 I just wanted to bring what I think is an interesting problem to the 
 table; my response shouldn't be used as a yardstick for one's ability to 
 define a good algorithm.
 
 Now that I feel pressured to give an answer, 

Sorry if made you feel pressured. I didn't expect a response.


 I'd say that BCS's answer 
 is right on the money. It's the exact algorithm I put in std.algorithm 
 (not checked in yet), and it works leanly and meanly on a large data set 
 as we speak. Sean's, bearophile's, and your answers were also very close 
 to BCS's, just that yours was a tad less detailed and the others came a 
 bit later.

I bet BCS had the advantage of a real keyboard ;)

May 25 2009

BCS <none anon.com> writes:

Hello Jason,

 I bet BCS had the advantage of a real keyboard ;)
 

you are 89% correct: http://en.wikipedia.org/wiki/Acer_Aspire_One

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 You can assume that each array is sorted.

 
 Err, you didn't comment on my algorithm, at the end. So, either it is 
 worthless to an extent not deserving even a dismissal, or you didn't 
 read the rest of the post.

I am sorry, at the first two reads I did not understand your algorithm. 
I noticed that it is more complicated than the canonical one, and also 
it necessitates storage to hold the entire output. At the third read I 
did understand it, or at least I understood how you mean it to work. It 
has at least one bug here:

========
Sort the array of structs, and beginning with the first struct, pop from 
the pointed-to subarray as long as it contains elements equal to the 
first one. Then pop from the subarray that belongs to following struct, 
and so on, until a subbarray's first element already is larger.
========

It would fail on the data:

double[][] a =
[
     [ 1, 1, 4, 7, 8 ],
     [ 7 ],
];

You will pop all 1s and then you'll pop 7. You should be popping 4.


Andrei

May 25 2009

Georg Wrede <georg.wrede iki.fi> writes:

Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 You can assume that each array is sorted.

 Err, you didn't comment on my algorithm, at the end. So, either it is 
 worthless to an extent not deserving even a dismissal, or you didn't 
 read the rest of the post.

 
 I am sorry, at the first two reads I did not understand your algorithm. 
 I noticed that it is more complicated than the canonical one

Canonical one. Righhhhhtttttt......

 From the outset I thought that you did have a solution in mind. And I 
thought that the pause of a couple of days before any seroious attemnts 
only meant that everybody else understood it, too. So, nobody wanted to 
spend 5 days inventing something which would only be dismissed "because 
Knuth did it better 50 years ago".

 and also it necessitates storage to hold the entire output.

I never said whether it'd be on stack or not. An array (or a linked 
list) of structs can be placed anywhere. Especially if you know (i.e. 
can easily count) the total number of items, you can allocate up front, 
be it on the heap or the stack.

 At the third read I did understand it, or at least I understood how
 you mean it to work. It has at least one bug here:
 
 ========
 Sort the array of structs, and beginning with the first struct, pop from 
 the pointed-to subarray as long as it contains elements equal to the 
 first one. Then pop from the subarray that belongs to following struct, 
 and so on, until a subbarray's first element already is larger.
 ========
 
 It would fail on the data:
 
 double[][] a =
 [
     [ 1, 1, 4, 7, 8 ],
     [ 7 ],
 ];
 
 You will pop all 1s and then you'll pop 7. You should be popping 4.

Sigh.

So you didn't read that part carefully. You also didn't read the 
following part, at all.

I admit, I relied too much on your (perceived, superior) ability, and 
thus described the algorithm top-level only.

My first thought was to actually write the whole thing in D2, but then I 
remembered that (as an old man), I code things at most at 10% the speed 
of some of the young and strong here. I didn't want to spend 2 days on 
something I would demand of myself of doing in 2 hours. Hence the 
pseudocode.


My algorithm would pop both the 1's. Then it would see that the 4 is 
bigger and give up. It would then go to a[0+1], and find nothing there 
(i.e. 7 is larger than 1), so it'd consider this round done.

Now, where I didn't "follow description" was, my algorithm was 
insensitive to there being several instances of the same value in a 
subarray. (My mistake, the "1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8" thing led 
me astray.) BUT IF YOU LOOK CAREFULLY  at my code, you'll see that it 
actually doens't make a difference. Why????

   Because, having popped a number of items from
   a number of sub-arrays, the sub-arrays aren't
   in any order at all. So, there arises a need
   to do some sorting.

INCIDENTALLY, which means, "You will pop all 1s and then you'll pop 7" 
*definitely* means you didn't read my post carefully.


No biggie, my university Pascal teacher did the same thing. Since I'm a 
slow coder, I used to both present my algorithms in "regular peoples' 
language", and my math proofs in writing, instead of a lot of hairy 
equations. Both guys gave me 0's for a number of things, and then I had 
to set up an appointment to see both to explain and clarify. (The one 
thing I remember as a killer was about my thesis (STL), which I 
submitted, then went to Spain for 3 weeks, returned on an overnight 
flight with no sleep at all, and at 8am went to see my score, and found 
myself disqualified. I went straight to the professor and confronted 
him. He explained that this assignment meant one had to prove the case 
for a small value, prove it for the next value, then proove that if it 
was valid for both, then it would have to be valid for the one following 
them. When I said that "this is exactly what I've done here", he went 
silet for 10 minutes, then blushed, and gave me a "barely accepted score"...

Sure, my first idea was to invent and quote real D2 code. But I thought 
that being a "regular" programmer, /everything/ in the code would be 
*rewritten* anyway, so why spend 4 days writing something I can describe 
while I'm f**ing reading the assignment.


Then it turns out I get omitted the first time, the second time I'm not 
understood, the 3rd time I'm not understood either, and the 4th time I'm 
misunderstood.

Were it not for D2, I'd suggest the reader ease up a bit. Save his and 
the audience's time. Getting it (err, reading it) right the first time 
(even if reading at half the speed) is much more efficient (for 
"himself", not to speak of the entire community and their combined 
wasted effort) than spawning a whole tree of response-subtrees, which 
(in the worst case) he has to answer individually. (So everybody loses.)


===================================

Which all means, if somebody else actually did read my description, and 
did implement it, Andrei'd have a hard time beating their code.

(As an additional hint, to save extra speed, one wouldn't wait till the 
turn is over, and then sort hen poppedItems, but instead, insert-sort 
them while iterating over the structs list.)

At the same time, one could keep a tab on the next "least-value" while 
"popping", so as not to have to compare to integers between the current 
and the next actual poppable value.

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 I am sorry, at the first two reads I did not understand your 
 algorithm. I noticed that it is more complicated than the canonical one

 
 Canonical one. Righhhhhtttttt......
 
  From the outset I thought that you did have a solution in mind. And I 
 thought that the pause of a couple of days before any seroious attemnts 
 only meant that everybody else understood it, too. So, nobody wanted to 
 spend 5 days inventing something which would only be dismissed "because 
 Knuth did it better 50 years ago".

[snip]

I don't even know how to start answering this and particularly what 
follows (30% of I can't even figure where it comes from and where it 
goes), but allow me to retort with just a few short and clear points.

1. You were not set up. I did have a solution in mind, but it's not from 
Knuth or some other classic. I didn't actually find it anywhere (which 
is why I consider the problem interesting to discuss in this group), but 
it's rather simple and clear because five people in this group came to 
it independently.

2. I'm not sure - are you reproaching me I didn't sit down to understand 
your algorithm (and pardon me, it wasn't the clearest exposition there 
is and it still has more than a few kinks to be ironed out)? And are you 
comparing the situation with your professor etc.? And am I supposed to 
thank you for not making a biggie out of it? Well your instructor was 
supposed to read your work, and I am not by any stretch of imagination. 
That was rather inappropriate.

3. If you are alleging that if somebody implemented your algorithm, it 
would run faster than BCS's, you are simply mistaken. Yours simply does 
more work than necessary, and no amount of buts and ifs can cover for that.


Andrei

May 25 2009

Georg Wrede <georg.wrede iki.fi> writes:

Andrei Alexandrescu wrote:
 3. If you are alleging that if somebody implemented your algorithm, it 
 would run faster than BCS's, you are simply mistaken. Yours simply does 
 more work than necessary, and no amount of buts and ifs can cover for that.

Maybe I didn't understand BCS's algorithm. To me it looked slower. I'll 
have to wait till I see the one actually implemented, and do a 
comparison. If nothing else, I'll then see how much mine is slower.

May 26 2009

Georg Wrede <georg.wrede iki.fi> writes:

Andrei Alexandrescu wrote:
 Well your instructor was supposed to read your work, and I am not by
 any stretch of imagination. That was rather inappropriate.

Maybe I misunderstood the point of this OT thread. One could believe the 
thread was created to look at alternative algorithms by people.

 Finally, why would anyone care for something like this?

Has this been answered already? What was the answer?

May 26 2009

bearophile <bearophileHUGS lycos.com> writes:

Georg Wrede:
 You probably have to have an array of structs, where each struct 
 contains a reference to a subarray, and a copy of the first value of 
 this subarray.

Keeping a cached copy of the first item of the subarray ("subrange") may be
better in some situations and worse (compared to keeping just a
reference/pointer) in other situations. And it's not very simple to design a
data structure able to adapt itself dynamically to such two situations.


 Sort the array of structs, and beginning with the first struct, pop from 
 the pointed-to subarray as long as it contains elements equal to the 
 first one. Then pop from the subarray that belongs to following struct, 
 and so on, until a subbarray's first element already is larger.
 While popping, you obviously update the value copies, and append to an 
 output array.

Better to yield lazily the output items.
And probably keeping them in a heap is more efficient than scanning them all.


 The algorithm could be enhanced by, instead of an array of structs, 
 having a singly liked list of these structs.

Today linked lists are mostly dead. In most situations arrays (or arrays with
holes, plus few other smart things) are more efficient, both in memory and
speed.

--------------------

For 	Andrei: in my last post I have suggested to remove from the heap the
references to the sub-ranges as soon they are exhausted. But lot of tests of
mine have shown me that's often not the most efficient strategy:
http://www.fantascienza.net/leonardo/ar/list_deletions.html
So better things are possible. The most refined strategy may be overkill, but
some medium strategy may be OK.

Bye,
bearophile

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 For 	Andrei: in my last post I have suggested to remove from the heap
 the references to the sub-ranges as soon they are exhausted. But lot
 of tests of mine have shown me that's often not the most efficient
 strategy: http://www.fantascienza.net/leonardo/ar/list_deletions.html
  So better things are possible. The most refined strategy may be
 overkill, but some medium strategy may be OK.

I don't think that applies here. Unstable removal is O(1) and restoring 
the heap property is O(log n) in theory and very close to O(1) in practice.

Andrei

May 25 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

Needless to say, nWayUnion is a range :o).<

It's better for nWayUnion to be a lazy iterable.
As probably others have already said, you can keep an heap of pointers, and
compute the heap invariant using as opCmp a function that uses the values they
point to. When nWayUnion yields an item, the relative pointer goes one step
forward and the heap invariant has to be restored. When one array finishes, the
relative pointer is removed before restoring the heap invariant.

The algorithm can be generalized: the input can be an array (random access
range? memory mapped file and arrays are among the most important cases) of
sorted iterables (even lazy ones), that is an array of lazy sorted ranges.

The most general input is a lazy range of lazy sorted ranges, in this situation
the nWayUnion has to first "duplicate" it into an eager array of such ranges,
and then turn it into an heap as before.
I guess in most situations you don't have more than thausands of such sorted
iterables, so cerating such temporary array isn't too much bad.
(nWayUnion may also have a small fixed size array of pointers/indexes on the
stack, for the simple case of an array of 3-5 arrays).

Bye,
bearophile

May 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 The most general input is a lazy range of lazy sorted ranges, in this
situation the nWayUnion has to first "duplicate" it into an eager array of such
ranges, and then turn it into an heap as before.
 I guess in most situations you don't have more than thausands of such sorted
iterables, so cerating such temporary array isn't too much bad.
 (nWayUnion may also have a small fixed size array of pointers/indexes on the
stack, for the simple case of an array of 3-5 arrays).

The policy of std.algorithm is to not surreptitiously allocate memory if 
it can help it. So the user is left responsible with creating a range of 
ranges with random access.

Andrei

May 25 2009

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] n-way union