www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Ranges/algorithms for aggregation

reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
Is there a neat way to do this transformation with ranges and 
std.algorithms?

     Input:
     -------
     B foo
     B bar
     C ble
     B big
     A begga

     Output: (aggregated and sorted on length)
     -------
     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]

The most obvious way (to me) to do this without standard 
algorithms is with an AA to the aggregation. The most obvious way 
(to me) to do this with std.algorithms is:

     B foo
     B bar
     C ble
     B big
     A begga

     =>

     [B, foo]
     [B, bar]
     [C, ble]
     [B, big]
     [A, begga]

     =>

     [B, foo]
     [B, bar]
     [B, big]
     [C, ble]
     [A, begga]

     =>

     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]

But this seems wasteful on memory. Is there a better way to do 
this in a more algorithmic way?
Mar 21 2014
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Luís Marques:

 Is there a neat way to do this transformation with ranges and 
 std.algorithms?

     Input:
     -------
     B foo
     B bar
     C ble
     B big
     A begga

     Output: (aggregated and sorted on length)
     -------
     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]
What is the desired output data structure? An associative array of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 2-tuples enough? There are various ways to solve your problem. Related: https://d.puremagic.com/issues/show_bug.cgi?id=5968 https://d.puremagic.com/issues/show_bug.cgi?id=9842 Bye, bearophile
Mar 21 2014
parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Friday, 21 March 2014 at 15:38:23 UTC, bearophile wrote:
    Output: (aggregated and sorted on length)
    -------
    B -> [foo, bar, big]
    C -> [ble]
    A -> [begga]
What is the desired output data structure? An associative array of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 2-tuples enough?
I'm doing this for a D newbie, to teach him the range/algorithmic approach. The function he wrote to output the result of that transformation takes as an input an array of arrays (the latter), but he builds that input iteratively using an AA of arrays (the former). I asked him about that mismatch and at the time he told me that "for now" he only needed the latter, suggesting he had other future plans where he might need the AA, but I'm not sure. So let's just say that the client is unclear on his requirements, which does happen in the real world anyway :-). In any case, I think the hashGroupBy is what I was asking about :-). Neat. (did anyone actually implement it?) I'm not sure how if "a dynamic arrays of dynamic arrays of 2-tuples" sufficed that would help with the intermediate step, if we wanted to avoid the sorting step. Did you have anything in particular in mind there?
Mar 21 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Luís Marques:

 So let's just say that the client is unclear on his 
 requirements, which does happen in the real world anyway :-).
Yes, it happens, But it's a problem, because often if you know what you need you can produce the results more efficiently :-)
 (did anyone actually implement it?)
Not yet.
 I'm not sure how if "a dynamic arrays of dynamic arrays of 
 2-tuples" sufficed that would help with the intermediate step, 
 if we wanted to avoid the sorting step. Did you have anything 
 in particular in mind there?
I think this problem needs a sorting or a hash. One possible solution, if you don't need an associative array as output, is to use a multiSort followed by a building of groups using slicing. It could be efficient enough. Later you search the keys with a some kind of binary search. Bye, bearophile
Mar 21 2014
parent =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Friday, 21 March 2014 at 16:04:45 UTC, bearophile wrote:
 I think this problem needs a sorting or a hash. One possible 
 solution, if you don't need an associative array as output, is 
 to use a multiSort followed by a building of groups using 
 slicing. It could be efficient enough. Later you search the 
 keys with a some kind of binary search.
The number of keys is large and unbounded (not just three as in my example), so I guess this multiSort approach would not be practical, right? I think we really need the hashGroupBy.
Mar 21 2014
prev sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Fri, 21 Mar 2014 15:22:13 +0000, Luís Marques wrote:

 Is there a neat way to do this transformation with ranges and
 std.algorithms?
 
      Input:
      -------
      B foo B bar C ble B big A begga
 
      Output: (aggregated and sorted on length)
      -------
      B -> [foo, bar, big]
      C -> [ble]
      A -> [begga]
 
 The most obvious way (to me) to do this without standard algorithms is
 with an AA to the aggregation. The most obvious way (to me) to do this
 with std.algorithms is:
 
      B foo B bar C ble B big A begga
 
      =>
 
      [B, foo]
      [B, bar]
      [C, ble]
      [B, big]
      [A, begga]
 
      =>
 
      [B, foo]
      [B, bar]
      [B, big]
      [C, ble]
      [A, begga]
 
      =>
 
      B -> [foo, bar, big]
      C -> [ble]
      A -> [begga]
 
 But this seems wasteful on memory. Is there a better way to do this in a
 more algorithmic way?
This pull request[1] for groupBy has been hanging around for a year now, driving me to copy-and-paste the implementation into a couple of my projects. Using it, you could do this: auto tuples = ... // get your list of (B, foo), (B, bar), etc. auto output = tuples.sort!`a[0] < b[0]` .groupBy!`a[0] == b[0]`; // output is a range of: // [ // [(A, begga)], // [(B, foo), (B, bar), (B, big)], // [(C, ble)] // ] The advantage being that output isn't an array at all but a lazy range of lazy ranges. 1 https://github.com/D-Programming-Language/phobos/pull/1186
Mar 21 2014
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 21, 2014 at 04:10:12PM +0000, Justin Whear wrote:
[...]
 This pull request[1] for groupBy has been hanging around for a year
 now, driving me to copy-and-paste the implementation into a couple of
 my projects.  Using it, you could do this:
 
 auto tuples = ... // get your list of (B, foo), (B, bar), etc.
 auto output = tuples.sort!`a[0] < b[0]`
                     .groupBy!`a[0] == b[0]`;
 // output is a range of:
 //    [
 //     [(A, begga)],
 //     [(B, foo), (B, bar), (B, big)],
 //     [(C, ble)]
 //    ]
 
 The advantage being that output isn't an array at all but a lazy range
 of lazy ranges.
 
 1 https://github.com/D-Programming-Language/phobos/pull/1186
Be aware, though, that groupBy only compares *adjacent* elements for equivalence; it does not sort the input. So if your input has equivalent elements interspersed with non-equivalent elements, you will have the equivalent elements split into multiple runs in the output. Example: auto data = [ tuple(1, "a") tuple(1, "b") tuple(2, "c") tuple(1, "d") ]; writeln(data.groupBy!((a,b) => a[0] == b[0])); Will output: [[tuple(1, "a"), tuple(1, "b")], [tuple(2, "c")], [tuple(1, "d")]] Which may not be what the OP wants. T -- Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn
Mar 21 2014
parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Friday, 21 March 2014 at 16:53:46 UTC, H. S. Teoh wrote:
 Be aware, though, that groupBy only compares *adjacent* 
 elements for
 equivalence; it does not sort the input. So if your input has 
 equivalent
 elements interspersed with non-equivalent elements, you will 
 have the
 equivalent elements split into multiple runs in the output.
I think that's why Justin used sort. The hashGroupBy proposed by bearophile would avoid the sort and the additional memory usage though, so that would be even better.
Mar 21 2014
parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Friday, 21 March 2014 at 17:20:38 UTC, Luís Marques wrote:
 I think that's why Justin used sort. The hashGroupBy proposed 
 by bearophile would avoid the sort and the additional memory 
 usage though, so that would be even better.
I was thinking, we don't even need the full power of sort. Is there a standard algorithm that makes elements with equal keys be in sequence, but that otherwise is less expensive than sort?
Mar 21 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Luís Marques:

 I was thinking, we don't even need the full power of sort. Is 
 there a standard algorithm that makes elements with equal keys 
 be in sequence, but that otherwise is less expensive than sort?
I don't know any, how is it supposed to know where to group items? Usually you build an associative array for that. Bye, bearophile
Mar 21 2014
parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Saturday, 22 March 2014 at 01:08:11 UTC, bearophile wrote:
 how is it supposed to know where to group items? Usually you 
 build an associative array for that.
It would swap elements, like sort, so it doesn't need to put them anywhere, just permute them. The advantage is this: Input: [7, 3, 7, 1, 1, 1, 1] Output sort: [1, 1, 1, 1, 3, 7, 7] Output groupSort: [3, 7, 7, 1, 1, 1, 1] groupSort (or whatever it would be called) only makes one swap, while sort makes a lot of them. So groupSort is a lot cheaper. I'm not sure what the asymptotic time complexity of groupSort is, at this moment's notice (I guess it would depend on what strategy it would use).
Mar 21 2014
parent "bearophile" <bearophileHUGS lycos.com> writes:
Luís Marques:

 It would swap elements, like sort, so it doesn't need to put 
 them anywhere, just permute them. The advantage is this:

 Input: [7, 3, 7, 1, 1, 1, 1]
 Output sort: [1, 1, 1, 1, 3, 7, 7]
 Output groupSort: [3, 7, 7, 1, 1, 1, 1]
I think to swap items it must know where to swap them to. And you can't do that efficiently if the groups are unsorted. Bye, bearophile
Mar 22 2014