www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - groupBy predicates: unary, binary, or both?

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I'm working on 
https://github.com/D-Programming-Language/phobos/pull/1186, which is 
somewhat important; "group by" is a powerful operation. Along the way, I 
stumbled upon an interesting issue in which I wanted to consult the 
community.

Usually groupBy is used to find runs of equivalent elements in a range. 
For example:

[1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6].groupBy()

yields the range [[1, 1, 1], [2, 2], [3, 3, 3], [4], [5, 5], [6]].

As is usual with Phobos algorithms, groupBy accepts a predicate. The 
default (as illustrated above) is "a = b", i.e. all elements in a group 
are equal to one another.

Equality is transitive and commutative. But there are useful cases in 
which predicates are not commutative. Consider we want to find strictly 
monotonic subranges in the range. We'd write:

auto r = [1, 3, 2, 4, 5, 1].groupBy!"a < b";

That should produce [[1, 3], [2, 4, 5], [1]]. For non-strict monotonic 
runs, the predicate would be "a <= b" etc. All that is pretty awesome.

However, that makes life a bit tougher for the algorithm - it must only 
compare adjacent elements only. In the case of "a = b", it suffices to 
save the first element in a group and compare it against every other 
element in the group.

Meanwhile, a very similar pull request 
(https://github.com/D-Programming-Language/phobos/pull/1453) uses unary 
predicates, i.e. an optional transformation function that is then used 
in conjunction with "==" to decide which elements belong in the same group.

Unary predicates make life simpler for the algorithm (save the transform 
of the first element, then compare it against the transform of the next 
etc) and are often easier to write by the end user, too (e.g. just write 
"a.length" instead of "a.length == b.length" to group by length).

So I was thinking to allow both cases, with the understanding that 
grouping by unary predicates uses "==" for comparison whereas grouping 
by binary predicates looks at adjacent elements to figure out group 
membership. That approach would, however, preclude the use of string 
lambdas (because deducing arity for string lambdas is possible, but 
quite unwieldy).

What do you think?


Andrei
Dec 29 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
There is precedence for this in phobos with schwatzSort: It use a 
unaryFun, followed by a binaryFun.

I've found this works quite well: More often than not, a unary 
fun is more than enough (the default "a < b"/"a == b" is fine). 
And in the cases where you *do* need binary comparison, then 
having a unary fun anywas is fine. EG:

"MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")();

I think it is both an elegant and efficient approach. That's what 
I'd go for.

-------

I think we should handle non-commutative predicates. It's not 
actually *that* much more complicated. Instead of storing in each 
Group a value + range, simply save the range twice, and have one 
range be 1 step ahead of the first. I think a few phobos algos do 
this. minPos does something kind of similar.

The "overhead" is an initial extra save yes, but the cost of each 
pop is to simply pop both ranges (no extra saves there). You 
never actually have to make element copies (which may fail if the 
range holds const elements to boot...). Also, if the ElementType 
is big, then storing an extra range has good chances to actually 
take up *less* room.

My stance is that supporting this would be a powerful and useful 
tool, and I don't think there'd be that much overhead either, so 
I say let's go for it.
Dec 29 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/29/13 2:20 AM, monarch_dodra wrote:
 There is precedence for this in phobos with schwatzSort: It use a
 unaryFun, followed by a binaryFun.

 I've found this works quite well: More often than not, a unary fun is
 more than enough (the default "a < b"/"a == b" is fine). And in the
 cases where you *do* need binary comparison, then having a unary fun
 anywas is fine. EG:

 "MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")();

 I think it is both an elegant and efficient approach. That's what I'd go
 for.

Hm so that's different from both my suggestions. It's also more difficult to explain, though indeed there's precedent with schwartz.
 I think we should handle non-commutative predicates. It's not actually
 *that* much more complicated. Instead of storing in each Group a value +
 range, simply save the range twice, and have one range be 1 step ahead
 of the first. I think a few phobos algos do this. minPos does something
 kind of similar.

The existing pull request already supports non-commutative predicates by means of copying the front. I'd thought about using the step-behind-range trick but that's less efficient than a copy in many cases. It's more general though as you point out, so I'll probably switch to that. Andrei
Dec 29 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 which is somewhat important; "group by" is a powerful operation.

I agree, it's a quite commonly useful operation.
 So I was thinking to allow both cases, with the understanding 
 that grouping by unary predicates uses "==" for comparison 
 whereas grouping by binary predicates looks at adjacent 
 elements to figure out group membership. That approach would, 
 however, preclude the use of string lambdas (because deducing 
 arity for string lambdas is possible, but quite unwieldy).

In similar cases the simplest solution is to have two functions with different (but similar) names.
 However, that makes life a bit tougher for the algorithm - it 
 must only compare adjacent elements only.

But another common need is to group by equivalent classes using an associative array. Both C# and Clojure have this functionality (on LINQ and in the functions toolkit). This means that a "hashGroup" builds an associative array inside and returns it. See also: http://d.puremagic.com/issues/show_bug.cgi?id=5968 http://d.puremagic.com/issues/show_bug.cgi?id=9842 Bye, bearophile
Dec 29 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/29/13 2:27 AM, bearophile wrote:
 But another common need is to group by equivalent classes using an
 associative array. Both C# and Clojure have this functionality (on LINQ
 and in the functions toolkit). This means that a "hashGroup" builds an
 associative array inside and returns it.


 See also:
 http://d.puremagic.com/issues/show_bug.cgi?id=5968
 http://d.puremagic.com/issues/show_bug.cgi?id=9842

I think hash joins should come first. Andrei
Dec 29 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/29/2013 05:17 PM, Andrei Alexandrescu wrote:
 On 12/29/13 2:27 AM, bearophile wrote:
 But another common need is to group by equivalent classes using an
 associative array. Both C# and Clojure have this functionality (on LINQ
 and in the functions toolkit). This means that a "hashGroup" builds an
 associative array inside and returns it.


 See also:
 http://d.puremagic.com/issues/show_bug.cgi?id=5968
 http://d.puremagic.com/issues/show_bug.cgi?id=9842

I think hash joins should come first. Andrei

What hash join algorithm do you have in mind that does not use hashGroupBy as a component? I think accumAssocArray or similar would be quite handy as well, as it is more general. A naive implementation r.hashGroupBy!f would then for example be given by: r.map!(a=>tuple(f(a),a)) // range of key-value pairs .accumAssocArray!((a,b)=>a~=b) // method of combination ((ElementType!(typeof(r))[]).init); // initial value
Dec 29 2013
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
I'm sorry to bump up an old thread, but whatever happened to 
this? I was just using 'groupby' in Python for a common data 
migration pattern I've developed...

# Build a dictionary of parent -> child relationships
# from a datbase query, for quick and dirty data migration.
# This runs in linear time.
{
     y[0], tuple(y[1])
     for y in
     groupby(
         some_query_func(
             "SELECT parent_id, ... "
             # ...
             "ORDER BY parent_id "
         ),
         lambda x: x["parent_id"]
     )
}

... and I was thinking "it would be nice to have this in D, too!"
Jul 03 2014
prev sibling parent "w0rp" <devw0rp gmail.com> writes:
On Thursday, 3 July 2014 at 10:48:53 UTC, w0rp wrote:
     y[0], tuple(y[1])

That's suppose to be a : there sorry, not a comma.
Jul 03 2014