www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [WORK] groupBy is in! Next: aggregate

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
So H.S. Teoh awesomely took 
https://github.com/D-Programming-Language/phobos/pull/2878 to 
completion. We now have a working and fast relational "group by" facility.

See it at work!

----
#!/usr/bin/rdmd

void main()
{
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
}
----

[[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

The next step is to define an aggregate() function, which is a lot 
similar to reduce() but works on ranges of ranges and aggregates a 
function over each group. Continuing the previous example:

     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;

should print:

[453, 600, 929, 812, 529, 768]

The aggregate function should support aggregating several functions at 
once, e.g. aggregate!(min, max) etc.

Takers?


Andrei
Jan 23 2015
next sibling parent Justin Whear <justin economicmodeling.com> writes:
On Fri, 23 Jan 2015 10:08:30 -0800, Andrei Alexandrescu wrote:

 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.
 
This is great news. It seems like every time I make use of component programming, I need groupBy at least once. I have a D file with an old copy of a groupBy implementation (I think it's Andrei's original stab at it) and it gets copied around to the various projects.
Jan 23 2015
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.
Unfortunately it doesn't work in pure/ safe/nothrow code because of limitations in the current RefCounted implementation. [...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:
 
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;
 
 should print:
 
 [453, 600, 929, 812, 529, 768]
 
 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.
 
 Takers?
[...] Isn't that just a simple matter of defining aggregate() in terms of map() and reduce()? Working example: import std.algorithm.comparison : max; import std.algorithm.iteration; import std.stdio; auto aggregate(alias func, RoR)(RoR ror) { return ror.map!(reduce!func); } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; } Output is as expected. T -- Verbing weirds language. -- Calvin (& Hobbes)
Jan 23 2015
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 10:29 AM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.
Unfortunately it doesn't work in pure/ safe/nothrow code because of limitations in the current RefCounted implementation. [...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:

      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .aggregate!max
          .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?
[...] Isn't that just a simple matter of defining aggregate() in terms of map() and reduce()? Working example: import std.algorithm.comparison : max; import std.algorithm.iteration; import std.stdio; auto aggregate(alias func, RoR)(RoR ror) { return ror.map!(reduce!func); } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; } Output is as expected.
Clever! Or, conversely, I'm not that bright! Yes, this is awesome. Probably the actual name "aggregate" should be defined even with that trivial implementation to help folks like me :o). -- Andrei
Jan 23 2015
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 10:29:13AM -0800, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:
 
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;
 
 should print:
 
 [453, 600, 929, 812, 529, 768]
 
 The aggregate function should support aggregating several functions
 at once, e.g. aggregate!(min, max) etc.
[...] Here's a working variadic implementation: import std.algorithm.comparison : max, min; import std.algorithm.iteration; import std.stdio; template aggregate(funcs...) { auto aggregate(RoR)(RoR ror) { return ror.map!(reduce!funcs); } } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!(max,min) .writeln; } Output (kinda ugly, but it works): [Tuple!(int, int)(453, 293), Tuple!(int, int)(600, 600), Tuple!(int, int)(929, 339), Tuple!(int, int)(812, 222), Tuple!(int, int)(529, 529), Tuple!(int, int)(768, 768)] Of course, it will require a little more polish before merging into Phobos, but the core implementation is nowhere near the complexity of groupBy. T -- The best compiler is between your ears. -- Michael Abrash
Jan 23 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:
 Of course, it will require a little more polish before merging into
 Phobos, but the core implementation is nowhere near the complexity of
 groupBy.
open https://github.com/D-Programming-Language/phobos/pulls [F5]... [F5]... [F5]... Andrei
Jan 23 2015
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 10:47:28AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:
Of course, it will require a little more polish before merging into
Phobos, but the core implementation is nowhere near the complexity of
groupBy.
open https://github.com/D-Programming-Language/phobos/pulls [F5]... [F5]... [F5]...
[...] void main() { foreach (iota(0 .. 60 * 60 * F5sPerSecond)) writeln("[F5]..."); writeln(q"ENDMSG https://github.com/D-Programming-Language/phobos/pull/2899 ENDMG); } ;-) T -- No! I'm not in denial!
Jan 23 2015
prev sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
On Friday, 23 January 2015 at 18:47:29 UTC, Andrei Alexandrescu 
wrote:
 open https://github.com/D-Programming-Language/phobos/pulls

 [F5]... [F5]... [F5]...


 Andrei
Haha, that's funny!
Jan 23 2015
prev sibling next sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by" facility.

 See it at work!

 ----
 #!/usr/bin/rdmd

 void main()
 {
      import std.algorithm, std.stdio;
      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:

      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .aggregate!max
          .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?


 Andrei
In most languages group by yields a tuple of {group key, group values}. For example (Ruby or Crystal): a = [1, 4, 2, 4, 5, 2, 3, 7, 9] groups = a.group_by { |x| x % 3 } puts groups #=> {1 => [1, 4, 4, 7], 2 => [2, 5, 2], 0 => [3, 9]} In C# it's also called group by: http://www.dotnetperls.com/groupby Java: http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function- SQL: http://www.w3schools.com/sql/sql_groupby.asp So I'm not sure "groupBy" is a good name for this.
Jan 23 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 12:19 PM, Ary Borenszweig wrote:
 In most languages group by yields a tuple of {group key, group values}.
Interesting, thanks. Looks like we're at a net loss of information with our current approach. quickfur, do you think you could expose a tuple with "key" and "values"? The former would be the function value, the latter would be what we offer right now. That would apply only to the unary version of groupBy. Andrei
Jan 23 2015
parent "Laeeth Isharc" <laeethnospam nospam.laeeth.com> writes:
On Friday, 23 January 2015 at 20:28:32 UTC, Andrei Alexandrescu 
wrote:
 On 1/23/15 12:19 PM, Ary Borenszweig wrote:
 In most languages group by yields a tuple of {group key, group 
 values}.
Interesting, thanks. Looks like we're at a net loss of information with our current approach. quickfur, do you think you could expose a tuple with "key" and "values"? The former would be the function value, the latter would be what we offer right now. That would apply only to the unary version of groupBy. Andrei
groupby hack below ? I haven't yet read the source code and don't feel I understand ranges deeply enough to know if this will work in the general case. But it at least works for the example (I think). Laeeth. #!/usr/bin/rdmd void main() { import std.algorithm, std.stdio, std.range; auto index=[293, 453, 600, 929, 339, 812, 222, 680, 529, 768]; auto vals=[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; auto zippy=zip(index,vals); zippy.groupBy!(a=> a[0] & 1) .writeln; } [root fedorabox test]# ./groupby [[Tuple!(int, int)(293, 1), Tuple!(int, int)(453, 2)], [Tuple!(int, int)(600, 3)], [Tuple!(int, int)(929, 4), Tuple!(int, int)(339, 5)], [Tuple!(int, int)(812, 6), Tuple!(int, int)(222, 7), Tuple!(int, int)(680, 8)], [Tuple!(int, int)(529, 9)], [Tuple!(int, int)(768, 10)]]
Jan 23 2015
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group 
 values}.
I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation). Bye, bearophile
Jan 23 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 12:38 PM, bearophile wrote:
 Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group values}.
I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- Andrei
Jan 23 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/23/15 12:38 PM, bearophile wrote:
Ary Borenszweig:

In most languages group by yields a tuple of {group key, group
values}.
I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- Andrei
^^^^ (!!!) *cowers and hides from the rotten tomatoes...* T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
Jan 23 2015
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 1:34 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 On 1/23/15 12:38 PM, bearophile wrote:
 Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group
 values}.
I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- Andrei
^^^^ (!!!)
I knew that pun won't go unremarked! :o) -- Andrei
Jan 23 2015
prev sibling parent reply "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Friday, 23 January 2015 at 20:19:31 UTC, Ary Borenszweig wrote:
 On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group 
 by" facility.

 See it at work!

 ----
 #!/usr/bin/rdmd

 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 The next step is to define an aggregate() function, which is a 
 lot
 similar to reduce() but works on ranges of ranges and 
 aggregates a
 function over each group. Continuing the previous example:

     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several 
 functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?


 Andrei
In most languages group by yields a tuple of {group key, group values}. For example (Ruby or Crystal): a = [1, 4, 2, 4, 5, 2, 3, 7, 9] groups = a.group_by { |x| x % 3 } puts groups #=> {1 => [1, 4, 4, 7], 2 => [2, 5, 2], 0 => [3, 9]} In C# it's also called group by: http://www.dotnetperls.com/groupby Java: http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function- SQL: http://www.w3schools.com/sql/sql_groupby.asp So I'm not sure "groupBy" is a good name for this.
You are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by. The best example I know of: https://clojuredocs.org/clojure.core/group-by https://clojuredocs.org/clojure.core/partition-by
Jan 23 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
[...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.
[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately. T -- Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
Jan 23 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
 [...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.
[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- Andrei
Jan 23 2015
next sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:
 On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
 [...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.
[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- Andrei
Another name might be chunkBy: it returns chunks that are grouped by some logic.
Jan 24 2015
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Jan 25, 2015 at 01:39:59AM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:
On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
[...]
You are talking about two different functions here. group by and
partition by. The function that has been implemented is often
called partition by.
[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- Andrei
Another name might be chunkBy: it returns chunks that are grouped by some logic.
Incidentally, that was the original name I implemented it under. T -- What do you get if you drop a piano down a mineshaft? A flat minor.
Jan 24 2015
prev sibling parent "Olivier Grant" <olivier.grant gmail.com> writes:
 We already have partition() functions that actually partition a 
 range into two subranges, so adding partitionBy with a 
 different meaning may be confusing. -- Andrei
In ruby, the closest to D's currently-named groupBy method is a set of three methods: slice_before slice_after slice_when http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-slice_when Your example in ruby would be: 2.2.0 > [293, 453, 600, 929, 339, 812, 222, 680, 529, 768].slice_when { |x,y| x & 1 != y & 1 }.to_a => [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]] O.
Jan 25 2015
prev sibling next sibling parent reply "MattCoder" <stop spam.com> writes:
On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu 
wrote:
 So H.S. Teoh awesomely took 
 https://github.com/D-Programming-Language/phobos/pull/2878 to 
 completion. We now have a working and fast relational "group 
 by" facility.

 See it at work!

 ----
 #!/usr/bin/rdmd

 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]
Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1). Matheus.
Jan 23 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 09:56:11PM +0000, MattCoder via Digitalmars-d wrote:
 On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:
So H.S. Teoh awesomely took
https://github.com/D-Programming-Language/phobos/pull/2878 to
completion.  We now have a working and fast relational "group by"
facility.

See it at work!

----
#!/usr/bin/rdmd

void main()
{
    import std.algorithm, std.stdio;
    [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
        .groupBy!(a => a & 1)
        .writeln;
}
----

[[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]
Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1).
[...] It's kind of a misnomer, because it actually only considers *consecutive runs* of equivalent elements; it doesn't look at the whole range before deciding what goes in which group. So technically it should be called consecutiveGroupBy or consecutivePartitionBy, but that's too big a mouthful to be a usable name. What you describe could be an interesting candidate to add, though. It could iterate over distinct values of the predicate, and traverse the forward range (input ranges obviously can't work unless you allocate, which makes it no longer lazy) each time. This, however, has O(n*k) complexity where k is the number of distinct predicate values. If it's anything bigger than bool or a relatively small enum, it would be impractical. Moreover, the requirement to traverse the range multiple times kinda sux; you might as well just call filter() with different expected values yourself, in a loop. In fact, you could ostensibly implement it something along these lines: auto globalPartition(alias pred, R)(R range) { alias Values = typeof(pred(range.front, range.front)); return iota(Values.min, Values.max) .map!(v => range.save.filter!(pred(e) == v)); } T -- Who told you to swim in Crocodile Lake without life insurance??
Jan 23 2015
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
H. S. Teoh:

 What you describe could be an interesting candidate to add, 
 though. It
 could iterate over distinct values of the predicate, and 
 traverse the
 forward range (input ranges obviously can't work unless you 
 allocate,
 which makes it no longer lazy) each time. This, however, has 
 O(n*k)
 complexity where k is the number of distinct predicate values.
Let's allocate, creating an associative array inside the grouping function :-) Bye, bearophile
Jan 23 2015
parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 1/23/15 7:30 PM, bearophile wrote:
 H. S. Teoh:

 What you describe could be an interesting candidate to add, though. It
 could iterate over distinct values of the predicate, and traverse the
 forward range (input ranges obviously can't work unless you allocate,
 which makes it no longer lazy) each time. This, however, has O(n*k)
 complexity where k is the number of distinct predicate values.
Let's allocate, creating an associative array inside the grouping function :-) Bye, bearophile
All languages I know do this for `group by` (because of the complexity involved), and I think it's ok to do so.
Jan 24 2015
prev sibling parent "MattCoder" <stop spam.com> writes:
On Friday, 23 January 2015 at 22:20:19 UTC, H. S. Teoh wrote:
 It's kind of a misnomer, because it actually only considers 
 *consecutive
 runs* of equivalent elements; it doesn't look at the whole 
 range before
 deciding what goes in which group. So technically it should be 
 called
 consecutiveGroupBy or consecutivePartitionBy, but that's too 
 big a
 mouthful to be a usable name.

 What you describe could be an interesting candidate to add, 
 though. It
 could iterate over distinct values of the predicate, and 
 traverse the
 forward range (input ranges obviously can't work unless you 
 allocate,
 which makes it no longer lazy) each time. This, however, has 
 O(n*k)
 complexity where k is the number of distinct predicate values. 
 If it's
 anything bigger than bool or a relatively small enum, it would 
 be
 impractical. Moreover, the requirement to traverse the range 
 multiple
 times kinda sux; you might as well just call filter() with 
 different
 expected values yourself, in a loop. In fact, you could 
 ostensibly
 implement it something along these lines:

 	auto globalPartition(alias pred, R)(R range) {
 		alias Values = typeof(pred(range.front, range.front));

 		return iota(Values.min, Values.max)
 			.map!(v => range.save.filter!(pred(e) == v));
 	}
...
Alright and thanks for the whole explanation. Matheus.
Jan 23 2015
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 1:56 PM, MattCoder wrote:
 On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.

 See it at work!

 ----
 #!/usr/bin/rdmd

 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]
Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1).
Yah, that would be partition(). -- Andrei
Jan 23 2015
prev sibling parent reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Fri, 2015-01-23 at 10:08 -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]
 #!/usr/bin/rdmd
 
 void main()
 {
      import std.algorithm, std.stdio;
      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .writeln;
 }
 ----
 
 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]
 
[…] I think I must be missing something, for me the result of a groupBy operation on the above input data should be: [1:[293, 453, 929, 339, 529], 0:[600, 812, 222, 680, 768]] i.e. a map with keys being the cases and values being the values that meet the case. In this example a & 1 asks for cases "lowest bit 0 or 1" aka "odd or even". There is nothing wrong with the semantics of the result above, but is it's name "group by" as understood by the rest of the world? -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 26 2015
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Russel Winder:

 but is it's name "group by" as understood by the rest of the 
 world?
Nope... Bye, bearophile
Jan 26 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:
 
but is it's name "group by" as understood by the rest of the world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released... T -- Don't drink and derive. Alcohol and algebra don't mix.
Jan 26 2015
next sibling parent reply "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Monday, 26 January 2015 at 16:13:40 UTC, H. S. Teoh wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via 
 Digitalmars-d wrote:
 Russel Winder:
 
but is it's name "group by" as understood by the rest of the 
world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug*
Andrei had a point about `partition` being used already. I liked Oliver's suggestion to go with slice-something. `sliceBy` might be worth considering. It even hints at the (efficient) implementation.
Jan 26 2015
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 26 January 2015 at 16:44:20 UTC, Ulrich Küttler wrote:
 Andrei had a point about `partition` being used already. I 
 liked Oliver's suggestion to go with slice-something. `sliceBy` 
 might be worth considering. It even hints at the (efficient) 
 implementation.
Does it return slices? If not, pick a different verb, e.g. "split".
Jan 26 2015
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...
My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. Andrei
Jan 26 2015
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...
My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. Andrei
That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range. Again, other languages return an associative array in this case.
Jan 26 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/26/15 9:50 AM, Ary Borenszweig wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...
My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. Andrei
That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range.
The implementation right now is quite interesting but not complicated, and achieves lazy grouping in a single pass.
 Again, other languages return an associative array in this case.
I think our approach is superior. Andrei
Jan 26 2015
parent reply zeljkog <zeljkog home.com> writes:
On 26.01.15 19:05, Andrei Alexandrescu wrote:
 On 1/26/15 9:50 AM, Ary Borenszweig wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?
Nope...
[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...
My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. Andrei
That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range.
The implementation right now is quite interesting but not complicated, and achieves lazy grouping in a single pass.
 Again, other languages return an associative array in this case.
I think our approach is superior. Andrei
I think std.experimental.algorithm.groupBy is one option. To postpone thing.
Jan 26 2015
parent "Phil" <phil.j.ellison gmail.com> writes:
I also found the behaviour confusing given the name. I like 
ChunkBy.
Jan 26 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
wrote:
Russel Winder:

but is it's name "group by" as understood by the rest of the world?
Nope...
[...]
My suggestion was to keep the name but change the code of your
groupBy implementation to return tuple(key, lazyValues) instead of
just lazyValues. That needs to happen only for binary predicates;
unary predicates will all have alternating true/false keys.
Huh, what? I think there's some misunderstanding here. The unary version of the current groupBy translates to a binary predicate: groupBy!(a => a.field) is equivalent to: groupBy!((a, b) => a.field == b.field) I don't see how this has anything to do with alternating keys. [...]
 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.
 
 Again, other languages return an associative array in this case.
I think we're talking past each other here. What groupBy currently does is to group elements by evaluating the predicate on *consecutive runs* of elements. What some people seem to demand is a function that groups elements by *global evaluation* of the predicate over all elements. These two are similar but divergent functions, and conflating them is not helping this discussion in any way. If "group by" in other languages refers to the latter function, then that means "groupBy" is poorly-named and we need to come up with a better name for it. Changing it to return tuples and what-not seems to be beating around the bush to me. T -- Computers are like a jungle: they have monitor lizards, rams, mice, c-moss, binary trees... and bugs.
Jan 26 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/26/15 10:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?
Nope...
[...]
 My suggestion was to keep the name but change the code of your
 groupBy implementation to return tuple(key, lazyValues) instead of
 just lazyValues. That needs to happen only for binary predicates;
 unary predicates will all have alternating true/false keys.
Huh, what? I think there's some misunderstanding here. The unary version of the current groupBy translates to a binary predicate: groupBy!(a => a.field) is equivalent to: groupBy!((a, b) => a.field == b.field) I don't see how this has anything to do with alternating keys.
Here's how. Basically the binary-predicate version has only Boolean keys that may be false or true. They will alternate because it's the change that triggers creation of a new group. In this example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!((a, b) => (a & 3) == (b & 3)) the groupBy function has no information about the result of a & 3. All it "sees" is the result of the predicate: true, false, true, false... HOWEVER, if you write it like this: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => (a & 3)) then groupBy sees the actual value of the function and can emit the proper key. So the key (ahem) here is to make groupBy with unary predicate different from groupBy with binary predicate. The former returns the tuple, the latter is unchanged. Makes sense?
 [...]
 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.

 Again, other languages return an associative array in this case.
I think we're talking past each other here. What groupBy currently does is to group elements by evaluating the predicate on *consecutive runs* of elements. What some people seem to demand is a function that groups elements by *global evaluation* of the predicate over all elements. These two are similar but divergent functions, and conflating them is not helping this discussion in any way.
Agreed.
 If "group by" in other languages refers to the latter function, then
 that means "groupBy" is poorly-named and we need to come up with a
 better name for it. Changing it to return tuples and what-not seems to
 be beating around the bush to me.
I like our notion of groupBy the same way I like the notion that something must be a random-access range in order to be sorted. (Other languages give the illusion they sort streams by internally converting them to arrays.) D offers better control, better flexibility, and richer semantics. Andrei
Jan 26 2015
parent "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Monday, 26 January 2015 at 18:31:05 UTC, Andrei Alexandrescu 
wrote:
 So the key (ahem) here is to make groupBy with unary predicate 
 different from groupBy with binary predicate. The former 
 returns the tuple, the latter is unchanged. Makes sense?
The current implementation has a certain beauty and does not lend itself easily to a split in two different versions. Also, the additional key return value might be unnecessary and unexpected in many cases. You can easily build the extended groupBy function out of the current one, if you only calculate the predicate beforehand. If the predicate is expensive, this might be sensible anyway. auto groupByStar(alias pred, Range)(Range r) { return r.map!(pred, "a") .groupBy!(a => a[0]) .map!(inner => tuple(inner.front[0], inner.map!"a[1]")) ; } void main() { auto a = iota(10); foreach (r; a.groupByStar!"a/3") { writeln(r[0], " ", r[1]); } } I do not know if this solution is any better. The above might be a terrible idea. Note: I failed to compile `r.groupBy!"a[0]"` with iteration.d(1648): Error: function expected before (), not "a[0]" of type string
Jan 26 2015
prev sibling parent reply "Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:
 If "group by" in other languages refers to the latter function, 
 then
 that means "groupBy" is poorly-named and we need to come up 
 with a
 better name for it. Changing it to return tuples and what-not 
 seems to
 be beating around the bush to me.


 T
T: you are good with algorithms. In many applications you have a bunch of results and want to summarise them. This is often what the corporate manager is doing with Excel pivot tables, and it is what the groupby function is used for in pandas. See here for a simple tutorial. http://wesmckinney.com/blog/?p=125 And here for a summary of what pandas can do with data: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html Is there any reason why we shouldn't add to Phobos: median, ranking, stddev, variance, correlation, covariance, skew, kurtosis, quantile, moving average, exp mov average, rolling window (see pandas)? I personally am fine with the implementation we have (although as Ray Dalio would say. I haven't yet earned the right that you should care what I think). All that it means is that you need to sort on multi key your results first before passing to groupby. My question is how much is lost by doing it in two steps (sort, groupby) rather than one. I don't think all that much, but it is not my field, I am also not that bothered, because this comes at the end of processing, not within the inner loop, so for me I don't think it makes a difference for now. If data sets reach commoncrawl type sizes then it might be different, although I will take D over java any day, warts and all. In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework. It would be interesting to benchmark D against pandas (which is implemented in cython for the key bits) and see how we look.
Jan 26 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote:
[...]
 My question is how much is lost by doing it in two steps (sort,
 groupby) rather than one.
[...] I'm not sure I understand what you mean by "lost"... I assume you're not talking about data loss, since that's not applicable either way. In any case, as with all things, there's a tradeoff. Building a hash table of groups, as some people demand, has the advantage that the grouping predicate is applied globally, and no sorting (O(n log n)) is needed. The disadvantage is that the entire data set has to fit in memory, and you cannot process it lazily. Before you've seen all of the data, you cannot know whether there are more distinct groups to come, or whether currently known groups have more members. Once the data size gets too large, you run into trouble. Plus, some people are adverse to gratuitous GC use in Phobos algorithms. Perhaps some of this is misplaced, but that's the perception. The current implementation has the advantage that it requires very little memory to work, and can process data lazily. It only needs to see a tiny portion of the entire data set to do its work. Each group returned is also lazy, so it doesn't need to store the whole group in its entirety. It can handle 10GB long groups in a 100GB data set streamed over the network with very little memory, whereas such a monstrous thing would be infeasibly resource-hungry if you need to allocate an on-disk hashtable (very inefficient!), iterate over the entire dataset, and then be I/O bound as you iterate over each group loading group members from disk. The big disadvantage, however, is that if your data is not sorted, then the groupings returned won't be quite what you'd expect, since the predicate is evaluated only over consecutive runs of elements, not globally over the entire data set. Which one is better? That depends on your specific application and its needs. For some applications, there is simply no way around the fact that you have a large dataset with randomly-ordered elements, so the sorting has to be done *somewhere*. For other applications, you *can* make stream your data in sorted order -- or perhaps it doesn't care if groupings aren't global -- so you can take advantage of the fact that the current groupBy is extremely cheap. It requires minimal memory to do its work, and it can do so inside a pipeline without requiring anything more than an input range. Basically, std.algorithm is (currently) primarily focused on linear algorithms. While there *are* some sorting algorithms and sublinear algorithms that take advantage of sorted data, it is still primarily focused on linear processing. There is no good abstraction as yet for more complex processing like multidimensional or graph traversals. As a result, it is best suited for stream-based processing. Generally speaking, most application code is primarily concerned with linear processing (copy this text from here to there, scan this text for some linear string AKA search keyword, render this line of text to the screen, draw this line around the window, move this item along in the processing queue, etc.). Non-linear computations generally take place only in dedicated computing modules of limited scope in the application, where one tends to write dedicated algorithms rather than reuse stock generic algorithms.
 In any case, the documentation should be very clear on what groupby
 does, and how the user can do what he might be expecting to achieve,
 coming from a different framework.
[...] As far as I know, the current groupBy docs explain quite clearly what it does. If you find it still inadequate or unclear, please file bugs against it so that we can look into improving the docs. T -- GEEK = Gatherer of Extremely Enlightening Knowledge
Jan 26 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/26/15 12:31 PM, H. S. Teoh via Digitalmars-d wrote:
 As far as I know, the current groupBy docs explain quite clearly what it
 does.
I agree. That said, perhaps a change of name would avoid confusion. I like groupBy as we currently define it, it's straight within the spirit of D. However, some people may legitimately expect it does global processing and groups all identical keys together. So here's what I think we should do: 1. Rename our current groupBy to chunkBy 2. Change code to have chunkBy with binary predicates return tuples, as discussed 3. Add a method called groupEquivalent to SortedRange. It takes no predicate (because predicate already is embedded in SortedRange). It groups together elements that are equivalent according to the sorting order. Returns a range of ranges (similar to what groupBy returns now). 4. Perhaps add a global chunkEquivalent as an alias for groupBy!((a, b) => a == b). So then in order to get relational group-by semantics by sorting, one would write range.sort!(predicate).groupEquivalent... Andrei
Jan 26 2015
parent reply "Dicebot" <public dicebot.lv> writes:
Don't forget that there is already 
http://dlang.org/phobos/std_algorithm.html#.group which matches 
groupBy in this context (only groups consequitive elements). It 
would be totally awkward to have those named differently (which 
was why I have suggested to change the name from chunkBy to 
groupBy in the very beginning)
Jan 26 2015
parent "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Monday, 26 January 2015 at 20:58:36 UTC, Dicebot wrote:
 Don't forget that there is already 
 http://dlang.org/phobos/std_algorithm.html#.group which matches 
 groupBy in this context (only groups consequitive elements). It 
 would be totally awkward to have those named differently (which 
 was why I have suggested to change the name from chunkBy to 
 groupBy in the very beginning)
This seems valid to me. I'd suggest to keep the name and implementation as it is and add the "common" groupBy function as an example. https://github.com/D-Programming-Language/phobos/pull/2915
Jan 26 2015
prev sibling next sibling parent "Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:
Thank you for the thoughtful reply.

I meant lost in terms of extra processing time and memory 
consumption in order to achieve the fairly common use case of a 
groupby pivot table or pandas style (ie what you get if you sort 
the data by group and then run D groupby)

If you first sort the data, and then run groupby, then it seems 
likely in some cases this will be more expensive than a straight 
pandas style groupby., and my question was how much more 
expensive.

If you needed to sort the data anyway, because you were producing 
a report, this doesn't matter.  But if you didn't need it sorted 
because you just wanted to produce statistics on each group (eg 
median, inter quartile difference) then it might.

I hadn't thought clearly before about in memory vs not.

I guess for pandas style groupby, having to sort a dataset 
externally just to use groupby is a comparative disaster because 
it seems to involve much more I/O than is needed.  If I am not 
mistaken, this might be one reason to have also at some point a 
separate function that does pandas groupby efficiently for  data 
that may not be held in memory.

Can you explain why for hashgroupby the entire dataset needs to 
fit in memory?  If I were writing this naively, one could have an 
associative array of lists of relevant items hashed by group.  
You go through each line once, add the index to the list for that 
group, then at the end sort the group keys and return indices or 
whatever the slice equivalent is of the lines matching each 
group.  But I may be missing some obvious points.

Ie my imagined use case has large data for each line, but the 
hash table and lists would fit in memory.

If thinking of a case where the data per line is not that large, 
but there are so many lines that it becomes an external sort then 
I see your point.  However I suggest that the use case I describe 
- whilst not the only one - is important enough at least to 
briefly consider.

"The  big disadvantage, however,
 is that
 if your data is not sorted, then the groupings returned won't 
 be quite
 what you'd expect, since the predicate is evaluated only over
 consecutive runs of elements, not globally over the entire data 
 set."
Well - it is a completely different function... I really do believe that pandas style dataframe type processing could be an interesting area for D, and most of what is needed to construct this is already here.
 As far as I know, the current groupBy docs explain quite 
 clearly what it
 does.  If you find it still inadequate or unclear, please file 
 bugs
 against it so that we can look into improving the docs.
Sorry - I had hadn't thought to check! Will do so now. Laeeth.
Jan 26 2015
prev sibling parent "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
 As far as I know, the current groupBy docs explain quite 
 clearly what it
 does.  If you find it still inadequate or unclear, please file 
 bug against it so that we can look into improving the docs.
Read the docs now - they are perfect within the context of the style of documentation (and these documents themselves have certainly been improved over the past months). It is very clear to me, but I am not sure it would be to everyone I work with. The formal library description may not be the place for this, but I do think many less sophisticated readers would find the style below more accessible and easier to follow (this is not about groupBy, but a broader question of documentation to supplement the formal reference description for the standard library): http://pandas.pydata.org/pandas-docs/stable/groupby.html I appreciate that it is all very well to suggest improvements but someone actually has to do the work. Recognizing the direction of what might be helpful is perhaps a start though. I might be on my own with this view, but I don't think I am.
Jan 26 2015
prev sibling parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, 2015-01-26 at 08:11 -0800, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wr=
ote:
 Russel Winder:
=20
but is it's name "group by" as understood by the rest of the world?
=20 Nope...
[...] =20 I proposed to rename it but it got shot down. *shrug*
What name do you think works for this operation?
 We still have a short window of time to sort this out, before 2.067 is
 released...
To be honest, given the confirmation that the semantics of this operation and that of the one other languages call groupBy are different, then 2.067 should not go out with this operation called groupBy.
=20
--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 26 2015