www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - sort API design

reply Johannes Riecken <johannes.riecken gmail.com> writes:
I had looked at how to do case-insensitive string sorting with 
Phobos and without looking at the documentation much, I went with 
the following code and then wrongly concluded that `<` does 
case-insensitive comparisons on strings by default:

```
import std.algorithm;
import std.stdio;
import std.string;

void main() {
     auto arr0 = ["foo", "bar", "Baz"];
     auto case_sensitive   = sort!((a, b) => a           < b       
    )(arr0);
     auto case_insensitive = sort!((a, b) => a.toLower() < 
b.toLower())(arr0);
     writeln(case_sensitive);
     writeln(case_insensitive);
}
```
Output:
```
["bar", "Baz", "foo"]
["bar", "Baz", "foo"]
```

Later I learned that I was trapped by `sort` being in-place but 
still returning a value. If this API were to be designed from 
scratch, I wonder if it would be possible to rule out this 
incorrect usage at compile-time while still retaining the other 
nice features of the `sort` function like being in-place and 
giving strong types which can be used to choose faster algorithms 
to further transform the result, etc.? Or are there any ideas in 
programming language research that would help with such a case?
Nov 20 2019
next sibling parent mipri <mipri minimaltype.com> writes:
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken 
wrote:
 Or are there
 any ideas in programming language research that would help with 
 such a case?
I remember reading that Ada/SPARK did some checking that could've helped here, but I can't find a reference now, and I'm not sure of the terminology. It might be one of the implications of the data flow analysis that it does. Suppose you had this in your code: int x; x = 2; x = 3; do_something(x); What is the point of the first assignment? Waste heat from the CPU? Of course there's no point, and SPARK makes it an error to do that: your program is doing some *work*, it is generating some outputs, so it must go on to use those outputs, and not throw them away. And in your program that led to a confusing result, you're also performing work that is lost:
     auto case_sensitive   = sort!((a, b) => a           < b
    )(arr0);
     auto case_insensitive = sort!((a, b) => a.toLower() <
 b.toLower())(arr0);
That first sort() has a return value, but a conceptual output is the order that it provides to arr0, and the second sort() can be statically known to destroy order in an arr0 provided to it. So: 1. sort() discards arr0's order!0 and then provides a new order!1 2. sort() discards arr0's order!1 and then provides a new order!2 3. writeln reads order!2 4. writeln reads order!2 Or, written like the first example above: Order o; o = order!1; o = order!2; do_something(o); What is the point of the first assignment? Waste heat from the CPU?
Nov 20 2019
prev sibling next sibling parent Dennis <dkorpel gmail.com> writes:
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken 
wrote:
 Or are there any ideas in programming language research that 
 would help with such a case?
Try to make it a habit to use `const` instead of `auto` if you don't expect the variable to be mutated. ``` const arr0 = ["foo", "bar", "Baz"]; ``` This is obviously not a fancy beginner-friendly solution like the thing mipri describes, but it is something that you can start doing today that might help.
Nov 20 2019
prev sibling next sibling parent reply Dukc <ajieskola gmail.com> writes:
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken 
wrote:
 Later I learned that I was trapped by `sort` being in-place but 
 still returning a value. If this API were to be designed from 
 scratch, I wonder if it would be possible to rule out this 
 incorrect usage at compile-time while still retaining the other 
 nice features of the `sort` function like being in-place and 
 giving strong types which can be used to choose faster 
 algorithms to further transform the result, etc.?
Theoretically it could do the same thing as `std.algortihm.move` - assign the original range to it's init value, so if it's accidently reused it'd be more obvious. But of course it won't be worth doing anymore because of the breakage. Perhaps in the std v2 Wilzbach has talked about, if the authors so wish.
Nov 20 2019
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, November 20, 2019 6:15:12 AM MST Dukc via Digitalmars-d wrote:
 On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken

 wrote:
 Later I learned that I was trapped by `sort` being in-place but
 still returning a value. If this API were to be designed from
 scratch, I wonder if it would be possible to rule out this
 incorrect usage at compile-time while still retaining the other
 nice features of the `sort` function like being in-place and
 giving strong types which can be used to choose faster
 algorithms to further transform the result, etc.?
Theoretically it could do the same thing as `std.algortihm.move` - assign the original range to it's init value, so if it's accidently reused it'd be more obvious. But of course it won't be worth doing anymore because of the breakage. Perhaps in the std v2 Wilzbach has talked about, if the authors so wish.
The fact that sort returns a SortedRange is useful for some algorithms (since they can then statically detect that the range is sorted), but it's also useful to be able to use sort the range in-place and continue to use it with the exact same type. The current API is only a problem if you don't read the documentation. - Jonathan M Davis
Nov 20 2019
prev sibling parent reply Johannes Riecken <johannes.riecken gmail.com> writes:
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken 
wrote:
 I had looked at how to do case-insensitive string sorting with 
 Phobos and without looking at the documentation much, I went 
 with the following code and then wrongly concluded that `<` 
 does case-insensitive comparisons on strings by default:

 ```
 import std.algorithm;
 import std.stdio;
 import std.string;

 void main() {
     auto arr0 = ["foo", "bar", "Baz"];
     auto case_sensitive   = sort!((a, b) => a           < b
    )(arr0);
     auto case_insensitive = sort!((a, b) => a.toLower() < 
 b.toLower())(arr0);
     writeln(case_sensitive);
     writeln(case_insensitive);
 }
 ```
 Output:
 ```
 ["bar", "Baz", "foo"]
 ["bar", "Baz", "foo"]
 ```

 Later I learned that I was trapped by `sort` being in-place but 
 still returning a value. If this API were to be designed from 
 scratch, I wonder if it would be possible to rule out this 
 incorrect usage at compile-time while still retaining the other 
 nice features of the `sort` function like being in-place and 
 giving strong types which can be used to choose faster 
 algorithms to further transform the result, etc.? Or are there 
 any ideas in programming language research that would help with 
 such a case?
Thanks, that's some nice ideas in this thread! I should definitely get into the habit of using const wherever possible. Jonathan, I think a major selling point of Phobos and D is that in lots of cases it can catch erroneous API usage at compile time.
Nov 21 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, November 21, 2019 3:19:45 PM MST Johannes Riecken via 
Digitalmars-d wrote:
 On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken

 wrote:
 I had looked at how to do case-insensitive string sorting with
 Phobos and without looking at the documentation much, I went
 with the following code and then wrongly concluded that `<`
 does case-insensitive comparisons on strings by default:

 ```
 import std.algorithm;
 import std.stdio;
 import std.string;

 void main() {

     auto arr0 = ["foo", "bar", "Baz"];
     auto case_sensitive   = sort!((a, b) => a           < b

    )(arr0);

     auto case_insensitive = sort!((a, b) => a.toLower() <

 b.toLower())(arr0);

     writeln(case_sensitive);
     writeln(case_insensitive);

 }
 ```
 Output:
 ```
 ["bar", "Baz", "foo"]
 ["bar", "Baz", "foo"]
 ```

 Later I learned that I was trapped by `sort` being in-place but
 still returning a value. If this API were to be designed from
 scratch, I wonder if it would be possible to rule out this
 incorrect usage at compile-time while still retaining the other
 nice features of the `sort` function like being in-place and
 giving strong types which can be used to choose faster
 algorithms to further transform the result, etc.? Or are there
 any ideas in programming language research that would help with
 such a case?
Thanks, that's some nice ideas in this thread! I should definitely get into the habit of using const wherever possible. Jonathan, I think a major selling point of Phobos and D is that in lots of cases it can catch erroneous API usage at compile time.
There's nothing erroneous about sorting a range in two different ways (though it obviously wouldn't make sense to do so twice in row without doing anything with the range in between). And having sort return a new range that was sorted while not affecting the one that was passed in would require either sorting lazily (which really doesn't work) or allocating an entirely new range in order to sort into - neither of which makes much sense. So, I'm honestly very surprised that anyone would think that calling sort would result in a new range without affecting the one that was passed in. Regardless, the documentation makes the behavior of sort clear, and anyone using a function needs to read its documentation if they don't want to run into problems. If you don't read documentation, you're going to make bad assumptions about how a function behaves eventually - if not frequently - and end up with bugs in your code. It is of course nice when the language is able to catch incorrect behavior for you at compile time, but I don't see how that could be done here. Sure, you used the function incorrectly, and if sort allocated a new range to sort into and returned that, you wouldn't have been bitten. However, pretty much every sort function I have ever seen sorts in-place, so I would expect most folks to assume that sort sorted the range in-place. And if sort returned a newly allocated range, anyone making that assumption and used sort without reading the documentation would end up with buggy code and could have complaints similiar to yours. Whether sort returns a newly allocated range or sorts in-place, _someone_ is going to make the wrong assumption about how it works and get bitten without the compiler being able to help them out. And the current implementation is by far more efficient than the version that returned a newly allocated range would be. It's also more flexible, since it allows you to sort the same range multiple times, multiple ways if that makes sense for the code. Anyone who wants to get a new range can simply allocate a new array to hold the data - either via .dup if the range to be sorted is an array or by calling save on the range and passing it to std.array.array, making your code something like auto case_sensitive = sort!((a, b) => a < b)(arr0.dup); auto case_insensitive = sort!((a, b) => a.toLower() < b.toLower())(arr0.dup); So, I'm sorry that you gotten bitten by sort not working the way you expected, but I don't see how that could have been prevented without implementing the range in the way that you expected, which would then screw over anyone who assumed that the range would be sorted in-place. The compiler can't fix this for you. Ultimately, you have to read the documentation. And given the inefficiency of returning a sorted range without affecting the one that was passed in, the current implementation makes far more sense than how you assumed that sort worked. Even if D is able to catch more problems at compile time than another language, there's still a limit to how many bugs it can catch. There's a reason that unit tests are a feature that's built into D. - Jonathan M Davis
Nov 21 2019
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 21 November 2019 at 23:28:48 UTC, Jonathan M Davis 
wrote:
 So, I'm sorry that you gotten bitten by sort not working the 
 way you expected, but I don't see how that could have been 
 prevented without implementing the range in the way that you 
 expected, which would then screw over anyone who assumed that 
 the range would be sorted in-place.
This has come up several times before, here are two of them: https://forum.dlang.org/search?q=Sorted+past+tense&search=Search It can be fixed with careful naming conventions. (lazy sorting make sense actually... Heap... Bucket...)
Nov 21 2019
prev sibling parent mipri <mipri minimaltype.com> writes:
On Thursday, 21 November 2019 at 23:28:48 UTC, Jonathan M Davis 
wrote:
 So, I'm sorry that you gotten bitten by sort not working the
 way you expected, but I don't see how that could have been
 prevented without implementing the range in the way that you
 expected, which would then screw over anyone who assumed that
 the range would be sorted in-place.
I don't disagree, but the problem also wasn't presented as a defect of Phobos, but as "I did wrote this code. I learned it was wrong. How might my error have been averted?" From some other languages, | Language | Sort | Mutates? | Returns? | Both? | |----------+-------------+----------+----------+-------| | Perl | sort | No | Yes | No | | Python | list.sort | Yes | No | No | | Python | list.sorted | No | Yes | No | | C | qsort | Yes | No | No | | C++ | std::sort | Yes | No | No | | Rust | vec::sort | Yes | No | No | The norm is to do one or the other, and not both. Why does D do both? Because it has an efficiency focus (avoid unnecessary copying) and also an expressiveness focus (go ahead and do things with long pipelines of transformations), and this combination is uncommon. Actually I'm surprised Rust doesn't do both... and come to think of it, I *was* surprised by that, when I rewrote a benchmark in it: I tried putting the .sort_by() in the middle of a pipeline of function calls and was confused by the error. So I had the exact opposite "I did a thing. I learned it was wrong." problem with sorting in Rust. There's another reason D does both: thing.change().change().change() is how you build pipelines in D, and each of these change() functions needs to return the input to the next change(). But note, this isn't a deliberate language feature, it's not taking advantage of "pipeline syntax", it's just a consequence of how methods work in OOP languages. So it's more like a pattern :) One of those awful things that highlights a failure of language design, where programmers, instead of directly expressing their intentions with code, must laboriously work around a void in the language. Or so I hear. https://www.martinfowler.com/articles/collection-pipeline/ https://clojure.org/guides/threading_macros Suppose, instead of expecting D programmers to make use of the "collection pipelien" pattern, D just had a collection pipeline feature? Like: thing |> change1() |> change2() |> change3() If it's a language feature, D can, since it knows the types of all these functions, apply a rule like if a function is void, reuse the last input for the next function otherwise, behave like . chaining With this, Phobos could have a void sort that could still appear in the middle of a long pipeline of transformations, and confusions like the one in this thread about Phobos, and *also* the one I had about Rust's sort, would both not be possible. Instead you'd have the new potential confusion of thinking that sort doesn't mutate because you saw it in a pipeline :) To be clear, I don't recommend changing anything. I just think it's not true that D has nothing to do with this.
Nov 21 2019