www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Suggestion: Operator `in` for slices

reply Quirin Schroll <qs.il.paperinik gmail.com> writes:
The operator `in` between `T` and `T[]` is requested a lot of 
times. Most people suggest it should return a `bool` signifying 
if some value of the array on the right-hand side is equal (`==`) 
to the `T` value on the left-hand side. This is the obvious thing 
to do.

I guess it has never been implemented because this information 
(true/false) is so much less than it could be. If D used 1-based 
indexing, `in` could return the index if found or 0 if not found. 
Because 0 == false, one could use `if (auto x in array)`. 
However, D's indexing 0-based, so 0 is a valid index. `in` cannot 
return a plain `size_t`.

My idea is inspired by how AAs `in` operator does not return a 
(near-useless) `bool`, but a value that can be used like a 
`bool`, but provides much more valuable information: a `T*`. It 
would work returning a `T*` for slices' `in` as well, but it is 
much less useful than the index. (One could get the index of 
something from the pointer to it using pointer subtraction, but 
pointer arithmetic not considered ` safe`.) The index practically 
*is* the pointer.

So, can we return an object that is, for almost all intents and 
purposes, usable as an index, but evaluating it via `if` returns 
`true` or `false` depending on whether the value was found or not.

So, my suggestion is that the `in` operator for slices returns a 
value of this type:
```D
struct InResult
{
     size_t index = size_t.max;
     alias index this;
     bool opCast(T : bool)() const { return index != size_t.max; }
}
```

In fact, this is very much a simple optional type. It takes a 
very unlikely valid value of its underlying type and reinterprets 
it as the invalid value.
Dec 18 2021
parent reply Adam Ruppe <destructionator gmail.com> writes:
On Saturday, 18 December 2021 at 18:29:56 UTC, Quirin Schroll 
wrote:
 I guess it has never been implemented because this information 
 (true/false) is so much less than it could be.
Nah, the general rule against it is due to speed/computational complexity. The `in` operator on AAs has a speed of O(log n) or O(1). `in` on slices would exceed that being O(n). I'm not sure the speed rule is written or not, but generally `in` is supposed to have the same speed requirement as `[n]` which is sub-linear.
Dec 18 2021
next sibling parent reply Quirin Schroll <qs.il.paperinik gmail.com> writes:
On Saturday, 18 December 2021 at 18:47:33 UTC, Adam Ruppe wrote:
 On Saturday, 18 December 2021 at 18:29:56 UTC, Quirin Schroll 
 wrote:
 I guess it has never been implemented because this information 
 (true/false) is so much less than it could be.
Nah, the general rule against it is due to speed/computational complexity. The `in` operator on AAs has a speed of O(log n) or O(1). `in` on slices would exceed that being O(n).
How is that even a criterion? `foreach` on arrays has O(n) complexity as well. You cannot say it's okay for `foreach` because it obviously has to be that way. The same can be said for linear search. It doesn't take a genius that searching an unordered structure cannot be better than O(n) in the worst case.
Dec 18 2021
parent Dukc <ajieskola gmail.com> writes:
On Saturday, 18 December 2021 at 20:17:51 UTC, Quirin Schroll 
wrote:
 Nah, the general rule against it is due to speed/computational 
 complexity. The `in` operator on AAs has a speed of O(log n) 
 or O(1). `in` on slices would exceed that being O(n).
How is that even a criterion? `foreach` on arrays has O(n) complexity as well. You cannot say it's okay for `foreach` because it obviously has to be that way.
`foreach` is not supposed to give that guarantee. `in` is. Just like the indexing operator (`range[index]`). That does not say you can't define a function that checks for an element in O(n) time, but it should be called something else than `in` then. It's the same reason as why we don't define `range[index]` as `range.save.dropExactly(index).front` for all forward ranges that do not explicitly declare `opIndex`. It's not that you can't index a range like that, but that we usually want a compile error rather than silent O(n) indexing if we are expecting O(1) or O(log n) indexing.
 The same can be said for linear search. It doesn't take a 
 genius that searching an unordered structure cannot be better 
 than O(n) in the worst case.
But it isn't always immediately clear to the reader what the type of the data structure is.
Dec 19 2021
prev sibling next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Saturday, 18 December 2021 at 18:47:33 UTC, Adam Ruppe wrote:
 The `in` operator on AAs has a speed of O(log n) or O(1). `in` 
 on slices would exceed that being O(n).
Who do people keep repeating that? - Worst case time complexity of the `in` operator in AA's is actually O(n). This can easily happen in practice when your `toHash` has a poor implementation, which I have had before - In custom types you can overload the `in` operator to something with arbitrarily bad time complexity. - Despite scaling with O(n), a linear scan is pretty fast. You can probably scan your entire RAM in a matter of seconds. The whole "it ruins performance guarantees" so far seems completely hypothetical and is not something that came from experience with actual code. The problem with `in` for slices to me is that it would be inconsistent with how it works for AAs, where `in` searches for a *key* and returns a pointer to a *value*. The same for slices would look like this: ```D T* opIn(size_t i, T[] arr) { return i >= arr.length ? null : &arr[i]; } void main() { int[3] a = [10, 20, 30]; assert((10 in a) == null); // index 10 is out of bounds assert((2 in a) == &a[2]); } ``` But people suggest `in` to search a *value* in a slice. I've once overloaded the `in` operator for a custom 2D array type used to represent a maze in which path finding was done. There it was useful for bounds checking since writing comparisons with `width` and `height` all the time is cumbersome. I wouldn't mind giving `in` this meaning so you can do easy bounds checks in slices as well, but I think it will be to confusing for newcomers expecting `in` to work like in Python.
Dec 18 2021
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 18 December 2021 at 20:46:06 UTC, Dennis wrote:
 - Despite scaling with O(n), a linear scan is pretty fast. You 
 can probably scan your entire RAM in a matter of seconds.
Also, if the array is sorted it would be O(log N). You can achieve this either by adding a high level IR that can deduce that the array is sorted or making it part of the type system.
Dec 18 2021
prev sibling next sibling parent reply Dukc <ajieskola gmail.com> writes:
On Saturday, 18 December 2021 at 20:46:06 UTC, Dennis wrote:
 On Saturday, 18 December 2021 at 18:47:33 UTC, Adam Ruppe wrote:
 The `in` operator on AAs has a speed of O(log n) or O(1). `in` 
 on slices would exceed that being O(n).
Who do people keep repeating that? - Worst case time complexity of the `in` operator in AA's is actually O(n). This can easily happen in practice when your `toHash` has a poor implementation, which I have had before - In custom types you can overload the `in` operator to something with arbitrarily bad time complexity.
These are bugs. The O(log n) assumption still makes sense when your code works as it's supposed to.
 - Despite scaling with O(n), a linear scan is pretty fast. You 
 can probably scan your entire RAM in a matter of seconds.
A few seconds is by far too much. Remember, the data structure might be fed to function designed by introspection. The function sees "Ah, an `in` operator. `in` is on average O(log N) at worst, I can call it hundreds of times without significant slowdown", and compiles such that it takes a quarter of an hour to execute. Had the `in` operator respected the O(log N) assumption, the introspecting function would have failed to compile alerting the programmer, or used an alternative, quicker implementation.
Dec 19 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 19 December 2021 at 10:10:21 UTC, Dukc wrote:
 function sees "Ah, an `in` operator. `in` is on average O(log 
 N) at worst, I can call it hundreds of times without 
 significant slowdown", and compiles such that it takes a 
 quarter of an hour to execute.
I know this is a common abuse of notation in C++ circles, but O(…) means worst case. Also if something is O(log N), then it is also O(N) or any other bigger bound. The notations O(N) means that N is sufficient to put an upper bound on the computation growth (in terms of input size) for each and every possible input of any size. Average case is not given with Big-Oh-notation, but done "accurately" by presenting a model for the distribution of input data and then solving it. (usually a very big task) When some C++ libraries talk of amortized complexity, they probably should not use O-notation, but we get what they mean. E.g. if you do a long series of operations then the costly reallocations will be amortized, so they therefore ignore it in the analysis and "pretend" it does not happen. But it isn't on average. To do on-average-calculations you need a model of the input and the operations, and the calculcations can take a scholar many days to figure out. E.g. one strategy is to translate the model and patterns of operations to recurrence relations and solving it as integrals… a very tedious and time consuming job.
Dec 19 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 19 December 2021 at 11:07:30 UTC, Ola Fosheim Grøstad 
wrote:
 Average case is not given with Big-Oh-notation, but done 
 "accurately" by presenting a model for the distribution of 
 input data and then solving it. (usually a very big task)
By this I mean that using "O(…)" alone means worst case. We can say that quicksort is O(N²). Not qualifying the statements implies worst case. If we consider all inputs equally likely then we can also claim that the average case complexity is limited by O(N log N). Now, if AA was implemented as a hashtable with a constant upper-bound on the linked lists (say 100) then we could say that "in" would be O(1). We could also claim that inserts would be O(1) amortized (ignore rehashing, assuming a distribution that is relatively flat). So in general, we should be able to have AAs where all operations (insert, delete and membership tests) are O(1) amortized.
Dec 19 2021
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/19/21 6:45 AM, Ola Fosheim Grøstad wrote:
 Now, if AA was implemented as a hashtable with a constant upper-bound on 
 the linked lists (say 100) then we could say that "in" would be O(1).
D AAs do not use linked lists any more. It uses open addressing. But having a bound on how many collisions can happen is ineffective against a bad hash function (say, one that always returns 0). The expectation of the hash table is that the hash function is good. Without a good hash function, the complexity goes to crap.
 
 We could also claim that inserts would be O(1) amortized (ignore 
 rehashing, assuming a distribution that is relatively flat).
That is what we claim (and what hash algorithms in general claim). amortized O(1) lookups and inserts. -Steve
Dec 19 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 19 December 2021 at 18:36:12 UTC, Steven Schveighoffer 
wrote:
 That is what we claim (and what hash algorithms in general 
 claim). amortized O(1) lookups and inserts.
Yes, if you don't shrink. You need to prove that the claim holds for all possible orderings of inserts, deletions and lookups. So you need to prove that you always do O(N) operations between rehashing, so you get O(N) operations of complexity O(1) + one rehash of complexity O(N) = O(N).
Dec 19 2021
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/19/21 2:34 PM, Ola Fosheim Grøstad wrote:
 On Sunday, 19 December 2021 at 18:36:12 UTC, Steven Schveighoffer wrote:
 That is what we claim (and what hash algorithms in general claim). 
 amortized O(1) lookups and inserts.
Yes, if you don't shrink. You need to prove that the claim holds for all possible orderings of inserts, deletions and lookups. So you need to prove that you always do O(N) operations between rehashing, so you get O(N) operations of complexity O(1) + one rehash of complexity O(N) = O(N).
Please see existing research, this is not a new concept. -Steve
Dec 21 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 21 December 2021 at 17:02:22 UTC, Steven 
Schveighoffer wrote:
 On 12/19/21 2:34 PM, Ola Fosheim Grøstad wrote:
 On Sunday, 19 December 2021 at 18:36:12 UTC, Steven 
 Schveighoffer wrote:
 That is what we claim (and what hash algorithms in general 
 claim). amortized O(1) lookups and inserts.
Yes, if you don't shrink. You need to prove that the claim holds for all possible orderings of inserts, deletions and lookups. So you need to prove that you always do O(N) operations between rehashing, so you get O(N) operations of complexity O(1) + one rehash of complexity O(N) = O(N).
Please see existing research, this is not a new concept.
What do you mean by research, this is a trivial topic. In order to get O(1) amortised you have to delay shrinkage, you must prove that any possible sequence does not reallocate more frequently than N operations.
Dec 21 2021
prev sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Sunday, 19 December 2021 at 19:34:18 UTC, Ola Fosheim Grøstad 
wrote:
 On Sunday, 19 December 2021 at 18:36:12 UTC, Steven 
 Schveighoffer wrote:
 That is what we claim (and what hash algorithms in general 
 claim). amortized O(1) lookups and inserts.
Yes, if you don't shrink. You need to prove that the claim holds for all possible orderings of inserts, deletions and lookups. So you need to prove that you always do O(N) operations between rehashing, so you get O(N) operations of complexity O(1) + one rehash of complexity O(N) = O(N).
To guarantee amortized O(1) deletion, it is sufficient to ensure that `shrink_threshold * growth_factor < growth_threshold`. And if you look at the source code for D's built-in AAs, you will see that this is, in fact, ensured: https://github.com/dlang/druntime/blob/v2.098.1/src/rt/aaA.d#L17-L27 So, we can safely put this discussion to rest, I think.
Dec 22 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 22 December 2021 at 14:27:32 UTC, Paul Backus wrote:
 https://github.com/dlang/druntime/blob/v2.098.1/src/rt/aaA.d#L17-L27

 So, we can safely put this discussion to rest, I think.
I haven't looked at the implementation of AAs or the allocator used, so I cannot tell whether it is O(1) amortized or not. You can test it by repeatedly adding/removing items near the thresholds. A preliminary test suggests that the observed behaviour is rather wasteful, which should be a warning for anyone who considers using it in production. It grows by a factor of 4, that is a lot of waste. It also appears to grow on deletion: adding changed 0 : 0 changed 6 : 8 changed 25 : 32 changed 102 : 128 changed 409 : 512 changed 1638 : 2048 changed 6553 : 8192 removing changed 4095 : 32768 changed 1023 : 8192 changed 255 : 2048 changed 63 : 512 changed 15 : 128 changed 3 : 32 ``` import std.stdio; struct AA { Impl* impl; }; struct Impl { void*[] buckets; }; void main(){ immutable N = 6554; int[int] a; ulong nn = 0; writeln("adding"); for(int i = 0; i<N; i++){ auto b = a; a[i] = i; auto aa = cast(AA*) &a; auto n = aa.impl.buckets.length; if (n!=nn) { writeln("changed ", i, " : ", nn); nn = n; } } writeln("\nremoving"); for(int i = N-1; i>=0; i--){ auto b = a; a.remove(i); auto aa = cast(AA*) &a; auto n = aa.impl.buckets.length; if (n!=nn) { writeln("changed ", i, " : ", nn); nn = n; } } } ```
Dec 29 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 29 December 2021 at 10:53:13 UTC, Ola Fosheim 
Grøstad wrote:
 removing
 changed 4095 : 32768
 changed 1023 : 8192
 changed 255 : 2048
 changed 63 : 512
 changed 15 : 128
 changed 3 : 32
To put numbers on this, when deleting a long series of keys the hash table has consistently less than 13% filled slots, and right before shrinking it has 3% filled slots (1024/32768). That is a lot of wasted memory, 97%. Hash tables should not be worse than 75% waste as a rule of thumb. With such high constant factors O(1) notation is of little value.
Dec 29 2021
parent reply Paul Backus <snarwin gmail.com> writes:
On Wednesday, 29 December 2021 at 11:26:48 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 29 December 2021 at 10:53:13 UTC, Ola Fosheim 
 Grøstad wrote:
 removing
 changed 4095 : 32768
 changed 1023 : 8192
 changed 255 : 2048
 changed 63 : 512
 changed 15 : 128
 changed 3 : 32
To put numbers on this, when deleting a long series of keys the hash table has consistently less than 13% filled slots, and right before shrinking it has 3% filled slots (1024/32768). That is a lot of wasted memory, 97%. Hash tables should not be worse than 75% waste as a rule of thumb. With such high constant factors O(1) notation is of little value.
Seems like an issue worth filing a bugzilla report about.
Dec 29 2021
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 29 December 2021 at 21:18:30 UTC, Paul Backus wrote:
 Seems like an issue worth filing a bugzilla report about.
Probably, I don't use AAs, but I can file an issue if nobody else has already.
Jan 02 2022
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 29 December 2021 at 21:18:30 UTC, Paul Backus wrote:
 Seems like an issue worth filing a bugzilla report about.
I am not able to register for issues.dlang.org, does not not accept gmail accounts. I guess someone else has to file it.
Jan 07 2022
next sibling parent bauss <jj_1337 live.dk> writes:
On Friday, 7 January 2022 at 10:07:02 UTC, Ola Fosheim Grøstad 
wrote:
 On Wednesday, 29 December 2021 at 21:18:30 UTC, Paul Backus 
 wrote:
 Seems like an issue worth filing a bugzilla report about.
I am not able to register for issues.dlang.org, does not not accept gmail accounts. I guess someone else has to file it.
Is that real? That gmail accounts don't work? That's so odd.
Jan 07 2022
prev sibling parent reply Mike Parker <aldacron gmail.com> writes:
On Friday, 7 January 2022 at 10:07:02 UTC, Ola Fosheim Grøstad 
wrote:
 On Wednesday, 29 December 2021 at 21:18:30 UTC, Paul Backus 
 wrote:
 Seems like an issue worth filing a bugzilla report about.
I am not able to register for issues.dlang.org, does not not accept gmail accounts. I guess someone else has to file it.
I registered with a gmail account. What error do you see?
Jan 07 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 7 January 2022 at 10:45:39 UTC, Mike Parker wrote:
 On Friday, 7 January 2022 at 10:07:02 UTC, Ola Fosheim Grøstad 
 wrote:
 On Wednesday, 29 December 2021 at 21:18:30 UTC, Paul Backus 
 wrote:
 Seems like an issue worth filing a bugzilla report about.
I am not able to register for issues.dlang.org, does not not accept gmail accounts. I guess someone else has to file it.
I registered with a gmail account. What error do you see?
«The e-mail address you entered (firstname.middlename.lastname gmail.com) didn't pass our syntax checking for a legal email address. A legal address must contain exactly one ' ', and at least one '.' after the . Currently, registering using Gmail addresses is not allowed due to spam. It also must not contain any illegal characters.»
Jan 07 2022
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 12/18/21 12:46 PM, Dennis wrote:

 The problem with `in` for slices to me is that it would be inconsistent
 with how it works for AAs, where `in` searches for a *key* and returns a
 pointer to a *value*.
That. Ali
Dec 19 2021
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 18 December 2021 at 20:46:06 UTC, Dennis wrote:
 I wouldn't mind giving `in` this meaning so you can do easy 
 bounds checks in slices as well, but I think it will be to 
 confusing for newcomers expecting `in` to work like in Python.
But your proposal would kinda work like in Python? In [3]: 'a' in {'a':1, 'b':2} Out[3]: True In [4]: 'a' in ['a','b'] Out[4]: True
Dec 19 2021
parent reply Dennis <dkorpel gmail.com> writes:
On Sunday, 19 December 2021 at 17:11:23 UTC, Ola Fosheim Grøstad 
wrote:
 But your proposal would kinda work like in Python?

 In [3]: 'a' in {'a':1, 'b':2}
 Out[3]: True
 In [4]: 'a' in ['a','b']
 Out[4]: True
In my proposal `'a' in ['a', 'b']` will return `null` because it looks for the key 'a', and since the ASCII value of 'a' is `0x61` and the array length 2, it will return `null`.
Dec 20 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 20 December 2021 at 10:34:44 UTC, Dennis wrote:
 In my proposal `'a' in ['a', 'b']` will return `null` because 
 it looks for the key 'a', and since the ASCII value of 'a' is 
 `0x61` and the array length 2, it will return `null`.
Oh yeah, I'm sorry, you looked at the index as a key. I'm not sure how I managed to mix that up… But yeah, just doing what Python does, but with a pointer to a value or null instead of boolean, makes sense in D.
Dec 20 2021
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/18/21 3:46 PM, Dennis wrote:
 On Saturday, 18 December 2021 at 18:47:33 UTC, Adam Ruppe wrote:
 The `in` operator on AAs has a speed of O(log n) or O(1). `in` on 
 slices would exceed that being O(n).
Who do people keep repeating that? - Worst case time complexity of the `in` operator in AA's is actually O(n). This can easily happen in practice when your `toHash` has a poor implementation, which I have had before
Well, yeah. But that doesn't mean it happens often or even regularly. You need to have a bad `toHash` function. We shouldn't lower the bar for `in` just because it's possible to make it perform poorly.
 - In custom types you can overload the `in` operator to something with 
 arbitrarily bad time complexity.
There's your answer -- make a custom type that allows `in` with an array-like structure. Or make a UFCS function.
 - Despite scaling with O(n), a linear scan is pretty fast. You can 
 probably scan your entire RAM in a matter of seconds.
Complexity is not some old-fashioned idea that has no relevance on today's hardware. Linear searching is a real bottleneck, and identifying `in` to be sub-linear is a useful design decision. If you are searching your entire memory by using an `opCmp`, I bet it will be more than a few seconds.
 The whole "it ruins performance guarantees" so far seems completely 
 hypothetical and is not something that came from experience with actual 
 code.
The design is sound. You use `in`, you get sub-linear (i.e. fast) performance. That kind of expectation is useful from a generic point of view. It's similar to opIndex for a linked list -- it's possible, it makes it so you can used linked lists in places that might expect arrays, but generic code that uses it might perform really badly as the code expects opIndex to be fast.
 The problem with `in` for slices to me is that it would be inconsistent 
 with how it works for AAs, where `in` searches for a *key* and returns a 
 pointer to a *value*. The same for slices would look like this:
[snip]
 But people suggest `in` to search a *value* in a slice.
Yes, this is a good point against. But the thing is, we also have `find`, which works fine, I'm not sure why we need `in` support.
 I've once overloaded the `in` operator for a custom 2D array type used 
 to represent a maze in which path finding was done. There it was useful 
 for bounds checking since writing comparisons with `width` and `height` 
 all the time is cumbersome. I wouldn't mind giving `in` this meaning so 
 you can do easy bounds checks in slices as well, but I think it will be 
 to confusing for newcomers expecting `in` to work like in Python.
Yes, that's not a terrible use case for it. But you are right that `in` will be confusing to newcomers, especially when you learn that an array is a collection of data. `if(x in arr)` doesn't seem to suggest that it will be looking at indexes. -Steve
Dec 19 2021
parent reply Dennis <dkorpel gmail.com> writes:
On Sunday, 19 December 2021 at 18:31:46 UTC, Steven Schveighoffer 
wrote:
 There's your answer -- make a custom type that allows `in` with 
 an array-like structure.
For the record, I'm not arguing in favor of `in` for slices. I'm just baffled that whenever this comes up, the first thing people bring up is the 'time complexity' argument.
 That kind of expectation is useful from a generic point of view.
And this is my biggest problem with that argument: it's about this mystical function that takes a generic data structure and uses the `in` operator multiple times on it, which can now suddenly be called on an array, ruining performance to the surprise of the programmer. What is this function? Does it exist in Phobos currently?
Dec 20 2021
next sibling parent reply Araq <rumpf_a web.de> writes:
On Monday, 20 December 2021 at 11:04:00 UTC, Dennis wrote:
 On Sunday, 19 December 2021 at 18:31:46 UTC, Steven 
 Schveighoffer wrote:
 There's your answer -- make a custom type that allows `in` 
 with an array-like structure.
For the record, I'm not arguing in favor of `in` for slices. I'm just baffled that whenever this comes up, the first thing people bring up is the 'time complexity' argument.
 That kind of expectation is useful from a generic point of 
 view.
And this is my biggest problem with that argument: it's about this mystical function that takes a generic data structure and uses the `in` operator multiple times on it, which can now suddenly be called on an array, ruining performance to the surprise of the programmer. What is this function? Does it exist in Phobos currently?
And how come it cannot store the result of the "in" operation in a local variable for re-use? And how exactly is calling an oh-so-efficient sub-linear O(log N) function repeatedly when you can save the result in a local *not* a problem?
Dec 20 2021
parent reply Biotronic <simen.kjaras gmail.com> writes:
On Monday, 20 December 2021 at 11:28:52 UTC, Araq wrote:
 On Monday, 20 December 2021 at 11:04:00 UTC, Dennis wrote:
 And this is my biggest problem with that argument: it's about 
 this mystical function that takes a generic data structure and 
 uses the `in` operator multiple times on it, which can now 
 suddenly be called on an array, ruining performance to the 
 surprise of the programmer.

 What is this function? Does it exist in Phobos currently?
And how come it cannot store the result of the "in" operation in a local variable for re-use? And how exactly is calling an oh-so-efficient sub-linear O(log N) function repeatedly when you can save the result in a local *not* a problem?
bool isSubset(T1, T2)(T1 needles, T2 haystack) { foreach (needle; needles) { if (!(needle in haystack)) return false; } return true; } If 'in' is O(1), that's a fairly sensible implementation. If it's O(N), it's not. Since the lookup uses different arguments each time, local variables can't help you. Note that, of course, there's no problem if `x in arr` is a simple bounds check - if your bounds check on an array is O(N), you're doing it wrong. -- Simen
Dec 20 2021
parent Dennis <dkorpel gmail.com> writes:
On Monday, 20 December 2021 at 13:07:40 UTC, Biotronic wrote:
 If 'in' is O(1), that's a fairly sensible implementation.
Thanks, that's an illuminating example.
Dec 20 2021
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 20 December 2021 at 11:04:00 UTC, Dennis wrote:
 On Sunday, 19 December 2021 at 18:31:46 UTC, Steven 
 Schveighoffer wrote:
 There's your answer -- make a custom type that allows `in` 
 with an array-like structure.
For the record, I'm not arguing in favor of `in` for slices. I'm just baffled that whenever this comes up, the first thing people bring up is the 'time complexity' argument.
Yes, I agree with this. When I use "in" in Python I am very much aware that it does a linear scan, but it is a very useful shorthand that can replaces a long and confusing sequence of or-expressions. E.g. ``` if keyword in ['if', 'while', 'for] ``` is a very useful and clear syntactical construct.
Dec 20 2021
parent reply Paul Backus <snarwin gmail.com> writes:
On Monday, 20 December 2021 at 13:54:56 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 20 December 2021 at 11:04:00 UTC, Dennis wrote:
 On Sunday, 19 December 2021 at 18:31:46 UTC, Steven 
 Schveighoffer wrote:
 There's your answer -- make a custom type that allows `in` 
 with an array-like structure.
For the record, I'm not arguing in favor of `in` for slices. I'm just baffled that whenever this comes up, the first thing people bring up is the 'time complexity' argument.
Yes, I agree with this. When I use "in" in Python I am very much aware that it does a linear scan, but it is a very useful shorthand that can replaces a long and confusing sequence of or-expressions. E.g. ``` if keyword in ['if', 'while', 'for] ``` is a very useful and clear syntactical construct.
```d import std.algorithm.searching; if (["if", "while", "for"].canFind(keyword)) { // ... } ```
Dec 20 2021
next sibling parent Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Monday, 20 December 2021 at 14:21:08 UTC, Paul Backus wrote:
 On Monday, 20 December 2021 at 13:54:56 UTC, Ola Fosheim 
 Grøstad wrote:
 E.g.

 ```
 if keyword in ['if', 'while', 'for]
 ```

 is a very useful and clear syntactical construct.
```d import std.algorithm.searching; if (["if", "while", "for"].canFind(keyword)) { // ... } ```
TBH, for that particular case, this looks nicer, and IMO documents the intent better than the original Pythonism: ```d import std.algorithm.comparison : among; if (keyword.among("if", "while", "for")) { /* ... */ } ``` It also doesn't let work go to waste, as it returns a 1-based index.
Dec 20 2021
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 20 December 2021 at 14:21:08 UTC, Paul Backus wrote:
 ```d
 import std.algorithm.searching;

 if (["if", "while", "for"].canFind(keyword)) {
     // ...
 }
 ```
You can always create a library construct, but then you should also argue in favour of removing AAs and "in". You either do it, or you don't. Why the half-measures?
Dec 20 2021
next sibling parent Paul Backus <snarwin gmail.com> writes:
On Monday, 20 December 2021 at 21:11:28 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 20 December 2021 at 14:21:08 UTC, Paul Backus wrote:
 ```d
 import std.algorithm.searching;

 if (["if", "while", "for"].canFind(keyword)) {
     // ...
 }
 ```
You can always create a library construct, but then you should also argue in favour of removing AAs and "in". You either do it, or you don't. Why the half-measures?
There are a lot of features one could potentially add to D to improve it, and limited resources to devote to doing so. I'm not necessarily *against* adding `in` for slices, but since we already have a perfectly good library solution, I don't think it deserves a high priority. Feel free to write the DIP yourself if you disagree. :)
Dec 20 2021
prev sibling parent reply Nick Treleaven <nick geany.org> writes:
On Monday, 20 December 2021 at 21:11:28 UTC, Ola Fosheim Grøstad 
wrote:
 You can always create a library construct, but then you should 
 also argue in favour of removing AAs and "in". You either do 
 it, or you don't. Why the half-measures?
No, `canFind` and `among` search for a given value in a range. `in` checks if a given key exists and returns a pointer to the associated data (not a pointer to the matching key). This is why Phobos SortedRange does not implement opIn even though it would be sub-linear. Dennis is right that conceptually a key for an array is an index. An array element value is the data associated with an index.
Dec 21 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 21 December 2021 at 17:06:04 UTC, Nick Treleaven 
wrote:
 Dennis is right that conceptually a key for an array is an 
 index. An array element value is the data associated with an 
 index.
What it is conceptually, depends on how you conceptualize it. The most common conceptualization of "in" is as a set operation and you cannot represent a set with array indexes.
Dec 22 2021
parent reply Nick Treleaven <nick geany.org> writes:
On Wednesday, 22 December 2021 at 12:43:26 UTC, Ola Fosheim 
Grøstad wrote:
 What it is conceptually, depends on how you conceptualize it. 
 The most common conceptualization of "in" is as a set operation 
 and you cannot represent a set with array indexes.
`in` is defined for RedBlackTree, but it returns bool: https://dlang.org/phobos/std_container_rbtree.html#.RedBlackTree.opBinaryRight So aside from the complexity argument, `element in arr` could be defined if it returned bool, not a pointer (in order to be consistent with AAs not returning a pointer to the *key*). But due to time complexity it makes more sense to define `index in arr`.
Jan 01 2022
parent reply Elronnd <elronnd elronnd.net> writes:
On Saturday, 1 January 2022 at 16:40:02 UTC, Nick Treleaven wrote:
 On Wednesday, 22 December 2021 at 12:43:26 UTC, Ola Fosheim 
 Grøstad wrote:
 What it is conceptually, depends on how you conceptualize it. 
 The most common conceptualization of "in" is as a set 
 operation and you cannot represent a set with array indexes.
`in` is defined for RedBlackTree, but it returns bool: https://dlang.org/phobos/std_container_rbtree.html#.RedBlackTree.opBinaryRight
A redblacktree is not a sequence, though; its elements are not keyed in any way.
Jan 01 2022
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 2 January 2022 at 02:33:22 UTC, Elronnd wrote:
 A redblacktree is not a sequence, though; its elements are not 
 keyed in any way.
Right, "in" is a set operator. You can implement a set as a dict with null-values, then you can think of the keys as values. If you think of the ADT as the set-concept then what is a key and what is a value is a matter of interpretation. If you think of a binary tree (or rbtree) as a set, then the distinction between key and value becomes blurred.
Jan 02 2022
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/20/21 6:04 AM, Dennis wrote:
 On Sunday, 19 December 2021 at 18:31:46 UTC, Steven Schveighoffer wrote:
 There's your answer -- make a custom type that allows `in` with an 
 array-like structure.
For the record, I'm not arguing in favor of `in` for slices. I'm just baffled that whenever this comes up, the first thing people bring up is the 'time complexity' argument.
One thing not really being discussed is that there is a difference between "some library" defining slow `in` operators, or slow `opIndex`, and dlang/Phobos doing it. D picked the path of trying to ensure complexity consistency for `in`, but it's more of a stylistic rule, not necessarily a requirement. -Steve
Dec 21 2021
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Tuesday, 21 December 2021 at 17:00:49 UTC, Steven 
Schveighoffer wrote:
 One thing not really being discussed is that there is a 
 difference between "some library" defining slow `in` operators, 
 or slow `opIndex`, and dlang/Phobos doing it.

 D picked the path of trying to ensure complexity consistency 
 for `in`, but it's more of a stylistic rule, not necessarily a 
 requirement.
Then maybe we in should be implemented; Have it check if you have it sorted (*unless you do assumesorted template*). If it is sorted naturally use BinarySearch, and if not it would probably do a linear search **BUT** give a warning message and file/line number so it can be fixed/traced? (*Or just make it have to be Binary Search and asserts out if it isn't sorted*) As for bool vs pointer return... probably return a pointer as that can easily be tested as bool for no extra cost.
Jan 02 2022
parent reply Nick Treleaven <nick geany.org> writes:
On Monday, 3 January 2022 at 00:15:53 UTC, Era Scarecrow wrote:
  Then maybe we in should be implemented; Have it check if you 
 have it sorted (*unless you do assumesorted template*). If it 
 is sorted naturally use BinarySearch, and if not it would 
 probably do a linear search **BUT** give a warning message and 
 file/line number so it can be fixed/traced? (*Or just make it 
 have to be Binary Search and asserts out if it isn't sorted*)

  As for bool vs pointer return... probably return a pointer as 
 that can easily be tested as bool for no extra cost.
It would make sense to implement `in` for SortedRange, but there is no use case for returning a pointer. You already have the value on the left hand side, the only reason for a pointer would be to mutate the element in the range. Mutating that would potentially stop the range being in order, violating the point of SortedRange.
Jan 04 2022
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Tuesday, 4 January 2022 at 18:23:33 UTC, Nick Treleaven wrote:
 It would make sense to implement `in` for SortedRange, but 
 there is no use case for returning a pointer. You already have 
 the value on the left hand side, the only reason for a pointer 
 would be to mutate the element in the range.
Sorry, i was somehow thinking there was a key/value pair in this and not just a normal sorted block. But yes there may still be use cases. Consider i do opCmp on a struct, and it only does comparisons on say a name portion (*Or UPC or some other immutable property ID*), but no other part of the stored data. The name/ID would be how it's sorted; But any attached data which is non-sorted would be inaccessible otherwise. This could be phone number, address, grades, just about anything (*and those items could be changed without changing the sorting order*) Though you can probably do 2 steps it seems like a bit of extra pointless work to me. Or maybe AA's are a better choice in that case. I'm not sure.
Jan 05 2022
parent Nick Treleaven <nick geany.org> writes:
On Wednesday, 5 January 2022 at 23:16:48 UTC, Era Scarecrow wrote:
 Consider i do opCmp on a struct, and it only does comparisons 
 on say a name portion
Ok, you're right. The search key then wouldn't need to contain the other data which is not used for comparisons.
 But any attached data which is non-sorted would be inaccessible 
 otherwise. This could be phone number, address, grades, just 
 about anything (*and those items could be changed without 
 changing the sorting order*)
True, but if SortedRange exposed a pointer to the matching element, it should probably be const so the data that *is* compared is not changed. (It could perhaps alternatively return a one-based index instead, which would convert to bool as expected).
Jan 07 2022
prev sibling parent Biotronic <simen.kjaras gmail.com> writes:
On Saturday, 18 December 2021 at 18:47:33 UTC, Adam Ruppe wrote:
 I'm not sure the speed rule is written or not, but generally 
 `in` is supposed to have the same speed requirement as `[n]` 
 which is sub-linear.
It's in [std.container](https://dlang.org/phobos/std_container.html), bottom of the page. I [remember a discussion on complexity](https://forum.dlang.org/post/n3qtgq$2frv$1 digitalmars.com) that prompted Andrei to create a library for marking functions with their complexity. I'm not sure if this was the original source for the complexities given in std.container. My brain is also telling me the values were sourced from Alexander Stepanov's [Elements of Programming](http://elementsofprogramming.com/), but I can't find a source for that, and I'm too lazy to look very deep. -- Biotronic
Dec 19 2021