www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Improving std.algorithm.find

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I was thinking of improving std.find. We have this bug report:

http://d.puremagic.com/issues/show_bug.cgi?id=3923

which is pretty vague but does have a point.

For starters, it should be told that std.algorithm.find _does_ a lot, 
which at least partially justifies its complexity. One thing that I 
haven't seen other find()s doing is to be able to search in one pass a 
given range for multiple other ranges, like this:

int[] a = [ 1, 4, 2, 3 ];
assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));

When passed more than two arguments, find returns a tuple continuing the 
searched ranged positioned on the element found and the 1-based index of 
the parameter that was found. The trick is that find() makes exactly one 
pass through the searched range, which is often more efficient than 
searching the same range for each element in turn. Also the one-pass 
approach works with input ranges so it doesn't put pressure on the range 
capabilities.

However the simplified find() looks like, I'd like to keep this feature 

complicates find's signature and implementation.

Another aspect I'd like to discuss is use of Boyer-Moore and related 
fast finding techniques. Currently the use of B-M is explicit and a bit 
awkward. I was thinking instead to automatically detect its 
appropriatedness and use it transparently. Bearophile posted at a link 
to such a blend string search routine 
(http://effbot.org/zone/stringlib.htm) that I think can be generalized 
quite easily.

Any ideas - please chime in.


Andrei
Jul 17 2010
next sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Saturday 17 July 2010 15:55:26 Andrei Alexandrescu wrote:
 I was thinking of improving std.find. We have this bug report:
 
 http://d.puremagic.com/issues/show_bug.cgi?id=3923
 
 which is pretty vague but does have a point.
 
 For starters, it should be told that std.algorithm.find _does_ a lot,
 which at least partially justifies its complexity. One thing that I
 haven't seen other find()s doing is to be able to search in one pass a
 given range for multiple other ranges, like this:
 
 int[] a = [ 1, 4, 2, 3 ];
 assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));
 
 When passed more than two arguments, find returns a tuple continuing the
 searched ranged positioned on the element found and the 1-based index of
 the parameter that was found. The trick is that find() makes exactly one
 pass through the searched range, which is often more efficient than
 searching the same range for each element in turn. Also the one-pass
 approach works with input ranges so it doesn't put pressure on the range
 capabilities.
 
 However the simplified find() looks like, I'd like to keep this feature

 complicates find's signature and implementation.
 
 Another aspect I'd like to discuss is use of Boyer-Moore and related
 fast finding techniques. Currently the use of B-M is explicit and a bit
 awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 
 Any ideas - please chime in.
 
 
 Andrei
I think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off. The find() that takes only a predicate and the range to search in is much better: Range find(alias pred, Range)(Range haystack); Now, unfortunately, I'm not sure how to improve the main version of find other than the return type. FindResult!(Range,Ranges) find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges needles); becomes Range find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges needles); However, that (alias pred = "a == b", Range, Ranges...) is still pretty ugly. Unfortunately, I'm not sure how you would eliminate that. The only part that the caller would normally care about is alias pred = "a == b", but the rest is part of the template definition as well, and I don't think that we have a way to clean that up at the moment. The other problem is that you're getting two totally different return types depending on whether you give it one needle or many (which may make Range not work as a return type anyway since what you get in the case of multiple needs isn't exactly a Range). So, I'd suggest either making it so that they return the same thing and you have to do a comparison to figure out which needle was found (in many cases, all you care about is that one of them was found anyway, but it would be less efficient in cases where you do care), or you should split find() into two functions - one which takes a single needle and one which takes many. That should make things a bit clearer. It doesn't really change the functionality, but the docs would be less cluttered. Another option would be to make find only return a range for multiple needles but have a separate function which returns a tuple if you really care, thereby simplifying find() a bit. In such a case, it would probably still be best to separate find() itself into two functions, but it wouldn't be as critical. As for integrating BoyerMoore into find(), if you can do it cleanly and seemlessly, I'm all for it. It would be good if there were a note about it in the documentation so that those using it wouldn't feel the need to reimplement it themselves, but I'm all for having versions of find() be specialized for the type that they're searching on if it can be done seemlessly. I'd even suggest doing the same for the various container types which could benefit from a specialized find - though perhaps those would be better served by having a special version of find() as part of their definition which std.algorithm.find() then calls. If anything, I've been more interested in canFind() and until() being made to match up with find(). I'd like to be able to give them both pretty much the same arguments and then get the bool from canFind() and the range that find() would have walked over in the case of until(). A function which gave you both the range that until() would have given you as well as the one which find() would have given you would be nice as well. I previously opened a bug with thoughts along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888 In any case, I think that find() itself is more or less fine from a usage standpoint. It's the docs that need help. And splitting up the function and trying to simplify the signature in the docs (whether the signature itself is really simplified or not) is what really needs to happen. The other thing would be to add a lot more examples. That would be a big boon for a lot of std.algorithm functions. They're not necessarily hard to use, and the examples show it much more clearly than the signatures often do. The problem is that the signatures are so ugly. And the fact that they have to be (in terms of the actual function if not the documentation) is pretty much required given what they do, so I'm not sure how much that can really be fixed. Maybe some extensions to DDoc need to be done for it happen, but it would be good if we could find a way to give simplified function signatures for a lot of these functions (possibly with a way in the docs to see the full function signature) so that users of the docs can see what they need to to use the functions without having to wade through the nasty (albeit necessarily so) signatures. - Jonathan M Davis
Jul 17 2010
parent reply sybrandy <sybrandy gmail.com> writes:
 I think that the problem with find() is not so much find() but it's
documentation.
 In all honesty, anything with a return type like FindResult!(Range, Ranges) is
 going to scare people off.
I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot. Casey
Jul 19 2010
next sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Monday, July 19, 2010 16:36:57 sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation. In all honesty, anything with a return type like
 FindResult!(Range, Ranges) is going to scare people off.
I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot. Casey
In most cases, the return types are just ranges, but because they're constructed by the algorithm, the algorithm creates its own range type. So, generally, all you need to know is that a range is being returned. But even then, you run into some issues where you need to do stuff like run the returned range through array() to get something useable - depending on what you're doing with it. Regardless of the return type though, std.algorithm is definitely a prime example of how auto is your friend. That way you often don't have to care all that much about what the return type really looks like. However, the documentation on it is obviously ugly. It really should give the programmer exactly what they need to use the function and nothing more. As it stands, you have to really want to use it to figure it out. Personally, I think that it's well worth it - std.algorithm is fantastic - but the documentation is nowhere near as straightforward as it should be (even if using the functions is actually pretty straightforward in most cases). - Jonathan M Davis
Jul 19 2010
parent reply sybrandy <sybrandy gmail.com> writes:
 In most cases, the return types are just ranges, but because they're
constructed
 by the algorithm, the algorithm creates its own range type. So, generally, all
 you need to know is that a range is being returned. But even then, you run into
 some issues where you need to do stuff like run the returned range through
 array() to get something useable - depending on what you're doing with it.
 Regardless of the return type though, std.algorithm is definitely a prime
example
 of how auto is your friend. That way you often don't have to care all that much
 about what the return type really looks like.
That paragraph would have been immensely useful to me a few months ago.
 However, the documentation on it is obviously ugly. It really should give the
 programmer exactly what they need to use the function and nothing more. As it
 stands, you have to really want to use it to figure it out. Personally, I think
 that it's well worth it - std.algorithm is fantastic - but the documentation is
 nowhere near as straightforward as it should be (even if using the functions is
 actually pretty straightforward in most cases).
Agreed. I'm also a big fan of examples as it's very easy to look at an example and get something working. Then, once you see it work and get somewhat of an understanding, you can dig deeper into the details to learn more. Casey
Jul 20 2010
parent bearophile <bearophileHUGS lycos.com> writes:
sybrandy:
 Agreed.  I'm also a big fan of examples as it's very easy to look at an 
 example and get something working.  Then, once you see it work and get 
 somewhat of an understanding, you can dig deeper into the details to 
 learn more.
I am not sure, but I think find() is not what I need, even if you add lot of usage examples to it. (It seems my list of find-like micropatterns got ignored, I don't know why, maybe they don't look enough like things done by STL C++.) Bye, bearophile
Jul 20 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 06:36 PM, sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation.
 In all honesty, anything with a return type like FindResult!(Range,
 Ranges) is
 going to scare people off.
I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot.
I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone) bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by default. I'll work on improving the documentation. Andrei
Jul 19 2010
parent sybrandy <sybrandy gmail.com> writes:
On 07/19/2010 10:18 PM, Andrei Alexandrescu wrote:
 On 07/19/2010 06:36 PM, sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation.
 In all honesty, anything with a return type like FindResult!(Range,
 Ranges) is
 going to scare people off.
I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot.
I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone) bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by default. I'll work on improving the documentation. Andrei
Muchas gracias from your loyal minions. Casey
Jul 20 2010
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei:

Any ideas - please chime in.<
As with most software it's better to first ask yourself what your software will mostly be used for, and then to write it. Otherwise you are wasting your time and the user time. The most common usages (I call them micropatterns) for find-like functions (every one of the following seven operations takes an optional transform function like schwartzSort): 1) boolean predicate, is the item/iterable x inside the iterable foo? ("in" operator, or if not possible a "isIn" function). 2) int function, tell me how many items/arrays x are inside the iterable foo ("count" function). 3) int function, tell me the index of the first item/iterable x in the iterable foo ("index" function). 4 extra) find lazily all the indexes of an item/iterable x inside the iterable foo ("indexes" struct or class). 5) find the first max/min item in a iterable ("max", "min" functions, they have overloads for 2 and 3 items too). 6) find the index of the max/min item in a iterable ("maxPos", "minPos" functions). 7) find lazily all the max/min items in a iterable ("maxs", "mins" structs or classes). An iterable is a Range (like a string, array, AA, etc) or a struct/class that defines just opApply. In my dlibs1 in all those functions if the sequence is an associative arrays, the search is done on the keys. D2 AAs have .byKey() that reduces this need a bit. In the first 4 micropatterns it's good to have a specialization for strings, to increase performance with some string-specific tricks . Bye, bearophile
Jul 18 2010
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
*****
    I was thinking of improving std.find.

    For starters, it should be told that std.algorithm.find _does_ a lot,
which at least partially justifies its complexity. One thing that I haven't
seen other find()s doing is to be able to search in one pass a given range
for multiple other ranges, like this:

    int[] a = [ 1, 4, 2, 3 ];
    assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));

    When passed more than two arguments, find returns a tuple continuing the
searched ranged positioned on the element found and the 1-based index of the
parameter that was found. The trick is that find() makes exactly one pass
through the searched range, which is often more efficient than searching the
same range for each element in turn. Also the one-pass approach works with
input ranges so it doesn't put pressure on the range capabilities.
*****

You could consider than find() is a multiple-returns function: it may return
0, 1 or many values (the successive finds). Which can, obviously, be easily
represented by a range. I see find() as bit like filter(): its return type
is a range determined by the predicate and whose elements are the rest of
the input range.
so:
find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are [4,2,3,4,0]
and [4,0].

If no needle is present in haystack, then the result range is empty. It's
just it's an empty range of ranges, not an empty version of the original
range.

A complement information can be how many steps were taken (that is, the
index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
another function could take care of that: findIndexed(haystack, needle) or
find!(withIndex)(haystack, needle).

The advantage of returning a range is the possibility to iterate on the
results. I, for one, like the idea of having find() in the same family that
workhorses like map() or filter()... The drawback is the necessity to test
the result with empty to know if you had results, but that's not different
from the current version.

So, you could have a findFirst() that does not return a range, but one
element, like the current find() does. Or call what I suggest findAll().

Now, for many needles... Currently, you pass the predicate "a==b" to find()
and use it on all needles. A possible generalization is to have a isOneOf()
predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current
element being tested and the b's are the needles passed to find(). This
generic calling of pred with (a, needles) could be done for any predicate,
like for example:

find!(isBetween)(haystack, 0, 10)

where isBetween(a, left, right) returns true if (left <= a && a <= right).

The default would be "a==b" for one needle and isOneOf(a, b...) if
needles.length > 1. So you still get the current behavior, with a bit of
possible customization thrown in, like the isBetween example. I don't think
that's possible today?

The limit of this is you cannot easily get the index of the found needle.
btw, could you remind us why you used a 1-based index? My opinion on this is
that if you want to find any needle, then you mostly care for them being
present, and much less for wich one is the first. Also, the found element is
accessible by taking the front of the retuned range. Why provide it to the
caller?

Another problem I see is to get the nice current behavior of having a needle
being a range like haystack:

find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

And that's I particularly like.


*****
    Another aspect I'd like to discuss is use of Boyer-Moore and related
fast finding techniques. Currently the use of B-M is explicit and a bit
awkward. I was thinking instead to automatically detect its appropriatedness
and use it transparently. Bearophile posted at a link to such a blend string
search routine (http://effbot.org/zone/stringlib.htm) that I think can be
generalized quite easily.
*****

Indeed, reading this, it seems like it could be generalized easily. You need
a random-access range with a defined length. So the usage of such an algo
could indeed be done transparently. Hmm...


Philippe
Jul 18 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/18/2010 04:30 PM, Philippe Sigaud wrote:
 On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
 You could consider than find() is a multiple-returns function: it may
 return 0, 1 or many values (the successive finds). Which can, obviously,
 be easily represented by a range. I see find() as bit like filter(): its
 return type is a range determined by the predicate and whose elements
 are the rest of the input range.
 so:
 find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are
 [4,2,3,4,0] and [4,0].

 If no needle is present in haystack, then the result range is empty.
 It's just it's an empty range of ranges, not an empty version of the
 original range.

 A complement information can be how many steps were taken (that is, the
 index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
 another function could take care of that: findIndexed(haystack, needle)
 or find!(withIndex)(haystack, needle).
I agree that a findAll function yielding a range is nice, however it's not the simplest interface. I think there should be a function that just finds something somewhere.
 Now, for many needles... Currently, you pass the predicate "a==b" to
 find() and use it on all needles. A possible generalization is to have a
 isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
 the current element being tested and the b's are the needles passed to
 find(). This generic calling of pred with (a, needles) could be done for
 any predicate, like for example:

 find!(isBetween)(haystack, 0, 10)
Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever.
 The limit of this is you cannot easily get the index of the found
 needle. btw, could you remind us why you used a 1-based index? My
 opinion on this is that if you want to find any needle, then you mostly
 care for them being present, and much less for wich one is the first.
I used one-based index to make tests easier - as you said, most often you care for the presence so zero/nonzero is easiest to tell. Then I thought it wouldn't hurt to provide the index since it's not any extra work.
 Also, the found element is accessible by taking the front of the retuned
 range. Why provide it to the caller?
Because that doesn't generalize well to searching ranges (as oposed to individual elements).
 Another problem I see is to get the nice current behavior of having a
 needle being a range like haystack:

 find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

 And that's I particularly like.
Not sure I understand whether that's good or bad, but I also want to provide the likes of findSkip() (it's been discussed here under a different name, findConsume) that positions the searched range after the found range: findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1] It's very useful in practice.
 *****
      Another aspect I'd like to discuss is use of Boyer-Moore and
 related fast finding techniques. Currently the use of B-M is explicit
 and a bit awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 *****

 Indeed, reading this, it seems like it could be generalized easily. You
 need a random-access range with a defined length. So the usage of such
 an algo could indeed be done transparently. Hmm...
The current B-M implementation already works with any random-access range with length, but it's not integrated transparently with the regular find. I wonder what the threshold conditions might look like. Andrei
Jul 18 2010
parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 I agree that a findAll function yielding a range is nice, however it's not
 the simplest interface. I think there should be a function that just finds
 something somewhere.
Yes, I agree that's the common case. What I personally would like is a way to find an element in an ordered container (binary tree...)
  Now, for many needles... Currently, you pass the predicate "a==b" to
 find() and use it on all needles. A possible generalization is to have a
 isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
 the current element being tested and the b's are the needles passed to
 find(). This generic calling of pred with (a, needles) could be done for
 any predicate, like for example:

 find!(isBetween)(haystack, 0, 10)
Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever.
The problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms. A not-so-uncommon scenario for me is having one range delivering limits (or zones to explore, or any other runtime arguments), and finding a value inside those limits in another range. Ideally, I'd like to use find() on a spatial structure like an R-tree. ( http://en.wikipedia.org/wiki/R-tree ). But I admit it's not that common.
  The limit of this is you cannot easily get the index of the found
 needle. btw, could you remind us why you used a 1-based index? My
 opinion on this is that if you want to find any needle, then you mostly
 care for them being present, and much less for wich one is the first.
I used one-based index to make tests easier - as you said, most often you care for the presence so zero/nonzero is easiest to tell. Then I thought it wouldn't hurt to provide the index since it's not any extra work.
Ah, I thought the scenario was: 1) test for the returned range (as the tuple._0 element) being empty or not. 2) if not empty, something was found, get the index. I forgot getting 0 as index would mean 'no element found'. If I understood you correctly, then I'd prefer a -1 index indicating failure, and a 0-based indexing on the needles if on is found. But that's a cosmetic issue.
  Also, the found element is accessible by taking the front of the retuned
 range. Why provide it to the caller?
Because that doesn't generalize well to searching ranges (as oposed to individual elements).
Good point.
  Another problem I see is to get the nice current behavior of having a
 needle being a range like haystack:

 find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

 And that's I particularly like.
Not sure I understand whether that's good or bad,
That's good. I mean, the current situation is good and that's a functionality I like.
 but I also want to provide the likes of findSkip() (it's been discussed
 here under a different name, findConsume) that positions the searched range
 after the found range:

 findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]

 It's very useful in practice.
I see that. And and empty range if it found nothing? Another useful one is split() to cut the range in two or three parts : before the match, the match itself (which can be one of the needles, so we do not know it in advance) and the part after that. Return that triple as a tuple. Hey, that way you can even implement replace: return chain(firstpart, replacedmatch, tailpart). Anything that can mimic regex functions on generic ranges is good, IMO. I regularly find myself wanting to do some regex-like matching. We do not need real pattern matching but even with simplified predicates, we can do many things.
  *****
     Another aspect I'd like to discuss is use of Boyer-Moore and
 related fast finding techniques. Currently the use of B-M is explicit
 and a bit awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 *****

 Indeed, reading this, it seems like it could be generalized easily. You
 need a random-access range with a defined length. So the usage of such
 an algo could indeed be done transparently. Hmm...
The current B-M implementation already works with any random-access range with length, but it's not integrated transparently with the regular find. I wonder what the threshold conditions might look like.
I'm sorry I didn't look at your implementation, as I don't know Boyer-Moore that much. Does it need to know the alphabet size or do you use another variant? Philippe
Jul 19 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 02:12 PM, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:


     I agree that a findAll function yielding a range is nice, however
     it's not the simplest interface. I think there should be a function
     that just finds something somewhere.


 Yes, I agree that's the common case.
 What I personally would like is a way to find an element in an ordered
 container (binary tree...)
Wouldn't that be a member of the binary tree?
     Aren't such scenarios better served by putting the limits inside the
     predicate? Otherwise the list of what can be done could go on forever.


 The problem is that, as the predicate is passed at compile-time, the
 limits must be CT-defined too. I'd like the limits to be defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.
I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]);
 A not-so-uncommon scenario for me is having one range delivering limits
 (or zones to explore, or any other runtime arguments), and finding a
 value inside those limits in another range. Ideally, I'd like to use
 find() on a spatial structure like an R-tree. (
 http://en.wikipedia.org/wiki/R-tree ).

 But I admit it's not that common.
Unless you manage to define appropriate and generally applicable iterators, I think that would overgeneralize find().
 Anything that can mimic regex functions on generic ranges is good, IMO.
 I regularly find myself wanting to do some regex-like matching. We do
 not need real pattern matching but even with simplified predicates, we
 can do many things.
I agree we're in desperate need for a good range-based, character-divorced, statically-computed regex engine.
 I'm sorry I didn't look at your implementation, as I don't know
 Boyer-Moore that much. Does it need to know the alphabet size or do you
 use another variant?
My explanation of B-M would do little more than pasting information from various places on the Net. Andrei
Jul 19 2010
parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 What I personally would like is a way to find an element in an ordered
 container (binary tree...)
Wouldn't that be a member of the binary tree?
Yes, indeed. Unless I define some cousin to range, a recursive range, to iterate on any recursive container, like trees or graphs and then devise algorithms on them. I played with this idea a few weeks ago but went to do other things. Anyway, carry on. Btw, I still plan to update some simple generic graphs and trees structs to obey TotalContainer interface when it makes sense, and update my graph algorithms accordingly. If I ever get somewhere, I'll post something.
     Aren't such scenarios better served by putting the limits inside the
    predicate? Otherwise the list of what can be done could go on forever.


 The problem is that, as the predicate is passed at compile-time, the
 limits must be CT-defined too. I'd like the limits to be defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.
I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]); Ah yes, but I regularly use algorithms structs as return values, like this:
auto nTimes(E, R)(E multiplier, R range) { return map!((ElementType!R e) { return e*multiplier;})(range); } Error: delegate std.algorithm.__dgliteral2 cannot access frame of function __dgliteral2| I ran into this yesterday while trying to encode power series as (possibly infinite) ranges, and getting problems for the operators overloading. Too bad, I got to the point where I could calculate sin/cos/exp values quite precisely and then could easily define derivation and such. There are workaround, but they are not as handy. I had the same problems some time ago, trying to define predicates on the fly, though I cannot right now remember what it was. I know it's not a problem in std.algorithm per se, but it's a limit on its usefulness, as far as I'm concerned.
  Anything that can mimic regex functions on generic ranges is good, IMO.
 I regularly find myself wanting to do some regex-like matching. We do
 not need real pattern matching but even with simplified predicates, we
 can do many things.
I agree we're in desperate need for a good range-based, character-divorced, statically-computed regex engine.
Oh, a full statically computed engine would be the Graal. But I did not go that far. Being able to simply match, split and replace at runtime on any range would be good also, without going all regex. Heck, I may even have already coded something like this, I'll go and check. What I'm trying to do right now as a back-burner task is a function transforming a range in a set-like range (each element is present only once), but testing for "did I already see that?" with a user-provided predicate.
  I'm sorry I didn't look at your implementation, as I don't know
 Boyer-Moore that much. Does it need to know the alphabet size or do you
 use another variant?
My explanation of B-M would do little more than pasting information from various places on the Net.
OK. Philippe
Jul 19 2010
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 20.07.2010 0:05, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> 
 wrote:


         What I personally would like is a way to find an element in an
         ordered
         container (binary tree...)


     Wouldn't that be a member of the binary tree?


 Yes, indeed. Unless I define some cousin to range, a recursive range, 
 to iterate on any recursive container, like trees or graphs and then 
 devise algorithms on them. I played with this idea a few weeks ago but 
 went to do other things. Anyway, carry on.
 Btw, I still plan to update some simple generic graphs and trees 
 structs to obey TotalContainer interface when it makes sense, and 
 update my graph algorithms accordingly. If I ever get somewhere, I'll 
 post something.


            Aren't such scenarios better served by putting the limits
         inside the
            predicate? Otherwise the list of what can be done could go
         on forever.


         The problem is that, as the predicate is passed at
         compile-time, the
         limits must be CT-defined too. I'd like the limits to be
         defined at
         runtime. I run into this regularly, while using predicates in
         std.algorithms.


     I think this is a misunderstanding. All of std.algorithm works
     with delegates:


        int[] a = [ 1, 4, 2, 3 ];
        int b = 2;
        assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 Ah yes, but I regularly use algorithms structs as return values, like 
 this:

 auto nTimes(E, R)(E multiplier, R range)
 {
     return map!((ElementType!R e) { return e*multiplier;})(range);
 }
Try this workaround, replacing delegate with nested function, works for me: auto nTimes(E, R)(E multiplier, R range){ ElementType!R fun(ElementType!R e) { return e*multiplier; } return map!(fun)(range); } For some reason, compiler never plays nice with map, and I believe there are some issues with current implementation (2.047 release ) that Andrei (hopefully) fixed in recent commit.
 Philippe
-- Dmitry Olshansky
Jul 19 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 05:36 PM, Dmitry Olshansky wrote:
 On 20.07.2010 0:05, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:


 What I personally would like is a way to find an element in an
 ordered
 container (binary tree...)


 Wouldn't that be a member of the binary tree?


 Yes, indeed. Unless I define some cousin to range, a recursive range,
 to iterate on any recursive container, like trees or graphs and then
 devise algorithms on them. I played with this idea a few weeks ago but
 went to do other things. Anyway, carry on.
 Btw, I still plan to update some simple generic graphs and trees
 structs to obey TotalContainer interface when it makes sense, and
 update my graph algorithms accordingly. If I ever get somewhere, I'll
 post something.


 Aren't such scenarios better served by putting the limits
 inside the
 predicate? Otherwise the list of what can be done could go
 on forever.


 The problem is that, as the predicate is passed at
 compile-time, the
 limits must be CT-defined too. I'd like the limits to be
 defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.


 I think this is a misunderstanding. All of std.algorithm works
 with delegates:


 int[] a = [ 1, 4, 2, 3 ];
 int b = 2;
 assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 Ah yes, but I regularly use algorithms structs as return values, like
 this:

 auto nTimes(E, R)(E multiplier, R range)
 {
 return map!((ElementType!R e) { return e*multiplier;})(range);
 }
Try this workaround, replacing delegate with nested function, works for me: auto nTimes(E, R)(E multiplier, R range){ ElementType!R fun(ElementType!R e) { return e*multiplier; } return map!(fun)(range); } For some reason, compiler never plays nice with map, and I believe there are some issues with current implementation (2.047 release ) that Andrei (hopefully) fixed in recent commit.
Yes, I'd discussed this with Walter on Friday and he said the semantics should be identical in the two cases. So we're dealing with a compiler bug. Andrei
Jul 19 2010
prev sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Jul 20, 2010 at 00:36, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:


 Ah yes, but I regularly use algorithms structs as return values, like this:
 auto nTimes(E, R)(E multiplier, R range)
 {
    return map!((ElementType!R e) { return e*multiplier;})(range);
 }
 Try this workaround, replacing delegate with nested function, works for me:

 auto nTimes(E, R)(E multiplier, R range){
    ElementType!R fun(ElementType!R e) { return e*multiplier; }
    return map!(fun)(range);
 }
 For some reason, compiler never plays nice with map, and I believe there
 are some issues with current implementation (2.047 release ) that Andrei
 (hopefully) fixed in recent commit.
Hmm, using a nested function was the first thing I tried, and it didn't work at the time. Good to know it's a bug. right now, it does not work, though: void main() { auto arr = [0,1,2,3,4]; writeln(array(nTimes(3, arr))); // prints 0, then four big numbers: it's stomping on another part of memory? } Strangely, even using 0 as a multiplier does not work. Using .save() in nTimes neither. You see, these kinds of issues all have solutions (in that particular case, the workaround is a bit complicated but works in practice), but they are cumbersome and limit my usage of std.algorithms. FWIW, the workaround I used there is to take the multiplier, call repeat on it to make it a range (3,3,3,3 ...), zip the two ranges together and call map with a function like this: times(A,B)(Tuple!(A,B) t) { return t._0*t._1;} so, you get: (0,3), (1,3), (2,3), ... and then 0,3,6, etc. Extension of this idea saved me many times. Philippe
Jul 19 2010
prev sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Jul 18, 2010 at 06:19, Jonathan M Davis wrote:
*****
I think that the problem with find() is not so much find() but it's
documentation.
In all honesty, anything with a return type like FindResult!(Range, Ranges)
is
going to scare people off.
*****

What should be done (in a perfect world) is to have a simplified signature
that can become the complete signature on a click

auto find!()(haystack, needles)  =>  FindResult!(etc.  ...
the empty !() being a cue that it's a template function and that there is
more to it that meets the eye.

(edit: crap, you said it afterwards. I should not reply hours after first
reading a message)

*****
If anything, I've been more interested in canFind() and until() being made
to
match up with find(). I'd like to be able to give them both pretty much the
same
arguments and then get the bool from canFind() and the range that find()
would
have walked over in the case of until(). A function which gave you both the
range that until() would have given you as well as the one which find()
would
have given you would be nice as well. I previously opened a bug with
thoughts
along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888
*****

Ah, these are called takeWhile, dropWhile and span, in Haskell and other
sequence-heavy languages.
Or splitBy, cutAt, etc. Quite handy to have, I agree std.algorithm should
have them. But then, it's huge module already.


*****
In any case, I think that find() itself is more or less fine from a usage
standpoint. It's the docs that need help. The other thing would
be to add a lot more examples. That would be a big boon for a lot of
std.algorithm functions. They're not necessarily hard to use, and the
examples
show it much more clearly than the signatures often do.
*****

I sometimes dream of a complement to Phobos that would demonstrate its use,
either by providing lots of examples for each function or by taking a module
and tackling a task with it. Obviously, as both a complement to Phobos and a
way to demo-nstrate D, it should be called Deimos, the second moon of Mars,
Phobos' sibling.

Now, the Deimos Project, that's some cool name, if I may say so.

But I guess the wiki already does this.

Philippe
Jul 18 2010