digitalmars.D - algorithm API design question

Andrei Alexandrescu (16/16) Sep 29 2013 I need a function that finds a run of length k of elements equal to x in...

Timon Gehr (7/23) Sep 30 2013 If the specialized algorithm runs faster, this should be done anyway.

Peter Alexander (6/19) Sep 30 2013 This.

Timon Gehr (5/21) Sep 30 2013 Boyer-Moore will already do this (and more). I guess you can be sure to

Andrei Alexandrescu (7/30) Sep 30 2013 B-M has significant preprocessing costs, so it's not appropriate for all...

Timon Gehr (7/39) Sep 30 2013 (My point was that the compiler knows it statically too, and the set of
Timon Gehr (3/8) Sep 30 2013 I guess another point is that Boyer-Moore requires more capabilities of

Andrei Alexandrescu (3/5) Sep 30 2013 Upon mismatch, restart search right after the mismatch point.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I need a function that finds a run of length k of elements equal to x in 
a range r, and I presume such a simple yet nontrivial algorithm (a 
dozen-liner) should be part of std.algorithm.

This raises an interesting question - what form should the API have. I 
see three options:

1. The existing find(r1, r2) figures out a way to dynamically check that 
r2 is a run of identical elements and tailor the argument accordingly. 
For example, during Boyer-Moore initialization that test comes cheap.

2. We should statically specialize find(r1, r2) for the case r2 is an 
instance of Repeat. The specialization runs the tailored algorithm. The 
user is supposed to call e.g. find(r, repeat(x, k)) to benefit of the 
specialized algorithm.

3. We should introduce a new function called e.g. findRun(r, x, k).

Each option has advantages and disadvantages. What do you all think is 
the best API?


Andrei

Sep 29 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/30/2013 04:35 AM, Andrei Alexandrescu wrote:
 I need a function that finds a run of length k of elements equal to x in
 a range r, and I presume such a simple yet nontrivial algorithm (a
 dozen-liner) should be part of std.algorithm.

 This raises an interesting question - what form should the API have. I
 see three options:

 1. The existing find(r1, r2) figures out a way to dynamically check that
 r2 is a run of identical elements and tailor the argument accordingly.
 For example, during Boyer-Moore initialization that test comes cheap.

This kind of thing tends to not pay off in the average case.

 2. We should statically specialize find(r1, r2) for the case r2 is an
 instance of Repeat. The specialization runs the tailored algorithm. The
 user is supposed to call e.g. find(r, repeat(x, k)) to benefit of the
 specialized algorithm.

If the specialized algorithm runs faster, this should be done anyway.
What is the optimization that would the specialized algorithm run faster 
though?

 3. We should introduce a new function called e.g. findRun(r, x, k).

:(

 Each option has advantages and disadvantages. What do you all think is
 the best API?


 Andrei

We already have 2.

Sep 30 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Monday, 30 September 2013 at 09:15:22 UTC, Timon Gehr wrote:
 On 09/30/2013 04:35 AM, Andrei Alexandrescu wrote:
 2. We should statically specialize find(r1, r2) for the case 
 r2 is an
 instance of Repeat. The specialization runs the tailored 
 algorithm. The
 user is supposed to call e.g. find(r, repeat(x, k)) to benefit 
 of the
 specialized algorithm.

 If the specialized algorithm runs faster, this should be done 
 anyway.
 What is the optimization that would the specialized algorithm 
 run faster though?

This.

The faster algorithm is to presumably jump along for random 
access ranges. e.g. if you are searching for xxxxxxxxxx then 
check r[9], if isn't x, check r[19] and so on. If it is x then 
check the preceding 10 elements.

Sep 30 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/30/2013 11:59 AM, Peter Alexander wrote:
 On Monday, 30 September 2013 at 09:15:22 UTC, Timon Gehr wrote:
 On 09/30/2013 04:35 AM, Andrei Alexandrescu wrote:
 2. We should statically specialize find(r1, r2) for the case r2 is an
 instance of Repeat. The specialization runs the tailored algorithm. The
 user is supposed to call e.g. find(r, repeat(x, k)) to benefit of the
 specialized algorithm.

 If the specialized algorithm runs faster, this should be done anyway.
 What is the optimization that would the specialized algorithm run
 faster though?

 This.

 The faster algorithm is to presumably jump along for random access
 ranges. e.g. if you are searching for xxxxxxxxxx then check r[9], if
 isn't x, check r[19] and so on. If it is x then check the preceding 10
 elements.

Boyer-Moore will already do this (and more). I guess you can be sure to 
get rid of the helper tables by specialising it manually. (But there is 
no fundamental obstacle that would prevent the compiler to perform this 
partial evaluation automatically).

Sep 30 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/30/13 5:04 AM, Timon Gehr wrote:
 On 09/30/2013 11:59 AM, Peter Alexander wrote:
 On Monday, 30 September 2013 at 09:15:22 UTC, Timon Gehr wrote:
 On 09/30/2013 04:35 AM, Andrei Alexandrescu wrote:
 2. We should statically specialize find(r1, r2) for the case r2 is an
 instance of Repeat. The specialization runs the tailored algorithm. The
 user is supposed to call e.g. find(r, repeat(x, k)) to benefit of the
 specialized algorithm.

 If the specialized algorithm runs faster, this should be done anyway.
 What is the optimization that would the specialized algorithm run
 faster though?

 This.

 The faster algorithm is to presumably jump along for random access
 ranges. e.g. if you are searching for xxxxxxxxxx then check r[9], if
 isn't x, check r[19] and so on. If it is x then check the preceding 10
 elements.

 Boyer-Moore will already do this (and more). I guess you can be sure to
 get rid of the helper tables by specialising it manually. (But there is
 no fundamental obstacle that would prevent the compiler to perform this
 partial evaluation automatically).

B-M has significant preprocessing costs, so it's not appropriate for all 
cases. A function specialized in finding runs would be optimal without 
preprocessing. Another way of looking at it: it's wasteful to pass 
repeat(x, k) into Boyer-Moore to have it rediscover what the caller knew 
already statically.


Andrei

Sep 30 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/30/2013 05:45 PM, Andrei Alexandrescu wrote:
 On 9/30/13 5:04 AM, Timon Gehr wrote:
 On 09/30/2013 11:59 AM, Peter Alexander wrote:
 On Monday, 30 September 2013 at 09:15:22 UTC, Timon Gehr wrote:
 On 09/30/2013 04:35 AM, Andrei Alexandrescu wrote:
 2. We should statically specialize find(r1, r2) for the case r2 is an
 instance of Repeat. The specialization runs the tailored algorithm.
 The
 user is supposed to call e.g. find(r, repeat(x, k)) to benefit of the
 specialized algorithm.

 If the specialized algorithm runs faster, this should be done anyway.
 What is the optimization that would the specialized algorithm run
 faster though?

 This.

 The faster algorithm is to presumably jump along for random access
 ranges. e.g. if you are searching for xxxxxxxxxx then check r[9], if
 isn't x, check r[19] and so on. If it is x then check the preceding 10
 elements.

 Boyer-Moore will already do this (and more). I guess you can be sure to
 get rid of the helper tables by specialising it manually. (But there is
 no fundamental obstacle that would prevent the compiler to perform this
 partial evaluation automatically).

 B-M has significant preprocessing costs, so it's not appropriate for all
 cases. A function specialized in finding runs would be optimal without
 preprocessing. Another way of looking at it: it's wasteful to pass
 repeat(x, k) into Boyer-Moore to have it rediscover what the caller knew
 already statically.


 Andrei

(My point was that the compiler knows it statically too, and the set of 
program transformations necessary to get rid of it in a standard 
Boyer-Moore implementation should be simple enough. Eg, any occurrence 
of needle[j]!=needle[i] will be known to be false. Probably current 
compilers will not inline array lookups and/or get rid of the allocation 
though.)

Sep 30 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/30/2013 05:45 PM, Andrei Alexandrescu wrote:

 B-M has significant preprocessing costs, so it's not appropriate for all
 cases. A function specialized in finding runs would be optimal without
 preprocessing.
 ...

I guess another point is that Boyer-Moore requires more capabilities of 
the input type.

Sep 30 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/30/13 2:15 AM, Timon Gehr wrote:
 What is the optimization that would the specialized algorithm run faster
 though?

Upon mismatch, restart search right after the mismatch point.

Andrei

Sep 30 2013

D Programming

C/C++ Programming

Other

digitalmars.D - algorithm API design question