www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Ranges of char and wchar

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
A discussion is building around 
https://github.com/D-Programming-Language/phobos/pull/2149, which is a 
nice initiative by Walter to allow Phobos users to avoid or control 
memory allocation.

First instance of the pull request copied the inputs into an output range.

The second instance (right now) creates an input range that lazily 
creates the result. The element type of that range is the encoding type 
of the first argument (i.e. char or wchar most of the time). This is 
different from string/wstring/etc element-wise iteration because it'll 
be done code unit-wise, not code point-wise.

We need a robust idiom for doing such string manipulation without 
allocation, for which setExtension is just an example. Going the output 
range route has nice things going for it because the output range 
decides the encoding in advance and then accepts via put() calls any 
encoding, with only the minimum transcoding needed.

However output range means the string operation will be done eagerly, 
whereas lazy has advantages (nice piping, saving on work etc).

On the other hand, there's the risk of becoming "more catholic than the 
Pope" by insisting on lazy string processing. Most string operations are 
eager, and insisting on a general framework for lazy encoded operations 
on strings may be an exaggeration.

D{iscuss,estroy}!


Andrei
May 08 2014
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 08, 2014 at 10:46:12AM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a
 nice initiative by Walter to allow Phobos users to avoid or control
 memory allocation.
 
 First instance of the pull request copied the inputs into an output
 range.
 
 The second instance (right now) creates an input range that lazily
 creates the result.

I've thought about input ranges vs. output ranges for a bit. I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy. This way, you maximize the usability of the function -- it can participate in UFCS chains, compose with other std.algorithm functions, etc.. [...]
 We need a robust idiom for doing such string manipulation without
 allocation, for which setExtension is just an example. Going the
 output range route has nice things going for it because the output
 range decides the encoding in advance and then accepts via put() calls
 any encoding, with only the minimum transcoding needed.

The problem with this approach is that it hampers usage in UFCS pipelines.
 However output range means the string operation will be done eagerly,
 whereas lazy has advantages (nice piping, saving on work etc).
 
 On the other hand, there's the risk of becoming "more catholic than
 the Pope" by insisting on lazy string processing. Most string
 operations are eager, and insisting on a general framework for lazy
 encoded operations on strings may be an exaggeration.

In terms of usability, my opinion is that it makes most sense to return an input range. Let the user decide when the result should be copied into an output range (via std.algorithm.copy). Compare the following for constructing a path from a directory name, a filename, and an extension: Case 1: setExtension takes an output range: // Look how ugly this is: string dirname = ...; string filename = ...; // Need temp buffer to store result char[128] result; char[] outputRange = result[]; dirname.copy(outputRange); setExtension(filename, ".ext", outputRange); writeln(result); Case 2: setExtension takes an input range: // Look how clean this is: string dirname = ...; string filename = ...; writeln(chain(dirname, setExtension(filename, ".ext"))); In case 1, the user has to manually create various intermediate buffers to store intermediate results. I used a trivial example here, but in application code, the processing you need is usually far more complex. This means creating lots of intermediate buffers, making sure you link the right ones together, etc.. The code becomes very verbose, and becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers refer to which fragment of the result again? Oh oops, I think I passed the wrong output range to setExtension). In case 2, the user decides when a buffer is needed and when it's not. The function calls chain very nicely. The code is more readable, and easy to maintain (and needless allocations -- including temporary static buffers -- are eliminated). T -- Nearly all men can stand adversity, but if you want to test a man's character, give him power. -- Abraham Lincoln
May 08 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 11:19 AM, H. S. Teoh via Digitalmars-d wrote:
 In case 1, the user has to manually create various intermediate buffers
 to store intermediate results. I used a trivial example here, but in
 application code, the processing you need is usually far more complex.
 This means creating lots of intermediate buffers, making sure you link
 the right ones together, etc.. The code becomes very verbose, and
 becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers
 refer to which fragment of the result again? Oh oops, I think I passed
 the wrong output range to setExtension).

 In case 2, the user decides when a buffer is needed and when it's not.
 The function calls chain very nicely. The code is more readable, and
 easy to maintain (and needless allocations -- including temporary static
 buffers -- are eliminated).

I think you nailed it. Being able to eliminate temporary buffers is a big win. The fastest way to manage allocated memory is to not need allocated memory.
May 08 2014
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 5/8/2014 2:19 PM, H. S. Teoh via Digitalmars-d wrote:
 This way, you maximize the usability of the function -- it can
 participate in UFCS chains, compose with other std.algorithm functions,
 etc..

I think that actually raises an issue worth considering. I posit that the results of an InputRange setExtention don't make any sense with most std.algorithm functions: What's an InputRange? It's a set of elements. So what are the "elements" of a setExtention? Uhh... The result of setExtention really just has exactly one "logical" element: the resulting filesystem path. But that's not necessarily the *actual* result. The actual result is the "elements" of "however the hell setExtention's internal algorithm happened to feel like plopping them out". What are you going to zip() them, map() them, sort() them? All completely meaningless. The "elements" of setExtention's InputRange are NOT defined. At least not in any well-encapsulated way. And really, they SHOULDN'T be defined. So we have setExtention, and anything that follows it's model, returning what's more or less an undefined result that's basically useless for pretty much anything besides plopping straight into array() or copy(). If the big benefit of going InputRange instead of OutputRange for such functions is just simply call chaining, then maybe we're solving the wrong problem and should instead be looking at ways to improve the usage of OutputRange-based functions.
May 08 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/8/14, 1:41 PM, "Luís Marques" <luis luismarques.eu>" wrote:
 On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
 Digitalmars-d wrote:
 I've thought about input ranges vs. output ranges for a bit.  I think it
 doesn't make sense for functions that process data to take an output
 range: output ranges are data sinks, and should only be used for the
 endpoint of a data processing pipeline. Since the string function
 doesn't know whether or not it's the last in a pipeline (only the
 calling code can know this), it should return an input range. If the
 user code wants to put the result into an output range, then it should
 simply use std.algorithm.copy.

I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.

Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch* Andrei
May 08 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice
 initiative by Walter to allow Phobos users to avoid or control memory
allocation.

The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn
May 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/8/14, 11:27 AM, Walter Bright wrote:
 On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a
 nice
 initiative by Walter to allow Phobos users to avoid or control memory
 allocation.

The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn

4. Rule of least surprise Some may be surprised that auto c = setExtension(a, b) doesn't actually just do it. So changing a and/or b before using c would yield surprising result for someone coming from a background of eager-string languages. Andrei
May 08 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 2:20 PM, Andrei Alexandrescu wrote:
 Some may be surprised that auto c = setExtension(a, b) doesn't actually just do
 it. So changing a and/or b before using c would yield surprising result for
 someone coming from a background of eager-string languages.

It's true that when I first encountered C#'s LINQ, I was surprised that it was lazy. It's also true that most of std.algorithm is lazy. Apart from coming up with a new naming convention (and renaming algorithms in Phobos), I don't see any obvious solution to what's lazy and what's not. One possibility is to informally (i.e. in the documentation rather than the core language spec) call something an 'algorithm' if it is lazy and 'function' if it is eager.
May 08 2014
parent Jacob Carlborg <doob me.com> writes:
On 08/05/14 23:33, Walter Bright wrote:

 It's true that when I first encountered C#'s LINQ, I was surprised that
 it was lazy.

 It's also true that most of std.algorithm is lazy. Apart from coming up
 with a new naming convention (and renaming algorithms in Phobos), I
 don't see any obvious solution to what's lazy and what's not.

 One possibility is to informally (i.e. in the documentation rather than
 the core language spec) call something an 'algorithm' if it is lazy and
 'function' if it is eager.

Don't know if it helps, but we could add a UDA indicating which functions are lazy. This wouldn't require any renaming of existing functions. -- /Jacob Carlborg
May 09 2014
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 5/8/2014 1:46 PM, Andrei Alexandrescu wrote:
 However output range means the string operation will be done eagerly,
 whereas lazy has advantages (nice piping, saving on work etc).

Isn't eagerness what we have array() for? Also, while naming is often just bikeshedding, in this case I think we really need to address it. It's unfortunate that the new InputRange version can't be overloaded with the non-range version, because distinguishing them with "setExt" vs "setExtention" is just a non-option as far as I'm concerned. But IMO that's the minor stuff, I think there's a bigger elephant in the room here: The various benefits of doing it as an InputRange instead of OutputRange, combined with the terrible mess that resulted from converting such a straightforward function from OutputRange to InputRange is yet another clear indication that we REALLY need a coroutine-like/C#-like way to create InputRanges. Do we really expect simple algorithms exploding into messes like that to become the norm in D? I certainly hope not. I'd rather just go back to opApply. I'm convinced this is a real problem for D moving forward, and I don't like continually seeing it get swept under the same "too unimportant for right now" rug as things that actually belong under there (like named parameters, ast macros, or pattern matching). Even C# is making us look really, really bad here, and C# can't even do generic code worth a damn.
May 08 2014
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 08, 2014 at 11:27:54AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
A discussion is building around
https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice
initiative by Walter to allow Phobos users to avoid or control memory
allocation.

The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn

I think having setExtension() return an input range is the best solution. It will satisfy all 3 requirements: it's easy to make it pure, safe, nothrow, nogc (since it lazily generates the result); it's easy to compose with UFCS chains, and it's consistent with the rest of the range-based functions in Phobos. Having setExtension (or any string function) take an output range would break (2) -- you have to allocate intermediate buffers to serve as output ranges if you want to do further processing of the result. If setExtension returns an input range, then you can just use std.algorithm.copy to copy the result to an output range should you need to. Going the other way, you can't easily convert an output range into an input range. There is the issue of decoding, though. Perhaps setExtension should take a compile-time argument specifying which char type is desired in the result? So you could do: string filename, ext; wstring pathname_w = filename.setExtension!wchar(ext).array; dstring pathname_d = filename.setExtension!dchar(ext).array; This way, if the output char type is identical to the input char type, the function can bypass decoding. T -- "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next. -- (Stolen from the net)
May 08 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 However output range means the string operation will be done eagerly, whereas
 lazy has advantages (nice piping, saving on work etc).

The lazy vs eager issue seems at first blush to be an esoteric, potayto-potahto bikeshed. But I think it's more interesting than that. Lazy means that the computation is divided into two pieces: 1. creating an engine to do the calculations 2. feeding the data to the engine to generate a result This, of course, reminds me of regex. By creating the regex engine separately, and then using that engine for multiple sets of data, one can afford to spend extra computational effort at optimizing the engine. Creating the engine can even be done at compile time (as Dmitry's most excellent regex engine does). Engines can make it easier to parallelize computation. And lastly, it's easy to make an eager version by wrapping a lazy one. But to make a lazy version from an eager one means reimplementing it.
May 08 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

Or yield()-ing inside the eager one's sink. And note also there is such as thing as a stackless "fiber", so I'm not certain a full-fledged context-switching fiber would necessarily be required (although it might - stackless fibers do have limitations).
May 08 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

Or yield()-ing inside the eager one's sink.

The data is still supplied to it.
May 08 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 5/8/2014 4:13 PM, Walter Bright wrote:
 8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

Or yield()-ing inside the eager one's sink.

The data is still supplied to it.

I think I've lost track of what you mean here. What data is still supplied to what, and what point does that illustrate?
May 08 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/8/2014 4:24 PM, Nick Sabalausky wrote:
 On 5/8/2014 4:13 PM, Walter Bright wrote:
 8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

Or yield()-ing inside the eager one's sink.

The data is still supplied to it.

I think I've lost track of what you mean here. What data is still supplied to what, and what point does that illustrate?

When creating an engine vs calculating the result, the former does not require the actual data to be supplied. regex is a classic example of that.
May 08 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu 
wrote:
 A discussion is building around 
 https://github.com/D-Programming-Language/phobos/pull/2149, 
 which is a nice initiative by Walter to allow Phobos users to 
 avoid or control memory allocation.

 First instance of the pull request copied the inputs into an 
 output range.

 The second instance (right now) creates an input range that 
 lazily creates the result. The element type of that range is 
 the encoding type of the first argument (i.e. char or wchar 
 most of the time). This is different from string/wstring/etc 
 element-wise iteration because it'll be done code unit-wise, 
 not code point-wise.

 We need a robust idiom for doing such string manipulation 
 without allocation, for which setExtension is just an example. 
 Going the output range route has nice things going for it 
 because the output range decides the encoding in advance and 
 then accepts via put() calls any encoding, with only the 
 minimum transcoding needed.

 However output range means the string operation will be done 
 eagerly, whereas lazy has advantages (nice piping, saving on 
 work etc).

 On the other hand, there's the risk of becoming "more catholic 
 than the Pope" by insisting on lazy string processing. Most 
 string operations are eager, and insisting on a general 
 framework for lazy encoded operations on strings may be an 
 exaggeration.

 D{iscuss,estroy}!


 Andrei

A little thing I've been thinking about on a related note: someOutputRange <| someInputRange; equivalent to someInputRange |> someOutputRange; which is conceptually the same as someInputRange.copy(someOutputRange); perhaps implemented by introducing new operators opSink (in the output range) and opSource (in the input range). These would be invoked as a pair, opSource returning an input range which is passed to opSink. If not implemented, opSource defaults to `auto opSource() { return this; }` or even `alias opSource = this;`. opSink would default to a std.algorithm.copy style thing (can't really be std.algorithm.copy itself because you don't want the dependency). I'm not sure what ground this really breaks, if any, but it sure looks nice to me.
May 08 2014
prev sibling next sibling parent =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
Digitalmars-d wrote:
 I've thought about input ranges vs. output ranges for a bit.  I 
 think it
 doesn't make sense for functions that process data to take an 
 output
 range: output ranges are data sinks, and should only be used 
 for the
 endpoint of a data processing pipeline. Since the string 
 function
 doesn't know whether or not it's the last in a pipeline (only 
 the
 calling code can know this), it should return an input range. 
 If the
 user code wants to put the result into an output range, then it 
 should
 simply use std.algorithm.copy.

I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.
May 08 2014
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/8/14, 1:41 PM, "Lus Marques" <luis luismarques.eu>" wrote:
On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
Digitalmars-d wrote:
I've thought about input ranges vs. output ranges for a bit.  I
think it doesn't make sense for functions that process data to take
an output range: output ranges are data sinks, and should only be
used for the endpoint of a data processing pipeline. Since the
string function doesn't know whether or not it's the last in a
pipeline (only the calling code can know this), it should return an
input range. If the user code wants to put the result into an output
range, then it should simply use std.algorithm.copy.

I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.

Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch*

Yeah, I'm not so sure it's a good idea for std.format to be lazy. But there are valid use cases for both, so now I'm wondering if we should adopt Walter's idea of using a naming convention to distinguish between algorithms (lazy) and functions (eager). T -- WINDOWS = Will Install Needless Data On Whole System -- CompuMan
May 08 2014
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 08, 2014 at 02:54:32PM -0700, H. S. Teoh via Digitalmars-d wrote:
 On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via
Digitalmars-d wrote:
 On 5/8/14, 1:41 PM, "Lus Marques" <luis luismarques.eu>" wrote:


I agree with H. S. Teoh. Indeed, I was thinking of trying to create
an alternative version of std.format which returned an InputRange,
instead of taking an OutputRange. The benefit of this becomes
obvious when you want to use std.format() in your own ranges, which
currently impairs laziness, pipelining, avoiding memory
allocations, etc.

Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch*

Yeah, I'm not so sure it's a good idea for std.format to be lazy. But there are valid use cases for both, so now I'm wondering if we should adopt Walter's idea of using a naming convention to distinguish between algorithms (lazy) and functions (eager).

On second thoughts, perhaps std.format *should* be lazy, or at least, the core implementation should be lazy, and std.format should be an eager wrapper around it (using std.algorithm.copy to copy the result of the underlying formatting engine to the output range). The problem is, the current design of std.format is heavily dependent on output ranges (e.g., user-defined types that define a toString that takes a delegate as an output range), and changing this now is going to cause large-scale breakage of existing code. I don't see any clean way of continuing to support toString(delegate) without introducing yet more hidden allocations (to buffer the output of toString, for example), if we switch std.format to a lazy implementation. The only way I can think of to do this is to have language-level support for yield(), then you can hook things up this way: module std.format; auto format(A...)(string fmt, A args) { struct Result { ... property auto front() { // assume typeof(args[i]) = user-defined type args[i].toString((const(char)[] data) { // This switches to a different // fiber than the one running // toString. yield(data); }); ... } ... } return Result(...); } module usercode; struct T { void toString(scope void delegate(const(char)[]) sink) { ... // Each of these calls will be yielded by the // sink, so they will "block" until the outer // caller has processes the new data. sink("abc"); sink("def"); ... } } This requires heavy-duty yield() support in the language, though. I don't know how likely such a thing will make it into the language... T -- Winners never quit, quitters never win. But those who never quit AND never win are idiots.
May 08 2014
prev sibling next sibling parent =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu
wrote:
 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }

 *ouch*

You might think that's bad, but there's nothing in that example that is specific to formatting; that kind of behavior can happen with any lazy range taking input by ref. (I guess that's one drawback of providing functional programming without immutability?) To be clear, I'm not saying that the idea was to make the existing format function lazy, break existing code and surprise people with that change in semantics. format() can be an eager wrapper for the new lazy functionality. And eagerness/laziness should be part of the specification IMO, even more than asymptotic complexity should be part of the STL spec, since (bug > performance bug).
May 08 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/8/14, 10:46 AM, Andrei Alexandrescu wrote:
[snip]

Here's a list of algorithms in std.algorithm that would need to be 
looked at if we want to handle ranges of char and wchar properly (at 
least versions with predicates). When I say "looked at" I mean "modified 
appropriately to handle ranges of char and wchar in addition to strings".

========================

Searching: all  any  canFind  count  countUntil  commonPrefix  endsWith 
  find  findAdjacent  findAmong  findSkip  findSplit  findSplitAfter 
findSplitBefore  minCount  minPos  mismatch  skipOver  startsWith  until

Comparison: cmp  equal  levenshteinDistance  levenshteinDistanceAndPath 
  mismatch

Iteration: filter  filterBidirectional  group  map  reduce  splitter  uniq

Sorting: completeSort  isPartitioned  isSorted  makeIndex 
nextPermutation  nextEvenPermutation  partialSort  partition  partition3 
  schwartzSort  sort  topN  topNCopy

Set operations: (probably none, because set ops are rare on strings)

Mutation: bringToFront  copy  fill  initializeAll  moveAll  moveSome 
remove  reverse  strip  stripLeft  stripRight  swapRanges  uninitializedFill

========================

Of these, only a fraction (maybe 60-70%) are actually used with strings. 
Still that leaves a significant amount of work to do if we want 
std.algorithm to work smoothly with ranges of char/wchar. Here 
"smoothly" means "if it compiles and runs it won't do something UTF-goofy".



Andrei
May 08 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/8/14, 3:33 PM, Andrei Alexandrescu wrote:
 Of these, only a fraction (maybe 60-70%) are actually used with strings.
 Still that leaves a significant amount of work to do if we want
 std.algorithm to work smoothly with ranges of char/wchar. Here
 "smoothly" means "if it compiles and runs it won't do something UTF-goofy".

The larger point here (I forgot) is that user code that's written in the style of std.algorithm would need similar adjustments in order to work with ranges of char/wchar. Andrie
May 08 2014
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Classification of functions in http://dlang.org/phobos/std_path.html:

1. Functions that allocate GC memory and return an array:

   dirName, setExtension, defaultExtension, buildPath, buildNormalizedPath, 
absolutePath, relativePath, expandTilde

These are prime candidates to be converted into algorithms.


2. Functions that return a slice of their input arrays:

   baseName, rootName, driveName, stripDrive, extension,

Should be enhanced to accept RandomAccessRange and produce RandomAccessRange


3. Functions that return a range:

   pathSplitter

Enhance to accept RandomAccessRange


4. Functions that take an array and compute a value:

   isRooted, isAbsolute, filenameCmp, globMatch, isValidPath

These are reducers, and should be enhanced to accept input ranges.
May 08 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Thursday, 8 May 2014 at 22:27:11 UTC, Luís Marques wrote:
 (I guess that's one
 drawback of providing functional programming without
 immutability?)

Another issue is iteration might be not repeatable, especially if a closure accidentally slips into the range. In C# iteration is not destructing, in D one can save a range.
May 09 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu 
wrote:
 Interesting. So then the range returned by format() will save 
 everything passed to it, which means...

 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }

 *ouch*

 Andrei

void fun(int[] a) { auto before = a.map!(a => a + 1); foreach (ref e; a) ++e; auto after = a.map!(a => a + 1); writeln(before, "\n--->\n", after); } Laziness and mutable references are not the best of friends.
May 10 2014
prev sibling parent "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu 
wrote:
 A discussion is building around 
 https://github.com/D-Programming-Language/phobos/pull/2149, 
 which is a nice initiative by Walter to allow Phobos users to 
 avoid or control memory allocation.

 First instance of the pull request copied the inputs into an 
 output range.

 The second instance (right now) creates an input range that 
 lazily creates the result. The element type of that range is 
 the encoding type of the first argument (i.e. char or wchar 
 most of the time). This is different from string/wstring/etc 
 element-wise iteration because it'll be done code unit-wise, 
 not code point-wise.

I'm with H. S. Teoh that we should prefer the input range approach. My limited experience with output ranges suggests to me that they really need to be the end point, where the data is actually leaving the program (writing to disk, over a network). And as Teoh mentions, if you need it in specific memory you can copy an input range into it. D is already fairly lazy, and .array() is great for doing eager evaluation into the GC. For strings having something similar which allows for the original encoding to remain would be good.
May 13 2014