digitalmars.D - Ranges of char and wchar

Andrei Alexandrescu (23/23) May 08 2014 A discussion is building around

H. S. Teoh via Digitalmars-d (51/73) May 08 2014 I've thought about input ranges vs. output ranges for a bit. I think it

Walter Bright (4/16) May 08 2014 I think you nailed it.
Nick Sabalausky (21/24) May 08 2014 I think that actually raises an issue worth considering. I posit that
=?UTF-8?B?Ikx1w61z?= Marques" (8/23) May 08 2014 I agree with H. S. Teoh. Indeed, I was thinking of trying to

Andrei Alexandrescu (12/28) May 08 2014 Interesting. So then the range returned by format() will save everything...

H. S. Teoh via Digitalmars-d (9/39) May 08 2014 [...]
H. S. Teoh via Digitalmars-d (52/79) May 08 2014 [...]
=?UTF-8?B?Ikx1w61z?= Marques" (13/21) May 08 2014 You might think that's bad, but there's nothing in that example

Kagamin (4/7) May 09 2014 Another issue is iteration might be not repeatable, especially if

Peter Alexander (10/21) May 10 2014 void fun(int[] a)

Walter Bright (7/10) May 08 2014 The setExtension() function is itself not very important, but what is im...

H. S. Teoh via Digitalmars-d (25/38) May 08 2014 I think having setExtension() return an input range is the best
Andrei Alexandrescu (6/18) May 08 2014 4. Rule of least surprise

Walter Bright (8/11) May 08 2014 It's true that when I first encountered C#'s LINQ, I was surprised that ...

Jacob Carlborg (6/14) May 09 2014 Don't know if it helps, but we could add a UDA indicating which

Nick Sabalausky (21/23) May 08 2014 Isn't eagerness what we have array() for?
Walter Bright (14/16) May 08 2014 The lazy vs eager issue seems at first blush to be an esoteric, potayto-...

Nick Sabalausky (5/6) May 08 2014 Or yield()-ing inside the eager one's sink.

Walter Bright (2/7) May 08 2014 The data is still supplied to it.

Nick Sabalausky (3/11) May 08 2014 I think I've lost track of what you mean here. What data is still

Walter Bright (4/16) May 08 2014 When creating an engine vs calculating the result, the former does not r...

John Colvin (18/46) May 08 2014 A little thing I've been thinking about on a related note:
Andrei Alexandrescu (25/25) May 08 2014 On 5/8/14, 10:46 AM, Andrei Alexandrescu wrote:

Andrei Alexandrescu (5/9) May 08 2014 The larger point here (I forgot) is that user code that's written in the...

Walter Bright (14/14) May 08 2014 Classification of functions in http://dlang.org/phobos/std_path.html:
Jesse Phillips (11/23) May 13 2014 I'm with H. S. Teoh that we should prefer the input range

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

A discussion is building around 
https://github.com/D-Programming-Language/phobos/pull/2149, which is a 
nice initiative by Walter to allow Phobos users to avoid or control 
memory allocation.

First instance of the pull request copied the inputs into an output range.

The second instance (right now) creates an input range that lazily 
creates the result. The element type of that range is the encoding type 
of the first argument (i.e. char or wchar most of the time). This is 
different from string/wstring/etc element-wise iteration because it'll 
be done code unit-wise, not code point-wise.

We need a robust idiom for doing such string manipulation without 
allocation, for which setExtension is just an example. Going the output 
range route has nice things going for it because the output range 
decides the encoding in advance and then accepts via put() calls any 
encoding, with only the minimum transcoding needed.

However output range means the string operation will be done eagerly, 
whereas lazy has advantages (nice piping, saving on work etc).

On the other hand, there's the risk of becoming "more catholic than the 
Pope" by insisting on lazy string processing. Most string operations are 
eager, and insisting on a general framework for lazy encoded operations 
on strings may be an exaggeration.

D{iscuss,estroy}!


Andrei

May 08 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 08, 2014 at 10:46:12AM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a
 nice initiative by Walter to allow Phobos users to avoid or control
 memory allocation.
 
 First instance of the pull request copied the inputs into an output
 range.
 
 The second instance (right now) creates an input range that lazily
 creates the result.

I've thought about input ranges vs. output ranges for a bit.  I think it
doesn't make sense for functions that process data to take an output
range: output ranges are data sinks, and should only be used for the
endpoint of a data processing pipeline. Since the string function
doesn't know whether or not it's the last in a pipeline (only the
calling code can know this), it should return an input range. If the
user code wants to put the result into an output range, then it should
simply use std.algorithm.copy.

This way, you maximize the usability of the function -- it can
participate in UFCS chains, compose with other std.algorithm functions,
etc..


[...]
 We need a robust idiom for doing such string manipulation without
 allocation, for which setExtension is just an example. Going the
 output range route has nice things going for it because the output
 range decides the encoding in advance and then accepts via put() calls
 any encoding, with only the minimum transcoding needed.

The problem with this approach is that it hampers usage in UFCS
pipelines.


 However output range means the string operation will be done eagerly,
 whereas lazy has advantages (nice piping, saving on work etc).
 
 On the other hand, there's the risk of becoming "more catholic than
 the Pope" by insisting on lazy string processing. Most string
 operations are eager, and insisting on a general framework for lazy
 encoded operations on strings may be an exaggeration.

[...]

In terms of usability, my opinion is that it makes most sense to return
an input range. Let the user decide when the result should be copied
into an output range (via std.algorithm.copy).

Compare the following for constructing a path from a directory name, a
filename, and an extension:

Case 1: setExtension takes an output range:

	// Look how ugly this is:
	string dirname = ...;
	string filename = ...;

	// Need temp buffer to store result
	char[128] result;
	char[] outputRange = result[];

	dirname.copy(outputRange);
	setExtension(filename, ".ext", outputRange);

	writeln(result);

Case 2: setExtension takes an input range:

	// Look how clean this is:
	string dirname = ...;
	string filename = ...;

	writeln(chain(dirname, setExtension(filename, ".ext")));

In case 1, the user has to manually create various intermediate buffers
to store intermediate results. I used a trivial example here, but in
application code, the processing you need is usually far more complex.
This means creating lots of intermediate buffers, making sure you link
the right ones together, etc.. The code becomes very verbose, and
becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers
refer to which fragment of the result again? Oh oops, I think I passed
the wrong output range to setExtension).

In case 2, the user decides when a buffer is needed and when it's not.
The function calls chain very nicely. The code is more readable, and
easy to maintain (and needless allocations -- including temporary static
buffers -- are eliminated).


T

-- 
Nearly all men can stand adversity, but if you want to test a man's character,
give him power. -- Abraham Lincoln

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 11:19 AM, H. S. Teoh via Digitalmars-d wrote:
 In case 1, the user has to manually create various intermediate buffers
 to store intermediate results. I used a trivial example here, but in
 application code, the processing you need is usually far more complex.
 This means creating lots of intermediate buffers, making sure you link
 the right ones together, etc.. The code becomes very verbose, and
 becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers
 refer to which fragment of the result again? Oh oops, I think I passed
 the wrong output range to setExtension).

 In case 2, the user decides when a buffer is needed and when it's not.
 The function calls chain very nicely. The code is more readable, and
 easy to maintain (and needless allocations -- including temporary static
 buffers -- are eliminated).

I think you nailed it.

Being able to eliminate temporary buffers is a big win. The fastest way to 
manage allocated memory is to not need allocated memory.

May 08 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 5/8/2014 2:19 PM, H. S. Teoh via Digitalmars-d wrote:
 This way, you maximize the usability of the function -- it can
 participate in UFCS chains, compose with other std.algorithm functions,
 etc..

I think that actually raises an issue worth considering. I posit that 
the results of an InputRange setExtention don't make any sense with most 
std.algorithm functions:

What's an InputRange? It's a set of elements. So what are the "elements" 
of a setExtention? Uhh...

The result of setExtention really just has exactly one "logical" 
element: the resulting filesystem path. But that's not necessarily the 
*actual* result. The actual result is the "elements" of "however the 
hell setExtention's internal algorithm happened to feel like plopping 
them out". What are you going to zip() them, map() them, sort() them? 
All completely meaningless.

The "elements" of setExtention's InputRange are NOT defined. At least 
not in any well-encapsulated way. And really, they SHOULDN'T be defined.

So we have setExtention, and anything that follows it's model, returning 
what's more or less an undefined result that's basically useless for 
pretty much anything besides plopping straight into array() or copy().

If the big benefit of going InputRange instead of OutputRange for such 
functions is just simply call chaining, then maybe we're solving the 
wrong problem and should instead be looking at ways to improve the usage 
of OutputRange-based functions.

May 08 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
Digitalmars-d wrote:
 I've thought about input ranges vs. output ranges for a bit.  I 
 think it
 doesn't make sense for functions that process data to take an 
 output
 range: output ranges are data sinks, and should only be used 
 for the
 endpoint of a data processing pipeline. Since the string 
 function
 doesn't know whether or not it's the last in a pipeline (only 
 the
 calling code can know this), it should return an input range. 
 If the
 user code wants to put the result into an output range, then it 
 should
 simply use std.algorithm.copy.

I agree with H. S. Teoh. Indeed, I was thinking of trying to
create an alternative version of std.format which returned an
InputRange, instead of taking an OutputRange. The benefit of this
becomes obvious when you want to use std.format() in your own
ranges, which currently impairs laziness, pipelining, avoiding
memory allocations, etc.

May 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/8/14, 1:41 PM, "Luís Marques" <luis luismarques.eu>" wrote:
 On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
 Digitalmars-d wrote:
 I've thought about input ranges vs. output ranges for a bit.  I think it
 doesn't make sense for functions that process data to take an output
 range: output ranges are data sinks, and should only be used for the
 endpoint of a data processing pipeline. Since the string function
 doesn't know whether or not it's the last in a pipeline (only the
 calling code can know this), it should return an input range. If the
 user code wants to put the result into an output range, then it should
 simply use std.algorithm.copy.

 I agree with H. S. Teoh. Indeed, I was thinking of trying to
 create an alternative version of std.format which returned an
 InputRange, instead of taking an OutputRange. The benefit of this
 becomes obvious when you want to use std.format() in your own
 ranges, which currently impairs laziness, pipelining, avoiding
 memory allocations, etc.

Interesting. So then the range returned by format() will save everything 
passed to it, which means...

int fun(int[] a)
{
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
}

*ouch*

Andrei

May 08 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/8/14, 1:41 PM, "Lu�s Marques" <luis luismarques.eu>" wrote:
On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via
Digitalmars-d wrote:
I've thought about input ranges vs. output ranges for a bit.  I
think it doesn't make sense for functions that process data to take
an output range: output ranges are data sinks, and should only be
used for the endpoint of a data processing pipeline. Since the
string function doesn't know whether or not it's the last in a
pipeline (only the calling code can know this), it should return an
input range. If the user code wants to put the result into an output
range, then it should simply use std.algorithm.copy.

I agree with H. S. Teoh. Indeed, I was thinking of trying to create
an alternative version of std.format which returned an InputRange,
instead of taking an OutputRange. The benefit of this becomes obvious
when you want to use std.format() in your own ranges, which currently
impairs laziness, pipelining, avoiding memory allocations, etc.

 
 Interesting. So then the range returned by format() will save
 everything passed to it, which means...
 
 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }
 
 *ouch*

[...]

Yeah, I'm not so sure it's a good idea for std.format to be lazy. But
there are valid use cases for both, so now I'm wondering if we should
adopt Walter's idea of using a naming convention to distinguish between
algorithms (lazy) and functions (eager).


T

-- 
WINDOWS = Will Install Needless Data On Whole System -- CompuMan

May 08 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 08, 2014 at 02:54:32PM -0700, H. S. Teoh via Digitalmars-d wrote:
 On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via
Digitalmars-d wrote:
 On 5/8/14, 1:41 PM, "Lu�s Marques" <luis luismarques.eu>" wrote:


[...]
I agree with H. S. Teoh. Indeed, I was thinking of trying to create
an alternative version of std.format which returned an InputRange,
instead of taking an OutputRange. The benefit of this becomes
obvious when you want to use std.format() in your own ranges, which
currently impairs laziness, pipelining, avoiding memory
allocations, etc.

 
 Interesting. So then the range returned by format() will save
 everything passed to it, which means...
 
 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }
 
 *ouch*

 [...]
 
 Yeah, I'm not so sure it's a good idea for std.format to be lazy. But
 there are valid use cases for both, so now I'm wondering if we should
 adopt Walter's idea of using a naming convention to distinguish
 between algorithms (lazy) and functions (eager).

[...]

On second thoughts, perhaps std.format *should* be lazy, or at least,
the core implementation should be lazy, and std.format should be an
eager wrapper around it (using std.algorithm.copy to copy the result of
the underlying formatting engine to the output range).

The problem is, the current design of std.format is heavily dependent on
output ranges (e.g., user-defined types that define a toString that
takes a delegate as an output range), and changing this now is going to
cause large-scale breakage of existing code. I don't see any clean way
of continuing to support toString(delegate) without introducing yet more
hidden allocations (to buffer the output of toString, for example), if
we switch std.format to a lazy implementation.

The only way I can think of to do this is to have language-level support
for yield(), then you can hook things up this way:

	module std.format;
	auto format(A...)(string fmt, A args) {
		struct Result {
			...
			 property auto front() {
				// assume typeof(args[i]) = user-defined type
				args[i].toString((const(char)[] data) {
					// This switches to a different
					// fiber than the one running
					// toString.
					yield(data);
				});
				...
			}
			...
		}
		return Result(...);
	}

	module usercode;
	struct T {
		void toString(scope void delegate(const(char)[]) sink)
		{
			...
			// Each of these calls will be yielded by the
			// sink, so they will "block" until the outer
			// caller has processes the new data.
			sink("abc");
			sink("def");
			...
		}
	}

This requires heavy-duty yield() support in the language, though. I
don't know how likely such a thing will make it into the language...


T

-- 
Winners never quit, quitters never win. But those who never quit AND never win
are idiots.

May 08 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu
wrote:
 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }

 *ouch*

You might think that's bad, but there's nothing in that example
that is specific to formatting; that kind of behavior can happen
with any lazy range taking input by ref. (I guess that's one
drawback of providing functional programming without
immutability?) To be clear, I'm not saying that the idea was to
make the existing format function lazy, break existing code and
surprise people with that change in semantics. format() can be an
eager wrapper for the new lazy functionality. And
eagerness/laziness should be part of the specification IMO, even
more than asymptotic complexity should be part of the STL spec,
since (bug > performance bug).

May 08 2014

"Kagamin" <spam here.lot> writes:

On Thursday, 8 May 2014 at 22:27:11 UTC, Luís Marques wrote:
 (I guess that's one
 drawback of providing functional programming without
 immutability?)

Another issue is iteration might be not repeatable, especially if 

not destructing, in D one can save a range.

May 09 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu 
wrote:
 Interesting. So then the range returned by format() will save 
 everything passed to it, which means...

 int fun(int[] a)
 {
    auto before = format("Before: %s", a);
    foreach (ref e; a) ++e;
    auto after = format("After: %s", a);
    writeln(before, "\n--->\n", after);
 }

 *ouch*

 Andrei

void fun(int[] a)
{
    auto before = a.map!(a => a + 1);
    foreach (ref e; a) ++e;
    auto after = a.map!(a => a + 1);
    writeln(before, "\n--->\n", after);
}

Laziness and mutable references are not the best of friends.

May 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice
 initiative by Walter to allow Phobos users to avoid or control memory
allocation.

The setExtension() function is itself not very important, but what is important 
is an example for how to put together ranges.

Some design goals:

1. purity,  safe, nothrow,  nogc
2. composability
3. have them work in a consistent way, so there's less for a user to learn

May 08 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 08, 2014 at 11:27:54AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
A discussion is building around
https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice
initiative by Walter to allow Phobos users to avoid or control memory
allocation.

 
 The setExtension() function is itself not very important, but what is
 important is an example for how to put together ranges.
 
 Some design goals:
 
 1. purity,  safe, nothrow,  nogc
 2. composability
 3. have them work in a consistent way, so there's less for a user to learn

I think having setExtension() return an input range is the best
solution. It will satisfy all 3 requirements: it's easy to make it pure,
safe, nothrow, nogc (since it lazily generates the result); it's easy to
compose with UFCS chains, and it's consistent with the rest of the
range-based functions in Phobos.

Having setExtension (or any string function) take an output range would
break (2) -- you have to allocate intermediate buffers to serve as
output ranges if you want to do further processing of the result.

If setExtension returns an input range, then you can just use
std.algorithm.copy to copy the result to an output range should you need
to. Going the other way, you can't easily convert an output range into
an input range.

There is the issue of decoding, though. Perhaps setExtension should take
a compile-time argument specifying which char type is desired in the
result? So you could do:

	string filename, ext;
	wstring pathname_w = filename.setExtension!wchar(ext).array;
	dstring pathname_d = filename.setExtension!dchar(ext).array;

This way, if the output char type is identical to the input char
type, the function can bypass decoding.


T

-- 
"Outlook not so good." That magic 8-ball knows everything! I'll ask
about Exchange Server next. -- (Stolen from the net)

May 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/8/14, 11:27 AM, Walter Bright wrote:
 On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 A discussion is building around
 https://github.com/D-Programming-Language/phobos/pull/2149, which is a
 nice
 initiative by Walter to allow Phobos users to avoid or control memory
 allocation.

 The setExtension() function is itself not very important, but what is
 important is an example for how to put together ranges.

 Some design goals:

 1. purity,  safe, nothrow,  nogc
 2. composability
 3. have them work in a consistent way, so there's less for a user to learn

4. Rule of least surprise

Some may be surprised that auto c = setExtension(a, b) doesn't actually 
just do it. So changing a and/or b before using c would yield surprising 
result for someone coming from a background of eager-string languages.


Andrei

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 2:20 PM, Andrei Alexandrescu wrote:
 Some may be surprised that auto c = setExtension(a, b) doesn't actually just do
 it. So changing a and/or b before using c would yield surprising result for
 someone coming from a background of eager-string languages.


lazy.

It's also true that most of std.algorithm is lazy. Apart from coming up with a 
new naming convention (and renaming algorithms in Phobos), I don't see any 
obvious solution to what's lazy and what's not.

One possibility is to informally (i.e. in the documentation rather than the
core 
language spec) call something an 'algorithm' if it is lazy and 'function' if it 
is eager.

May 08 2014

Jacob Carlborg <doob me.com> writes:

On 08/05/14 23:33, Walter Bright wrote:


 it was lazy.

 It's also true that most of std.algorithm is lazy. Apart from coming up
 with a new naming convention (and renaming algorithms in Phobos), I
 don't see any obvious solution to what's lazy and what's not.

 One possibility is to informally (i.e. in the documentation rather than
 the core language spec) call something an 'algorithm' if it is lazy and
 'function' if it is eager.

Don't know if it helps, but we could add a UDA indicating which 
functions are lazy. This wouldn't require any renaming of existing 
functions.

-- 
/Jacob Carlborg

May 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 5/8/2014 1:46 PM, Andrei Alexandrescu wrote:
 However output range means the string operation will be done eagerly,
 whereas lazy has advantages (nice piping, saving on work etc).

Isn't eagerness what we have array() for?

Also, while naming is often just bikeshedding, in this case I think we 
really need to address it. It's unfortunate that the new InputRange 
version can't be overloaded with the non-range version, because 
distinguishing them with "setExt" vs "setExtention" is just a non-option 
as far as I'm concerned.

But IMO that's the minor stuff, I think there's a bigger elephant in the 
room here:

The various benefits of doing it as an InputRange instead of 
OutputRange, combined with the terrible mess that resulted from 
converting such a straightforward function from OutputRange to 
InputRange is yet another clear indication that we REALLY need a 

simple algorithms exploding into messes like that to become the norm in 
D? I certainly hope not. I'd rather just go back to opApply.

I'm convinced this is a real problem for D moving forward, and I don't 
like continually seeing it get swept under the same "too unimportant for 
right now" rug as things that actually belong under there (like named

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
 However output range means the string operation will be done eagerly, whereas
 lazy has advantages (nice piping, saving on work etc).


The lazy vs eager issue seems at first blush to be an esoteric, potayto-potahto 
bikeshed.

But I think it's more interesting than that.

Lazy means that the computation is divided into two pieces:

1. creating an engine to do the calculations
2. feeding the data to the engine to generate a result

This, of course, reminds me of regex. By creating the regex engine separately, 
and then using that engine for multiple sets of data, one can afford to spend 
extra computational effort at optimizing the engine. Creating the engine can 
even be done at compile time (as Dmitry's most excellent regex engine does).

Engines can make it easier to parallelize computation.

And lastly, it's easy to make an eager version by wrapping a lazy one. But to 
make a lazy version from an eager one means reimplementing it.

May 08 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

Or yield()-ing inside the eager one's sink.

And note also there is such as thing as a stackless "fiber", so I'm not 
certain a full-fledged context-switching fiber would necessarily be 
required (although it might - stackless fibers do have limitations).

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

 Or yield()-ing inside the eager one's sink.

The data is still supplied to it.

May 08 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 5/8/2014 4:13 PM, Walter Bright wrote:
 8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

 Or yield()-ing inside the eager one's sink.

 The data is still supplied to it.

I think I've lost track of what you mean here. What data is still 
supplied to what, and what point does that illustrate?

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/8/2014 4:24 PM, Nick Sabalausky wrote:
 On 5/8/2014 4:13 PM, Walter Bright wrote:
 8/2014 12:22 PM, Nick Sabalausky wrote:
 On 5/8/2014 2:46 PM, Walter Bright wrote:
 But to make a lazy version from an eager one means reimplementing it.

 Or yield()-ing inside the eager one's sink.

 The data is still supplied to it.

 I think I've lost track of what you mean here. What data is still supplied to
 what, and what point does that illustrate?

When creating an engine vs calculating the result, the former does not require 
the actual data to be supplied.

regex is a classic example of that.

May 08 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu 
wrote:
 A discussion is building around 
 https://github.com/D-Programming-Language/phobos/pull/2149, 
 which is a nice initiative by Walter to allow Phobos users to 
 avoid or control memory allocation.

 First instance of the pull request copied the inputs into an 
 output range.

 The second instance (right now) creates an input range that 
 lazily creates the result. The element type of that range is 
 the encoding type of the first argument (i.e. char or wchar 
 most of the time). This is different from string/wstring/etc 
 element-wise iteration because it'll be done code unit-wise, 
 not code point-wise.

 We need a robust idiom for doing such string manipulation 
 without allocation, for which setExtension is just an example. 
 Going the output range route has nice things going for it 
 because the output range decides the encoding in advance and 
 then accepts via put() calls any encoding, with only the 
 minimum transcoding needed.

 However output range means the string operation will be done 
 eagerly, whereas lazy has advantages (nice piping, saving on 
 work etc).

 On the other hand, there's the risk of becoming "more catholic 
 than the Pope" by insisting on lazy string processing. Most 
 string operations are eager, and insisting on a general 
 framework for lazy encoded operations on strings may be an 
 exaggeration.

 D{iscuss,estroy}!


 Andrei

A little thing I've been thinking about on a related note:

someOutputRange <| someInputRange;

equivalent to

someInputRange |> someOutputRange;

which is conceptually the same as

someInputRange.copy(someOutputRange);

perhaps implemented by introducing new operators opSink (in the 
output range) and opSource (in the input range). These would be 
invoked as a pair, opSource returning an input range which is 
passed to opSink.

If not implemented, opSource defaults to `auto opSource() { 
return this; }` or even `alias opSource = this;`. opSink would 
default to a std.algorithm.copy style thing (can't really be 
std.algorithm.copy itself because you don't want the dependency).

I'm not sure what ground this really breaks, if any, but it sure 
looks nice to me.

May 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/8/14, 10:46 AM, Andrei Alexandrescu wrote:
[snip]

Here's a list of algorithms in std.algorithm that would need to be 
looked at if we want to handle ranges of char and wchar properly (at 
least versions with predicates). When I say "looked at" I mean "modified 
appropriately to handle ranges of char and wchar in addition to strings".

========================

Searching: all  any  canFind  count  countUntil  commonPrefix  endsWith 
  find  findAdjacent  findAmong  findSkip  findSplit  findSplitAfter 
findSplitBefore  minCount  minPos  mismatch  skipOver  startsWith  until

Comparison: cmp  equal  levenshteinDistance  levenshteinDistanceAndPath 
  mismatch

Iteration: filter  filterBidirectional  group  map  reduce  splitter  uniq

Sorting: completeSort  isPartitioned  isSorted  makeIndex 
nextPermutation  nextEvenPermutation  partialSort  partition  partition3 
  schwartzSort  sort  topN  topNCopy

Set operations: (probably none, because set ops are rare on strings)

Mutation: bringToFront  copy  fill  initializeAll  moveAll  moveSome 
remove  reverse  strip  stripLeft  stripRight  swapRanges  uninitializedFill

========================

Of these, only a fraction (maybe 60-70%) are actually used with strings. 
Still that leaves a significant amount of work to do if we want 
std.algorithm to work smoothly with ranges of char/wchar. Here 
"smoothly" means "if it compiles and runs it won't do something UTF-goofy".



Andrei

May 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/8/14, 3:33 PM, Andrei Alexandrescu wrote:
 Of these, only a fraction (maybe 60-70%) are actually used with strings.
 Still that leaves a significant amount of work to do if we want
 std.algorithm to work smoothly with ranges of char/wchar. Here
 "smoothly" means "if it compiles and runs it won't do something UTF-goofy".

The larger point here (I forgot) is that user code that's written in the 
style of std.algorithm would need similar adjustments in order to work 
with ranges of char/wchar.

Andrie

May 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

Classification of functions in http://dlang.org/phobos/std_path.html:

1. Functions that allocate GC memory and return an array:

   dirName, setExtension, defaultExtension, buildPath, buildNormalizedPath, 
absolutePath, relativePath, expandTilde

These are prime candidates to be converted into algorithms.


2. Functions that return a slice of their input arrays:

   baseName, rootName, driveName, stripDrive, extension,

Should be enhanced to accept RandomAccessRange and produce RandomAccessRange


3. Functions that return a range:

   pathSplitter

Enhance to accept RandomAccessRange


4. Functions that take an array and compute a value:

   isRooted, isAbsolute, filenameCmp, globMatch, isValidPath

These are reducers, and should be enhanced to accept input ranges.

May 08 2014

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu 
wrote:
 A discussion is building around 
 https://github.com/D-Programming-Language/phobos/pull/2149, 
 which is a nice initiative by Walter to allow Phobos users to 
 avoid or control memory allocation.

 First instance of the pull request copied the inputs into an 
 output range.

 The second instance (right now) creates an input range that 
 lazily creates the result. The element type of that range is 
 the encoding type of the first argument (i.e. char or wchar 
 most of the time). This is different from string/wstring/etc 
 element-wise iteration because it'll be done code unit-wise, 
 not code point-wise.

I'm with H. S. Teoh that we should prefer the input range 
approach. My limited experience with output ranges suggests to me 
that they really need to be the end point, where the data is 
actually leaving the program (writing to disk, over a network). 
And as Teoh mentions, if you need it in specific memory you can 
copy an input range into it.

D is already fairly lazy, and .array() is great for doing eager 
evaluation into the GC. For strings having something similar 
which allows for the original encoding to remain would be good.

May 13 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Ranges of char and wchar