www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Lots of low hanging fruit in Phobos

reply Walter Bright <newshound2 digitalmars.com> writes:
A major goal for D in the short term is to reduce reliance in Phobos on the GC. 
I was looking at std.string last night, and I noticed a couple things:

1. The inputs are constrained to being strings. This is overly restrictive, the 
inputs should be InputRanges.

2. The outputs should be a range, too. In fact, the string functions should 
become algorithms. Then they won't need to allocate any memory at all.

The existing functions should not be removed, but perhaps rewritten as wrappers 
around the algorithm versions.

I've found that rewriting traditional code, which is what std.string is now, in 
terms of algorithms is a bit mind-bending. But it's well worth it, and fun.

So who wants to step up? Don't have to do the whole thing in one go, just pick
a 
function and do that one.
Mar 06 2014
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 The existing functions should not be removed, but perhaps 
 rewritten as wrappers around the algorithm versions.
How does one handle case sensitivity for ranges of abstract types?
 I've found that rewriting traditional code, which is what 
 std.string is now, in terms of algorithms is a bit 
 mind-bending. But it's well worth it, and fun.
Would be pretty neat if std.string and std.regex would work with char-like types which actually carry more data per character. That way, it'd be possible to do string/regex transforms (search & replace, etc.) but keep track where exactly each character came from.
Mar 06 2014
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 06 Mar 2014 16:30:46 -0500, Vladimir Panteleev  
<vladimir thecybershadow.net> wrote:

 On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 The existing functions should not be removed, but perhaps rewritten as  
 wrappers around the algorithm versions.
How does one handle case sensitivity for ranges of abstract types?
Perhaps he means many of the functions only accept const(Char)[], where Char can be templatized, but it really should accept any range. -Steve
Mar 06 2014
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 01:30, Vladimir Panteleev пишет:
 On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 The existing functions should not be removed, but perhaps rewritten as
 wrappers around the algorithm versions.
How does one handle case sensitivity for ranges of abstract types?
+1
 I've found that rewriting traditional code, which is what std.string
 is now, in terms of algorithms is a bit mind-bending. But it's well
 worth it, and fun.
Would be pretty neat if std.string and std.regex would work with char-like types which actually carry more data per character. That way, it'd be possible to do string/regex transforms (search & replace, etc.) but keep track where exactly each character came from.
Exactly. I've been toying with idea of having generic notion of Alphabets for more then a year now. That would also be generalizing code unit / code point stuff of Unicode, into legacy encodings and beyond. One case I had in mind is the very limited (A, C, T, G) alphabet in bio-informatics. -- Dmitry Olshansky
Mar 06 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Thursday, 6 March 2014 at 21:30:47 UTC, Vladimir Panteleev 
wrote:
 On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 The existing functions should not be removed, but perhaps 
 rewritten as wrappers around the algorithm versions.
How does one handle case sensitivity for ranges of abstract types?
The same way std.algorithm does: through equality predicates.
Mar 06 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 1:30 PM, Vladimir Panteleev wrote:
 On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 The existing functions should not be removed, but perhaps rewritten as
 wrappers around the algorithm versions.
How does one handle case sensitivity for ranges of abstract types?
Use 'static if' to detect the element type. And besides, I had meant that an algorithm should work on an InputRange of char's, and not be restricted to char[].
Mar 06 2014
prev sibling next sibling parent reply "Meta" <jared771 gmail.com> writes:
On Thursday, 6 March 2014 at 21:26:45 UTC, Walter Bright wrote:
 A major goal for D in the short term is to reduce reliance in 
 Phobos on the GC. I was looking at std.string last night, and I 
 noticed a couple things:

 1. The inputs are constrained to being strings. This is overly 
 restrictive, the inputs should be InputRanges.

 2. The outputs should be a range, too. In fact, the string 
 functions should become algorithms. Then they won't need to 
 allocate any memory at all.

 The existing functions should not be removed, but perhaps 
 rewritten as wrappers around the algorithm versions.

 I've found that rewriting traditional code, which is what 
 std.string is now, in terms of algorithms is a bit 
 mind-bending. But it's well worth it, and fun.

 So who wants to step up? Don't have to do the whole thing in 
 one go, just pick a function and do that one.
Seems like a good idea to reduce memory usage wherever possible, but I thought that the reason std.string exists (and duplicates some functionality that exists elsewhere) is to provide string-specific versions that were either optimized specifically for strings, or have completely different functionality due to working with strings.
Mar 06 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 1:40 PM, Meta wrote:
 Seems like a good idea to reduce memory usage wherever possible, but I thought
 that the reason std.string exists (and duplicates some functionality that
exists
 elsewhere) is to provide string-specific versions that were either optimized
 specifically for strings, or have completely different functionality due to
 working with strings.
1. By using template specializations, algorithms can be custom optimized for certain inputs. 2. I expect the developer of these algorithms to do performance profiling of them vs the originals, and address any problems. 3. I've done some similar algorithms, and was able to achieve performance parity, and sometimes even do better (because no memory needed to be allocated). std.string was one of the first Phobos modules written, was written by myself, and long predates notions of ranges and algorithms. It has evolved since then, but its fundamental nature has not changed. It's time for that to change to a modern, D style.
Mar 06 2014
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 06, 2014 at 01:26:46PM -0800, Walter Bright wrote:
 A major goal for D in the short term is to reduce reliance in Phobos
 on the GC. I was looking at std.string last night, and I noticed a
 couple things:
 
 1. The inputs are constrained to being strings. This is overly
 restrictive, the inputs should be InputRanges.
 
 2. The outputs should be a range, too. In fact, the string functions
 should become algorithms. Then they won't need to allocate any
 memory at all.
[...] What about using output ranges? T -- What do you call optometrist jokes? Vitreous humor.
Mar 06 2014
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 00:44:45 UTC, H. S. Teoh wrote:
 What about using output ranges?
I think most the string functions should be transformative, like std.algorithm.map, so they take an input range and return an input range. This lets them chain most easily, letting the user sink them into a particular range at the end. Though we could do a bit of magic to both take an output range and return an input range for it (which can also be backward-compatible, as we talked about in a thread a month or so ago), the most straightforward way is surely to treat it all like map.
Mar 06 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 4:43 PM, H. S. Teoh wrote:
 On Thu, Mar 06, 2014 at 01:26:46PM -0800, Walter Bright wrote:
 A major goal for D in the short term is to reduce reliance in Phobos
 on the GC. I was looking at std.string last night, and I noticed a
 couple things:

 1. The inputs are constrained to being strings. This is overly
 restrictive, the inputs should be InputRanges.

 2. The outputs should be a range, too. In fact, the string functions
 should become algorithms. Then they won't need to allocate any
 memory at all.
[...] What about using output ranges?
A great question. I tend to regard output ranges as suitable for a container to expose. An algorithm reads an InputRange, and presents its output as another InputRange. This is so that algorithms can be easily chained together by the user, and the last algorithm in the chain would be std.algorithm.copy(OutputRange). For example, we could implement toStringz() as an algorithm that looks like: auto toStringz(S)(S s) { return chain(s, repeat(0, 1)) } and use it like this: string s = ...; char[] buffer = ...; s.toStringz.copy(buffer); Note how the algorithm version of toStringz does not allocate any memory. buffer would be the output range, but note how what it actually is is, and how its memory is allocated, is selected by the user. (std.buffer.scopebuffer is ideal for this sort of usage.) std.file is loaded with calls to toStringz(), each of which leaks memory quite unnecessarily. Using an algorithm version of toStringz(), along with scopebuffer, will neatly eliminated these leaks. You might be tempted to think that these are just little leaks, who cares. But I recently wrote a D program that did tens of thousands of file operations, and those little leaks turned into a raging river.
Mar 06 2014
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 06, 2014 at 05:15:45PM -0800, Walter Bright wrote:
 On 3/6/2014 4:43 PM, H. S. Teoh wrote:
On Thu, Mar 06, 2014 at 01:26:46PM -0800, Walter Bright wrote:
A major goal for D in the short term is to reduce reliance in Phobos
on the GC. I was looking at std.string last night, and I noticed a
couple things:

1. The inputs are constrained to being strings. This is overly
restrictive, the inputs should be InputRanges.

2. The outputs should be a range, too. In fact, the string functions
should become algorithms. Then they won't need to allocate any
memory at all.
[...] What about using output ranges?
A great question. I tend to regard output ranges as suitable for a container to expose. An algorithm reads an InputRange, and presents its output as another InputRange. This is so that algorithms can be easily chained together by the user, and the last algorithm in the chain would be std.algorithm.copy(OutputRange).
After some thought, and also some recent experiences, I've come to the same conclusion. Output ranges are basically sinks, which means they are no good for further chaining (attempting to do so is an exercise in frustration). So pretty much everything that generates or transforms data should return an input range, and output ranges should only be used when you're going to stop processing the data at that point, e.g., save it into a buffer, write to a file, etc.. Unfortunately, input ranges are somewhat tedious to write -- nice foreach loops have to be broken up into .empty, .front, .popFront, which is a lot of boilerplate code and not so nice in inner loops (though I suppose the idea is to improve compiler inlining to handle these cases). I wonder if a mid- to long-term goal for D would be to ease writing input ranges by moving some of the boilerplate into the language, or providing range builder facilities in Phobos. The most nagging part of writing input ranges in the current language is the inability to use foreach to generate the returned elements. (Well, you *can* do that to a buffer array and then return the array, but that kinda defeats the purpose of GC avoidance.) Some kind of built-in coroutine syntax with 'yield' would help things a LOT.
 For example, we could implement toStringz() as an algorithm that looks
 like:
 
     auto toStringz(S)(S s) {
         return chain(s, repeat(0, 1))
     }
 
 and use it like this:
 
     string s = ...;
     char[] buffer = ...;
     s.toStringz.copy(buffer);
 
 Note how the algorithm version of toStringz does not allocate any
 memory. buffer would be the output range, but note how what it
 actually is is, and how its memory is allocated, is selected by the
 user.
Yes, I like this. This is the right way to go. [...]
 You might be tempted to think that these are just little leaks, who
 cares. But I recently wrote a D program that did tens of thousands
 of file operations, and those little leaks turned into a raging
 river.
Yeah, I'm currently working on a program that generates potentially huge numbers of n-dimensional coordinates, and currently the code does this using ranges that return array literals (ouch). This is pretty bad for performance and also for GC pressure, so I'm thinking to restructure the code some time in the near future to get rid of obligatory allocations, much like how you describe it above. T -- Customer support: the art of getting your clients to pay for your own incompetence.
Mar 06 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 5:37 PM, H. S. Teoh wrote:
 Unfortunately, input ranges are somewhat tedious to write
I know. But we need to make the effort, at least for Phobos. The payoff is it makes user code much less effort.
Mar 06 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
H. S. Teoh:

 Some kind of built-in coroutine syntax with
 'yield' would help things a LOT.
Some persons have quickly shot down this idea that I have suggested: https://d.puremagic.com/issues/show_bug.cgi?id=11880 Bye, bearophile
Mar 06 2014
prev sibling next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/6/2014 8:37 PM, H. S. Teoh wrote:
 Unfortunately, input ranges are somewhat tedious to write -- nice
 foreach loops have to be broken up into .empty, .front, .popFront, which
 is a lot of boilerplate code and not so nice in inner loops (though I
 suppose the idea is to improve compiler inlining to handle these cases).
 I wonder if a mid- to long-term goal for D would be to ease writing
 input ranges by moving some of the boilerplate into the language, or
 providing range builder facilities in Phobos. The most nagging part of
 writing input ranges in the current language is the inability to use
 foreach to generate the returned elements. (Well, you *can* do that to a
 buffer array and then return the array, but that kinda defeats the
 purpose of GC avoidance.) Some kind of built-in coroutine syntax with
 'yield' would help things a LOT.
need a way to make input ranges (maybe even forward ranges if we're automatically converted behind-the-scenes into stackless fibers (ie, into state machines). A while back, I tried doing a library solution for this via regular fibers, but there turned out to be a lot of overhead due to the fiber's context-switching [1]. We need to take inspiration from two things: C's "protothreads" IMO. In both cases, the body of a coroutine is automagically transformed into a big switch statement. Yield statements then become roughly "return tuple([yielded value], [resume point]); case [resume point]:...". And then invoking the coroutine again automatically passes in the last [resume point] returned so the big switch() can jump straight to where it left off. The main difference in our case would be that we'd also auto-generate all the boilerplate rigging for this to be an input range. [1] http://semitwist.com/articles/article/view/combine-coroutines-and-input-ranges-for-dead-simple-d-iteration [2] C's "protothreads" library: http://dunkels.com/adam/pt/expansion.html
Mar 06 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
These are good ideas, but for now we need to just bite the bullet and fix
Phobos.
Mar 06 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/6/2014 10:59 PM, Walter Bright wrote:
 These are good ideas, but for now we need to just bite the bullet and
 fix Phobos.
Agreed.
Mar 06 2014
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/07/2014 02:37 AM, H. S. Teoh wrote:
 Unfortunately, input ranges are somewhat tedious to write -- nice
 foreach loops have to be broken up into .empty, .front, .popFront, which
 is a lot of boilerplate code and not so nice in inner loops (though I
 suppose the idea is to improve compiler inlining to handle these cases).
I think that the separation of front and popFront causes much of this.
Mar 07 2014
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 01:14:29AM +0100, Timon Gehr wrote:
 On 03/07/2014 02:37 AM, H. S. Teoh wrote:
Unfortunately, input ranges are somewhat tedious to write -- nice
foreach loops have to be broken up into .empty, .front, .popFront,
which is a lot of boilerplate code and not so nice in inner loops
(though I suppose the idea is to improve compiler inlining to handle
these cases).
I think that the separation of front and popFront causes much of this.
I'm on the fence about that one. On the one hand, some ranges that do complex calculations in .front are better off with .popAndReturnFront. But on the other hand, there are also algorithms for which you don't want to advance the range just to pick off its front element. Since the separated semantics can't be achieved with a non-separated range API, I lean slightly towards favoring the current split approach. Having a way to auto-generate input range boilerplate, though, would be really, *really* nice. Coroutine-style code would be ideal. T -- Bomb technician: If I'm running, try to keep up.
Mar 07 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 5:09 PM, H. S. Teoh wrote:
 Having a way to auto-generate input range boilerplate, though, would be
 really, *really* nice. Coroutine-style code would be ideal.
I usually start writing an InputRange by pasting in a boilerplate version, then modifying it.
Mar 07 2014
prev sibling next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 8 March 2014 at 01:10:38 UTC, H. S. Teoh wrote:
 Having a way to auto-generate input range boilerplate, though, 
 would be really, *really* nice.
Eh, I don't think it is a big deal and would be fairly limited compared to the current setup. If you use a fiber or state variable or something for the yield this yield that trick, how do you go backward? Random access? I think the best the yield stuff can do is maybe forward range and maybe infinite (probably with an annotation of some sort, since otherwise, the infiniteness wouldn't be obvious at compile time). So the best we're looking to automate is input or perhaps forward ranges. And how hard are these really to write? yield query(string q) { auto result = c_query(toStringz(q)); while(!HasRow(result)) yield GetNextRow(result); } OK, that is kinda nice, but, is the status quo so bad? (BTW the reason I went with some kind of C database api here is everything else I could think of are actually pretty short when using std.algorithm functions to help define them.) struct query { private Result result; this(string q) { result = c_query(toStringz(q)); if(!empty) popFront(); } Row front; property bool empty() { return HasRow(result); } void popFront() in { assert(!empty); } body { front = GetNextRow(result); } } It is certainly a bit longer, but it isn't that bad, and is easily extended to other range capabilities. Translating recursive iteration to a range does take a bit more, you need to track your local variables and put them in a stack of your own, but even that isn't too hard (though a bit wordier). I guess the whole yield thing can be kinda nice, I'm just not sure it is that big of a win given its other limitations compared to full ranges.
Mar 07 2014
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Adam D. Ruppe:

 struct query {
     private Result result;
     this(string q) {
          result = c_query(toStringz(q));
          if(!empty) popFront();
     }

     Row front;
      property bool empty() { return HasRow(result); }
     void popFront() in { assert(!empty); } body {
          front = GetNextRow(result);
     }
 }


 It is certainly a bit longer, but it isn't that bad, and is 
 easily extended to other range capabilities.
Your code is badly formatted.
 Translating recursive iteration to a range does take a bit 
 more, you need to track your local variables and put them in a 
 stack of your own, but even that isn't too hard (though a bit 
 wordier).
In generally this is rather bad, unless you are a C programmer used to use tweezers and needles to implement your own hash set every other day :-(
 I guess the whole yield thing can be kinda nice, I'm just not 
 sure it is that big of a win given its other limitations 
 compared to full ranges.
yield is limited, but in a large amount of cases it's enough, especially if it's well implemented (now Python 3 has yield that is usable for coroutines too, and it has recently added another optimization). For the other cases you can use normal range protocols as you have done. Bye, bearophile
Mar 07 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 (now Python 3 has yield that is usable for coroutines too, and 
 it has recently added another optimization).
http://theburningmonk.com/2011/09/fsharp-yield-vs-yield/ Bye, bearophile
Mar 08 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:

I have not yet found the paper, but I think this was the topic: http://blogs.msdn.com/b/dsyme/archive/2007/09/22/some-details-on-f-computation-expressions-aka-monadic-or-workflow-syntax.aspx Bye, bearophile
Mar 08 2014
parent "bearophile" <bearophileHUGS lycos.com> writes:
 I have not yet found the paper,
Here it is: http://tomasp.net/academic/papers/computation-zoo/ More on the same: http://en.wikibooks.org/wiki/F_Sharp_Programming/Computation_Expressions http://msdn.microsoft.com/en-us/library/dd233182.aspx Bye, bearophile
Mar 08 2014
prev sibling next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 8 March 2014 at 02:15:19 UTC, Adam D. Ruppe wrote:
 I guess the whole yield thing can be kinda nice, I'm just not 
 sure it is that big of a win given its other limitations 
 compared to full ranges.
I slapped this together to see if we can do it with a mixin template: http://arsdnet.net/dcode/next.d struct ShortRange { int next() { switch(callNumber) { case 0: return 50; case 1: return 60; case 2: return 70; case 3: return finished(); default: assert(0); } } mixin MakeShortRange!(); } Wordier than yield ShortRange() { yield 50; yield 60; yield 70; } but the same idea. (In fact, i'm sure the whole yield thing should pretty much just rewrite into something like this anyway). My hypothetical from the previous post would be written: struct ShortRange { Result result; this(string s) { result = c_query(s); } Row next() { if(!HasRow(result)) return finished(); return GetNextRow(result); } mixin MakeShortRange!(); } Which is a bit shorter and perhaps nicer to use than the long form range.
Mar 07 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Adam D. Ruppe:

 Wordier than
 yield ShortRange() { yield 50; yield 60; yield 70; }

 but the same idea.
In general a typed syntax could be: yield(int) ShortRange() { yield 50; yield 60; yield 70; } Or even: yield(auto) ShortRange() { yield 50; yield 60; yield 70; } So yield(auto) could be sugar for yield(auto): yield ShortRange() { yield 50; yield 60; yield 70; } Bye, bearophile
Mar 07 2014
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 03:27:15AM +0000, bearophile wrote:
 Adam D. Ruppe:
 
Wordier than
yield ShortRange() { yield 50; yield 60; yield 70; }

but the same idea.
In general a typed syntax could be: yield(int) ShortRange() { yield 50; yield 60; yield 70; } Or even: yield(auto) ShortRange() { yield 50; yield 60; yield 70; } So yield(auto) could be sugar for yield(auto): yield ShortRange() { yield 50; yield 60; yield 70; }
[...] Yield syntax is generally not a big deal with simple ranges, but once you start needing recursion and branching logic, things become significantly hairier without yield. For example: yield ComplicatedRange(int arg) { if (arg == 0) { foreach (i; 0 .. 10) yield i; } else if (arg >= 1 && arg <= 5) { yield 10; yield 20; yield 30; } else if (arg >= 6) { foreach (i; 10 .. 20) { if (i >= 5 && i < 9) { yield 40; yield 50; } else { yield i*2; } } } } With yield, the branching logic is clear and relatively easy to follow. Without yield, this will turn into a nasty mess of nested state variables, due to the different numbers of yielded items within nested conditional blocks. (Of course, you could argue that if things get this complicated, one should be splitting it up into multiple composed ranges instead, but not everything is readily decomposible.) T -- Bare foot: (n.) A device for locating thumb tacks on the floor.
Mar 07 2014
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/08/2014 03:15 AM, Adam D. Ruppe wrote:
 So the best we're looking to automate is input or perhaps forward
 ranges. And how hard are these really to write?

 yield query(string q) {
     auto result = c_query(toStringz(q));
     while(!HasRow(result))
        yield GetNextRow(result);
 }

 OK, that is kinda nice, but, is the status quo so bad?
 (BTW the reason I
 went with some kind of C database api here is everything else I could
 think of are actually pretty short when using std.algorithm functions to
 help define them.)

 struct query {
      private Result result;
      this(string q) {
           result = c_query(toStringz(q));
           if(!empty) popFront();
      }

      Row front;
       property bool empty() { return HasRow(result); }
      void popFront() in { assert(!empty); } body {
           front = GetNextRow(result);
      }
 }


 It is certainly a bit longer, but it isn't that bad, and is easily
 extended to other range capabilities.
This does not do the same thing. It computes the first row even if it is never requested. std.algorithm.filter suffers from the same problem. This is part of the reason why I think the separation of front and popFront is suboptimal design.
Mar 08 2014
parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Saturday, 8 March 2014 at 09:25:38 UTC, Timon Gehr wrote:
 This does not do the same thing. It computes the first row even 
 if it is never requested. std.algorithm.filter suffers from the 
 same problem. This is part of the reason why I think the 
 separation of front and popFront is suboptimal design.
Please have a look at this pull request: https://github.com/D-Programming-Language/phobos/pull/1987
Mar 09 2014
prev sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/7/2014 9:15 PM, Adam D. Ruppe wrote:
 On Saturday, 8 March 2014 at 01:10:38 UTC, H. S. Teoh wrote:
 Having a way to auto-generate input range boilerplate, though, would
 be really, *really* nice.
Eh, I don't think it is a big deal and would be fairly limited compared to the current setup. If you use a fiber or state variable or something for the yield this yield that trick, how do you go backward? Random access?
Oh it would definitely be for nothing more advanced than a forward range. It's a coroutine, ie a generator function, so the elements are determined by execution flow. Since execution can't go in reverse or random-access, anything beyond forward range is ruled out. But you knew that :) Forward range should be possible though (at least for pure-ish coroutines anyway), since all you'd have to do is clone the coroutine's internal state structure (which it must already have anyway, since you can't have coroutines without it). While it couldn't be used for bidirectional or random-access ranges, it would still be a big help for generators - ie the whole point of coroutines in the first place. Other languages have coroutines - we'd have coroutines that can be used as ranges :)
 I think the best the yield stuff can do is maybe forward range and maybe
 infinite (probably with an annotation of some sort, since otherwise, the
 infiniteness wouldn't be obvious at compile time).


 So the best we're looking to automate is input or perhaps forward
 ranges. And how hard are these really to write?

 yield query(string q) {
     auto result = c_query(toStringz(q));
     while(!HasRow(result))
        yield GetNextRow(result);
 }

 OK, that is kinda nice, but, is the status quo so bad? (BTW the reason I
 went with some kind of C database api here is everything else I could
 think of are actually pretty short when using std.algorithm functions to
 help define them.)

 struct query {
      private Result result;
      this(string q) {
           result = c_query(toStringz(q));
           if(!empty) popFront();
      }

      Row front;
       property bool empty() { return HasRow(result); }
      void popFront() in { assert(!empty); } body {
           front = GetNextRow(result);
      }
 }


 It is certainly a bit longer, but it isn't that bad, and is easily
 extended to other range capabilities.


 Translating recursive iteration to a range does take a bit more, you
 need to track your local variables and put them in a stack of your own,
 but even that isn't too hard (though a bit wordier).


 I guess the whole yield thing can be kinda nice, I'm just not sure it is
 that big of a win given its other limitations compared to full ranges.
I dunno, what you wrote sounds to me like a pretty convincing argument in favor of coroutine ranges. ;) Keep in mind, making *all* ranges easier to write isn't the charter here, and it doesn't need to be: Bidirectional and random-access ranges ARE more complicated than generators, so I think they're already just about as easy to write as we *can* make them. Instead, the problem we're looking at solving here is exactly what you've described above: Making generators in D is more complicated and boilerplate-y than it really needs to be (unless you want to give up on ranges and use opApply - which still isn't particularly great without that mixin you suggested a couple years ago to cover up the return value ugliness). Generators may be a subset of ranges, but they're an important subset. generators. What really itches at me is that we're so tantalizingly close with opApply. The only real problem with it (aside from the return value system being kinda awkward compared to a "yield" statement) is that it can't be used as a range. And it can't be converted to a range without using Phobos fibers which imposes too much of an overhead to be a one-size-fits-all technique for generators.
Mar 08 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/8/2014 5:56 AM, Nick Sabalausky wrote:
 What really itches at me is that we're so tantalizingly close with
 opApply. The only real problem with it (aside from the return value
 system being kinda awkward compared to a "yield" statement) is that it
 can't be used as a range. And it can't be converted to a range without
 using Phobos fibers which imposes too much of an overhead to be a
 one-size-fits-all technique for generators.
The troublesome thing about that is anyone who writes a generator in D is forced to choose between [mostly] straightforward implementation logic (ie opApply) *or* range compatibility. I think we can do better than to force an ugly choice like that.
Mar 08 2014
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/08/2014 02:09 AM, H. S. Teoh wrote:
 On Sat, Mar 08, 2014 at 01:14:29AM +0100, Timon Gehr wrote:
 On 03/07/2014 02:37 AM, H. S. Teoh wrote:
 Unfortunately, input ranges are somewhat tedious to write -- nice
 foreach loops have to be broken up into .empty, .front, .popFront,
 which is a lot of boilerplate code and not so nice in inner loops
 (though I suppose the idea is to improve compiler inlining to handle
 these cases).
I think that the separation of front and popFront causes much of this.
I'm on the fence about that one. On the one hand, some ranges that do complex calculations in .front are better off with .popAndReturnFront. But on the other hand, there are also algorithms for which you don't want to advance the range just to pick off its front element. Since the separated semantics can't be achieved with a non-separated range API, I lean slightly towards favoring the current split approach. ...
I agree that there are trade-offs involved.
 Having a way to auto-generate input range boilerplate, though, would be
 really, *really* nice. Coroutine-style code would be ideal.
 ...
The drawback is that without the compiler being really clever, ranges thus defined will be just input ranges.
Mar 08 2014
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 10:38:50AM +0100, Timon Gehr wrote:
 On 03/08/2014 02:09 AM, H. S. Teoh wrote:
On Sat, Mar 08, 2014 at 01:14:29AM +0100, Timon Gehr wrote:
On 03/07/2014 02:37 AM, H. S. Teoh wrote:
Unfortunately, input ranges are somewhat tedious to write -- nice
foreach loops have to be broken up into .empty, .front, .popFront,
which is a lot of boilerplate code and not so nice in inner loops
(though I suppose the idea is to improve compiler inlining to
handle these cases).
I think that the separation of front and popFront causes much of this.
I'm on the fence about that one. On the one hand, some ranges that do complex calculations in .front are better off with .popAndReturnFront. But on the other hand, there are also algorithms for which you don't want to advance the range just to pick off its front element. Since the separated semantics can't be achieved with a non-separated range API, I lean slightly towards favoring the current split approach. ...
I agree that there are trade-offs involved.
Having a way to auto-generate input range boilerplate, though, would
be really, *really* nice. Coroutine-style code would be ideal.
...
The drawback is that without the compiler being really clever, ranges thus defined will be just input ranges.
We can achieve at least forward ranges with coroutines, if they're pure. T -- Your inconsistency is the only consistent thing about you! -- KD
Mar 08 2014
prev sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
W dniu 2014-03-08 02:09, H. S. Teoh pisze:
 Having a way to auto-generate input range boilerplate, though, would be
 really, *really* nice. Coroutine-style code would be ideal.
https://github.com/pszturmaj/dgenerators
Mar 08 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/8/2014 5:58 PM, Piotr Szturmaj wrote:
 W dniu 2014-03-08 02:09, H. S. Teoh pisze:
 Having a way to auto-generate input range boilerplate, though, would be
 really, *really* nice. Coroutine-style code would be ideal.
https://github.com/pszturmaj/dgenerators
Yea, there is that approach, which is nice in certain ways. I did the same kinda thing a couple years ago[1], but your API appears much, much nicer. The unfortunate downside, though, is that as jerro demonstrated[2] there's a high overhead when using fibers for iteration. It likely wouldn't be an issue for some things, like IO, but for other things it could be a problem. So what we end up with is an unfortunate three-way choice anytime someone needs a generator in D: 1. Give up on range compatibility (opApply) 2. Give up on straightforward implementation (input/forward range) 3. Give up on maximum performance (fiber-based coroutine range) D's just dancing all around the ideal solution for generators without ever actually hitting the mark. Which I find frustrating because I know that's a target D's capable of nailing. [1] http://semitwist.com/articles/article/view/combine-coroutines-and-input-ranges-for-dead-simple-d-iteration [2] http://forum.dlang.org/thread/jno6o5$qtb$1 digitalmars.com
Mar 08 2014
parent reply "w0rp" <devw0rp gmail.com> writes:
On Saturday, 8 March 2014 at 23:23:21 UTC, Nick Sabalausky wrote:
 3. Give up on maximum performance (fiber-based coroutine range)
I think that's what I would go for. yield with co-routines could be a quick and dirty way to create an InputRange at some performance cost. Then writing the InputRange by hand would then be an optimisation. I think I tend to write things in such a way where I have room to optimise in such a way, but I almost always reach for the easier to write, less performant option first. (It's a bonus when the easier way is also the more performant way.)
Mar 08 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
w0rp:

 3. Give up on maximum performance (fiber-based coroutine range)
I think that's what I would go for.
Yet there's no need for that. You can have your pie and eat it too, with a small cost. D can rewrite code that contains a yield in an efficient finite state machine (this is what ShedSkin same). Bye, bearophile
Mar 09 2014
parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 10:30:37 UTC, bearophile wrote:
 w0rp:

 3. Give up on maximum performance (fiber-based coroutine 
 range)
I think that's what I would go for.
Yet there's no need for that. You can have your pie and eat it too, with a small cost. D can rewrite code that contains a yield in an efficient finite state machine (this is what does the same).
How does this handle recursive generators, e.g. like you would see with a depth-first tree traversal? The "state" in this case is a call stack. How is the memory allocated for the stack?
Mar 09 2014
next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Sunday, 9 March 2014 at 11:00:13 UTC, Peter Alexander wrote:
 On Sunday, 9 March 2014 at 10:30:37 UTC, bearophile wrote:
 w0rp:

 3. Give up on maximum performance (fiber-based coroutine 
 range)
I think that's what I would go for.
Yet there's no need for that. You can have your pie and eat it too, with a small cost. D can rewrite code that contains a yield in an efficient finite state machine (this is what does the same).
How does this handle recursive generators, e.g. like you would see with a depth-first tree traversal? The "state" in this case is a call stack. How is the memory allocated for the stack?
The generators mentioned are stack-less. You'd need to maintain the stack yourself.
Mar 09 2014
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 11:17:14 UTC, Tobias Pankrath wrote:
 On Sunday, 9 March 2014 at 11:00:13 UTC, Peter Alexander wrote:
 On Sunday, 9 March 2014 at 10:30:37 UTC, bearophile wrote:
 w0rp:

 3. Give up on maximum performance (fiber-based coroutine 
 range)
I think that's what I would go for.
Yet there's no need for that. You can have your pie and eat it too, with a small cost. D can rewrite code that contains a yield in an efficient finite state machine (this is what compiler does the same).
How does this handle recursive generators, e.g. like you would see with a depth-first tree traversal? The "state" in this case is a call stack. How is the memory allocated for the stack?
The generators mentioned are stack-less. You'd need to maintain the stack yourself.
That's unfortunate. I find that non-recursive generators are pretty easy to write as ranges anyway :-/
Mar 09 2014
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Peter Alexander:

 How does this handle recursive generators, e.g. like you would 
 see with a depth-first tree traversal? The "state" in this case 
 is a call stack. How is the memory allocated for the stack?
A Python 2.x test program: class Node: def __init__(self, data, left, right): self.data = data self.left = left self.right = right def in_order(node): if node is not None: for n in in_order(node.left): yield n yield node.data for n in in_order(node.right): yield n def main(): tree = Node(1, Node(2, Node(4, Node(7, None, None), None), Node(5, None, None)), Node(3, Node(6, Node(8, None, None), Node(9, None, None)), None)) for n in in_order(tree): print n, print main() It prints: 7 4 2 5 1 8 6 9 3 ShedSkin translates it to some C++ code, here I show only the in_order function: class __gen_in_order : public __iter<__ss_int> { public: Node *node; __iter<__ss_int> *__0, *__1, *__4, *__5; __ss_int __2, __6, n; __iter<__ss_int>::for_in_loop __3, __7; int __last_yield; __gen_in_order(Node *node) { this->node = node; __last_yield = -1; } __ss_int __get_next() { switch(__last_yield) { case 0: goto __after_yield_0; case 1: goto __after_yield_1; case 2: goto __after_yield_2; default: break; } if ((node!=NULL)) { FOR_IN(n,in_order(node->left),0,2,3) __last_yield = 0; __result = n; return __result; __after_yield_0:; END_FOR __last_yield = 1; __result = node->data; return __result; __after_yield_1:; FOR_IN(n,in_order(node->right),4,6,7) __last_yield = 2; __result = n; return __result; __after_yield_2:; END_FOR } __stop_iteration = true; } }; __iter<__ss_int> *in_order(Node *node) { return new __gen_in_order(node); } Where FOR_IN is a macro to some foreach-like code: #define FOR_IN(e, iter, temp, i, t) \ { \ Note this is not efficient, because it could generate O(n^2) code, but this is not fault of ShedSkin, as the original Python code has the same problem. They have recently (in Python 3.3) added to Python the syntax "yield from" that will allow to remove that performance problem (currently I think the performance is not improved): http://legacy.python.org/dev/peps/pep-0380/ Look especially at the section regarding optimization: http://legacy.python.org/dev/peps/pep-0380/#optimisations With this improvement the in_order function becomes: def in_order(node): if node is not None: yield from in_order(node.left) yield node.data yield from in_order(node.right) kinds of yield are "yield" and "yield!": http://theburningmonk.com/2011/09/fsharp-yield-vs-yield/ http://stackoverflow.com/questions/3500488/f-yield-operator-implementation-and-possible-c-sharp-equivalents Bye, bearophile
Mar 09 2014
prev sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Friday, 7 March 2014 at 01:15:43 UTC, Walter Bright wrote:
 On 3/6/2014 4:43 PM, H. S. Teoh wrote:
 What about using output ranges?
A great question. I tend to regard output ranges as suitable for a container to expose. An algorithm reads an InputRange, and presents its output as another InputRange. This is so that algorithms can be easily chained together by the user, and the last algorithm in the chain would be std.algorithm.copy(OutputRange).
While I prefer this approach most of the time, I fear it has a performance cost over sink-style algorithms (whether they use an output range or build an array result). I guess the question is whether this cost is generally offset by gains in the memory allocation code or not.
 For example, we could implement toStringz() as an algorithm 
 that looks like:

     auto toStringz(S)(S s) {
         return chain(s, repeat(0, 1))
     }
std.range.only is more suitable than `repeat` here (and I don't know if `chain` would let you chain a range of characters with a range of integers): auto toStringz(S)(S s) if (is(Unqual!(ElementType!S) == dchar)) { return s.chain('\0'.only()); } Either way, perhaps the most serious problem is that when using `copy` to write the result to an array, both UTF decoding and re-encoding takes place (the result is always a range of `dchar`).
Mar 06 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 5:54 PM, Jakob Ovrum wrote:
 While I prefer this approach most of the time, I fear it has a performance cost
 over sink-style algorithms (whether they use an output range or build an array
 result). I guess the question is whether this cost is generally offset by gains
 in the memory allocation code or not.
std.buffer.scopebuffer was developed for that purpose, and as the discussion on why it is the way it is shows, it was developed with extensive use of a profiler to ensure there was no speed compromise from using it.
 For example, we could implement toStringz() as an algorithm that looks like:

     auto toStringz(S)(S s) {
         return chain(s, repeat(0, 1))
     }
std.range.only is more suitable than `repeat` here
Yes, I missed that. I'm still learning about the proper way to use ranges.
 (and I don't know if `chain`
 would let you chain a range of characters with a range of integers):
My code fragment is untested and pretty much off the cuff. A real version would be a bit more complex, as you suggest.
 Either way, perhaps the most serious problem is that when using `copy` to write
 the result to an array, both UTF decoding and re-encoding takes place (the
 result is always a range of `dchar`).
I know. I struggled with that myself, and I object to the design decision that auto-decodes and auto-encodes any ranges that deal with char's. It automatically turns fast code into molasses. I wound up casting everything to ubytes, ugh. But it's too late to change that now, sigh.
Mar 06 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 01:26, Walter Bright пишет:
 A major goal for D in the short term is to reduce reliance in Phobos on
 the GC. I was looking at std.string last night, and I noticed a couple
 things:

 1. The inputs are constrained to being strings. This is overly
 restrictive, the inputs should be InputRanges.
Fun is most of std.algorithm is special-cased on strings. So we have double volume of crap. -- Dmitry Olshansky
Mar 07 2014