www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 9674] New: std.algorithm.filter problems with non-deterministic _input.front

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674

           Summary: std.algorithm.filter problems with non-deterministic
                    _input.front
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc


--- Comment #0 from bearophile_hugs eml.cc 2013-03-09 08:05:14 PST ---
This Python2.6 program:


from itertools import imap, ifilter
from sys import stdout
def fun(x):
    stdout.write("*")
    return x
list(ifilter(lambda x: True, imap(fun, xrange(5))))



Prints:

*****




While this similar D program:


import std.stdio: write;
import std.array: array;
import std.range: iota;
import std.algorithm: map, filter;
void main() {
    iota(5).map!((x) {
        write("*");
        return x;
    }).filter!(_ => true).array;
}


Prints:

**********



This is a problem if the function given to map is not deterministic, or if it's
costly to evaluate.




Removing both the iota() and the array() from the D code, using a counter, but
keeping both map() and filter():


import std.algorithm: map, filter;
int count = 0;
void main() {
    auto r1 = [0, 0, 0, 0, 0];
    auto r2 = map!((int x){ count++; return 0; })(r1);
    auto r3 = filter!((int y){ return true; })(r2);
    foreach (z; r3) {}
    assert(count == 10); // passes.
}




To better show what's going on, this is a shortened version of the filter(), it
calls pred() only n times, and it calls _input.front more often:


struct FilterResult(alias pred, Range) {
    Range _input;

    this(Range r) {
        _input = r;
        while (!_input.empty && !pred(_input.front)) {
            _input.popFront();
        }
    }

     property bool empty() { return _input.empty; }

    void popFront() {
        do {
            _input.popFront();
        } while (!_input.empty && !pred(_input.front));
    }

     property auto ref front() {
        return _input.front;
    }
}



Calling _input.front more often means it assumes _input.front is _like_ a pure
function. But generally in D (and Python) this is not true.

I think at least it should be documented in std.algorithm ddocs that filter()
assumes the front() of its argument range to act like a pure function (despite
there isn't type sytem enforcement of this). But maybe it's better to change
std.algorithm.filter.

This modified filter() caches _input.front and it works with not deterministic
_input.front functions too:



import std.stdio, std.range, std.algorithm, std.traits, std.random;

struct FilterResult2(alias pred, Range) {
    import std.traits: ForeachType;

    Range _input;
    ForeachType!Range rFront;

    this(Range r) {
        _input = r;

        while (!_input.empty) {
            this.rFront = _input.front;
            if (pred(this.rFront))
                break;
            _input.popFront();
        }
    }

     property bool empty() { return _input.empty; }

    void popFront() {
        do {
            _input.popFront();
            if (_input.empty)
                break;
            this.rFront = _input.front;
            if (pred(this.rFront))
                break;
        } while (true);
    }

     property auto ref front() {
        return this.rFront;
    }
}

FilterResult2!(pred, Range) filter2(alias pred, Range)(Range rs) {
    return FilterResult2!(pred, Range)(rs);
}

// Follows demo code------

auto distinct(Range)(Range r)
if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    });
}


/**
This is a useful function, it yields lazily the distinct items
from the input iterable.
It is similar to uniq, but doesn't need the input range to be sorted.
*/
auto distinct2(Range)(Range r)
if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter2!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    });
}

void main() {
    100.iota.map!(_ => uniform(0, 10))
    .distinct.array.sort.writeln; // Bad output.

    100.iota.map!(_ => uniform(0, 10))
    .distinct2.array.sort.writeln; // Good output.
}



What are the downsides of replacing std.algorithm.filter() with something
similar to this filter2() (with the missing methods, Unqual, etc)?


This topic was discussed at length in this thread:
http://forum.dlang.org/thread/fcbtejylrwxfploajuto forum.dlang.org

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 09 2013
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674


timon.gehr gmx.ch changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |timon.gehr gmx.ch


--- Comment #1 from timon.gehr gmx.ch 2013-03-09 15:20:21 PST ---
A solution that does not require caching front is the following:

struct Filter(alias a,R){
    private R source;
    // ...
    private bool processed = false;
    void process(){
        if(!processed)
            while(!source.empty && !a(source.front))
                source.popFront();
        processed = true;
    }
     property auto ref front(){ process(); return source.front; }
     property bool empty(){ process(); return source.empty; }
    void popFront(){ process(); source.popFront(); }
}

It also fixes the problem of filter not being properly lazy (i.e. currently the
function call might loop forever, even if no element is actually requested
later.)

Of course, this might be slower than the current solution in case the compiler
is unable to properly track the 'processed' field value.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #2 from timon.gehr gmx.ch 2013-03-09 15:31:40 PST ---
(In reply to comment #1)
 A solution that does not require caching front is the following:
 
 ...
     void popFront(){ process(); source.popFront(); }

This should obviously reset the flag: void popFront(){ process(); source.popFront(); processed=false; } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #3 from bearophile_hugs eml.cc 2013-03-09 15:39:49 PST ---
I am trying your code, but I see duplications in the output:


import std.stdio: writeln;
import std.algorithm: map, sort;
import std.array: array;
import std.range: iota, isInputRange;
import std.random: uniform;
import std.traits: ForeachType;


struct FilterResult3(alias pred, Range) {
    private Range _input;

    this(Range r) {
        _input = r;
    }

    private bool processed = false;

    void process() {
        if (!processed)
            while (!_input.empty && !pred(_input.front))
                _input.popFront();
        processed = true;
    }

     property auto ref front() {
        process();
        return _input.front;
    }

     property bool empty() {
        process();
        return _input.empty;
    }

    void popFront() {
        process();
        _input.popFront();
        processed = false;
    }
}

FilterResult3!(pred, Range) filter3(alias pred, Range)(Range rs) {
    return FilterResult3!(pred, Range)(rs);
}


auto distinct3(Range)(Range r)
if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter3!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    });
}

void main() {
    100.iota.map!(_ => uniform(0, 10))
    .distinct3.array.sort.writeln; //
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #4 from timon.gehr gmx.ch 2013-03-09 15:49:49 PST ---
(In reply to comment #3)
 I am trying your code, but I see duplications in the output:
 ...
 
 void main() {
     100.iota.map!(_ => uniform(0, 10))
     .distinct3.array.sort.writeln; //
 }

The code depends on front of the source range being consistent across calls. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #5 from timon.gehr gmx.ch 2013-03-09 15:57:31 PST ---
(In reply to comment #4)
 ...
 
 The code depends on front of the source range being consistent across calls.

(If that is not what is wanted, a lot of caching would have to be introduced into most std.algorithm/std.range ranges.) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #6 from bearophile_hugs eml.cc 2013-03-09 16:35:29 PST ---
(In reply to comment #4 and #5)

 The code depends on front of the source range being consistent across calls.

front() is not tagged with "pure", so you can't depend on that.
 (If that is not what is wanted, a lot of caching would have to be introduced

Otherwise I think the "pure" attribute needs be added in many of those Phobos places. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #7 from timon.gehr gmx.ch 2013-03-09 16:50:30 PST ---
(In reply to comment #6)
 (In reply to comment #4 and #5)
 
 The code depends on front of the source range being consistent across calls.

front() is not tagged with "pure", ...

How would that help? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674


Marco Leise <Marco.Leise gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Marco.Leise gmx.de


--- Comment #8 from Marco Leise <Marco.Leise gmx.de> 2013-03-09 18:08:45 PST ---
(In reply to comment #7)
 (In reply to comment #6)
 (In reply to comment #4 and #5)
 
 The code depends on front of the source range being consistent across calls.

front() is not tagged with "pure", ...

How would that help?

Pure _does_ make it consistent across calls. That's how it helps. :) And arguably a range's front property - in theory - fits that shoe. As the user of a range I'd typically expect front to be a dumb, pure property. In this case it is incorrect to call map with a mapping function that accesses global state for example. Or front has to cache the last result, so it doesn't need to ever be recalculated. But then practically every "front" property should be cached and it starts to look like a design flaw. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 09 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9674



--- Comment #9 from timon.gehr gmx.ch 2013-03-10 01:55:54 PST ---
(In reply to comment #8)
 (In reply to comment #7)
 (In reply to comment #6)
 (In reply to comment #4 and #5)
 
 The code depends on front of the source range being consistent across calls.

front() is not tagged with "pure", ...

How would that help?

Pure _does_ make it consistent across calls. That's how it helps. :)

Actually it does not. :) struct R{ int x; property int front()pure{ return x++; } } I guess you want const pure. (but there are still type system holes.)
 And arguably a range's front property - in theory - fits that shoe. As the user
 of a range I'd typically expect front to be a dumb, pure property.
 
 In this case it is incorrect to call map with a mapping function that accesses
 global state for example.

One issue is that recalculation may be expensive, or it might be cheap, or even removed by CSE.
 Or front has to cache the last result, so it doesn't
 need to ever be recalculated. But then practically every "front" property
 should be cached and it starts to look like a design flaw.

Agreed, it is not too clear what to do about this. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 10 2013