digitalmars.D - Iterator straw-man

Sean Kelly (145/145) Nov 07 2006 C++ supports pointers as iterators largely for compatibility with

Craig Black (10/10) Nov 07 2006 I do agree that using classes and interfaces to represent iterators is a...

Sean Kelly (6/19) Nov 07 2006 Yup. The reason for a proposal is twofold: to establish a common idiom
Bill Baxter (14/27) Nov 07 2006 I haven't read Seans big long proposal yet, but I think the method that

Sean Kelly (3/22) Nov 08 2006 I agree. In my proposal I used fwdIt and revIt for lack of a better ide...

Bill Baxter (44/81) Nov 07 2006 I don't think opApply should be the mechanism by which foreach accepts

Benji Smith (9/12) Nov 08 2006 Why the continued assumption that there are only two valid kinds of

Bill Baxter (14/28) Nov 08 2006 Because those two will be recognized specially by foreach and

Sean Kelly (20/117) Nov 08 2006 True enough. I mostly added it as an alternative for situations where

Bill Baxter (5/14) Nov 08 2006 That's true. With all current uses you could see the 'op' as synonymous...

Sean Kelly <sean f4.ca> writes:

C++ supports pointers as iterators largely for compatibility with 
existing/old code and because of the prevalence of pointer use in the 
language.  D however, is not burdened with the need for backwards 
compatibility, nor is pointer use at all common.  Languages without 
pointers typically implement iterators as a single object that is itself 
aware of the end of the referenced sequence.  But while this works quite 
well for sequential operations, it falls apart somewhat for random 
access operations, particularly in languages lacking operator overloading.

Unlike C++, D contains a fairly robust built-in dynamic array type. 
These arrays may be passed by value efficiently and contain both 
built-in property methods and user extension of property semantics using 
the following convention:

     void mutate( char[] c, int i );
     char[] c;
     c.mutate( 0 );

D also supports a slice syntax which is roughly comparable to a 
self-contained random access iterator.  Thus "abcdef"[0 .. 2] returns a 
dynamic array type supporting all the typical array operations, 
including the length property for determining an end point.  Slicing is 
supported for user-defined types as well, so this method may be applied 
to any container supporting random access.

Given the presence of D's slice syntax, it would be a shame to ignore it 
in favor of pointer-style random access iterators for algorithm use. 
And while the idea of a single iterator design for all access methods 
(sequential and random access) seems vaguely appealing, it relies 
heavily on traits templates to allow proper function overloading for 
different iterator categories.  In fact, overloading template functions 
for different iterator categories is so annoying that the C++ committee 
is adding concept checking to C++ 0x to simplify the implementation of 
such overloaded template functions.

I believe D's slice syntax offers a perfect opportunity to provide a 
reasonable, intuitive syntax for robust "smart" random-access iterators. 
  The built-in array type will work as-is, and a user-defined random 
access container could support the same syntax by providing opIndex, 
opIndexAssign, opSlice, and opSliceAssign for both the container and 
iterator types.  Obtaining an iterator on a random access container, 
then, would be done via a slice operation:

     auto i = myCont[0 .. $];
     foreach( e; i ) {} // sequential access
     auto e = i[0];     // random access

The remaining iterator categories in C++ are essentially variations of 
forward iterators and bidirectional iterators, and could roughly be 
described as requiring a means of advancing, testing whether the 
iterator references a valid element, and of accessing the referenced 
element.  Testing validity in C++ is done by comparing the data iterator 
with an "end" iterator for equality, as again, this syntax works with 
pointers as well.  But using two separate iterators for defining the 
bounds of a range:

     * Is cumbersome.  It means passing two parameters to every function 
requiring an iterator instead of one.

     * Is dangerous.  Comparing the "end" iterator of one range with the 
data iterator of another range is legal if the data (and possibly 
container) types match, and can result in an attempt to read/write past 
the end of the referenced sequence.

     * Is unnecessary.  It is more natural, and more common, to perform 
random access operations via array index operations, which leaves 
typical iterator use largely restricted to sequential access.  This 
being the case, why support pointers as valid iterators?

What follows is a rough description of a "smart" read-only iterator for D:

interface Iterator<T>
{
     bool atEnd();
     T next();
     T curr();
     int opApply( int delegate( T ) dg );
}

The nuances are unimportant, but the basic idea is that any such 
iterator should provide a means of advancing, testing whether the 
iterator references a valid element, and of referencing that element. 
Also, opApply is supported to allow iterators to be used in foreach 
operations for simple uses.

Assuming such a design, the iterator type determines progress through 
the container (forward, reverse, stepped, etc) rather than the algorithm 
being applied.  Here is a straightforward implementation of some 
algorithms using such iterators, 'C' is a container type for demonstration:

     bool contains( C cont, E match ) {
         foreach( e; cont ) {
             if( e == match )
                 return true;
         }
         return false;
     }

     I find_first( C cont, E match ) {
         auto i = cont.fwdIt;
         foreach( e; i ) {
             if( e == match )
                 return i;
         }
         return i;
     }

     I find_last( C cont, E match ) {
         auto i = cont.revIt;
         foreach( e; i ) {
             if( e == match )
                 return i;
         }
         return i;
     }

Given the above, it seems awkward that the iterator must be declared 
outside of foreach so a copy may be made to return the position of the 
found element.  Therefore, foreach should allow an arbitrary key type to 
be returned by opApply.  For array operations, the key could continue to 
be an integer representing the array postion., but in sequence 
operations the key could be an iterator representing the current 
position.  Thus, the following operations would be equivalent:

1)  auto i = cont.fwdIt;
     foreach( e; i ) {
         if( e == match )
             return i;
     }

2)  foreach( i, e; cont.fwdIt ) {
         if( e == match )
             return i;
     }

3)  foreach( i, e; cont ) {
         if( e == match )
             return i;
     }

This being the case, I believe iterators must be copyable in a generic 
manner, and ideally, copies must be relatively efficient to perform. 
Therefore, I suggest the addition of a .dup() property to the standard 
iterator interface to flatten the differences between class and 
struct-based iterators, and it may be advisable to use structs as 
iterators in the standard case and forget about classes and interfaces 
altogether.  That said, if a reference to the current iterator could be 
passed around within foreach instead of a copy being made for each 
iteration, then much of the cost could be eliminated.  This should 
easily be possible using the same auto-dereferencing semantics used by 
"inout" parameters in D already.

So in summation, D adequately represents "smart" random access iterators 
via array slice semantics and sequential iteration is best represented 
using a single "smart" iterator object as opposed to a pair of "dumb" 
iterators as per C++.  Finally, the example code above suggests that the 
need for a foreach_reverse statement may not be necessary if fwdIt and 
revIt properties are provided for built-in arrays.  Though the only 
functional difference between:

     foreach( i, e; array )

and

     foreach( i, e; array.fwdIt )

would be the "key" type returned in 'i'.  In the first case it would be 
an integer index and in the second it would be an iterator.

As an aside, foreach_reverse may remain useful in instances where we 
want the index of the current element instead of an iterator.  Or array 
iterators may provide a means to obtain the relevant index.  Perhaps 
someone could make an argument for or against based on these ideas.

Nov 07 2006

"Craig Black" <cblack ara.com> writes:

I do agree that using classes and interfaces to represent iterators is a bad 
idea for performance reasons.  I also agree that I don't think it's 
necessary to support raw pointers.

IMO iterators should be structs such that iteration can be performed as 
follows:

for(auto i = container.begin(); i.isValid(); i.next())  write(i.value);

If I am not mistaken, this can already be done in D.  With foreach support 
this would simplify to.

foreach(i; container)  write(i);

-Craig

Nov 07 2006

Sean Kelly <sean f4.ca> writes:

Craig Black wrote:
 I do agree that using classes and interfaces to represent iterators is a bad 
 idea for performance reasons.  I also agree that I don't think it's 
 necessary to support raw pointers.
 
 IMO iterators should be structs such that iteration can be performed as 
 follows:
 
 for(auto i = container.begin(); i.isValid(); i.next())  write(i.value);
 
 If I am not mistaken, this can already be done in D.  With foreach support 
 this would simplify to.
 
 foreach(i; container)  write(i);

Yup.  The reason for a proposal is twofold: to establish a common idiom 
for people to follow so code can interact reliably, and to determine 
whether any built-in behavior needs to be tweaked to support iterators 
in a way this is both generic (for UDTs, AAs, and arrays) and elegant.


Sean

Nov 07 2006

Bill Baxter <dnewsgroup billbaxter.com> writes:

Craig Black wrote:
 I do agree that using classes and interfaces to represent iterators is a bad 
 idea for performance reasons.  I also agree that I don't think it's 
 necessary to support raw pointers.
 
 IMO iterators should be structs such that iteration can be performed as 
 follows:
 
 for(auto i = container.begin(); i.isValid(); i.next())  write(i.value);
 
 If I am not mistaken, this can already be done in D.  With foreach support 
 this would simplify to.
 
 foreach(i; container)  write(i);

I haven't read Seans big long proposal yet, but I think the method that 
returns an iterator should not be called 'begin'.  I should be called 
'forward' or 'iterator' or 'iter' or 'forward_iterator' or something 
like that.

Then i would hope that all of the following would be possible:

    foreach(i; container.iter)  write(i);
    foreach(i; container.reverse_iter)  write(i);
    foreach(i; container.my_skip_every_prime_number_iter) write(i);

foreach(i; container) can look for the .iter (or opIterator, or whatever 
standard name is decided upon).

foreach_reverse(i; container) [shudder] can look for the .riter (or 
opIteratorReverse, or whatever).

--bb

Nov 07 2006

Sean Kelly <sean f4.ca> writes:

Bill Baxter wrote:
 Craig Black wrote:
 I do agree that using classes and interfaces to represent iterators is 
 a bad idea for performance reasons.  I also agree that I don't think 
 it's necessary to support raw pointers.

 IMO iterators should be structs such that iteration can be performed 
 as follows:

 for(auto i = container.begin(); i.isValid(); i.next())  write(i.value);

 If I am not mistaken, this can already be done in D.  With foreach 
 support this would simplify to.

 foreach(i; container)  write(i);

 
 I haven't read Seans big long proposal yet, but I think the method that 
 returns an iterator should not be called 'begin'.  I should be called 
 'forward' or 'iterator' or 'iter' or 'forward_iterator' or something 
 like that.

I agree.  In my proposal I used fwdIt and revIt for lack of a better idea.


Sean

Nov 08 2006

Bill Baxter <dnewsgroup billbaxter.com> writes:

Sean Kelly wrote:

I mostly agree with you.  But have some comments on the specifics.

 What follows is a rough description of a "smart" read-only iterator for D:
 
 interface Iterator<T>
 {
     bool atEnd();
     T next();
     T curr();
     int opApply( int delegate( T ) dg );
 }
 
 The nuances are unimportant, but the basic idea is that any such 
 iterator should provide a means of advancing, testing whether the 
 iterator references a valid element, and of referencing that element. 
 Also, opApply is supported to allow iterators to be used in foreach 
 operations for simple uses.

I don't think opApply should be the mechanism by which foreach accepts 
iterators.  It's just too convoluted and too much a burden on iterator 
implementors.  And it's not going to be as efficient as a raw for-loop, 
either.

 Assuming such a design, the iterator type determines progress through 
 the container (forward, reverse, stepped, etc) rather than the algorithm 
 being applied.  Here is a straightforward implementation of some 
 algorithms using such iterators, 'C' is a container type for demonstration:
 
     bool contains( C cont, E match ) {
         foreach( e; cont ) {
             if( e == match )
                 return true;
         }
         return false;
     }

It should be possible to make a template out of that, too.

      bool contains(C,E)( C cont, E match ) {
          foreach( e; cont ) {
              if( e == match )
                  return true;
          }
          return false;
      }

Current compiler limitations aside, that *should* be reducible to the 
same efficiency as a for loop with simple pointer (or index) manipulation.

 [...] 
 
 Given the above, it seems awkward that the iterator must be declared 
 outside of foreach so a copy may be made to return the position of the 
 found element.  Therefore, foreach should allow an arbitrary key type to 
 be returned by opApply.  

I like that.  It makes sense.  The iterator itself is analogous to the 
int index when foreach-ing arrays.  That seems very logical.

Given that similarity it seems like it would make sense for indexing a 
container by an iterator to work as well:

foreach( i, e; cont.fwdIt ) {
       if( e == match ) {
           writefln("Matched item ", cont[i]);
           return i;
       }
}

I.e. cont[i] becomes synonymous with i.curr().

That would be easy to arrange if opIndex was allowed to take an iterator 
as the index.  (I think it's a good idea to remove the restriction on 
opIndex arguments in any event, but this is another justification.)

An opIndex for an iterator would take the simple form:

    static T opIndex(ContIterator i) { return i.curr(); }

If you had write-able iterator with say a .set(T v) method, then you 
could also have opIndexAssign:

    static T opIndexAssign(ContIterator i, T x) { return i.set(x); }

If the Iterator type has a pointer to its container, you could make the 
methods non-static and assert(this==i.container).

 As an aside, foreach_reverse may remain useful in instances where we 
 want the index of the current element instead of an iterator.  Or array 
 iterators may provide a means to obtain the relevant index.  Perhaps 
 someone could make an argument for or against based on these ideas.

foreach_reverse(i,x; cont) would remain (marginally) useful as a 
shortcut for the special reverse iterator name:  I.e. a synonym for
     foreach(i,x; cont.revIter)

I'm guessing the specially-ordained iterator method names will probably 
end up being opIter and opIterReverse (or maybe same with 'Iterator' 
spelled out) just following the pattern established so far.

--bb

Nov 07 2006

Benji Smith <dlanguage benjismith.net> writes:

Bill Baxter wrote:
 I'm guessing the specially-ordained iterator method names will probably 
 end up being opIter and opIterReverse (or maybe same with 'Iterator' 
 spelled out) just following the pattern established so far.

Why the continued assumption that there are only two valid kinds of 
iterator? A hashtable might have forward and reverse iterators for both 
its keys and its values, thereby requiring four distinct iterators. A 
tree might have depth-first, breadth-first, in-order, and reverse-order 
iterators. Any collection class might have a random-order iterator. I 
think it's awfully presumptuous to bake the notion of forward and 
reverse iteration directly into the language.

--benji

Nov 08 2006

Bill Baxter <wbaxter gmail.com> writes:

Benji Smith wrote:
 Bill Baxter wrote:
 
 I'm guessing the specially-ordained iterator method names will 
 probably end up being opIter and opIterReverse (or maybe same with 
 'Iterator' spelled out) just following the pattern established so far.

 
 
 Why the continued assumption that there are only two valid kinds of 
 iterator? 

Because those two will be recognized specially by foreach and 
foreach_reverse without having to specify a method name.  But you'll 
still be able to pass in a specific iterator of whatever kind you wish.

A hashtable might have forward and reverse iterators for both
 its keys and its values, thereby requiring four distinct iterators. 

Yep, for hashtable you may have to name them all explicitly:
    foreach(k; ht.keys)
    foreach(v; ht.values)
    foreach(k; ht.keysReverse)
    foreach(v; ht.valuesReverse)

A
 tree might have depth-first, breadth-first, in-order, and reverse-order 
 iterators. Any collection class might have a random-order iterator. I 
 think it's awfully presumptuous to bake the notion of forward and 
 reverse iteration directly into the language.

Can't say I disagree with you, but it's already there with foreach and 
foreach_reverse and opApply/opApplyReverse.

--bb

Nov 08 2006

Sean Kelly <sean f4.ca> writes:

Bill Baxter wrote:
 Sean Kelly wrote:
 
 I mostly agree with you.  But have some comments on the specifics.
 
 What follows is a rough description of a "smart" read-only iterator 
 for D:

 interface Iterator<T>
 {
     bool atEnd();
     T next();
     T curr();
     int opApply( int delegate( T ) dg );
 }

 The nuances are unimportant, but the basic idea is that any such 
 iterator should provide a means of advancing, testing whether the 
 iterator references a valid element, and of referencing that element. 
 Also, opApply is supported to allow iterators to be used in foreach 
 operations for simple uses.

 
 I don't think opApply should be the mechanism by which foreach accepts 
 iterators.  It's just too convoluted and too much a burden on iterator 
 implementors.  And it's not going to be as efficient as a raw for-loop, 
 either.

True enough.  I mostly added it as an alternative for situations where 
users didn't need the power of a manual loop.  And I think 
implementation would be pretty simple for most/all iterators:

     for( ; !atEnd(); next() )
     {
         ret = dg( curr() );
         // handle ret
     }

 Assuming such a design, the iterator type determines progress through 
 the container (forward, reverse, stepped, etc) rather than the 
 algorithm being applied.  Here is a straightforward implementation of 
 some algorithms using such iterators, 'C' is a container type for 
 demonstration:

     bool contains( C cont, E match ) {
         foreach( e; cont ) {
             if( e == match )
                 return true;
         }
         return false;
     }

 
 It should be possible to make a template out of that, too.
 
      bool contains(C,E)( C cont, E match ) {
          foreach( e; cont ) {
              if( e == match )
                  return true;
          }
          return false;
      }
 
 Current compiler limitations aside, that *should* be reducible to the 
 same efficiency as a for loop with simple pointer (or index) manipulation.

Yup.  The function above was meant for illustrating an idea more than it 
was a proposal for how to implement a contains function :-)

 [...]
 Given the above, it seems awkward that the iterator must be declared 
 outside of foreach so a copy may be made to return the position of the 
 found element.  Therefore, foreach should allow an arbitrary key type 
 to be returned by opApply.  

 
 I like that.  It makes sense.  The iterator itself is analogous to the 
 int index when foreach-ing arrays.  That seems very logical.
 
 Given that similarity it seems like it would make sense for indexing a 
 container by an iterator to work as well:
 
 foreach( i, e; cont.fwdIt ) {
       if( e == match ) {
           writefln("Matched item ", cont[i]);
           return i;
       }
 }
 
 I.e. cont[i] becomes synonymous with i.curr().

Hrm, good point.

 That would be easy to arrange if opIndex was allowed to take an iterator 
 as the index.  (I think it's a good idea to remove the restriction on 
 opIndex arguments in any event, but this is another justification.)

Is it currently required to be an integer?  I had no idea.

 An opIndex for an iterator would take the simple form:
 
    static T opIndex(ContIterator i) { return i.curr(); }
 
 If you had write-able iterator with say a .set(T v) method, then you 
 could also have opIndexAssign:
 
    static T opIndexAssign(ContIterator i, T x) { return i.set(x); }
 
 If the Iterator type has a pointer to its container, you could make the 
 methods non-static and assert(this==i.container).

All good ideas.

 As an aside, foreach_reverse may remain useful in instances where we 
 want the index of the current element instead of an iterator.  Or 
 array iterators may provide a means to obtain the relevant index.  
 Perhaps someone could make an argument for or against based on these 
 ideas.

 
 foreach_reverse(i,x; cont) would remain (marginally) useful as a 
 shortcut for the special reverse iterator name:  I.e. a synonym for
     foreach(i,x; cont.revIter)

True enough.

 I'm guessing the specially-ordained iterator method names will probably 
 end up being opIter and opIterReverse (or maybe same with 'Iterator' 
 spelled out) just following the pattern established so far.

Following the D convention, the 'op' prefix suggests a use form 
different from that of a normal function call.  Personally, I'd be fine 
with this done as plain old property methods: fwdIt/revIt, or something 
along those lines.


Sean

Nov 08 2006

Bill Baxter <wbaxter gmail.com> writes:

Sean Kelly wrote:
 Bill Baxter wrote:
 
 Sean Kelly wrote:


 
 Following the D convention, the 'op' prefix suggests a use form 
 different from that of a normal function call.  Personally, I'd be fine 
 with this done as plain old property methods: fwdIt/revIt, or something 
 along those lines.

That's true.  With all current uses you could see the 'op' as synonymous 
with "you ain't never gonna call this directly".   So something intended 
to be called directly doesn't necessarily have to follow that convention.

--bb

Nov 08 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Iterator straw-man