digitalmars.D - efficient input range buffering

Martin Nowak (12/12) Oct 02 2011 I've written a wrapper to promote input ranges to buffered forward range...

Dmitry Olshansky (23/34) Oct 02 2011 Like it, but one thing catches my eye - why use GC for blocks when the

Robert Jacques (2/16) Oct 02 2011 Consider that the elements could contain pointers, etc.

Dmitry Olshansky (6/26) Oct 02 2011 Mmh.. You are right, I was thinking I/O as soon as I seen line about

Martin Nowak (21/60) Oct 02 2011 I've added that, didn't before as it's always quite a lot of boilerplate...

Dmitry Olshansky (19/86) Oct 02 2011 IMO implicit save is nice and well as long as it stays cheap e.g.

travert phare.normalesup.org (Christophe) (3/19) Oct 02 2011 I'll be glad to have a look at this code, but a minimum of documentation...
bearophile (25/26) Oct 02 2011 To avoid the GC to waste time following the linked list, isn't it better...
Timon Gehr (20/32) Oct 02 2011 I have implemented something similar (but it is not generic). It uses a

Martin Nowak (22/61) Oct 02 2011 Which is about the same as I did where free memory blocks are recycled

Timon Gehr (8/48) Oct 02 2011 It does not. The only position that is important is that of the very

Martin Nowak (3/57) Oct 02 2011 I see, little more elaborate than I guessed from your description.

"Martin Nowak" <dawg dawgfoto.de> writes:

I've written a wrapper to promote input ranges to buffered forward ranges.
It allows to write range agnostic lexers/parsers with infinite lookahead.

Buffering is done through a singly linked list of memory blocks that are  
reference counted.
Each saved range holds a reference to all future blocks.
Blocks are recycled when being no longer used.

https://gist.github.com/1257196

There is a major issue with the somewhat broken implicit-save-through-copy  
concept.
A lot of copies in function parameters, foreach loops etc. will also create
references and thus can be easily responsible for inefficiencies.

martin

Oct 02 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the 
whole thing is already refcounted? Straight malloc/free would be a 
better fit. Certainly it may use an allocator when we have them.

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.

While the concept of ForwardRange is quite elegant, I found myself 
wondering on occasion that InputRange with restore could be way more 
efficient. Consider:

auto r = range();
auto save_it = r; ///at least full bitwise copy and/or refCounting
if(do_stuff(r))
	...
else
    r = save_it;//another one
...

Compared to:

auto save_pt = r.savepoint(); //different object, minimal overhead
if(do_stuff(r))
	...
else
    r.restore(save_pt);//range knows how to do this efficiently
...

-- 
Dmitry Olshansky

Oct 02 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 02 Oct 2011 10:46:38 -0400, Dmitry Olshansky <dmitry.olsh gmail.com>
wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 Like it, but one thing catches my eye - why use GC for blocks when the
 whole thing is already refcounted? Straight malloc/free would be a
 better fit. Certainly it may use an allocator when we have them.

Consider that the elements could contain pointers, etc.

Oct 02 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02.10.2011 21:06, Robert Jacques wrote:
 On Sun, 02 Oct 2011 10:46:38 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 Like it, but one thing catches my eye - why use GC for blocks when the
 whole thing is already refcounted? Straight malloc/free would be a
 better fit. Certainly it may use an allocator when we have them.

 Consider that the elements could contain pointers, etc.

Mmh.. You are right, I was thinking I/O as soon as I seen line about 
lexers/parsers. Though it could be an optimization if hasIndirections!T 
is false.

-- 
Dmitry Olshansky

Oct 02 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 02 Oct 2011 16:46:38 +0200, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward  
 ranges.
 It allows to write range agnostic lexers/parsers with infinite  
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 Like it, but one thing catches my eye - why use GC for blocks when the  
 whole thing is already refcounted? Straight malloc/free would be a  
 better fit. Certainly it may use an allocator when we have them.

I've added that, didn't before as it's always quite a lot of boilerplate  
to get this right.

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also  
 create
 references and thus can be easily responsible for inefficiencies.

 While the concept of ForwardRange is quite elegant, I found myself  
 wondering on occasion that InputRange with restore could be way more  
 efficient. Consider:

 auto r = range();
 auto save_it = r; ///at least full bitwise copy and/or refCounting
 if(do_stuff(r))
 	...
 else
     r = save_it;//another one
 ...

 Compared to:

 auto save_pt = r.savepoint(); //different object, minimal overhead
 if(do_stuff(r))
 	...
 else
     r.restore(save_pt);//range knows how to do this efficiently
 ...

Looks interesting and clearly has some advantages.
I personally tend to alway write .save when I intend so and
thought that way more phobos functions should take ranges by auto ref  
(i.e. actually advance them).
But I recapture from an earlier discussion that most people liked the  
implicit save
and some proposed to ditch save at all.

How would restore allow to actually duplicate the range at different  
positions.
I think a pretty common use case for forward ranges is backtracking:
auto start = range.save;
popWhile!pred(range);
doSomeThing(start, range);

Will I need to be able to do this?
auto range2 = range;
range.restore(start);
doSomeThing(range, range2);

Oct 02 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02.10.2011 21:29, Martin Nowak wrote:
 On Sun, 02 Oct 2011 16:46:38 +0200, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 Like it, but one thing catches my eye - why use GC for blocks when the
 whole thing is already refcounted? Straight malloc/free would be a
 better fit. Certainly it may use an allocator when we have them.

 I've added that, didn't before as it's always quite a lot of boilerplate
 to get this right.

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

 While the concept of ForwardRange is quite elegant, I found myself
 wondering on occasion that InputRange with restore could be way more
 efficient. Consider:

 auto r = range();
 auto save_it = r; ///at least full bitwise copy and/or refCounting
 if(do_stuff(r))
 ...
 else
 r = save_it;//another one
 ...

 Compared to:

 auto save_pt = r.savepoint(); //different object, minimal overhead
 if(do_stuff(r))
 ...
 else
 r.restore(save_pt);//range knows how to do this efficiently
 ...

 Looks interesting and clearly has some advantages.
 I personally tend to alway write .save when I intend so and
 thought that way more phobos functions should take ranges by auto ref
 (i.e. actually advance them).

And .save is sort of the right thing to do. But I never bothered using it.

 But I recapture from an earlier discussion that most people liked the
 implicit save
 and some proposed to ditch save at all.

IMO implicit save is nice and well as long as it stays cheap e.g. 
arrays, plain in-memory stuff.
Once our hands are on efficiency things like buffered files it's all 
about being explicit enough.

 How would restore allow to actually duplicate the range at different
 positions.
 I think a pretty common use case for forward ranges is backtracking:

In your example it's duplication being used to do backtracking, it 
doesn't need to. There are use cases where you don't need a separate 
copy at all.

 auto start = range.save;
 popWhile!pred(range);
 doSomeThing(start, range);

For better or worse my idea is to ditch duplication ability completely. 
That's why I've been talking about InputRange specifically, it's sort of 
extension on it.
Now if we wanted to, it could be extend to forward ranges so they can do 
both: restore states and be duplicated.

 Will I need to be able to do this?

No the original intent is make new kind of InputRange that can go "back" 
to some state.
Thus:

 auto range2 = range; //same object
 range.restore(start); //resets to start
 doSomeThing(range, range2); // call with 2 identical objects at start, yikes!

-- 
Dmitry Olshansky

Oct 02 2011

travert phare.normalesup.org (Christophe) writes:

"Martin Nowak" , dans le message (digitalmars.D:145927), a écrit :
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.
 
 Buffering is done through a singly linked list of memory blocks that are  
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.
 
 https://gist.github.com/1257196
 
 There is a major issue with the somewhat broken implicit-save-through-copy  
 concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.
 
 martin

I'll be glad to have a look at this code, but a minimum of documentation 
would be nice.

Oct 02 2011

bearophile <bearophileHUGS lycos.com> writes:

Martin Nowak:

 https://gist.github.com/1257196

To avoid the GC to waste time following the linked list, isn't it better to use
a dynamic array of pointers (or a dynamic array of arrays) (so it becomes a
flat data structure)?

And:


version (unittest)
{
    import std.stdio, std.algorithm;
}

unittest
{
    static struct InputRange
    {
        this(size_t *pn)
        {
            _pn = pn;
        }

        bool empty() const { return *_pn == 1_000_000; }

I suggest ==>

unittest
{
    import std.stdio, std.algorithm;

    static struct InputRange
    {
        bool empty() const { return *_pn == 1_000_000; }

Bye,
bearophile

Oct 02 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.

 martin

I have implemented something similar (but it is not generic). It uses a 
dynamic circular buffer on top of a single dynamic array. I believe it 
is more efficient that way for usual parsing tasks. (and uses less lines 
of code) The clients of the range have to explicitly state that they 
want to go back to a certain state. It works in LIFO order, which is 
sufficient for parsing with lookahead.

auto a=r.pushAnchor(); // remember current state in a
r.popFront(); // do some lookahead
auto b=r.pushAnchor(); // remember current state in b
r.popFront(); // do some lookahead
r.popFront(); // ditto
r.popAnchor(b); // go back to b
r.popFront(); // do lookahead again
r.popAnchor(a); // go back to a

The buffer will dynamically be resized if a lot of lookahead is required 
and be reused otherwise. It has minimal overhead and it uses the 
execution stack to store the states (just indexes into the original 
range). The final buffer size is less than twice the largest lookahead 
required.

Oct 02 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward  
 ranges.
 It allows to write range agnostic lexers/parsers with infinite  
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also  
 create
 references and thus can be easily responsible for inefficiencies.

 martin

 I have implemented something similar (but it is not generic). It uses a  
 dynamic circular buffer on top of a single dynamic array. I believe it  
 is more efficient that way for usual parsing tasks. (and uses less lines  
 of code) The clients of the range have to explicitly state that they  
 want to go back to a certain state. It works in LIFO order, which is  
 sufficient for parsing with lookahead.

Which is about the same as I did where free memory blocks are recycled
and appended to the front. So to say a cyclic singly linked list.
Only you have two performance issues when using a ring buffer based on  
arrays.
  - The head reader needs to check if he can overwrite a location for every  
element.
  - When resizing the buffer you need to copy the already read data  
(possibly causing costly copy ctor invocations).

With block based reads you never need to copy read data and it also frees  
from
checking for overwrites until the block end is reached.

 auto a=r.pushAnchor(); // remember current state in a
 r.popFront(); // do some lookahead
 auto b=r.pushAnchor(); // remember current state in b
 r.popFront(); // do some lookahead
 r.popFront(); // ditto
 r.popAnchor(b); // go back to b
 r.popFront(); // do lookahead again
 r.popAnchor(a); // go back to a

 The buffer will dynamically be resized if a lot of lookahead is required  
 and be reused otherwise. It has minimal overhead and it uses the  
 execution stack to store the states (just indexes into the original  
 range). The final buffer size is less than twice the largest lookahead  
 required.

I also begin to think that the implicit save is not very useful and we  
should
rather use a paired restore.
Maybe the ForwardRange concept can be extended to
is(typeof(range.save) == typeof(range)) ||  
is(typeof(range.restore(range.save))).
But as Dmitry pointed out it is not always possible to allow duplication
so maybe this requires a new concept of restoreable InputRange, something  
in between
InputRange and ForwardRange.

Oct 02 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/03/2011 01:14 AM, Martin Nowak wrote:
 On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

 martin

 I have implemented something similar (but it is not generic). It uses
 a dynamic circular buffer on top of a single dynamic array. I believe
 it is more efficient that way for usual parsing tasks. (and uses less
 lines of code) The clients of the range have to explicitly state that
 they want to go back to a certain state. It works in LIFO order, which
 is sufficient for parsing with lookahead.

 Which is about the same as I did where free memory blocks are recycled
 and appended to the front. So to say a cyclic singly linked list.
 Only you have two performance issues when using a ring buffer based on
 arrays.
 - The head reader needs to check if he can overwrite a location for
 every element.

It does not. The only position that is important is that of the very 
first anchor, therefore there is only one check per buffer refill. (the 
range saves the very first anchor, as well as how many anchors are around).

 - When resizing the buffer you need to copy the already read data
 (possibly causing costly copy ctor invocations).

The buffer is always doubled in size, quickly reaching the maximal size 
needed (I start with a quite large buffer, usually it is never resized). 
There are no invocations of the copy ctor, because the data can be moved.

 With block based reads you never need to copy read data and it also
 frees from
 checking for overwrites until the block end is reached.

I never need to copy data, and moving data only happens on degenerate input.

Oct 02 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 03 Oct 2011 03:12:00 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/03/2011 01:14 AM, Martin Nowak wrote:
 On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that  
 are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

 martin

 I have implemented something similar (but it is not generic). It uses
 a dynamic circular buffer on top of a single dynamic array. I believe
 it is more efficient that way for usual parsing tasks. (and uses less
 lines of code) The clients of the range have to explicitly state that
 they want to go back to a certain state. It works in LIFO order, which
 is sufficient for parsing with lookahead.

 Which is about the same as I did where free memory blocks are recycled
 and appended to the front. So to say a cyclic singly linked list.
 Only you have two performance issues when using a ring buffer based on
 arrays.
 - The head reader needs to check if he can overwrite a location for
 every element.

 It does not. The only position that is important is that of the very  
 first anchor, therefore there is only one check per buffer refill. (the  
 range saves the very first anchor, as well as how many anchors are  
 around).

 - When resizing the buffer you need to copy the already read data
 (possibly causing costly copy ctor invocations).

 The buffer is always doubled in size, quickly reaching the maximal size  
 needed (I start with a quite large buffer, usually it is never resized).  
 There are no invocations of the copy ctor, because the data can be moved.

 With block based reads you never need to copy read data and it also
 frees from
 checking for overwrites until the block end is reached.

 I never need to copy data, and moving data only happens on degenerate  
 input.

I see, little more elaborate than I guessed from your description.
Is the code public?

Oct 02 2011

D Programming

C/C++ Programming

Other

digitalmars.D - efficient input range buffering