www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - efficient input range buffering

reply "Martin Nowak" <dawg dawgfoto.de> writes:
I've written a wrapper to promote input ranges to buffered forward ranges.
It allows to write range agnostic lexers/parsers with infinite lookahead.

Buffering is done through a singly linked list of memory blocks that are  
reference counted.
Each saved range holds a reference to all future blocks.
Blocks are recycled when being no longer used.

https://gist.github.com/1257196

There is a major issue with the somewhat broken implicit-save-through-copy  
concept.
A lot of copies in function parameters, foreach loops etc. will also create
references and thus can be easily responsible for inefficiencies.

martin
Oct 02 2011
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the whole thing is already refcounted? Straight malloc/free would be a better fit. Certainly it may use an allocator when we have them.
 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.

While the concept of ForwardRange is quite elegant, I found myself wondering on occasion that InputRange with restore could be way more efficient. Consider: auto r = range(); auto save_it = r; ///at least full bitwise copy and/or refCounting if(do_stuff(r)) ... else r = save_it;//another one ... Compared to: auto save_pt = r.savepoint(); //different object, minimal overhead if(do_stuff(r)) ... else r.restore(save_pt);//range knows how to do this efficiently ... -- Dmitry Olshansky
Oct 02 2011
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02.10.2011 21:06, Robert Jacques wrote:
 On Sun, 02 Oct 2011 10:46:38 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the whole thing is already refcounted? Straight malloc/free would be a better fit. Certainly it may use an allocator when we have them.

Consider that the elements could contain pointers, etc.

Mmh.. You are right, I was thinking I/O as soon as I seen line about lexers/parsers. Though it could be an optimization if hasIndirections!T is false. -- Dmitry Olshansky
Oct 02 2011
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02.10.2011 21:29, Martin Nowak wrote:
 On Sun, 02 Oct 2011 16:46:38 +0200, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the whole thing is already refcounted? Straight malloc/free would be a better fit. Certainly it may use an allocator when we have them.

to get this right.
 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

While the concept of ForwardRange is quite elegant, I found myself wondering on occasion that InputRange with restore could be way more efficient. Consider: auto r = range(); auto save_it = r; ///at least full bitwise copy and/or refCounting if(do_stuff(r)) ... else r = save_it;//another one ... Compared to: auto save_pt = r.savepoint(); //different object, minimal overhead if(do_stuff(r)) ... else r.restore(save_pt);//range knows how to do this efficiently ...

Looks interesting and clearly has some advantages. I personally tend to alway write .save when I intend so and thought that way more phobos functions should take ranges by auto ref (i.e. actually advance them).

And .save is sort of the right thing to do. But I never bothered using it.
 But I recapture from an earlier discussion that most people liked the
 implicit save
 and some proposed to ditch save at all.

IMO implicit save is nice and well as long as it stays cheap e.g. arrays, plain in-memory stuff. Once our hands are on efficiency things like buffered files it's all about being explicit enough.
 How would restore allow to actually duplicate the range at different
 positions.
 I think a pretty common use case for forward ranges is backtracking:

In your example it's duplication being used to do backtracking, it doesn't need to. There are use cases where you don't need a separate copy at all.
 auto start = range.save;
 popWhile!pred(range);
 doSomeThing(start, range);

For better or worse my idea is to ditch duplication ability completely. That's why I've been talking about InputRange specifically, it's sort of extension on it. Now if we wanted to, it could be extend to forward ranges so they can do both: restore states and be duplicated.
 Will I need to be able to do this?

No the original intent is make new kind of InputRange that can go "back" to some state. Thus:
 auto range2 = range; //same object
 range.restore(start); //resets to start
 doSomeThing(range, range2); // call with 2 identical objects at start, yikes!

-- Dmitry Olshansky
Oct 02 2011
prev sibling next sibling parent travert phare.normalesup.org (Christophe) writes:
"Martin Nowak" , dans le message (digitalmars.D:145927), a écrit :
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.
 
 Buffering is done through a singly linked list of memory blocks that are  
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.
 
 https://gist.github.com/1257196
 
 There is a major issue with the somewhat broken implicit-save-through-copy  
 concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.
 
 martin

I'll be glad to have a look at this code, but a minimum of documentation would be nice.
Oct 02 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 02 Oct 2011 10:46:38 -0400, Dmitry Olshansky <dmitry.olsh gmail.com>
wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the whole thing is already refcounted? Straight malloc/free would be a better fit. Certainly it may use an allocator when we have them.

Consider that the elements could contain pointers, etc.
Oct 02 2011
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Martin Nowak:

 https://gist.github.com/1257196

To avoid the GC to waste time following the linked list, isn't it better to use a dynamic array of pointers (or a dynamic array of arrays) (so it becomes a flat data structure)? And: version (unittest) { import std.stdio, std.algorithm; } unittest { static struct InputRange { this(size_t *pn) { _pn = pn; } bool empty() const { return *_pn == 1_000_000; } I suggest ==> unittest { import std.stdio, std.algorithm; static struct InputRange { bool empty() const { return *_pn == 1_000_000; } Bye, bearophile
Oct 02 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 02 Oct 2011 16:46:38 +0200, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 On 02.10.2011 18:12, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward  
 ranges.
 It allows to write range agnostic lexers/parsers with infinite  
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

Like it, but one thing catches my eye - why use GC for blocks when the whole thing is already refcounted? Straight malloc/free would be a better fit. Certainly it may use an allocator when we have them.

to get this right.
 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also  
 create
 references and thus can be easily responsible for inefficiencies.

While the concept of ForwardRange is quite elegant, I found myself wondering on occasion that InputRange with restore could be way more efficient. Consider: auto r = range(); auto save_it = r; ///at least full bitwise copy and/or refCounting if(do_stuff(r)) ... else r = save_it;//another one ... Compared to: auto save_pt = r.savepoint(); //different object, minimal overhead if(do_stuff(r)) ... else r.restore(save_pt);//range knows how to do this efficiently ...

Looks interesting and clearly has some advantages. I personally tend to alway write .save when I intend so and thought that way more phobos functions should take ranges by auto ref (i.e. actually advance them). But I recapture from an earlier discussion that most people liked the implicit save and some proposed to ditch save at all. How would restore allow to actually duplicate the range at different positions. I think a pretty common use case for forward ranges is backtracking: auto start = range.save; popWhile!pred(range); doSomeThing(start, range); Will I need to be able to do this? auto range2 = range; range.restore(start); doSomeThing(range, range2);
Oct 02 2011
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward ranges.
 It allows to write range agnostic lexers/parsers with infinite lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also create
 references and thus can be easily responsible for inefficiencies.

 martin

I have implemented something similar (but it is not generic). It uses a dynamic circular buffer on top of a single dynamic array. I believe it is more efficient that way for usual parsing tasks. (and uses less lines of code) The clients of the range have to explicitly state that they want to go back to a certain state. It works in LIFO order, which is sufficient for parsing with lookahead. auto a=r.pushAnchor(); // remember current state in a r.popFront(); // do some lookahead auto b=r.pushAnchor(); // remember current state in b r.popFront(); // do some lookahead r.popFront(); // ditto r.popAnchor(b); // go back to b r.popFront(); // do lookahead again r.popAnchor(a); // go back to a The buffer will dynamically be resized if a lot of lookahead is required and be reused otherwise. It has minimal overhead and it uses the execution stack to store the states (just indexes into the original range). The final buffer size is less than twice the largest lookahead required.
Oct 02 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 10/03/2011 01:14 AM, Martin Nowak wrote:
 On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

 martin

I have implemented something similar (but it is not generic). It uses a dynamic circular buffer on top of a single dynamic array. I believe it is more efficient that way for usual parsing tasks. (and uses less lines of code) The clients of the range have to explicitly state that they want to go back to a certain state. It works in LIFO order, which is sufficient for parsing with lookahead.

and appended to the front. So to say a cyclic singly linked list. Only you have two performance issues when using a ring buffer based on arrays. - The head reader needs to check if he can overwrite a location for every element.

It does not. The only position that is important is that of the very first anchor, therefore there is only one check per buffer refill. (the range saves the very first anchor, as well as how many anchors are around).
 - When resizing the buffer you need to copy the already read data
 (possibly causing costly copy ctor invocations).

The buffer is always doubled in size, quickly reaching the maximal size needed (I start with a quite large buffer, usually it is never resized). There are no invocations of the copy ctor, because the data can be moved.
 With block based reads you never need to copy read data and it also
 frees from
 checking for overwrites until the block end is reached.

I never need to copy data, and moving data only happens on degenerate input.
Oct 02 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward  
 ranges.
 It allows to write range agnostic lexers/parsers with infinite  
 lookahead.

 Buffering is done through a singly linked list of memory blocks that are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also  
 create
 references and thus can be easily responsible for inefficiencies.

 martin

I have implemented something similar (but it is not generic). It uses a dynamic circular buffer on top of a single dynamic array. I believe it is more efficient that way for usual parsing tasks. (and uses less lines of code) The clients of the range have to explicitly state that they want to go back to a certain state. It works in LIFO order, which is sufficient for parsing with lookahead.

and appended to the front. So to say a cyclic singly linked list. Only you have two performance issues when using a ring buffer based on arrays. - The head reader needs to check if he can overwrite a location for every element. - When resizing the buffer you need to copy the already read data (possibly causing costly copy ctor invocations). With block based reads you never need to copy read data and it also frees from checking for overwrites until the block end is reached.
 auto a=r.pushAnchor(); // remember current state in a
 r.popFront(); // do some lookahead
 auto b=r.pushAnchor(); // remember current state in b
 r.popFront(); // do some lookahead
 r.popFront(); // ditto
 r.popAnchor(b); // go back to b
 r.popFront(); // do lookahead again
 r.popAnchor(a); // go back to a

 The buffer will dynamically be resized if a lot of lookahead is required  
 and be reused otherwise. It has minimal overhead and it uses the  
 execution stack to store the states (just indexes into the original  
 range). The final buffer size is less than twice the largest lookahead  
 required.

should rather use a paired restore. Maybe the ForwardRange concept can be extended to is(typeof(range.save) == typeof(range)) || is(typeof(range.restore(range.save))). But as Dmitry pointed out it is not always possible to allow duplication so maybe this requires a new concept of restoreable InputRange, something in between InputRange and ForwardRange.
Oct 02 2011
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 03 Oct 2011 03:12:00 +0200, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 10/03/2011 01:14 AM, Martin Nowak wrote:
 On Sun, 02 Oct 2011 21:15:12 +0200, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 10/02/2011 04:12 PM, Martin Nowak wrote:
 I've written a wrapper to promote input ranges to buffered forward
 ranges.
 It allows to write range agnostic lexers/parsers with infinite
 lookahead.

 Buffering is done through a singly linked list of memory blocks that  
 are
 reference counted.
 Each saved range holds a reference to all future blocks.
 Blocks are recycled when being no longer used.

 https://gist.github.com/1257196

 There is a major issue with the somewhat broken
 implicit-save-through-copy concept.
 A lot of copies in function parameters, foreach loops etc. will also
 create
 references and thus can be easily responsible for inefficiencies.

 martin

I have implemented something similar (but it is not generic). It uses a dynamic circular buffer on top of a single dynamic array. I believe it is more efficient that way for usual parsing tasks. (and uses less lines of code) The clients of the range have to explicitly state that they want to go back to a certain state. It works in LIFO order, which is sufficient for parsing with lookahead.

and appended to the front. So to say a cyclic singly linked list. Only you have two performance issues when using a ring buffer based on arrays. - The head reader needs to check if he can overwrite a location for every element.

It does not. The only position that is important is that of the very first anchor, therefore there is only one check per buffer refill. (the range saves the very first anchor, as well as how many anchors are around).
 - When resizing the buffer you need to copy the already read data
 (possibly causing costly copy ctor invocations).

The buffer is always doubled in size, quickly reaching the maximal size needed (I start with a quite large buffer, usually it is never resized). There are no invocations of the copy ctor, because the data can be moved.
 With block based reads you never need to copy read data and it also
 frees from
 checking for overwrites until the block end is reached.

I never need to copy data, and moving data only happens on degenerate input.

Is the code public?
Oct 02 2011