www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [RFC] I/O and Buffer Range

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
Didn't mean this to be kind of under the tree .. Planned to post long 
ago but I barely had the time to flesh it out.

TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered 
I/O sources specifically.

Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soon 
finds out that while slice-em-up with random access ranges works very 
nice any attempts to extend that beyond arrays and in-memory stuff are 
getting real awkward. It sums up to implementing not a small amount of 
bookkeeping (buffering) all the while having to use primitives that just 
don't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)? 
Save the range, do 3 front/popFront with empty checks then copy saved 
range back. Painful especially if you know full well that it must be 
just a length check plus 3 direct reads of array. Even if the underlying 
buffer allows you to slice up the contents and do array-wise operations 
on them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges is 
done - you get that you have this 2-nd parser that should support 
forward ranges as well. It leads to conclusion that parsers simply don't 
want to work forward ranges, they need something bigger and all extra 
work you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can't 
even be forward ranges! Stream over a TCP socket - how do you 'save' 
that, Sherlock? Keeping in mind to return the same type - we'd better 
called it "fork" back then.

To sum it up - the current selection of ranges is not good enough to 
_efficiently_ work with I/O. Yet ranges are very neat and we'd better 
retain their strengths and capitalize on the existing work done in this 
area (e.g. a slew of good stuff in std.algorithm and std.range). Got to 
extend the notion then.

Goals

Introduce a better primitive aimed squarely at solving the _input_ side 
of I/O problem. Output IMHO might need some tweaks but with OutputRange 
it's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturally 
and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and 
lookbehind.

It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, and 
buffer ranges with input sources we had grown something powerful indeed.

Spoiler

I have a proof of concept implemented. It *works*. What I'm looking for 
is to simplify the set of primitives and/or better suggestions.

Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using this 
abstraction and working directly over files:

grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line by 
line and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before: 
fgets + std.regex.matchFirst. Note that this (old) solution is cheating 
twice - it seeks only 1 match per line and it knows the line length 
ahead of time.

For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is defined 
here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal was 
sufficient first, minimal second.

Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing 
(and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "sliding 
window" over whatever is the true source of data. Thus a sliding window 
can be the whole array, a mapped area of a file or a buffer of network 
stream. A sliding window may be moved across the input stream (~ 
re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding window 
model and match that with requirements a parser would "like".

2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start of 
something "useful" and require the underlying implementation when the 
time comes to move the window to keep the data starting from here. 
Typically later "down the stream" when the boundaries of slice are 
established it's extracted, examined and (rarely!) copied over. Ability 
to avoid copy unless absolutely necessary is _crucial_.

Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
   //pin this position so that buffering won't lose it
   auto m = input.mark();
   while(!input.empty && isAlpha(input.front)){
	input.popFront();
   }
   //get a slice from 'm' to current position
   auto word = input.slice(m);
   //grab a copy if haven't seen before
   if(word !in identifiers)
       identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking and 
pinning concept explicitly. See mark/slice in docs.

3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's not 
just wasteful as in memory, it's _semantically_ costly. In case of 
buffer range it would also imply some "re-reading" of I/O source as the 
copy has to be completely independent view of the same input(!). Hence 
"save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quite 
often) the state has to be restored. To minimize primitive count and not 
break requirement (2) reuse 'mark' to save the state and add 'seek' to 
restore position to the marked point. As it was pinned it's always in 
the buffer and readily accessible, keeping the ability to work over streams.

4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is that 
setting up boundaries of stuff to slice (captures in regex) must be dirt 
cheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitively 
costly as it has to communicate with buffer (currently bump a counter in 
some array). The idea that worked out with std.regex is to use a single 
"vantage point" mark and take a big slice off that and then make 
sub-slices of it as the plain array it is. Then from time to time 
"vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark to 
the current position plus 'seek' overloads that work with relative 
offsets (positive and negative).

Another usage for relative seek is skipping the of data without looking, 
potentially dropping the whole buffers away (there is overlap with 
popFrontN though). Also fixed-width backtracking is easily had with 
relative seek if we can instruct the buffer range to keep at least K 
last bytes in the buffer (require minimal history).

5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array, 
hence one may easily take a peek at some portion of it as plain ubyte[]. 
Implementation makes sure it buffers up enough of data as required 
and/or returns empty range if not possible. This supports things like 
LL(k) lookahead and fixed-length lookbehind that is common in regex 
0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, the 
question remains is how effective it'll be.

-- 
Dmitry Olshansky
Dec 29 2013
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 [snip]

Hmm, just yesterday I was rewriting a parser to use a buffer instead of loading the whole file in memory, so this is quite timely for me. Questions: 1. What happens when the distance between the pinned and current position exceeds the size of the buffer (sliding window)? Is the buffer size increased, or is the stream rewound if possible and the range returned by the slice does seeking? 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)? I guess I see the point when applied to regular expressions, where the user explicitly specifies how many characters to look ahead/behind. However, I think in most use cases the amount is not known beforehand (without imposing arbitrary limitations on users like "Thou shalt not have variable identifiers longer than 32 characters"), so the pattern would be "try a cheap lookahead/behind, and if that fails, do an expensive one". 3. I think ideally the final design would use something like what std.allocator does with "unbounded" and "chooseAtRuntime" - some uses might not need lookahead or lookbehind or other features at all, so having a way to disable the relevant code would benefit those cases.
Dec 29 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Dec-2013 02:45, Vladimir Panteleev пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 [snip]

Hmm, just yesterday I was rewriting a parser to use a buffer instead of loading the whole file in memory, so this is quite timely for me. Questions: 1. What happens when the distance between the pinned and current position exceeds the size of the buffer (sliding window)? Is the buffer size increased, or is the stream rewound if possible and the range returned by the slice does seeking?

It's expected that the window is increased. The exact implementation may play any dirty tricks it sees fit as long as it can provide a slice over the pinned area. In short - maintain the illusion that the window has increased. I would be against seeking range and would most likely opt for memory-mapped files instead but it all depends on the exact numbers.
 2. I don't understand the rationale behind the current semantics of
 lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the
 first whitespace char, you don't know how many chars to request.

If you want to 'find' just do front/popFront, no? Or do you specifically want to do array-wise operations?
 Wouldn't it be better to make these functions return the ENTIRE
 available buffer in O(1)?

Indeed, now I think that 2 overloads would be better: auto lookahead(size_t n); //exactly n bytes, re-buffering as needed auto lookahead(); // all that is available in the window, no re-buffering Similar for lookbehind.
 I guess I see the point when applied to regular expressions, where the
 user explicitly specifies how many characters to look ahead/behind.

Actually the user doesn't - our lookahead/lookbehind is variable length. One thing I would have to drop is unbound lookbehind, not that it's so critical.
 However, I think in most use cases the amount is not known beforehand
 (without imposing arbitrary limitations on users like "Thou shalt not
 have variable identifiers longer than 32 characters"), so the pattern
 would be "try a cheap lookahead/behind, and if that fails, do an
 expensive one".

I would say that in case where you need arbitrary-length lookahead: m = mark, seek + popFront x N, seek(m) should fit the bill. Or as is the case in regex at the moment - mark once, and use seek back to some position relative to it. In one word - backtracking. An example of where fixed lookahead rocks: https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L421
 3. I think ideally the final design would use something like what
 std.allocator does with "unbounded" and "chooseAtRuntime" - some uses
 might not need lookahead or lookbehind or other features at all, so
 having a way to disable the relevant code would benefit those cases.

It makes sense to make lookahead and lookbehind optional. As for the code - for the moment it doesn't add much and builds on stuff already there. Though I suspect some other implementations would be able to "cut corners" more efficiently. -- Dmitry Olshansky
Dec 30 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Dec-2013 09:29, Marco Leise пишет:
 Am Sun, 29 Dec 2013 22:45:35 +0000
 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]

 2. I don't understand the rationale behind the current semantics
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to
 find the first whitespace char, you don't know how many chars to
 request. Wouldn't it be better to make these functions return the
 ENTIRE available buffer in O(1)?

Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :) I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in ℕ count)

Yeah, that's a good idea. I mean primitives, not Unicode in type names ;) -- Dmitry Olshansky
Dec 30 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 29 Dec 2013 22:45:35 +0000
schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]
=20
 2. I don't understand the rationale behind the current semantics=20
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to=20
 find the first whitespace char, you don't know how many chars to=20
 request. Wouldn't it be better to make these functions return the=20
 ENTIRE available buffer in O(1)?

Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :)=20 I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in =E2=84=95 count) See: https://github.com/mleise/piped/blob/master/src/piped/circularbuffer.d= #L400 Both return a slice over all the available buffered data. mapAtLeast() also waits till enough data is available from the producer thread. The circular buffer is automatically extended if the producer wants to write a larger chunk or the consumer needs a larger window. (I was mostly focused on lock-free operation in not-starved situations and minimal bounds checking, so the code is a bit more sophisticated than the average circular buffer.) --=20 Marco
Dec 29 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 Dec 2013 12:08:18 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 30-Dec-2013 09:29, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 Am Sun, 29 Dec 2013 22:45:35 +0000
 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]

 2. I don't understand the rationale behind the current semantics
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to
 find the first whitespace char, you don't know how many chars to
 request. Wouldn't it be better to make these functions return the
 ENTIRE available buffer in O(1)?

Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :) I use those primitives: ubyte[] mapAvailable() pure nothrow ubyte[] mapAtLeast(in =E2=84=95 count)

Yeah, that's a good idea. I mean primitives, not Unicode in type names ;)

One day it'll catch on :D --=20 Marco
Dec 30 2013
prev sibling next sibling parent reply "Brad Anderson" <eco gnuk.net> writes:
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 Proposal

Having never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it. Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability. Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages. --- auto scrapeTitles(M)(in M message) { static url_re = ctRegex!(r"(https?|ftp)://[^\s/$.?#].[^\s]*", "i"); static title_re = ctRegex!(r"<title.*?>(.*?)<", "si"); static ws_re = ctRegex!(r"(\s{2,}|\n|\t)"); auto utf8 = new EncodingSchemeUtf8; auto titles = matchAll(message, url_re) .map!(match => match.captures[0]) .map!((url) => get(url).ifThrown([])) .map!(bytes => cast(string) utf8.sanitize(cast(immutable(ubyte)[])bytes)) .map!(content => matchFirst(content, title_re)) .filter!(captures => !captures.empty) .map!(capture => capture[1].idup) // dup so original is GCed .map!(title => title.entitiesToUni.replace(ws_re, " ")) .array; return titles; } --- I really, really didn't want to use that std.net.curl.get(). It causes all sorts of problems if someone links to a huge resource. I just could not figure out how to utilize byLine (the title regex capture can be multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob Ovrum's help in IRC) was really a goal here as this is just a toy so I went with get() for the time being but it's always sad to sacrifice elegance for performance. I certainly didn't want to add some elaborate evergrowing buffer in the middle of this otherwise clean UFCS chain (and I'm not even sure how to incrementally regex search the growing buffer or if that's even possible). If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems. P.S. I love std.regex more and more every day. It made that entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d
Dec 30 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
31-Dec-2013 05:51, Brad Anderson пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 Proposal

Having never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it.

Exactly, the situation is simply not good enough. I can assure you that on the side of parser writers it's even less appealing.
 Being able to just do a matchAll() on a file, std.net.curl, etc. without
 sacrificing performance and memory would be such a massive gain for
 usability.

.. and performance ;)
 Just a simple example of where I couldn't figure out how to utilize
 either byLine or byChunk without adding some clunky homegrown buffering
 solution. This is code that scrapes website titles from the pages of
 URLs in IRC messages.

 I really, really didn't want to use that std.net.curl.get().  It causes
 all sorts of problems if someone links to a huge resource.

*Nods*
 I just could
 not figure out how to utilize byLine (the title regex capture can be
 multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob
 Ovrum's help in IRC) was really a goal here as this is just a toy so I
 went with get() for the time being but it's always sad to sacrifice
 elegance for performance. I certainly didn't want to add some elaborate
 evergrowing buffer in the middle of this otherwise clean UFCS chain (and
 I'm not even sure how to incrementally regex search the growing buffer
 or if that's even possible).

I thought to provide something like that, incremental match that takes pieces of data slice by slice, having to mess with the not-yet-matched kind of object. But it was solving the wrong problem. And it shows that backtracking engines simply can't work like that, they would want to go back to the prior pieces.
 If I'm understanding your proposal correctly that get(url) could be
 replaced with a hypothetical std.net.curl "buffer range" and that could
 be passed directly to matchFirst. It would only take up, at most, the
 size of the buffer in memory (which could grow if the capture grows to
 be larger than the buffer) and wouldn't read the unneeded portion of the
 resource at all. That would be such a huge win for everyone so I'm very
 excited about this proposal. It addresses all of my current problems.

That's indeed what the proposal is all about. Glad it makes sense :)
 P.S. I love std.regex more and more every day. It made that
 entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d

Aye, replace with functor rox! -- Dmitry Olshansky
Dec 31 2013
prev sibling next sibling parent reply "Joseph Cassman" <jc7919 outlook.com> writes:
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 [...]

Interesting idea. Seems to fill a need I have been facing with some parsing code. Since I was unclear about how its functionality compares to ForwardRange I took a look through std.algorithm. If typed versions of lookahead/lookbehind were added it seems like ForwardRange could be replaced with this new range type, at least in certain places. For example, it seems to me that the following code (from the article "On Iteration" https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7) ForwardRange findAdjacent(ForwardRange r){ if (!r.empty()) { auto s = r.save(); s.popFront(); for (; !s.empty(); r.popFront(), s.popFront()) { if (r.front() == s.front()) break; } } return r; } also could be written as BufferedRange findAdjacent(BufferedRange r) { for(;!r.empty;r.popFront()) { if(r.front == r.lookahead(1)[0]) break; } return r; } Perhaps the mark and/or slice functionality could be leveraged to do something similar but be more performant. If anyone can see how the new range type compares to ForwardRange that would be cool. Joseph
Dec 30 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 [...]

Interesting idea. Seems to fill a need I have been facing with some parsing code. Since I was unclear about how its functionality compares to ForwardRange I took a look through std.algorithm. If typed versions of lookahead/lookbehind were added it seems like ForwardRange could be replaced with this new range type, at least in certain places. For example, it seems to me that the following code (from the article "On Iteration" https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7) ForwardRange findAdjacent(ForwardRange r){ if (!r.empty()) { auto s = r.save(); s.popFront(); for (; !s.empty(); r.popFront(), s.popFront()) { if (r.front() == s.front()) break; } } return r; }

surprized myself). A proper name would be ForkableRange, indeed look at this code: auto s = r.save(); s.popFront(); s is a fork of r (independent view that has a live of its own). The original algortihm even _reads_ easier if the correct term is used: auto s = r.fork(); s.popFront(); ... Now it's clear that we fork a range and advance the forked copy one step ahead of the original.
 also could be written as

     BufferedRange findAdjacent(BufferedRange r) {
        for(;!r.empty;r.popFront()) {
           if(r.front == r.lookahead(1)[0]) break;
        }
        return r;
     }

 Perhaps the mark and/or slice functionality could be leveraged to do
 something similar but be more performant.

 If anyone can see how the new range type compares to ForwardRange that
 would be cool.

I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment. A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges. I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement. -- Dmitry Olshansky
Dec 31 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
31-Dec-2013 22:46, Joseph Cassman пишет:
 On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:
 31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:



 I'm thinking there might be a way to bridge the new range type with
 ForwardRange but not directly as defined at the moment.

 A possibility I consider is to separate a Buffer object (not a range),
 and let it be shared among views - light-weight buffer-ranges. Then if
 we imagine that these light-weight buffer-ranges are working as marks
 (i.e. they pin down the buffer) in the current proposal then they
 could be forward ranges.


I've created a fork where I've implemented just that. As a bonus I also tweaked stream primitives so it now works with pipes or whatever input stdin happens to be. Links stay the same: Docs: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html Code: https://github.com/blackwhale/datapicked/tree/fwd-buffer-range/dpick/buffer The description has largely simplified and the primitive count reduced. 1. A buffer range is a forward range. It has reference semantics. 2. A copy produced by _save_ is an independent view of the underlying buffer (or window). 3. No bytes can be discarded that are seen in some existing view. Thus each reference pins its position in the buffer. 4. 3 new primitives are: Range slice(BufferRange r); Returns a slice of a window between the current range position and r. It must be a random access range. ptrdiff_t tell(BufferRange r); Returns a difference in positions in the window of current range and r. Note that unlike slice(r).length this can be both positive and negative. bool seek(ptrdiff_t ofs); Reset buffer state to an offset from the current position. Return indicates success of the operation. It may fail if there is not enough data, or (if ofs is negative) that this portion of data was already discarded. 5. Lookahead and lookbehind are a extra primitives that were left intact for the moment. Where applicable a range may provide lookahead: Range lookahead(); //as much as available in the window Range lookahead(size_t n); // either n exactly or nothing if not And lookbehind: Range lookbehind(); //as much as available in the window Range lookbehind(size_t n); //either n exactly or nothing if not These should probably be tested as separate traits.
 input-source <--> buffer range <--> parser/consumer

 Meaning that if we can mix and match parsers with buffer ranges, and
 buffer ranges with input sources we had grown something powerful indeed.

Being able to wrap an already-in-use range object with the buffer interface as you do in the sample code (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good for composability. Also allows for existing functionality in std.algorithm to be reused as-is.

It was more about wrapping an array but it's got to integrate well with what we have. I could imagine a use case for buffering an input range. Then I think a buffer range of anything other then bytes would be in order.
 I think the new range type could also be added directly to some new, or
 perhaps retrofitted into existing, code to add the new functionality
 without sacrificing performance. In that way the internal payload
 already used to get the data (say by the input range) could be reused
 without having to allocate new memory to support the buffer API.

 As one idea of using a buffer range from the start, a function template
 by(T) (where T is ubyte, char, wchar, or dchar) could be added to
 std.stdio.

IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time, repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework. I've put together up something pretty simple and fast for buffer range directly on native I/O: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/stream.d It needs a bit better error messages then naked enforce, and a bit of tweaks to memory management. It does runs circles around existing std.stdio already.
 It would return a buffer range object providing more
 functionality than byChunk or byLine while adding access to the entire
 stream of data in a file in a contiguous and yet efficient manner.

Drop 'efficient' if we talk interfacing with C run-time. Otherwise, yes, absolutely.
 Seems
 to help with the issues faced in processing file data mentioned in
 previous comments in this thread.

-- Dmitry Olshansky
Jan 04 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Jan-2014 09:22, Jason White пишет:
 On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:
 IMHO C run-time I/O has no use in D. The amount of work spent on
 special-casing the non-locking primitives of each C run-time,
 repeating legacy mistakes (like text mode, codepages and locales) and
 stumbling on portability problems (getc is a macro we can't have)
 would have been better spent elsewhere - designing our own I/O framework.

I agree. I wrote a (mostly complete) file stream implementation that uses the native I/O API: https://github.com/jasonwhite/io/blob/master/src/io/file.d

It's obvious and adds no extra value.
 It allows for more robust open-flags than the fopen-style flags (like
 "r+"). For seeking, I adapted it to use your concept of marks (which I
 quite like) instead of SEEK_CUR and friends.

 Please feel free to use this! (The Windows implementation hasn't been
 tested yet, so it probably doesn't work.)

Will poke around. I like this (I mean composition): https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17
 BTW, I was also working on buffered streams, but couldn't figure out a
 good way to do it.

-- Dmitry Olshansky
Jan 05 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Jan-2014 15:08, Jason White пишет:
 On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
 As an advice I'd suggest to drop the 'Data' part in
 writeData/readData. It's obvious and adds no extra value.

You're right, but it avoids a name clash if it's composed with text writing. "write" would be used for text and "writeData" would be used for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid that problem (which, I suppose, are more appropriate names).

I my view text implies something like: void write(const(char)[]); size_t read(char[]); And binary would be: void write(const(ubyte)[]); size_t read(ubyte[]); Should not clash.
 Will poke around. I like this (I mean composition):
 https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

Yeah, the idea is to separate buffering, text, and locking operations so that they can be composed with any other type of stream (e.g., files, in-memory arrays, or sockets).

In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).
Currently, std.stdio has all three of
 those facets rolled into one.

Locking though is a province of shared and may need a bit more thought. -- Dmitry Olshansky
Jan 05 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-Jan-2014 09:41, Jason White пишет:
 On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:
 I my view text implies something like:

 void write(const(char)[]);
 size_t read(char[]);

 And binary would be:

 void write(const(ubyte)[]);
 size_t read(ubyte[]);

 Should not clash.

Those would do the same thing for either text or binary data. When I say text writing, I guess I mean the serialization of any type to text (like what std.stdio.write does): void write(T)(T value); // Text writing void write(const(ubyte)[] buf); // Binary writing write([1, 2, 3]); // want to write "[1, 2, 3]" // but writes "\x01\x02\x03" This clashes. We need to be able to specify if we want to write/read a text representation or just the raw binary data. In the above case, the most specialized overload will be called.

Ok, now I see. In my eye though serialization completely hides raw stream write. So: struct SomeStream{ void write(const(ubyte)[] data...); } struct Serializer(Stream){ void write(T)(T value); //calls stream.write inside of it private: Stream stream; }
 In-memory array IMHO better not pretend to be a stream. This kind of
 wrapping goes in the wrong direction (losing capabilities). Instead
 wrapping a stream and/or array as a buffer range proved to me to be
 more natural (extending capabilities).

Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations?

I think it may be best not to. Buffer builds on top of unbuffered stream. If there is a need to do large reads it may as well use naked stream and not worry about extra copying happening in the buffer layer. I need to think on this. Seeing as lookahead + seek could be labeled as read even though it's not.
 I don't see why it would conflict with a
 range interface. As I understand it, ranges act on a single element at a
 time while streams act on multiple elements at a time. For ArrayBuffer
 in datapicked, a stream-style read is just lookahead(n) and cur += n.
 What capabilities are lost?

In short - lookahead is slicing, read would be copying. For me prime capability of an array is slicing that is dirt cheap O(1). On the other hand stream interface is all about copying bytes to the user provided array. In this setting it means that if you want to wrap array as stream, then it must follow generic stream interface. The latter cannot and should not think of slicing and the like. Then while wrapping it in some adapter up the chain it's no longer seen as array (because adapter is generic too and is made for streams). This is what I call capability loss.
 If buffers/arrays provide a stream interface, then they can be used by
 code that doesn't directly need the buffering capabilities but would
 still benefit from them.

See above - it would be better if the code was written for ranges not streams. Then e.g. slicing of buffer range on top of array works just as cheap as it was for arrays. And zero copies are made (=performance).
 Currently, std.stdio has all three of
 those facets rolled into one.

Locking though is a province of shared and may need a bit more thought.

Locking of streams is something that I haven't explored too deeply yet. Streams that communicate with the OS certainly need locking as thread locality makes no difference there.

Actually these objects do just fine, since OS does the locking (or makes sure of something equivalent). If your stream is TLS there is no need for extra locking at all (no interleaving of I/O calls is possible) regardless of its kind. Shared instances would need locking as 2 threads may request some operation, and as OS locks only on per sys-call basis something cruel may happen in the code that deals with buffering etc. -- Dmitry Olshansky
Jan 06 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Jan-2014 11:59, Jason White пишет:
 On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides raw
 stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.

Cool, got to steal sysErrorString too! :)
 Actually these objects do just fine, since OS does the locking (or
 makes sure of something equivalent). If your stream is TLS there is no
 need for extra locking at all (no interleaving of I/O calls is
 possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request some
 operation, and as OS locks only on per sys-call basis something cruel
 may happen in the code that deals with buffering etc.

Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package into Phobos. It could replace std.stream as well as std.stdio.

Then our goals are aligned. Be sure to take a peek at (if you haven't already): https://github.com/schveiguy/phobos/blob/new-io/std/io.d I have my share criticisms for it but it's a nice piece of work that addresses many questions of I/O I've yet to consider.
 Stuff like
 serializers and lexers would be more robust and easier to write.

Indeed and even beyond that. -- Dmitry Olshansky
Jan 07 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
16-Jan-2014 19:55, Steven Schveighoffer пишет:
 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Then our goals are aligned. Be sure to take a peek at (if you haven't
 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Yes, I'm gearing up to revisit that after a long D hiatus, and I came across this thread. At this point, I really really like the ideas that you have in this. It solves an issue that I struggled with, and my solution was quite clunky. I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which I have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It's job is to collect data from the underlying stream, and present it to consumers as a random-access buffer.

The only interesting thing I'd add here s that some buffer may work without underlying stream. Best examples are arrays and MM-files.
 3. Filter that has access to transform the buffer data/copy it.
 4. Ranges that use the buffer/filter to process/present the data.

Yes, yes and yes. I find it surprisingly good to see our vision seems to match. I was half-expecting you'd come along and destroy it all ;)
 The problem I struggled with is the presentation of UTF data of any
 format as char[] wchar[] or dchar[]. 2 things need to happen. First is
 that the data needs to be post-processed to perform any necessary byte
 swapping. The second is to transcode the data into the correct width.

 In this way, you can process UTF data of any type (I even have code to
 detect the encoding and automatically process it), and then use it in a
 way that makes sense for your code.

 My solution was to paste in a "processing" delegate into the class
 hierarchy of buffered streams that allowed one read/write access to the
 buffer. But it's clunky, and difficult to deal with in a generalized
 fashion.

 But the idea of using a buffer in between the stream and the range, and
 possibly bolting together multiple transformations in a clean way, makes
 this problem easy to solve, and I think it is closer to the vision
 Andrei/Walter have.

In essence a transcoding filter for UTF-16 would wrap a buffer of ubyte and itself present a buffer interface (but of wchar). My own stuff currently deals only in ubyte and the limited decoding is represented by a "decode" function that takes a buffer of ubyte and decodes UTF-8. I think typed buffers/filters is the way to go.
 I also like the idea of "pinning" the data instead of my mechanism of
 using a delegate (which was similar but not as general). It also has
 better opportunities for optimization.

 Other ideas that came to me that buffer filters could represent:

 * compression/decompression
 * encryption

 I am going to study your code some more and see how I can update my code
 to use it. I still need to maintain the std.stdio.File interface, and
 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to make
 it not be so bad).

I seriously not seeing how interfacing with C runtime could be fast enough.
 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sticky
 part is dealing with shared. But with structs, this should make things
 much easier.

I'm thinking a generic locking wrapper is possible along the lines of: shared Locked!(GenericBuffer!char) stdin; //usage struct Locked(T){ shared: private: T _this; Mutex mut; public: //forwarded methods } The wrapper will introduce a lock, and implement every method of wrapped struct roughly like this: mut.lock(); scope(exit) mut.unlock(); (cast(T*)_this).method(args); I'm sure it could be pretty automatic.
 One question, is there a reason a buffer type has to be a range at all?
 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks of
 a stream.

Lexers/parsers enjoy it - i.e. they work pretty much as ranges especially when skipping spaces and the like. As I said the main reason was: if it fits as range why not? After all it makes one-pass processing of data trivial as it rides on top of foreach: foreach(octect; mybuffer) { if(intersting(octect)) do_cool_stuff(); } Things like countUntil make perfect sense when called on buffer (e.g. to find matching sentinel). -- Dmitry Olshansky
Jan 16 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 00:00, Steven Schveighoffer пишет:
 On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 16-Jan-2014 19:55, Steven Schveighoffer пишет:
 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:



 In essence a transcoding filter for UTF-16 would wrap a buffer of
 ubyte and itself present a buffer interface (but of wchar).

My intended interface allows you to specify the desired type per read. Think of the case of stdin, where the clients will be varied and written by many different people, and its interface is decided by Phobos. But a transcoding buffer may make some optimizations. For instance, reading a UTF32 file as utf-8 can re-use the same buffer, as no code unit uses more than 4 code points (did I get that right?).

The other way around :) 4 code units - 1 code point.
 I am going to study your code some more and see how I can update my code
 to use it. I still need to maintain the std.stdio.File interface, and
 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to make
 it not be so bad).

I seriously not seeing how interfacing with C runtime could be fast enough.

It's not. But an important stipulation in order for this to all be accepted is that it doesn't break existing code that expects things like printf and writef to interleave properly. However, I think we can have an opt-in scheme, and there are certain cases where we can proactively switch to a D-buffer scheme. For example, if you get a ByLine range, it expects to exhaust the data from stream, and may not properly work with C printf. The idea is that stdio.File can switch at runtime from FILE * to D streams as needed or directed.
 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sticky
 part is dealing with shared. But with structs, this should make things
 much easier.

I'm thinking a generic locking wrapper is possible along the lines of: shared Locked!(GenericBuffer!char) stdin; //usage struct Locked(T){ shared: private: T _this; Mutex mut; public: //forwarded methods } The wrapper will introduce a lock, and implement every method of wrapped struct roughly like this: mut.lock(); scope(exit) mut.unlock(); (cast(T*)_this).method(args); I'm sure it could be pretty automatic.

This would be a key addition for ANY type in order to properly work with shared. BUT, I don't see how it works safely generically because you necessarily have to cast away shared in order to call the methods. You would have to limit this to only working on types it was intended for.

The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...
 I've been expecting to have to do something like this, but not looking
 forward to it :(

 One question, is there a reason a buffer type has to be a range at all?
 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks of
 a stream.

Lexers/parsers enjoy it - i.e. they work pretty much as ranges especially when skipping spaces and the like. As I said the main reason was: if it fits as range why not? After all it makes one-pass processing of data trivial as it rides on top of foreach: foreach(octect; mybuffer) { if(intersting(octect)) do_cool_stuff(); } Things like countUntil make perfect sense when called on buffer (e.g. to find matching sentinel).

I think I misstated my question. What I am curious about is why a type must be a forward range to pass isBuffer. Of course, if it makes sense for a buffer type to also be a range, it can certainly implement that interface as well. But I don't know that I would need those primitives in all cases. I don't have any specific use case for having a buffer that doesn't implement a range interface, but I am hesitant to necessarily couple the buffer interface to ranges just because we can't think of a counter-case :)

Convenient to work with does ring good to me. I simply see no need to reinvent std.algorithm on buffers especially the ones that just scan left-to-right. Example would be calculating a checksum of a stream (say data comes from a pipe or socket i.e. buffered). It's a trivial application of std.algorithm.reduce and there no need to reinvent that wheel IMHO. -- Dmitry Olshansky
Jan 16 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 02:41, Steven Schveighoffer пишет:
 On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 The other way around :) 4 code units - 1 code point.

I knew that was going to happen :)

Aye. BTW I haven't thought of writing into the buffer, but it works exactly the same. It could be even read/write -"discarded" data is written to underlying stream, freshly loaded is read from stream. Now in case of input stream only writes are nop, for output-only reads are nops. With pinning it makes for cool multi-pass algorithms that actually output stuff into the file.
 This would be a key addition for ANY type in order to properly work with
 shared. BUT, I don't see how it works safely generically because you
 necessarily have to cast away shared in order to call the methods. You
 would have to limit this to only working on types it was intended for.

The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...

I think you meant *not* 100% :) But yeah, regardless of how generalized it is, this is likely the best path. I think this is the tack that std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.

Aye.
 Convenient to work with does ring good to me. I simply see no need to
 reinvent std.algorithm on buffers especially the ones that just scan
 left-to-right.
 Example would be calculating a checksum of a stream (say data comes
 from a pipe or socket i.e. buffered). It's a trivial application of
 std.algorithm.reduce and there no need to reinvent that wheel IMHO.

I again think I am being misunderstood. Example might be appropriate:

Looking at the post by Walter I see I need to clarify things. And if you browse the thread you'd see my understanding also changed with time - I started with no stinkin' forward ranges with buffered I/O only to later eat my words and make them happen. First let's call buffer a thing that pack an array and few extras to support pinning, refiling of data from underlying stream and extending. It exposes current "window" of underlying stream. Now we have pins in that buffer that move along. A pin not only enforces that all data "to the left" stays accessible but also is a "view" of buffer. It's even conceptually a range with extra primitives outlined here: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html I should stick to naming it BufferRange. The "real buffer" is internal construct and BufferRanges share ownership of it. This is exactly what I ended up with in my second branch. "real" buffer: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L417 vs buffer range over it: https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L152 Naming issues are apparently even worse then Walter implies.
 struct Buffer {...} // implements BOTH forward range and Buffer primitives

 struct OtherBuffer {...} // only implements Buffer primitives.

 If isBuffer is modified to not require isForwardRange, then both Buffer
 and OtherBuffer will work as buffers, but only Buffer works as a range.

 But as you have it now, isBuffer!OtherBuffer is false. Is this necessary?

I think I should call it BufferRange from now on. And bring my docs in line. It may make sense to provide naked Buffer itself as a construct (it has simpler interface) and have generic BufferRange wrapper. I'm just not seeing user code using the former - too error prone and unwieldy.
 So we can implement buffers that are both ranges and buffers, and will
 work with std.algorithm without modification (and that's fine and
 expected by me), but do we need to *require* that? Are we
 over-specifying?

Random-Access range had to require .front/.back maybe it was over-specification too? Some stuff is easy to index but not "drop off an item at either end". But now it really doesn't matter - if there are such things, they are not random-access ranges.
 Is there a possibility that someone can invent a buffer
 that satisfies the primitives of say a line-by-line reader, but cannot
 possibly be a forward range?

I hardly can see that: front --> lookahead(1)[0] empty --> lookahead(1).length != 0 popFront --> seek(1) or enforce(seek(1)) save -> well there got to be a way to pin the data in the buffer? And they surely could be better implemented inside of a specific buffer range. -- Dmitry Olshansky
Jan 17 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 18:03, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly
 the same. It could be even read/write -"discarded" data is written to
 underlying stream, freshly loaded is read from stream. Now in case of
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.

Agreed, read/write buffer is a bad idea. As for write-only buffer implemented in the same style as read-only, well I need to try it first. -- Dmitry Olshansky
Jan 17 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 18:52, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:03, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly
 the same. It could be even read/write -"discarded" data is written to
 underlying stream, freshly loaded is read from stream. Now in case of
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.

Agreed, read/write buffer is a bad idea. As for write-only buffer implemented in the same style as read-only, well I need to try it first.

Keep in mind that a read/write buffered stream *should* exist. It's just that you need to switch how you use the buffer.

I'm doubting that actually. Just use the freaking MM-file directly you'd better off efficiency wise. And there would be no need to switch back and forth. Modern OS use paging subsystem for files anyway (unless opened with O_DIRECT or similar).
 I think in my code, what I do is if someone wants to read, and is
 currently writing (on a read/write stream), I flush the buffer, then
 switch it to be in read mode, and then seek to the appropriate place, to
 fill the buffer.

 When switching from read to write, it's much easier, since the buffer
 starts out empty.

 So I think a buffer type that can be used for read *OR* write is good,
 just not simultaneously.

See above. I don't think there is a need for these paces with flushing stuff. And if it comes to pipes and sockets then just having 2 buffers fits the bill better.
 However, simultaneous read/write is probably something that could be
 useful for a niche use case (e.g. some transformative process), as long
 as you don't mind having 2 file descriptors open :)

-- Dmitry Olshansky
Jan 17 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 13:19, Kagamin пишет:
 On Thursday, 16 January 2014 at 15:55:07 UTC, Steven Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class hierarchy
 (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific primitives.
 It's job is to collect data from the underlying stream, and present it
 to consumers as a random-access buffer.

If you have a struct-based buffer, how would you enlarge the buffer?

What's the problem? I don't see how struct/class can change there anything, it's a member field that is an array that we surely can expand.
 Won't it suffer from AA syndrome?

Buffer is created with factory functions only. It's not like an AA that grows from an empty/null state. Empty buffer (.init) doesn't grow it's simply empty. -- Dmitry Olshansky
Jan 17 2014
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 17 Jan 2014 10:40:02 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:52, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

 Keep in mind that a read/write buffered stream *should* exist. It's j=


 that you need to switch how you use the buffer.

I'm doubting that actually. Just use the freaking MM-file directly you=

 better off efficiency wise. And there would be no need to switch back =

 and forth. Modern OS use paging subsystem for files anyway (unless  =

 opened with O_DIRECT or similar).

Remember we have to support whatever stdio.File does, and I think it doe= s = support this. -Steve
Jan 17 2014
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:03, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exact=



 the same. It could be even read/write -"discarded" data is written t=



 underlying stream, freshly loaded is read from stream. Now in case o=



 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

In my experience, the code/process for a write buffer is significantl=


 different than the code for a read buffer. A read/write buffer is ver=


 difficult to make, because you either need 2 file pointers, or need t=


 constantly seek the single file pointer to overwrite the existing dat=



Agreed, read/write buffer is a bad idea. As for write-only buffer =

 implemented in the same style as read-only, well I need to try it firs=


Keep in mind that a read/write buffered stream *should* exist. It's just= = that you need to switch how you use the buffer. I think in my code, what I do is if someone wants to read, and is = currently writing (on a read/write stream), I flush the buffer, then = switch it to be in read mode, and then seek to the appropriate place, to= = fill the buffer. When switching from read to write, it's much easier, since the buffer = starts out empty. So I think a buffer type that can be used for read *OR* write is good, = just not simultaneously. However, simultaneous read/write is probably something that could be = useful for a niche use case (e.g. some transformative process), as long = as = you don't mind having 2 file descriptors open :) -Steve
Jan 17 2014
prev sibling next sibling parent "Joseph Cassman" <jc7919 outlook.com> writes:
On Tuesday, 31 December 2013 at 01:51:23 UTC, Brad Anderson wrote:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
 wrote:
 Proposal

Having never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me. Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it. Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability. Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages. [...] If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems. [...]

I find the same to be the case with API's used to process files. It would be a real win to have a simple, efficient, and possibly non-blocking range interface over a file able to access a single byte or character at a time. I remember a discussion on this forum that spoke positively about the async and await combination of keywords in .NET 4.5. If asynchronous and synchronous functionality could be combined with a flexible buffer in a UFCS call chain I think that would hit a sweet spot for code composability and understandability. Joseph
Dec 30 2013
prev sibling next sibling parent "Joseph Cassman" <jc7919 outlook.com> writes:
On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky 
wrote:
 31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
 wrote:

I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment. A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges. I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement.

I think I now understand a bit better what you were thinking when you first posted
 input-source <--> buffer range <--> parser/consumer

 Meaning that if we can mix and match parsers with buffer 
 ranges, and buffer ranges with input sources we had grown 
 something powerful indeed.

Being able to wrap an already-in-use range object with the buffer interface as you do in the sample code (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good for composability. Also allows for existing functionality in std.algorithm to be reused as-is. I think the new range type could also be added directly to some new, or perhaps retrofitted into existing, code to add the new functionality without sacrificing performance. In that way the internal payload already used to get the data (say by the input range) could be reused without having to allocate new memory to support the buffer API. As one idea of using a buffer range from the start, a function template by(T) (where T is ubyte, char, wchar, or dchar) could be added to std.stdio. It would return a buffer range object providing more functionality than byChunk or byLine while adding access to the entire stream of data in a file in a contiguous and yet efficient manner. Seems to help with the issues faced in processing file data mentioned in previous comments in this thread. Joseph
Dec 31 2013
prev sibling next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
I've been rewriting a bit of the lexer in DScanner.

https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
(Ignore the "invariant" block. I've been trying to hunt down some 
unreelated memory corruption issue)

One thing that I've found to be very useful is the ability to 
increment column or index counters inside of the lexer range's 
popFront method. I think that I'd end up using my own range type 
for arrays, but construct a lexer range on top of your buffer 
range for anything else.
Jan 04 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Jan-2014 13:39, Brian Schott пишет:
 I've been rewriting a bit of the lexer in DScanner.

 https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
 (Ignore the "invariant" block. I've been trying to hunt down some
 unreelated memory corruption issue)

startsWith & peek could be implemented with proposed the lookahead, so overall buffer range looks fitting to your use case.
 One thing that I've found to be very useful is the ability to increment
 column or index counters inside of the lexer range's popFront method.

I think that I'd end up using my own range type for arrays, but construct
 a lexer range on top of your buffer range for anything else.

I think it should be possible to wrap a buffer range and to add related accounting and simplify interface for your own use. As an experiment I have a greatly simplified buffer range - it builds on top of forward range making it instantly compatible with the most of std.algorithm. The only problem at the moment is slightly worse performance with DMD (who cares) and that it segfaults with LDC (and this is a problem). Would be nice to test LDC first but I think I'll publish it anyway, stay tuned. -- Dmitry Olshansky
Jan 04 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Jan-2014 13:23, Brian Schott пишет:
 My experimental lexer generator now uses the buffer range.
 https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdx

Cool! A minor note: https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/d/lexer.d#L487 lookahead(n) should always give a slice of length n, or 0 so you may as well test for != 0. In general you should avoid marking too often, it takes a bit of . I'm currently in favor of my 2nd design where marking is replaced by .save returning an independent view of buffer, making Buffer a normal forward range that is cheap to copy. https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/ Sadly it segfaults with LDC so I can't quite assess its performance :(
 The problem that I have with it right now is that
 range.lookbehind(1).length != range.lookahead(1).length. This was
 confusing.

That indeed may look confusing at start but keep in mind that range.front is in fact a lookahead of 1. Thus all algorithms that can work with lookahead of 1 LL(1), LALR(1) etc. would easily work any forward/input range (though any practical parser need to slice or copy parts of input aside). -- Dmitry Olshansky
Jan 12 2014
prev sibling next sibling parent "Jason White" <54f9byee3t32 gmail.com> writes:
On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky 
wrote:
 IMHO C run-time I/O has no use in D. The amount of work spent 
 on special-casing the non-locking primitives of each C run-time,
 repeating legacy mistakes (like text mode, codepages and 
 locales) and stumbling on portability problems (getc is a macro 
 we can't have) would have been better spent elsewhere - 
 designing our own I/O framework.

I agree. I wrote a (mostly complete) file stream implementation that uses the native I/O API: https://github.com/jasonwhite/io/blob/master/src/io/file.d It allows for more robust open-flags than the fopen-style flags (like "r+"). For seeking, I adapted it to use your concept of marks (which I quite like) instead of SEEK_CUR and friends. Please feel free to use this! (The Windows implementation hasn't been tested yet, so it probably doesn't work.) BTW, I was also working on buffered streams, but couldn't figure out a good way to do it.
Jan 04 2014
prev sibling next sibling parent "Jason White" <54f9byee3t32 gmail.com> writes:
On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
 As an advice I'd suggest to drop the 'Data' part in 
 writeData/readData. It's obvious and adds no extra value.

You're right, but it avoids a name clash if it's composed with text writing. "write" would be used for text and "writeData" would be used for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid that problem (which, I suppose, are more appropriate names).
 Will poke around. I like this (I mean composition):
 https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

Yeah, the idea is to separate buffering, text, and locking operations so that they can be composed with any other type of stream (e.g., files, in-memory arrays, or sockets). Currently, std.stdio has all three of those facets rolled into one.
Jan 05 2014
prev sibling next sibling parent "Jason White" <54f9byee3t32 gmail.com> writes:
On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:
 I my view text implies something like:

 void write(const(char)[]);
 size_t read(char[]);

 And binary would be:

 void write(const(ubyte)[]);
 size_t read(ubyte[]);

 Should not clash.

Those would do the same thing for either text or binary data. When I say text writing, I guess I mean the serialization of any type to text (like what std.stdio.write does): void write(T)(T value); // Text writing void write(const(ubyte)[] buf); // Binary writing write([1, 2, 3]); // want to write "[1, 2, 3]" // but writes "\x01\x02\x03" This clashes. We need to be able to specify if we want to write/read a text representation or just the raw binary data. In the above case, the most specialized overload will be called.
 In-memory array IMHO better not pretend to be a stream. This 
 kind of wrapping goes in the wrong direction (losing 
 capabilities). Instead wrapping a stream and/or array as a 
 buffer range proved to me to be more natural (extending 
 capabilities).

Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations? I don't see why it would conflict with a range interface. As I understand it, ranges act on a single element at a time while streams act on multiple elements at a time. For ArrayBuffer in datapicked, a stream-style read is just lookahead(n) and cur += n. What capabilities are lost? If buffers/arrays provide a stream interface, then they can be used by code that doesn't directly need the buffering capabilities but would still benefit from them.
Currently, std.stdio has all three of
 those facets rolled into one.

Locking though is a province of shared and may need a bit more thought.

Locking of streams is something that I haven't explored too deeply yet. Streams that communicate with the OS certainly need locking as thread locality makes no difference there.
Jan 05 2014
prev sibling next sibling parent "Jason White" <54f9byee3t32 gmail.com> writes:
On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides 
 raw stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.
 Shouldn't buffers/arrays provide a stream interface in 
 addition to
 buffer-specific operations?

I think it may be best not to. Buffer builds on top of unbuffered stream. If there is a need to do large reads it may as well use naked stream and not worry about extra copying happening in the buffer layer. I need to think on this. Seeing as lookahead + seek could be labeled as read even though it's not.
 I don't see why it would conflict with a
 range interface. As I understand it, ranges act on a single 
 element at a
 time while streams act on multiple elements at a time. For 
 ArrayBuffer
 in datapicked, a stream-style read is just lookahead(n) and 
 cur += n.
 What capabilities are lost?

In short - lookahead is slicing, read would be copying. For me prime capability of an array is slicing that is dirt cheap O(1). On the other hand stream interface is all about copying bytes to the user provided array. In this setting it means that if you want to wrap array as stream, then it must follow generic stream interface. The latter cannot and should not think of slicing and the like. Then while wrapping it in some adapter up the chain it's no longer seen as array (because adapter is generic too and is made for streams). This is what I call capability loss.
 If buffers/arrays provide a stream interface, then they can be 
 used by
 code that doesn't directly need the buffering capabilities but 
 would
 still benefit from them.

See above - it would be better if the code was written for ranges not streams. Then e.g. slicing of buffer range on top of array works just as cheap as it was for arrays. And zero copies are made (=performance).

Okay, I see. I'm just concerned about composability. I'll have to think more about how it's affected. (BTW, you can probably simplify lookahead/lookbehind with look(ptrdiff_t n) where the sign of n indicates ahead/behind.)
 Actually these objects do just fine, since OS does the locking 
 (or makes sure of something equivalent). If your stream is TLS 
 there is no need for extra locking at all (no interleaving of 
 I/O calls is possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request 
 some operation, and as OS locks only on per sys-call basis 
 something cruel may happen in the code that deals with 
 buffering etc.

Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package into Phobos. It could replace std.stream as well as std.stdio. Stuff like serializers and lexers would be more robust and easier to write.
Jan 06 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
My experimental lexer generator now uses the buffer range. 
https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdx

The problem that I have with it right now is that 
range.lookbehind(1).length != range.lookahead(1).length. This was 
confusing.
Jan 09 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 07-Jan-2014 11:59, Jason White =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides raw
 stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

I was thinking it should also have "alias stream this;", but maybe that's not the best thing to do for a serializer. I concede, I've s/(read|write)Data/\1/g on https://github.com/jasonwhite/io/blob/master/src/io/file.d and it now works on Windows with useful exception messages.

Cool, got to steal sysErrorString too! :)
 Actually these objects do just fine, since OS does the locking (or
 makes sure of something equivalent). If your stream is TLS there is =



 need for extra locking at all (no interleaving of I/O calls is
 possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request some
 operation, and as OS locks only on per sys-call basis something crue=



 may happen in the code that deals with buffering etc.

Oh yeah, you're right. As a side note: I would love to get a kick-ass I/O stream package int=


 Phobos. It could replace std.stream as well as std.stdio.

Then our goals are aligned. Be sure to take a peek at (if you haven't =

 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Yes, I'm gearing up to revisit that after a long D hiatus, and I came = across this thread. At this point, I really really like the ideas that you have in this. It = = solves an issue that I struggled with, and my solution was quite clunky.= I am thinking of this layout for streams/buffers: 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which= I = have pretty much written) 2. Buffer like you have, based on a struct, with specific primitives. It= 's = job is to collect data from the underlying stream, and present it to = consumers as a random-access buffer. 3. Filter that has access to transform the buffer data/copy it. 4. Ranges that use the buffer/filter to process/present the data. The problem I struggled with is the presentation of UTF data of any form= at = as char[] wchar[] or dchar[]. 2 things need to happen. First is that the= = data needs to be post-processed to perform any necessary byte swapping. = = The second is to transcode the data into the correct width. In this way, you can process UTF data of any type (I even have code to = detect the encoding and automatically process it), and then use it in a = = way that makes sense for your code. My solution was to paste in a "processing" delegate into the class = hierarchy of buffered streams that allowed one read/write access to the = = buffer. But it's clunky, and difficult to deal with in a generalized = fashion. But the idea of using a buffer in between the stream and the range, and = = possibly bolting together multiple transformations in a clean way, makes= = this problem easy to solve, and I think it is closer to the vision = Andrei/Walter have. I also like the idea of "pinning" the data instead of my mechanism of = using a delegate (which was similar but not as general). It also has = better opportunities for optimization. Other ideas that came to me that buffer filters could represent: * compression/decompression * encryption I am going to study your code some more and see how I can update my code= = to use it. I still need to maintain the std.stdio.File interface, and = Walter is insistent that the initial state of stdout/err/in must be = synchronous with C (which kind of sucks, but I have plans on how to make= = it not be so bad). There is still a lot of work left to do, but I think one of the hard par= ts = is done, namely dealing with UTF transcoding. The remaining sticky part = is = dealing with shared. But with structs, this should make things much easi= er. One question, is there a reason a buffer type has to be a range at all? = I = can see where it's easy to make it a range, but I don't see higher-level= = code using the range primitives when dealing with chunks of a stream. -Steve
Jan 16 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 16-Jan-2014 19:55, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Then our goals are aligned. Be sure to take a peek at (if you haven'=



 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Yes, I'm gearing up to revisit that after a long D hiatus, and I came=


 across this thread.

 At this point, I really really like the ideas that you have in this. =


 solves an issue that I struggled with, and my solution was quite clun=


 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class hierarchy (wh=


 I have pretty much written)
 2. Buffer like you have, based on a struct, with specific primitives.=


 It's job is to collect data from the underlying stream, and present i=


 to consumers as a random-access buffer.

The only interesting thing I'd add here s that some buffer may work =

 without underlying stream. Best examples are arrays and MM-files.

Yes, but I would stress that for convenience, the buffer should forward = = some of the stream primitives (such as seeking) in cases where seeking i= s = possible, at least in the case of a buffer that wraps a stream. That actually is another point that would have sucked with my class-base= d = solution -- allocating a class to use an array as backing.
 3. Filter that has access to transform the buffer data/copy it.
 4. Ranges that use the buffer/filter to process/present the data.

Yes, yes and yes. I find it surprisingly good to see our vision seems =

 match. I was half-expecting you'd come along and destroy it all ;)

:) I've been preaching for a while that ranges don't make good streams, = = and that streams should be classes, but I hadn't considered splitting ou= t = the buffer. I think it's the right balance.
 The problem I struggled with is the presentation of UTF data of any
 format as char[] wchar[] or dchar[]. 2 things need to happen. First i=


 that the data needs to be post-processed to perform any necessary byt=


 swapping. The second is to transcode the data into the correct width.=


 In this way, you can process UTF data of any type (I even have code t=


 detect the encoding and automatically process it), and then use it in=


 way that makes sense for your code.

 My solution was to paste in a "processing" delegate into the class
 hierarchy of buffered streams that allowed one read/write access to t=


 buffer. But it's clunky, and difficult to deal with in a generalized
 fashion.

 But the idea of using a buffer in between the stream and the range, a=


 possibly bolting together multiple transformations in a clean way, ma=


 this problem easy to solve, and I think it is closer to the vision
 Andrei/Walter have.

In essence a transcoding filter for UTF-16 would wrap a buffer of ubyt=

 and itself present a buffer interface (but of wchar).

My intended interface allows you to specify the desired type per read. = Think of the case of stdin, where the clients will be varied and written= = by many different people, and its interface is decided by Phobos. But a transcoding buffer may make some optimizations. For instance, = reading a UTF32 file as utf-8 can re-use the same buffer, as no code uni= t = uses more than 4 code points (did I get that right?).
 I am going to study your code some more and see how I can update my c=


 to use it. I still need to maintain the std.stdio.File interface, and=


 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to m=


 it not be so bad).

I seriously not seeing how interfacing with C runtime could be fast =

 enough.

It's not. But an important stipulation in order for this to all be = accepted is that it doesn't break existing code that expects things like= = printf and writef to interleave properly. However, I think we can have an opt-in scheme, and there are certain cas= es = where we can proactively switch to a D-buffer scheme. For example, if yo= u = get a ByLine range, it expects to exhaust the data from stream, and may = = not properly work with C printf. The idea is that stdio.File can switch at runtime from FILE * to D strea= ms = as needed or directed.
 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sti=


 part is dealing with shared. But with structs, this should make thing=


 much easier.

I'm thinking a generic locking wrapper is possible along the lines of:=

 shared Locked!(GenericBuffer!char) stdin; //usage

 struct Locked(T){
 shared:
 private:
 	T _this;
 	Mutex mut;
 public:
 	//forwarded methods
 }

 The wrapper will introduce a lock, and implement every method of wrapp=

 struct roughly like this:
 mut.lock();
 scope(exit) mut.unlock();
 (cast(T*)_this).method(args);

 I'm sure it could be pretty automatic.

This would be a key addition for ANY type in order to properly work with= = shared. BUT, I don't see how it works safely generically because you = necessarily have to cast away shared in order to call the methods. You = would have to limit this to only working on types it was intended for. I've been expecting to have to do something like this, but not looking = forward to it :(
 One question, is there a reason a buffer type has to be a range at al=


 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks=


 a stream.

Lexers/parsers enjoy it - i.e. they work pretty much as ranges =

 especially when skipping spaces and the like. As I said the main reaso=

 was: if it fits as range why not? After all it makes one-pass processi=

 of data trivial as it rides on top of foreach:

 foreach(octect; mybuffer)
 {
 	if(intersting(octect))
 		do_cool_stuff();
 }

 Things like countUntil make perfect sense when called on buffer (e.g. =

 find matching sentinel).

I think I misstated my question. What I am curious about is why a type = must be a forward range to pass isBuffer. Of course, if it makes sense f= or = a buffer type to also be a range, it can certainly implement that = interface as well. But I don't know that I would need those primitives i= n = all cases. I don't have any specific use case for having a buffer that = doesn't implement a range interface, but I am hesitant to necessarily = couple the buffer interface to ranges just because we can't think of a = counter-case :) -Steve
Jan 16 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I am confused because there are 4 terms conflated here: BufferRange Buffer InputStream Stream
Jan 16 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2014 00:18, Walter Bright пишет:
 On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is
 defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I am confused because there are 4 terms conflated here: BufferRange Buffer

You even quoted it:
BufferRange .. (for now called simply Buffer)...

 InputStream
 Stream

One and the same as I dealt with the input side of the problem only. -- Dmitry Olshansky
Jan 16 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2014 2:30 PM, Dmitry Olshansky wrote:
 17-Jan-2014 00:18, Walter Bright пишет:
 On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is
 defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I am confused because there are 4 terms conflated here: BufferRange Buffer

You even quoted it: >BufferRange .. (for now called simply Buffer)...

I know. But using two names for the same thing makes for confusing documentation. There also appears to be no rationale for "for the moment".
 InputStream
 Stream

One and the same as I dealt with the input side of the problem only.

Same problem with multiple names for the same thing, also there's no definition of either name.
Jan 16 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 The other way around :) 4 code units - 1 code point.

I knew that was going to happen :)
 This would be a key addition for ANY type in order to properly work with
 shared. BUT, I don't see how it works safely generically because you
 necessarily have to cast away shared in order to call the methods. You
 would have to limit this to only working on types it was intended for.

The requirement may be that it's pure or should I say "well-contained". In other words as long as it doesn't smuggle references somewhere else it should be fine. That is to say it's 100% fool-proof, nor do I think that essentially simulating a synchronized class is a always a good thing to do...

I think you meant *not* 100% :) But yeah, regardless of how generalized it is, this is likely the best path. I think this is the tack that std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.
 Convenient to work with does ring good to me. I simply see no need to  
 reinvent std.algorithm on buffers especially the ones that just scan  
 left-to-right.
 Example would be calculating a checksum of a stream (say data comes from  
 a pipe or socket i.e. buffered). It's a trivial application of  
 std.algorithm.reduce and there no need to reinvent that wheel IMHO.

I again think I am being misunderstood. Example might be appropriate: struct Buffer {...} // implements BOTH forward range and Buffer primitives struct OtherBuffer {...} // only implements Buffer primitives. If isBuffer is modified to not require isForwardRange, then both Buffer and OtherBuffer will work as buffers, but only Buffer works as a range. But as you have it now, isBuffer!OtherBuffer is false. Is this necessary? So we can implement buffers that are both ranges and buffers, and will work with std.algorithm without modification (and that's fine and expected by me), but do we need to *require* that? Are we over-specifying? Is there a possibility that someone can invent a buffer that satisfies the primitives of say a line-by-line reader, but cannot possibly be a forward range? -Steve
Jan 16 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Thursday, 16 January 2014 at 15:55:07 UTC, Steven 
Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class 
 hierarchy (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific 
 primitives. It's job is to collect data from the underlying 
 stream, and present it to consumers as a random-access buffer.

If you have a struct-based buffer, how would you enlarge the buffer? Won't it suffer from AA syndrome?
Jan 17 2014
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Friday, 17 January 2014 at 09:19:12 UTC, Kagamin wrote:
 On Thursday, 16 January 2014 at 15:55:07 UTC, Steven 
 Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class 
 hierarchy (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific 
 primitives. It's job is to collect data from the underlying 
 stream, and present it to consumers as a random-access buffer.

If you have a struct-based buffer, how would you enlarge the buffer? Won't it suffer from AA syndrome?

It is the lazily initialized nature of AAs that causes the problem. Structs can, but don't have to, replicate the behaviour.
Jan 17 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 17 January 2014 at 09:33:41 UTC, Dmitry Olshansky 
wrote:
 What's the problem? I don't see how struct/class can change 
 there anything, it's a member field that is an array that we 
 surely can expand.

Ah, I thought one can copy the buffer, now I see you pass it by ref.
Jan 17 2014
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly  
 the same. It could be even read/write -"discarded" data is written to  
 underlying stream, freshly loaded is read from stream. Now in case of  
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually  
 output stuff into the file.

In my experience, the code/process for a write buffer is significantly different than the code for a read buffer. A read/write buffer is very difficult to make, because you either need 2 file pointers, or need to constantly seek the single file pointer to overwrite the existing data.
 But as you have it now, isBuffer!OtherBuffer is false. Is this  
 necessary?

I think I should call it BufferRange from now on. And bring my docs in line. It may make sense to provide naked Buffer itself as a construct (it has simpler interface) and have generic BufferRange wrapper. I'm just not seeing user code using the former - too error prone and unwieldy.

In a sense, I agree. There aren't many things to do directly with a buffer (i.e. there will likely be few filters), and a buffer range provides enough primitives to make front/popFront/empty trivial.
 So we can implement buffers that are both ranges and buffers, and will
 work with std.algorithm without modification (and that's fine and
 expected by me), but do we need to *require* that? Are we
 over-specifying?

Random-Access range had to require .front/.back maybe it was over-specification too? Some stuff is easy to index but not "drop off an item at either end". But now it really doesn't matter - if there are such things, they are not random-access ranges.
 Is there a possibility that someone can invent a buffer
 that satisfies the primitives of say a line-by-line reader, but cannot
 possibly be a forward range?

I hardly can see that: front --> lookahead(1)[0] empty --> lookahead(1).length != 0 popFront --> seek(1) or enforce(seek(1)) save -> well there got to be a way to pin the data in the buffer? And they surely could be better implemented inside of a specific buffer range.

I'll have to take a closer look at your code to have a reasonable response. But I think this looks fine so far. -Steve
Jan 17 2014