digitalmars.D - [RFC] I/O and Buffer Range

Dmitry Olshansky (155/155) Dec 29 2013 Didn't mean this to be kind of under the tree .. Planned to post long

Vladimir Panteleev (27/28) Dec 29 2013 Hmm, just yesterday I was rewriting a parser to use a buffer

Marco Leise (20/27) Dec 29 2013 Yeah, been there too. I guess after X years of programming

Dmitry Olshansky (5/20) Dec 30 2013 Yeah, that's a good idea.

Marco Leise (5/29) Dec 30 2013 One day it'll catch on :D

Dmitry Olshansky (27/52) Dec 30 2013 It's expected that the window is increased. The exact implementation may...

Brad Anderson (66/67) Dec 30 2013 Having never written any parser I'm not really qualified to

Joseph Cassman (11/43) Dec 30 2013 I find the same to be the case with API's used to process files.
Dmitry Olshansky (15/53) Dec 31 2013 Exactly, the situation is simply not good enough. I can assure you that

Joseph Cassman (34/35) Dec 30 2013 Interesting idea. Seems to fill a need I have been facing with

Dmitry Olshansky (25/58) Dec 31 2013 The trick is that the name Forward range is misleading actually (I was

Joseph Cassman (24/41) Dec 31 2013 I think I now understand a bit better what you were thinking when

Dmitry Olshansky (52/86) Jan 04 2014 I've created a fork where I've implemented just that.

Jason White (12/18) Jan 04 2014 I agree. I wrote a (mostly complete) file stream implementation

Dmitry Olshansky (7/23) Jan 05 2014 As an advice I'd suggest to drop the 'Data' part in writeData/readData.

Jason White (10/14) Jan 05 2014 You're right, but it avoids a name clash if it's composed with

Dmitry Olshansky (15/29) Jan 05 2014 I my view text implies something like:

Jason White (23/39) Jan 05 2014 Those would do the same thing for either text or binary data.

Dmitry Olshansky (38/81) Jan 06 2014 Ok, now I see. In my eye though serialization completely hides raw

Jason White (15/70) Jan 06 2014 I was thinking it should also have "alias stream this;", but

Dmitry Olshansky (10/42) Jan 07 2014 Then our goals are aligned. Be sure to take a peek at (if you haven't

Steven Schveighoffer (68/109) Jan 16 2014 no

Dmitry Olshansky (40/91) Jan 16 2014 The only interesting thing I'd add here s that some buffer may work

Steven Schveighoffer (80/169) Jan 16 2014 :

Dmitry Olshansky (16/104) Jan 16 2014 The other way around :) 4 code units - 1 code point.

Steven Schveighoffer (19/35) Jan 16 2014 I knew that was going to happen :)

Dmitry Olshansky (44/82) Jan 17 2014 Aye. BTW I haven't thought of writing into the buffer, but it works

Steven Schveighoffer (12/44) Jan 17 2014 In my experience, the code/process for a write buffer is significantly

Dmitry Olshansky (5/18) Jan 17 2014 Agreed, read/write buffer is a bad idea. As for write-only buffer

Steven Schveighoffer (28/47) Jan 17 2014 :

Dmitry Olshansky (10/46) Jan 17 2014 I'm doubting that actually. Just use the freaking MM-file directly you'd...

Steven Schveighoffer (10/17) Jan 17 2014 :

Kagamin (4/10) Jan 17 2014 If you have a struct-based buffer, how would you enlarge the

Dmitry Olshansky (8/18) Jan 17 2014 What's the problem? I don't see how struct/class can change there

Kagamin (4/7) Jan 17 2014 Ah, I thought one can copy the buffer, now I see you pass it by

Jakob Ovrum (3/14) Jan 17 2014 It is the lazily initialized nature of AAs that causes the

Brian Schott (9/9) Jan 04 2014 I've been rewriting a bit of the lexer in DScanner.

Dmitry Olshansky (13/21) Jan 04 2014 startsWith & peek could be implemented with proposed the lookahead, so

Brian Schott (5/5) Jan 09 2014 My experimental lexer generator now uses the buffer range.

Dmitry Olshansky (19/24) Jan 12 2014 Cool!

Walter Bright (6/8) Jan 16 2014 I am confused because there are 4 terms conflated here:

Dmitry Olshansky (6/16) Jan 16 2014 One and the same fr the moment.

Walter Bright (5/21) Jan 16 2014 I know. But using two names for the same thing makes for confusing

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

Didn't mean this to be kind of under the tree .. Planned to post long
ago but I barely had the time to flesh it out.

TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered
I/O sources specifically.

Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soon
finds out that while slice-em-up with random access ranges works very
nice any attempts to extend that beyond arrays and in-memory stuff are
getting real awkward. It sums up to implementing not a small amount of
bookkeeping (buffering) all the while having to use primitives that just
don't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)?
Save the range, do 3 front/popFront with empty checks then copy saved
range back. Painful especially if you know full well that it must be
just a length check plus 3 direct reads of array. Even if the underlying
buffer allows you to slice up the contents and do array-wise operations
on them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges is
done - you get that you have this 2-nd parser that should support
forward ranges as well. It leads to conclusion that parsers simply don't
want to work forward ranges, they need something bigger and all extra
work you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can't
even be forward ranges! Stream over a TCP socket - how do you 'save'
that, Sherlock? Keeping in mind to return the same type - we'd better
called it "fork" back then.

To sum it up - the current selection of ranges is not good enough to
_efficiently_ work with I/O. Yet ranges are very neat and we'd better
retain their strengths and capitalize on the existing work done in this
area (e.g. a slew of good stuff in std.algorithm and std.range). Got to
extend the notion then.

Goals

Introduce a better primitive aimed squarely at solving the _input_ side
of I/O problem. Output IMHO might need some tweaks but with OutputRange
it's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturally
and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and
lookbehind.

It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, and
buffer ranges with input sources we had grown something powerful indeed.

Spoiler

I have a proof of concept implemented. It *works*. What I'm looking for
is to simplify the set of primitives and/or better suggestions.

Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using this
abstraction and working directly over files:

grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line by
line and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before:
fgets + std.regex.matchFirst. Note that this (old) solution is cheating
twice - it seeks only 1 match per line and it knows the line length
ahead of time.

For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is defined
here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal was
sufficient first, minimal second.

Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing
(and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "sliding
window" over whatever is the true source of data. Thus a sliding window
can be the whole array, a mapped area of a file or a buffer of network
stream. A sliding window may be moved across the input stream (~
re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding window
model and match that with requirements a parser would "like".

2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start of
something "useful" and require the underlying implementation when the
time comes to move the window to keep the data starting from here.
Typically later "down the stream" when the boundaries of slice are
established it's extracted, examined and (rarely!) copied over. Ability
to avoid copy unless absolutely necessary is _crucial_.

Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
//pin this position so that buffering won't lose it
auto m = input.mark();
while(!input.empty && isAlpha(input.front)){
input.popFront();
}
//get a slice from 'm' to current position
auto word = input.slice(m);
//grab a copy if haven't seen before
if(word !in identifiers)
identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking and
pinning concept explicitly. See mark/slice in docs.

3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's not
just wasteful as in memory, it's _semantically_ costly. In case of
buffer range it would also imply some "re-reading" of I/O source as the
copy has to be completely independent view of the same input(!). Hence
"save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quite
often) the state has to be restored. To minimize primitive count and not
break requirement (2) reuse 'mark' to save the state and add 'seek' to
restore position to the marked point. As it was pinned it's always in
the buffer and readily accessible, keeping the ability to work over streams.

4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is that
setting up boundaries of stuff to slice (captures in regex) must be dirt
cheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitively
costly as it has to communicate with buffer (currently bump a counter in
some array). The idea that worked out with std.regex is to use a single
"vantage point" mark and take a big slice off that and then make
sub-slices of it as the plain array it is. Then from time to time
"vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark to
the current position plus 'seek' overloads that work with relative
offsets (positive and negative).

Another usage for relative seek is skipping the of data without looking,
potentially dropping the whole buffers away (there is overlap with
popFrontN though). Also fixed-width backtracking is easily had with
relative seek if we can instruct the buffer range to keep at least K
last bytes in the buffer (require minimal history).

5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array,
hence one may easily take a peek at some portion of it as plain ubyte[].
Implementation makes sure it buffers up enough of data as required
and/or returns empty range if not possible. This supports things like
LL(k) lookahead and fixed-length lookbehind that is common in regex
0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, the
question remains is how effective it'll be.

--
Dmitry Olshansky

Dec 29 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 [snip]

Hmm, just yesterday I was rewriting a parser to use a buffer 
instead of loading the whole file in memory, so this is quite 
timely for me.

Questions:

1. What happens when the distance between the pinned and current 
position exceeds the size of the buffer (sliding window)? Is the 
buffer size increased, or is the stream rewound if possible and 
the range returned by the slice does seeking?

2. I don't understand the rationale behind the current semantics 
of lookahead/lookbehind. If you want to e.g. peek ahead/behind to 
find the first whitespace char, you don't know how many chars to 
request. Wouldn't it be better to make these functions return the 
ENTIRE available buffer in O(1)?

I guess I see the point when applied to regular expressions, 
where the user explicitly specifies how many characters to look 
ahead/behind. However, I think in most use cases the amount is 
not known beforehand (without imposing arbitrary limitations on 
users like "Thou shalt not have variable identifiers longer than 
32 characters"), so the pattern would be "try a cheap 
lookahead/behind, and if that fails, do an expensive one".

3. I think ideally the final design would use something like what 
std.allocator does with "unbounded" and "chooseAtRuntime" - some 
uses might not need lookahead or lookbehind or other features at 
all, so having a way to disable the relevant code would benefit 
those cases.

Dec 29 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 29 Dec 2013 22:45:35 +0000
schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]
=20
 2. I don't understand the rationale behind the current semantics=20
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to=20
 find the first whitespace char, you don't know how many chars to=20
 request. Wouldn't it be better to make these functions return the=20
 ENTIRE available buffer in O(1)?

Yeah, been there too. I guess after X years of programming
chances are good you implemented some buffer with
look-ahead. :)=20

I use those primitives:

   ubyte[] mapAvailable() pure nothrow
   ubyte[] mapAtLeast(in =E2=84=95 count)

See: https://github.com/mleise/piped/blob/master/src/piped/circularbuffer.d=
#L400

Both return a slice over all the available buffered data.
mapAtLeast() also waits till enough data is available from the
producer thread. The circular buffer is automatically extended
if the producer wants to write a larger chunk or the consumer
needs a larger window.
(I was mostly focused on lock-free operation in not-starved
situations and minimal bounds checking, so the code is a bit
more sophisticated than the average circular buffer.)

--=20
Marco

Dec 29 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Dec-2013 09:29, Marco Leise пишет:
 Am Sun, 29 Dec 2013 22:45:35 +0000
 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]

 2. I don't understand the rationale behind the current semantics
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to
 find the first whitespace char, you don't know how many chars to
 request. Wouldn't it be better to make these functions return the
 ENTIRE available buffer in O(1)?

 Yeah, been there too. I guess after X years of programming
 chances are good you implemented some buffer with
 look-ahead. :)

 I use those primitives:

     ubyte[] mapAvailable() pure nothrow
     ubyte[] mapAtLeast(in ℕ count)

Yeah, that's a good idea.
I mean primitives, not Unicode in type names ;)


-- 
Dmitry Olshansky

Dec 30 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 30 Dec 2013 12:08:18 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 30-Dec-2013 09:29, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 Am Sun, 29 Dec 2013 22:45:35 +0000
 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 [snip]

 2. I don't understand the rationale behind the current semantics
 of lookahead/lookbehind. If you want to e.g. peek ahead/behind to
 find the first whitespace char, you don't know how many chars to
 request. Wouldn't it be better to make these functions return the
 ENTIRE available buffer in O(1)?

 Yeah, been there too. I guess after X years of programming
 chances are good you implemented some buffer with
 look-ahead. :)

 I use those primitives:

     ubyte[] mapAvailable() pure nothrow
     ubyte[] mapAtLeast(in =E2=84=95 count)

=20
 Yeah, that's a good idea.
 I mean primitives, not Unicode in type names ;)

One day it'll catch on :D

--=20
Marco

Dec 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Dec-2013 02:45, Vladimir Panteleev пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 [snip]

 Hmm, just yesterday I was rewriting a parser to use a buffer instead of
 loading the whole file in memory, so this is quite timely for me.

 Questions:

 1. What happens when the distance between the pinned and current
 position exceeds the size of the buffer (sliding window)? Is the buffer
 size increased, or is the stream rewound if possible and the range
 returned by the slice does seeking?

It's expected that the window is increased. The exact implementation may 
play any dirty tricks it sees fit as long as it can provide a slice over 
the pinned area. In short - maintain the illusion that the window has 
increased. I would be against seeking range and would most likely opt 
for memory-mapped files instead but it all depends on the exact numbers.

 2. I don't understand the rationale behind the current semantics of
 lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the
 first whitespace char, you don't know how many chars to request.

If you want to 'find' just do front/popFront, no?
Or do you specifically want to do array-wise operations?

 Wouldn't it be better to make these functions return the ENTIRE
 available buffer in O(1)?

Indeed, now I think that 2 overloads would be better:
auto lookahead(size_t n); //exactly n bytes, re-buffering as needed
auto lookahead(); // all that is available in the window, no re-buffering

Similar for lookbehind.

 I guess I see the point when applied to regular expressions, where the
 user explicitly specifies how many characters to look ahead/behind.

Actually the user doesn't - our lookahead/lookbehind is variable length. 
One thing I would have to drop is unbound lookbehind, not that it's so 
critical.

 However, I think in most use cases the amount is not known beforehand
 (without imposing arbitrary limitations on users like "Thou shalt not
 have variable identifiers longer than 32 characters"), so the pattern
 would be "try a cheap lookahead/behind, and if that fails, do an
 expensive one".

I would say that in case where you need arbitrary-length lookahead:
m = mark, seek + popFront x N, seek(m) should fit the bill.
Or as is the case in regex at the moment - mark once, and use seek back 
to some position relative to it. In one word - backtracking.

An example of where fixed lookahead rocks:
https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L421

 3. I think ideally the final design would use something like what
 std.allocator does with "unbounded" and "chooseAtRuntime" - some uses
 might not need lookahead or lookbehind or other features at all, so
 having a way to disable the relevant code would benefit those cases.

It makes sense to make lookahead and lookbehind optional.
As for the code - for the moment it doesn't add much and builds on stuff 
already there. Though I suspect some other implementations would be able 
to "cut corners" more efficiently.

-- 
Dmitry Olshansky

Dec 30 2013

"Brad Anderson" <eco gnuk.net> writes:

On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 Proposal

Having never written any parser I'm not really qualified to 
seriously give comments or review it but it all looks very nice 
to me.

Speaking as just an end user of these things whenever I use 
ranges over files or from, say, std.net.curl the byLine/byChunk 
interface always feels terribly awkward to use which often leads 
to me just giving up and loading the entire file/resource into an 
array. It's the boundaries that I stumble over. byLine never fits 
when I want to extract something multiline but byChunk doesn't 
fit because I often if what I'm searching for lands on the 
boundary I'll miss it.

Being able to just do a matchAll() on a file, std.net.curl, etc. 
without sacrificing performance and memory would be such a 
massive gain for usability.

Just a simple example of where I couldn't figure out how to 
utilize either byLine or byChunk without adding some clunky 
homegrown buffering solution. This is code that scrapes website 
titles from the pages of URLs in IRC messages.

---
auto scrapeTitles(M)(in M message)
{

"i");
     static title_re = ctRegex!(r"<title.*?>(.*?)<", "si");
     static ws_re = ctRegex!(r"(\s{2,}|\n|\t)");

     auto utf8 = new EncodingSchemeUtf8;
     auto titles =
          matchAll(message, url_re)
         .map!(match => match.captures[0])
         .map!((url) => get(url).ifThrown([]))
         .map!(bytes => cast(string)
                        
utf8.sanitize(cast(immutable(ubyte)[])bytes))
         .map!(content => matchFirst(content, title_re))
         .filter!(captures => !captures.empty)
         .map!(capture => capture[1].idup) // dup so original is 
GCed
         .map!(title => title.entitiesToUni.replace(ws_re, " "))
         .array;

     return titles;
}
---

I really, really didn't want to use that std.net.curl.get().  It 
causes all sorts of problems if someone links to a huge resource. 
I just could not figure out how to utilize byLine (the title 
regex capture can be multiline) or byChunk cleanly. Code elegance 
(a lot of it due to Jakob Ovrum's help in IRC) was really a goal 
here as this is just a toy so I went with get() for the time 
being but it's always sad to sacrifice elegance for performance.  
I certainly didn't want to add some elaborate evergrowing buffer 
in the middle of this otherwise clean UFCS chain (and I'm not 
even sure how to incrementally regex search the growing buffer or 
if that's even possible).

If I'm understanding your proposal correctly that get(url) could 
be replaced with a hypothetical std.net.curl "buffer range" and 
that could be passed directly to matchFirst. It would only take 
up, at most, the size of the buffer in memory (which could grow 
if the capture grows to be larger than the buffer) and wouldn't 
read the unneeded portion of the resource at all. That would be 
such a huge win for everyone so I'm very excited about this 
proposal. It addresses all of my current problems.



P.S. I love std.regex more and more every day. It made that 
entitiesToUni function so easy to implement: 
http://dpaste.dzfl.pl/688f2e7d

Dec 30 2013

"Joseph Cassman" <jc7919 outlook.com> writes:

On Tuesday, 31 December 2013 at 01:51:23 UTC, Brad Anderson wrote:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
 wrote:
 Proposal

 Having never written any parser I'm not really qualified to 
 seriously give comments or review it but it all looks very nice 
 to me.

 Speaking as just an end user of these things whenever I use 
 ranges over files or from, say, std.net.curl the byLine/byChunk 
 interface always feels terribly awkward to use which often 
 leads to me just giving up and loading the entire file/resource 
 into an array. It's the boundaries that I stumble over. byLine 
 never fits when I want to extract something multiline but 
 byChunk doesn't fit because I often if what I'm searching for 
 lands on the boundary I'll miss it.

 Being able to just do a matchAll() on a file, std.net.curl, 
 etc. without sacrificing performance and memory would be such a 
 massive gain for usability.

 Just a simple example of where I couldn't figure out how to 
 utilize either byLine or byChunk without adding some clunky 
 homegrown buffering solution. This is code that scrapes website 
 titles from the pages of URLs in IRC messages.

 [...]

 If I'm understanding your proposal correctly that get(url) 
 could be replaced with a hypothetical std.net.curl "buffer 
 range" and that could be passed directly to matchFirst. It 
 would only take up, at most, the size of the buffer in memory 
 (which could grow if the capture grows to be larger than the 
 buffer) and wouldn't read the unneeded portion of the resource 
 at all. That would be such a huge win for everyone so I'm very 
 excited about this proposal. It addresses all of my current 
 problems.

 [...]

I find the same to be the case with API's used to process files. 
It would be a real win to have a simple, efficient, and possibly 
non-blocking range interface over a file able to access a single 
byte or character at a time.

I remember a discussion on this forum that spoke positively about 
the async and await combination of keywords in .NET 4.5. If 
asynchronous and synchronous functionality could be combined with 
a flexible buffer in a UFCS call chain I think that would hit a 
sweet spot for code composability and understandability.

Joseph

Dec 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

31-Dec-2013 05:51, Brad Anderson пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 Proposal

 Having never written any parser I'm not really qualified to seriously
 give comments or review it but it all looks very nice to me.

 Speaking as just an end user of these things whenever I use ranges over
 files or from, say, std.net.curl the byLine/byChunk interface always
 feels terribly awkward to use which often leads to me just giving up and
 loading the entire file/resource into an array. It's the boundaries that
 I stumble over. byLine never fits when I want to extract something
 multiline but byChunk doesn't fit because I often if what I'm searching
 for lands on the boundary I'll miss it.

Exactly, the situation is simply not good enough. I can assure you that 
on the side of parser writers it's even less appealing.

 Being able to just do a matchAll() on a file, std.net.curl, etc. without
 sacrificing performance and memory would be such a massive gain for
 usability.

.. and performance ;)

 Just a simple example of where I couldn't figure out how to utilize
 either byLine or byChunk without adding some clunky homegrown buffering
 solution. This is code that scrapes website titles from the pages of
 URLs in IRC messages.

[snip]
 I really, really didn't want to use that std.net.curl.get().  It causes
 all sorts of problems if someone links to a huge resource.

*Nods*

 I just could
 not figure out how to utilize byLine (the title regex capture can be
 multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob
 Ovrum's help in IRC) was really a goal here as this is just a toy so I
 went with get() for the time being but it's always sad to sacrifice
 elegance for performance. I certainly didn't want to add some elaborate
 evergrowing buffer in the middle of this otherwise clean UFCS chain (and
 I'm not even sure how to incrementally regex search the growing buffer
 or if that's even possible).

I thought to provide something like that, incremental match that takes 
pieces of data slice by slice, having to mess with the not-yet-matched 
kind of object. But it was solving the wrong problem. And it shows that 
backtracking engines simply can't work like that, they would want to go 
back to the prior pieces.

 If I'm understanding your proposal correctly that get(url) could be
 replaced with a hypothetical std.net.curl "buffer range" and that could
 be passed directly to matchFirst. It would only take up, at most, the
 size of the buffer in memory (which could grow if the capture grows to
 be larger than the buffer) and wouldn't read the unneeded portion of the
 resource at all. That would be such a huge win for everyone so I'm very
 excited about this proposal. It addresses all of my current problems.

That's indeed what the proposal is all about. Glad it makes sense :)

 P.S. I love std.regex more and more every day. It made that
 entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d

Aye, replace with functor rox!

-- 
Dmitry Olshansky

Dec 31 2013

"Joseph Cassman" <jc7919 outlook.com> writes:

On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
 [...]

Interesting idea. Seems to fill a need I have been facing with 
some parsing code.

Since I was unclear about how its functionality compares to 
ForwardRange I took a look through std.algorithm. If typed 
versions of lookahead/lookbehind were added it seems like 
ForwardRange could be replaced with this new range type, at least 
in certain places. For example, it seems to me that the following 
code (from the article "On Iteration" 
https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7)

    ForwardRange findAdjacent(ForwardRange r){
       if (!r.empty()) {
          auto s = r.save();
          s.popFront();
          for (; !s.empty();
                r.popFront(), s.popFront()) {
             if (r.front() == s.front()) break;
          }
       }
       return r;
    }

also could be written as

    BufferedRange findAdjacent(BufferedRange r) {
       for(;!r.empty;r.popFront()) {
          if(r.front == r.lookahead(1)[0]) break;
       }
       return r;
    }

Perhaps the mark and/or slice functionality could be leveraged to 
do something similar but be more performant.

If anyone can see how the new range type compares to ForwardRange 
that would be cool.

Joseph

Dec 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
 [...]

 Interesting idea. Seems to fill a need I have been facing with some
 parsing code.

 Since I was unclear about how its functionality compares to ForwardRange
 I took a look through std.algorithm. If typed versions of
 lookahead/lookbehind were added it seems like ForwardRange could be
 replaced with this new range type, at least in certain places. For
 example, it seems to me that the following code (from the article "On
 Iteration"
 https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7)

     ForwardRange findAdjacent(ForwardRange r){
        if (!r.empty()) {
           auto s = r.save();
           s.popFront();
           for (; !s.empty();
                 r.popFront(), s.popFront()) {
              if (r.front() == s.front()) break;
           }
        }
        return r;
     }

The trick is that the name Forward range is misleading actually (I was 
surprized myself). A proper name would be ForkableRange, indeed look at 
this code:
        auto s = r.save();
        s.popFront();

s is a fork of r (independent view that has a live of its own).
The original algortihm even _reads_ easier if the correct term is used:

        auto s = r.fork();
        s.popFront();
        ...
Now it's clear that we fork a range and advance the forked copy one step 
ahead of the original.

 also could be written as

     BufferedRange findAdjacent(BufferedRange r) {
        for(;!r.empty;r.popFront()) {
           if(r.front == r.lookahead(1)[0]) break;
        }
        return r;
     }

Aye, just don't forget to test lookahead for empty instead of r.empty.

 Perhaps the mark and/or slice functionality could be leveraged to do
 something similar but be more performant.

 If anyone can see how the new range type compares to ForwardRange that
 would be cool.

I'm thinking there might be a way to bridge the new range type with 
ForwardRange but not directly as defined at the moment.

A possibility I consider is to separate a Buffer object (not a range), 
and let it be shared among views - light-weight buffer-ranges. Then if 
we imagine that these light-weight buffer-ranges are working as marks 
(i.e. they pin down the buffer) in the current proposal then they could 
be forward ranges.

I need to think on this, as the ability to integrate well with 
forwardish algorithms would be a great improvement.

-- 
Dmitry Olshansky

Dec 31 2013

"Joseph Cassman" <jc7919 outlook.com> writes:

On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky 
wrote:
 31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
 wrote:

 I'm thinking there might be a way to bridge the new range type 
 with ForwardRange but not directly as defined at the moment.

 A possibility I consider is to separate a Buffer object (not a 
 range), and let it be shared among views - light-weight 
 buffer-ranges. Then if we imagine that these light-weight 
 buffer-ranges are working as marks (i.e. they pin down the 
 buffer) in the current proposal then they could be forward 
 ranges.

 I need to think on this, as the ability to integrate well with 
 forwardish algorithms would be a great improvement.

I think I now understand a bit better what you were thinking when 
you first posted

 input-source <--> buffer range <--> parser/consumer

 Meaning that if we can mix and match parsers with buffer 
 ranges, and buffer ranges with input sources we had grown 
 something powerful indeed.

Being able to wrap an already-in-use range object with the buffer 
interface as you do in the sample code 
(https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is 
good for composability. Also allows for existing functionality in 
std.algorithm to be reused as-is.

I think the new range type could also be added directly to some 
new, or perhaps retrofitted into existing, code to add the new 
functionality without sacrificing performance. In that way the 
internal payload already used to get the data (say by the input 
range) could be reused without having to allocate new memory to 
support the buffer API.

As one idea of using a buffer range from the start, a function 
template by(T) (where T is ubyte, char, wchar, or dchar) could be 
added to std.stdio. It would return a buffer range object 
providing more functionality than byChunk or byLine while adding 
access to the entire stream of data in a file in a contiguous and 
yet efficient manner. Seems to help with the issues faced in 
processing file data mentioned in previous comments in this 
thread.

Joseph

Dec 31 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

31-Dec-2013 22:46, Joseph Cassman пишет:
 On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:
 31-Dec-2013 05:53, Joseph Cassman пишет:
 On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:



 I'm thinking there might be a way to bridge the new range type with
 ForwardRange but not directly as defined at the moment.

 A possibility I consider is to separate a Buffer object (not a range),
 and let it be shared among views - light-weight buffer-ranges. Then if
 we imagine that these light-weight buffer-ranges are working as marks
 (i.e. they pin down the buffer) in the current proposal then they
 could be forward ranges.


I've created a fork where I've implemented just that.
As a bonus I also tweaked stream primitives so it now works with pipes 
or whatever input stdin happens to be.

Links stay the same:
Docs: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html
Code: 
https://github.com/blackwhale/datapicked/tree/fwd-buffer-range/dpick/buffer

The description has largely simplified and the primitive count reduced.

1. A buffer range is a forward range. It has reference semantics.
2. A copy produced by _save_ is an independent view of the underlying 
buffer (or window).
3. No bytes can be discarded that are seen in some existing view. Thus 
each reference pins its position in the buffer.
4. 3 new primitives are:
    Range slice(BufferRange r);
Returns a slice of a window between the current range position and r. It 
must be a random access range.

    ptrdiff_t tell(BufferRange r);
Returns a difference in positions in the window of current range and r. 
Note that unlike slice(r).length this can be both positive and negative.

    bool seek(ptrdiff_t ofs);
Reset buffer state to an offset from the current position. Return 
indicates success of the operation. It may fail if there is not enough 
data, or (if ofs is negative) that this portion of data was already 
discarded.

5. Lookahead and lookbehind are a extra primitives that were left intact 
for the moment. Where applicable a range may provide lookahead:

Range lookahead(); //as much as available in the window
Range lookahead(size_t n); // either n exactly or nothing if not

And lookbehind:

Range lookbehind(); //as much as available in the window
Range lookbehind(size_t n); //either n exactly or nothing if not

These should probably be tested as separate traits.

 input-source <--> buffer range <--> parser/consumer

 Meaning that if we can mix and match parsers with buffer ranges, and
 buffer ranges with input sources we had grown something powerful indeed.

 Being able to wrap an already-in-use range object with the buffer
 interface as you do in the sample code
 (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good
 for composability. Also allows for existing functionality in
 std.algorithm to be reused as-is.

It was more about wrapping an array but it's got to integrate well with 
what we have. I could imagine a use case for buffering an input range.
Then I think a buffer range of anything other then bytes would be in order.

 I think the new range type could also be added directly to some new, or
 perhaps retrofitted into existing, code to add the new functionality
 without sacrificing performance. In that way the internal payload
 already used to get the data (say by the input range) could be reused
 without having to allocate new memory to support the buffer API.

 As one idea of using a buffer range from the start, a function template
 by(T) (where T is ubyte, char, wchar, or dchar) could be added to
 std.stdio.

IMHO C run-time I/O has no use in D. The amount of work spent on 
special-casing the non-locking primitives of each C run-time,
repeating legacy mistakes (like text mode, codepages and locales) and 
stumbling on portability problems (getc is a macro we can't have) would 
have been better spent elsewhere - designing our own I/O framework.

I've put together up something pretty simple and fast for buffer range 
directly on native I/O:
https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/stream.d

It needs a bit better error messages then naked enforce, and a bit of 
tweaks to memory management. It does runs circles around existing 
std.stdio already.

 It would return a buffer range object providing more
 functionality than byChunk or byLine while adding access to the entire
 stream of data in a file in a contiguous and yet efficient manner.

Drop 'efficient' if we talk interfacing with C run-time. Otherwise, yes, 
absolutely.

 Seems
 to help with the issues faced in processing file data mentioned in
 previous comments in this thread.


-- 
Dmitry Olshansky

Jan 04 2014

"Jason White" <54f9byee3t32 gmail.com> writes:

On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky 
wrote:
 IMHO C run-time I/O has no use in D. The amount of work spent 
 on special-casing the non-locking primitives of each C run-time,
 repeating legacy mistakes (like text mode, codepages and 
 locales) and stumbling on portability problems (getc is a macro 
 we can't have) would have been better spent elsewhere - 
 designing our own I/O framework.

I agree. I wrote a (mostly complete) file stream implementation 
that uses the native I/O API:

     https://github.com/jasonwhite/io/blob/master/src/io/file.d

It allows for more robust open-flags than the fopen-style flags 
(like "r+"). For seeking, I adapted it to use your concept of 
marks (which I quite like) instead of SEEK_CUR and friends.

Please feel free to use this! (The Windows implementation hasn't 
been tested yet, so it probably doesn't work.)

BTW, I was also working on buffered streams, but couldn't figure 
out a good way to do it.

Jan 04 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

05-Jan-2014 09:22, Jason White пишет:
 On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:
 IMHO C run-time I/O has no use in D. The amount of work spent on
 special-casing the non-locking primitives of each C run-time,
 repeating legacy mistakes (like text mode, codepages and locales) and
 stumbling on portability problems (getc is a macro we can't have)
 would have been better spent elsewhere - designing our own I/O framework.

 I agree. I wrote a (mostly complete) file stream implementation that
 uses the native I/O API:

      https://github.com/jasonwhite/io/blob/master/src/io/file.d

As an advice I'd suggest to drop the 'Data' part in writeData/readData. 
It's obvious and adds no extra value.

 It allows for more robust open-flags than the fopen-style flags (like
 "r+"). For seeking, I adapted it to use your concept of marks (which I
 quite like) instead of SEEK_CUR and friends.

 Please feel free to use this! (The Windows implementation hasn't been
 tested yet, so it probably doesn't work.)

Will poke around. I like this (I mean composition):
https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

 BTW, I was also working on buffered streams, but couldn't figure out a
 good way to do it.


-- 
Dmitry Olshansky

Jan 05 2014

"Jason White" <54f9byee3t32 gmail.com> writes:

On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
 As an advice I'd suggest to drop the 'Data' part in 
 writeData/readData. It's obvious and adds no extra value.

You're right, but it avoids a name clash if it's composed with 
text writing. "write" would be used for text and "writeData" 
would be used for raw data. std.stdio.File uses the names 
rawRead/rawWrite to avoid that problem (which, I suppose, are 
more appropriate names).

 Will poke around. I like this (I mean composition):
 https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

Yeah, the idea is to separate buffering, text, and locking 
operations so that they can be composed with any other type of 
stream (e.g., files, in-memory arrays, or sockets). Currently, 
std.stdio has all three of those facets rolled into one.

Jan 05 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

05-Jan-2014 15:08, Jason White пишет:
 On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
 As an advice I'd suggest to drop the 'Data' part in
 writeData/readData. It's obvious and adds no extra value.

 You're right, but it avoids a name clash if it's composed with text
 writing. "write" would be used for text and "writeData" would be used
 for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid
 that problem (which, I suppose, are more appropriate names).

I my view text implies something like:

void write(const(char)[]);
size_t read(char[]);

And binary would be:

void write(const(ubyte)[]);
size_t read(ubyte[]);

Should not clash.

 Will poke around. I like this (I mean composition):
 https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

 Yeah, the idea is to separate buffering, text, and locking operations so
 that they can be composed with any other type of stream (e.g., files,
 in-memory arrays, or sockets).

In-memory array IMHO better not pretend to be a stream. This kind of 
wrapping goes in the wrong direction (losing capabilities). Instead 
wrapping a stream and/or array as a buffer range proved to me to be more 
natural (extending capabilities).

Currently, std.stdio has all three of
 those facets rolled into one.

Locking though is a province of shared and may need a bit more thought.

-- 
Dmitry Olshansky

Jan 05 2014

"Jason White" <54f9byee3t32 gmail.com> writes:

On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:
 I my view text implies something like:

 void write(const(char)[]);
 size_t read(char[]);

 And binary would be:

 void write(const(ubyte)[]);
 size_t read(ubyte[]);

 Should not clash.

Those would do the same thing for either text or binary data. 
When I say text writing, I guess I mean the serialization of any 
type to text (like what std.stdio.write does):

     void write(T)(T value);         // Text writing
     void write(const(ubyte)[] buf); // Binary writing

     write([1, 2, 3]); // want to write "[1, 2, 3]"
                       // but writes "\x01\x02\x03"

This clashes. We need to be able to specify if we want to 
write/read a text representation or just the raw binary data. In 
the above case, the most specialized overload will be called.

 In-memory array IMHO better not pretend to be a stream. This 
 kind of wrapping goes in the wrong direction (losing 
 capabilities). Instead wrapping a stream and/or array as a 
 buffer range proved to me to be more natural (extending 
 capabilities).

Shouldn't buffers/arrays provide a stream interface in addition 
to buffer-specific operations? I don't see why it would conflict 
with a range interface. As I understand it, ranges act on a 
single element at a time while streams act on multiple elements 
at a time. For ArrayBuffer in datapicked, a stream-style read is 
just lookahead(n) and cur += n. What capabilities are lost?

If buffers/arrays provide a stream interface, then they can be 
used by code that doesn't directly need the buffering 
capabilities but would still benefit from them.

Currently, std.stdio has all three of
 those facets rolled into one.

 Locking though is a province of shared and may need a bit more 
 thought.

Locking of streams is something that I haven't explored too 
deeply yet. Streams that communicate with the OS certainly need 
locking as thread locality makes no difference there.

Jan 05 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

06-Jan-2014 09:41, Jason White пишет:
 On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:
 I my view text implies something like:

 void write(const(char)[]);
 size_t read(char[]);

 And binary would be:

 void write(const(ubyte)[]);
 size_t read(ubyte[]);

 Should not clash.

 Those would do the same thing for either text or binary data. When I say
 text writing, I guess I mean the serialization of any type to text (like
 what std.stdio.write does):

      void write(T)(T value);         // Text writing
      void write(const(ubyte)[] buf); // Binary writing

      write([1, 2, 3]); // want to write "[1, 2, 3]"
                        // but writes "\x01\x02\x03"

 This clashes. We need to be able to specify if we want to write/read a
 text representation or just the raw binary data. In the above case, the
 most specialized overload will be called.

Ok, now I see. In my eye though serialization completely hides raw 
stream write.

So:
struct SomeStream{
     void write(const(ubyte)[] data...);
}

struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
private:
     Stream stream;
}

 In-memory array IMHO better not pretend to be a stream. This kind of
 wrapping goes in the wrong direction (losing capabilities). Instead
 wrapping a stream and/or array as a buffer range proved to me to be
 more natural (extending capabilities).

 Shouldn't buffers/arrays provide a stream interface in addition to
 buffer-specific operations?

I think it may be best not to. Buffer builds on top of unbuffered 
stream. If there is a need to do large reads it may as well use naked 
stream and not worry about extra copying happening in the buffer layer.

I need to think on this. Seeing as lookahead + seek could be labeled as 
read even though it's not.

 I don't see why it would conflict with a
 range interface. As I understand it, ranges act on a single element at a
 time while streams act on multiple elements at a time. For ArrayBuffer
 in datapicked, a stream-style read is just lookahead(n) and cur += n.
 What capabilities are lost?

In short - lookahead is slicing, read would be copying.
For me prime capability of an array is slicing that is dirt cheap O(1). 
On the other hand stream interface is all about copying bytes to the 
user provided array.

In this setting it means that if you want to wrap array as stream, then 
it must follow generic stream interface. The latter cannot and should 
not think of slicing and the like. Then while wrapping it in some 
adapter up the chain it's no longer seen as array (because adapter is 
generic too and is made for streams). This is what I call capability loss.

 If buffers/arrays provide a stream interface, then they can be used by
 code that doesn't directly need the buffering capabilities but would
 still benefit from them.

See above - it would be better if the code was written for ranges not 
streams. Then e.g. slicing of buffer range on top of array works just as 
cheap as it was for arrays. And zero copies are made (=performance).

 Currently, std.stdio has all three of
 those facets rolled into one.

 Locking though is a province of shared and may need a bit more thought.

 Locking of streams is something that I haven't explored too deeply yet.
 Streams that communicate with the OS certainly need locking as thread
 locality makes no difference there.

Actually these objects do just fine, since OS does the locking (or makes 
sure of something equivalent). If your stream is TLS there is no need 
for extra locking at all (no interleaving of I/O calls is possible) 
regardless of its kind.

Shared instances would need locking as 2 threads may request some 
operation, and as OS locks only on per sys-call basis something cruel 
may happen in the code that deals with buffering etc.

-- 
Dmitry Olshansky

Jan 06 2014

"Jason White" <54f9byee3t32 gmail.com> writes:

On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides 
 raw stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

I was thinking it should also have "alias stream this;", but 
maybe that's not the best thing to do for a serializer.

I concede, I've s/(read|write)Data/\1/g on

     https://github.com/jasonwhite/io/blob/master/src/io/file.d

and it now works on Windows with useful exception messages.

 Shouldn't buffers/arrays provide a stream interface in 
 addition to
 buffer-specific operations?

 I think it may be best not to. Buffer builds on top of 
 unbuffered stream. If there is a need to do large reads it may 
 as well use naked stream and not worry about extra copying 
 happening in the buffer layer.

 I need to think on this. Seeing as lookahead + seek could be 
 labeled as read even though it's not.

 I don't see why it would conflict with a
 range interface. As I understand it, ranges act on a single 
 element at a
 time while streams act on multiple elements at a time. For 
 ArrayBuffer
 in datapicked, a stream-style read is just lookahead(n) and 
 cur += n.
 What capabilities are lost?

 In short - lookahead is slicing, read would be copying.
 For me prime capability of an array is slicing that is dirt 
 cheap O(1). On the other hand stream interface is all about 
 copying bytes to the user provided array.

 In this setting it means that if you want to wrap array as 
 stream, then it must follow generic stream interface. The 
 latter cannot and should not think of slicing and the like. 
 Then while wrapping it in some adapter up the chain it's no 
 longer seen as array (because adapter is generic too and is 
 made for streams). This is what I call capability loss.

 If buffers/arrays provide a stream interface, then they can be 
 used by
 code that doesn't directly need the buffering capabilities but 
 would
 still benefit from them.

 See above - it would be better if the code was written for 
 ranges not streams. Then e.g. slicing of buffer range on top of 
 array works just as cheap as it was for arrays. And zero copies 
 are made (=performance).

Okay, I see. I'm just concerned about composability. I'll have to 
think more about how it's affected.

(BTW, you can probably simplify lookahead/lookbehind with 
look(ptrdiff_t n) where the sign of n indicates ahead/behind.)

 Actually these objects do just fine, since OS does the locking 
 (or makes sure of something equivalent). If your stream is TLS 
 there is no need for extra locking at all (no interleaving of 
 I/O calls is possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request 
 some operation, and as OS locks only on per sys-call basis 
 something cruel may happen in the code that deals with 
 buffering etc.

Oh yeah, you're right.

As a side note: I would love to get a kick-ass I/O stream package 
into Phobos. It could replace std.stream as well as std.stdio. 
Stuff like serializers and lexers would be more robust and easier 
to write.

Jan 06 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Jan-2014 11:59, Jason White пишет:
 On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides raw
 stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

 I was thinking it should also have "alias stream this;", but maybe
 that's not the best thing to do for a serializer.

 I concede, I've s/(read|write)Data/\1/g on

      https://github.com/jasonwhite/io/blob/master/src/io/file.d

 and it now works on Windows with useful exception messages.

Cool, got to steal sysErrorString too! :)

 Actually these objects do just fine, since OS does the locking (or
 makes sure of something equivalent). If your stream is TLS there is no
 need for extra locking at all (no interleaving of I/O calls is
 possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request some
 operation, and as OS locks only on per sys-call basis something cruel
 may happen in the code that deals with buffering etc.

 Oh yeah, you're right.

 As a side note: I would love to get a kick-ass I/O stream package into
 Phobos. It could replace std.stream as well as std.stdio.

Then our goals are aligned. Be sure to take a peek at (if you haven't 
already):
https://github.com/schveiguy/phobos/blob/new-io/std/io.d

I have my share criticisms for it but it's a nice piece of work that 
addresses many questions of I/O I've yet to consider.

 Stuff like
 serializers and lexers would be more robust and easier to write.

Indeed and even beyond that.

-- 
Dmitry Olshansky

Jan 07 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 07-Jan-2014 11:59, Jason White =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On Monday, 6 January 2014 at 10:26:27 UTC, Dmitry Olshansky wrote:
 Ok, now I see. In my eye though serialization completely hides raw
 stream write.

 So:
 struct SomeStream{
     void write(const(ubyte)[] data...);
 }

 struct Serializer(Stream){
     void write(T)(T value); //calls stream.write inside of it
 private:
     Stream stream;
 }

 I was thinking it should also have "alias stream this;", but maybe
 that's not the best thing to do for a serializer.

 I concede, I've s/(read|write)Data/\1/g on

      https://github.com/jasonwhite/io/blob/master/src/io/file.d

 and it now works on Windows with useful exception messages.

 Cool, got to steal sysErrorString too! :)

 Actually these objects do just fine, since OS does the locking (or
 makes sure of something equivalent). If your stream is TLS there is =



no
 need for extra locking at all (no interleaving of I/O calls is
 possible) regardless of its kind.

 Shared instances would need locking as 2 threads may request some
 operation, and as OS locks only on per sys-call basis something crue=



l
 may happen in the code that deals with buffering etc.

 Oh yeah, you're right.

 As a side note: I would love to get a kick-ass I/O stream package int=


o
 Phobos. It could replace std.stream as well as std.stdio.

 Then our goals are aligned. Be sure to take a peek at (if you haven't =

 =

 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Yes, I'm gearing up to revisit that after a long D hiatus, and I came  =

across this thread.

At this point, I really really like the ideas that you have in this. It =
 =

solves an issue that I struggled with, and my solution was quite clunky.=


I am thinking of this layout for streams/buffers:

1. Unbuffered stream used for raw i/o, based on a class hierarchy (which=
 I  =

have pretty much written)
2. Buffer like you have, based on a struct, with specific primitives. It=
's  =

job is to collect data from the underlying stream, and present it to  =

consumers as a random-access buffer.
3. Filter that has access to transform the buffer data/copy it.
4. Ranges that use the buffer/filter to process/present the data.

The problem I struggled with is the presentation of UTF data of any form=
at  =

as char[] wchar[] or dchar[]. 2 things need to happen. First is that the=
  =

data needs to be post-processed to perform any necessary byte swapping. =
 =

The second is to transcode the data into the correct width.

In this way, you can process UTF data of any type (I even have code to  =

detect the encoding and automatically process it), and then use it in a =
 =

way that makes sense for your code.

My solution was to paste in a "processing" delegate into the class  =

hierarchy of buffered streams that allowed one read/write access to the =
 =

buffer. But it's clunky, and difficult to deal with in a generalized  =

fashion.

But the idea of using a buffer in between the stream and the range, and =
 =

possibly bolting together multiple transformations in a clean way, makes=
  =

this problem easy to solve, and I think it is closer to the vision  =

Andrei/Walter have.

I also like the idea of "pinning" the data instead of my mechanism of  =

using a delegate (which was similar but not as general). It also has  =

better opportunities for optimization.

Other ideas that came to me that buffer filters could represent:

* compression/decompression
* encryption

I am going to study your code some more and see how I can update my code=
  =

to use it. I still need to maintain the std.stdio.File interface, and  =

Walter is insistent that the initial state of stdout/err/in must be  =

synchronous with C (which kind of sucks, but I have plans on how to make=
  =

it not be so bad).

There is still a lot of work left to do, but I think one of the hard par=
ts  =

is done, namely dealing with UTF transcoding. The remaining sticky part =
is  =

dealing with shared. But with structs, this should make things much easi=
er.

One question, is there a reason a buffer type has to be a range at all? =
I  =

can see where it's easy to make it a range, but I don't see higher-level=
  =

code using the range primitives when dealing with chunks of a stream.

-Steve

Jan 16 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

16-Jan-2014 19:55, Steven Schveighoffer пишет:
 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Then our goals are aligned. Be sure to take a peek at (if you haven't
 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

 Yes, I'm gearing up to revisit that after a long D hiatus, and I came
 across this thread.

 At this point, I really really like the ideas that you have in this. It
 solves an issue that I struggled with, and my solution was quite clunky.

 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which
 I have pretty much written)
 2. Buffer like you have, based on a struct, with specific primitives.
 It's job is to collect data from the underlying stream, and present it
 to consumers as a random-access buffer.

The only interesting thing I'd add here s that some buffer may work 
without underlying stream. Best examples are arrays and MM-files.

 3. Filter that has access to transform the buffer data/copy it.
 4. Ranges that use the buffer/filter to process/present the data.

Yes, yes and yes. I find it surprisingly good to see our vision seems to 
match. I was half-expecting you'd come along and destroy it all ;)

 The problem I struggled with is the presentation of UTF data of any
 format as char[] wchar[] or dchar[]. 2 things need to happen. First is
 that the data needs to be post-processed to perform any necessary byte
 swapping. The second is to transcode the data into the correct width.

 In this way, you can process UTF data of any type (I even have code to
 detect the encoding and automatically process it), and then use it in a
 way that makes sense for your code.

 My solution was to paste in a "processing" delegate into the class
 hierarchy of buffered streams that allowed one read/write access to the
 buffer. But it's clunky, and difficult to deal with in a generalized
 fashion.

 But the idea of using a buffer in between the stream and the range, and
 possibly bolting together multiple transformations in a clean way, makes
 this problem easy to solve, and I think it is closer to the vision
 Andrei/Walter have.

In essence a transcoding filter for UTF-16 would wrap a buffer of ubyte 
and itself present a buffer interface (but of wchar).

My own stuff currently deals only in ubyte and the limited decoding is 
represented by a "decode" function that takes a buffer of ubyte and 
decodes UTF-8. I think typed buffers/filters is the way to go.

 I also like the idea of "pinning" the data instead of my mechanism of
 using a delegate (which was similar but not as general). It also has
 better opportunities for optimization.

 Other ideas that came to me that buffer filters could represent:

 * compression/decompression
 * encryption

 I am going to study your code some more and see how I can update my code
 to use it. I still need to maintain the std.stdio.File interface, and
 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to make
 it not be so bad).

I seriously not seeing how interfacing with C runtime could be fast enough.

 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sticky
 part is dealing with shared. But with structs, this should make things
 much easier.

I'm thinking a generic locking wrapper is possible along the lines of:

shared Locked!(GenericBuffer!char) stdin; //usage

struct Locked(T){
shared:
private:
	T _this;
	Mutex mut;
public:
	//forwarded methods
}

The wrapper will introduce a lock, and implement every method of wrapped 
struct roughly like this:
mut.lock();
scope(exit) mut.unlock();
(cast(T*)_this).method(args);

I'm sure it could be pretty automatic.

 One question, is there a reason a buffer type has to be a range at all?
 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks of
 a stream.

Lexers/parsers enjoy it - i.e. they work pretty much as ranges 
especially when skipping spaces and the like. As I said the main reason 
was: if it fits as range why not? After all it makes one-pass processing 
of data trivial as it rides on top of foreach:

foreach(octect; mybuffer)
{
	if(intersting(octect))
		do_cool_stuff();
}

Things like countUntil make perfect sense when called on buffer (e.g. to 
find matching sentinel).

-- 
Dmitry Olshansky

Jan 16 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 16-Jan-2014 19:55, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

:
 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Then our goals are aligned. Be sure to take a peek at (if you haven'=



t
 already):
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

 Yes, I'm gearing up to revisit that after a long D hiatus, and I came=


 across this thread.

 At this point, I really really like the ideas that you have in this. =


It
 solves an issue that I struggled with, and my solution was quite clun=


ky.
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class hierarchy (wh=


ich
 I have pretty much written)
 2. Buffer like you have, based on a struct, with specific primitives.=


 It's job is to collect data from the underlying stream, and present i=


t
 to consumers as a random-access buffer.

 The only interesting thing I'd add here s that some buffer may work  =

 without underlying stream. Best examples are arrays and MM-files.

Yes, but I would stress that for convenience, the buffer should forward =
 =

some of the stream primitives (such as seeking) in cases where seeking i=
s  =

possible, at least in the case of a buffer that wraps a stream.

That actually is another point that would have sucked with my class-base=
d  =

solution -- allocating a class to use an array as backing.

 3. Filter that has access to transform the buffer data/copy it.
 4. Ranges that use the buffer/filter to process/present the data.

 Yes, yes and yes. I find it surprisingly good to see our vision seems =

to  =

 match. I was half-expecting you'd come along and destroy it all ;)

:) I've been preaching for a while that ranges don't make good streams, =
 =

and that streams should be classes, but I hadn't considered splitting ou=
t  =

the buffer. I think it's the right balance.

 The problem I struggled with is the presentation of UTF data of any
 format as char[] wchar[] or dchar[]. 2 things need to happen. First i=


s
 that the data needs to be post-processed to perform any necessary byt=


e
 swapping. The second is to transcode the data into the correct width.=


 In this way, you can process UTF data of any type (I even have code t=


o
 detect the encoding and automatically process it), and then use it in=


 a
 way that makes sense for your code.

 My solution was to paste in a "processing" delegate into the class
 hierarchy of buffered streams that allowed one read/write access to t=


he
 buffer. But it's clunky, and difficult to deal with in a generalized
 fashion.

 But the idea of using a buffer in between the stream and the range, a=


nd
 possibly bolting together multiple transformations in a clean way, ma=


kes
 this problem easy to solve, and I think it is closer to the vision
 Andrei/Walter have.

 In essence a transcoding filter for UTF-16 would wrap a buffer of ubyt=

e  =

 and itself present a buffer interface (but of wchar).

My intended interface allows you to specify the desired type per read.  =

Think of the case of stdin, where the clients will be varied and written=
  =

by many different people, and its interface is decided by Phobos.

But a transcoding buffer may make some optimizations. For instance,  =

reading a UTF32 file as utf-8 can re-use the same buffer, as no code uni=
t  =

uses more than 4 code points (did I get that right?).

 I am going to study your code some more and see how I can update my c=


ode
 to use it. I still need to maintain the std.stdio.File interface, and=


 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to m=


ake
 it not be so bad).

 I seriously not seeing how interfacing with C runtime could be fast  =

 enough.

It's not. But an important stipulation in order for this to all be  =

accepted is that it doesn't break existing code that expects things like=
  =

printf and writef to interleave properly.

However, I think we can have an opt-in scheme, and there are certain cas=
es  =

where we can proactively switch to a D-buffer scheme. For example, if yo=
u  =

get a ByLine range, it expects to exhaust the data from stream, and may =
 =

not properly work with C printf.

The idea is that stdio.File can switch at runtime from FILE * to D strea=
ms  =

as needed or directed.

 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sti=


cky
 part is dealing with shared. But with structs, this should make thing=


s
 much easier.

 I'm thinking a generic locking wrapper is possible along the lines of:=

 shared Locked!(GenericBuffer!char) stdin; //usage

 struct Locked(T){
 shared:
 private:
 	T _this;
 	Mutex mut;
 public:
 	//forwarded methods
 }

 The wrapper will introduce a lock, and implement every method of wrapp=

ed  =

 struct roughly like this:
 mut.lock();
 scope(exit) mut.unlock();
 (cast(T*)_this).method(args);

 I'm sure it could be pretty automatic.

This would be a key addition for ANY type in order to properly work with=
  =

shared. BUT, I don't see how it works safely generically because you  =

necessarily have to cast away shared in order to call the methods. You  =

would have to limit this to only working on types it was intended for.

I've been expecting to have to do something like this, but not looking  =

forward to it :(

 One question, is there a reason a buffer type has to be a range at al=


l?
 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks=


 of
 a stream.

 Lexers/parsers enjoy it - i.e. they work pretty much as ranges  =

 especially when skipping spaces and the like. As I said the main reaso=

n  =

 was: if it fits as range why not? After all it makes one-pass processi=

ng  =

 of data trivial as it rides on top of foreach:

 foreach(octect; mybuffer)
 {
 	if(intersting(octect))
 		do_cool_stuff();
 }

 Things like countUntil make perfect sense when called on buffer (e.g. =

to  =

 find matching sentinel).

I think I misstated my question. What I am curious about is why a type  =

must be a forward range to pass isBuffer. Of course, if it makes sense f=
or  =

a buffer type to also be a range, it can certainly implement that  =

interface as well. But I don't know that I would need those primitives i=
n  =

all cases. I don't have any specific use case for having a buffer that  =

doesn't implement a range interface, but I am hesitant to necessarily  =

couple the buffer interface to ranges just because we can't think of a  =

counter-case :)

-Steve

Jan 16 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 00:00, Steven Schveighoffer пишет:
 On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 16-Jan-2014 19:55, Steven Schveighoffer пишет:
 On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:



[snip]

 In essence a transcoding filter for UTF-16 would wrap a buffer of
 ubyte and itself present a buffer interface (but of wchar).

 My intended interface allows you to specify the desired type per read.
 Think of the case of stdin, where the clients will be varied and written
 by many different people, and its interface is decided by Phobos.

 But a transcoding buffer may make some optimizations. For instance,
 reading a UTF32 file as utf-8 can re-use the same buffer, as no code
 unit uses more than 4 code points (did I get that right?).

The other way around :) 4 code units - 1 code point.

 I am going to study your code some more and see how I can update my code
 to use it. I still need to maintain the std.stdio.File interface, and
 Walter is insistent that the initial state of stdout/err/in must be
 synchronous with C (which kind of sucks, but I have plans on how to make
 it not be so bad).

 I seriously not seeing how interfacing with C runtime could be fast
 enough.

 It's not. But an important stipulation in order for this to all be
 accepted is that it doesn't break existing code that expects things like
 printf and writef to interleave properly.

 However, I think we can have an opt-in scheme, and there are certain
 cases where we can proactively switch to a D-buffer scheme. For example,
 if you get a ByLine range, it expects to exhaust the data from stream,
 and may not properly work with C printf.

 The idea is that stdio.File can switch at runtime from FILE * to D
 streams as needed or directed.

 There is still a lot of work left to do, but I think one of the hard
 parts is done, namely dealing with UTF transcoding. The remaining sticky
 part is dealing with shared. But with structs, this should make things
 much easier.

 I'm thinking a generic locking wrapper is possible along the lines of:

 shared Locked!(GenericBuffer!char) stdin; //usage

 struct Locked(T){
 shared:
 private:
     T _this;
     Mutex mut;
 public:
     //forwarded methods
 }

 The wrapper will introduce a lock, and implement every method of
 wrapped struct roughly like this:
 mut.lock();
 scope(exit) mut.unlock();
 (cast(T*)_this).method(args);

 I'm sure it could be pretty automatic.

 This would be a key addition for ANY type in order to properly work with
 shared. BUT, I don't see how it works safely generically because you
 necessarily have to cast away shared in order to call the methods. You
 would have to limit this to only working on types it was intended for.

The requirement may be that it's pure or should I say "well-contained". 
In other words as long as it doesn't smuggle references somewhere else 
it should be fine.
That is to say it's 100% fool-proof, nor do I think that essentially 
simulating a synchronized class is a always a good thing to do...

 I've been expecting to have to do something like this, but not looking
 forward to it :(

 One question, is there a reason a buffer type has to be a range at all?
 I can see where it's easy to make it a range, but I don't see
 higher-level code using the range primitives when dealing with chunks of
 a stream.

 Lexers/parsers enjoy it - i.e. they work pretty much as ranges
 especially when skipping spaces and the like. As I said the main
 reason was: if it fits as range why not? After all it makes one-pass
 processing of data trivial as it rides on top of foreach:

 foreach(octect; mybuffer)
 {
     if(intersting(octect))
         do_cool_stuff();
 }

 Things like countUntil make perfect sense when called on buffer (e.g.
 to find matching sentinel).

 I think I misstated my question. What I am curious about is why a type
 must be a forward range to pass isBuffer. Of course, if it makes sense
 for a buffer type to also be a range, it can certainly implement that
 interface as well. But I don't know that I would need those primitives
 in all cases. I don't have any specific use case for having a buffer
 that doesn't implement a range interface, but I am hesitant to
 necessarily couple the buffer interface to ranges just because we can't
 think of a counter-case :)

Convenient to work with does ring good to me. I simply see no need to 
reinvent std.algorithm on buffers especially the ones that just scan 
left-to-right.
Example would be calculating a checksum of a stream (say data comes from 
a pipe or socket i.e. buffered). It's a trivial application of 
std.algorithm.reduce and there no need to reinvent that wheel IMHO.

-- 
Dmitry Olshansky

Jan 16 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 The other way around :) 4 code units - 1 code point.

I knew that was going to happen :)

 This would be a key addition for ANY type in order to properly work with
 shared. BUT, I don't see how it works safely generically because you
 necessarily have to cast away shared in order to call the methods. You
 would have to limit this to only working on types it was intended for.

 The requirement may be that it's pure or should I say "well-contained".  
 In other words as long as it doesn't smuggle references somewhere else  
 it should be fine.
 That is to say it's 100% fool-proof, nor do I think that essentially  
 simulating a synchronized class is a always a good thing to do...

I think you meant *not* 100% :) But yeah, regardless of how generalized it  
is, this is likely the best path. I think this is the tack that  
std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.

 Convenient to work with does ring good to me. I simply see no need to  
 reinvent std.algorithm on buffers especially the ones that just scan  
 left-to-right.
 Example would be calculating a checksum of a stream (say data comes from  
 a pipe or socket i.e. buffered). It's a trivial application of  
 std.algorithm.reduce and there no need to reinvent that wheel IMHO.

I again think I am being misunderstood. Example might be appropriate:

struct Buffer {...} // implements BOTH forward range and Buffer primitives

struct OtherBuffer {...} // only implements Buffer primitives.

If isBuffer is modified to not require isForwardRange, then both Buffer  
and OtherBuffer will work as buffers, but only Buffer works as a range.

But as you have it now, isBuffer!OtherBuffer is false. Is this necessary?

So we can implement buffers that are both ranges and buffers, and will  
work with std.algorithm without modification (and that's fine and expected  
by me), but do we need to *require* that? Are we over-specifying? Is there  
a possibility that someone can invent a buffer that satisfies the  
primitives of say a line-by-line reader, but cannot possibly be a forward  
range?

-Steve

Jan 16 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 02:41, Steven Schveighoffer пишет:
 On Thu, 16 Jan 2014 17:28:31 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 The other way around :) 4 code units - 1 code point.

 I knew that was going to happen :)

Aye. BTW I haven't thought of writing into the buffer, but it works 
exactly the same. It could be even read/write -"discarded" data is 
written to underlying stream, freshly loaded is read from stream. Now in 
case of input stream only writes are nop, for output-only reads are nops.

With pinning it makes for cool multi-pass algorithms that actually 
output stuff into the file.

 This would be a key addition for ANY type in order to properly work with
 shared. BUT, I don't see how it works safely generically because you
 necessarily have to cast away shared in order to call the methods. You
 would have to limit this to only working on types it was intended for.

 The requirement may be that it's pure or should I say
 "well-contained". In other words as long as it doesn't smuggle
 references somewhere else it should be fine.
 That is to say it's 100% fool-proof, nor do I think that essentially
 simulating a synchronized class is a always a good thing to do...

 I think you meant *not* 100% :) But yeah, regardless of how generalized
 it is, this is likely the best path. I think this is the tack that
 std.stdio.File takes anyway, except it's using FILE *'s locking mechanism.

Aye.

 Convenient to work with does ring good to me. I simply see no need to
 reinvent std.algorithm on buffers especially the ones that just scan
 left-to-right.
 Example would be calculating a checksum of a stream (say data comes
 from a pipe or socket i.e. buffered). It's a trivial application of
 std.algorithm.reduce and there no need to reinvent that wheel IMHO.

 I again think I am being misunderstood. Example might be appropriate:

Looking at the post by Walter I see I need to clarify things.
And if you browse the thread you'd see my understanding also changed 
with time - I started with no stinkin' forward ranges with buffered I/O 
only to later eat my words and make them happen.

First let's call buffer a thing that pack an array and few extras to 
support pinning, refiling of data from underlying stream and extending.
It exposes current "window" of underlying stream.

Now we have pins in that buffer that move along. A pin not only enforces 
that all data "to the left" stays accessible but also is a "view" of 
buffer. It's even conceptually a range with extra primitives outlined here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I should stick to naming it BufferRange. The "real buffer" is internal 
construct and BufferRanges share ownership of it.

This is exactly what I ended up with in my second branch.
"real" buffer:
https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L417
vs buffer range over it:
https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/buffer.d#L152

Naming issues are apparently even worse then Walter implies.

 struct Buffer {...} // implements BOTH forward range and Buffer primitives

 struct OtherBuffer {...} // only implements Buffer primitives.

 If isBuffer is modified to not require isForwardRange, then both Buffer
 and OtherBuffer will work as buffers, but only Buffer works as a range.

 But as you have it now, isBuffer!OtherBuffer is false. Is this necessary?

I think I should call it BufferRange from now on. And bring my docs in 
line. It may make sense to provide naked Buffer itself as a construct 
(it has simpler interface) and have generic BufferRange wrapper. I'm 
just not seeing user code using the former - too error prone and unwieldy.

 So we can implement buffers that are both ranges and buffers, and will
 work with std.algorithm without modification (and that's fine and
 expected by me), but do we need to *require* that? Are we
 over-specifying?

Random-Access range had to require .front/.back maybe it was 
over-specification too? Some stuff is easy to index but not "drop off an 
item at either end". But now it really doesn't matter - if there are 
such things, they are not random-access ranges.

 Is there a possibility that someone can invent a buffer
 that satisfies the primitives of say a line-by-line reader, but cannot
 possibly be a forward range?

I hardly can see that:
front --> lookahead(1)[0]
empty --> lookahead(1).length != 0
popFront --> seek(1) or enforce(seek(1))

save -> well there got to be a way to pin the data in the buffer?

And they surely could be better implemented inside of a specific buffer 
range.

-- 
Dmitry Olshansky

Jan 17 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly  
 the same. It could be even read/write -"discarded" data is written to  
 underlying stream, freshly loaded is read from stream. Now in case of  
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually  
 output stuff into the file.

In my experience, the code/process for a write buffer is significantly  
different than the code for a read buffer. A read/write buffer is very  
difficult to make, because you either need 2 file pointers, or need to  
constantly seek the single file pointer to overwrite the existing data.

 But as you have it now, isBuffer!OtherBuffer is false. Is this  
 necessary?

 I think I should call it BufferRange from now on. And bring my docs in  
 line. It may make sense to provide naked Buffer itself as a construct  
 (it has simpler interface) and have generic BufferRange wrapper. I'm  
 just not seeing user code using the former - too error prone and  
 unwieldy.

In a sense, I agree. There aren't many things to do directly with a buffer  
(i.e. there will likely be few filters), and a buffer range provides  
enough primitives to make front/popFront/empty trivial.

 So we can implement buffers that are both ranges and buffers, and will
 work with std.algorithm without modification (and that's fine and
 expected by me), but do we need to *require* that? Are we
 over-specifying?

 Random-Access range had to require .front/.back maybe it was  
 over-specification too? Some stuff is easy to index but not "drop off an  
 item at either end". But now it really doesn't matter - if there are  
 such things, they are not random-access ranges.

 Is there a possibility that someone can invent a buffer
 that satisfies the primitives of say a line-by-line reader, but cannot
 possibly be a forward range?

 I hardly can see that:
 front --> lookahead(1)[0]
 empty --> lookahead(1).length != 0
 popFront --> seek(1) or enforce(seek(1))

 save -> well there got to be a way to pin the data in the buffer?

 And they surely could be better implemented inside of a specific buffer  
 range.

I'll have to take a closer look at your code to have a reasonable  
response. But I think this looks fine so far.

-Steve

Jan 17 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 18:03, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly
 the same. It could be even read/write -"discarded" data is written to
 underlying stream, freshly loaded is read from stream. Now in case of
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

 In my experience, the code/process for a write buffer is significantly
 different than the code for a read buffer. A read/write buffer is very
 difficult to make, because you either need 2 file pointers, or need to
 constantly seek the single file pointer to overwrite the existing data.

Agreed, read/write buffer is a bad idea. As for write-only buffer 
implemented in the same style as read-only, well I need to try it first.


-- 
Dmitry Olshansky

Jan 17 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:03, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

:
 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exact=



ly
 the same. It could be even read/write -"discarded" data is written t=



o
 underlying stream, freshly loaded is read from stream. Now in case o=



f
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

 In my experience, the code/process for a write buffer is significantl=


y
 different than the code for a read buffer. A read/write buffer is ver=


y
 difficult to make, because you either need 2 file pointers, or need t=


o
 constantly seek the single file pointer to overwrite the existing dat=


a.

 Agreed, read/write buffer is a bad idea. As for write-only buffer  =

 implemented in the same style as read-only, well I need to try it firs=

t.

Keep in mind that a read/write buffered stream *should* exist. It's just=
  =

that you need to switch how you use the buffer.

I think in my code, what I do is if someone wants to read, and is  =

currently writing (on a read/write stream), I flush the buffer, then  =

switch it to be in read mode, and then seek to the appropriate place, to=
  =

fill the buffer.

When switching from read to write, it's much easier, since the buffer  =

starts out empty.

So I think a buffer type that can be used for read *OR* write is good,  =

just not simultaneously.

However, simultaneous read/write is probably something that could be  =

useful for a niche use case (e.g. some transformative process), as long =
as  =

you don't mind having 2 file descriptors open :)

-Steve

Jan 17 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 18:52, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 09:12:56 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:03, Steven Schveighoffer пишет:
 On Fri, 17 Jan 2014 05:01:35 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 BTW I haven't thought of writing into the buffer, but it works exactly
 the same. It could be even read/write -"discarded" data is written to
 underlying stream, freshly loaded is read from stream. Now in case of
 input stream only writes are nop, for output-only reads are nops.

 With pinning it makes for cool multi-pass algorithms that actually
 output stuff into the file.

 In my experience, the code/process for a write buffer is significantly
 different than the code for a read buffer. A read/write buffer is very
 difficult to make, because you either need 2 file pointers, or need to
 constantly seek the single file pointer to overwrite the existing data.

 Agreed, read/write buffer is a bad idea. As for write-only buffer
 implemented in the same style as read-only, well I need to try it first.

 Keep in mind that a read/write buffered stream *should* exist. It's just
 that you need to switch how you use the buffer.

I'm doubting that actually. Just use the freaking MM-file directly you'd 
better off efficiency wise. And there would be no need to switch back 
and forth. Modern OS use paging subsystem for files anyway (unless 
opened with O_DIRECT or similar).

 I think in my code, what I do is if someone wants to read, and is
 currently writing (on a read/write stream), I flush the buffer, then
 switch it to be in read mode, and then seek to the appropriate place, to
 fill the buffer.

 When switching from read to write, it's much easier, since the buffer
 starts out empty.

 So I think a buffer type that can be used for read *OR* write is good,
 just not simultaneously.

See above. I don't think there is a need for these paces with flushing 
stuff. And if it comes to pipes and sockets then just having 2 buffers 
fits the bill better.

 However, simultaneous read/write is probably something that could be
 useful for a niche use case (e.g. some transformative process), as long
 as you don't mind having 2 file descriptors open :)


-- 
Dmitry Olshansky

Jan 17 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 17 Jan 2014 10:40:02 -0500, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 17-Jan-2014 18:52, Steven Schveighoffer =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

:
 Keep in mind that a read/write buffered stream *should* exist. It's j=


ust
 that you need to switch how you use the buffer.

 I'm doubting that actually. Just use the freaking MM-file directly you=

'd  =

 better off efficiency wise. And there would be no need to switch back =

 =

 and forth. Modern OS use paging subsystem for files anyway (unless  =

 opened with O_DIRECT or similar).

Remember we have to support whatever stdio.File does, and I think it doe=
s  =

support this.

-Steve

Jan 17 2014

"Kagamin" <spam here.lot> writes:

On Thursday, 16 January 2014 at 15:55:07 UTC, Steven 
Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class 
 hierarchy (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific 
 primitives. It's job is to collect data from the underlying 
 stream, and present it to consumers as a random-access buffer.

If you have a struct-based buffer, how would you enlarge the 
buffer? Won't it suffer from AA syndrome?

Jan 17 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 13:19, Kagamin пишет:
 On Thursday, 16 January 2014 at 15:55:07 UTC, Steven Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class hierarchy
 (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific primitives.
 It's job is to collect data from the underlying stream, and present it
 to consumers as a random-access buffer.

 If you have a struct-based buffer, how would you enlarge the buffer?

What's the problem? I don't see how struct/class can change there 
anything, it's a member field that is an array that we surely can expand.

 Won't it suffer from AA syndrome?

Buffer is created with factory functions only. It's not like an AA that 
grows from an empty/null state. Empty buffer (.init) doesn't grow it's 
simply empty.

-- 
Dmitry Olshansky

Jan 17 2014

"Kagamin" <spam here.lot> writes:

On Friday, 17 January 2014 at 09:33:41 UTC, Dmitry Olshansky 
wrote:
 What's the problem? I don't see how struct/class can change 
 there anything, it's a member field that is an array that we 
 surely can expand.

Ah, I thought one can copy the buffer, now I see you pass it by 
ref.

Jan 17 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Friday, 17 January 2014 at 09:19:12 UTC, Kagamin wrote:
 On Thursday, 16 January 2014 at 15:55:07 UTC, Steven 
 Schveighoffer wrote:
 I am thinking of this layout for streams/buffers:

 1. Unbuffered stream used for raw i/o, based on a class 
 hierarchy (which I have pretty much written)
 2. Buffer like you have, based on a struct, with specific 
 primitives. It's job is to collect data from the underlying 
 stream, and present it to consumers as a random-access buffer.

 If you have a struct-based buffer, how would you enlarge the 
 buffer? Won't it suffer from AA syndrome?

It is the lazily initialized nature of AAs that causes the 
problem. Structs can, but don't have to, replicate the behaviour.

Jan 17 2014

"Brian Schott" <briancschott gmail.com> writes:

I've been rewriting a bit of the lexer in DScanner.

https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
(Ignore the "invariant" block. I've been trying to hunt down some 
unreelated memory corruption issue)

One thing that I've found to be very useful is the ability to 
increment column or index counters inside of the lexer range's 
popFront method. I think that I'd end up using my own range type 
for arrays, but construct a lexer range on top of your buffer 
range for anything else.

Jan 04 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

04-Jan-2014 13:39, Brian Schott пишет:
 I've been rewriting a bit of the lexer in DScanner.

 https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
 (Ignore the "invariant" block. I've been trying to hunt down some
 unreelated memory corruption issue)

startsWith & peek could be implemented with proposed the lookahead, so 
overall buffer range looks fitting to your use case.

 One thing that I've found to be very useful is the ability to increment
 column or index counters inside of the lexer range's popFront method.

I think that I'd end up using my own range type for arrays, but construct
 a lexer range on top of your buffer range for anything else.

I think it should be possible to wrap a buffer range and to add related 
accounting and simplify interface for your own use.

As an experiment I have a greatly simplified buffer range - it builds on 
top of forward range making it instantly compatible with the most of 
std.algorithm. The only problem at the moment is slightly worse 
performance with DMD (who cares) and that it segfaults with LDC (and 
this is a problem). Would be nice to test LDC first but I think I'll 
publish it anyway, stay tuned.


-- 
Dmitry Olshansky

Jan 04 2014

"Brian Schott" <briancschott gmail.com> writes:

My experimental lexer generator now uses the buffer range. 
https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdx

The problem that I have with it right now is that 
range.lookbehind(1).length != range.lookahead(1).length. This was 
confusing.

Jan 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

09-Jan-2014 13:23, Brian Schott пишет:
 My experimental lexer generator now uses the buffer range.
 https://github.com/Hackerpilot/Dscanner/tree/NewLexer/stdx

Cool!

A minor note:
https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/d/lexer.d#L487

lookahead(n) should always give a slice of length n,
or 0 so you may as well test for != 0.

In general you should avoid marking too often, it takes a bit of .

I'm currently in favor of my 2nd design where marking is replaced by 
.save returning an independent view of buffer, making Buffer a normal 
forward range that is cheap to copy.

https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/

Sadly it segfaults with LDC so I can't quite assess its performance :(

 The problem that I have with it right now is that
 range.lookbehind(1).length != range.lookahead(1).length. This was
 confusing.

That indeed may look confusing at start but keep in mind that 
range.front is in fact a lookahead of 1. Thus all algorithms that can 
work with lookahead of 1 LL(1), LALR(1) etc. would easily work any 
forward/input range (though any practical parser need to slice or copy 
parts of input aside).

-- 
Dmitry Olshansky

Jan 12 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I am confused because there are 4 terms conflated here:

BufferRange
Buffer
InputStream
Stream

Jan 16 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2014 00:18, Walter Bright пишет:
 On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is
 defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

 I am confused because there are 4 terms conflated here:

 BufferRange
 Buffer

One and the same fr the moment.

You even quoted it:
BufferRange .. (for now called simply Buffer)...

 InputStream
 Stream

One and the same as I dealt with the input side of the problem only.

-- 
Dmitry Olshansky

Jan 16 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2014 2:30 PM, Dmitry Olshansky wrote:
 17-Jan-2014 00:18, Walter Bright пишет:
 On 12/29/2013 2:02 PM, Dmitry Olshansky wrote:
 The BufferRange concept itself (for now called simply Buffer) is
 defined here:
 http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

 I am confused because there are 4 terms conflated here:

 BufferRange
 Buffer

 One and the same fr the moment.

 You even quoted it:
  >BufferRange .. (for now called simply Buffer)...

I know. But using two names for the same thing makes for confusing 
documentation. There also appears to be no rationale for "for the moment".


 InputStream
 Stream

 One and the same as I dealt with the input side of the problem only.

Same problem with multiple names for the same thing, also there's no definition 
of either name.

Jan 16 2014

D Programming

C/C++ Programming

Other

digitalmars.D - [RFC] I/O and Buffer Range