www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Streaming transport interfaces: input

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around. interface TransportBase { /// True if stream has no more data. property bool empty(); /// True if the position property works property bool positionable(); /// Gives current position in stream property ulong position(); /// Moves stream to absolute position pos property void position(ulong pos); } interface InputBinaryTransport : TransportBase { /// Reads up to target.length bytes into target, /// returns size of data read. If target.length < 1 throws. size_t read(ubyte[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number /// of bytes read. size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter); } interface InputTextTransport : TransportBase { /// Returns the native character size: 1 for UTF-8, 2 for UTF-16, /// 4 for UTF-32, 0 for nonstandard native encodings. property size_t nativeUTFWidth(); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 4 throws. size_t read(char[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref char[] target, in char[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 2 throws. size_t read(wchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref wchar[] target, in wchar[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 1 throws. size_t read(dchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref dchar[] target, in dchar[] delimiter); } Please destroy. Andrei
Oct 14 2010
next sibling parent reply Pelle <pelle.mansson gmail.com> writes:
On 10/14/2010 05:34 PM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around. interface TransportBase { /// True if stream has no more data. property bool empty(); /// True if the position property works property bool positionable(); /// Gives current position in stream property ulong position(); /// Moves stream to absolute position pos property void position(ulong pos); } interface InputBinaryTransport : TransportBase { /// Reads up to target.length bytes into target, /// returns size of data read. If target.length < 1 throws. size_t read(ubyte[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number /// of bytes read. size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter); } interface InputTextTransport : TransportBase { /// Returns the native character size: 1 for UTF-8, 2 for UTF-16, /// 4 for UTF-32, 0 for nonstandard native encodings. property size_t nativeUTFWidth(); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 4 throws. size_t read(char[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref char[] target, in char[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 2 throws. size_t read(wchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref wchar[] target, in wchar[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 1 throws. size_t read(dchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref dchar[] target, in dchar[] delimiter); } Please destroy. Andrei

Why is positionable in TransportBase? Should it not be a separate interface? Also, shouldn't the reads accept any output range? appendDelim as well. This would do away with the two extra versions in the text input. Why does the reads throw on maybe-too-small arrays? I would expect them to read at best effort and return 0 if they fail. As they read 0 bytes. I am no expert on this subject. Disclaimer, etc :-)
Oct 14 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/2010 11:43 AM, Pelle wrote:
 Why is positionable in TransportBase? Should it not be a separate
 interface?

It could.
 Also, shouldn't the reads accept any output range? appendDelim as well.
 This would do away with the two extra versions in the text input.

Interfaces can't have overridable template methods.
 Why does the reads throw on maybe-too-small arrays? I would expect them
 to read at best effort and return 0 if they fail. As they read 0 bytes.

Because small arrays would make it (a) impossible to distinguish between no-op and end of file and (b) impossible to read one UTF code point. Andrei
Oct 14 2010
prev sibling next sibling parent reply anon <foo bar.com> writes:
Andrei Alexandrescu Wrote:

 Starting a new thread from Denis' question:
 
 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around. interface TransportBase { /// True if stream has no more data. property bool empty(); /// True if the position property works property bool positionable(); /// Gives current position in stream property ulong position(); /// Moves stream to absolute position pos property void position(ulong pos); } interface InputBinaryTransport : TransportBase { /// Reads up to target.length bytes into target, /// returns size of data read. If target.length < 1 throws. size_t read(ubyte[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number /// of bytes read. size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter); } interface InputTextTransport : TransportBase { /// Returns the native character size: 1 for UTF-8, 2 for UTF-16, /// 4 for UTF-32, 0 for nonstandard native encodings. property size_t nativeUTFWidth(); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 4 throws. size_t read(char[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref char[] target, in char[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 2 throws. size_t read(wchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref wchar[] target, in wchar[] delimiter); /// Reads complete UTF code points into target, without going /// beyond target.length. Returns number of code units read. /// If target.length < 1 throws. size_t read(dchar[] target); /// Appends to target up until delimiter is found or end of stream. /// The delimiter is included in the target. Returns number of /// code units read. size_t appendDelim(ref dchar[] target, in dchar[] delimiter); } Please destroy. Andrei

I can't understand why we need binary vs. text modes. Seems to me a silly design decision from the old days when we used 7-bit terminals or something. This always complicates the APIs for no reason. I'd like to see a two layer design: - lower UNTYPED layer that deals with transporting bits and bytes and deals with lower level concerns like byte order. - higher level extend-able TYPED layer with formatters/encoders. Also, the higher level should be compose-able such that I could read from a gzipped xml utf8 file
Oct 14 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/2010 12:15 PM, anon wrote:
 I can't understand why we need binary vs. text modes. Seems to me a
 silly design decision from the old days when we used 7-bit terminals
 or something. This always complicates the APIs for no reason.

A variety of Internet protocols are text-based. Also, for example I use a text console everyday.
 I'd like to see a two layer design: - lower UNTYPED layer that deals
 with transporting bits and bytes and deals with lower level concerns
 like byte order. - higher level extend-able TYPED layer with
 formatters/encoders.

This would be the untyped lower level layer. Andrei
Oct 14 2010
parent anon <foo bar.com> writes:
Andrei Alexandrescu Wrote:

 On 10/14/2010 12:15 PM, anon wrote:
 I can't understand why we need binary vs. text modes. Seems to me a
 silly design decision from the old days when we used 7-bit terminals
 or something. This always complicates the APIs for no reason.

A variety of Internet protocols are text-based. Also, for example I use a text console everyday.
 I'd like to see a two layer design: - lower UNTYPED layer that deals
 with transporting bits and bytes and deals with lower level concerns
 like byte order. - higher level extend-able TYPED layer with
 formatters/encoders.

This would be the untyped lower level layer. Andrei

All I'm saying is that instead of treating text-based protocols specially at the lower level, they should be on the higher level typed as UTF-8 or whatever. They should be treated like any other protocol. E.g. I see no difference between a text file encoded in UTF8 and an image file encoded via jpeg.
Oct 14 2010
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Now say that 10 times fast!

On 10/14/10, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 ..lower level layer.

 Andrei

Oct 14 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around. interface TransportBase { /// True if stream has no more data. property bool empty(); /// True if the position property works property bool positionable(); /// Gives current position in stream property ulong position(); /// Moves stream to absolute position pos property void position(ulong pos); }

Please, use the term "seek", and allow an anchor. Every OS allows this, it makes no sense not to provide it. Other than that, I like the idea (assuming output transport is forthcoming and implements TransportBase).
 interface InputBinaryTransport : TransportBase
 {
      /// Reads up to target.length bytes into target,
      /// returns size of data read. If target.length < 1 throws.
      size_t read(ubyte[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number
      /// of bytes read.
      size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter);
 }

I don't like appendDelim. We don't need to define that until we have buffering. The simple function of an input stream is to read data. With buffering you get all the goodies that you want, but the buffer should be in control of its data buffer. Basically, appendDelim can be defined outside this class, because the primitive read is enough.
 interface InputTextTransport : TransportBase
 {
      /// Returns the native character size: 1 for UTF-8, 2 for UTF-16,
      /// 4 for UTF-32, 0 for nonstandard native encodings.
       property size_t nativeUTFWidth();

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 4 throws.
      size_t read(char[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref char[] target, in char[] delimiter);

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 2 throws.
      size_t read(wchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref wchar[] target, in wchar[] delimiter);

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 1 throws.
      size_t read(dchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref dchar[] target, in dchar[] delimiter);
 }

Shouldn't the text transport be defined on top of the binary transport? And in any case, I'd expect buffering to go between the two. If all you are adding are the different widths of characters, I don't think you need this extra layer. It's going to make the buffering layer more difficult to implement (now it must handle both a text version and abinary version). -Steve
Oct 14 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100. Oh, that reminds me I need to provide length as a property as well. This would save us crap like seek(0, SEEK_END); ftell() to figure out the length of a file. I have no sympathy for seek and tell with anchors.
 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?
 The simple function of an input stream is to read data.

It does read data.
 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

I think the appendDelim method allows fast and simple implementations of a variety of patterns. As I (thought I) have shown elsethread, without appendDelim there's no way to efficiently implement a line-oriented stream on top of a block-oriented one.
 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra interface function is acceptable for fast I/O.
 Shouldn't the text transport be defined on top of the binary transport?

No, because there are transports that genuinely do not accept binary data.
 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a different interface?
 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

I don't understand this. Andrei
Oct 14 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 12:56 CDT, Denis Koroskin wrote:
 appendDelim *requires* buffering for to be implemented. No OS provides
 an API to read from a file (be it pipe, socket, whatever) to read up to
 some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 512 bytes or whatever), so ANY protocol that allows the user to set the maximum bytes to read will require buffering and copying. So how is appendDelim worse than read?
 As such, if you
 need to read until a delimeter, you need to fetch block to some internal
 buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.
 I've
 implemented that on top of chunked read interface, and it was 5% faster
 than getline()/getdelim() that GNU libc provides (despite you claming it
 to be "many times faster"). It's not.

Please post your code.
 Buffering requires and additional level of data copying, and this is bad
 for fast I/O.

Agreed. But then you define routines that also requires buffering. How do you reconcile your own requirement with your own interface?
 If you need fast I/O or must pull that out of the stream
 interface. Otherwise chunked read will be less efficient due to
 additional copies to and from buffers.

 On the contrary line-based reading can be implemented on top of the
 chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying. appendDelim implementation: 1. Low-level read in internal buffers 2. Search for delimiter (assume found for simplicity) 3. Resize user buffer 4. Copy That's one copy, with the necessary corner cases when the delimiter isn't found yet etc. (which increase copying ONLY if the buffer is actually moved when reallocated). The implementation in your message on 10/13/2010 21:20 CDT: 1. Low-level read in internal buffers 2. Copy from internal buffers into the internal buffer provided by your ByLine implementation 3. Copy from the internal buffer of ByLine into the user-supplied buffer That's two copies. Agreed? Andrei
Oct 14 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 14:00 CDT, Denis Koroskin wrote:
 On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:56 CDT, Denis Koroskin wrote:
 appendDelim *requires* buffering for to be implemented. No OS provides
 an API to read from a file (be it pipe, socket, whatever) to read up to
 some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 512 bytes or whatever), so ANY protocol that allows the user to set the maximum bytes to read will require buffering and copying. So how is appendDelim worse than read?
 As such, if you
 need to read until a delimeter, you need to fetch block to some internal
 buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.
 I've
 implemented that on top of chunked read interface, and it was 5% faster
 than getline()/getdelim() that GNU libc provides (despite you claming it
 to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248

I meant the baseline.
 Buffering requires and additional level of data copying, and this is bad
 for fast I/O.

Agreed. But then you define routines that also requires buffering. How do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when you need to buffer something, but in general you don't. My Stream interface is simply a thin portable layer on top of OS. See the code above for simple implementation that is built on top of fopen/fread (I used open/read initially but it gave 0 improvement so I went back to fopen/fread because GNU libc line-input uses them, too, so that would be the most fair comparison). It can't be any more efficient than that.
 If you need fast I/O or must pull that out of the stream
 interface. Otherwise chunked read will be less efficient due to
 additional copies to and from buffers.

 On the contrary line-based reading can be implemented on top of the
 chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying. appendDelim implementation: 1. Low-level read in internal buffers 2. Search for delimiter (assume found for simplicity) 3. Resize user buffer 4. Copy That's one copy, with the necessary corner cases when the delimiter isn't found yet etc. (which increase copying ONLY if the buffer is actually moved when reallocated). The implementation in your message on 10/13/2010 21:20 CDT: 1. Low-level read in internal buffers 2. Copy from internal buffers into the internal buffer provided by your ByLine implementation 3. Copy from the internal buffer of ByLine into the user-supplied buffer That's two copies. Agreed? Andrei

I'm not sure what message are you talking about (first or second one). Second one (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248) makes a chunked read to internal buffer (if not filled yet), then searches for a delimiter and then copies to a user-provided buffer. That's one copy in most cases. And that's what GNU libc does, too.

Your function calls fopen() and does not disable buffering by means of setvbuf(). By default fopen() opens in buffered mode. Does the existence of that buffer entail an extra copy? Andrei
Oct 14 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/2010 07:57 PM, Denis Koroskin wrote:
 In my original version there was a setbuf(f, null) call. I removed it
 because it had 0 impact on performance.
 I also tried using unistd open/read functions, that had zero impact, too

Yah, I looked more into the behavior of fread. At least on a couple of implementations that offer source, the implementation first copies whatever data is (if any) in the internal buffer, and then read the rest of the data straight into the remaining user-supplied buffer. This means, if you consistently fread into a buffer, the internal buffers are never touched. I stand corrected regarding the extra copy.
 (btw, opening file with O_DIRECT returned valid file descriptor, but
 read operations very failing with an invalid argument error).

I recall I've read a discussion where Linus was really down on O_DIRECT. Found it: http://kerneltrap.org/node/7563 Andrei
Oct 14 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

I confirm byLine is faster than getline after translating getline to D. byLine does crash intermittently so it has a bug somewhere, but I assume fixing it won't reduce its performance significantly. Results on my system on a 20x larger test file (obtained with repeat 20 { cat WindowsUpdate.log >> big.log } && mv big.log WindowsUpdate.log) are (median of 5): 1.759 vs. 1.516 seconds (14% improvement). Andrei
Oct 15 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/15/10 11:11 CDT, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 20:06:47 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

I confirm byLine is faster than getline after translating getline to D. byLine does crash intermittently so it has a bug somewhere, but I assume fixing it won't reduce its performance significantly. Results on my system on a 20x larger test file (obtained with repeat 20 { cat WindowsUpdate.log >> big.log } && mv big.log WindowsUpdate.log) are (median of 5): 1.759 vs. 1.516 seconds (14% improvement). Andrei

With introduction of consume (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119414) it got another 40% improvement for counting lines and chars (it doesn't copy anything at all now).

Then I guess... put me on the bandwagon! Thanks for the great job on designing, implementing, and benchmarking. Andrei
Oct 15 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100.

Um.. yuck. We need to use two system calls to seek 100 bytes?

seek and tell don't always issue system calls.
 Oh, that reminds me I need to provide length as a property as well.
 This would save us crap like seek(0, SEEK_END); ftell() to figure out
 the length of a file.

So now you need to do stream.position = stream.length to seek to the end of the file instead of stream.seek(0, Anchor.END)?

Yes.
 Plus, how will you
 implement length, probably like this:
 auto curpos = seek(0, SEEK_CUR);
 auto len = seek(0, SEEK_END);
 seek(curpos, SEEK_BEG);
 return len;

Depends. For files, you can just use stat.
 So that looks like 3 system calls instead of one, plus you just wasted
 time seeking back to the current position.

Well again they don't always issue system calls, but point taken. I do see a need for fast positioning at end of stream. Perhaps we could accommodate an enum equal to ulong.max such that this goes to the end of stream: stream.position = StreamBase.atEnd;
 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?

Because appendDelim deals with buffering. If I defined a buffered stream, I'd include a function like this: size_t read(bool delegate(T[] data) sink); which buffers data until sink returned false (passing each read chunk into sink), extending the buffer as necessary. Then it's trivial to implement readDelim on top of this.

Interesting. But that would still force readDelim to store leftover bytes.
 The simple function of an input stream is to read data.

It does read data.

I mean, that's *all* it should do. It should not be appending to buffers.

This comes from a practical need. I've often had a buffer and wanted to read one more line into it, keeping the existing content. It was impossible without extra allocation and copying.
 I think the appendDelim method allows fast and simple implementations
 of a variety of patterns. As I (thought I) have shown elsethread,
 without appendDelim there's no way to efficiently implement a
 line-oriented stream on top of a block-oriented one.

Um... the read system call is the same interface as the proposed block-oriented interface. How are you avoiding using system calls?

I think we don't have the same definition for "system call". For example by my definition fread is NOT a system call.
 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra interface function is acceptable for fast I/O.

No, you can define it without extra copying.

How? Denis' implementation has two copies in the mix. (I'm not counting .dup etc.) Anyhow, let's do this - write down your interfaces so I can comment on them. We talk "oh that's a buffering interface" and "that requires buffering" and "that's an extra copy" and so on but we have little concrete contenders. I put my cards on the table, you put yours.
 If you don't allow direct
 access to the buffer, then you have extra copying. But we don't have to
 mimic C here. We should not be encouraging constant reinventing of the
 buffer wheel here. Buffering is a well-defined task that can be
 implemented once.

 Just as a note, Tango does this, and it's very fast. There is certainly
 no extra copying there.

 Shouldn't the text transport be defined on top of the binary transport?

No, because there are transports that genuinely do not accept binary data.

I mean, a text transport uses a binary transport underneath. What text transport doesn't use a binary transport to do its dirty work? And what exactly does a text transport do so differently that it needs to be a separate interface?

A text transport does not accept raw binary data and requires e.g. Base64 encoding (e.g. mail attachments do that). The console is a text device - makes no sense to dump binary data on it. A JSON encoder is also a text transport.
 In other words, if 90% of the text transport duplicates the binary
 transport, I see an opportunity for consolidation.

Consolidation brings simplification, which is good. But I believe there exist text entities that do make the distinction worthwhile.
 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a different interface?

Yes, but if we expect to reuse code, I'd expect a buffered transport to use a primitive transport underneath for actually reading/writing data. If you have multiple versions of the class that actually reads/writes data (such as binary vs. text), then the buffer which uses it must support all of them. Text based processing to me seems to be a buffered activity (reading lines, ensuring you don't have sliced utf-8 data, etc.).

Yes. What may be not so obvious is that binary processing with user-imposed data lenghts is ALSO a buffer activity. This is because the low-level buffers do NOT come at arbitrary positions (alignment restrictions) and to NOT come at arbitrary lengths.
 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

I don't understand this.

buffer uses a transport. If you have two different transport interfaces, the buffer must support them both. And if the benefit is, one simply defines [w|d]char versions of read, then we haven't gained much for the trouble of having to support both.

I'll be looking forward to seeing your interfaces. Andrei
Oct 14 2010
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-10-14 16:47:13 -0400, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 stream.position = StreamBase.atEnd;

OK, that is acceptable. What about seeking to N bytes before the end? What about seeking N bytes ahead of the current position (as previously stated)?

Would be nice if we could do all this: stream.position = $; stream.position = $-10; stream.position += 10; The last one (+=) should be only one function call to avoid two system calls. All that's a little hard to implement though. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 14 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 21:58 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not
 counting .dup etc.) Anyhow, let's do this - write down your
 interfaces so I can comment on them. We talk "oh that's a buffering
 interface" and "that requires buffering" and "that's an extra copy"
 and so on but we have little concrete contenders. I put my cards on
 the table, you put yours.

I'll see if I can put something together.

Here's a rough outline:

Thanks!
 enum Anchor
 {
 Begin,
 Current,
 End
 }

 interface Seek
 {
 ulong seek(long delta, Anchor whence);
 final ulong tell() { return seek(0, Anchor.Current); }
 bool seekable(); // define as false if seeking is not supported, true if it
 // is supported (this doesn't necessarily mean a seek will
 // succeed).
 }

So far so good.
 interface InputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.
 }

No way to check for end of stream except by reading some of it?
 // defined to implement either a D buffered object or wrap a FILE *.
 //
 interface BufferedInputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.

Since this method has the same sig, why doesn't BufferedInputTransport inherit InputTransport?
 // read data into the buffer until the delegate returns other than ~0
 //
 // The delegate is passed the entire buffer so far, with the start of the
 // new data just read. It returns other than ~0 when it determines the end
 // of the data in question.
 //
 ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process);

How does the delegate say "you know what, I'm fine with the first 1000 bytes of the data; please take the rest of 1048 back"? Is that the result of the delegate? The process feels a bit odd.
 // same as readUntil except append to the given arr, Any excess
 // data will be pushed into the internal buffer.
 //
 size_t appendUntil(uint delegate(ubyte[] data, uint start) process, ref
 ubyte[] arr)

So indeed the delegate seems to return the length it wants to keep? And the rest would be copied back into the stream's internal buffers? I'm not sure I understand this API.
 // various buffer functions.
  property size_t bufsize();
  property size_t readable();
 // etc.

Can one set bufsize?
 }

 The way I see it working is, there are two implementations for
 BufferedInputTransport: FILEInputTransport and DBufferInputTransport.
 There are numerous implementations of InputTransport, each of which can
 be passed to the DBufferInputTransport, which uses its own buffer
 implementation. For example, a network socket, file, inter-thread
 stream, an array, etc.

 This way, you can play nice with C's stdio when necessary (i.e. for
 stdin/stdout/stderr) and avoid the FILE limitations and performance
 issues otherwise.

I'm a bit unclear on the delegate stuff, but it's promising because it could be quite flexible. But I wouldn't want to aggravate the users with an API that's difficult to use. Could you please give a few examples using delegates that implement common patterns - e.g. readline and readDelim? Andrei
Oct 14 2010
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
  On 14.10.2010 22:14, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100.

Um.. yuck. We need to use two system calls to seek 100 bytes?
 Oh, that reminds me I need to provide length as a property as well. 
 This would save us crap like seek(0, SEEK_END); ftell() to figure out 
 the length of a file.

So now you need to do stream.position = stream.length to seek to the end of the file instead of stream.seek(0, Anchor.END)? Plus, how will you implement length, probably like this: auto curpos = seek(0, SEEK_CUR); auto len = seek(0, SEEK_END); seek(curpos, SEEK_BEG); return len;

point of most native OS I/O functions, at very least, Windows and Posix have a separate function for determining file length. [snip]
 -Steve

-- Dmitry Olshansky
Oct 14 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
With generic element type:

interface InputBinaryTransport(T) : TransportBase
if (!hasPointers!T && !is(T == class))
{
     /// Reads up to target.length items into target,
     /// returns number of items read. If target.length < 1 throws.
     size_t read(T[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number
     /// of items read.
     size_t appendDelim(ref T[] target, in T[] delimiter);
}

The rest of the transport interfaces stays the same.


Andrei
Oct 14 2010
parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 14 Oct 2010 21:42:04 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 With generic element type:

 interface InputBinaryTransport(T) : TransportBase
 if (!hasPointers!T && !is(T == class))
 {
      /// Reads up to target.length items into target,
      /// returns number of items read. If target.length < 1 throws.
      size_t read(T[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number
      /// of items read.
      size_t appendDelim(ref T[] target, in T[] delimiter);
 }

 The rest of the transport interfaces stays the same.


 Andrei

This interface requires buffering and thus needs to be built on top of the lower-level stream interface because Stream shouldn't do buffering internally. But anyway, why would you need to read exactly one type of data from a stream? Why can't "read" be a template? Why does InputBinaryTransport have to be an interface and not a concrete implementation? It can be implemented uniformally on top of a generic stream.
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 14 Oct 2010 21:39:03 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100. Oh, that reminds me I need to provide length as a property as well. This would save us crap like seek(0, SEEK_END); ftell() to figure out the length of a file. I have no sympathy for seek and tell with anchors.
 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?
 The simple function of an input stream is to read data.

It does read data.
 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

I think the appendDelim method allows fast and simple implementations of a variety of patterns. As I (thought I) have shown elsethread, without appendDelim there's no way to efficiently implement a line-oriented stream on top of a block-oriented one.
 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra interface function is acceptable for fast I/O.
 Shouldn't the text transport be defined on top of the binary transport?

No, because there are transports that genuinely do not accept binary data.
 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a different interface?
 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

I don't understand this. Andrei

appendDelim *requires* buffering for to be implemented. No OS provides an API to read from a file (be it pipe, socket, whatever) to read up to some abstract delimiter. It *always* reads in blocks. As such, if you need to read until a delimeter, you need to fetch block to some internal buffer, MANUALLY search through it and THEN copy to output string. I've implemented that on top of chunked read interface, and it was 5% faster than getline()/getdelim() that GNU libc provides (despite you claming it to be "many times faster"). It's not. Buffering requires and additional level of data copying, and this is bad for fast I/O. If you need fast I/O or must pull that out of the stream interface. Otherwise chunked read will be less efficient due to additional copies to and from buffers. On the contrary line-based reading can be implemented on top of the chunked read without sacrificing a tiny bit of efficiency.
Oct 14 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100.

Um.. yuck. We need to use two system calls to seek 100 bytes?
 Oh, that reminds me I need to provide length as a property as well. This  
 would save us crap like seek(0, SEEK_END); ftell() to figure out the  
 length of a file.

So now you need to do stream.position = stream.length to seek to the end of the file instead of stream.seek(0, Anchor.END)? Plus, how will you implement length, probably like this: auto curpos = seek(0, SEEK_CUR); auto len = seek(0, SEEK_END); seek(curpos, SEEK_BEG); return len; So that looks like 3 system calls instead of one, plus you just wasted time seeking back to the current position. Now, I do like the syntax better, but we are not going for syntax here, we are going for performance. If you can find a way to merge the two, I'd be happy with it.
 I have no sympathy for seek and tell with anchors.

Sympathy has nothing to do with it. It's the simple fact that you have to deal with the OS, and munging your interface on top of it means more system calls and less performance.
 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?

Because appendDelim deals with buffering. If I defined a buffered stream, I'd include a function like this: size_t read(bool delegate(T[] data) sink); which buffers data until sink returned false (passing each read chunk into sink), extending the buffer as necessary. Then it's trivial to implement readDelim on top of this.
 The simple function of an input stream is to read data.

It does read data.

I mean, that's *all* it should do. It should not be appending to buffers.
 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

I think the appendDelim method allows fast and simple implementations of a variety of patterns. As I (thought I) have shown elsethread, without appendDelim there's no way to efficiently implement a line-oriented stream on top of a block-oriented one.

Um... the read system call is the same interface as the proposed block-oriented interface. How are you avoiding using system calls?
 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra interface function is acceptable for fast I/O.

No, you can define it without extra copying. If you don't allow direct access to the buffer, then you have extra copying. But we don't have to mimic C here. We should not be encouraging constant reinventing of the buffer wheel here. Buffering is a well-defined task that can be implemented once. Just as a note, Tango does this, and it's very fast. There is certainly no extra copying there.
 Shouldn't the text transport be defined on top of the binary transport?

No, because there are transports that genuinely do not accept binary data.

I mean, a text transport uses a binary transport underneath. What text transport doesn't use a binary transport to do its dirty work? And what exactly does a text transport do so differently that it needs to be a separate interface? In other words, if 90% of the text transport duplicates the binary transport, I see an opportunity for consolidation.
 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a different interface?

Yes, but if we expect to reuse code, I'd expect a buffered transport to use a primitive transport underneath for actually reading/writing data. If you have multiple versions of the class that actually reads/writes data (such as binary vs. text), then the buffer which uses it must support all of them. Text based processing to me seems to be a buffered activity (reading lines, ensuring you don't have sliced utf-8 data, etc.).
 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

I don't understand this.

buffer uses a transport. If you have two different transport interfaces, the buffer must support them both. And if the benefit is, one simply defines [w|d]char versions of read, then we haven't gained much for the trouble of having to support both. -Steve
Oct 14 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 13:56:19 -0400, Denis Koroskin <2korden gmail.com>  
wrote:

 Buffering requires and additional level of data copying, and this is bad  
 for fast I/O. If you need fast I/O or must pull that out of the stream  
 interface. Otherwise chunked read will be less efficient due to  
 additional copies to and from buffers.

No it doesn't. Tango's buffer does not require extra copying. -Steve
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:56 CDT, Denis Koroskin wrote:
 appendDelim *requires* buffering for to be implemented. No OS provides
 an API to read from a file (be it pipe, socket, whatever) to read up to
 some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 512 bytes or whatever), so ANY protocol that allows the user to set the maximum bytes to read will require buffering and copying. So how is appendDelim worse than read?
 As such, if you
 need to read until a delimeter, you need to fetch block to some internal
 buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.
 I've
 implemented that on top of chunked read interface, and it was 5% faster
 than getline()/getdelim() that GNU libc provides (despite you claming it
 to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248
 Buffering requires and additional level of data copying, and this is bad
 for fast I/O.

Agreed. But then you define routines that also requires buffering. How do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when you need to buffer something, but in general you don't. My Stream interface is simply a thin portable layer on top of OS. See the code above for simple implementation that is built on top of fopen/fread (I used open/read initially but it gave 0 improvement so I went back to fopen/fread because GNU libc line-input uses them, too, so that would be the most fair comparison). It can't be any more efficient than that.
 If you need fast I/O or must pull that out of the stream
 interface. Otherwise chunked read will be less efficient due to
 additional copies to and from buffers.

 On the contrary line-based reading can be implemented on top of the
 chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying. appendDelim implementation: 1. Low-level read in internal buffers 2. Search for delimiter (assume found for simplicity) 3. Resize user buffer 4. Copy That's one copy, with the necessary corner cases when the delimiter isn't found yet etc. (which increase copying ONLY if the buffer is actually moved when reallocated). The implementation in your message on 10/13/2010 21:20 CDT: 1. Low-level read in internal buffers 2. Copy from internal buffers into the internal buffer provided by your ByLine implementation 3. Copy from the internal buffer of ByLine into the user-supplied buffer That's two copies. Agreed? Andrei

I'm not sure what message are you talking about (first or second one). Second one (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars. &article_id=119248) makes a chunked read to internal buffer (if not filled yet), then searches for a delimiter and then copies to a user-provided buffer. That's one copy in most cases. And that's what GNU libc does, too.
Oct 14 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows  
 this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows seek/tell with anchors allows ALL anchors, and always allows either both or none of seek and tell. So I decided to cut through the crap and simplify. You want to seek 100 bytes from here, you write stream.position = stream.position + 100.

Um.. yuck. We need to use two system calls to seek 100 bytes?

seek and tell don't always issue system calls.

seek *is* a system call (lseek64). tell is simply seeking with (0, SEEK_CUR). Are we talking about the same thing?
 Oh, that reminds me I need to provide length as a property as well.
 This would save us crap like seek(0, SEEK_END); ftell() to figure out
 the length of a file.

So now you need to do stream.position = stream.length to seek to the end of the file instead of stream.seek(0, Anchor.END)?

Yes.
 Plus, how will you
 implement length, probably like this:
 auto curpos = seek(0, SEEK_CUR);
 auto len = seek(0, SEEK_END);
 seek(curpos, SEEK_BEG);
 return len;

Depends. For files, you can just use stat.

True, but still, 2 system calls (stat and then seek).
 So that looks like 3 system calls instead of one, plus you just wasted
 time seeking back to the current position.

Well again they don't always issue system calls, but point taken. I do see a need for fast positioning at end of stream. Perhaps we could accommodate an enum equal to ulong.max such that this goes to the end of stream: stream.position = StreamBase.atEnd;

OK, that is acceptable. What about seeking to N bytes before the end? What about seeking N bytes ahead of the current position (as previously stated)?
 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?

Because appendDelim deals with buffering. If I defined a buffered stream, I'd include a function like this: size_t read(bool delegate(T[] data) sink); which buffers data until sink returned false (passing each read chunk into sink), extending the buffer as necessary. Then it's trivial to implement readDelim on top of this.

Interesting. But that would still force readDelim to store leftover bytes.

What does readDelim do with them in your implementation? It must read data via a block read, that's the only thing the OS provides (via read system call), so what do you do, just return the extra data or throw it away? You need to buffer it somewhere.
 The simple function of an input stream is to read data.

It does read data.

I mean, that's *all* it should do. It should not be appending to buffers.

This comes from a practical need. I've often had a buffer and wanted to read one more line into it, keeping the existing content. It was impossible without extra allocation and copying.

This can be accomodated via a buffer type. Buffering provides everything you need to implement readDelim.
 I think the appendDelim method allows fast and simple implementations
 of a variety of patterns. As I (thought I) have shown elsethread,
 without appendDelim there's no way to efficiently implement a
 line-oriented stream on top of a block-oriented one.

Um... the read system call is the same interface as the proposed block-oriented interface. How are you avoiding using system calls?

I think we don't have the same definition for "system call". For example by my definition fread is NOT a system call.

OK, now I think I see the issue :) You are assuming we are implementing all of this on top of FILE *. FILE * provides buffering already, so you are not avoiding buffering at all and you are unnecessarily using an extremely outdated interface. BTW, fread eventually calls read, there's no way around it. I think we can provide a version of the BufferedStream interface which uses C's FILE * for stdout/stdin/stderr (to play nice with C), but we should avoid FILE * for everything else.
 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra interface function is acceptable for fast I/O.

No, you can define it without extra copying.

How? Denis' implementation has two copies in the mix. (I'm not counting .dup etc.) Anyhow, let's do this - write down your interfaces so I can comment on them. We talk "oh that's a buffering interface" and "that requires buffering" and "that's an extra copy" and so on but we have little concrete contenders. I put my cards on the table, you put yours.

I'll see if I can put something together.
 If you don't allow direct
 access to the buffer, then you have extra copying. But we don't have to
 mimic C here. We should not be encouraging constant reinventing of the
 buffer wheel here. Buffering is a well-defined task that can be
 implemented once.

 Just as a note, Tango does this, and it's very fast. There is certainly
 no extra copying there.

 Shouldn't the text transport be defined on top of the binary  
 transport?

No, because there are transports that genuinely do not accept binary data.

I mean, a text transport uses a binary transport underneath. What text transport doesn't use a binary transport to do its dirty work? And what exactly does a text transport do so differently that it needs to be a separate interface?

A text transport does not accept raw binary data and requires e.g. Base64 encoding (e.g. mail attachments do that). The console is a text device - makes no sense to dump binary data on it. A JSON encoder is also a text transport.

-- that writes to/reads from a binary transport. A file is a binary transport. My opinion is that a text reader/writer should be a class/struct that uses a binary transport. If you are going to implement the binary transport stuff also inside the text transport, then it seems like an unnecessary duplication.
 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a different interface?

Yes, but if we expect to reuse code, I'd expect a buffered transport to use a primitive transport underneath for actually reading/writing data. If you have multiple versions of the class that actually reads/writes data (such as binary vs. text), then the buffer which uses it must support all of them. Text based processing to me seems to be a buffered activity (reading lines, ensuring you don't have sliced utf-8 data, etc.).

Yes. What may be not so obvious is that binary processing with user-imposed data lenghts is ALSO a buffer activity. This is because the low-level buffers do NOT come at arbitrary positions (alignment restrictions) and to NOT come at arbitrary lengths.

You can still avoid copying. I'll see if I can show you. -Steve
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 15 Oct 2010 02:01:55 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 14:00 CDT, Denis Koroskin wrote:
 On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:56 CDT, Denis Koroskin wrote:
 appendDelim *requires* buffering for to be implemented. No OS provides
 an API to read from a file (be it pipe, socket, whatever) to read up  
 to
 some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 512 bytes or whatever), so ANY protocol that allows the user to set the maximum bytes to read will require buffering and copying. So how is appendDelim worse than read?
 As such, if you
 need to read until a delimeter, you need to fetch block to some  
 internal
 buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.
 I've
 implemented that on top of chunked read interface, and it was 5%  
 faster
 than getline()/getdelim() that GNU libc provides (despite you claming  
 it
 to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248

I meant the baseline.
 Buffering requires and additional level of data copying, and this is  
 bad
 for fast I/O.

Agreed. But then you define routines that also requires buffering. How do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when you need to buffer something, but in general you don't. My Stream interface is simply a thin portable layer on top of OS. See the code above for simple implementation that is built on top of fopen/fread (I used open/read initially but it gave 0 improvement so I went back to fopen/fread because GNU libc line-input uses them, too, so that would be the most fair comparison). It can't be any more efficient than that.
 If you need fast I/O or must pull that out of the stream
 interface. Otherwise chunked read will be less efficient due to
 additional copies to and from buffers.

 On the contrary line-based reading can be implemented on top of the
 chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying. appendDelim implementation: 1. Low-level read in internal buffers 2. Search for delimiter (assume found for simplicity) 3. Resize user buffer 4. Copy That's one copy, with the necessary corner cases when the delimiter isn't found yet etc. (which increase copying ONLY if the buffer is actually moved when reallocated). The implementation in your message on 10/13/2010 21:20 CDT: 1. Low-level read in internal buffers 2. Copy from internal buffers into the internal buffer provided by your ByLine implementation 3. Copy from the internal buffer of ByLine into the user-supplied buffer That's two copies. Agreed? Andrei

I'm not sure what message are you talking about (first or second one). Second one (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248) makes a chunked read to internal buffer (if not filled yet), then searches for a delimiter and then copies to a user-provided buffer. That's one copy in most cases. And that's what GNU libc does, too.

Your function calls fopen() and does not disable buffering by means of setvbuf(). By default fopen() opens in buffered mode. Does the existence of that buffer entail an extra copy? Andrei

In my original version there was a setbuf(f, null) call. I removed it because it had 0 impact on performance. I also tried using unistd open/read functions, that had zero impact, too (btw, opening file with O_DIRECT returned valid file descriptor, but read operations very failing with an invalid argument error).
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 07:57 PM, Denis Koroskin wrote:
 In my original version there was a setbuf(f, null) call. I removed it
 because it had 0 impact on performance.
 I also tried using unistd open/read functions, that had zero impact, too

Yah, I looked more into the behavior of fread. At least on a couple of implementations that offer source, the implementation first copies whatever data is (if any) in the internal buffer, and then read the rest of the data straight into the remaining user-supplied buffer. This means, if you consistently fread into a buffer, the internal buffers are never touched. I stand corrected regarding the extra copy.
 (btw, opening file with O_DIRECT returned valid file descriptor, but
 read operations very failing with an invalid argument error).

I recall I've read a discussion where Linus was really down on O_DIRECT. Found it: http://kerneltrap.org/node/7563 Andrei

Yeah, I've read that. Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best of 3 results for parsing 33Mb log file: using getline (C++ -O2): real 0m0.081s user 0m0.068s sys 0m0.016s using byLine (D -O -release -inline): real 0m0.067s user 0m0.056s sys 0m0.012s That's a 20% difference. Source and test files here: http://rapidshare.com/files/425154119/tests.7z 309k http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no 7z installed)
Oct 14 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not counting  
 .dup etc.) Anyhow, let's do this - write down your interfaces so I can  
 comment on them. We talk "oh that's a buffering interface" and "that  
 requires buffering" and "that's an extra copy" and so on but we have  
 little concrete contenders. I put my cards on the table, you put yours.

I'll see if I can put something together.

Here's a rough outline: enum Anchor { Begin, Current, End } interface Seek { ulong seek(long delta, Anchor whence); final ulong tell() { return seek(0, Anchor.Current); } bool seekable(); // define as false if seeking is not supported, true if it // is supported (this doesn't necessarily mean a seek will // succeed). } interface InputTransport : Seek { size_t read(ubyte[] data); // returns 0 on EOF. } // defined to implement either a D buffered object or wrap a FILE *. // interface BufferedInputTransport : Seek { size_t read(ubyte[] data); // returns 0 on EOF. // // read data into the buffer until the delegate returns other than ~0 // // The delegate is passed the entire buffer so far, with the start of the // new data just read. It returns other than ~0 when it determines the end // of the data in question. // ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process); // // same as readUntil except append to the given arr, Any excess // data will be pushed into the internal buffer. // size_t appendUntil(uint delegate(ubyte[] data, uint start) process, ref ubyte[] arr) // various buffer functions. property size_t bufsize(); property size_t readable(); // etc. } The way I see it working is, there are two implementations for BufferedInputTransport: FILEInputTransport and DBufferInputTransport. There are numerous implementations of InputTransport, each of which can be passed to the DBufferInputTransport, which uses its own buffer implementation. For example, a network socket, file, inter-thread stream, an array, etc. This way, you can play nice with C's stdio when necessary (i.e. for stdin/stdout/stderr) and avoid the FILE limitations and performance issues otherwise. -Steve
Oct 14 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 14 Oct 2010 19:30:21 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2010-10-14 16:47:13 -0400, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 stream.position = StreamBase.atEnd;

What about seeking N bytes ahead of the current position (as previously stated)?

Would be nice if we could do all this: stream.position = $; stream.position = $-10; stream.position += 10; The last one (+=) should be only one function call to avoid two system calls. All that's a little hard to implement though.

You could get close to this with some nifty type construction, but I'm unsure it's worth all that when the whole thing can be had via one function call (seek). -Steve
Oct 14 2010
prev sibling next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:
 
 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around.

I suggest having read and write with offset style apis. See also: man pread The value here is that you can safely concurrently access a file descriptor and not have to worry about atomicity of seek+read or seek+write at the app layer.
Oct 14 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/14/10 23:37 CDT, Brad Roberts wrote:
 On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around.

I suggest having read and write with offset style apis. See also: man pread The value here is that you can safely concurrently access a file descriptor and not have to worry about atomicity of seek+read or seek+write at the app layer.

This is very interesting - I had no idea about pread and pwrite. http://linux.die.net/man/2/pread Andrei
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 15 Oct 2010 08:37:24 +0400, Brad Roberts <braddr puremagic.com>  
wrote:

 On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for formatting - it just moves stuff around.

I suggest having read and write with offset style apis. See also: man pread The value here is that you can safely concurrently access a file descriptor and not have to worry about atomicity of seek+read or seek+write at the app layer.

In my original API proposal I included reading from offsets: interface SeekableInputStream : SeekableStream, InputStream { ... // reads up to buffer.length bytes from a stream // returns number of bytes read // throws on error size_t read(ubyte[] buffer, long offset); // reads up to buffer.length bytes from a specified offset AsyncReadFromOffsetRequest readAsync(ubyte[] buffer, long offset, Mailbox* mailbox = null); ... } interface SeekableOutputStream : SeekableStream, OutputStream { ... // returns number of bytes written // throws on error size_t write(const(ubyte)[] buffer, long offset); // writes from from a specified offset AsyncWriteFromOffsetRequest writeAsync(const(ubyte)[] buffer, long offset, Mailbox* mailbox = null); ... } Although it complicates an API quite a bit, it has its advantages, too. However, not all Stream types support reading/writing from an offset (e.g. SocketStream doesn't).
Oct 14 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Fri, 15 Oct 2010 20:06:47 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

I confirm byLine is faster than getline after translating getline to D. byLine does crash intermittently so it has a bug somewhere, but I assume fixing it won't reduce its performance significantly. Results on my system on a 20x larger test file (obtained with repeat 20 { cat WindowsUpdate.log >> big.log } && mv big.log WindowsUpdate.log) are (median of 5): 1.759 vs. 1.516 seconds (14% improvement). Andrei

With introduction of consume (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars. &article_id=119414) it got another 40% improvement for counting lines and chars (it doesn't copy anything at all now).
Oct 15 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 15 Oct 2010 00:37:55 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 21:58 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not
 counting .dup etc.) Anyhow, let's do this - write down your
 interfaces so I can comment on them. We talk "oh that's a buffering
 interface" and "that requires buffering" and "that's an extra copy"
 and so on but we have little concrete contenders. I put my cards on
 the table, you put yours.

I'll see if I can put something together.

Here's a rough outline:

Thanks!
 enum Anchor
 {
 Begin,
 Current,
 End
 }

 interface Seek
 {
 ulong seek(long delta, Anchor whence);
 final ulong tell() { return seek(0, Anchor.Current); }
 bool seekable(); // define as false if seeking is not supported, true  
 if it
 // is supported (this doesn't necessarily mean a seek will
 // succeed).
 }

So far so good.
 interface InputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.
 }

No way to check for end of stream except by reading some of it?

This is often the only way in the low level interface, and since we have no buffer at this point, yes, it's required to read. How do you implement EOF without a buffer to hold the data you tried to read to see if you were at EOF? It might be feasible to ask for EOF on the buffered version, but I still think it's not necessary.
 // defined to implement either a D buffered object or wrap a FILE *.
 //
 interface BufferedInputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.

Since this method has the same sig, why doesn't BufferedInputTransport inherit InputTransport?

I thought of that, but then a buffered input class could accept a buffered input transport interface as its low-level implementation, so then you have unnecessarily double-buffered streams.
 // read data into the buffer until the delegate returns other than ~0
 //
 // The delegate is passed the entire buffer so far, with the start of  
 the
 // new data just read. It returns other than ~0 when it determines the  
 end
 // of the data in question.
 //
 ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process);

How does the delegate say "you know what, I'm fine with the first 1000 bytes of the data; please take the rest of 1048 back"? Is that the result of the delegate? The process feels a bit odd.

You are using the internal buffer, no copying is necessary. So all that happens is the read position is moved up 1000 bytes and the first 1000 bytes is returned.
 // same as readUntil except append to the given arr, Any excess
 // data will be pushed into the internal buffer.
 //
 size_t appendUntil(uint delegate(ubyte[] data, uint start) process, ref
 ubyte[] arr)

So indeed the delegate seems to return the length it wants to keep? And the rest would be copied back into the stream's internal buffers? I'm not sure I understand this API.

Yes, this involves a copy of the data you aren't interested in, but how else could you do it? You can't know "hey this data is not going to satisfy the condition, so I'll preemptively read it into the buffer instead".
 // various buffer functions.
  property size_t bufsize();
  property size_t readable();
 // etc.

Can one set bufsize?

Probably in the etc. functions ;) I left that up in the air, because I haven't given full thought to a buffer implementation.
 }

 The way I see it working is, there are two implementations for
 BufferedInputTransport: FILEInputTransport and DBufferInputTransport.
 There are numerous implementations of InputTransport, each of which can
 be passed to the DBufferInputTransport, which uses its own buffer
 implementation. For example, a network socket, file, inter-thread
 stream, an array, etc.

 This way, you can play nice with C's stdio when necessary (i.e. for
 stdin/stdout/stderr) and avoid the FILE limitations and performance
 issues otherwise.

I'm a bit unclear on the delegate stuff, but it's promising because it could be quite flexible. But I wouldn't want to aggravate the users with an API that's difficult to use. Could you please give a few examples using delegates that implement common patterns - e.g. readline and readDelim?

Sure, readline is probably easiest. Note, I'll assume for this example that \n signifies a line, windows would be slightly more difficult but doesn't really improve the example clarity, just assume it can also be done: char[] readline(BufferedInputTransport trans, bool makeCopy=false) { uint checkForNL(ubyte[] data, uint start) { char[] d = cast(char[])data; // need to switch to utf8 foreach(i, dchar d; d[start..$]) { if(d == '\n') return i + 1 + start; // consume including the newline } return ~0; } ubyte[] result; if(makeCopy) trans.read(&checkForNL, result); else result = trans.read(&checkForNL); auto cresult = cast(char[])result; // don't include any newline read if(cresult[$-1] == '\n') cresult = cresult[0..$-1]; return result; } If you specify makeCopy is true, the resulting data is unique and can be used wherever. Otherwise, the resulting data is actually the buffer of the BufferedInputTransport stream, and shouldn't be saved as it may be reused. One can easily make a range based on this as well. -Steve
Oct 15 2010