digitalmars.D - Streaming transport interfaces: input

Andrei Alexandrescu (56/57) Oct 14 2010 Here's the input transport layer. Transport has no concern for

Pelle (7/64) Oct 14 2010 Why is positionable in TransportBase? Should it not be a separate interf...

Andrei Alexandrescu (6/12) Oct 14 2010 Interfaces can't have overridable template methods.

anon (6/74) Oct 14 2010 I can't understand why we need binary vs. text modes. Seems to me a sill...

Andrei Alexandrescu (5/12) Oct 14 2010 A variety of Internet protocols are text-based. Also, for example I use

anon (3/19) Oct 14 2010 All I'm saying is that instead of treating text-based protocols speciall...
Andrej Mitrovic (2/4) Oct 14 2010

Steven Schveighoffer (18/73) Oct 14 2010 Please, use the term "seek", and allow an anchor. Every OS allows this,...

Andrei Alexandrescu (23/42) Oct 14 2010 I've always thought that's a crappy appendix. Every OS that ever allows

Denis Koroskin (16/59) Oct 14 2010 appendDelim *requires* buffering for to be implemented. No OS provides a...

Andrei Alexandrescu (26/43) Oct 14 2010 Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires

Denis Koroskin (16/60) Oct 14 2010 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars....

Andrei Alexandrescu (6/88) Oct 14 2010 Your function calls fopen() and does not disable buffering by means of

Denis Koroskin (7/106) Oct 14 2010 In my original version there was a setbuf(f, null) call. I removed it

Andrei Alexandrescu (11/16) Oct 14 2010 Yah, I looked more into the behavior of fread. At least on a couple of

Denis Koroskin (17/33) Oct 14 2010 Yeah, I've read that.

Andrei Alexandrescu (8/24) Oct 15 2010 I confirm byLine is faster than getline after translating getline to D.

Denis Koroskin (6/33) Oct 15 2010 With introduction of consume

Andrei Alexandrescu (4/42) Oct 15 2010 Then I guess... put me on the bandwagon! Thanks for the great job on

Steven Schveighoffer (4/8) Oct 14 2010 No it doesn't. Tango's buffer does not require extra copying.

Steven Schveighoffer (52/94) Oct 14 2010 Um.. yuck. We need to use two system calls to seek 100 bytes?

Andrei Alexandrescu (32/124) Oct 14 2010 Yes.

Steven Schveighoffer (32/158) Oct 14 2010 seek *is* a system call (lseek64). tell is simply seeking with (0,

Michel Fortin (12/18) Oct 14 2010 Would be nice if we could do all this:

Steven Schveighoffer (6/18) Oct 14 2010 You could get close to this with some nifty type construction, but I'm

Steven Schveighoffer (59/66) Oct 14 2010 Here's a rough outline:

Andrei Alexandrescu (19/80) Oct 14 2010 So far so good.

Steven Schveighoffer (53/141) Oct 15 2010 This is often the only way in the low level interface, and since we have...

Dmitry Olshansky (8/34) Oct 14 2010 Actually C Run-time library provides something very limited from the

Andrei Alexandrescu (14/14) Oct 14 2010 With generic element type:

Denis Koroskin (9/23) Oct 14 2010 This interface requires buffering and thus needs to be built on top of t...

Brad Roberts (4/11) Oct 14 2010 I suggest having read and write with offset style apis. See also: man ...

Andrei Alexandrescu (4/15) Oct 14 2010 This is very interesting - I had no idea about pread and pwrite.
Denis Koroskin (30/45) Oct 14 2010 In my original API proposal I included reading from offsets:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

Here's the input transport layer. Transport has no concern for 
formatting - it just moves stuff around.

interface TransportBase
{
     /// True if stream has no more data.
      property bool empty();
     /// True if the position property works
      property bool positionable();
     /// Gives current position in stream
      property ulong position();
     /// Moves stream to absolute position pos
      property void position(ulong pos);
}

interface InputBinaryTransport : TransportBase
{
     /// Reads up to target.length bytes into target,
     /// returns size of data read. If target.length < 1 throws.
     size_t read(ubyte[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number
     /// of bytes read.
     size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter);
}

interface InputTextTransport : TransportBase
{
     /// Returns the native character size: 1 for UTF-8, 2 for UTF-16,
     /// 4 for UTF-32, 0 for nonstandard native encodings.
      property size_t nativeUTFWidth();

     /// Reads complete UTF code points into target, without going
     /// beyond target.length. Returns number of code units read.
     /// If target.length < 4 throws.
     size_t read(char[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number of
     /// code units read.
     size_t appendDelim(ref char[] target, in char[] delimiter);

     /// Reads complete UTF code points into target, without going
     /// beyond target.length. Returns number of code units read.
     /// If target.length < 2 throws.
     size_t read(wchar[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number of
     /// code units read.
     size_t appendDelim(ref wchar[] target, in wchar[] delimiter);

     /// Reads complete UTF code points into target, without going
     /// beyond target.length. Returns number of code units read.
     /// If target.length < 1 throws.
     size_t read(dchar[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number of
     /// code units read.
     size_t appendDelim(ref dchar[] target, in dchar[] delimiter);
}

Please destroy.


Andrei

Oct 14 2010

Pelle <pelle.mansson gmail.com> writes:

On 10/14/2010 05:34 PM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

 Here's the input transport layer. Transport has no concern for
 formatting - it just moves stuff around.

 interface TransportBase
 {
 /// True if stream has no more data.
  property bool empty();
 /// True if the position property works
  property bool positionable();
 /// Gives current position in stream
  property ulong position();
 /// Moves stream to absolute position pos
  property void position(ulong pos);
 }

 interface InputBinaryTransport : TransportBase
 {
 /// Reads up to target.length bytes into target,
 /// returns size of data read. If target.length < 1 throws.
 size_t read(ubyte[] target);
 /// Appends to target up until delimiter is found or end of stream.
 /// The delimiter is included in the target. Returns number
 /// of bytes read.
 size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter);
 }

 interface InputTextTransport : TransportBase
 {
 /// Returns the native character size: 1 for UTF-8, 2 for UTF-16,
 /// 4 for UTF-32, 0 for nonstandard native encodings.
  property size_t nativeUTFWidth();

 /// Reads complete UTF code points into target, without going
 /// beyond target.length. Returns number of code units read.
 /// If target.length < 4 throws.
 size_t read(char[] target);
 /// Appends to target up until delimiter is found or end of stream.
 /// The delimiter is included in the target. Returns number of
 /// code units read.
 size_t appendDelim(ref char[] target, in char[] delimiter);

 /// Reads complete UTF code points into target, without going
 /// beyond target.length. Returns number of code units read.
 /// If target.length < 2 throws.
 size_t read(wchar[] target);
 /// Appends to target up until delimiter is found or end of stream.
 /// The delimiter is included in the target. Returns number of
 /// code units read.
 size_t appendDelim(ref wchar[] target, in wchar[] delimiter);

 /// Reads complete UTF code points into target, without going
 /// beyond target.length. Returns number of code units read.
 /// If target.length < 1 throws.
 size_t read(dchar[] target);
 /// Appends to target up until delimiter is found or end of stream.
 /// The delimiter is included in the target. Returns number of
 /// code units read.
 size_t appendDelim(ref dchar[] target, in dchar[] delimiter);
 }

 Please destroy.


 Andrei

Why is positionable in TransportBase? Should it not be a separate interface?

Also, shouldn't the reads accept any output range? appendDelim as well. 
This would do away with the two extra versions in the text input.

Why does the reads throw on maybe-too-small arrays? I would expect them 
to read at best effort and return 0 if they fail. As they read 0 bytes.

I am no expert on this subject. Disclaimer, etc :-)

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/2010 11:43 AM, Pelle wrote:
 Why is positionable in TransportBase? Should it not be a separate
 interface?

It could.

 Also, shouldn't the reads accept any output range? appendDelim as well.
 This would do away with the two extra versions in the text input.

Interfaces can't have overridable template methods.

 Why does the reads throw on maybe-too-small arrays? I would expect them
 to read at best effort and return 0 if they fail. As they read 0 bytes.

Because small arrays would make it (a) impossible to distinguish between 
no-op and end of file and (b) impossible to read one UTF code point.


Andrei

Oct 14 2010

anon <foo bar.com> writes:

Andrei Alexandrescu Wrote:

 Starting a new thread from Denis' question:
 
 Can we outline basic Stream interface now so that we could move on?

 
 Here's the input transport layer. Transport has no concern for 
 formatting - it just moves stuff around.
 
 interface TransportBase
 {
      /// True if stream has no more data.
       property bool empty();
      /// True if the position property works
       property bool positionable();
      /// Gives current position in stream
       property ulong position();
      /// Moves stream to absolute position pos
       property void position(ulong pos);
 }
 
 interface InputBinaryTransport : TransportBase
 {
      /// Reads up to target.length bytes into target,
      /// returns size of data read. If target.length < 1 throws.
      size_t read(ubyte[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number
      /// of bytes read.
      size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter);
 }
 
 interface InputTextTransport : TransportBase
 {
      /// Returns the native character size: 1 for UTF-8, 2 for UTF-16,
      /// 4 for UTF-32, 0 for nonstandard native encodings.
       property size_t nativeUTFWidth();
 
      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 4 throws.
      size_t read(char[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref char[] target, in char[] delimiter);
 
      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 2 throws.
      size_t read(wchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref wchar[] target, in wchar[] delimiter);
 
      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 1 throws.
      size_t read(dchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref dchar[] target, in dchar[] delimiter);
 }
 
 Please destroy.
 
 
 Andrei

I can't understand why we need binary vs. text modes. Seems to me a silly
design decision from the old days when we used 7-bit terminals or something.
This always complicates the APIs for no reason. 

I'd like to see a two layer design:
 - lower UNTYPED layer that deals with transporting bits and bytes and deals
with lower level concerns like byte order. 
 - higher level extend-able TYPED layer with formatters/encoders. 

Also, the higher level should be compose-able such that I could read from a
gzipped xml utf8 file

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/2010 12:15 PM, anon wrote:
 I can't understand why we need binary vs. text modes. Seems to me a
 silly design decision from the old days when we used 7-bit terminals
 or something. This always complicates the APIs for no reason.

A variety of Internet protocols are text-based. Also, for example I use 
a text console everyday.

 I'd like to see a two layer design: - lower UNTYPED layer that deals
 with transporting bits and bytes and deals with lower level concerns
 like byte order. - higher level extend-able TYPED layer with
 formatters/encoders.

This would be the untyped lower level layer.

Andrei

Oct 14 2010

anon <foo bar.com> writes:

Andrei Alexandrescu Wrote:

 On 10/14/2010 12:15 PM, anon wrote:
 I can't understand why we need binary vs. text modes. Seems to me a
 silly design decision from the old days when we used 7-bit terminals
 or something. This always complicates the APIs for no reason.

 
 A variety of Internet protocols are text-based. Also, for example I use 
 a text console everyday.
 
 I'd like to see a two layer design: - lower UNTYPED layer that deals
 with transporting bits and bytes and deals with lower level concerns
 like byte order. - higher level extend-able TYPED layer with
 formatters/encoders.

 
 This would be the untyped lower level layer.
 
 Andrei

All I'm saying is that instead of treating text-based protocols specially at
the lower level, they should be on the higher level typed as UTF-8 or whatever.
They should be treated like any other protocol. 
E.g. I see no difference between a text file encoded in UTF8 and an image file
encoded via jpeg.

Oct 14 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Now say that 10 times fast!

On 10/14/10, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 ..lower level layer.

 Andrei

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

 Here's the input transport layer. Transport has no concern for  
 formatting - it just moves stuff around.

 interface TransportBase
 {
      /// True if stream has no more data.
       property bool empty();
      /// True if the position property works
       property bool positionable();
      /// Gives current position in stream
       property ulong position();
      /// Moves stream to absolute position pos
       property void position(ulong pos);
 }

Please, use the term "seek", and allow an anchor.  Every OS allows this,  
it makes no sense not to provide it.

Other than that, I like the idea (assuming output transport is forthcoming  
and implements TransportBase).

 interface InputBinaryTransport : TransportBase
 {
      /// Reads up to target.length bytes into target,
      /// returns size of data read. If target.length < 1 throws.
      size_t read(ubyte[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number
      /// of bytes read.
      size_t appendDelim(ref ubyte[] target, in ubyte[] delimiter);
 }

I don't like appendDelim.  We don't need to define that until we have  
buffering.  The simple function of an input stream is to read data.  With  
buffering you get all the goodies that you want, but the buffer should be  
in control of its data buffer.

Basically, appendDelim can be defined outside this class, because the  
primitive read is enough.

 interface InputTextTransport : TransportBase
 {
      /// Returns the native character size: 1 for UTF-8, 2 for UTF-16,
      /// 4 for UTF-32, 0 for nonstandard native encodings.
       property size_t nativeUTFWidth();

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 4 throws.
      size_t read(char[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref char[] target, in char[] delimiter);

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 2 throws.
      size_t read(wchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref wchar[] target, in wchar[] delimiter);

      /// Reads complete UTF code points into target, without going
      /// beyond target.length. Returns number of code units read.
      /// If target.length < 1 throws.
      size_t read(dchar[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number of
      /// code units read.
      size_t appendDelim(ref dchar[] target, in dchar[] delimiter);
 }

Shouldn't the text transport be defined on top of the binary transport?   
And in any case, I'd expect buffering to go between the two.  If all you  
are adding are the different widths of characters, I don't think you need  
this extra layer.  It's going to make the buffering layer more difficult  
to implement (now it must handle both a text version and abinary version).

-Steve

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

I've always thought that's a crappy appendix. Every OS that ever allows 
seek/tell with anchors allows ALL anchors, and always allows either both 
or none of seek and tell. So I decided to cut through the crap and 
simplify. You want to seek 100 bytes from here, you write 
stream.position = stream.position + 100.

Oh, that reminds me I need to provide length as a property as well. This 
would save us crap like seek(0, SEEK_END); ftell() to figure out the 
length of a file.

I have no sympathy for seek and tell with anchors.

 I don't like appendDelim. We don't need to define that until we have
 buffering.

Why?

 The simple function of an input stream is to read data.

It does read data.

 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

I think the appendDelim method allows fast and simple implementations of 
a variety of patterns. As I (thought I) have shown elsethread, without 
appendDelim there's no way to efficiently implement a line-oriented 
stream on top of a block-oriented one.

 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

You can only define it if you accept extra copying. I'd say one extra 
interface function is acceptable for fast I/O.

 Shouldn't the text transport be defined on top of the binary transport?

No, because there are transports that genuinely do not accept binary data.

 And in any case, I'd expect buffering to go between the two.

How do you define buffering? Would a buffered transport implement a 
different interface?

 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

I don't understand this.


Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 14 Oct 2010 21:39:03 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

 I've always thought that's a crappy appendix. Every OS that ever allows  
 seek/tell with anchors allows ALL anchors, and always allows either both  
 or none of seek and tell. So I decided to cut through the crap and  
 simplify. You want to seek 100 bytes from here, you write  
 stream.position = stream.position + 100.

 Oh, that reminds me I need to provide length as a property as well. This  
 would save us crap like seek(0, SEEK_END); ftell() to figure out the  
 length of a file.

 I have no sympathy for seek and tell with anchors.

 I don't like appendDelim. We don't need to define that until we have
 buffering.

 Why?

 The simple function of an input stream is to read data.

 It does read data.

 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

 I think the appendDelim method allows fast and simple implementations of  
 a variety of patterns. As I (thought I) have shown elsethread, without  
 appendDelim there's no way to efficiently implement a line-oriented  
 stream on top of a block-oriented one.

 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

 You can only define it if you accept extra copying. I'd say one extra  
 interface function is acceptable for fast I/O.

 Shouldn't the text transport be defined on top of the binary transport?

 No, because there are transports that genuinely do not accept binary  
 data.

 And in any case, I'd expect buffering to go between the two.

 How do you define buffering? Would a buffered transport implement a  
 different interface?

 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

 I don't understand this.


 Andrei

appendDelim *requires* buffering for to be implemented. No OS provides an  
API to read from a file (be it pipe, socket, whatever) to read up to some  
abstract delimiter. It *always* reads in blocks. As such, if you need to  
read until a delimeter, you need to fetch block to some internal buffer,  
MANUALLY search through it and THEN copy to output string. I've  
implemented that on top of chunked read interface, and it was 5% faster  
than getline()/getdelim() that GNU libc provides (despite you claming it  
to be "many times faster"). It's not.

Buffering requires and additional level of data copying, and this is bad  
for fast I/O. If you need fast I/O or must pull that out of the stream  
interface. Otherwise chunked read will be less efficient due to additional  
copies to and from buffers.

On the contrary line-based reading can be implemented on top of the  
chunked read without sacrificing a tiny bit of efficiency.

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 12:56 CDT, Denis Koroskin wrote:
 appendDelim *requires* buffering for to be implemented. No OS provides
 an API to read from a file (be it pipe, socket, whatever) to read up to
 some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires 
buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 
512 bytes or whatever), so ANY protocol that allows the user to set the 
maximum bytes to read will require buffering and copying. So how is 
appendDelim worse than read?

 As such, if you
 need to read until a delimeter, you need to fetch block to some internal
 buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.

 I've
 implemented that on top of chunked read interface, and it was 5% faster
 than getline()/getdelim() that GNU libc provides (despite you claming it
 to be "many times faster"). It's not.

Please post your code.

 Buffering requires and additional level of data copying, and this is bad
 for fast I/O.

Agreed. But then you define routines that also requires buffering. How 
do you reconcile your own requirement with your own interface?

 If you need fast I/O or must pull that out of the stream
 interface. Otherwise chunked read will be less efficient due to
 additional copies to and from buffers.

 On the contrary line-based reading can be implemented on top of the
 chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying.

appendDelim implementation:

1. Low-level read in internal buffers

2. Search for delimiter (assume found for simplicity)

3. Resize user buffer

4. Copy

That's one copy, with the necessary corner cases when the delimiter 
isn't found yet etc. (which increase copying ONLY if the buffer is 
actually moved when reallocated).

The implementation in your message on 10/13/2010 21:20 CDT:

1. Low-level read in internal buffers

2. Copy from internal buffers into the internal buffer provided by your 
ByLine implementation

3. Copy from the internal buffer of ByLine into the user-supplied buffer

That's two copies. Agreed?


Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

On 10/14/10 12:56 CDT, Denis Koroskin wrote:
appendDelim *requires* buffering for to be implemented. No OS provides
an API to read from a file (be it pipe, socket, whatever) to read up to
some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires
buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at
512 bytes or whatever), so ANY protocol that allows the user to set the
maximum bytes to read will require buffering and copying. So how is
appendDelim worse than read?

As such, if you
need to read until a delimeter, you need to fetch block to some internal
buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.

I've
implemented that on top of chunked read interface, and it was 5% faster
than getline()/getdelim() that GNU libc provides (despite you claming it
to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248

Buffering requires and additional level of data copying, and this is bad
for fast I/O.

Agreed. But then you define routines that also requires buffering. How
do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when
you need to buffer something, but in general you don't. My Stream
interface is simply a thin portable layer on top of OS. See the code above
for simple implementation that is built on top of fopen/fread (I used
open/read initially but it gave 0 improvement so I went back to
fopen/fread because GNU libc line-input uses them, too, so that would be
the most fair comparison). It can't be any more efficient than that.

If you need fast I/O or must pull that out of the stream
interface. Otherwise chunked read will be less efficient due to
additional copies to and from buffers.

On the contrary line-based reading can be implemented on top of the
chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying.

appendDelim implementation:

1. Low-level read in internal buffers

2. Search for delimiter (assume found for simplicity)

3. Resize user buffer

4. Copy

That's one copy, with the necessary corner cases when the delimiter
isn't found yet etc. (which increase copying ONLY if the buffer is
actually moved when reallocated).

The implementation in your message on 10/13/2010 21:20 CDT:

1. Low-level read in internal buffers

2. Copy from internal buffers into the internal buffer provided by your
ByLine implementation

3. Copy from the internal buffer of ByLine into the user-supplied buffer

That's two copies. Agreed?

Andrei

I'm not sure what message are you talking about (first or second one).
Second one
(http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.
&article_id=119248)
makes a chunked read to internal buffer (if not filled yet), then searches
for a delimiter and then copies to a user-provided buffer. That's one copy
in most cases. And that's what GNU libc does, too.

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 14:00 CDT, Denis Koroskin wrote:
On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

Clear. What may be not so clear is that read(ubyte[] buf) ALSO
requires buffering. Disk I/O comes in fixed buffer sizes (sometimes
aligned at 512 bytes or whatever), so ANY protocol that allows the
user to set the maximum bytes to read will require buffering and
copying. So how is appendDelim worse than read?

As such, if you
need to read until a delimeter, you need to fetch block to some internal
buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.

I've
implemented that on top of chunked read interface, and it was 5% faster
than getline()/getdelim() that GNU libc provides (despite you claming it
to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248

I meant the baseline.

Buffering requires and additional level of data copying, and this is bad
for fast I/O.

Agreed. But then you define routines that also requires buffering. How
do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when
you need to buffer something, but in general you don't. My Stream
interface is simply a thin portable layer on top of OS. See the code
above for simple implementation that is built on top of fopen/fread (I
used open/read initially but it gave 0 improvement so I went back to
fopen/fread because GNU libc line-input uses them, too, so that would be
the most fair comparison). It can't be any more efficient than that.

If you need fast I/O or must pull that out of the stream
interface. Otherwise chunked read will be less efficient due to
additional copies to and from buffers.

On the contrary line-based reading can be implemented on top of the
chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying.

appendDelim implementation:

1. Low-level read in internal buffers

2. Search for delimiter (assume found for simplicity)

3. Resize user buffer

4. Copy

That's one copy, with the necessary corner cases when the delimiter
isn't found yet etc. (which increase copying ONLY if the buffer is
actually moved when reallocated).

The implementation in your message on 10/13/2010 21:20 CDT:

1. Low-level read in internal buffers

2. Copy from internal buffers into the internal buffer provided by
your ByLine implementation

3. Copy from the internal buffer of ByLine into the user-supplied buffer

That's two copies. Agreed?

Andrei

I'm not sure what message are you talking about (first or second one).
Second one
(http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248)
makes a chunked read to internal buffer (if not filled yet), then
searches for a delimiter and then copies to a user-provided buffer.
That's one copy in most cases. And that's what GNU libc does, too.

Your function calls fopen() and does not disable buffering by means of
setvbuf(). By default fopen() opens in buffered mode. Does the existence
of that buffer entail an extra copy?

Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 15 Oct 2010 02:01:55 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

On 10/14/10 14:00 CDT, Denis Koroskin wrote:
On Thu, 14 Oct 2010 22:22:07 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

On 10/14/10 12:56 CDT, Denis Koroskin wrote:
appendDelim *requires* buffering for to be implemented. No OS provides
an API to read from a file (be it pipe, socket, whatever) to read up
to
some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO
requires buffering. Disk I/O comes in fixed buffer sizes (sometimes
aligned at 512 bytes or whatever), so ANY protocol that allows the
user to set the maximum bytes to read will require buffering and
copying. So how is appendDelim worse than read?

As such, if you
need to read until a delimeter, you need to fetch block to some
internal
buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.

I've
implemented that on top of chunked read interface, and it was 5%
faster
than getline()/getdelim() that GNU libc provides (despite you claming
it
to be "many times faster"). It's not.

Please post your code.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119248

I meant the baseline.

Buffering requires and additional level of data copying, and this is
bad
for fast I/O.

Agreed. But then you define routines that also requires buffering. How
do you reconcile your own requirement with your own interface?

My interface doesn't require any additional copying. You only copy when
you need to buffer something, but in general you don't. My Stream
interface is simply a thin portable layer on top of OS. See the code
above for simple implementation that is built on top of fopen/fread (I
used open/read initially but it gave 0 improvement so I went back to
fopen/fread because GNU libc line-input uses them, too, so that would be
the most fair comparison). It can't be any more efficient than that.

If you need fast I/O or must pull that out of the stream
interface. Otherwise chunked read will be less efficient due to
additional copies to and from buffers.

On the contrary line-based reading can be implemented on top of the
chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying.

appendDelim implementation:

1. Low-level read in internal buffers

2. Search for delimiter (assume found for simplicity)

3. Resize user buffer

4. Copy

That's one copy, with the necessary corner cases when the delimiter
isn't found yet etc. (which increase copying ONLY if the buffer is
actually moved when reallocated).

The implementation in your message on 10/13/2010 21:20 CDT:

1. Low-level read in internal buffers

2. Copy from internal buffers into the internal buffer provided by
your ByLine implementation

3. Copy from the internal buffer of ByLine into the user-supplied
buffer

That's two copies. Agreed?

Andrei

Your function calls fopen() and does not disable buffering by means of
setvbuf(). By default fopen() opens in buffered mode. Does the existence
of that buffer entail an extra copy?

Andrei

In my original version there was a setbuf(f, null) call. I removed it
because it had 0 impact on performance.
I also tried using unistd open/read functions, that had zero impact, too
(btw, opening file with O_DIRECT returned valid file descriptor, but read
operations very failing with an invalid argument error).

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/2010 07:57 PM, Denis Koroskin wrote:
 In my original version there was a setbuf(f, null) call. I removed it
 because it had 0 impact on performance.
 I also tried using unistd open/read functions, that had zero impact, too

Yah, I looked more into the behavior of fread. At least on a couple of 
implementations that offer source, the implementation first copies 
whatever data is (if any) in the internal buffer, and then read the rest 
of the data straight into the remaining user-supplied buffer.

This means, if you consistently fread into a buffer, the internal 
buffers are never touched.

I stand corrected regarding the extra copy.

 (btw, opening file with O_DIRECT returned valid file descriptor, but
 read operations very failing with an invalid argument error).

I recall I've read a discussion where Linus was really down on O_DIRECT. 
Found it: http://kerneltrap.org/node/7563


Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 07:57 PM, Denis Koroskin wrote:
 In my original version there was a setbuf(f, null) call. I removed it
 because it had 0 impact on performance.
 I also tried using unistd open/read functions, that had zero impact, too

 Yah, I looked more into the behavior of fread. At least on a couple of  
 implementations that offer source, the implementation first copies  
 whatever data is (if any) in the internal buffer, and then read the rest  
 of the data straight into the remaining user-supplied buffer.

 This means, if you consistently fread into a buffer, the internal  
 buffers are never touched.

 I stand corrected regarding the extra copy.

 (btw, opening file with O_DIRECT returned valid file descriptor, but
 read operations very failing with an invalid argument error).

 I recall I've read a discussion where Linus was really down on O_DIRECT.  
 Found it: http://kerneltrap.org/node/7563


 Andrei

Yeah, I've read that.

Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best of  
3 results for parsing 33Mb log file:

using getline (C++ -O2):
real 0m0.081s
user 0m0.068s
sys 0m0.016s

using byLine (D -O -release -inline):
real 0m0.067s
user 0m0.056s
sys 0m0.012s

That's a 20% difference. Source and test files here:
http://rapidshare.com/files/425154119/tests.7z  309k
http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no  
7z installed)

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

I confirm byLine is faster than getline after translating getline to D. 
byLine does crash intermittently so it has a bug somewhere, but I assume 
fixing it won't reduce its performance significantly.

Results on my system on a 20x larger test file (obtained with repeat 20 
{ cat WindowsUpdate.log >> big.log } && mv big.log WindowsUpdate.log) 
are (median of 5): 1.759 vs. 1.516 seconds (14% improvement).


Andrei

Oct 15 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 15 Oct 2010 20:06:47 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

 I confirm byLine is faster than getline after translating getline to D.  
 byLine does crash intermittently so it has a bug somewhere, but I assume  
 fixing it won't reduce its performance significantly.

 Results on my system on a 20x larger test file (obtained with repeat 20  
 { cat WindowsUpdate.log >> big.log } && mv big.log WindowsUpdate.log)  
 are (median of 5): 1.759 vs. 1.516 seconds (14% improvement).


 Andrei

With introduction of consume  
(http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.
&article_id=119414)  
it got another 40% improvement for counting lines and chars (it doesn't  
copy anything at all now).

Oct 15 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/15/10 11:11 CDT, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 20:06:47 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/2010 09:01 PM, Denis Koroskin wrote:
 On Fri, 15 Oct 2010 05:49:05 +0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Btw, I've re-run my tests, O_DIRECT works but also has no effect. Best
 of 3 results for parsing 33Mb log file:

 using getline (C++ -O2):
 real 0m0.081s
 user 0m0.068s
 sys 0m0.016s

 using byLine (D -O -release -inline):
 real 0m0.067s
 user 0m0.056s
 sys 0m0.012s

 That's a 20% difference. Source and test files here:
 http://rapidshare.com/files/425154119/tests.7z 309k
 http://rapidshare.com/files/425154408/tests.zip 6Mb (in case you have no
 7z installed)

 I confirm byLine is faster than getline after translating getline to
 D. byLine does crash intermittently so it has a bug somewhere, but I
 assume fixing it won't reduce its performance significantly.

 Results on my system on a 20x larger test file (obtained with repeat
 20 { cat WindowsUpdate.log >> big.log } && mv big.log
 WindowsUpdate.log) are (median of 5): 1.759 vs. 1.516 seconds (14%
 improvement).


 Andrei

 With introduction of consume
 (http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119414)
 it got another 40% improvement for counting lines and chars (it doesn't
 copy anything at all now).

Then I guess... put me on the bandwagon! Thanks for the great job on 
designing, implementing, and benchmarking.

Andrei

Oct 15 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 13:56:19 -0400, Denis Koroskin <2korden gmail.com>  
wrote:

 Buffering requires and additional level of data copying, and this is bad  
 for fast I/O. If you need fast I/O or must pull that out of the stream  
 interface. Otherwise chunked read will be less efficient due to  
 additional copies to and from buffers.

No it doesn't.  Tango's buffer does not require extra copying.

-Steve

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

 I've always thought that's a crappy appendix. Every OS that ever allows  
 seek/tell with anchors allows ALL anchors, and always allows either both  
 or none of seek and tell. So I decided to cut through the crap and  
 simplify. You want to seek 100 bytes from here, you write  
 stream.position = stream.position + 100.

Um.. yuck.   We need to use two system calls to seek 100 bytes?

 Oh, that reminds me I need to provide length as a property as well. This  
 would save us crap like seek(0, SEEK_END); ftell() to figure out the  
 length of a file.

So now you need to do stream.position = stream.length to seek to the end  
of the file instead of stream.seek(0, Anchor.END)?  Plus, how will you  
implement length, probably like this:

auto curpos = seek(0, SEEK_CUR);
auto len = seek(0, SEEK_END);
seek(curpos, SEEK_BEG);
return len;

So that looks like 3 system calls instead of one, plus you just wasted  
time seeking back to the current position.

Now, I do like the syntax better, but we are not going for syntax here, we  
are going for performance.  If you can find a way to merge the two, I'd be  
happy with it.

 I have no sympathy for seek and tell with anchors.

Sympathy has nothing to do with it.  It's the simple fact that you have to  
deal with the OS, and munging your interface on top of it means more  
system calls and less performance.

 I don't like appendDelim. We don't need to define that until we have
 buffering.

 Why?

Because appendDelim deals with buffering.  If I defined a buffered stream,  
I'd include a function like this:

size_t read(bool delegate(T[] data) sink);

which buffers data until sink returned false (passing each read chunk into  
sink), extending the buffer as necessary.

Then it's trivial to implement readDelim on top of this.

 The simple function of an input stream is to read data.

 It does read data.

I mean, that's *all* it should do.  It should not be appending to buffers.

 With
 buffering you get all the goodies that you want, but the buffer should
 be in control of its data buffer.

 I think the appendDelim method allows fast and simple implementations of  
 a variety of patterns. As I (thought I) have shown elsethread, without  
 appendDelim there's no way to efficiently implement a line-oriented  
 stream on top of a block-oriented one.

Um... the read system call is the same interface as the proposed  
block-oriented interface.  How are you avoiding using system calls?

 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

 You can only define it if you accept extra copying. I'd say one extra  
 interface function is acceptable for fast I/O.

No, you can define it without extra copying.  If you don't allow direct  
access to the buffer, then you have extra copying.  But we don't have to  
mimic C here.  We should not be encouraging constant reinventing of the  
buffer wheel here.  Buffering is a well-defined task that can be  
implemented once.

Just as a note, Tango does this, and it's very fast.  There is certainly  
no extra copying there.

 Shouldn't the text transport be defined on top of the binary transport?

 No, because there are transports that genuinely do not accept binary  
 data.

I mean, a text transport uses a binary transport underneath.  What text  
transport doesn't use a binary transport to do its dirty work?  And what  
exactly does a text transport do so differently that it needs to be a  
separate interface?

In other words, if 90% of the text transport duplicates the binary  
transport, I see an opportunity for consolidation.

 And in any case, I'd expect buffering to go between the two.

 How do you define buffering? Would a buffered transport implement a  
 different interface?

Yes, but if we expect to reuse code, I'd expect a buffered transport to  
use a primitive transport underneath for actually reading/writing data.   
If you have multiple versions of the class that actually reads/writes data  
(such as binary vs. text), then the buffer which uses it must support all  
of them.

Text based processing to me seems to be a buffered activity (reading  
lines, ensuring you don't have sliced utf-8 data, etc.).

 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

 I don't understand this.

buffer uses a transport.  If you have two different transport interfaces,  
the buffer must support them both.  And if the benefit is, one simply  
defines [w|d]char versions of read, then we haven't gained much for the  
trouble of having to support both.

-Steve

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

 I've always thought that's a crappy appendix. Every OS that ever
 allows seek/tell with anchors allows ALL anchors, and always allows
 either both or none of seek and tell. So I decided to cut through the
 crap and simplify. You want to seek 100 bytes from here, you write
 stream.position = stream.position + 100.

 Um.. yuck. We need to use two system calls to seek 100 bytes?

seek and tell don't always issue system calls.

 Oh, that reminds me I need to provide length as a property as well.
 This would save us crap like seek(0, SEEK_END); ftell() to figure out
 the length of a file.

 So now you need to do stream.position = stream.length to seek to the end
 of the file instead of stream.seek(0, Anchor.END)?

Yes.

 Plus, how will you
 implement length, probably like this:
 auto curpos = seek(0, SEEK_CUR);
 auto len = seek(0, SEEK_END);
 seek(curpos, SEEK_BEG);
 return len;

Depends. For files, you can just use stat.

 So that looks like 3 system calls instead of one, plus you just wasted
 time seeking back to the current position.

Well again they don't always issue system calls, but point taken. I do 
see a need for fast positioning at end of stream. Perhaps we could 
accommodate an enum equal to ulong.max such that this goes to the end of 
stream:

stream.position = StreamBase.atEnd;

 I don't like appendDelim. We don't need to define that until we have
 buffering.

 Why?

 Because appendDelim deals with buffering. If I defined a buffered
 stream, I'd include a function like this:

 size_t read(bool delegate(T[] data) sink);

 which buffers data until sink returned false (passing each read chunk
 into sink), extending the buffer as necessary.

 Then it's trivial to implement readDelim on top of this.

Interesting. But that would still force readDelim to store leftover bytes.

 The simple function of an input stream is to read data.

 It does read data.

 I mean, that's *all* it should do. It should not be appending to buffers.

This comes from a practical need. I've often had a buffer and wanted to 
read one more line into it, keeping the existing content. It was 
impossible without extra allocation and copying.

 I think the appendDelim method allows fast and simple implementations
 of a variety of patterns. As I (thought I) have shown elsethread,
 without appendDelim there's no way to efficiently implement a
 line-oriented stream on top of a block-oriented one.

 Um... the read system call is the same interface as the proposed
 block-oriented interface. How are you avoiding using system calls?

I think we don't have the same definition for "system call". For example 
by my definition fread is NOT a system call.

 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

 You can only define it if you accept extra copying. I'd say one extra
 interface function is acceptable for fast I/O.

 No, you can define it without extra copying.

How? Denis' implementation has two copies in the mix. (I'm not counting 
.dup etc.) Anyhow, let's do this - write down your interfaces so I can 
comment on them. We talk "oh that's a buffering interface" and "that 
requires buffering" and "that's an extra copy" and so on but we have 
little concrete contenders. I put my cards on the table, you put yours.

 If you don't allow direct
 access to the buffer, then you have extra copying. But we don't have to
 mimic C here. We should not be encouraging constant reinventing of the
 buffer wheel here. Buffering is a well-defined task that can be
 implemented once.

 Just as a note, Tango does this, and it's very fast. There is certainly
 no extra copying there.

 Shouldn't the text transport be defined on top of the binary transport?

 No, because there are transports that genuinely do not accept binary
 data.

 I mean, a text transport uses a binary transport underneath. What text
 transport doesn't use a binary transport to do its dirty work? And what
 exactly does a text transport do so differently that it needs to be a
 separate interface?

A text transport does not accept raw binary data and requires e.g. 
Base64 encoding (e.g. mail attachments do that). The console is a text 
device - makes no sense to dump binary data on it. A JSON encoder is 
also a text transport.

 In other words, if 90% of the text transport duplicates the binary
 transport, I see an opportunity for consolidation.

Consolidation brings simplification, which is good. But I believe there 
exist text entities that do make the distinction worthwhile.

 And in any case, I'd expect buffering to go between the two.

 How do you define buffering? Would a buffered transport implement a
 different interface?

 Yes, but if we expect to reuse code, I'd expect a buffered transport to
 use a primitive transport underneath for actually reading/writing data.
 If you have multiple versions of the class that actually reads/writes
 data (such as binary vs. text), then the buffer which uses it must
 support all of them.

 Text based processing to me seems to be a buffered activity (reading
 lines, ensuring you don't have sliced utf-8 data, etc.).

Yes. What may be not so obvious is that binary processing with 
user-imposed data lenghts is ALSO a buffer activity. This is because the 
low-level buffers do NOT come at arbitrary positions (alignment 
restrictions) and to NOT come at arbitrary lengths.

 If all you
 are adding are the different widths of characters, I don't think you
 need this extra layer. It's going to make the buffering layer more
 difficult to implement (now it must handle both a text version and
 abinary version).

 I don't understand this.

 buffer uses a transport. If you have two different transport interfaces,
 the buffer must support them both. And if the benefit is, one simply
 defines [w|d]char versions of read, then we haven't gained much for the
 trouble of having to support both.

I'll be looking forward to seeing your interfaces.


Andrei

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows  
 this,
 it makes no sense not to provide it.

 I've always thought that's a crappy appendix. Every OS that ever
 allows seek/tell with anchors allows ALL anchors, and always allows
 either both or none of seek and tell. So I decided to cut through the
 crap and simplify. You want to seek 100 bytes from here, you write
 stream.position = stream.position + 100.

 Um.. yuck. We need to use two system calls to seek 100 bytes?

 seek and tell don't always issue system calls.

seek *is* a system call (lseek64).  tell is simply seeking with (0,  
SEEK_CUR).

Are we talking about the same thing?

 Oh, that reminds me I need to provide length as a property as well.
 This would save us crap like seek(0, SEEK_END); ftell() to figure out
 the length of a file.

 So now you need to do stream.position = stream.length to seek to the end
 of the file instead of stream.seek(0, Anchor.END)?

 Yes.

 Plus, how will you
 implement length, probably like this:
 auto curpos = seek(0, SEEK_CUR);
 auto len = seek(0, SEEK_END);
 seek(curpos, SEEK_BEG);
 return len;

 Depends. For files, you can just use stat.

True, but still, 2 system calls (stat and then seek).

 So that looks like 3 system calls instead of one, plus you just wasted
 time seeking back to the current position.

 Well again they don't always issue system calls, but point taken. I do  
 see a need for fast positioning at end of stream. Perhaps we could  
 accommodate an enum equal to ulong.max such that this goes to the end of  
 stream:

 stream.position = StreamBase.atEnd;

OK, that is acceptable.  What about seeking to N bytes before the end?

What about seeking N bytes ahead of the current position (as previously  
stated)?

 I don't like appendDelim. We don't need to define that until we have
 buffering.

 Why?

 Because appendDelim deals with buffering. If I defined a buffered
 stream, I'd include a function like this:

 size_t read(bool delegate(T[] data) sink);

 which buffers data until sink returned false (passing each read chunk
 into sink), extending the buffer as necessary.

 Then it's trivial to implement readDelim on top of this.

 Interesting. But that would still force readDelim to store leftover  
 bytes.

What does readDelim do with them in your implementation?  It must read  
data via a block read, that's the only thing the OS provides (via read  
system call), so what do you do, just return the extra data or throw it  
away?  You need to buffer it somewhere.

 The simple function of an input stream is to read data.

 It does read data.

 I mean, that's *all* it should do. It should not be appending to  
 buffers.

 This comes from a practical need. I've often had a buffer and wanted to  
 read one more line into it, keeping the existing content. It was  
 impossible without extra allocation and copying.

This can be accomodated via a buffer type.  Buffering provides everything  
you need to implement readDelim.

 I think the appendDelim method allows fast and simple implementations
 of a variety of patterns. As I (thought I) have shown elsethread,
 without appendDelim there's no way to efficiently implement a
 line-oriented stream on top of a block-oriented one.

 Um... the read system call is the same interface as the proposed
 block-oriented interface. How are you avoiding using system calls?

 I think we don't have the same definition for "system call". For example  
 by my definition fread is NOT a system call.

OK, now I think I see the issue :)  You are assuming we are implementing  
all of this on top of FILE *.  FILE * provides buffering already, so you  
are not avoiding buffering at all and you are unnecessarily using an  
extremely outdated interface.

BTW, fread eventually calls read, there's no way around it.

I think we can provide a version of the BufferedStream interface which  
uses C's FILE * for stdout/stdin/stderr (to play nice with C), but we  
should avoid FILE * for everything else.

 Basically, appendDelim can be defined outside this class, because the
 primitive read is enough.

 You can only define it if you accept extra copying. I'd say one extra
 interface function is acceptable for fast I/O.

 No, you can define it without extra copying.

 How? Denis' implementation has two copies in the mix. (I'm not counting  
 .dup etc.) Anyhow, let's do this - write down your interfaces so I can  
 comment on them. We talk "oh that's a buffering interface" and "that  
 requires buffering" and "that's an extra copy" and so on but we have  
 little concrete contenders. I put my cards on the table, you put yours.

I'll see if I can put something together.

 If you don't allow direct
 access to the buffer, then you have extra copying. But we don't have to
 mimic C here. We should not be encouraging constant reinventing of the
 buffer wheel here. Buffering is a well-defined task that can be
 implemented once.

 Just as a note, Tango does this, and it's very fast. There is certainly
 no extra copying there.

 Shouldn't the text transport be defined on top of the binary  
 transport?

 No, because there are transports that genuinely do not accept binary
 data.

 I mean, a text transport uses a binary transport underneath. What text
 transport doesn't use a binary transport to do its dirty work? And what
 exactly does a text transport do so differently that it needs to be a
 separate interface?

 A text transport does not accept raw binary data and requires e.g.  
 Base64 encoding (e.g. mail attachments do that). The console is a text  
 device - makes no sense to dump binary data on it. A JSON encoder is  
 also a text transport.

-- that writes to/reads from a binary transport.  A file is a binary  
transport.

My opinion is that a text reader/writer should be a class/struct that uses  
a binary transport.  If you are going to implement the binary transport  
stuff also inside the text transport, then it seems like an unnecessary  
duplication.

 And in any case, I'd expect buffering to go between the two.

 How do you define buffering? Would a buffered transport implement a
 different interface?

 Yes, but if we expect to reuse code, I'd expect a buffered transport to
 use a primitive transport underneath for actually reading/writing data.
 If you have multiple versions of the class that actually reads/writes
 data (such as binary vs. text), then the buffer which uses it must
 support all of them.

 Text based processing to me seems to be a buffered activity (reading
 lines, ensuring you don't have sliced utf-8 data, etc.).

 Yes. What may be not so obvious is that binary processing with  
 user-imposed data lenghts is ALSO a buffer activity. This is because the  
 low-level buffers do NOT come at arbitrary positions (alignment  
 restrictions) and to NOT come at arbitrary lengths.

You can still avoid copying.  I'll see if I can show you.

-Steve

Oct 14 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-10-14 16:47:13 -0400, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 stream.position = StreamBase.atEnd;

 
 OK, that is acceptable.  What about seeking to N bytes before the end?
 
 What about seeking N bytes ahead of the current position (as previously 
  stated)?

Would be nice if we could do all this:

	stream.position = $;
	stream.position = $-10;
	stream.position += 10;

The last one (+=) should be only one function call to avoid two system 
calls. All that's a little hard to implement though.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 19:30:21 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2010-10-14 16:47:13 -0400, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 stream.position = StreamBase.atEnd;

  OK, that is acceptable.  What about seeking to N bytes before the end?
  What about seeking N bytes ahead of the current position (as  
 previously  stated)?

 Would be nice if we could do all this:

 	stream.position = $;
 	stream.position = $-10;
 	stream.position += 10;

 The last one (+=) should be only one function call to avoid two system  
 calls. All that's a little hard to implement though.

You could get close to this with some nifty type construction, but I'm  
unsure it's worth all that when the whole thing can be had via one  
function call (seek).

-Steve

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not counting  
 .dup etc.) Anyhow, let's do this - write down your interfaces so I can  
 comment on them. We talk "oh that's a buffering interface" and "that  
 requires buffering" and "that's an extra copy" and so on but we have  
 little concrete contenders. I put my cards on the table, you put yours.

 I'll see if I can put something together.

Here's a rough outline:

enum Anchor
{
     Begin,
     Current,
     End
}

interface Seek
{
     ulong seek(long delta, Anchor whence);
     final ulong tell() { return seek(0, Anchor.Current); }
     bool seekable(); // define as false if seeking is not supported, true  
if it
                      // is supported (this doesn't necessarily mean a seek  
will
                      // succeed).
}

interface InputTransport : Seek
{
     size_t read(ubyte[] data); // returns 0 on EOF.
}

// defined to implement either a D buffered object or wrap a FILE *.
//
interface BufferedInputTransport : Seek
{
     size_t read(ubyte[] data); // returns 0 on EOF.

     //
     // read data into the buffer until the delegate returns other than ~0
     //
     // The delegate is passed the entire buffer so far, with the start of  
the
     // new data just read.  It returns other than ~0 when it determines  
the end
     // of the data in question.
     //
     ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process);

     //
     // same as readUntil except append to the given arr, Any excess
     // data will be pushed into the internal buffer.
     //
     size_t appendUntil(uint delegate(ubyte[] data, uint start) process,  
ref ubyte[] arr)

     // various buffer functions.
      property size_t bufsize();
      property size_t readable();
     // etc.
}

The way I see it working is, there are two implementations for  
BufferedInputTransport:  FILEInputTransport and DBufferInputTransport.   
There are numerous implementations of InputTransport, each of which can be  
passed to the DBufferInputTransport, which uses its own buffer  
implementation.  For example, a network socket, file, inter-thread stream,  
an array, etc.

This way, you can play nice with C's stdio when necessary (i.e. for  
stdin/stdout/stderr) and avoid the FILE limitations and performance issues  
otherwise.

-Steve

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 21:58 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not
 counting .dup etc.) Anyhow, let's do this - write down your
 interfaces so I can comment on them. We talk "oh that's a buffering
 interface" and "that requires buffering" and "that's an extra copy"
 and so on but we have little concrete contenders. I put my cards on
 the table, you put yours.

 I'll see if I can put something together.

 Here's a rough outline:

Thanks!

 enum Anchor
 {
 Begin,
 Current,
 End
 }

 interface Seek
 {
 ulong seek(long delta, Anchor whence);
 final ulong tell() { return seek(0, Anchor.Current); }
 bool seekable(); // define as false if seeking is not supported, true if it
 // is supported (this doesn't necessarily mean a seek will
 // succeed).
 }

So far so good.

 interface InputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.
 }

No way to check for end of stream except by reading some of it?

 // defined to implement either a D buffered object or wrap a FILE *.
 //
 interface BufferedInputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.

Since this method has the same sig, why doesn't BufferedInputTransport 
inherit InputTransport?

 // read data into the buffer until the delegate returns other than ~0
 //
 // The delegate is passed the entire buffer so far, with the start of the
 // new data just read. It returns other than ~0 when it determines the end
 // of the data in question.
 //
 ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process);

How does the delegate say "you know what, I'm fine with the first 1000 
bytes of the data; please take the rest of 1048 back"? Is that the 
result of the delegate? The process feels a bit odd.

 // same as readUntil except append to the given arr, Any excess
 // data will be pushed into the internal buffer.
 //
 size_t appendUntil(uint delegate(ubyte[] data, uint start) process, ref
 ubyte[] arr)

So indeed the delegate seems to return the length it wants to keep? And 
the rest would be copied back into the stream's internal buffers? I'm 
not sure I understand this API.

 // various buffer functions.
  property size_t bufsize();
  property size_t readable();
 // etc.

Can one set bufsize?

 }

 The way I see it working is, there are two implementations for
 BufferedInputTransport: FILEInputTransport and DBufferInputTransport.
 There are numerous implementations of InputTransport, each of which can
 be passed to the DBufferInputTransport, which uses its own buffer
 implementation. For example, a network socket, file, inter-thread
 stream, an array, etc.

 This way, you can play nice with C's stdio when necessary (i.e. for
 stdin/stdout/stderr) and avoid the FILE limitations and performance
 issues otherwise.

I'm a bit unclear on the delegate stuff, but it's promising because it 
could be quite flexible. But I wouldn't want to aggravate the users with 
an API that's difficult to use. Could you please give a few examples 
using delegates that implement common patterns - e.g. readline and 
readDelim?


Andrei

Oct 14 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 15 Oct 2010 00:37:55 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 21:58 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 16:47:13 -0400, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu

 How? Denis' implementation has two copies in the mix. (I'm not
 counting .dup etc.) Anyhow, let's do this - write down your
 interfaces so I can comment on them. We talk "oh that's a buffering
 interface" and "that requires buffering" and "that's an extra copy"
 and so on but we have little concrete contenders. I put my cards on
 the table, you put yours.

 I'll see if I can put something together.

 Here's a rough outline:

 Thanks!

 enum Anchor
 {
 Begin,
 Current,
 End
 }

 interface Seek
 {
 ulong seek(long delta, Anchor whence);
 final ulong tell() { return seek(0, Anchor.Current); }
 bool seekable(); // define as false if seeking is not supported, true  
 if it
 // is supported (this doesn't necessarily mean a seek will
 // succeed).
 }

 So far so good.

 interface InputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.
 }

 No way to check for end of stream except by reading some of it?

This is often the only way in the low level interface, and since we have  
no buffer at this point, yes, it's required to read.  How do you implement  
EOF without a buffer to hold the data you tried to read to see if you were  
at EOF?

It might be feasible to ask for EOF on the buffered version, but I still  
think it's not necessary.

 // defined to implement either a D buffered object or wrap a FILE *.
 //
 interface BufferedInputTransport : Seek
 {
 size_t read(ubyte[] data); // returns 0 on EOF.

 Since this method has the same sig, why doesn't BufferedInputTransport  
 inherit InputTransport?

I thought of that, but then a buffered input class could accept a buffered  
input transport interface as its low-level implementation, so then you  
have unnecessarily double-buffered streams.

 // read data into the buffer until the delegate returns other than ~0
 //
 // The delegate is passed the entire buffer so far, with the start of  
 the
 // new data just read. It returns other than ~0 when it determines the  
 end
 // of the data in question.
 //
 ubyte[] readUntil(uint delegate(ubyte[] data, uint start) process);

 How does the delegate say "you know what, I'm fine with the first 1000  
 bytes of the data; please take the rest of 1048 back"? Is that the  
 result of the delegate? The process feels a bit odd.

You are using the internal buffer, no copying is necessary.  So all that  
happens is the read position is moved up 1000 bytes and the first 1000  
bytes is returned.

 // same as readUntil except append to the given arr, Any excess
 // data will be pushed into the internal buffer.
 //
 size_t appendUntil(uint delegate(ubyte[] data, uint start) process, ref
 ubyte[] arr)

 So indeed the delegate seems to return the length it wants to keep? And  
 the rest would be copied back into the stream's internal buffers? I'm  
 not sure I understand this API.

Yes, this involves a copy of the data you aren't interested in, but how  
else could you do it?  You can't know "hey this data is not going to  
satisfy the condition, so I'll preemptively read it into the buffer  
instead".

 // various buffer functions.
  property size_t bufsize();
  property size_t readable();
 // etc.

 Can one set bufsize?

Probably in the etc. functions ;)  I left that up in the air, because I  
haven't given full thought to a buffer implementation.

 }

 The way I see it working is, there are two implementations for
 BufferedInputTransport: FILEInputTransport and DBufferInputTransport.
 There are numerous implementations of InputTransport, each of which can
 be passed to the DBufferInputTransport, which uses its own buffer
 implementation. For example, a network socket, file, inter-thread
 stream, an array, etc.

 This way, you can play nice with C's stdio when necessary (i.e. for
 stdin/stdout/stderr) and avoid the FILE limitations and performance
 issues otherwise.

 I'm a bit unclear on the delegate stuff, but it's promising because it  
 could be quite flexible. But I wouldn't want to aggravate the users with  
 an API that's difficult to use. Could you please give a few examples  
 using delegates that implement common patterns - e.g. readline and  
 readDelim?

Sure, readline is probably easiest.  Note, I'll assume for this example  
that \n signifies a line, windows would be slightly more difficult but  
doesn't really improve the example clarity, just assume it can also be  
done:

char[] readline(BufferedInputTransport trans, bool makeCopy=false)
{
    uint checkForNL(ubyte[] data, uint start)
    {
       char[] d = cast(char[])data; // need to switch to utf8
       foreach(i, dchar d; d[start..$])
       {
          if(d == '\n')
             return i + 1 + start; // consume including the newline
       }
       return ~0;
    }

    ubyte[] result;
    if(makeCopy)
        trans.read(&checkForNL, result);
    else
        result = trans.read(&checkForNL);

    auto cresult = cast(char[])result;

    // don't include any newline read
    if(cresult[$-1] == '\n')
       cresult = cresult[0..$-1];
    return result;
}

If you specify makeCopy is true, the resulting data is unique and can be  
used wherever.  Otherwise, the resulting data is actually the buffer of  
the BufferedInputTransport stream, and shouldn't be saved as it may be  
reused.

One can easily make a range based on this as well.

-Steve

Oct 15 2010

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

  On 14.10.2010 22:14, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
 On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Please, use the term "seek", and allow an anchor. Every OS allows this,
 it makes no sense not to provide it.

 I've always thought that's a crappy appendix. Every OS that ever 
 allows seek/tell with anchors allows ALL anchors, and always allows 
 either both or none of seek and tell. So I decided to cut through the 
 crap and simplify. You want to seek 100 bytes from here, you write 
 stream.position = stream.position + 100.

 Um.. yuck.   We need to use two system calls to seek 100 bytes?

 Oh, that reminds me I need to provide length as a property as well. 
 This would save us crap like seek(0, SEEK_END); ftell() to figure out 
 the length of a file.

 So now you need to do stream.position = stream.length to seek to the 
 end of the file instead of stream.seek(0, Anchor.END)?  Plus, how will 
 you implement length, probably like this:

 auto curpos = seek(0, SEEK_CUR);
 auto len = seek(0, SEEK_END);
 seek(curpos, SEEK_BEG);
 return len;

Actually C Run-time library provides something very limited from the 
point of most native OS I/O functions,
at very least, Windows and Posix have a separate function for 
determining file length.
[snip]
 -Steve


-- 
Dmitry Olshansky

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

With generic element type:

interface InputBinaryTransport(T) : TransportBase
if (!hasPointers!T && !is(T == class))
{
     /// Reads up to target.length items into target,
     /// returns number of items read. If target.length < 1 throws.
     size_t read(T[] target);
     /// Appends to target up until delimiter is found or end of stream.
     /// The delimiter is included in the target. Returns number
     /// of items read.
     size_t appendDelim(ref T[] target, in T[] delimiter);
}

The rest of the transport interfaces stays the same.


Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 14 Oct 2010 21:42:04 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 With generic element type:

 interface InputBinaryTransport(T) : TransportBase
 if (!hasPointers!T && !is(T == class))
 {
      /// Reads up to target.length items into target,
      /// returns number of items read. If target.length < 1 throws.
      size_t read(T[] target);
      /// Appends to target up until delimiter is found or end of stream.
      /// The delimiter is included in the target. Returns number
      /// of items read.
      size_t appendDelim(ref T[] target, in T[] delimiter);
 }

 The rest of the transport interfaces stays the same.


 Andrei

This interface requires buffering and thus needs to be built on top of the  
lower-level stream interface because Stream shouldn't do buffering  
internally.

But anyway, why would you need to read exactly one type of data from a  
stream? Why can't "read" be a template? Why does InputBinaryTransport have  
to be an interface and not a concrete implementation? It can be  
implemented uniformally on top of a generic stream.

Oct 14 2010

Brad Roberts <braddr puremagic.com> writes:

On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:
 
 Can we outline basic Stream interface now so that we could move on?

 
 Here's the input transport layer. Transport has no concern for formatting - it
 just moves stuff around.
 

I suggest having read and write with offset style apis.  See also:  man pread

The value here is that you can safely concurrently access a file descriptor and
not have to worry about atomicity of seek+read or seek+write at the app layer.

Oct 14 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/14/10 23:37 CDT, Brad Roberts wrote:
 On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

 Here's the input transport layer. Transport has no concern for formatting - it
 just moves stuff around.

 I suggest having read and write with offset style apis.  See also:  man pread

 The value here is that you can safely concurrently access a file descriptor and
 not have to worry about atomicity of seek+read or seek+write at the app layer.

This is very interesting - I had no idea about pread and pwrite.

http://linux.die.net/man/2/pread


Andrei

Oct 14 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Fri, 15 Oct 2010 08:37:24 +0400, Brad Roberts <braddr puremagic.com>  
wrote:

 On 10/14/2010 8:34 AM, Andrei Alexandrescu wrote:
 Starting a new thread from Denis' question:

 Can we outline basic Stream interface now so that we could move on?

 Here's the input transport layer. Transport has no concern for  
 formatting - it
 just moves stuff around.

 I suggest having read and write with offset style apis.  See also:  man  
 pread

 The value here is that you can safely concurrently access a file  
 descriptor and
 not have to worry about atomicity of seek+read or seek+write at the app  
 layer.

In my original API proposal I included reading from offsets:

interface SeekableInputStream : SeekableStream, InputStream
{
	...

	// reads up to buffer.length bytes from a stream
	// returns number of bytes read
	// throws on error
	size_t read(ubyte[] buffer, long offset);

	// reads up to buffer.length bytes from a specified offset
	AsyncReadFromOffsetRequest readAsync(ubyte[] buffer, long offset,  
Mailbox* mailbox = null);
	
	...
}

interface SeekableOutputStream : SeekableStream, OutputStream
{
	...

	// returns number of bytes written
	// throws on error
	size_t write(const(ubyte)[] buffer, long offset);

	// writes from from a specified offset
	AsyncWriteFromOffsetRequest writeAsync(const(ubyte)[] buffer, long  
offset, Mailbox* mailbox = null);

	...
}

Although it complicates an API quite a bit, it has its advantages, too.  
However, not all Stream types support reading/writing from an offset (e.g.  
SocketStream doesn't).

Oct 14 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Streaming transport interfaces: input