www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unbuffered range interface

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
The recent discussion initiated by Walter points to a problem known 
since a long time ago: ranges are well modeled by objects in memory 
(arrays, lists, other containers) but poorly by objects that need to 
load or construct elements on the fly.

Here are some scenarios that we need to accommodate:

1. Files and sockets

Fastest access is implemented at the OS level by means of read() that 
takes a user-provided buffer.

2. Compression, decompression, encryption, decryption

I think certain sizes would work better than others, but I'm not sure 
how that all goes. A good case study.

3. Of course character encoding, decoding, and transcoding.

What set of primitives would work best for all these scenarios and more?


Andrei
Mar 25 2014
next sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Wednesday, 26 March 2014 at 00:55:31 UTC, Andrei Alexandrescu 
wrote:
 The recent discussion initiated by Walter points to a problem 
 known since a long time ago: ranges are well modeled by objects 
 in memory (arrays, lists, other containers) but poorly by 
 objects that need to load or construct elements on the fly.
If you want to support web servers you should also consider scenarios where you retrieve ranges from a distributed database and the data arrive out-of-order. You can cut down on latency by processing map() etc out-of-order.
Mar 25 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wednesday, 26 March 2014 at 00:55:31 UTC, Andrei Alexandrescu 
wrote:
 The recent discussion initiated by Walter points to a problem 
 known since a long time ago: ranges are well modeled by objects 
 in memory (arrays, lists, other containers) but poorly by 
 objects that need to load or construct elements on the fly.

 Here are some scenarios that we need to accommodate:

 1. Files and sockets

 Fastest access is implemented at the OS level by means of 
 read() that takes a user-provided buffer.

 2. Compression, decompression, encryption, decryption

 I think certain sizes would work better than others, but I'm 
 not sure how that all goes. A good case study.

 3. Of course character encoding, decoding, and transcoding.

 What set of primitives would work best for all these scenarios 
 and more?
Have you seen Dmitry Olshansky's BufferedRange? http://forum.dlang.org/post/l9q66g$2he3$1 digitalmars.com
Mar 25 2014
prev sibling next sibling parent "Andrea Fontana" <nospam example.com> writes:
Similar scenario writing my mongodb (document based db) wrapper.


On Wednesday, 26 March 2014 at 00:55:31 UTC, Andrei Alexandrescu 
wrote:
 The recent discussion initiated by Walter points to a problem 
 known since a long time ago: ranges are well modeled by objects 
 in memory (arrays, lists, other containers) but poorly by 
 objects that need to load or construct elements on the fly.

 Here are some scenarios that we need to accommodate:

 1. Files and sockets

 Fastest access is implemented at the OS level by means of 
 read() that takes a user-provided buffer.

 2. Compression, decompression, encryption, decryption

 I think certain sizes would work better than others, but I'm 
 not sure how that all goes. A good case study.

 3. Of course character encoding, decoding, and transcoding.

 What set of primitives would work best for all these scenarios 
 and more?


 Andrei
Mar 26 2014
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Mar-2014 04:55, Andrei Alexandrescu пишет:
 The recent discussion initiated by Walter points to a problem known
 since a long time ago: ranges are well modeled by objects in memory
 (arrays, lists, other containers) but poorly by objects that need to
 load or construct elements on the fly.

 Here are some scenarios that we need to accommodate:
I've observed that ranges do mostly fine on top of buffering primitive. Said primitive provides direct access to underlying array. For cases listed below I've come to conclusion that we'd need a generic sliding-window (over whatever is the true source of data) abstraction. I call it a buffer, and a special range over said abstraction a buffer-range. Key observation on the level of ranges is that we need something that looks a lot like RA range, has slicing, but doesn't have length. There must be a way to do queries like "is there X elements available" and "please extend/move underlying window to have at least Y elements ahead available".
 1. Files and sockets

 Fastest access is implemented at the OS level by means of read() that
 takes a user-provided buffer.
Ranges would need at least buffering primitive below. It may or may not be backed by file/socket descriptor down below (say MM-file easily provides buffer interface by re-mapping different views of file).
 2. Compression, decompression, encryption, decryption
This is much better addressed at the level of buffering itself, i.e. making an adapter/filter for a buffer.
 I think certain sizes would work better than others, but I'm not sure
 how that all goes. A good case study.

 3. Of course character encoding, decoding, and transcoding.
Also seems to be decently addressed as filter for buffers. Depending on the kind of the trade-off between laziness vs throughput it may be postponed to ranges. -- Dmitry Olshansky
Mar 27 2014
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 26/03/14 01:55, Andrei Alexandrescu wrote:
 The recent discussion initiated by Walter points to a problem known since a
long
 time ago: ranges are well modeled by objects in memory (arrays, lists, other
 containers) but poorly by objects that need to load or construct elements on
the
 fly.
Is it important to distinguish here between stuff that is loaded/constructed on the fly and may differ, versus stuff that is constructed on the fly but deterministically, so that it's _as if_ the value was actually stored?
Mar 27 2014
prev sibling next sibling parent reply "QAston" <qaston gmail.com> writes:
 What set of primitives would work best for all these scenarios 
 and more?
class Silver { enum bullet = true; }
Mar 28 2014
parent "QAston" <qaston gmail.com> writes:
On Friday, 28 March 2014 at 19:21:51 UTC, QAston wrote:
 What set of primitives would work best for all these scenarios 
 and more?
class Silver { enum bullet = true; }
Woops, I should have posted golden hammer, sorry.
Mar 28 2014
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
I see it like Dimitry and have actually implemented what he
describes a while ago. It is a one consumer, one producer
circular buffer which offers a sliding window into the
content.
As typical for a stream(!) it operates on variable blocks of
bytes and producer and consumer never actually negotiate a
fixed block size. Instead the producer puts as many bytes into
the buffer as it thinks is a good trade-off between the count
of system calls and the delay before the consumer can start
reading. The consumer on the other hand asks for as many bytes
as it needs at minimum and gets blocked if this requirement
is yet unmet.
When there are enough bytes available for consumption the
consumer may also decide to map *all* of the currently
available data if that further improves processing performance.

Source:
https://github.com/mleise/piped/blob/master/src/piped/circularbuffer.d

The public primitives (to come back to the topic) are the
following as exposed by a "get" and a "put" pointer into the
buffer:

commit(<byte count or data type>)
  Tells the buffer that the sliding window can be moved by X
  bytes, after they have been processed (read from or written
  to).
mapAvailable()
  For the consumer: Returns a window spanning all of the
  buffer that the producer has committed already.
  For the producer: Returns all of the free memory in the
  buffer that can be written to.
mapAtLeast(<byte count>)
  Same as mapAvailable() but blocks until a certain amount of
  bytes are ready.
map(<byte count>)
  Same as mapAtLeast, but shortens the resulting slice to
  match the byte count it has actually been asked for:
  mapAtLeast(count)[0 .. count];
map(T)()
  Treats the start of the sliding window as data of type T and
  returns a pointer to it. (When enough bytes are available.)
finish()
  This basically tells the counterpart that we are done with
  the stream. For the producer this is like setting an EOF
  flag. No more data will be written. All queries for more
  will result in an exception.

In this buffer I replaced EOF checks with throwing exceptions
on attempts to read past the end of the stream. In my limited
use cases the expected length of the stream could always be
established on the fly, for example by reading file header
fields.
There are also primitives for reading and writing bit runs
mostly to accommodate compressed streams:

peekBits(<bit count>)
  Return ubyte, uint, ulong, ... of the next bits at the start
  of the sliding window, but don't remove them yet.
  This is required for some compression algorithms to decide on
  the next steps to take.
skipBits(<bit count>)
  For performance reasons, if you just peeked the bits and
  cached the result you can just skip them. It can also be
  useful for other cases where you have no use for some bits
  in the stream.
commitBits(<bit count>)
  As above for integral bytes. The committed buffer bits
  become available to the counterpart.
readBit()
  Read a single bit. Optimized.
readBits(<bit count>)
  Works like peekBits() followed by skipBits().
skipBitsToNextByte()
  Very typical in compression. Skips the remainder of a byte
  until we are at an integral position again.

-- 
Marco
Mar 29 2014