www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Lazy caching map()?

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
Today I found myself needing a lazy, caching version of map() on an
array.  More precisely, given an array `T[] src` of source data and a
function func(T) that's pretty expensive to compute, return an object
`result` such that:

- result[i] == func(src[i]), for 0 ≤ i < src.length.

- If result[j] is never actually used, func(src[j]) is never invoked
  (lazy).

- If result[j] is referenced multiple times, a cached value is returned
  after the first time, i.e., the result of func(src[j]) is cached, and
  func(src[j]) is never invoked more than once.

I couldn't figure out how to build this using Phobos primitives, so I
wrote my own implementation of it.  Pretty simple, really, it's just a
wrapper struct that lazily initializes an array of Nullable!T and lazily
populates it with func(j) when opIndex(j) is invoked, and just returns
the cached value if opIndex(j) has been invoked before.

Can this be done using current Phobos primitives?


T

-- 
Don't throw out the baby with the bathwater. Use your hands...
Mar 09 2018
next sibling parent reply aliak <something something.com> writes:
On Friday, 9 March 2018 at 19:41:46 UTC, H. S. Teoh wrote:
 Today I found myself needing a lazy, caching version of map() 
 on an array.  More precisely, given an array `T[] src` of 
 source data and a function func(T) that's pretty expensive to 
 compute, return an object `result` such that:

 - result[i] == func(src[i]), for 0 ≤ i < src.length.

 - If result[j] is never actually used, func(src[j]) is never 
 invoked
   (lazy).

 - If result[j] is referenced multiple times, a cached value is 
 returned
   after the first time, i.e., the result of func(src[j]) is 
 cached, and
   func(src[j]) is never invoked more than once.

 I couldn't figure out how to build this using Phobos 
 primitives, so I wrote my own implementation of it.  Pretty 
 simple, really, it's just a wrapper struct that lazily 
 initializes an array of Nullable!T and lazily populates it with 
 func(j) when opIndex(j) is invoked, and just returns the cached 
 value if opIndex(j) has been invoked before.

 Can this be done using current Phobos primitives?


 T
Unless I'm misunderstanding your requirements, are you looking for std.functionl.memoize? auto fun(int a) { writeln("here"); return a + 1; } void main() { auto a = [1, 2, 3]; auto b = a.map!(memoize!fun); writeln(b[1]); writeln(b[1]); writeln(b[1]); } will print "here" just once.
Mar 10 2018
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 10, 2018 at 08:54:16PM +0000, aliak via Digitalmars-d wrote:
 On Friday, 9 March 2018 at 19:41:46 UTC, H. S. Teoh wrote:
 [...]  More precisely, given an array `T[] src` of source data and a
 function func(T) that's pretty expensive to compute, return an
 object `result` such that:
 
 - result[i] == func(src[i]), for 0 ≤ i < src.length.
 
 - If result[j] is never actually used, func(src[j]) is never invoked
   (lazy).
 
 - If result[j] is referenced multiple times, a cached value is
   returned after the first time, i.e., the result of func(src[j]) is
   cached, and func(src[j]) is never invoked more than once.
[...]
 Can this be done using current Phobos primitives?
[...]
 Unless I'm misunderstanding your requirements, are you looking for
 std.functionl.memoize?
 
 auto fun(int a) {
     writeln("here");
     return a + 1;
 }
 
 void main() {
     auto a = [1, 2, 3];
     auto b = a.map!(memoize!fun);
 
     writeln(b[1]);
     writeln(b[1]);
     writeln(b[1]);
 }
 
 will print "here" just once.
Very nice. Using memoize did occur to me, but I needed an array interface to it. Didn't think of using memoize with map, for some reason. Thanks for the idea! However, looking at the implementation of memoize, it seems that it's caching the result globally, with very little control over the cache except the size. I wonder if there's a way to control it better, i.e., free the cache once there are no more references to it, and have the cache size depend on the data size. Basically, in my use case, once I map an array it's not predictable in advance how many elements will be accessed (i.e., it's hard to decide on an optimal cache size for memoize), but this access will happen pretty soon afterwards, and then the array will be discarded (i.e., no point holding on old entries in the cache). I would prefer that the cache will be cleared once the mapped array has been GC'd, but memoize() seems to hold on to the cached results indefinitely. T -- Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen
Mar 13 2018
parent reply aliak <something something.com> writes:
On Tuesday, 13 March 2018 at 17:24:35 UTC, H. S. Teoh wrote:
 Very nice. Using memoize did occur to me, but I needed an array 
 interface to it.  Didn't think of using memoize with map, for 
 some reason.  Thanks for the idea!

 However, looking at the implementation of memoize, it seems 
 that it's caching the result globally, with very little control 
 over the cache except the size. I wonder if there's a way to 
 control it better, i.e., free the cache once there are no more 
 references to it, and have the cache size depend on the data 
 size.  Basically, in my use case, once I map an array it's not 
 predictable in advance how many elements will be accessed 
 (i.e., it's hard to decide on an optimal cache size for 
 memoize), but this access will happen pretty soon afterwards, 
 and then the array will be discarded (i.e., no point holding on 
 old entries in the cache).  I would prefer that the cache will 
 be cleared once the mapped array has been GC'd, but memoize() 
 seems to hold on to the cached results indefinitely.


 T
No worries! And ugh, can't think off the top of my head how to improve it other than to make it a type and give it scope so that it can die at some point. But that would make it uglier to use as well. Btw, I just saw someone posted a link to an old forum post of yours: https://forum.dlang.org/post/mailman.2562.1403196857.2907.digitalmars-d puremagic.com Mind if I add that (or a version of it) to a library I'm writing? (it's an optional type on dub) Cheers - Ali
Mar 13 2018
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 14, 2018 at 12:12:44AM +0000, aliak via Digitalmars-d wrote:
[...]
 Btw, I just saw someone posted a link to an old forum post of yours:
https://forum.dlang.org/post/mailman.2562.1403196857.2907.digitalmars-d puremagic.com
 
 Mind if I add that (or a version of it) to a library I'm writing?
 (it's an optional type on dub)
[...] Sure, go right ahead! Glad you found it useful! T -- "The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."
Mar 13 2018
prev sibling next sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Friday, 9 March 2018 at 19:41:46 UTC, H. S. Teoh wrote:
 Today I found myself needing a lazy, caching version of map() 
 on an array.  More precisely, given an array `T[] src` of 
 source data and a function func(T) that's pretty expensive to 
 compute, return an object `result` such that:

 - result[i] == func(src[i]), for 0 ≤ i < src.length.

 - If result[j] is never actually used, func(src[j]) is never 
 invoked
   (lazy).

 - If result[j] is referenced multiple times, a cached value is 
 returned
   after the first time, i.e., the result of func(src[j]) is 
 cached, and
   func(src[j]) is never invoked more than once.

 Can this be done using current Phobos primitives?


 T
Phobos implementation based on memoize would use hash map internally, which is slow for this use case. mir-algorithm v0.9.5 (DUB would be updated soon) got `cached` and `cachedGC` topologies. Example: // Mir's map is required import mir.ndslice: cachedGC, map, sliced; auto ar = new int[5]; // init ar ... auto cachedLazyMap = ar.map!fun.cachedGC; `cachedGC` [1] internally allocates a vector of caches and a packed bitarray. `cached` [2] allows to reuse already allocated buffers. The result support all random access range primitive, slicing and etc. ndslice architecture allows to add new random access topologies very easily. [1] [2] Best regards, Ilya Yaroshenko
Mar 18 2018
prev sibling parent reply Graham Fawcett <fawcett uwindsor.ca> writes:
On Friday, 9 March 2018 at 19:41:46 UTC, H. S. Teoh wrote:
 Today I found myself needing a lazy, caching version of map() 
 on an array.  More precisely, given an array `T[] src` of 
 source data and a function func(T) that's pretty expensive to 
 compute, return an object `result` such that:

 [...]
Map the sources to std.parallelism.Task instances, and yieldForce() the instances when you need the results? Graham
Mar 19 2018
parent Graham Fawcett <fawcett uwindsor.ca> writes:
On Monday, 19 March 2018 at 16:47:16 UTC, Graham Fawcett wrote:
 On Friday, 9 March 2018 at 19:41:46 UTC, H. S. Teoh wrote:
 Today I found myself needing a lazy, caching version of map() 
 on an array.  More precisely, given an array `T[] src` of 
 source data and a function func(T) that's pretty expensive to 
 compute, return an object `result` such that:

 [...]
Map the sources to std.parallelism.Task instances, and yieldForce() the instances when you need the results?
On second thought, that won't work. AFIAK there's no way to take an unscheduled task and force it to execute on the current thread immediately and yield the result. But with such an operation, you could have a thread-local future/promise type that would meet your needs. Graham
Mar 19 2018