www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to resume iteration from previous point in associative array

reply "Tarman" <tarman chilon.net> writes:
Hi,

We're doing some "super computing" "big data" style stuff with D. 
We have a system where we're comparing associative arrays with 
billions of entries.

However in this system we need to fairly consider possible 
solutions for many "units" at a time within a single thread.

So we'd like to... say, first iterate over the first 10 million 
for each "unit" then iterate over the next 10 million for the 
next unit so that each unit gets a fair share of CPU time.

However in this case we can't:

int count = 0;
foreach (item; associativeArray) {
   if (++count == 10_000_000) {
     // here we wish to somehow save the iteration state
   }
}

Then on the next iteration:

foreach (resume from previous iteration point) {
}

We've tried copying the keys into a non-associative array and 
sure this works, but it is far far far less optimal than an 
equivalent C++ solution we wrote where we use an 
std::unordered_set and can simply store the iterator.
Feb 19 2014
next sibling parent "Tarman" <tarman chilon.net> writes:
I guess what might work here is to allow some kind of argument to 
"byKey" that will let us resume so maybe something like:

foreach (item; 
massiveAssociativeArray.byKey(indexOfKeyToStartFrom)) {
}

If something like this isn't possible, then would one of our 
engineers be able to submit a patch to libphobos to provide the 
feature?
Feb 19 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Tarman:

 We're doing some "super computing" "big data" style stuff with 
 D. We have a system where we're comparing associative arrays 
 with billions of entries.
Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit.
 We've tried copying the keys into a non-associative array and 
 sure this works, but it is far far far less optimal than an 
 equivalent C++ solution we wrote where we use an 
 std::unordered_set and can simply store the iterator.
Have you tried the simplest thing, to let std.parallelism chunk a AA.byKey.zip(AA.byValue) ? Bye, bearophile
Feb 19 2014
parent reply "Nicolas Sicard" <dransic gmail.com> writes:
On Wednesday, 19 February 2014 at 09:28:27 UTC, bearophile wrote:
 Tarman:

 We're doing some "super computing" "big data" style stuff with 
 D. We have a system where we're comparing associative arrays 
 with billions of entries.
Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit.
 We've tried copying the keys into a non-associative array and 
 sure this works, but it is far far far less optimal than an 
 equivalent C++ solution we wrote where we use an 
 std::unordered_set and can simply store the iterator.
Have you tried the simplest thing, to let std.parallelism chunk a AA.byKey.zip(AA.byValue) ? Bye, bearophile
Are byKey and byValue guaranteed to iterate the associative array in the same order?
Feb 19 2014
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Nicolas Sicard:

 Are byKey and byValue guaranteed to iterate the associative 
 array in the same order?
Nope, or not yet, sorry. Bye, bearophile
Feb 19 2014
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 02/19/2014 04:11 AM, bearophile wrote:
 Nicolas Sicard:

 Are byKey and byValue guaranteed to iterate the associative array in
 the same order?
Nope, or not yet, sorry. Bye, bearophile
Is there a bug report about that? I can't imagine how they can iterate in any other order. Ali
Feb 19 2014
prev sibling next sibling parent "simendsjo" <simendsjo gmail.com> writes:
On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote:
 Hi,

 We're doing some "super computing" "big data" style stuff with 
 D. We have a system where we're comparing associative arrays 
 with billions of entries.

 However in this system we need to fairly consider possible 
 solutions for many "units" at a time within a single thread.

 So we'd like to... say, first iterate over the first 10 million 
 for each "unit" then iterate over the next 10 million for the 
 next unit so that each unit gets a fair share of CPU time.

 However in this case we can't:

 int count = 0;
 foreach (item; associativeArray) {
   if (++count == 10_000_000) {
     // here we wish to somehow save the iteration state
   }
 }

 Then on the next iteration:

 foreach (resume from previous iteration point) {
 }

 We've tried copying the keys into a non-associative array and 
 sure this works, but it is far far far less optimal than an 
 equivalent C++ solution we wrote where we use an 
 std::unordered_set and can simply store the iterator.
Probably not an answer to your question, but if possible, you can use a fiber and yield when you should give control to another fiber.
Feb 19 2014
prev sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote:
 Hi,

 We're doing some "super computing" "big data" style stuff with 
 D. We have a system where we're comparing associative arrays 
 with billions of entries.

 However in this system we need to fairly consider possible 
 solutions for many "units" at a time within a single thread.

 So we'd like to... say, first iterate over the first 10 million 
 for each "unit" then iterate over the next 10 million for the 
 next unit so that each unit gets a fair share of CPU time.

 However in this case we can't:

 int count = 0;
 foreach (item; associativeArray) {
   if (++count == 10_000_000) {
     // here we wish to somehow save the iteration state
   }
 }

 Then on the next iteration:

 foreach (resume from previous iteration point) {
 }

 We've tried copying the keys into a non-associative array and 
 sure this works, but it is far far far less optimal than an 
 equivalent C++ solution we wrote where we use an 
 std::unordered_set and can simply store the iterator.
You could iterate over byValue/byKey/zip(byKey,byValue) which gives you a range to start again from. You'd need to iterate by hand (with while,for) though, since your range won't get consumed by foreach, I guess. Or use RefRange. not tested: auto values = refRange(&associativeArray.byValue); foreach(item; values) { if(someday) break; } // here values should have been advanced accordingly // and you can just continue your iteration foreach(item; values) { }
Feb 19 2014
parent "Tarman" <tarman chilon.net> writes:
 bearophile:

Thanks for your suggestion! Yes we've tried this but 
unfortunately the performance doesn't work for us, maybe because 
all our CPUs are quite saturated.

 simendsjo:

Thanks also for your suggestion, we are using tasks and yield in 
our code but in this case we felt that using it just to resume 
iteration was perhaps a misuse of the feature although we would 
be interested to see what the overhead is.

 Mr Pankrath:

Thanks that looks like what we need!
Feb 19 2014