www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - foreach (i; taskPool.parallel(0..2_000_000)

reply Paul <phshaffer gmail.com> writes:
Thanks in advance for any assistance.

As the subject line suggests can I do something like? :
```d
foreach (i; taskPool.parallel(0..2_000_000))
```
Obviously this exact syntax doesn't work but I think it expresses 
the gist of my challenge.
Apr 01 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/1/23 2:25 PM, Paul wrote:
 Thanks in advance for any assistance.
 
 As the subject line suggests can I do something like? :
 ```d
 foreach (i; taskPool.parallel(0..2_000_000))
 ```
 Obviously this exact syntax doesn't work but I think it expresses the 
 gist of my challenge.
```d import std.range; foreach(; iota(0, 2_000_000).parallel) ``` -Steve
Apr 01 2023
next sibling parent Paul <phshaffer gmail.com> writes:
Thanks Steve.
Apr 01 2023
prev sibling next sibling parent reply Paul <phshaffer gmail.com> writes:
 ```d
 import std.range;

 foreach(; iota(0, 2_000_000).parallel)
 ```

 -Steve
Is there a way to verify that it split up the work in to tasks/threads ...? The example you gave me works...compiles w/o errors but the execution time is the same as the non-parallel version. They both take about 6 secs to execute. totalCPUs tells me I have 8 CPUs available.
Apr 01 2023
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 4/1/23 15:30, Paul wrote:

 Is there a way to verify that it split up the work in to tasks/threads
 ...?
It is hard to see the difference unless there is actual work in the loop that takes time. You can add a Thread.sleep call. (Commented-out in the following program.) Another option is to monitor a task manager like 'top' on unix based systems. It should multiple threads for the same program. However, I will do something unspeakably wrong and take advantage of undefined behavior below. :) Since iteration count is an even number, the 'sum' variable should come out as 0 in the end. With .parallel it doesn't because multiple threads are stepping on each other's toes (values): import std; void main() { long sum; foreach(i; iota(0, 2_000_000).parallel) { // import core.thread; // Thread.sleep(1.msecs); if (i % 2) { ++sum; } else { --sum; } } if (sum == 0) { writeln("We highly likely worked serially."); } else { writefln!"We highly likely worked in parallel because %s != 0."(sum); } } If you remove .parallel, 'sum' will always be 0. Ali
Apr 01 2023
next sibling parent reply Salih Dincer <salihdb hotmail.com> writes:
On Saturday, 1 April 2023 at 22:48:46 UTC, Ali Çehreli wrote:
 On 4/1/23 15:30, Paul wrote:

 Is there a way to verify that it split up the work in to
tasks/threads
 ...?
It is hard to see the difference unless there is actual work in the loop that takes time.
I always use the Rowland Sequence for such experiments. At least it's better than the Fibonacci Range: ```d struct RowlandSequence { import std.numeric : gcd; import std.format : format; import std.conv : text; long b, r, a = 3; enum empty = false; string[] front() { string result = format("%s, %s", b, r); return [text(a), result]; } void popFront() { long result = 1; while(result == 1) { result = gcd(r++, b); b += result; } a = result; } } enum BP { f = 1, b = 7, r = 2, a = 1, /* f = 109, b = 186837516, r = 62279173, //*/ s = 5 } void main() { RowlandSequence rs; long start, skip; with(BP) { rs = RowlandSequence(b, r); start = f; skip = s; } rs.popFront(); import std.stdio, std.parallelism; import std.range : take; auto rsFirst128 = rs.take(128); foreach(r; rsFirst128.parallel) { if(r[0].length > skip) { start.writeln(": ", r); } start++; } } /* PRINTS: 46: ["121403", "364209, 121404"] 48: ["242807", "728421, 242808"] 68: ["486041", "1458123, 486042"] 74: ["972533", "2917599, 972534"] 78: ["1945649", "5836947, 1945650"] 82: ["3891467", "11674401, 3891468"] 90: ["7783541", "23350623, 7783542"] 93: ["15567089", "46701267, 15567090"] 102: ["31139561", "93418683, 31139562"] 108: ["62279171", "186837513, 62279172"] */ ``` The operation is simple, again multiplication, addition, subtraction and module, i.e. So four operations but enough to overrun the CPU! I haven't seen rsFirst256 until now because I don't have a fast enough processor. Maybe you'll see it, but the first 108 is fast anyway. **PS:** Decrease value of the `skip` to see the entire sequence. In cases where your processor power is not enough, you can create skip points. Check out BP... SDB 79
Apr 01 2023
parent Salih Dincer <salihdb hotmail.com> writes:
On Sunday, 2 April 2023 at 04:34:40 UTC, Salih Dincer wrote:
 I haven't seen rsFirst256 until now...
**Edit:** I saw, I saw :) I am struck with consternation! I've never seen these results before. Interesting, there is such a thing as parallel threading :) Here are my skipPoints: ```d enum BP : long { //f, a, r, b = 7, /* <- beginning f = 113, r = 62279227, b = 186837678, // f = 146, r = 249134971, b = 747404910, // f = 161, r = 498270808, b = 1494812421, // f = 178, r = 1993083484, b = 5979250449, // f = 210, r = 3986167363, b = 11958502086, //*/ s = 5 } /* PRINTS: eLab pico:~/Projeler$ ./RownlandSequence_v2 122: ["124559610, 373678827"] 128: ["249120240, 747360717"] */ ``` It looks like there are 5 total skipPoints until 256 where it loops for a long time. (while passing 1's). SDB 79
Apr 02 2023
prev sibling parent reply Salih Dincer <salihdb hotmail.com> writes:
On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer 
wrote:
 So for example, if you have:

 ```d
 foreach(i; iota(0, 2_000_000).parallel)
 {
    runExpensiveTask(i);
 }
 ```

 The foreach is run on the main thread, gets a `0`, then hands 
 off to a task thread `runExpensiveTask(0)`. Then it gets a `1`, 
 and hands off to a task thread `runExpensiveTask(1)`, etc. The 
 iteration is not expensive, and is not done in parallel.

 On the other hand, what you *shouldn't* do is:

 ```d
 foreach(i; iota(0, 2_000_000).map!(x => 
 runExpensiveTask(x)).parallel)
 {
 }
 ```

 as this will run the expensive task *before* running any tasks.
I don't understand what `foreach()` does :) ```d import std.range, std.algorithm : map; import std.stdio, std.parallelism; //import sdb.sequences : RowlandSequence_v2;/* struct RowlandSequence_v2 { import std.numeric : gcd; long b, r, a = 3; enum empty = false; auto front() => a; void popFront() { long result = 1; while(result == 1) { result = gcd(r++, b); b += result; } a = result; } }//*/ enum BP : long { // s, f, r, b = 7, /* <- beginning s = 178, r = 1993083484, b = 5979250449,//*/ len = 190 } void main() { with(BP) { long[len] arr; auto range = RowlandSequence_v2(b, r); s.iota(len) .map!((a){ range.popFront(); return arr[a] = range.front(); } ) .parallel .writeln; } } /* PRINTS: ParallelForeach!(MapResult!(__lambda3, Result))(std.parallelism.TaskPool, [5, 3, 73, 157, 7, 5, 3, 13, 3986167223, 3, 7, 73], 1) */ ``` Is it necessary to enclose the code in `foreach()`? I invite Ali to tell me! Please explain why parallel isn't running. "Ben anlamıyor, foreach ne yapıyor 😀 Kodu `foreach()` içine almak şart mı? Ali'yi davet ediyorum, bana anlatması için! Paralel() niye çalışmıyor, lütfen açıklayın hocam. Mümkünse Türkçe!" in Turkish. SDB 79
Apr 04 2023
next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 4/4/23 02:24, Salih Dincer wrote:

 I don't understand what `foreach()` does :)
Hm. I forgot whether 'parallel' works only with 'foreach'. But there are various other algorithms in std.parallelism that may be more useful with range algorithm chains: https://dlang.org/phobos/std_parallelism.html
 in Turkish
I don't have time to experiment more at this time but I have the following chapters, which includes some of those other algorithms as well: http://ddili.org/ders/d/kosut_islemler.html http://ddili.org/ders/d.en/parallelism.html Ali
Apr 04 2023
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/4/23 5:24 AM, Salih Dincer wrote:

 Is it necessary to enclose the code in `foreach()`? I invite Ali to tell 
 me! Please explain why parallel isn't running.
parallel is a shortcut to `TaskPool.parallel`, which is indeed a foreach-only construct, it does not return a range. I think what you want is `TaskPool.map`: ```d // untested, just looking at the taskPool.map!(/* your map function here */) (s.iota(len)).writeln; ``` Can't use pipelining with it, because it is a member function. https://dlang.org/phobos/std_parallelism.html#.TaskPool.map -Steve
Apr 04 2023
parent reply Salih Dincer <salihdb hotmail.com> writes:
On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer 
wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is indeed 
 a foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```
I tried, thanks but it goes into infinite loop. For example, the first 50 of the sequence should have been printed to the screen immediately without waiting. ```d long[50] arr; RowlandSequence_v2 range; auto next(long a) { range.popFront(); return arr[a] = range.front(); } void main() { range = RowlandSequence_v2(7, 2); taskPool.map!next(iota(50))/* s.iota(50) .map!next .parallel//*/ .writeln; } ``` On Tuesday, 4 April 2023 at 13:18:01 UTC, Ali Çehreli wrote:
 I don't have time to experiment more at this time but I have 
 the following chapters, which includes some of those other 
 algorithms as well:

   http://ddili.org/ders/d/kosut_islemler.html
I read it, thanks... SDB 79
Apr 04 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/4/23 11:34 AM, Salih Dincer wrote:
 On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is indeed a 
 foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```
I tried, thanks but it goes into infinite loop. For example, the first 50 of the sequence should have been printed to the screen immediately without waiting. ```d long[50] arr; RowlandSequence_v2 range; auto next(long a) {   range.popFront();   return arr[a] = range.front(); } void main() {     range = RowlandSequence_v2(7, 2);     taskPool.map!next(iota(50))/*     s.iota(50)      .map!next      .parallel//*/      .writeln; } ```
Keep in mind that `arr` and `range` are thread-local, and so will be different states for different tasks. Though I don't really know what you are doing there. -Steve
Apr 04 2023
parent Salih Dincer <salihdb hotmail.com> writes:
On Tuesday, 4 April 2023 at 16:22:29 UTC, Steven Schveighoffer 
wrote:
 On 4/4/23 11:34 AM, Salih Dincer wrote:
 On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer 
 wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is 
 indeed a foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```
I tried, thanks but it goes into infinite loop. For example, the first 50 of the sequence should have been printed to the screen immediately without waiting. ```d long[50] arr; RowlandSequence_v2 range; auto next(long a) {   range.popFront();   return arr[a] = range.front(); } void main() {     range = RowlandSequence_v2(7, 2);     taskPool.map!next(iota(50))/*     s.iota(50)      .map!next      .parallel//*/      .writeln; } ```
Keep in mind that `arr` and `range` are thread-local, and so will be different states for different tasks.
I guess the reason it goes into an infinite loop is because gcd() a recursive function (gcd). This is the only solution I can think of about this: ```d import std.range, std.algorithm : map; import std.stdio, std.parallelism; //import std.numeric : gcd; /* struct Vector { long x, y, result; alias result this; } Vector gcd(long a, long b) { if(b == 0) return Vector(1, 0, a); auto pre = gcd(b, a % b); auto tmp = (a / b) * pre.y; return Vector(pre.y, pre.x - tmp, pre.result); }//*/ struct RowlandSequence_v3 { long b, r, n, a = 3, limit; bool empty() { return n == limit; } auto front() { return a; } void popFront() { long result = 1; while(result == 1) { result = gcd(r++, b); b += result; } a = result; } long gcd(long a, long b) { long c; while(b != 0) { c = a % b; a = b; b = c; } return a; } } auto next(ref RowlandSequence_v3 range) { with(range) { if(empty) return [n, front]; popFront(); return [n++, front]; } } long[179] arr; void main() { // initialization: RowlandSequence_v3[4] r = [ RowlandSequence_v3(7 , 2, 0, 3, 112), RowlandSequence_v3(186837678, 62279227, 112, 3, 145), RowlandSequence_v3(747404910, 249134971, 145, 6257, 160), RowlandSequence_v3(1494812421, 498270808, 160, 11, 177) ]; auto tasks = [ task(&next, r[0]), task(&next, r[1]), task(&next, r[2]), task(&next, r[3]) ]; // quad parallel operation: foreach(_; 0..r[0].limit) { foreach(p, ref task; tasks) { task.executeInNewThread; auto t = task.workForce; arr[t[0]] = t[1]; } } // prints... foreach(x, n; arr) { switch(x + 1) { case 112, 145, 160: n.writeln; break; default: n.write(", "); } } } /* PRINTS: user debian:~/Documents$ dmd -O rowlandSequence.d -release user debian:~/Documents$ time ./rowlandSequence 5, 3, 11, 3, 23, 3, 47, 3, 5, 3, 101, 3, 7, 11, 3, 13, 233, 3, 467, 3, 5, 3, 941, 3, 7, 1889, 3, 3779, 3, 7559, 3, 13, 15131, 3, 53, 3, 7, 30323, 3, 60647, 3, 5, 3, 101, 3, 121403, 3, 242807, 3, 5, 3, 19, 7, 5, 3, 47, 3, 37, 5, 3, 17, 3, 199, 53, 3, 29, 3, 486041, 3, 7, 421, 23, 3, 972533, 3, 577, 7, 1945649, 3, 163, 7, 3891467, 3, 5, 3, 127, 443, 3, 31, 7783541, 3, 7, 15567089, 3, 19, 29, 3, 5323, 7, 5, 3, 31139561, 3, 41, 3, 5, 3, 62279171, 3, 7, 83, 3 29, 3, 1103, 3, 5, 3, 13, 7, 124559609, 3, 107, 3, 911, 3, 249120239, 3, 11, 3, 7, 61, 37, 179, 3, 31, 19051, 7, 3793, 23, 3, 5, 3, 6257, 3 3, 11, 3, 13, 5, 3, 739, 37, 5, 3, 498270791, 3, 19, 11, 3 3, 3, 5, 3, 996541661, 3, 7, 37, 5, 3, 67, 1993083437, 3, 5, 3, 83, 3, 3, 0, real 7m54.093s user 7m54.062s sys 0m0.024s */ ``` However, parallel processing for 4-digit sequence elements is not promising at least for the Rowland Sequence. SDB 79
Apr 05 2023
prev sibling parent reply Paul <phshaffer gmail.com> writes:
On Saturday, 1 April 2023 at 18:30:32 UTC, Steven Schveighoffer 
wrote:
 On 4/1/23 2:25 PM, Paul wrote:

 ```d
 import std.range;

 foreach(; iota(0, 2_000_000).parallel)
 ```

 -Steve
Is there a way to tell if the parallelism actually divided up the work? Both versions of my program run in the same time ~6 secs.
Apr 01 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/1/23 6:32 PM, Paul wrote:
 On Saturday, 1 April 2023 at 18:30:32 UTC, Steven Schveighoffer wrote:
 On 4/1/23 2:25 PM, Paul wrote:

 ```d
 import std.range;

 foreach(i; iota(0, 2_000_000).parallel)
 ```
Is there a way to tell if the parallelism actually divided up the work? Both versions of my program run in the same time ~6 secs.
It's important to note that parallel doesn't iterate the range in parallel, it just runs the body in parallel limited by your CPU count. If your `foreach` body takes a global lock (like `writeln(i);`), then it's not going to run any faster (probably slower actually). If you can disclose more about what you are trying to do, it would be helpful. Also make sure you have more than one logical CPU. -Steve
Apr 02 2023
parent reply Paul <phshaffer gmail.com> writes:
On Sunday, 2 April 2023 at 15:32:05 UTC, Steven Schveighoffer 
wrote:

 It's important to note that parallel doesn't iterate the range 
 in parallel, it just runs the body in parallel limited by your 
 CPU count.
**?!?**
 If your `foreach` body takes a global lock (like 
 `writeln(i);`), then it's not going to run any faster (probably 
 slower actually).
**Ok I did have some debug writelns I commented out.**
 If you can disclose more about what you are trying to do, it 
 would be helpful.
**This seems like it would be a lot of code and explaining but let me think about how to summarize.**
 Also make sure you have more than one logical CPU.
**I have 8.**
Apr 03 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/3/23 6:02 PM, Paul wrote:
 On Sunday, 2 April 2023 at 15:32:05 UTC, Steven Schveighoffer wrote:
 
 It's important to note that parallel doesn't iterate the range in 
 parallel, it just runs the body in parallel limited by your CPU count.
**?!?**
So for example, if you have: ```d foreach(i; iota(0, 2_000_000).parallel) { runExpensiveTask(i); } ``` The foreach is run on the main thread, gets a `0`, then hands off to a task thread `runExpensiveTask(0)`. Then it gets a `1`, and hands off to a task thread `runExpensiveTask(1)`, etc. The iteration is not expensive, and is not done in parallel. On the other hand, what you *shouldn't* do is: ```d foreach(i; iota(0, 2_000_000).map!(x => runExpensiveTask(x)).parallel) { } ``` as this will run the expensive task *before* running any tasks.
 
 If your `foreach` body takes a global lock (like `writeln(i);`), then 
 it's not going to run any faster (probably slower actually).
**Ok I did have some debug writelns I commented out.**
And did it help? Another thing that takes a global lock is memory allocation.
 Also make sure you have more than one logical CPU.
**I have 8.**
It's dependent on the work being done, but you should see a roughly 8x speedup as long as the overhead of distributing tasks is not significant compared to the work being done. -Steve
Apr 03 2023
parent reply Paul <phshaffer gmail.com> writes:
On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer 
wrote:

 
 If your `foreach` body takes a global lock (like 
 `writeln(i);`), then it's not going to run any faster 
 (probably slower actually).
**Ok I did have some debug writelns I commented out.**
And did it help?
**No** My program is about 140 lines Steven. Its just one of the Advent of Code challenges. Could I past the whole program here and see what you think? Thanks for your assistance...much appreciated.
Apr 03 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/3/23 6:56 PM, Paul wrote:
 On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer wrote:
 
 If your `foreach` body takes a global lock (like `writeln(i);`), 
 then it's not going to run any faster (probably slower actually).
**Ok I did have some debug writelns I commented out.**
And did it help?
**No** My program is about 140 lines Steven.  Its just one of the Advent of Code challenges.  Could I past the whole program here and see what you think?
Yeah, please post. -Steve
Apr 03 2023
parent reply Paul <phshaffer gmail.com> writes:
On Monday, 3 April 2023 at 23:13:58 UTC, Steven Schveighoffer 
wrote:
 Yeah, please post.
```d module aoc2215b2; import std.stdio; import std.file: readText; import std.conv: to; import std.math: abs; import std.traits; import std.parallelism; import std.range; import core.time: MonoTime; // Timed main() vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv void main(string[] args) { auto progStartTime = MonoTime.currTime; //----------------------------------------------------------------------------- string filename = args.length > 1 ? args[1] : "aoc2215a.data"; CommPair[] commPair; ulong x,y; // read file that has data sets in the form of x,y coordinate pairs // for each sensor-beacon pair. Create on array of structs to hold // this information. loadFileDataIntoArrayOfStructs(commPair, filename); foreach(int lineOfInterest;parallel(iota(0,4_000_001))) { Span[] span; // A section of line-of-interest coordinates where no other beacons are present. const spanReserve = span.reserve(50); createSpansOfNoBeacons(lineOfInterest,commPair,span); // if spans overlap, combine them into a single span and mark // the other spans !inUse. combineOverlappingSpans(span); // look for a line that doesn't have 4,000,001 locations accounted for if(beaconFreeLocations(span) < 4_000_001) { // find the location that is not accounted for foreach(ulong i;0..4_000_000) { bool found = false; foreach(sp;span) { if(i >= sp.xLow && i <= sp.xHigh) { found = true; break; } } if(!found) { x = i; y = lineOfInterest; break; } } } } writeln(x," ",y); //----------------------------------------------------------------------------- auto progEndTime = MonoTime.currTime; writeln(progEndTime - progStartTime); } // Timed main() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ struct CommPair { int sx,sy,bx,by; int manhattanDistance; } void loadFileDataIntoArrayOfStructs(ref CommPair[] commPair, string filename) { import std.regex; auto s = readText(filename); auto ctr = ctRegex!(`x=(-*\d+), y=(-*\d+):.*x=(-*\d+), y=(-*\d+)`); CommPair cptemp; foreach (c; matchAll(s, ctr)) { cptemp.sx = to!int(c[1]); cptemp.sy = to!int(c[2]); cptemp.bx = to!int(c[3]); cptemp.by = to!int(c[4]); cptemp.manhattanDistance = abs(cptemp.sx-cptemp.bx) + abs(cptemp.sy-cptemp.by); commPair ~= cptemp; } } struct Span { int xLow, xHigh; bool inUse = true; } void createSpansOfNoBeacons(int lineOfInterest, CommPair[] commPair,ref Span[] span) { foreach(size_t i,cp;commPair) { int distanceToLineOfInterest = abs(cp.sy - lineOfInterest); if(cp.manhattanDistance >= distanceToLineOfInterest) { int xLow = cp.sx - (cp.manhattanDistance - distanceToLineOfInterest); int xHigh = cp.sx + (cp.manhattanDistance - distanceToLineOfInterest); span ~= Span(xLow,xHigh); } } } void combineOverlappingSpans(ref Span[] span) { bool combinedSpansThisCycle = true; while(combinedSpansThisCycle) { combinedSpansThisCycle = false; for(size_t i=0; i < span.length-1; i++) { if(!span[i].inUse) continue; for(size_t j=i+1; j < span.length; j++) { if(!span[j].inUse) continue; // if one span overlaps with the other, combine them into one span if(spanIContainedInSpanJ(span[i],span[j]) || spanJContainedInSpanI(span[i],span[j])) { span[i].xLow = span[i].xLow < span[j].xLow ? span[i].xLow : span[j].xLow; span[i].xHigh = span[i].xHigh > span[j].xHigh ? span[i].xHigh : span[j].xHigh; span[j].inUse = false; combinedSpansThisCycle = true; // after combining two spans, perform bounds checking // 15 part b limits the search between 0 and 4,000,000 span[i].xLow = span[i].xLow < 0 ? 0 : span[i].xLow; span[i].xHigh = span[i].xHigh > 4_000_000 ? 4_000_000 : span[i].xHigh; } } } } } bool spanIContainedInSpanJ (Span spanI, Span spanJ) { return (spanI.xLow >= spanJ.xLow && spanI.xLow <= spanJ.xHigh) || (spanI.xHigh>= spanJ.xLow && spanI.xHigh<= spanJ.xHigh); } bool spanJContainedInSpanI (Span spanI, Span spanJ) { return (spanJ.xLow >= spanI.xLow && spanJ.xLow <= spanI.xHigh) || (spanJ.xHigh>= spanI.xLow && spanJ.xHigh<= spanI.xHigh); } int beaconFreeLocations(ref Span[] span) { int noBeaconCount = 0; foreach(sp;span) if(sp.inUse) noBeaconCount += sp.xHigh - sp.xLow + 1; return noBeaconCount; } ```
Apr 03 2023
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/3/23 7:22 PM, Paul wrote:
 ```d
 // Timed main() 
 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 void main(string[] args) {
 auto progStartTime = MonoTime.currTime;
 //-----------------------------------------------------------------------------
      string filename = args.length > 1 ? args[1] : "aoc2215a.data";
      CommPair[] commPair;
      ulong x,y;
 
      // read file that has data sets in the form of x,y coordinate pairs
      // for each sensor-beacon pair.  Create on array of structs to hold
      // this information.
      loadFileDataIntoArrayOfStructs(commPair, filename);
 
      foreach(int lineOfInterest;parallel(iota(0,4_000_001))) {
          Span[] span; // A section of line-of-interest coordinates
where 
 no other beacons are present.
          const spanReserve = span.reserve(50);
          createSpansOfNoBeacons(lineOfInterest,commPair,span);
 
          // if spans overlap, combine them into a single span and mark
          // the other spans !inUse.
          combineOverlappingSpans(span);
 
          // look for a line that doesn't have 4,000,001 locations 
 accounted for
          if(beaconFreeLocations(span) < 4_000_001) {
 
              // find the location that is not accounted for
              foreach(ulong i;0..4_000_000) {
                  bool found = false;
                  foreach(sp;span) {
                      if(i >= sp.xLow && i <= sp.xHigh) {
                          found = true;
                          break;
                      }
                  }
                  if(!found) {
                      x = i; y = lineOfInterest;
                      break;
                  }
              }
          }
      }
 writeln(x," ",y);
 
 ```
So I just quoted your main loop. I am assuming that this O(n^2) algorithm doesn't actually run for all iterations, because that wouldn't be feasible (16 trillion iterations is a lot). This means that I'm assuming a lot of cases do not run the second loop. Everything you do besides prune the second loop is mostly allocating an array of `Span` types. This means most of the parallel loops are allocating, and doing nothing else. As I said earlier, allocations need a global lock of the GC. What you need to do probably, is to avoid these allocations per loop. The easiest thing I can think of is to store the Span array as a static array of the largest array you need (i.e. the length of `commPair`), and then slice it instead of appending. So what you need is inside `createSpansOfNoBeacons`, take as a reference a `ref Span[MAX_SPANS]`, and have it return a `Span[]` that is a slice of that which was "alocated". See if this helps. FWIW, I did the AoC 2022 as well, and I never needed parallel execution. Looking at my solution comment in reddit, this one I sort of punted by knowing I could exit as soon as the answer is found (my solution runs in 2.5s on my input). But I recommend (once you are done), reading this post, it is a really cool way to look at it: https://www.reddit.com/r/adventofcode/comments/zmcn64/2022_day_15_solutions/j0hl19a/?context=8&depth=9 -Steve
Apr 03 2023
parent reply Paul <phshaffer gmail.com> writes:
On Monday, 3 April 2023 at 23:50:48 UTC, Steven Schveighoffer 
wrote:
 So what you need is inside `createSpansOfNoBeacons`, take as a 
 reference a `ref Span[MAX_SPANS]`, and have it return a 
 `Span[]` that is a slice of that which was "alocated".

 See if this helps.
Well Steven just making the change you said reduced the execution time from ~6-7 secs to ~3 secs. Then, including the 'parallel' in the foreach statement took it down to ~1 sec. Boy lesson learned in appending-to and zeroing dynamic arrays in a hot loop!
Apr 04 2023
parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Tue, Apr 04, 2023 at 09:35:29PM +0000, Paul via Digitalmars-d-learn wrote:
[...]
 Well Steven just making the change you said reduced the execution time
 from ~6-7 secs to ~3 secs.  Then, including the 'parallel' in the
 foreach statement took it down to ~1 sec.
 
 Boy lesson learned in appending-to and zeroing dynamic arrays in a hot
 loop!
Best practices for arrays in hot loops: - Avoid appending if possible; instead, pre-allocate outside the loop. - Where possible, reuse existing arrays instead of discarding old ones and allocating new ones. - Use slices where possible instead of making copies of subarrays (this esp. applies to strings). - Where possible, prefer sequential access over random access (take advantage of the CPU cache hierarchy). T -- Famous last words: I *think* this will work...
Apr 04 2023
parent reply Paul <phshaffer gmail.com> writes:
On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:

 Best practices for arrays in hot loops:
 - Avoid appending if possible; instead, pre-allocate outside 
 the loop.
 - Where possible, reuse existing arrays instead of discarding 
 old ones
   and allocating new ones.
 - Use slices where possible instead of making copies of 
 subarrays (this
   esp. applies to strings).
 - Where possible, prefer sequential access over random access 
 (take
   advantage of the CPU cache hierarchy).
Thanks for sharing Teoh! Very helpful. would this be random access? for(size_t i; i<arr.length; i++) using indices? ...and this be sequential foreach(a;arr) ? or would they have to be completely different kinds of containers? a dlang 'range' vs arr[]?
Apr 05 2023
next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/5/23 6:34 PM, Paul wrote:
 On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
 
 Best practices for arrays in hot loops:
 - Avoid appending if possible; instead, pre-allocate outside the loop.
 - Where possible, reuse existing arrays instead of discarding old ones
   and allocating new ones.
 - Use slices where possible instead of making copies of subarrays (this
   esp. applies to strings).
 - Where possible, prefer sequential access over random access (take
   advantage of the CPU cache hierarchy).
Thanks for sharing Teoh!  Very helpful. would this be random access? for(size_t i; i<arr.length; i++) using indices? ...and this be sequential foreach(a;arr) ?
No, random access is access out of sequence. Those two lines are pretty much equivalent, and even a naive compiler is going to produce exactly the same generated code from both of them. A classic example is processing a 2d array: ```d for(int i = 0; i < arr[0].length; ++i) for(int j = 0; j < arr.length; ++j) arr[j][i]++; // vs for(int j = 0; j < arr.length; ++j) for(int i = 0; i < arr[0].length; ++i) arr[j][i]++; ``` The first accesses elements *by column*, which means that the array data is accessed non-linearly in memory. To be fair, both are "linear" in terms of algorithm, but one is going to be faster because of cache coherency (you are accessing sequential *hardware addresses*). -Steve
Apr 05 2023
prev sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Wed, Apr 05, 2023 at 10:34:22PM +0000, Paul via Digitalmars-d-learn wrote:
 On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
 
 Best practices for arrays in hot loops:
[...]
 - Where possible, prefer sequential access over random access (take
   advantage of the CPU cache hierarchy).
Thanks for sharing Teoh! Very helpful. would this be random access? for(size_t i; i<arr.length; i++) using indices? ...and this be sequential foreach(a;arr) ? or would they have to be completely different kinds of containers? a dlang 'range' vs arr[]?
[...] The exact syntactic construct you use is not important; under the hood, for(i; i<arr.length; i++) and foreach(a;arr) are exactly the same thing. What's important is the actual pattern of memory access. Looping from the first element of an array to the last element is sequential access; doing binary search on an array is random access (because it's unpredictable where the next access will be). Traversing a linked list is also random access, because in general individual nodes will be stored in different places in memory. So doing a hashtable lookup. Basically, modern CPUs have a cache hierarchy and memory prefetching controlled by the access predictor. When the predictor sees memory being accessed sequentially, it can speed up future accesses by preloading subsequent blocks of memory into cache lines so that when the CPU next tries to read from that location it's already in the cache and it doesn't have to wait for a fetch cycle from RAM, which is slow. The bad thing about things like hashtables is because it's unpredictable where the next access will be, so the CPU's predictor won't be able to preload the next access for you. So you'll incur a fetch cycle from RAM every time. Of course, this doesn't mean that hashtable (or other data structure) lookups are necessarily bad. If the alternative is linear scan through very large arrays, then hashtables will speed up your inner loop in spite of the cache misses. So it depends on your exact use case. But if your arrays are not overly large, scanning through the entire array may sometimes be faster than doing many hashtable lookups. Some modern hashtable implementations are also adapted to take advantage of the cache hierarchy, so it may not be as bad as a naïve implementation. Nevertheless, the general principle is that locality (adjacent memory accesses are close to each other) is good, and should generally be preferred over strategies that require jumping around in memory. So your data structures and algorithms should be designed in a way that takes advantage of linear access where possible. T -- In theory, there is no difference between theory and practice.
Apr 05 2023
parent reply Paul <phshaffer gmail.com> writes:
On Wednesday, 5 April 2023 at 23:06:54 UTC, H. S. Teoh wrote:

 So your data structures and algorithms should be designed in a 
 way that takes advantage of linear access where possible.


 T
Yes I understand, basically, what's going on in hardware. I just wasn't sure if the access type was linked to the container type. It seems obvious now, since you've both made it clear, that it also depends on how I'm accessing my container. Having said all of this, isn't a D 'range' fundamentally a sequential access container (i.e popFront) ?
Apr 05 2023
parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Thu, Apr 06, 2023 at 01:20:28AM +0000, Paul via Digitalmars-d-learn wrote:
[...]
 Yes I understand, basically, what's going on in hardware.  I just
 wasn't sure if the access type was linked to the container type.  It
 seems obvious now, since you've both made it clear, that it also
 depends on how I'm accessing my container.
 
 Having said all of this, isn't a D 'range' fundamentally a sequential
 access container (i.e popFront) ?
D ranges are conceptually sequential, but the actual underlying memory access patterns depends on the concrete type at runtime. An array's elements are stored sequentially in memory, and arrays are ranges. But a linked-list can also have a range interface, yet its elements may be stored in non-consecutive memory locations. So the concrete type matters here; the range API only gives you conceptual sequentiality, it does not guarantee physically sequential memory access. T -- Many open minds should be closed for repairs. -- K5 user
Apr 05 2023
parent Paul <phshaffer gmail.com> writes:
On Thursday, 6 April 2023 at 01:44:15 UTC, H. S. Teoh wrote:

 D ranges are conceptually sequential, but the actual underlying 
 memory access patterns depends on the concrete type at runtime. 
 An array's elements are stored sequentially in memory, and 
 arrays are ranges.  But a linked-list can also have a range 
 interface, yet its elements may be stored in non-consecutive 
 memory locations.  So the concrete type matters here; the range 
 API only gives you conceptual sequentiality, it does not 
 guarantee physically sequential memory access.
Very helpful Teoh. Thanks again.
Apr 06 2023