digitalmars.D.learn - foreach (i; taskPool.parallel(0..2_000

digitalmars.D.learn - foreach (i; taskPool.parallel(0..2_000_000)

Paul (7/7) Apr 01 2023 Thanks in advance for any assistance.

Steven Schveighoffer (6/14) Apr 01 2023 ```d

Paul (1/1) Apr 01 2023 Thanks Steve.
Paul (5/10) Apr 01 2023 Is there a way to verify that it split up the work in to

=?UTF-8?Q?Ali_=c3=87ehreli?= (31/33) Apr 01 2023 It is hard to see the difference unless there is actual work in the loop...

Salih Dincer (71/77) Apr 01 2023 I always use the Rowland Sequence for such experiments. At least

Salih Dincer (25/26) Apr 02 2023 **Edit:** I saw, I saw :)

Salih Dincer (53/72) Apr 04 2023 I don't understand what `foreach()` does :)

=?UTF-8?Q?Ali_=c3=87ehreli?= (10/12) Apr 04 2023 Hm. I forgot whether 'parallel' works only with 'foreach'. But there are...
Steven Schveighoffer (12/14) Apr 04 2023 parallel is a shortcut to `TaskPool.parallel`, which is indeed a

Salih Dincer (26/38) Apr 04 2023 I tried, thanks but it goes into infinite loop. For example, the

Steven Schveighoffer (5/42) Apr 04 2023 Keep in mind that `arr` and `range` are thread-local, and so will be

Salih Dincer (108/149) Apr 05 2023 I guess the reason it goes into an infinite loop is because gcd()

Paul (4/10) Apr 01 2023 Is there a way to tell if the parallelism actually divided up the

Steven Schveighoffer (9/21) Apr 02 2023 It's important to note that parallel doesn't iterate the range in

Paul (7/16) Apr 03 2023 **?!?**

Steven Schveighoffer (25/37) Apr 03 2023 So for example, if you have:

Paul (7/13) Apr 03 2023 **No**

Steven Schveighoffer (3/16) Apr 03 2023 Yeah, please post.

Paul (146/147) Apr 03 2023 ```d

Steven Schveighoffer (25/73) Apr 03 2023 So I just quoted your main loop.

Paul (7/11) Apr 04 2023 Well Steven just making the change you said reduced the execution

H. S. Teoh (13/19) Apr 04 2023 Best practices for arrays in hot loops:

Paul (7/19) Apr 05 2023 Thanks for sharing Teoh! Very helpful.

Steven Schveighoffer (20/36) Apr 05 2023 No, random access is access out of sequence. Those two lines are pretty
H. S. Teoh (37/50) Apr 05 2023 [...]

Paul (7/10) Apr 05 2023 Yes I understand, basically, what's going on in hardware. I just

H. S. Teoh (12/19) Apr 05 2023 D ranges are conceptually sequential, but the actual underlying memory

Paul (2/10) Apr 06 2023 Very helpful Teoh. Thanks again.

Paul <phshaffer gmail.com> writes:

Thanks in advance for any assistance.

As the subject line suggests can I do something like? :
```d
foreach (i; taskPool.parallel(0..2_000_000))
```
Obviously this exact syntax doesn't work but I think it expresses 
the gist of my challenge.

Apr 01 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/1/23 2:25 PM, Paul wrote:
 Thanks in advance for any assistance.
 
 As the subject line suggests can I do something like? :
 ```d
 foreach (i; taskPool.parallel(0..2_000_000))
 ```
 Obviously this exact syntax doesn't work but I think it expresses the 
 gist of my challenge.


```d
import std.range;

foreach(; iota(0, 2_000_000).parallel)
```

-Steve

Apr 01 2023

Paul <phshaffer gmail.com> writes:

Thanks Steve.

Apr 01 2023

Paul <phshaffer gmail.com> writes:

 ```d
 import std.range;

 foreach(; iota(0, 2_000_000).parallel)
 ```

 -Steve

Is there a way to verify that it split up the work in to 
tasks/threads ...?  The example you gave me works...compiles w/o 
errors but the execution time is the same as the non-parallel 
version.  They both take about 6 secs to execute.  totalCPUs 
tells me I have 8 CPUs available.

Apr 01 2023

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 4/1/23 15:30, Paul wrote:

 Is there a way to verify that it split up the work in to tasks/threads
 ...?

It is hard to see the difference unless there is actual work in the loop 
that takes time. You can add a Thread.sleep call. (Commented-out in the 
following program.)

Another option is to monitor a task manager like 'top' on unix based 
systems. It should multiple threads for the same program.

However, I will do something unspeakably wrong and take advantage of 
undefined behavior below. :) Since iteration count is an even number, 
the 'sum' variable should come out as 0 in the end. With .parallel it 
doesn't because multiple threads are stepping on each other's toes (values):

import std;

void main() {
     long sum;

     foreach(i; iota(0, 2_000_000).parallel) {
         // import core.thread;
         // Thread.sleep(1.msecs);

         if (i % 2) {
             ++sum;

         } else {
             --sum;
         }
     }

     if (sum == 0) {
         writeln("We highly likely worked serially.");

     } else {
         writefln!"We highly likely worked in parallel because %s != 
0."(sum);
     }
}

If you remove .parallel, 'sum' will always be 0.

Ali

Apr 01 2023

Salih Dincer <salihdb hotmail.com> writes:

On Saturday, 1 April 2023 at 22:48:46 UTC, Ali Çehreli wrote:
 On 4/1/23 15:30, Paul wrote:

 Is there a way to verify that it split up the work in to

 tasks/threads
 ...?

 It is hard to see the difference unless there is actual work in 
 the loop that takes time.

I always use the Rowland Sequence for such experiments.  At least 
it's better than the Fibonacci Range:

```d
struct RowlandSequence {
   import std.numeric : gcd;
   import std.format : format;
   import std.conv : text;

   long b, r, a = 3;
   enum empty = false;

   string[] front() {
     string result = format("%s, %s", b, r);
     return [text(a), result];
   }

   void popFront() {
     long result = 1;
     while(result == 1) {
       result = gcd(r++, b);
       b += result;
     }
     a = result;
   }
}

enum BP {
   f = 1, b = 7, r = 2, a = 1, /*
   f = 109, b = 186837516, r = 62279173, //*/
   s = 5
}

void main()
{
   RowlandSequence rs;
   long start, skip;

   with(BP) {
     rs = RowlandSequence(b, r);
     start = f;
     skip = s;
   }
   rs.popFront();

   import std.stdio, std.parallelism;
   import std.range : take;

   auto rsFirst128 = rs.take(128);
   foreach(r; rsFirst128.parallel)
   {
     if(r[0].length > skip)
     {
       start.writeln(": ", r);
     }
     start++;
   }
} /* PRINTS:

46: ["121403", "364209, 121404"]
48: ["242807", "728421, 242808"]
68: ["486041", "1458123, 486042"]
74: ["972533", "2917599, 972534"]
78: ["1945649", "5836947, 1945650"]
82: ["3891467", "11674401, 3891468"]
90: ["7783541", "23350623, 7783542"]
93: ["15567089", "46701267, 15567090"]
102: ["31139561", "93418683, 31139562"]
108: ["62279171", "186837513, 62279172"]

*/
```

The operation is simple, again multiplication, addition, 
subtraction and module, i.e. So four operations but enough to 
overrun the CPU! I haven't seen rsFirst256 until now because I 
don't have a fast enough processor. Maybe you'll see it, but the 
first 108 is fast anyway.

**PS:** Decrease value of the `skip` to see the entire sequence. 
In cases where your processor power is not enough, you can create 
skip points.  Check out BP...

SDB 79

Apr 01 2023

Salih Dincer <salihdb hotmail.com> writes:

On Sunday, 2 April 2023 at 04:34:40 UTC, Salih Dincer wrote:
 I haven't seen rsFirst256 until now...

**Edit:** I saw, I saw :)

I am struck with consternation! I've never seen these results 
before. Interesting, there is such a thing as parallel threading 
:)

Here are my skipPoints:

```d
enum BP : long {
   //f, a, r, b = 7, /* <- beginning
    f = 113, r =   62279227, b =   186837678,
   // f = 146, r =  249134971, b =   747404910,
   // f = 161, r =  498270808, b =  1494812421,
   // f = 178, r = 1993083484, b =  5979250449,
   // f = 210, r = 3986167363, b = 11958502086,
   //*/
   s = 5
} /* PRINTS:
eLab pico:~/Projeler$ ./RownlandSequence_v2
122: ["124559610, 373678827"]
128: ["249120240, 747360717"]
*/
```

It looks like there are 5 total skipPoints until 256 where it 
loops for a long time. (while passing 1's).

SDB 79

Apr 02 2023

Salih Dincer <salihdb hotmail.com> writes:

On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer 
wrote:
 So for example, if you have:

 ```d
 foreach(i; iota(0, 2_000_000).parallel)
 {
    runExpensiveTask(i);
 }
 ```

 The foreach is run on the main thread, gets a `0`, then hands 
 off to a task thread `runExpensiveTask(0)`. Then it gets a `1`, 
 and hands off to a task thread `runExpensiveTask(1)`, etc. The 
 iteration is not expensive, and is not done in parallel.

 On the other hand, what you *shouldn't* do is:

 ```d
 foreach(i; iota(0, 2_000_000).map!(x => 
 runExpensiveTask(x)).parallel)
 {
 }
 ```

 as this will run the expensive task *before* running any tasks.

I don't understand what `foreach()` does :)
```d
import std.range, std.algorithm : map;
import std.stdio, std.parallelism;
//import sdb.sequences : RowlandSequence_v2;/*
struct RowlandSequence_v2 {
   import std.numeric : gcd;

   long b, r, a = 3;
   enum empty = false;
   auto front() => a;
   void popFront() {
     long result = 1;
     while(result == 1) {
       result = gcd(r++, b);
       b += result;
     }
     a = result;
   }
}//*/

enum BP : long {
    // s, f, r, b = 7, /* <- beginning
    s = 178, r = 1993083484, b =  5979250449,//*/
    len = 190
}

void main()
{
   with(BP) {
     long[len] arr;
     auto range = RowlandSequence_v2(b, r);

     s.iota(len)
      .map!((a){
                 range.popFront();
                 return arr[a] = range.front();
               }
           )
      .parallel
      .writeln;
   }
} /* PRINTS:

ParallelForeach!(MapResult!(__lambda3, 
Result))(std.parallelism.TaskPool, [5, 3, 73, 157, 7, 5, 3, 13, 
3986167223, 3, 7, 73], 1)

*/
```

Is it necessary to enclose the code in `foreach()`? I invite Ali 
to tell me! Please explain why parallel isn't running.

"Ben anlamıyor, foreach ne yapıyor 😀 Kodu `foreach()` içine almak 
şart mı? Ali'yi davet ediyorum, bana anlatması için! Paralel() 
niye çalışmıyor, lütfen açıklayın hocam. Mümkünse Türkçe!" in 
Turkish.

SDB 79

Apr 04 2023

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 4/4/23 02:24, Salih Dincer wrote:

 I don't understand what `foreach()` does :)

Hm. I forgot whether 'parallel' works only with 'foreach'. But there are 
various other algorithms in std.parallelism that may be more useful with 
range algorithm chains:

   https://dlang.org/phobos/std_parallelism.html

 in Turkish

I don't have time to experiment more at this time but I have the 
following chapters, which includes some of those other algorithms as well:

   http://ddili.org/ders/d/kosut_islemler.html

   http://ddili.org/ders/d.en/parallelism.html

Ali

Apr 04 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/4/23 5:24 AM, Salih Dincer wrote:

 Is it necessary to enclose the code in `foreach()`? I invite Ali to tell 
 me! Please explain why parallel isn't running.

parallel is a shortcut to `TaskPool.parallel`, which is indeed a 
foreach-only construct, it does not return a range.

I think what you want is `TaskPool.map`:

```d
// untested, just looking at the
taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
```

Can't use pipelining with it, because it is a member function.

https://dlang.org/phobos/std_parallelism.html#.TaskPool.map

-Steve

Apr 04 2023

Salih Dincer <salihdb hotmail.com> writes:

On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer 
wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is indeed 
 a foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```


I tried, thanks but it goes into infinite loop. For example, the 
first 50 of the sequence should have been printed to the screen 
immediately without waiting.

```d
long[50] arr;
RowlandSequence_v2 range;

auto next(long a)
{
   range.popFront();
   return arr[a] = range.front();
}

void main()
{
     range = RowlandSequence_v2(7, 2);
     taskPool.map!next(iota(50))/*
     s.iota(50)
      .map!next
      .parallel//*/
      .writeln;
}
```

On Tuesday, 4 April 2023 at 13:18:01 UTC, Ali Çehreli wrote:
 I don't have time to experiment more at this time but I have 
 the following chapters, which includes some of those other 
 algorithms as well:

   http://ddili.org/ders/d/kosut_islemler.html

I read it, thanks...

SDB 79

Apr 04 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/4/23 11:34 AM, Salih Dincer wrote:
 On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is indeed a 
 foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```

 
 
 I tried, thanks but it goes into infinite loop. For example, the first 
 50 of the sequence should have been printed to the screen immediately 
 without waiting.
 
 ```d
 long[50] arr;
 RowlandSequence_v2 range;
 
 auto next(long a)
 {
    range.popFront();
    return arr[a] = range.front();
 }
 
 void main()
 {
      range = RowlandSequence_v2(7, 2);
      taskPool.map!next(iota(50))/*
      s.iota(50)
       .map!next
       .parallel//*/
       .writeln;
 }
 ```

Keep in mind that `arr` and `range` are thread-local, and so will be 
different states for different tasks.

Though I don't really know what you are doing there.

-Steve

Apr 04 2023

Salih Dincer <salihdb hotmail.com> writes:

On Tuesday, 4 April 2023 at 16:22:29 UTC, Steven Schveighoffer 
wrote:
 On 4/4/23 11:34 AM, Salih Dincer wrote:
 On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer 
 wrote:
 parallel is a shortcut to `TaskPool.parallel`, which is 
 indeed a foreach-only construct, it does not return a range.

 I think what you want is `TaskPool.map`:

 ```d
 // untested, just looking at the
 taskPool.map!(/* your map function here */)
    (s.iota(len)).writeln;
 ```

 
 
 I tried, thanks but it goes into infinite loop. For example, 
 the first 50 of the sequence should have been printed to the 
 screen immediately without waiting.
 
 ```d
 long[50] arr;
 RowlandSequence_v2 range;
 
 auto next(long a)
 {
    range.popFront();
    return arr[a] = range.front();
 }
 
 void main()
 {
      range = RowlandSequence_v2(7, 2);
      taskPool.map!next(iota(50))/*
      s.iota(50)
       .map!next
       .parallel//*/
       .writeln;
 }
 ```

 Keep in mind that `arr` and `range` are thread-local, and so 
 will be different states for different tasks.

I guess the reason it goes into an infinite loop is because gcd() 
a recursive function (gcd). This is the only solution I can think 
of about this:

```d
import std.range, std.algorithm : map;
import std.stdio, std.parallelism;
//import std.numeric : gcd;
/*
struct Vector {
   long x, y, result;
   alias result this;
}

Vector gcd(long a, long b) {
   if(b == 0) return Vector(1, 0, a);

   auto pre = gcd(b, a % b);
   auto tmp = (a / b) * pre.y;

   return Vector(pre.y, pre.x - tmp, pre.result);
}//*/

struct RowlandSequence_v3 {
   long b, r, n, a = 3, limit;

   bool empty() { return n == limit; }

   auto front() { return a; }

   void popFront() {
     long result = 1;
     while(result == 1) {
       result = gcd(r++, b);
       b += result;
     }
     a = result;
   }

   long gcd(long a, long b) {
     long c;
     while(b != 0) {
       c = a % b;
       a = b;
       b = c;
     }
     return a;
   }
}

auto next(ref RowlandSequence_v3 range) {
   with(range) {
     if(empty) return [n, front];
     popFront();
     return [n++, front];
   }
}

long[179] arr;

void main()
{
   // initialization:
   RowlandSequence_v3[4] r = [
   RowlandSequence_v3(7 , 2, 0, 3, 112),
   RowlandSequence_v3(186837678, 62279227, 112, 3, 145),
   RowlandSequence_v3(747404910, 249134971, 145, 6257, 160),
   RowlandSequence_v3(1494812421, 498270808, 160, 11, 177)
   ];

   auto tasks = [ task(&next, r[0]),
                  task(&next, r[1]),
                  task(&next, r[2]),
                  task(&next, r[3])
                ];

   // quad parallel operation:
   foreach(_; 0..r[0].limit)
   {
     foreach(p, ref task; tasks)
     {
         task.executeInNewThread;
         auto t = task.workForce;
         arr[t[0]] = t[1];
     }
   }

   // prints...
   foreach(x, n; arr) {
     switch(x + 1) {
       case 112, 145, 160:
         n.writeln; break;
       default:
         n.write(", ");
     }
   }
} /* PRINTS:
user debian:~/Documents$ dmd -O rowlandSequence.d -release
user debian:~/Documents$ time ./rowlandSequence
5, 3, 11, 3, 23, 3, 47, 3, 5, 3, 101, 3, 7, 11, 3, 13, 233, 3, 
467, 3, 5, 3, 941, 3, 7, 1889, 3, 3779, 3, 7559, 3, 13, 15131, 3, 
53, 3, 7, 30323, 3, 60647, 3, 5, 3, 101, 3, 121403, 3, 242807, 3, 
5, 3, 19, 7, 5, 3, 47, 3, 37, 5, 3, 17, 3, 199, 53, 3, 29, 3, 
486041, 3, 7, 421, 23, 3, 972533, 3, 577, 7, 1945649, 3, 163, 7, 
3891467, 3, 5, 3, 127, 443, 3, 31, 7783541, 3, 7, 15567089, 3, 
19, 29, 3, 5323, 7, 5, 3, 31139561, 3, 41, 3, 5, 3, 62279171, 3, 
7, 83, 3
29, 3, 1103, 3, 5, 3, 13, 7, 124559609, 3, 107, 3, 911, 3, 
249120239, 3, 11, 3, 7, 61, 37, 179, 3, 31, 19051, 7, 3793, 23, 
3, 5, 3, 6257, 3
3, 11, 3, 13, 5, 3, 739, 37, 5, 3, 498270791, 3, 19, 11, 3
3, 3, 5, 3, 996541661, 3, 7, 37, 5, 3, 67, 1993083437, 3, 5, 3, 
83, 3, 3, 0,
real	7m54.093s
user	7m54.062s
sys	0m0.024s
*/
```

However, parallel processing for 4-digit sequence elements is not 
promising at least for the Rowland Sequence.

SDB 79

Apr 05 2023

Paul <phshaffer gmail.com> writes:

On Saturday, 1 April 2023 at 18:30:32 UTC, Steven Schveighoffer 
wrote:
 On 4/1/23 2:25 PM, Paul wrote:

 ```d
 import std.range;

 foreach(; iota(0, 2_000_000).parallel)
 ```

 -Steve

Is there a way to tell if the parallelism actually divided up the 
work?  Both versions of my program run in the same time ~6 secs.

Apr 01 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/1/23 6:32 PM, Paul wrote:
 On Saturday, 1 April 2023 at 18:30:32 UTC, Steven Schveighoffer wrote:
 On 4/1/23 2:25 PM, Paul wrote:

 ```d
 import std.range;

 foreach(i; iota(0, 2_000_000).parallel)
 ```

 
 Is there a way to tell if the parallelism actually divided up the work?  
 Both versions of my program run in the same time ~6 secs.

It's important to note that parallel doesn't iterate the range in 
parallel, it just runs the body in parallel limited by your CPU count.

If your `foreach` body takes a global lock (like `writeln(i);`), then 
it's not going to run any faster (probably slower actually).

If you can disclose more about what you are trying to do, it would be 
helpful.

Also make sure you have more than one logical CPU.

-Steve

Apr 02 2023

Paul <phshaffer gmail.com> writes:

On Sunday, 2 April 2023 at 15:32:05 UTC, Steven Schveighoffer 
wrote:

 It's important to note that parallel doesn't iterate the range 
 in parallel, it just runs the body in parallel limited by your 
 CPU count.

**?!?**

 If your `foreach` body takes a global lock (like 
 `writeln(i);`), then it's not going to run any faster (probably 
 slower actually).

**Ok I did have some debug writelns I commented out.**

 If you can disclose more about what you are trying to do, it 
 would be helpful.

**This seems like it would be a lot of code and explaining but 
let me think about how to summarize.**

 Also make sure you have more than one logical CPU.

**I have 8.**

Apr 03 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/3/23 6:02 PM, Paul wrote:
 On Sunday, 2 April 2023 at 15:32:05 UTC, Steven Schveighoffer wrote:
 
 It's important to note that parallel doesn't iterate the range in 
 parallel, it just runs the body in parallel limited by your CPU count.

 **?!?**

So for example, if you have:

```d
foreach(i; iota(0, 2_000_000).parallel)
{
    runExpensiveTask(i);
}
```

The foreach is run on the main thread, gets a `0`, then hands off to a 
task thread `runExpensiveTask(0)`. Then it gets a `1`, and hands off to 
a task thread `runExpensiveTask(1)`, etc. The iteration is not 
expensive, and is not done in parallel.

On the other hand, what you *shouldn't* do is:

```d
foreach(i; iota(0, 2_000_000).map!(x => runExpensiveTask(x)).parallel)
{
}
```

as this will run the expensive task *before* running any tasks.

 
 If your `foreach` body takes a global lock (like `writeln(i);`), then 
 it's not going to run any faster (probably slower actually).

 **Ok I did have some debug writelns I commented out.**

And did it help? Another thing that takes a global lock is memory 
allocation.

 Also make sure you have more than one logical CPU.

 **I have 8.**

It's dependent on the work being done, but you should see a roughly 8x 
speedup as long as the overhead of distributing tasks is not significant 
compared to the work being done.

-Steve

Apr 03 2023

Paul <phshaffer gmail.com> writes:

On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer 
wrote:

 
 If your `foreach` body takes a global lock (like 
 `writeln(i);`), then it's not going to run any faster 
 (probably slower actually).

 **Ok I did have some debug writelns I commented out.**

 And did it help?

**No**

My program is about 140 lines Steven.  Its just one of the Advent 
of Code challenges.  Could I past the whole program here and see 
what you think?

Thanks for your assistance...much appreciated.

Apr 03 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/3/23 6:56 PM, Paul wrote:
 On Monday, 3 April 2023 at 22:24:18 UTC, Steven Schveighoffer wrote:
 
 If your `foreach` body takes a global lock (like `writeln(i);`), 
 then it's not going to run any faster (probably slower actually).

 **Ok I did have some debug writelns I commented out.**

 And did it help?

 **No**
 
 My program is about 140 lines Steven.  Its just one of the Advent of 
 Code challenges.  Could I past the whole program here and see what you 
 think?

Yeah, please post.

-Steve

Apr 03 2023

Paul <phshaffer gmail.com> writes:

On Monday, 3 April 2023 at 23:13:58 UTC, Steven Schveighoffer 
wrote:
 Yeah, please post.

```d
module aoc2215b2;

import std.stdio;
import std.file: readText;
import std.conv: to;
import std.math: abs;
import std.traits;
import std.parallelism;
import std.range;
import core.time: MonoTime;

// Timed main() 
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
void main(string[] args) {
auto progStartTime = MonoTime.currTime;
//-----------------------------------------------------------------------------
	string filename = args.length > 1 ? args[1] : "aoc2215a.data";
	CommPair[] commPair;
	ulong x,y;

	// read file that has data sets in the form of x,y coordinate 
pairs
	// for each sensor-beacon pair.  Create on array of structs to 
hold
	// this information.
	loadFileDataIntoArrayOfStructs(commPair, filename);
	
	foreach(int lineOfInterest;parallel(iota(0,4_000_001))) {
		Span[] span; // A section of line-of-interest coordinates where 
no other beacons are present.
		const spanReserve = span.reserve(50);
		createSpansOfNoBeacons(lineOfInterest,commPair,span);

		// if spans overlap, combine them into a single span and mark
		// the other spans !inUse.
		combineOverlappingSpans(span);

		// look for a line that doesn't have 4,000,001 locations 
accounted for
		if(beaconFreeLocations(span) < 4_000_001) {

			// find the location that is not accounted for
			foreach(ulong i;0..4_000_000) {
				bool found = false;
				foreach(sp;span) {
					if(i >= sp.xLow && i <= sp.xHigh) {
						found = true;
						break;
					}
				}
				if(!found) {
					x = i; y = lineOfInterest;
					break;
				}
			}
		}
	}
writeln(x," ",y);

//-----------------------------------------------------------------------------
auto progEndTime = MonoTime.currTime;
writeln(progEndTime - progStartTime);
}
// Timed main() 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



struct CommPair {
	int sx,sy,bx,by;
	int manhattanDistance;
}

void loadFileDataIntoArrayOfStructs(ref CommPair[] commPair, 
string filename) {
	import std.regex;
	auto s = readText(filename);
	auto ctr = ctRegex!(`x=(-*\d+), y=(-*\d+):.*x=(-*\d+), 
y=(-*\d+)`);
	CommPair cptemp;
	foreach (c; matchAll(s, ctr)) {
		cptemp.sx = to!int(c[1]);
		cptemp.sy = to!int(c[2]);
		cptemp.bx = to!int(c[3]);
		cptemp.by = to!int(c[4]);
		cptemp.manhattanDistance = abs(cptemp.sx-cptemp.bx) + 
abs(cptemp.sy-cptemp.by);
		commPair ~= cptemp;
	}
}

struct Span {
	int xLow, xHigh;
	bool inUse = true;
}

void createSpansOfNoBeacons(int lineOfInterest, CommPair[] 
commPair,ref Span[] span) {
	foreach(size_t i,cp;commPair) {
		int distanceToLineOfInterest = abs(cp.sy - lineOfInterest);
		if(cp.manhattanDistance >= distanceToLineOfInterest) {
			int xLow  = cp.sx - (cp.manhattanDistance - 
distanceToLineOfInterest);
			int xHigh = cp.sx + (cp.manhattanDistance - 
distanceToLineOfInterest);
			span ~= Span(xLow,xHigh);
		}
	}
}

void combineOverlappingSpans(ref Span[] span) {
	bool combinedSpansThisCycle = true;
	while(combinedSpansThisCycle) {
		combinedSpansThisCycle = false;
		for(size_t i=0; i < span.length-1; i++) {
			if(!span[i].inUse) continue;
		
			for(size_t j=i+1; j < span.length; j++) {
				if(!span[j].inUse) continue;

				// if one span overlaps with the other, combine them into one 
span
				if(spanIContainedInSpanJ(span[i],span[j]) || 
spanJContainedInSpanI(span[i],span[j])) {
					span[i].xLow  = span[i].xLow  < span[j].xLow  ? span[i].xLow 
  : span[j].xLow;
					span[i].xHigh = span[i].xHigh > span[j].xHigh ? 
span[i].xHigh : span[j].xHigh;
					span[j].inUse = false;
					combinedSpansThisCycle = true;

					// after combining two spans, perform bounds checking
					// 15 part b limits the search between 0 and 4,000,000
					span[i].xLow  = span[i].xLow  < 0         ? 0         : 
span[i].xLow;
					span[i].xHigh = span[i].xHigh > 4_000_000 ? 4_000_000 : 
span[i].xHigh;
				}
			}
		}
	}
}

bool spanIContainedInSpanJ (Span spanI, Span spanJ) {
	return 	(spanI.xLow >= spanJ.xLow && spanI.xLow <= spanJ.xHigh) 
||
			(spanI.xHigh>= spanJ.xLow && spanI.xHigh<= spanJ.xHigh);
}

bool spanJContainedInSpanI (Span spanI, Span spanJ) {
	return 	(spanJ.xLow >= spanI.xLow && spanJ.xLow <= spanI.xHigh) 
||
			(spanJ.xHigh>= spanI.xLow && spanJ.xHigh<= spanI.xHigh);
}

int beaconFreeLocations(ref Span[] span) {
	int noBeaconCount = 0;
	foreach(sp;span) if(sp.inUse) noBeaconCount += sp.xHigh - 
sp.xLow + 1;

	return noBeaconCount;
}
```

Apr 03 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/3/23 7:22 PM, Paul wrote:
 ```d
 // Timed main() 
 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 void main(string[] args) {
 auto progStartTime = MonoTime.currTime;
 //-----------------------------------------------------------------------------
      string filename = args.length > 1 ? args[1] : "aoc2215a.data";
      CommPair[] commPair;
      ulong x,y;
 
      // read file that has data sets in the form of x,y coordinate pairs
      // for each sensor-beacon pair.  Create on array of structs to hold
      // this information.
      loadFileDataIntoArrayOfStructs(commPair, filename);
 
      foreach(int lineOfInterest;parallel(iota(0,4_000_001))) {
          Span[] span; // A section of line-of-interest coordinates
where 
 no other beacons are present.
          const spanReserve = span.reserve(50);
          createSpansOfNoBeacons(lineOfInterest,commPair,span);
 
          // if spans overlap, combine them into a single span and mark
          // the other spans !inUse.
          combineOverlappingSpans(span);
 
          // look for a line that doesn't have 4,000,001 locations 
 accounted for
          if(beaconFreeLocations(span) < 4_000_001) {
 
              // find the location that is not accounted for
              foreach(ulong i;0..4_000_000) {
                  bool found = false;
                  foreach(sp;span) {
                      if(i >= sp.xLow && i <= sp.xHigh) {
                          found = true;
                          break;
                      }
                  }
                  if(!found) {
                      x = i; y = lineOfInterest;
                      break;
                  }
              }
          }
      }
 writeln(x," ",y);
 
 ```

So I just quoted your main loop.

I am assuming that this O(n^2) algorithm doesn't actually run for all 
iterations, because that wouldn't be feasible (16 trillion iterations is 
a lot).

This means that I'm assuming a lot of cases do not run the second loop. 
Everything you do besides prune the second loop is mostly allocating an 
array of `Span` types. This means most of the parallel loops are 
allocating, and doing nothing else. As I said earlier, allocations need 
a global lock of the GC.

What you need to do probably, is to avoid these allocations per loop.

The easiest thing I can think of is to store the Span array as a static 
array of the largest array you need (i.e. the length of `commPair`), and 
then slice it instead of appending.

So what you need is inside `createSpansOfNoBeacons`, take as a reference 
a `ref Span[MAX_SPANS]`, and have it return a `Span[]` that is a slice 
of that which was "alocated".

See if this helps.

FWIW, I did the AoC 2022 as well, and I never needed parallel execution. 
Looking at my solution comment in reddit, this one I sort of punted by 
knowing I could exit as soon as the answer is found (my solution runs in 
2.5s on my input). But I recommend (once you are done), reading this 
post, it is a really cool way to look at it:

https://www.reddit.com/r/adventofcode/comments/zmcn64/2022_day_15_solutions/j0hl19a/?context=8&depth=9

-Steve

Apr 03 2023

Paul <phshaffer gmail.com> writes:

On Monday, 3 April 2023 at 23:50:48 UTC, Steven Schveighoffer 
wrote:
 So what you need is inside `createSpansOfNoBeacons`, take as a 
 reference a `ref Span[MAX_SPANS]`, and have it return a 
 `Span[]` that is a slice of that which was "alocated".

 See if this helps.

Well Steven just making the change you said reduced the execution 
time from ~6-7 secs to ~3 secs.  Then, including the 'parallel' 
in the foreach statement took it down to ~1 sec.

Boy lesson learned in appending-to and zeroing dynamic arrays in 
a hot loop!

Apr 04 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Tue, Apr 04, 2023 at 09:35:29PM +0000, Paul via Digitalmars-d-learn wrote:
[...]
 Well Steven just making the change you said reduced the execution time
 from ~6-7 secs to ~3 secs.  Then, including the 'parallel' in the
 foreach statement took it down to ~1 sec.
 
 Boy lesson learned in appending-to and zeroing dynamic arrays in a hot
 loop!

Best practices for arrays in hot loops:
- Avoid appending if possible; instead, pre-allocate outside the loop.
- Where possible, reuse existing arrays instead of discarding old ones
  and allocating new ones.
- Use slices where possible instead of making copies of subarrays (this
  esp. applies to strings).
- Where possible, prefer sequential access over random access (take
  advantage of the CPU cache hierarchy).


T

-- 
Famous last words: I *think* this will work...

Apr 04 2023

Paul <phshaffer gmail.com> writes:

On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:

 Best practices for arrays in hot loops:
 - Avoid appending if possible; instead, pre-allocate outside 
 the loop.
 - Where possible, reuse existing arrays instead of discarding 
 old ones
   and allocating new ones.
 - Use slices where possible instead of making copies of 
 subarrays (this
   esp. applies to strings).
 - Where possible, prefer sequential access over random access 
 (take
   advantage of the CPU cache hierarchy).

Thanks for sharing Teoh!  Very helpful.

would this be random access? for(size_t i; i<arr.length; i++) 
using indices?
...and this be sequential foreach(a;arr) ?

or would they have to be completely different kinds of containers?
a dlang 'range' vs arr[]?

Apr 05 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/5/23 6:34 PM, Paul wrote:
 On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
 
 Best practices for arrays in hot loops:
 - Avoid appending if possible; instead, pre-allocate outside the loop.
 - Where possible, reuse existing arrays instead of discarding old ones
   and allocating new ones.
 - Use slices where possible instead of making copies of subarrays (this
   esp. applies to strings).
 - Where possible, prefer sequential access over random access (take
   advantage of the CPU cache hierarchy).

 
 Thanks for sharing Teoh!  Very helpful.
 
 would this be random access? for(size_t i; i<arr.length; i++) using 
 indices?
 ...and this be sequential foreach(a;arr) ?

No, random access is access out of sequence. Those two lines are pretty 
much equivalent, and even a naive compiler is going to produce exactly 
the same generated code from both of them.

A classic example is processing a 2d array:

```d
for(int i = 0; i < arr[0].length; ++i)
    for(int j = 0; j < arr.length; ++j)
      arr[j][i]++;

// vs
for(int j = 0; j < arr.length; ++j)
    for(int i = 0; i < arr[0].length; ++i)
      arr[j][i]++;
```

The first accesses elements *by column*, which means that the array data 
is accessed non-linearly in memory.

To be fair, both are "linear" in terms of algorithm, but one is going to 
be faster because of cache coherency (you are accessing sequential 
*hardware addresses*).

-Steve

Apr 05 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Wed, Apr 05, 2023 at 10:34:22PM +0000, Paul via Digitalmars-d-learn wrote:
 On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
 
 Best practices for arrays in hot loops:


[...]
 - Where possible, prefer sequential access over random access (take
   advantage of the CPU cache hierarchy).

 
 Thanks for sharing Teoh!  Very helpful.
 
 would this be random access? for(size_t i; i<arr.length; i++) using
 indices?  ...and this be sequential foreach(a;arr) ?
 
 or would they have to be completely different kinds of containers?  a
 dlang 'range' vs arr[]?

[...]

The exact syntactic construct you use is not important; under the hood,
for(i; i<arr.length; i++) and foreach(a;arr) are exactly the same thing.
What's important is the actual pattern of memory access.  Looping from
the first element of an array to the last element is sequential access;
doing binary search on an array is random access (because it's
unpredictable where the next access will be). Traversing a linked list
is also random access, because in general individual nodes will be
stored in different places in memory. So doing a hashtable lookup.

Basically, modern CPUs have a cache hierarchy and memory prefetching
controlled by the access predictor. When the predictor sees memory being
accessed sequentially, it can speed up future accesses by preloading
subsequent blocks of memory into cache lines so that when the CPU next
tries to read from that location it's already in the cache and it
doesn't have to wait for a fetch cycle from RAM, which is slow.  The bad
thing about things like hashtables is because it's unpredictable where
the next access will be, so the CPU's predictor won't be able to preload
the next access for you. So you'll incur a fetch cycle from RAM every
time.

Of course, this doesn't mean that hashtable (or other data structure)
lookups are necessarily bad.  If the alternative is linear scan through
very large arrays, then hashtables will speed up your inner loop in
spite of the cache misses. So it depends on your exact use case. But if
your arrays are not overly large, scanning through the entire array may
sometimes be faster than doing many hashtable lookups.  Some modern
hashtable implementations are also adapted to take advantage of the
cache hierarchy, so it may not be as bad as a na�ve implementation.

Nevertheless, the general principle is that locality (adjacent memory
accesses are close to each other) is good, and should generally be
preferred over strategies that require jumping around in memory. So your
data structures and algorithms should be designed in a way that takes
advantage of linear access where possible.


T

-- 
In theory, there is no difference between theory and practice.

Apr 05 2023

Paul <phshaffer gmail.com> writes:

On Wednesday, 5 April 2023 at 23:06:54 UTC, H. S. Teoh wrote:

 So your data structures and algorithms should be designed in a 
 way that takes advantage of linear access where possible.


 T

Yes I understand, basically, what's going on in hardware.  I just 
wasn't sure if the access type was linked to the container type.  
It seems obvious now, since you've both made it clear, that it 
also depends on how I'm accessing my container.

Having said all of this, isn't a D 'range' fundamentally a 
sequential access container (i.e popFront) ?

Apr 05 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Apr 06, 2023 at 01:20:28AM +0000, Paul via Digitalmars-d-learn wrote:
[...]
 Yes I understand, basically, what's going on in hardware.  I just
 wasn't sure if the access type was linked to the container type.  It
 seems obvious now, since you've both made it clear, that it also
 depends on how I'm accessing my container.
 
 Having said all of this, isn't a D 'range' fundamentally a sequential
 access container (i.e popFront) ?

D ranges are conceptually sequential, but the actual underlying memory
access patterns depends on the concrete type at runtime. An array's
elements are stored sequentially in memory, and arrays are ranges.  But
a linked-list can also have a range interface, yet its elements may be
stored in non-consecutive memory locations.  So the concrete type
matters here; the range API only gives you conceptual sequentiality, it
does not guarantee physically sequential memory access.


T

-- 
Many open minds should be closed for repairs. -- K5 user

Apr 05 2023

Paul <phshaffer gmail.com> writes:

On Thursday, 6 April 2023 at 01:44:15 UTC, H. S. Teoh wrote:

 D ranges are conceptually sequential, but the actual underlying 
 memory access patterns depends on the concrete type at runtime. 
 An array's elements are stored sequentially in memory, and 
 arrays are ranges.  But a linked-list can also have a range 
 interface, yet its elements may be stored in non-consecutive 
 memory locations.  So the concrete type matters here; the range 
 API only gives you conceptual sequentiality, it does not 
 guarantee physically sequential memory access.

Very helpful Teoh.  Thanks again.

Apr 06 2023

D Programming

C/C++ Programming

Other

digitalmars.D.learn - foreach (i; taskPool.parallel(0..2_000_000)