digitalmars.D.learn - First time using Parallel

Era Scarecrow (37/37) Dec 25 2021 This is curious. I was up for trying to parallelize my code,

max haughton (15/53) Dec 26 2021 Parallel programming is one of the deepest rabbit holes you can

rikki cattermole (4/15) Dec 26 2021 It'll speed it up significantly.

Era Scarecrow (80/96) Dec 26 2021 I'm basically generating a default list of LFSR's for my Reed

Bastiaan Veelo (34/40) Dec 26 2021 [...]

Bastiaan Veelo (5/9) Dec 26 2021 Scratch that, I misunderstood the example. It doesn't solve

Era Scarecrow (17/24) Dec 26 2021 Maybe. I did notice that the early stuff a bunch of output was

Era Scarecrow <rtcvb32 yahoo.com> writes:

  This is curious. I was up for trying to parallelize my code, 
specifically having a block of code calculate some polynomials 
(*Related to Reed Solomon stuff*). So I cracked open std.parallel 
and looked over how I would manage this all.

  To my surprise I found ParallelForEach, which gives the example 
of:

```d
foreach(value; taskPool.parallel(range) ){code}
```

Since my code doesn't require any memory management, shared 
resources or race conditions (*other than stdout*), I plugged in 
an iota and gave it a go. To my amazement no compiling issues, 
and all my cores are in heavy use and it's outputting results!

  Now said results are out of order (*and early results are 
garbage from stdout*), but I'd included a bitwidth comment so 
sorting should be easy.
```d
         0x3,    /*7*/
         0x11,   /*9*/
         0x9,    /*10*/
         0x1D,   /*8*/
         0x5,    /*11*/
         0x3,    /*15*/
         0x53,   /*12*/
         0x1B,   /*13*/
         0x2B,   /*14*/
```
etc etc.

  Previously years ago I remember having to make a struct and then 
having to pass a function and a bunch of stuff from within the 
struct, often breaking and being hard to get to even work so I 
didn't hardly touch this stuff. This is making outputting data 
MUCH faster and so easily; Well at least on a beefy computer and 
not just some chromebook I'm programming on so it can all be on 
the go.


  So I suppose, is there anything I need to know? About shared 
resources or how to wait until all threads are done?

Dec 25 2021

max haughton <maxhaton gmail.com> writes:

On Sunday, 26 December 2021 at 06:10:03 UTC, Era Scarecrow wrote:
  This is curious. I was up for trying to parallelize my code, 
 specifically having a block of code calculate some polynomials 
 (*Related to Reed Solomon stuff*). So I cracked open 
 std.parallel and looked over how I would manage this all.

  To my surprise I found ParallelForEach, which gives the 
 example of:

 ```d
 foreach(value; taskPool.parallel(range) ){code}
 ```

 Since my code doesn't require any memory management, shared 
 resources or race conditions (*other than stdout*), I plugged 
 in an iota and gave it a go. To my amazement no compiling 
 issues, and all my cores are in heavy use and it's outputting 
 results!

  Now said results are out of order (*and early results are 
 garbage from stdout*), but I'd included a bitwidth comment so 
 sorting should be easy.
 ```d
         0x3,    /*7*/
         0x11,   /*9*/
         0x9,    /*10*/
         0x1D,   /*8*/
         0x5,    /*11*/
         0x3,    /*15*/
         0x53,   /*12*/
         0x1B,   /*13*/
         0x2B,   /*14*/
 ```
 etc etc.

  Previously years ago I remember having to make a struct and 
 then having to pass a function and a bunch of stuff from within 
 the struct, often breaking and being hard to get to even work 
 so I didn't hardly touch this stuff. This is making outputting 
 data MUCH faster and so easily; Well at least on a beefy 
 computer and not just some chromebook I'm programming on so it 
 can all be on the go.


  So I suppose, is there anything I need to know? About shared 
 resources or how to wait until all threads are done?

Parallel programming is one of the deepest rabbit holes you can 
actually get to use in practice. Your question at the moment 
doesn't really have much context to it so it's difficult to 
suggest where you should go directly.

I would start by removing the use of stdout in your loop kernel - 
I'm not familiar with what you are calculating, but if you can 
basically have the (parallel) loop operate from (say) one array 
directly into another then you can get extremely good parallel 
scaling with almost no effort.

Not using in the actual loop should make the code faster even 
without threads because having a function call in the hot code 
will mean compilers optimizer will give up on certain 
transformations - i.e. do all the work as compactly as possible 
then output the data in one step at the end.

Dec 26 2021

rikki cattermole <rikki cattermole.co.nz> writes:

On 27/12/2021 12:10 AM, max haughton wrote:
 I would start by removing the use of stdout in your loop kernel - I'm 
 not familiar with what you are calculating, but if you can basically 
 have the (parallel) loop operate from (say) one array directly into 
 another then you can get extremely good parallel scaling with almost no 
 effort.
 
 Not using in the actual loop should make the code faster even without 
 threads because having a function call in the hot code will mean 
 compilers optimizer will give up on certain transformations - i.e. do 
 all the work as compactly as possible then output the data in one step 
 at the end.

It'll speed it up significantly.

Standard IO has locks in it. So you end up with all calculations 
grinding to a half waiting for another thread to finish doing something.

Dec 26 2021

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Sunday, 26 December 2021 at 11:24:54 UTC, rikki cattermole 
wrote:
 I would start by removing the use of stdout in your loop kernel 
 - I'm not familiar with what you are calculating, but if you 
 can basically have the (parallel) loop operate from (say) one 
 array directly into another then you can get extremely good 
 parallel scaling with almost no effort.

  I'm basically generating a default list of LFSR's for my Reed 
Solomon codes. LFSR can be used in Pseudo random numbers, but in 
this case it's to build a Galois field for Error Correction.

  Using it is simple, you need to know a binary number that when 
xored when a 1 bit exits the range, will result in the maximum 
numbers (*excluding zero*). So if we do 4 bits (xor of 3) you'd 
get:

```
  0 0001 -- initial
  0 0010
  0 0100
  0 1000
  1 0011 <- 0000
  0 0110
  0 1100
  1 1011 <- 1000
  1 0101 <- 0110
  0 1010
  1 0111 <- 0100
  0 1110
  1 1111 <- 1100
  1 1101 <- 1110
  1 1001 <- 1010
  1 0001 <- 0010 -- back to our initial value
```
  As such the bulk of the work is done in this function. Other 
functions leading to this mostly figure out what value should be 
according to some rules i set before trying to work (*quite a few 
only need 2 bits on*).

```d
bool testfunc(ulong value, ulong bitswide) {
	ulong cnt=1, lfsr=2,up=1UL<<bitswide;
	value |= up; //eliminates need to and the result
	while(cnt < up && lfsr != 1) {
		lfsr <<= 1;
		if (lfsr & up)
			lfsr ^= value;
		cnt++;
	}
	return cnt == up-1;
}

//within main, cyclebits will call testfunc when value is 
calculated
	foreach(bitwidth; taskPool.parallel(iota(start, end))) {
		for(ulong bitson=2; bitson <= bitwidth; bitson+=1) {
			ulong v = cyclebits(bitwidth, bitson, &testfunc);
			if (v) {
				writeln("\t0x", cast(void*)v, ",\t/*",bitwidth, "*/"); //only 
place IO takes place
				break;
			}
		}
	}
```

rikki cattermole wrote:
Your question at the moment doesn't really have much context to 
it so it's difficult to suggest where you should go directly.

I suppose, if I started doing work where I'm sharing resources 
(*probably memory*) would i have to go with semaphores and locks. 
I remember trying to read how to use threads in the past in C/C++ 
and it was a headache to setup where i just gave up.

I assume it's best to divide work up where it can be completed 
without competing for resources or race conditions, hefty enough 
work to make it worth the cost of instantiating the thread in the 
first place. So aside from the library documentation is there a 
good source for learning/using parallel and best practices? I'll 
love to be using more of this in the future if it isn't as big a 
blunder as it's made out to be.

 Not using in the actual loop should make the code faster even 
 without threads because having a function call in the hot code 
 will mean compilers optimizer will give up on certain 
 transformations - i.e. do all the work as compactly as possible 
 then output the data in one step at the end.

  In this case I'm not sure how long each step takes, so I'm 
hoping intermediary results i can copy by hand will work (*it may 
take a second or several minutes*). If this wasn't a brute force 
elimination of so many combinations I'm sure a different approach 
would work.

On 27/12/2021 12:10 AM, max haughton wrote:
 It'll speed it up significantly.

 Standard IO has locks in it. So you end up with all 
 calculations grinding to a halt waiting for another thread to 
 finish doing something.

  I assume that's only when they are trying to actually use it? 
Early in the cycles (*under 30*) they were outputting quickly, 
but after 31 it can be minutes between results, and each thread 
(*if I'm right*) is working on a different number. So ones found 
where 3,5,9 are pretty fast while all the others have a lot of 
failures before i get a good result.

Dec 26 2021

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Sunday, 26 December 2021 at 06:10:03 UTC, Era Scarecrow wrote:

[...]

 ```d
 foreach(value; taskPool.parallel(range) ){code}
 ```

[...]

  Now said results are out of order

[...]

  So I suppose, is there anything I need to know? About shared 
 resources or how to wait until all threads are done?

Have a look at `taskPool.workerLocalStorage`. I learned about 
this in [a post by data 
pulverizer](https://forum.dlang.org/post/ddgxqoitxoaljfwnlogc forum.dlang.org),
who gives this example (slightly modified):
```d
import std.traits : isFloatingPoint;

auto dot(T)(T[] x, T[] y) if (isFloatingPoint!T)
in (y.length == x.length)
{
     import std.range : iota;
     import std.parallelism : parallel, taskPool;

     auto sums = taskPool.workerLocalStorage(0.0L);
     foreach (i; parallel(iota(x.length)))
         sums.get += x[i] * y[i];
     T result = 0.0;
     foreach (threadResult; sums.toRange)
         result += threadResult;
     return result;
}

void main()
{
     double[] x = [1, 2, 3, 4, 5];
     double[] y = [6, 7, 8, 9, 10];
     assert(dot(x, y) == 130);
}
```
(https://run.dlang.io/is/Ia8A0k)

So if you use `workerLocalStorage` to give each thread an 
`appender!string` to write output to, and afterwards write those 
to `stdout`, you'll get your output in order without sorting.

--Bastiaan.

Dec 26 2021

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Sunday, 26 December 2021 at 15:20:09 UTC, Bastiaan Veelo wrote:
 So if you use `workerLocalStorage` to give each thread an 
 `appender!string` to write output to, and afterwards write 
 those to `stdout`, you'll get your output in order without 
 sorting.

Scratch that, I misunderstood the example. It doesn't solve 
ordering. The example works because order does not matter for 
addition. Sorry for spreading wrong information.

-- Bastiaan.

Dec 26 2021

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Sunday, 26 December 2021 at 15:36:54 UTC, Bastiaan Veelo wrote:
 On Sunday, 26 December 2021 at 15:20:09 UTC, Bastiaan Veelo 
 wrote:
 So if you use `workerLocalStorage` ... you'll get your output 
 in order without  sorting.

 Scratch that, I misunderstood the example. It doesn't solve 
 ordering. The example works because order does not matter for 
 addition. Sorry for spreading wrong information.

  Maybe. I did notice that the early stuff a bunch of output was 
getting mixed up;
```
         0x      0x      0x      0x      0x      0x      0x      
0x35    /*33,   /*,     /*,     /*115,  /*,     /*3,   /
*9,     /*3410*/*/
```

Which i assume it's doing several small write calls and different 
threads are acting at the same time. So if I do an appender 
string and then outputted the string as a single bock that would 
likely go away; Though it wouldn't help with ordering.

  **IF** I didn't have to wait so long to get results and wanted 
them all at once in order, I would write the results to the 
offsets of an array and then output it all at once at the end 
(*and since they'd have their own offset to write to you don't 
need to lock*).

Dec 26 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - First time using Parallel