www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - By ref and by pointer kills performance.

reply claptrap <clap trap.com> writes:
I was refactoring some code and changed a parameter from by 
value, to by pointer, and saw the performance drop by 50%. This 
is a highly reduced example of what I found, but basically 
passing something into a function by reference or pointer seems 
to make the compilers (it affects both DMD and LDC) treat it as 
if its volatile and must be loaded from memory on every use. This 
also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

```d
void fillBP(uint* value, uint* dest)
{
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}
```
codegen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     ECX,[RSI]
                 mov     [RDI],ECX
                 mov     EDX,[RSI]
                 mov     4[RDI],EDX
                 mov     R8D,[RSI]
                 mov     8[RDI],R8D
                 mov     R9D,[RSI]
                 mov     0Ch[RDI],R9D
                 pop     RBP
                 ret

codgen LDC -->

         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 4], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 8], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 12], eax
         ret
```d
void fillBV(uint value, uint* dest)
{
     dest[0] = value;
     dest[1] = value;
     dest[2] = value;
     dest[3] = value;
}
```
codgen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     [RDI],ESI
                 mov     4[RDI],ESI
                 mov     8[RDI],ESI
                 mov     0Ch[RDI],ESI
                 pop     RBP
                 ret

codegen LDC -->

         movd    xmm0, edi
         pshufd  xmm0, xmm0, 0
         movdqu  xmmword ptr [rsi], xmm0
         ret

Interestingly if you do this...
```d
void fillBP(uint* value, uint* dest)
{
     uint tmp = *value;
     dest[0] = tmp;
     dest[1] = tmp;
     dest[2] = tmp;
     dest[3] = tmp;
}
```
You get identical code to the by value versions. (except the load 
from memory)

I'm not a compiler guy so maybe there's some rationale for this 
that I don't know but it seems like the compiler should be able 
to read "*value" once and cache it.
Feb 12
next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

```d
void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
}
```

I expected that to not do the extra loads, but it did.

```d
void fillBP(immutable(uint*) value, uint* dest) {
	dest[0 .. 4][] = *value;
}
```

And that certainly should not be doing it either.
Even if it wasn't immutable.

For your code, because it is not immutable and therefore can be changed 
externally on another thread, the fact that the compiler has to do the 
loads is correct. This isn't a bug.
Feb 12
next sibling parent reply Johan <j j.nl> writes:
On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.
I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that `immutable(uint*)` points to _mutable_ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (`__monitor`), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it). In the case of `fillBP` there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined... For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that `immutable` reference arguments never alias to other reference arguments; this complicates implementing the desired optimization. I'm not saying it is impossible, it's just extra effort (and proof is needed). -Johan
Feb 13
next sibling parent reply Johan <j j.nl> writes:
On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.
I hope someone can find the link to some DConf talk (me or Andrei)
Found it: https://youtu.be/-0jcE9B5kjs?list=PL9a7lgBtSQb-YCVj96v5vn1tEXPjKOPuB&t=403
Feb 13
parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
Wow, OOP people ruining performance for everybody, who would have 
thought..

That makes me worried now.. -betterC doesn't seem to change 
anything about optimization, or rather, lack of optimization..
Feb 13
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 14/02/2024 4:20 AM, ryuukk_ wrote:
 Wow, OOP people ruining performance for everybody, who would have thought..
 
 That makes me worried now.. -betterC doesn't seem to change anything 
 about optimization, or rather, lack of optimization..
No, it isn't limited to classes, the same issue appears with reference counting. Immutable simply isn't providing the guarantee of immutable.
Feb 13
prev sibling next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 14/02/2024 2:30 AM, Johan wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.
I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that `immutable(uint*)` points to _mutable_ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (`__monitor`), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it). In the case of `fillBP` there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined... For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that `immutable` reference arguments never alias to other reference arguments; this complicates implementing the desired optimization. I'm not saying it is impossible, it's just extra effort (and proof is needed). -Johan
If you don't like the problem, change the problem. After thinking about it, I can agree for D classes immutable can't be known if it truely applies or not. A lock could still be taken in const code. So the problem is: For a type that is not a D class (COM, extern(C++) are not included in that definition), mark as immutable to LLVM IR so optimizations can be used wrt. immutable. After all, to call an immutable this, method it must also be marked as const or immutable. The this pointer won't be allowed to be mutated unless someone did something bad (again, not the compiler's fault that the user went against it). I don't think much thought needs to go into this.
Feb 13
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
After even more thinking, I'm reminded of an operator overload that I 
want for reference counting.

``opGoingIntoROM``.

The reason for reference counting is obvious, to turn off reference 
counting.

But it also applies here.

Having this knowledge on the monitor could turn off the actual 
lock/unlock mutation.

Okay this is compelling enough to report it.

https://issues.dlang.org/show_bug.cgi?id=24393

This is no longer an "I want". It's a genuine type guarantee issue of 
something we currently support.
Feb 13
prev sibling parent reply Kagamin <spam here.lot> writes:
On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 I hope someone can find the link to some DConf talk (me or 
 Andrei) or forum post where I talk about why LDC assumes that 
 `immutable(uint*)` points to _mutable_ (nota bene) data. The 
 reason is the mutable thread synchronization field in immutable 
 class variable storage (`__monitor`), combined with casting an 
 immutable class to an array of immutable bytes.

 Side-effects in-between immutable(uint*) lookup could run into 
 a synchronization event on the immutable data (i.e. mutating 
 it).
Shouldn't it be okay for immutable(uint*) to not access `__monitor`? I think, `__monitor` should work outside of type safety, like reference counting, then immutability won't matter. ``` shared(Object.Monitor*) getMonitor(immutable Object o) { size_t addr=cast(size_t)cast(void*)&o.__monitor; return cast(shared(Object.Monitor*))(addr-20+20); } ``` Is it legal for optimizer to cast away `restrict` qualifier like this?
Feb 16
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/02/2024 10:34 PM, Kagamin wrote:
 On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 
     I hope someone can find the link to some DConf talk (me or Andrei)
     or forum post where I talk about why LDC assumes that
     |immutable(uint*)| points to /mutable/ (nota bene) data. The reason
     is the mutable thread synchronization field in immutable class
     variable storage (|__monitor|), combined with casting an immutable
     class to an array of immutable bytes.
 
     Side-effects in-between immutable(uint*) lookup could run into a
     synchronization event on the immutable data (i.e. mutating it).
 
 Shouldn't it be okay for immutable(uint*) to not access |__monitor|? I 
 think, |__monitor| should work outside of type safety, like reference 
 counting, then immutability won't matter.
That covers immutable, but not immutable that has become const. The only way I can see it working reliably is to simply turn off the mutex when it is immutable.
Feb 16
parent reply Kagamin <spam here.lot> writes:
On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 That covers immutable, but not immutable that has become const.
That's fine, const can't be restrict qualified unconditionally, but immutable can.
Feb 16
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/02/2024 11:23 PM, Kagamin wrote:
 On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 That covers immutable, but not immutable that has become const.
That's fine, const can't be restrict qualified unconditionally, but immutable can.
It is not fine. Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide. Write to it, and see how the program segfaults. Just to prove a point of what immutable enables a compiler to do: ```d import std.stdio; import core.sys.posix.sys.mman; import core.sys.posix.unistd : sysconf, _SC_PAGESIZE; import core.sys.posix.stdlib; void main() { auto pagesize = sysconf(_SC_PAGESIZE); int* temp; posix_memalign(cast(void**)&temp, pagesize, 4 * pagesize); data = temp[0 .. pagesize]; data[0 .. 2] = [1, 2]; mprotect(data.ptr, int.sizeof * data.length, PROT_READ); writeln(data[0 .. 2]); data[0 .. 2] = [3, 4]; writeln(data[0 .. 2]); } static int[] data; ``` ``` [1, 2] Error: program killed by signal 11 ``` Read only memory like this is useful to kernels, it enables them to map the same ram ranges to multiple processes. It provides a very strong guarantee, well beyond const and right now that guarantee isn't honored in the language and needs to be fixed (otherwise why have it at all?).
Feb 16
parent reply Kagamin <spam here.lot> writes:
On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is 
 allowed to modify the contents process wide.
Then you won't be able to create a monitor for them.
Feb 16
next sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 17/02/2024 4:12 AM, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is allowed 
 to modify the contents process wide.
Then you won't be able to create a monitor for them.
Code will still call into the null monitor. Or we could have a way to disable it at runtime instead by setting a field to change behavior to cover all use cases (which is what I proposed ;) ).
Feb 16
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is 
 allowed to modify the contents process wide.
Then you won't be able to create a monitor for them.
I don't get this point. The monitor field can just be the single pointer that is not transitively immutable? I mean. The compiler knows *exactly* when you access the monitor field. It's never accessed manually! This is a pointer access that can be generated at exactly line of code in the compiler. Just drop immutable there.
Feb 18
parent FeepingCreature <feepingcreature gmail.com> writes:
On Sunday, 18 February 2024 at 13:41:42 UTC, FeepingCreature 
wrote:
 On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing 
 is allowed to modify the contents process wide.
Then you won't be able to create a monitor for them.
I don't get this point. The monitor field can just be the single pointer that is not transitively immutable? I mean. The compiler knows *exactly* when you access the monitor field. It's never accessed manually! This is a pointer access that can be generated at exactly line of code in the compiler. Just drop immutable there.
 I hope someone can find the link to some DConf talk (me or 
 Andrei) or forum post where I talk about why LDC assumes that 
 immutable(uint*) points to mutable (nota bene) data. The reason 
 is the mutable thread synchronization field in immutable class 
 variable storage (__monitor), combined with casting an 
 immutable class to an array of immutable bytes.
Okay I think I overread this, probably because it's absurd. Why is this idiom worth protecting? You're hardcasting *anyways*, it's forbidden in safe *anyways*, just cast __monitor to uint* instead of immutable(uint*). If you're interacting with classes at such a deep level, asking you to remember that there's a mutable field in there even in immutable classes doesn't seem like a big ask.
Feb 18
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
 	dest[0 .. 4][] = *value;
 }
 ```

 And that certainly should not be doing it either.
 Even if it wasn't immutable.

 For your code, because it is not immutable and therefore can be 
 changed externally on another thread, the fact that the 
 compiler has to do the loads is correct. This isn't a bug.
Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.
Feb 13
next sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 14/02/2024 4:14 AM, Patrick Schluter wrote:
 Is not a thread issue. The memory the pointers point to only needs to 
 overlap and the loads are required to get the "right" result.
It can be a thread issue, but yes overlapping is another potential situation where it can get over written. I prefer talking about threads, because they bring an entirely unknown element to the discussion and shutdown any localized assumptions. In the case of immutable it should not be possible for immutable memory to be changed unless someone did something bad. Immutable is a very strong guarantee at the process level, that there are no mutable pointers pointing at immutable memory. Therefore the compiler is free to make that optimization without concern. If it breaks, it's the users fault for misusing the language. "The second way is to cast data to immutable. When doing so, it is up to the programmer to ensure that any mutable references to the same data are not used to modify the data after the cast." https://dlang.org/spec/const3.html#creating_immutable_data
Feb 13
prev sibling parent Basile B. <b2.temp gmx.com> writes:
On Tuesday, 13 February 2024 at 15:14:06 UTC, Patrick Schluter 
wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
 	dest[0 .. 4][] = *value;
 }
 ```

 And that certainly should not be doing it either.
 Even if it wasn't immutable.

 For your code, because it is not immutable and therefore can 
 be changed externally on another thread, the fact that the 
 compiler has to do the loads is correct. This isn't a bug.
Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.
It's been proven that the main problem, i.e what defeated CSE ( the Common Sub Expression optim pass), is the possible overlap between the two parameter. However there'is still a risk on the value, i.e it can be modified by another thread, like mentioned Johan IIRC. The destination of the pointer you pass as "value" might change between one of the fourth assignment. To be frank I thought this was the problem.
Feb 13
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.
I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
Feb 13
next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 14/02/2024 5:49 PM, Walter Bright wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.
I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
I'm aware that what dmd is doing is right. But it isn't using SIMD either, so there is some performance left on the table. A backend today vs 15 years ago is very different in terms of what they can do. Let alone one that last saw significant active development 30 years ago. I suggest that you have a chat with Bruce about this during the next meetup. We've been trying to explain to you just how backends have changed over the years from a users perspective. The ones today can only be described with one word: magical. But I will say this: 20 years ago you needed inline assembly, 10 years ago you needed intrinsics, today you may not need to do anything special in D, although will in C++.
Feb 13
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
I'll give an example for something I wrote up in a reply for a Reddit post:

Using ldc I was able to get this code to do 8.8 million vectors in 
around 1 second.

```d
module strategy;

int run(byte[] a, byte[] b) {
     int sum = 0;

     for(size_t i; i < a.length; i += 64) {
         static foreach(j; 0 .. 64) {
             sum += cast(short)a[i + j] * cast(short)b[i + j];
         }
     }

     return sum;
}
```

To achieve the same thing using Go, they had to write assembly (see 
DotVNNI example).

https://sourcegraph.com/blog/slow-to-simd

LDC is able to get to almost the maximum a CPU can do, with what looks 
to be almost naive looking code. DMD cannot compete with this, nor 
should it aim to.

Here is what actual naive code looks like for this problem:

```d
module strategy;

int run(byte[] a, byte[] b) {
     assert(a.length == b.length);

     int sum = 0;

     foreach(i; 0 .. a.length) {
         sum += cast(short)a[i] * cast(short)b[i];
     }

     return sum;
}
```

7 million vectors per second.

They had to write assembly to get that speed. I got it, without doing 
anything special...
Feb 13
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 14.02.24 05:49, Walter Bright wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.
I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. ...
Well, my understanding is that storing a value that was read from a `uint*` cannot invalidate a value read from the same `uint*` unless unaligned reads/writes are allowed or the pointer can point to itself. Are unaligned reads/writes allowed in D? Can a `uint*` point to itself? Is there something else I missed? How to make the following assertion fail by filling the `...` holes with code without invoking UB? ```d void fillBP1(uint* value, uint* dest)pure{ dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } void fillBP2(uint* value, uint* dest)pure{ auto tmp = *value; dest[0] = tmp; dest[1] = tmp; dest[2] = tmp; dest[3] = tmp; } uint[4] distinguish(alias f)()pure{ ...; // TODO uint[4] dest = ...; // TODO uint* value = ...; // TODO f(value,dest.ptr); return dest; } void main(){ import std.stdio; assert(distinguish!fillBP1()==distinguish!fillBP2()); writeln(test!fillBP1); writeln(test!fillBP2); } ```
 C99 tried to address this with __restrict, but few people use it or 
 understand it. D didn't bother with it because people will inevitably 
 misuse __restrict and get their data mysteriously corrupted.
 
My understanding is the case in the OB does not require `__restrict`. `value` can point to any of the 4 fields of `dest`, it is okay to only read it once even in that case. Of course, maybe the data flow analysis is more conservative than that, but I wouldn't necessarily point to this case as one where data flow analysis is working "correctly".
Feb 14
prev sibling next sibling parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright 
wrote:
 This isn't a case of buggy DFA. It's a case of doing DFA 
 correctly.

 The issue is pointer aliasing. A pointer can point to anything, 
 including const data. Therefore, storing through a pointer can 
 alter any value that is reachable via a pointer. Therefore, 
 storing through a pointer invalidates any cached value already 
 read.

 This is what you're seeing.

 C99 tried to address this with __restrict, but few people use 
 it or understand it. D didn't bother with it because people 
 will inevitably misuse __restrict and get their data 
 mysteriously corrupted.
This, here, is the answer. Moral of the story, use (gdc, ldc) vendor attributes if you really care. ``` import core.attribute; void fillBP( restrict uint* dest, uint* value) { //... ```
Feb 14
prev sibling parent reply max haughton <maxhaton gmail.com> writes:
On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright 
wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.
I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
It's probably true that not many can rattle off the legalese that defines restrict semantics correctly (I know I can't, although have to say I've never been bitten by it), but within a roughly C-like programming model its where a lot of performance comes from when at the bleeding edge so I don't know if not including it at all is particularly good either. Thinking about it, it's actually probably mainly an issue for library users rather than the author who actually wrote restrict to begin with - I have wondered in the past whether some kind of `isolated` style type qualifier could actually help here i.e. one could provide the compiler good information and make it very obvious to the programmer.
Feb 18
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 19/02/2024 2:58 AM, max haughton wrote:
 Thinking about it, it's actually probably mainly an issue for library 
 users rather than the author who actually wrote restrict to begin with - 
 I have wondered in the past whether some kind of |isolated| style type 
 qualifier could actually help here i.e. one could provide the compiler 
 good information and make it very obvious to the programmer.
A couple of days ago I had a read of Joe Duffy's blog. I was very impressed with what he had to say, enough to immediately want to buy his concurrency book and start adding his blog articles as references to my existing proposals. He was one of the developers Midori and had a lot to say about exception mechanisms and concurrency. Turns out we both have some similar conclusions, only he gained his about 15 years prior to me (which may or may not have contributed to my impression). The concept of isolated was one such concept that he talked about in regards to concurrency and its safety. I had not thought about it, that it would be a useful tool for making memory ownership safe in this context. But it also applies here consider: ```d isolated(int[]) a, b; func(a, b); void func(scope int[] c, scope int[] d) safe {} ``` Should that be callable with a and b? Yes. The graph doesn't matter, it doesn't escape. Now what about the case where it does? ```d int[] func(return scope int[] c, scope int[] d) safe { return c; } ``` Again yes. Only one parameter contributes to return, the return would therefore be part of ``a``'s graph. Another interesting possibility is that if one parameter is isolated, that means the second parameter ``d`` is not in ``c``'s graph, because isolated is a form of transfer of ownership, except because ``c`` is scope, the transfer doesn't occur from the outside. As a result both ``c`` and ``d`` would have `` restrict`` added. This is on my todo list to research, of course how exactly it plays out with DIP1000 and the this pointer is unknown to me at this stage.
Feb 18
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 13.02.24 04:31, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.
 
 Now for ldc:
 
 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
       dest[0] = *value;
       dest[1] = *value;
       dest[2] = *value;
       dest[3] = *value;
 }
 ```
 
 I expected that to not do the extra loads, but it did.
 
 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0 .. 4][] = *value;
 }
 ```
 
 And that certainly should not be doing it either.
 Even if it wasn't immutable.
 
 For your code, because it is not immutable and therefore can be changed 
 externally on another thread, the fact that the compiler has to do the 
 loads is correct. This isn't a bug.
An unsynchronized read of a location that is modified by another thread is a race condition and a race condition is UB, so this scenario is not an optimization blocker. I think with this specific code reading back the pointers is unnecessary, because assigning `*value` to itself does not produce an observable difference in behavior given that it is also read and assigned to a different location at least once. (Unless LDC defines the behavior of unaligned reads/writes?) Of course, if the code instead assigned `*value+1` to the entries of `dest`, the reads would become necessary. In general, I think LDC and GDC are pretty conservative about optimizing based on UB, which I think has benefits for security.
Feb 14
prev sibling next sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:
 I was refactoring some code and changed a parameter from by 
 value, to by pointer, and saw the performance drop by 50%. This 
 is a highly reduced example of what I found, but basically 
 passing something into a function by reference or pointer seems 
 to make the compilers (it affects both DMD and LDC) treat it as 
 if its volatile and must be loaded from memory on every use. 
 This also inhibits the auto-vectorization of code by LDC.

 https://d.godbolt.org/z/oonq1drd9

 ...

 I'm not a compiler guy so maybe there's some rationale for this 
 that I don't know but it seems like the compiler should be able 
 to read "*value" once and cache it.
To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the restrict attribute as seen here: https://d.godbolt.org/z/z9vYndWqP When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.
Feb 12
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:
 [snip]

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping indexed 
 arrays I've seen ldc go through the proof and do two versions 
 of the code with a branch.
As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
Feb 13
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:
 On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal 
 wrote:
 [...]
As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
Sorry, I should note that while `fillRestricted3` compiles, It has similar codegen as `fillBP`.
Feb 13
prev sibling parent reply Bruce Carneal <bcarneal gmail.com> writes:
On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:
 On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal 
 wrote:
 [snip]

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping 
 indexed arrays I've seen ldc go through the proof and do two 
 versions of the code with a branch.
As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
There was a little discussion a while back about restrict being generalized to slices but, as you note, it's not there yet. I typically use restrict on slice .ptr fields passed into nested functions within an trusted function. Yes, that's a little clumsy but it lets you do bounds checks and the like before dropping into the hot loops. If you're willing to tolerate the code bloat, and your code is simple, ldc will do the proof and auto-vec things for you. restrict also seems to keep the auto-vec optimizer engaged in not-strictly-vanilla formulations. If you have multiple sources and a single destination array note that you only need restrict on the destination. You're just trying to give the back end a clear view of dependencies. As always, godbolt is your friend when optimizing known bottlenecks of this sort.
Feb 13
parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 February 2024 at 15:56:59 UTC, Bruce Carneal wrote:
 [snip]

 There was a little discussion a while back about  restrict 
 being generalized to slices but, as you note, it's not there 
 yet.
Thanks for the detailed response!
 [snip]

 As always, godbolt is your friend when optimizing known 
 bottlenecks of this sort.
godbolt is definitely your friend when you can use it.
Feb 13
prev sibling next sibling parent claptrap <clap trap.com> writes:
On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:
 On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping indexed 
 arrays I've seen ldc go through the proof and do two versions 
 of the code with a branch.
Ah OK makes sense. The restrict attribute will help with what I'm doing. thanks.
Feb 13
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 13.02.24 07:02, Bruce Carneal wrote:
 
 To reuse the value the compiler would have to prove that the memory 
 locations do not overlap.
Not really, it only has to show that the value in the memory location has not changed since the last read.
Feb 14
parent Bruce Carneal <bcarneal gmail.com> writes:
On Wednesday, 14 February 2024 at 11:46:24 UTC, Timon Gehr wrote:
 On 13.02.24 07:02, Bruce Carneal wrote:
 
 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.
Not really, it only has to show that the value in the memory location has not changed since the last read.
True. I jumped to the general case where I'd seen sub optimal code gen. I should have commented on both the given special, stationary, case and the more general.
Feb 14
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:
 I was refactoring some code and changed a parameter from by 
 value, to by pointer, and saw the performance drop by 50%. This 
 is a highly reduced example of what I found, but basically 
 passing something into a function by reference or pointer seems 
 to make the compilers (it affects both DMD and LDC) treat it as 
 if its volatile and must be loaded from memory on every use. 
 This also inhibits the auto-vectorization of code by LDC.

 https://d.godbolt.org/z/oonq1drd9

 ```d
 void fillBP(uint* value, uint* dest)
 {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
 }
 ```
 codegen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     ECX,[RSI]
                 mov     [RDI],ECX
                 mov     EDX,[RSI]
                 mov     4[RDI],EDX
                 mov     R8D,[RSI]
                 mov     8[RDI],R8D
                 mov     R9D,[RSI]
                 mov     0Ch[RDI],R9D
                 pop     RBP
                 ret

 codgen LDC -->

         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 4], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 8], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 12], eax
         ret
 ```d
Yes, that's normal. The compiler cannot know from the declaration alone if your pointer overlaps. In C you can declare the pointers with restrict which will tell the compiler that the pointers don't overlap. I don't know why D doesn't support restrict.
Feb 13
parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 13 February 2024 at 15:11:58 UTC, Patrick Schluter 
wrote:
 Yes, that's normal. The compiler cannot know from the 
 declaration alone if your pointer overlaps. In C you can 
 declare the pointers with restrict which will tell the compiler 
 that the pointers don't overlap. I don't know why D doesn't 
 support restrict.
It's a major footgun without much benefit. In this case, one could load the integers in a local variable and use the value from there, and you'd get teh same effet.
Feb 13