www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Dynamic array comparison optimization: check for equal pointers?

reply Johan <j j.nl> writes:
Hi all,
   I thought that we optimized dynamic array comparison with a 
check to see if the pointers are equal, but I don't see it in our 
array comparison implementation in druntime. Quick checking of 
glibc memcmp also showed that there is no short circuit for when 
the pointers of the slices are equal.

I was expecting something like:
```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
```

I am surprised that memcmp itself does not implement the pointers 
are equal check. Is there a big impact of adding the shortcircuit 
check to prevent scanning memory?

cheers,
   Johan
Jun 02 2020
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 Hi all,
   I thought that we optimized dynamic array comparison with a 
 check to see if the pointers are equal, but I don't see it in 
 our array comparison implementation in druntime. Quick checking 
 of glibc memcmp also showed that there is no short circuit for 
 when the pointers of the slices are equal.

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```

 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?

 cheers,
   Johan
I would have expected this optimization to be present as well ... It seems like a no-brainer.
Jun 02 2020
prev sibling next sibling parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```
That's, worst case, two branches and a scan, and best case one branch which may or may not be slower than the actual scan.
 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?
There can't be a definitive answer for this, it's dependent on architecture as well as size of data. As far as array comparison goes, if a compiler (talking about ldc/gdc here) can infer pointer/length information, it'll be free to throw away the scan. In all other cases, I'd rather see the comparison function, well, do the comparison, and leave branches up to the user. It's much better than to inflict the branches on everyone unconditionally (pun intended): if your data shows that you can benefit from the short-circuit, you can always do just that. Short of rethinking the design, of course ;) I mean, I can't imagine a scenario where a program would have to routinely compare identical memory ranges. Doesn't mean there aren't such scenarios though.
Jun 02 2020
parent reply Johan <j j.nl> writes:
On Tuesday, 2 June 2020 at 20:54:39 UTC, Stanislav Blinov wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```
That's, worst case, two branches and a scan, and best case one branch which may or may not be slower than the actual scan.
The "actual scan" is a function call, which will be super slow (note that in druntime's implementation it cannot be optimized to a tail call) compared to the comparison and early return.
 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?
There can't be a definitive answer for this, it's dependent on architecture as well as size of data. As far as array comparison goes, if a compiler (talking about ldc/gdc here) can infer pointer/length information, it'll be free to throw away the scan. In all other cases, I'd rather see the comparison function, well, do the comparison, and leave branches up to the user. It's much better than to inflict the branches on everyone unconditionally (pun intended): if your data shows that you can benefit from the short-circuit, you can always do just that. Short of rethinking the design, of course ;) I mean, I can't imagine a scenario where a program would have to routinely compare identical memory ranges. Doesn't mean there aren't such scenarios though.
You are assuming that an if-statement costs significant time, but that doesn't have to be the case. The pointer values already must be read from memory (otherwise you cannot do the memory scan), and with speculative execution the comparison+branch only costs very little (codegen could be coaxed to make if(false) be the default execution path) and potentially saves many many memory reads. Could someone take the effort and check the performance difference (for example the compiler itself) with this change? Thanks! Johan
Jun 02 2020
parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Tuesday, 2 June 2020 at 22:01:20 UTC, Johan wrote:
 On Tuesday, 2 June 2020 at 20:54:39 UTC, Stanislav Blinov wrote:
 That's, worst case, two branches and a scan, and best case one 
 branch which may or may not be slower than the actual scan.
The "actual scan" is a function call, which will be super slow (note that in druntime's implementation it cannot be optimized to a tail call) compared to the comparison and early return.
LDC (and I'm assuming GDC as well) is well aware of memcmp, it should not be a function call. At the very least in -Ox builds.
 You are assuming that an if-statement costs significant time, 
 but that doesn't have to be the case. The pointer values 
 already must be read from memory (otherwise you cannot do the 
 memory scan), and with speculative execution the 
 comparison+branch only costs very little (codegen could be 
 coaxed to make if(false) be the default execution path) and 
 potentially saves many many memory reads.
I'm not assuming anything. The branch(es) may, or may not, be more costly than just comparing some words. Depending on architecture and the amount of data to compare. That's literally what I wrote :) May, or may not - implementation has no business assuming either way. That's why such a check absolutely should *not* be there. It should be where it belongs - in user code, backed up by profiling as necessary. Including it there may be beneficial for you specifically, but detrimental for everyone else.
Jun 02 2020
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/2/20 6:20 PM, Stanislav Blinov wrote:
 On Tuesday, 2 June 2020 at 22:01:20 UTC, Johan wrote:
 On Tuesday, 2 June 2020 at 20:54:39 UTC, Stanislav Blinov wrote:
 That's, worst case, two branches and a scan, and best case one branch 
 which may or may not be slower than the actual scan.
The "actual scan" is a function call, which will be super slow (note that in druntime's implementation it cannot be optimized to a tail call) compared to the comparison and early return.
LDC (and I'm assuming GDC as well) is well aware of memcmp, it should not be a function call. At the very least in -Ox builds.
Let's assume LDC knows how to optimize this. How do you think it would do so? I'd imagine very similarly to how Johan wrote it. What makes you think LDC can't optimize out the check if there's a better way? Then the check is there in case the optimizer you are dealing with is going to call an opaque memcmp in all cases. -Steve
Jun 02 2020
parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Tuesday, 2 June 2020 at 22:44:24 UTC, Steven Schveighoffer 
wrote:

 Let's assume LDC knows how to optimize this. How do you think 
 it would do so? I'd imagine very similarly to how Johan wrote 
 it.

 What makes you think LDC can't optimize out the check if 
 there's a better way? Then the check is there in case the 
 optimizer you are dealing with is going to call an opaque 
 memcmp in all cases.
That actually is irrelevant. Whether it gets inlined, whether even it's loop is unrolled by the compiler - that is all beside the point. If you hardcode the check in the library, that check is there for everyone using that library. On all hardware and for all data sizes. Irrespective of whether it can, on a given platform with a given array size, be actually faster to just do the call/loop/etc. than to do a check first. Irrespective of what a given compiler is capable of. There cannot be a universal optimization here. Users can always write a compare_arrays_ptrcheck if so required, but they cannot "unwrite" druntime. Same goes for actual memcmp. So let them do the minimal work required, and let users make higher-level decisions at higher level. If a program is attempting to "compare" identical memory ranges so often as to warrant a test - it's up to that program to deal with it, not the whole community.
Jun 02 2020
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/2/20 7:35 PM, Stanislav Blinov wrote:
 On Tuesday, 2 June 2020 at 22:44:24 UTC, Steven Schveighoffer wrote:
 
 Let's assume LDC knows how to optimize this. How do you think it would 
 do so? I'd imagine very similarly to how Johan wrote it.

 What makes you think LDC can't optimize out the check if there's a 
 better way? Then the check is there in case the optimizer you are 
 dealing with is going to call an opaque memcmp in all cases.
That actually is irrelevant. Whether it gets inlined, whether even it's loop is unrolled by the compiler - that is all beside the point. If you hardcode the check in the library, that check is there for everyone using that library. On all hardware and for all data sizes. Irrespective of whether it can, on a given platform with a given array size, be actually faster to just do the call/loop/etc. than to do a check first. Irrespective of what a given compiler is capable of. There cannot be a universal optimization here. Users can always write a compare_arrays_ptrcheck if so required, but they cannot "unwrite" druntime.
I'm saying the compiler can unwrite druntime, if it knows a better way. Happens all the time with LDC, it will remove loops that it knows don't do anything. And the upside is that compilers that don't do the "smarter" thing will not needlessly call memcmp. -Steve
Jun 02 2020
parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Wednesday, 3 June 2020 at 00:04:43 UTC, Steven Schveighoffer 
wrote:

 Users can always write a compare_arrays_ptrcheck if so 
 required, but they cannot "unwrite" druntime.
I'm saying the compiler can unwrite druntime, if it knows a better way. Happens all the time with LDC, it will remove loops that it knows don't do anything. And the upside is that compilers that don't do the "smarter" thing will not needlessly call memcmp.
...Yes? And that will fall under a hypothetical case when the check just gets removed (or heck, inserted if it's faster). Leaving it in for all other cases when the compiler *can't* remove it. Make the best case better but the worst case worse? No thanks. For the third time, and let me be a little more explicit this time - if a program gives memcmp identical pointers so much as to warrant a test (which is found out by measuring) - the program is doing something wrong. *It* should do a test and not bother with memcmp at all. Or its logic needs to change so as to not have this problem in the first place. Same translates onto druntime's arrays. There's no definitive answer to Johan's original question. It depends. On hardware. On data. Therefore, that check shouldn't be there, it should be delegated to someone capable of making such decision - to the user.
Jun 02 2020
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/2/20 8:35 PM, Stanislav Blinov wrote:
 On Wednesday, 3 June 2020 at 00:04:43 UTC, Steven Schveighoffer wrote:
 
 Users can always write a compare_arrays_ptrcheck if so required, but 
 they cannot "unwrite" druntime.
I'm saying the compiler can unwrite druntime, if it knows a better way. Happens all the time with LDC, it will remove loops that it knows don't do anything. And the upside is that compilers that don't do the "smarter" thing will not needlessly call memcmp.
....Yes? And that will fall under a hypothetical case when the check just gets removed (or heck, inserted if it's faster). Leaving it in for all other cases when the compiler *can't* remove it. Make the best case better but the worst case worse? No thanks.
The worst case is opaque memcmp is called in all cases. No thanks. I'll take a few extra cycles over that any day. -Steve
Jun 02 2020
parent Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Wednesday, 3 June 2020 at 01:10:55 UTC, Steven Schveighoffer 
wrote:

 ....Yes? And that will fall under a hypothetical case when the 
 check just gets removed (or heck, inserted if it's faster). 
 Leaving it in for all other cases when the compiler *can't* 
 remove it. Make the best case better but the worst case worse? 
 No thanks.
The worst case is opaque memcmp is called in all cases. No thanks. I'll take a few extra cycles over that any day.
With the check in place, the worst case would be a useless comparison plus the memcmp, or a failed branch on the off chance that you do try to compare identical ranges. Or a useless comparison and a failed branch + memcmp. Depends on your program. Whether the effect would be significant or immeasurable - also depends on your program, and your data, and your hardware. "I'm not sure what my program is doing, so let's make everyone else pay for it". That's what including that test, in memcmp or in druntime, effectively would be.
Jun 02 2020
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/2/20 3:38 PM, Johan wrote:
 Hi all,
    I thought that we optimized dynamic array comparison with a check to 
 see if the pointers are equal, but I don't see it in our array 
 comparison implementation in druntime. Quick checking of glibc memcmp 
 also showed that there is no short circuit for when the pointers of the 
 slices are equal.
 
 I was expecting something like:
 ```
    if ( sizes are not equal )
       return false;
    if ( pointers are equal  )
       return true;
    return memcmp;
 ```
 
 I am surprised that memcmp itself does not implement the pointers are 
 equal check. Is there a big impact of adding the shortcircuit check to 
 prevent scanning memory?
 
I thought I remembered this being true, but maybe it was for class instances? Looking back in history, it seems prior to 2017, it was done by the compiler generating the code instead of a template, but seems like it did not include the pointer check. -Steve
Jun 02 2020
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, June 2, 2020 3:18:40 PM MDT Steven Schveighoffer via Digitalmars-d 
wrote:
 On 6/2/20 3:38 PM, Johan wrote:
 Hi all,

    I thought that we optimized dynamic array comparison with a check to

 see if the pointers are equal, but I don't see it in our array
 comparison implementation in druntime. Quick checking of glibc memcmp
 also showed that there is no short circuit for when the pointers of the
 slices are equal.

 I was expecting something like:
 ```

    if ( sizes are not equal )
       return false;
    if ( pointers are equal  )
       return true;
    return memcmp;

 ```

 I am surprised that memcmp itself does not implement the pointers are
 equal check. Is there a big impact of adding the shortcircuit check to
 prevent scanning memory?
I thought I remembered this being true, but maybe it was for class instances? Looking back in history, it seems prior to 2017, it was done by the compiler generating the code instead of a template, but seems like it did not include the pointer check.
The free function opEquals for classes does check the references first and has for years (as well doing checks like whether either reference is null). I've never looked at what the array implementation does though. - Jonathna M Davis
Jun 02 2020
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 6/2/20 6:48 PM, Jonathan M Davis wrote:
 On Tuesday, June 2, 2020 3:18:40 PM MDT Steven Schveighoffer via Digitalmars-d
 wrote:
 I thought I remembered this being true, but maybe it was for class
 instances? Looking back in history, it seems prior to 2017, it was done
 by the compiler generating the code instead of a template, but seems
 like it did not include the pointer check.
The free function opEquals for classes does check the references first and has for years (as well doing checks like whether either reference is null). I've never looked at what the array implementation does though.
Sorry, to be clear I meant the array implementation prior to 2017 implemented opEquals by generating code with the compiler. After that, it used the linked template. -Steve
Jun 02 2020
prev sibling next sibling parent reply Dominikus Dittes Scherkl <dominikus scherkl.de> writes:
On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 Hi all,
   I thought that we optimized dynamic array comparison with a 
 check to see if the pointers are equal, but I don't see it in 
 our array comparison implementation in druntime. Quick checking 
 of glibc memcmp also showed that there is no short circuit for 
 when the pointers of the slices are equal.

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```

 I am surprised that memcmp itself does not implement the 
 pointers are equal check.
I also would expect this kind of optimization in memcmp and not in the functions using it. Anything at higher level should be unaware of pointers wherever possible. And how do you know this optimization is NOT in memcmp? It's extern C and the source may be not available - have you examined it? There are many implementations of memcmp available - maybe on you're platform the cost for this extra check is too high? (especially given the very low probability that it is ever called with src==dst).
Jun 02 2020
next sibling parent reply Johan <j j.nl> writes:
On Tuesday, 2 June 2020 at 21:27:44 UTC, Dominikus Dittes Scherkl 
wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 Hi all,
   I thought that we optimized dynamic array comparison with a 
 check to see if the pointers are equal, but I don't see it in 
 our array comparison implementation in druntime. Quick 
 checking of glibc memcmp also showed that there is no short 
 circuit for when the pointers of the slices are equal.

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```

 I am surprised that memcmp itself does not implement the 
 pointers are equal check.
I also would expect this kind of optimization in memcmp and not in the functions using it. Anything at higher level should be unaware of pointers wherever possible.
Pointers are nothing to be scared of.
 And how do you know this optimization is NOT in memcmp? It's 
 extern C and the source may be not available - have you 
 examined it?
That's what I wrote, yes.
 There are many implementations of memcmp available - maybe on 
 you're platform the cost for this extra check is too high? 
 (especially given the very low probability that it is ever 
 called with src==dst).
glibc is a very commonly used library for memcmp. It does many other checks before starting the comparison (e.g. optimizing for pointer alignment), so the cost of the extra check is perhaps not even measurable. -Johan
Jun 02 2020
parent NaN <divide by.zero> writes:
On Tuesday, 2 June 2020 at 22:05:06 UTC, Johan wrote:
 On Tuesday, 2 June 2020 at 21:27:44 UTC, Dominikus Dittes 
 Scherkl wrote:

 glibc is a very commonly used library for memcmp. It does many 
 other checks before starting the comparison (e.g. optimizing 
 for pointer alignment), so the cost of the extra check is 
 perhaps not even measurable.
Getting two pointers to the same memory location may be such a rare occurrence that it's not worth having the check for it. IE, Adding a tiny overhead on 1000 calls can still cost more than a very large overhead on 1 call.
Jun 02 2020
prev sibling parent Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Tuesday, 2 June 2020 at 21:27:44 UTC, Dominikus Dittes Scherkl 
wrote:

 I also would expect this kind of optimization in memcmp and not 
 in the functions using it. Anything at higher level should be 
 unaware of pointers wherever possible.
How do you pass pointers to memcmp if you're not aware of them? ;)
Jun 02 2020
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 Hi all,
   I thought that we optimized dynamic array comparison with a 
 check to see if the pointers are equal, but I don't see it in 
 our array comparison implementation in druntime. Quick checking 
 of glibc memcmp also showed that there is no short circuit for 
 when the pointers of the slices are equal.

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```

 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?
memcmp() doesn't implement the check to avoid memcmp(NULL, NULL, length) returning with 0 instead of segfaulting as it does if any of the 2 pointers in NULL. It's a case of consistant behaviour, either you tolerate NULL pointers in all cases or you segfault in all cases.
Jun 02 2020
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 3 June 2020 at 05:48:56 UTC, Patrick Schluter wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 [...]
memcmp() doesn't implement the check to avoid memcmp(NULL, NULL, length) returning with 0 instead of segfaulting as it does if any of the 2 pointers in NULL. It's a case of consistant behaviour, either you tolerate NULL pointers in all cases or you segfault in all cases.
https://stackoverflow.com/questions/16362925/can-i-pass-a-null-pointer-to-memcmp
Jun 02 2020
prev sibling next sibling parent Johan <j j.nl> writes:
On Wednesday, 3 June 2020 at 05:48:56 UTC, Patrick Schluter wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?
memcmp() doesn't implement the check to avoid memcmp(NULL, NULL, length) returning with 0 instead of segfaulting as it does if any of the 2 pointers in NULL. It's a case of consistant behaviour, either you tolerate NULL pointers in all cases or you segfault in all cases.
Interesting argument! But "undefined behavior" does not mean the program has to segfault. It's perfectly fine to return 0 for the UB case. -Johan
Jun 03 2020
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/3/20 1:48 AM, Patrick Schluter wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 Hi all,
   I thought that we optimized dynamic array comparison with a check to 
 see if the pointers are equal, but I don't see it in our array 
 comparison implementation in druntime. Quick checking of glibc memcmp 
 also showed that there is no short circuit for when the pointers of 
 the slices are equal.

 I was expecting something like:
 ```
   if ( sizes are not equal )
      return false;
   if ( pointers are equal  )
      return true;
   return memcmp;
 ```

 I am surprised that memcmp itself does not implement the pointers are 
 equal check. Is there a big impact of adding the shortcircuit check to 
 prevent scanning memory?
memcmp() doesn't implement the check to avoid memcmp(NULL, NULL, length) returning with 0 instead of segfaulting as it does if any of the 2 pointers in NULL. It's a case of consistant behaviour, either you tolerate NULL pointers in all cases or you segfault in all cases.
Affirmative: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#301 My take on the original matter is - adding an easily-predictable branch is unlikely to affect speed in the most naive speculative execution hardware. However, prediction does take resources that would be otherwise available for other branches, and the branch adds to code size. Both of these are important in large applications. So inserting a test will look good in microbenchmarks but may be actually a negative in real-world use.
Jun 03 2020
parent Johan <j j.nl> writes:
On Wednesday, 3 June 2020 at 13:46:00 UTC, Andrei Alexandrescu 
wrote:
 On 6/3/20 1:48 AM, Patrick Schluter wrote:
 On Tuesday, 2 June 2020 at 19:38:04 UTC, Johan wrote:
 I am surprised that memcmp itself does not implement the 
 pointers are equal check. Is there a big impact of adding the 
 shortcircuit check to prevent scanning memory?
memcmp() doesn't implement the check to avoid memcmp(NULL, NULL, length) returning with 0 instead of segfaulting as it does if any of the 2 pointers in NULL. It's a case of consistant behaviour, either you tolerate NULL pointers in all cases or you segfault in all cases.
Affirmative: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#301 My take on the original matter is - adding an easily-predictable branch is unlikely to affect speed in the most naive speculative execution hardware. However, prediction does take resources that would be otherwise available for other branches, and the branch adds to code size. Both of these are important in large applications. So inserting a test will look good in microbenchmarks but may be actually a negative in real-world use.
It's such a simple optimization idea that I think it must have been tested by many many people already; presumably the outcome was that the gain is not worth the cost. In case the data is _not_ equal, I guess the inequality actually shows up quite early so only a fraction of memory is scanned. In case the data _is_ equal, that's where pointer comparison can cut a significant amount of memory read, but perhaps in most cases the data being equal is still stored in different memory locations. Again, I'm very interested to see a perf comparison in a real application. Perhaps I can test this at Weka. -Johan
Jun 03 2020