www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Optimization problem: bulk Boolean operations on vectors

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
An interesting problem to look at:

https://github.com/dlang/dmd/pull/6352

Andrei
Dec 23 2016
next sibling parent reply hardreset <invalid email.address> writes:
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:

 https://github.com/dlang/dmd/pull/6352

 Andrei
Ok some hand written assembler to and-assign ints... enum SIZE = 100000000; void foo3(int* a, int* b) { asm { mov EAX,a; mov EDX,b; sub EDX,EAX; mov ECX,EAX; add ECX,SIZE*4; l1:; cmp EAX,ECX; jae done; mov EBX,[EAX]; and EBX,[EAX+EDX]; mov [EAX],EBX; add EAX,4; jmp l1; done:; } } I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
Dec 23 2016
next sibling parent reply Seb <seb wilzba.ch> writes:
On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.
Has anyone considered using ldc for DMD release compilations?
  dmd -O -release -inline -run foo.d
138 ms, 603 μs, and 6 hnsecs
 ldmd -O -release -inline -run foo.d
84 ms and 719 μs
Dec 23 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
 On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.
Has anyone considered using ldc for DMD release compilations?
  dmd -O -release -inline -run foo.d
138 ms, 603 μs, and 6 hnsecs
 ldmd -O -release -inline -run foo.d
84 ms and 719 μs
This is indeed already done.
Dec 23 2016
prev sibling parent Martin Nowak <code dawg.eu> writes:
On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.
Has anyone considered using ldc for DMD release compilations?
We considered using gdc (or ldc) to compensate for the slowdown after the ddmd transition, but nobody complained enough for that to happen ;). Also Windows users were stuck with dmd at that point. Using a proper optimizer, profiles, and LTO it was fairly simple to speedup dmd by some 20+% (IIRC). Unfortunately wasn't considered before 2.069.0, but that's one of the main reasons why we setup Travis-CI to test gdc/ldc builds. https://trello.com/c/OT6jlFNa/85-rebench-ddmd-vs-dmd-compiler-speed
Dec 25 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/23/2016 10:03 AM, hardreset wrote:
 enum SIZE = 100000000;

 void foo3(int* a, int* b)
 {
     asm
     {
         mov   EAX,a;
         mov   EDX,b;
         sub   EDX,EAX;
         mov   ECX,EAX;
         add   ECX,SIZE*4;
     l1:;
         cmp   EAX,ECX;
         jae   done;
         mov   EBX,[EAX];
         and   EBX,[EAX+EDX];
         mov   [EAX],EBX;
         add   EAX,4;
         jmp l1;
     done:;
     }
 }

 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the alternatives. The point
 being that it's the generated code not the source that's the problem.
For this D code: enum SIZE = 100000000; void foo(int* a, int* b) { int* atop = a + 1000; ptrdiff_t offset = b - a; for (; a < atop; ++a) *a &= *(a + offset); } The following asm is generated by DMD: push EBX mov EBX,8[ESP] sub EAX,EBX push ESI cdq and EDX,3 add EAX,EDX sar EAX,2 lea ECX,0FA0h[EBX] mov ESI,EAX cmp EBX,ECX jae L2C L20: mov EDX,[ESI*4][EBX] and [EBX],EDX add EBX,4 cmp EBX,ECX jb L20 L2C: pop ESI pop EBX ret 4 The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.
Dec 23 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/2016 05:11 PM, Walter Bright wrote:
 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }
That's a neat trick, thanks hardreset (and Walter for making it high-level). There are a bunch of places in druntime and phobos that could use it. -- Andrei
Dec 23 2016
prev sibling next sibling parent reply Johan Engelen <j j.nl> writes:
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }
This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE). Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first). LDC: https://godbolt.org/g/YcCJdZ (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) GDC does something more complex: https://godbolt.org/g/3XeI9p Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop.
Dec 23 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/23/2016 3:35 PM, Johan Engelen wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }
This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE).
The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.
 Thus the "ptrdiff" loop cannot be vectorized (or can
 only be vectorized when guarded with a check for potential overflow first).
 LDC:
 https://godbolt.org/g/YcCJdZ
 (note the funny jmp .LBB1_6 to a ret instruction... what's that?!)

 GDC does something more complex:
 https://godbolt.org/g/3XeI9p

 Just for info. Don't know which is faster, but I'm guessing the vectorized
 foreach loop.
The vectorized one probably is. But it sure is a lot of code. (The loop speed could possibly be doubled just by unrolling it a few times.)
Dec 23 2016
parent reply Johan Engelen <j j.nl> writes:
On Saturday, 24 December 2016 at 00:04:48 UTC, Walter Bright 
wrote:
 On 12/23/2016 3:35 PM, Johan Engelen wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright 
 wrote:
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }
This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE).
The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.
Of course, but it's not codified in the source and thus the function is different from the foreach version. More importantly, I think it is broken. I find it very hard to reason about the ptrdiff_t version, because of all the potential overflows and the mixing of unsigned and signed math. So I wrote a little helper program to print out pointer values. ``` // file a.d import std.stdio; void main() { int* a = cast(int*) (size_t.max - 4000000LU + 2000000LU); int* b = cast(int*) (1000000LU); // ptrdiff_t cannot represent the large difference between //arrays at the start and end of memory. ptrdiff_t offset = b - a; writeln("a = ", a); writeln("b = ", b); writeln("a + offset = ", a + offset); } ``` Then: ``` ❯ dmd -run a.d a = FFFFFFFFFFC2F6FF b = F4240 a + offset = F423F ``` a+offset is not equal to b for the given pointer values. Whoops. I find it amazing that the GCC backend manages to come up with a bunch of checks for different overflow cases and creates a vectorized inner loop. -Johan
Dec 24 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/24/2016 2:28 AM, Johan Engelen wrote:
 Of course, but it's not codified in the source and thus the function is
 different from the foreach version.

 More importantly, I think it is broken. I find it very hard to reason about the
 ptrdiff_t version, because of all the potential overflows and the mixing of
 unsigned and signed math.
You are pedantically correct. It's possible someone would write a bizarre loop, and compiler optimizers need to take that into account when doing loop transformations. This is one of the reasons why D's array syntax and semantics is superior - it guarantees no overflow.
Dec 24 2016
prev sibling next sibling parent reply hardreset <invalid email.address> writes:
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 On 12/23/2016 10:03 AM, hardreset wrote:

 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }

 The following asm is generated by DMD:

                 push    EBX
                 mov     EBX,8[ESP]
                 sub     EAX,EBX
                 push    ESI
                 cdq
                 and     EDX,3
                 add     EAX,EDX
                 sar     EAX,2
                 lea     ECX,0FA0h[EBX]
                 mov     ESI,EAX
                 cmp     EBX,ECX
                 jae     L2C
 L20:            mov     EDX,[ESI*4][EBX]
                 and     [EBX],EDX
                 add     EBX,4
                 cmp     EBX,ECX
                 jb      L20
 L2C:            pop     ESI
                 pop     EBX
                 ret     4

 The inner loop is 5 instructions, whereas the one you wrote is 
 7 instructions (I didn't benchmark it). With some more source 
 code manipulation the divide can be eliminated, but that is 
 irrelevant to the inner loop.
I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with... rdmd test.d -O -release -inline Am I missing something? How do I get the asm output?
Dec 23 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/2016 08:06 PM, hardreset wrote:
 rdmd test.d -O -release -inline
This passes the three flags to test.d. Place test.d at the end.
Dec 23 2016
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/23/2016 5:06 PM, hardreset wrote:
 I patched up the prolog code and timed it and it came out identical to my asm.
I
 tried the ptrdiff C-like code and that still comes out 20% slower here. I'm
 compiling with...

 rdmd test.d -O -release -inline

 Am I missing something? How do I get the asm output?
dmd test -O obj2asm test.obj
Dec 23 2016
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2016-12-24 02:06, hardreset wrote:

 How do I get the asm output?
You cannot get the assembly output from DMD since it directly outputs object code. Do get the assembly you need to use a disassembler. -- /Jacob Carlborg
Dec 24 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }
Is subtraction of pointers which do not belong to the same array defined behavior in D?
Dec 23 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 24 December 2016 at 01:38:24 UTC, safety0ff wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright 
 wrote:
 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }
Is subtraction of pointers which do not belong to the same array defined behavior in D?
Yes, I think so. There are not crazy rules to make pointer math ub.
Dec 23 2016
prev sibling next sibling parent safety0ff <safety0ff.dev gmail.com> writes:
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:
The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit for optimization as well.
Dec 23 2016
prev sibling parent Observer <spurious.address yahoo.com> writes:
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:

 https://github.com/dlang/dmd/pull/6352
With respect to bulk operations on vectors, of course I recognize the desire to use high-level code which is portable across platforms. But I also wonder whether this is the sort of thing that is really best handled these days by using the multi-media instruction set instead of the general-purpose instruction set. There's a lot of special-purpose arithmetic that is provided by the CPU in such a context, and it seems a bit silly to ignore that possibility. Also more generally regarding low-level stuff like this, perhaps this reference might be of interest: https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685
Dec 27 2016