digitalmars.D - Optimization problem: bulk Boolean operations on vectors

Andrei Alexandrescu (3/3) Dec 23 2016 An interesting problem to look at:

hardreset (31/34) Dec 23 2016 Ok some hand written assembler to and-assign ints...

Seb (4/13) Dec 23 2016 138 ms, 603 μs, and 6 hnsecs

Stefan Koch (2/17) Dec 23 2016 This is indeed already done.
Martin Nowak (10/14) Dec 25 2016 We considered using gdc (or ldc) to compensate for the slowdown

Walter Bright (33/60) Dec 23 2016 For this D code:

Andrei Alexandrescu (4/10) Dec 23 2016 That's a neat trick, thanks hardreset (and Walter for making it
Johan Engelen (12/20) Dec 23 2016 This code is not equivalent with the plain foreach loop.

Walter Bright (5/26) Dec 23 2016 The assumption is that 'a' points to the start of an array [0..SIZE], so...

Johan Engelen (35/53) Dec 24 2016 Of course, but it's not codified in the source and thus the

Walter Bright (6/11) Dec 24 2016 You are pedantically correct. It's possible someone would write a bizarr...

hardreset (6/40) Dec 23 2016 I patched up the prolog code and timed it and it came out

Andrei Alexandrescu (2/3) Dec 23 2016 This passes the three flags to test.d. Place test.d at the end.
Walter Bright (3/8) Dec 23 2016 dmd test -O
Jacob Carlborg (5/6) Dec 24 2016 You cannot get the assembly output from DMD since it directly outputs

safety0ff (3/11) Dec 23 2016 Is subtraction of pointers which do not belong to the same array

Stefan Koch (3/18) Dec 23 2016 Yes, I think so.

safety0ff (4/5) Dec 23 2016 The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit
Observer (14/16) Dec 27 2016 With respect to bulk operations on vectors, of course I recognize

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

An interesting problem to look at:

https://github.com/dlang/dmd/pull/6352

Andrei

Dec 23 2016

hardreset <invalid email.address> writes:

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:

 https://github.com/dlang/dmd/pull/6352

 Andrei

Ok some hand written assembler to and-assign ints...

enum SIZE = 100000000;

void foo3(int* a, int* b)
{
     asm
     {
         mov   EAX,a;
         mov   EDX,b;
         sub   EDX,EAX;
         mov   ECX,EAX;
         add   ECX,SIZE*4;
     l1:;
         cmp   EAX,ECX;
         jae   done;
         mov   EBX,[EAX];
         and   EBX,[EAX+EDX];
         mov   [EAX],EBX;
         add   EAX,4;
         jmp l1;
     done:;
     }
}

I get..

456ms for unconditional
505ms for conditional
268ms for assembler

So the asm is about 40% faster than the best of the alternatives. 
The point being that it's the generated code not the source 
that's the problem.

Dec 23 2016

Seb <seb wilzba.ch> writes:

On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.

Has anyone considered using ldc for DMD release compilations?

  dmd -O -release -inline -run foo.d

138 ms, 603 μs, and 6 hnsecs

 ldmd -O -release -inline -run foo.d

84 ms and 719 μs

Dec 23 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
 On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.

 Has anyone considered using ldc for DMD release compilations?

  dmd -O -release -inline -run foo.d

 138 ms, 603 μs, and 6 hnsecs

 ldmd -O -release -inline -run foo.d

 84 ms and 719 μs

This is indeed already done.

Dec 23 2016

Martin Nowak <code dawg.eu> writes:

On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
 So the asm is about 40% faster than the best of the 
 alternatives. The point being that it's the generated code not 
 the source that's the problem.

 Has anyone considered using ldc for DMD release compilations?

We considered using gdc (or ldc) to compensate for the slowdown 
after the ddmd transition, but nobody complained enough for that 
to happen ;). Also Windows users were stuck with dmd at that 
point.
Using a proper optimizer, profiles, and LTO it was fairly simple 
to speedup dmd by some 20+% (IIRC).

Unfortunately wasn't considered before 2.069.0, but that's one of 
the main reasons why we setup Travis-CI to test gdc/ldc builds.
https://trello.com/c/OT6jlFNa/85-rebench-ddmd-vs-dmd-compiler-speed

Dec 25 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 12/23/2016 10:03 AM, hardreset wrote:
 enum SIZE = 100000000;

 void foo3(int* a, int* b)
 {
     asm
     {
         mov   EAX,a;
         mov   EDX,b;
         sub   EDX,EAX;
         mov   ECX,EAX;
         add   ECX,SIZE*4;
     l1:;
         cmp   EAX,ECX;
         jae   done;
         mov   EBX,[EAX];
         and   EBX,[EAX+EDX];
         mov   [EAX],EBX;
         add   EAX,4;
         jmp l1;
     done:;
     }
 }

 I get..

 456ms for unconditional
 505ms for conditional
 268ms for assembler

 So the asm is about 40% faster than the best of the alternatives. The point
 being that it's the generated code not the source that's the problem.

For this D code:

enum SIZE = 100000000;

void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
	*a &= *(a + offset);
}

The following asm is generated by DMD:

                 push    EBX
                 mov     EBX,8[ESP]
                 sub     EAX,EBX
                 push    ESI
                 cdq
                 and     EDX,3
                 add     EAX,EDX
                 sar     EAX,2
                 lea     ECX,0FA0h[EBX]
                 mov     ESI,EAX
                 cmp     EBX,ECX
                 jae     L2C
L20:            mov     EDX,[ESI*4][EBX]
                 and     [EBX],EDX
                 add     EBX,4
                 cmp     EBX,ECX
                 jb      L20
L2C:            pop     ESI
                 pop     EBX
                 ret     4

The inner loop is 5 instructions, whereas the one you wrote is 7 instructions
(I 
didn't benchmark it). With some more source code manipulation the divide can be 
eliminated, but that is irrelevant to the inner loop.

Dec 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/23/2016 05:11 PM, Walter Bright wrote:
 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }

That's a neat trick, thanks hardreset (and Walter for making it 
high-level). There are a bunch of places in druntime and phobos that 
could use it. -- Andrei

Dec 23 2016

Johan Engelen <j j.nl> writes:

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }

This code is not equivalent with the plain foreach loop. 
Execution is different when a > (size_t.max-SIZE). Thus the 
"ptrdiff" loop cannot be vectorized (or can only be vectorized 
when guarded with a check for potential overflow first).
LDC:
https://godbolt.org/g/YcCJdZ
(note the funny jmp .LBB1_6 to a ret instruction... what's that?!)

GDC does something more complex:
https://godbolt.org/g/3XeI9p

Just for info. Don't know which is faster, but I'm guessing the 
vectorized foreach loop.

Dec 23 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 12/23/2016 3:35 PM, Johan Engelen wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }

 This code is not equivalent with the plain foreach loop. Execution is different
 when a > (size_t.max-SIZE).

The assumption is that 'a' points to the start of an array [0..SIZE], so
there's 
no overflow.

 Thus the "ptrdiff" loop cannot be vectorized (or can
 only be vectorized when guarded with a check for potential overflow first).
 LDC:
 https://godbolt.org/g/YcCJdZ
 (note the funny jmp .LBB1_6 to a ret instruction... what's that?!)

 GDC does something more complex:
 https://godbolt.org/g/3XeI9p

 Just for info. Don't know which is faster, but I'm guessing the vectorized
 foreach loop.

The vectorized one probably is. But it sure is a lot of code. (The loop speed 
could possibly be doubled just by unrolling it a few times.)

Dec 23 2016

Johan Engelen <j j.nl> writes:

On Saturday, 24 December 2016 at 00:04:48 UTC, Walter Bright 
wrote:
 On 12/23/2016 3:35 PM, Johan Engelen wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright 
 wrote:
 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000; // should be `a + SIZE`, right?
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
     *a &= *(a + offset);
 }

 This code is not equivalent with the plain foreach loop. 
 Execution is different
 when a > (size_t.max-SIZE).

 The assumption is that 'a' points to the start of an array 
 [0..SIZE], so there's no overflow.

Of course, but it's not codified in the source and thus the 
function is different from the foreach version.

More importantly, I think it is broken. I find it very hard to 
reason about the ptrdiff_t version, because of all the potential 
overflows and the mixing of unsigned and signed math. So I wrote 
a little helper program to print out pointer values.
```
// file a.d
import std.stdio;
void main()
{
     int* a = cast(int*) (size_t.max - 4000000LU + 2000000LU);
     int* b = cast(int*) (1000000LU);
     // ptrdiff_t cannot represent the large difference between
     //arrays at the start and end of memory.
     ptrdiff_t offset = b - a;

     writeln("a          = ", a);
     writeln("b          = ", b);
     writeln("a + offset = ", a + offset);
}
```
Then:
```
❯ dmd -run a.d
a          = FFFFFFFFFFC2F6FF
b          = F4240
a + offset = F423F
```
a+offset is not equal to b for the given pointer values. Whoops.

I find it amazing that the GCC backend manages to come up with a 
bunch of checks for different overflow cases and creates a 
vectorized inner loop.

-Johan

Dec 24 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 12/24/2016 2:28 AM, Johan Engelen wrote:
 Of course, but it's not codified in the source and thus the function is
 different from the foreach version.

 More importantly, I think it is broken. I find it very hard to reason about the
 ptrdiff_t version, because of all the potential overflows and the mixing of
 unsigned and signed math.

You are pedantically correct. It's possible someone would write a bizarre loop, 
and compiler optimizers need to take that into account when doing loop 
transformations.

This is one of the reasons why D's array syntax and semantics is superior - it 
guarantees no overflow.

Dec 24 2016

hardreset <invalid email.address> writes:

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 On 12/23/2016 10:03 AM, hardreset wrote:

 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }

 The following asm is generated by DMD:

                 push    EBX
                 mov     EBX,8[ESP]
                 sub     EAX,EBX
                 push    ESI
                 cdq
                 and     EDX,3
                 add     EAX,EDX
                 sar     EAX,2
                 lea     ECX,0FA0h[EBX]
                 mov     ESI,EAX
                 cmp     EBX,ECX
                 jae     L2C
 L20:            mov     EDX,[ESI*4][EBX]
                 and     [EBX],EDX
                 add     EBX,4
                 cmp     EBX,ECX
                 jb      L20
 L2C:            pop     ESI
                 pop     EBX
                 ret     4

 The inner loop is 5 instructions, whereas the one you wrote is 
 7 instructions (I didn't benchmark it). With some more source 
 code manipulation the divide can be eliminated, but that is 
 irrelevant to the inner loop.

I patched up the prolog code and timed it and it came out 
identical to my asm. I tried the ptrdiff C-like code and that 
still comes out 20% slower here. I'm compiling with...

rdmd test.d -O -release -inline

Am I missing something? How do I get the asm output?

Dec 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/23/2016 08:06 PM, hardreset wrote:
 rdmd test.d -O -release -inline

This passes the three flags to test.d. Place test.d at the end.

Dec 23 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 12/23/2016 5:06 PM, hardreset wrote:
 I patched up the prolog code and timed it and it came out identical to my asm.
I
 tried the ptrdiff C-like code and that still comes out 20% slower here. I'm
 compiling with...

 rdmd test.d -O -release -inline

 Am I missing something? How do I get the asm output?

dmd test -O
obj2asm test.obj

Dec 23 2016

Jacob Carlborg <doob me.com> writes:

On 2016-12-24 02:06, hardreset wrote:

 How do I get the asm output?

You cannot get the assembly output from DMD since it directly outputs 
object code. Do get the assembly you need to use a disassembler.

-- 
/Jacob Carlborg

Dec 24 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }

Is subtraction of pointers which do not belong to the same array 
defined behavior in D?

Dec 23 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 24 December 2016 at 01:38:24 UTC, safety0ff wrote:
 On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright 
 wrote:
 For this D code:

 enum SIZE = 100000000;

 void foo(int* a, int* b) {
     int* atop = a + 1000;
     ptrdiff_t offset = b - a;
     for (; a < atop; ++a)
 	*a &= *(a + offset);
 }

 Is subtraction of pointers which do not belong to the same 
 array defined behavior in D?

Yes, I think so.
There are not crazy rules to make pointer math ub.

Dec 23 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:

The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit 
for optimization as well.

Dec 23 2016

Observer <spurious.address yahoo.com> writes:

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu 
wrote:
 An interesting problem to look at:

 https://github.com/dlang/dmd/pull/6352

With respect to bulk operations on vectors, of course I recognize 
the
desire to use high-level code which is portable across platforms.
But I also wonder whether this is the sort of thing that is really
best handled these days by using the multi-media instruction set
instead of the general-purpose instruction set.  There's a lot of
special-purpose arithmetic that is provided by the CPU in such a
context, and it seems a bit silly to ignore that possibility.

Also more generally regarding low-level stuff like this, perhaps 
this
reference might be of interest:
https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685

Dec 27 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Optimization problem: bulk Boolean operations on vectors