digitalmars.D.learn - Rather Bizarre slow downs using Complex!float with avx (ldc).

james.p.leblanc (78/78) Sep 30 2021 D-Ers,

Johan (15/19) Sep 30 2021 This could be an template instantiation culling problem. If the

james.p.leblanc (95/101) Oct 01 2021 Johan,

Guillaume Piolat (7/10) Oct 02 2021 Maybe something related to:

james.p.leblanc <james.p.leblanc gmail.com> writes:

D-Ers,

I have been getting counterintuitive results on avx/no-avx timing
experiments.  Storyline to date (notes at end):


numbers),
speed comparison.
   a)  moving from non-avx --> avx shows non-realistic speed up of 
15-25 X.
   b)  this is weird, but story continues ...


   a)  moving from non-avx --> avx again shows amazing gains, but 
the

maybe
       this looks plausible?


   a)  now **going from non-avx to avx shows a serious performance 
LOSS**
       of 40% to breaking even at best.  What is happening here?


   a)  non-avx --> avx shows performancegains again about 2X (so 
the
       gains appear to be reasonable).


The main question I have is:

**"What is going on with the Complex!float performance?"**  One 
might expect
floats to have a better perfomance than doubles as we saw with the
real-value data (becuase of vector packaging, memory bandwidth, 
etc).

But, **Complex!float shows MUCH WORSE avx performance than 
Complex!Double
(by a factor of almost 4).**

```d
//            Table of Computation Times
//
//       self math              std math
// explicit  no-explicit   explicit  no-explicit
//   align      align        align      align


AVX


AVX

with AVX

without AVX

with AVX

without AVX
```

Notes:

1) Based on forum hints from ldc experts, I got good guidance
    on enabling avx ( i.e. compiling modules on command line, using
    --fast-math and -mcpu=haswell on command line).

2) From Mir-glas experts I received hints to try to implement own 
version
    of the complex math.  (this is what the "self-math" column 
refers to).

I understand that detail of the computations are not included 
here, (I
can do that if there is interest, and if I figure out an 
effective way to present
it in a forum.)

But, I thought I might begin with a simple question, **"Is there 
some well-known
issue that I am missing here".  Have others been done this road 
as well?**

Thanks for any and all input.
Best Regards,
James

PS  Sorry for the inelegant table ... I do not believe there is a 
way
to include the beautiful bars charts on this forum.  Please 
correct me
if there is a way...)

Sep 30 2021

Johan <j j.nl> writes:

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc 
wrote:
 D-Ers,

 I have been getting counterintuitive results on avx/no-avx 
 timing
 experiments.

This could be an template instantiation culling problem. If the 
compiler is able to determine that `Complex!float` is already 
instantiated (codegen) inside Phobos, then it may decide not to 
codegen it again when you are compiling your code with 
AVX+fastmath enabled. This could explain why you don't see 
improvement for `Complex!float`, but do see improvement with 
`Complex!double`. This does not explain the worse performance 
with AVX+fastmath vs without it.

Generally, for performance issues like this you need to study 
assembly output (`--output-s`) or LLVM IR (`--output-ll`).
First thing I would look out for is function inlining yes/no.

cheers,
   Johan

Sep 30 2021

james.p.leblanc <james.p.leblanc gmail.com> writes:

On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote:
 On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc

 Generally, for performance issues like this you need to study 
 assembly output (`--output-s`) or LLVM IR (`--output-ll`).
 First thing I would look out for is function inlining yes/no.

 cheers,
   Johan

Johan,

Thanks kindly for your reply.  As suggested, I have looked at the 
assembly output.

Strangely the fused multiplay add are indeed there in the avx 
version, but example
still runs slower for **Complex!float** data type.

I have stripped the code down to a minimum, which demonstrates 
the weird result:



```d

import ldc.attributes;  // with or without this line makes no 
difference
import std.stdio;
import std.datetime.stopwatch;
import std.complex;

alias T = Complex!float;
auto typestr = "COMPLEX FLOAT";
/* alias T = Complex!double; */
/* auto typestr = "COMPLEX DOUBLE"; */

auto alpha = cast(T) complex(0.1, -0.2);  // dummy values to fill 
arrays
auto beta = cast(T) complex(-0.7, 0.6);

auto dotprod( T[] x, T[] y)
{
    auto sum = cast(T) 0;
       foreach( size_t i ; 0 .. x.length)
          sum += x[i] * conj(y[i]);
    return sum;
}

void main()
{
    int nEle = 1000;
    int nIter = 2000;

    auto startTime = MonoTime.currTime;
    auto dur = cast(double) 
(MonoTime.currTime-startTime).total!"usecs";

    T[] x, y;
    x.length = nEle;
    y.length = nEle;
    T z;
    x[] = alpha;
    y[] = beta;

    startTime = MonoTime.currTime;
    foreach( i ; 0 .. nIter){
       foreach( j ; 0 .. nIter){
             z = dotprod(x,y);
       }
    }
    auto etime = cast(double) 
(MonoTime.currTime-startTime).total!"msecs" / 1.0e3;
    writef(" result:  % 5.2f%+5.2fi  comp time:  %5.2f \n", z.re, 
z.im, etime);
}
```

For convenience I include bash script used compile/run/generate 
assembly code / and grep:

```bash
echo
echo "With AVX:"
ldc2 -O3 -release question.d --ffast-math -mcpu=haswell
question
ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell
mv question.s question_with_avx.s

echo
echo "Without AVX"
ldc2 -O3 -release question.d
question
ldc2 -output-s -O3 -release question.d
mv question.s question_without_avx.s

echo
echo "fused multiply adds are found in avx code (as desired)"
grep vfmadd *.s /dev/null
```

Here is output when run on my machine:

```console
With AVX:
  result:  -190.00+80.00i  comp time:   6.45

Without AVX
  result:  -190.00+80.00i  comp time:   5.74

fused multiply adds are found in avx code (as desired)
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm5, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm0, %xmm2, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm4, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm5, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm1, %xmm0

```

Repeating the experiment after changing to datatype of 
Complex!double
shows AVX code to be twice as fast (perhaps more aligned with 
expectations).

**I admit my confusion as to why the Complex!float is 
misbehaving.**

Does anyone have insight to what is happening?

Thanks,
James

Oct 01 2021

Guillaume Piolat <first.last gmail.com> writes:

On Friday, 1 October 2021 at 08:32:14 UTC, james.p.leblanc wrote:
 Does anyone have insight to what is happening?

 Thanks,
 James

Maybe something related to: 
https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 
?

AVX is not always a clear win in terms of performance.
Processing 8x float at once may not do anything if you are 
memory-bound, etc.

Oct 02 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Rather Bizarre slow downs using Complex!float with avx (ldc).