## digitalmars.D - FFT in D (using SIMD) and benchmarks

- "a" <a a.com> Jan 24 2012
- Iain Buclaw <ibuclaw ubuntu.com> Jan 24 2012
- "a" <a a.com> Jan 24 2012
- bearophile <bearophileHUGS lycos.com> Jan 24 2012
- "a" <a a.com> Jan 24 2012
- bearophile <bearophileHUGS lycos.com> Jan 24 2012
- "a" <a a.com> Jan 24 2012
- "Marco Leise" <Marco.Leise gmx.de> Jan 25 2012
- "Robert Jacques" <sandford jhu.edu> Jan 25 2012
- Manu <turkeyman gmail.com> Jan 25 2012
- Iain Buclaw <ibuclaw ubuntu.com> Jan 25 2012
- Manu <turkeyman gmail.com> Jan 25 2012
- Trass3r <un known.com> Jan 25 2012
- Jonathan M Davis <jmdavisProg gmx.com> Jan 25 2012
- "a" <a a.com> Jan 25 2012

Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here: https://github.com/jerro/pfft Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations): GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds) (I took that definition from http://www.fftw.org/speed/ ) Chart: http://cloud.github.com/downloads/jerro/pfft/image.png Results: GDC SSE: 2 0.833648 3 1.23383 4 6.92712 5 8.93348 6 10.9212 7 11.9306 8 12.5338 9 13.4025 10 13.5835 11 13.6992 12 13.493 13 12.7082 14 9.32621 15 9.15256 16 9.31431 17 8.38154 18 8.267 19 7.61852 20 7.14305 21 7.01786 22 6.58934 G++ SSE: 2 1.65933 3 1.96071 4 7.09683 5 9.66308 6 11.1498 7 11.9315 8 12.5712 9 13.4241 10 13.4907 11 13.6524 12 13.4215 13 12.6472 14 9.62755 15 9.24289 16 9.64412 17 8.88006 18 8.66819 19 8.28623 20 7.74581 21 7.6395 22 7.33506 GDC scalar: 2 0.808422 3 1.20835 4 2.66921 5 2.81166 6 2.99551 7 3.26423 8 3.61477 9 3.90741 10 4.04009 11 4.20405 12 4.21491 13 4.30896 14 3.79835 15 3.80497 16 3.94784 17 3.98417 18 3.58506 19 3.33992 20 3.42309 21 3.21923 22 3.25673 DMD SSE: 2 0.497946 3 0.773551 4 3.79912 5 3.78027 6 3.85155 7 4.06491 8 4.30895 9 4.53038 10 4.61006 11 4.82098 12 4.7455 13 4.85332 14 3.37768 15 3.44962 16 3.54049 17 3.40236 18 3.47339 19 3.40212 20 3.15997 21 3.32644 22 3.22767 DMD scalar: 2 0.478998 3 0.772341 4 1.6106 5 1.68516 6 1.7083 7 1.70625 8 1.68684 9 1.66931 10 1.66125 11 1.63756 12 1.61885 13 1.60459 14 1.402 15 1.39665 16 1.37894 17 1.36306 18 1.27189 19 1.21033 20 1.25719 21 1.21315 22 1.21606 SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.

Jan 24 2012

On 25 January 2012 00:04, a <a a.com> wrote:Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here: https://github.com/jerro/pfft Because dmd currently doesn't have an intrinsic for the SHUFPS instructio=

I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of cod=

compiled with both DMD and GDC and also the c++ code using SSE. The resul=

are below. The left column is base two logarithm of the array size and th=

right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations): GFLOPS =3D 5 n log2(n) / (time for one FFT in nanoseconds) =A0 (I took th=

definition from http://www.fftw.org/speed/ ) Chart: http://cloud.github.com/downloads/jerro/pfft/image.png Results: GDC SSE: 2 =A0 =A0 =A0 0.833648 3 =A0 =A0 =A0 1.23383 4 =A0 =A0 =A0 6.92712 5 =A0 =A0 =A0 8.93348 6 =A0 =A0 =A0 10.9212 7 =A0 =A0 =A0 11.9306 8 =A0 =A0 =A0 12.5338 9 =A0 =A0 =A0 13.4025 10 =A0 =A0 =A013.5835 11 =A0 =A0 =A013.6992 12 =A0 =A0 =A013.493 13 =A0 =A0 =A012.7082 14 =A0 =A0 =A09.32621 15 =A0 =A0 =A09.15256 16 =A0 =A0 =A09.31431 17 =A0 =A0 =A08.38154 18 =A0 =A0 =A08.267 19 =A0 =A0 =A07.61852 20 =A0 =A0 =A07.14305 21 =A0 =A0 =A07.01786 22 =A0 =A0 =A06.58934 G++ SSE: 2 =A0 =A0 =A0 1.65933 3 =A0 =A0 =A0 1.96071 4 =A0 =A0 =A0 7.09683 5 =A0 =A0 =A0 9.66308 6 =A0 =A0 =A0 11.1498 7 =A0 =A0 =A0 11.9315 8 =A0 =A0 =A0 12.5712 9 =A0 =A0 =A0 13.4241 10 =A0 =A0 =A013.4907 11 =A0 =A0 =A013.6524 12 =A0 =A0 =A013.4215 13 =A0 =A0 =A012.6472 14 =A0 =A0 =A09.62755 15 =A0 =A0 =A09.24289 16 =A0 =A0 =A09.64412 17 =A0 =A0 =A08.88006 18 =A0 =A0 =A08.66819 19 =A0 =A0 =A08.28623 20 =A0 =A0 =A07.74581 21 =A0 =A0 =A07.6395 22 =A0 =A0 =A07.33506 GDC scalar: 2 =A0 =A0 =A0 0.808422 3 =A0 =A0 =A0 1.20835 4 =A0 =A0 =A0 2.66921 5 =A0 =A0 =A0 2.81166 6 =A0 =A0 =A0 2.99551 7 =A0 =A0 =A0 3.26423 8 =A0 =A0 =A0 3.61477 9 =A0 =A0 =A0 3.90741 10 =A0 =A0 =A04.04009 11 =A0 =A0 =A04.20405 12 =A0 =A0 =A04.21491 13 =A0 =A0 =A04.30896 14 =A0 =A0 =A03.79835 15 =A0 =A0 =A03.80497 16 =A0 =A0 =A03.94784 17 =A0 =A0 =A03.98417 18 =A0 =A0 =A03.58506 19 =A0 =A0 =A03.33992 20 =A0 =A0 =A03.42309 21 =A0 =A0 =A03.21923 22 =A0 =A0 =A03.25673 DMD SSE: 2 =A0 =A0 =A0 0.497946 3 =A0 =A0 =A0 0.773551 4 =A0 =A0 =A0 3.79912 5 =A0 =A0 =A0 3.78027 6 =A0 =A0 =A0 3.85155 7 =A0 =A0 =A0 4.06491 8 =A0 =A0 =A0 4.30895 9 =A0 =A0 =A0 4.53038 10 =A0 =A0 =A04.61006 11 =A0 =A0 =A04.82098 12 =A0 =A0 =A04.7455 13 =A0 =A0 =A04.85332 14 =A0 =A0 =A03.37768 15 =A0 =A0 =A03.44962 16 =A0 =A0 =A03.54049 17 =A0 =A0 =A03.40236 18 =A0 =A0 =A03.47339 19 =A0 =A0 =A03.40212 20 =A0 =A0 =A03.15997 21 =A0 =A0 =A03.32644 22 =A0 =A0 =A03.22767 DMD scalar: 2 =A0 =A0 =A0 0.478998 3 =A0 =A0 =A0 0.772341 4 =A0 =A0 =A0 1.6106 5 =A0 =A0 =A0 1.68516 6 =A0 =A0 =A0 1.7083 7 =A0 =A0 =A0 1.70625 8 =A0 =A0 =A0 1.68684 9 =A0 =A0 =A0 1.66931 10 =A0 =A0 =A01.66125 11 =A0 =A0 =A01.63756 12 =A0 =A0 =A01.61885 13 =A0 =A0 =A01.60459 14 =A0 =A0 =A01.402 15 =A0 =A0 =A01.39665 16 =A0 =A0 =A01.37894 17 =A0 =A0 =A01.36306 18 =A0 =A0 =A01.27189 19 =A0 =A0 =A01.21033 20 =A0 =A0 =A01.25719 21 =A0 =A0 =A01.21315 22 =A0 =A0 =A01.21606 SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.=

and 3 for DMD. Code compiled with GDC is just a little bit slower than G+=

(and just for some values of n), which is really nice.

That is quite interesting, and really cool at the same time. :) Did you run into any issues with GDC's implementation of vectors? --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 24 2012

Did you run into any issues with GDC's implementation of vectors?

No, GDC vectors worked fine. It was a pretty straight forward port from C++ version using intrinsics from emmintrin.h.

Jan 24 2012

a:Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance. (And if possible I'd like to see two lines for the LDC2 compiler too.) Bye, bearophile

Jan 24 2012

On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:a:Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance. (And if possible I'd like to see two lines for the LDC2 compiler too.) Bye, bearophile

"bench" program in the fftw test directory gives this when run in a loop: 2 Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7 3 Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3 4 Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9 5 Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108 6 Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225 7 Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717 8 Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501 9 Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958 10 Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'': 20690 11 Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'': 19693 12 Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'': 18622 13 Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'': 14784 14 Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'': 13913 15 Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'': 12659 16 Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'': 12409 17 Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'': 11200 18 Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'': 10338 19 Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'': 5243.4 20 Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'': 5102.8 21 Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'': 5135.2 22 Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'': 5209.8 This was with fftw compiled for single precision and with SSE, but without AVX support. When I compiled fftw with AVX support, the peak was at about 30 GFLOPS, IIRC. It is possible that it would be even faster if I configured it in a different way. The C++ version of my FFT also supports AVX and gets to about 24 GFLOPS when using it. If AVX types will be added to D, I will port that part too.

Jan 24 2012

a:I have updated the graph now.

Thank you, it's a very nice graph. The reference lines of FFTW help. From how much AVX improves the situation I see Intel engineers know their work; I didn't know AVX is able to bring such performance improvement (in carefully written code).The C++ version of my FFT also supports AVX and gets to about 24 GFLOPS when using it.

So your FFT code is rather good, despite being so much simpler and shorter than FFTW code :-) Your code seems good to replace Phobos std.numeric.fft(), if you will ever want to donate your FFT code to Phobos.If AVX types will be added to D, I will port that part too.

Walter has added support for YMM registers too in D/DMD, so I presume having AVX1 instructions are quite an option. I presume we will have them too. But ask Walter for more details on this. And hopefully we'll see good implementations of this algorithm too :-) http://web.mit.edu/newsoffice/2012/faster-fourier-transforms-0118.html?tmpl=component&print=1 Bye, bearophile

Jan 24 2012

On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:a:Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance. (And if possible I'd like to see two lines for the LDC2 compiler too.) Bye, bearophile

I have updated the graph now.

Jan 24 2012

If you want to, you can take a look at the link in this forum post: http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT It looks like MIT has a new FFT algorithm.

Jan 25 2012

On Wed, 25 Jan 2012 05:58:52 -0600, Marco Leise <Marco.Leise gmx.de> wrote:If you want to, you can take a look at the link in this forum post: http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT It looks like MIT has a new FFT algorithm.

For those who want the short version. MIT has developed a faster method of doing a sparse FFT transform. Since many applications (like compression) work by throwing away the 'little' frequencies, those algorithms can get a practical speed-up of ~10x, depending on compression level by switching from the FFT's O(n log n) algorithm to MIT's new O(k log n) algorithm. The new algorithm is still slower for doing all n frequencies, so it's not going to replace all FFT usage, just lossy FFT usage.

Jan 25 2012

--002354470e0c96c4b404b759c477 Content-Type: text/plain; charset=UTF-8 Can you paste disassembly's of the GDC code and the G++ code? I imagine there's something trivial in the scheduler that GDC has missed. Like the other day I noticed GDC was unnecessarily generating a stack frame for leaf functions, which Iain already fixed. I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll work well. On 25 January 2012 02:04, a <a a.com> wrote:Since SIMD types were added to D I've ported an FFT that I was writing in C++ to D. The code is here: https://github.com/jerro/pfft Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%). I've benchmarked the scalar and SSE version of code compiled with both DMD and GDC and also the c++ code using SSE. The results are below. The left column is base two logarithm of the array size and the right column is GFLOPS defined as the number of floating point operations that the most basic FFT algorithm would perform divided by the time taken (the algorithm I used performs just a bit less operations): GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds) (I took that definition from http://www.fftw.org/speed/ ) Chart: http://cloud.github.com/**downloads/jerro/pfft/image.png<http://cloud.github.com/downloads/jerro/pfft/image.png> Results: GDC SSE: 2 0.833648 3 1.23383 4 6.92712 5 8.93348 6 10.9212 7 11.9306 8 12.5338 9 13.4025 10 13.5835 11 13.6992 12 13.493 13 12.7082 14 9.32621 15 9.15256 16 9.31431 17 8.38154 18 8.267 19 7.61852 20 7.14305 21 7.01786 22 6.58934 G++ SSE: 2 1.65933 3 1.96071 4 7.09683 5 9.66308 6 11.1498 7 11.9315 8 12.5712 9 13.4241 10 13.4907 11 13.6524 12 13.4215 13 12.6472 14 9.62755 15 9.24289 16 9.64412 17 8.88006 18 8.66819 19 8.28623 20 7.74581 21 7.6395 22 7.33506 GDC scalar: 2 0.808422 3 1.20835 4 2.66921 5 2.81166 6 2.99551 7 3.26423 8 3.61477 9 3.90741 10 4.04009 11 4.20405 12 4.21491 13 4.30896 14 3.79835 15 3.80497 16 3.94784 17 3.98417 18 3.58506 19 3.33992 20 3.42309 21 3.21923 22 3.25673 DMD SSE: 2 0.497946 3 0.773551 4 3.79912 5 3.78027 6 3.85155 7 4.06491 8 4.30895 9 4.53038 10 4.61006 11 4.82098 12 4.7455 13 4.85332 14 3.37768 15 3.44962 16 3.54049 17 3.40236 18 3.47339 19 3.40212 20 3.15997 21 3.32644 22 3.22767 DMD scalar: 2 0.478998 3 0.772341 4 1.6106 5 1.68516 6 1.7083 7 1.70625 8 1.68684 9 1.66931 10 1.66125 11 1.63756 12 1.61885 13 1.60459 14 1.402 15 1.39665 16 1.37894 17 1.36306 18 1.27189 19 1.21033 20 1.25719 21 1.21315 22 1.21606 SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ (and just for some values of n), which is really nice.

--002354470e0c96c4b404b759c477 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Can you paste disassembly's of the GDC code and the G++ code?<div>I ima= gine there's something trivial in the scheduler that GDC has missed. Li= ke the other day I noticed GDC was unnecessarily generating a stack frame f= or leaf functions, which Iain already fixed.</div> <div><br></div><div>I'd also be interested to try out my experimental s= td.simd (portable) library in the context of your FFT... might give that a = shot, I think it'll work well.</div><div><br></div><br><div class=3D"gm= ail_quote"> On 25 January 2012 02:04, a <span dir=3D"ltr"><<a href=3D"mailto:a a.com= ">a a.com</a>></span> wrote:<br><blockquote class=3D"gmail_quote" style= =3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Since SI= MD types were added to D I've ported an FFT that I was writing in C++ t= o D. The code is here:<br> <br> <a href=3D"https://github.com/jerro/pfft" target=3D"_blank">https://github.= com/jerro/pfft</a><br> <br> Because dmd currently doesn't have an intrinsic for the SHUFPS instruct= ion I've included a version block with some GDC specific code (this gav= e me a speedup of up to 80%). I've benchmarked the scalar and SSE versi= on of code compiled with both DMD and GDC and also the c++ code using SSE. = The results are below. The left column is base two logarithm of the array s= ize and the right column is GFLOPS defined as the number of floating point = operations that the most basic FFT algorithm would perform divided by the t= ime taken (the algorithm I used performs just a bit less operations):<br> <br> GFLOPS =3D 5 n log2(n) / (time for one FFT in nanoseconds) =C2=A0 (I took t= hat definition from <a href=3D"http://www.fftw.org/speed/" target=3D"_blank= ">http://www.fftw.org/speed/</a> )<br> <br> Chart: <a href=3D"http://cloud.github.com/downloads/jerro/pfft/image.png" t= arget=3D"_blank">http://cloud.github.com/<u></u>downloads/jerro/pfft/image.= png</a><br> <br> Results:<br> <br> GDC SSE:<br> <br> 2 =C2=A0 =C2=A0 =C2=A0 0.833648<br> 3 =C2=A0 =C2=A0 =C2=A0 1.23383<br> 4 =C2=A0 =C2=A0 =C2=A0 6.92712<br> 5 =C2=A0 =C2=A0 =C2=A0 8.93348<br> 6 =C2=A0 =C2=A0 =C2=A0 10.9212<br> 7 =C2=A0 =C2=A0 =C2=A0 11.9306<br> 8 =C2=A0 =C2=A0 =C2=A0 12.5338<br> 9 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:13.4025" value=3D"+61134025" target= =3D"_blank">13.4025</a><br> 10 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:13.5835" value=3D"+61135835" target= =3D"_blank">13.5835</a><br> 11 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:13.6992" value=3D"+61136992" target= =3D"_blank">13.6992</a><br> 12 =C2=A0 =C2=A0 =C2=A013.493<br> 13 =C2=A0 =C2=A0 =C2=A012.7082<br> 14 =C2=A0 =C2=A0 =C2=A09.32621<br> 15 =C2=A0 =C2=A0 =C2=A09.15256<br> 16 =C2=A0 =C2=A0 =C2=A09.31431<br> 17 =C2=A0 =C2=A0 =C2=A08.38154<br> 18 =C2=A0 =C2=A0 =C2=A08.267<br> 19 =C2=A0 =C2=A0 =C2=A07.61852<br> 20 =C2=A0 =C2=A0 =C2=A07.14305<br> 21 =C2=A0 =C2=A0 =C2=A07.01786<br> 22 =C2=A0 =C2=A0 =C2=A06.58934<br> <br> G++ SSE:<br> <br> 2 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:1.65933" value=3D"+61165933" target= =3D"_blank">1.65933</a><br> 3 =C2=A0 =C2=A0 =C2=A0 1.96071<br> 4 =C2=A0 =C2=A0 =C2=A0 7.09683<br> 5 =C2=A0 =C2=A0 =C2=A0 9.66308<br> 6 =C2=A0 =C2=A0 =C2=A0 11.1498<br> 7 =C2=A0 =C2=A0 =C2=A0 11.9315<br> 8 =C2=A0 =C2=A0 =C2=A0 12.5712<br> 9 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:13.4241" value=3D"+61134241" target= =3D"_blank">13.4241</a><br> 10 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:13.4907" value=3D"+61134907" target= =3D"_blank">13.4907</a><br> 11 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:13.6524" value=3D"+61136524" target= =3D"_blank">13.6524</a><br> 12 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:13.4215" value=3D"+61134215" target= =3D"_blank">13.4215</a><br> 13 =C2=A0 =C2=A0 =C2=A012.6472<br> 14 =C2=A0 =C2=A0 =C2=A09.62755<br> 15 =C2=A0 =C2=A0 =C2=A09.24289<br> 16 =C2=A0 =C2=A0 =C2=A09.64412<br> 17 =C2=A0 =C2=A0 =C2=A08.88006<br> 18 =C2=A0 =C2=A0 =C2=A08.66819<br> 19 =C2=A0 =C2=A0 =C2=A08.28623<br> 20 =C2=A0 =C2=A0 =C2=A07.74581<br> 21 =C2=A0 =C2=A0 =C2=A07.6395<br> 22 =C2=A0 =C2=A0 =C2=A07.33506<br> <br> GDC scalar:<br> <br> 2 =C2=A0 =C2=A0 =C2=A0 0.808422<br> 3 =C2=A0 =C2=A0 =C2=A0 1.20835<br> 4 =C2=A0 =C2=A0 =C2=A0 2.66921<br> 5 =C2=A0 =C2=A0 =C2=A0 2.81166<br> 6 =C2=A0 =C2=A0 =C2=A0 2.99551<br> 7 =C2=A0 =C2=A0 =C2=A0 3.26423<br> 8 =C2=A0 =C2=A0 =C2=A0 3.61477<br> 9 =C2=A0 =C2=A0 =C2=A0 3.90741<br> 10 =C2=A0 =C2=A0 =C2=A04.04009<br> 11 =C2=A0 =C2=A0 =C2=A04.20405<br> 12 =C2=A0 =C2=A0 =C2=A04.21491<br> 13 =C2=A0 =C2=A0 =C2=A04.30896<br> 14 =C2=A0 =C2=A0 =C2=A03.79835<br> 15 =C2=A0 =C2=A0 =C2=A03.80497<br> 16 =C2=A0 =C2=A0 =C2=A03.94784<br> 17 =C2=A0 =C2=A0 =C2=A03.98417<br> 18 =C2=A0 =C2=A0 =C2=A03.58506<br> 19 =C2=A0 =C2=A0 =C2=A03.33992<br> 20 =C2=A0 =C2=A0 =C2=A03.42309<br> 21 =C2=A0 =C2=A0 =C2=A03.21923<br> 22 =C2=A0 =C2=A0 =C2=A03.25673<br> <br> DMD SSE:<br> <br> 2 =C2=A0 =C2=A0 =C2=A0 0.497946<br> 3 =C2=A0 =C2=A0 =C2=A0 0.773551<br> 4 =C2=A0 =C2=A0 =C2=A0 3.79912<br> 5 =C2=A0 =C2=A0 =C2=A0 3.78027<br> 6 =C2=A0 =C2=A0 =C2=A0 3.85155<br> 7 =C2=A0 =C2=A0 =C2=A0 4.06491<br> 8 =C2=A0 =C2=A0 =C2=A0 4.30895<br> 9 =C2=A0 =C2=A0 =C2=A0 4.53038<br> 10 =C2=A0 =C2=A0 =C2=A04.61006<br> 11 =C2=A0 =C2=A0 =C2=A04.82098<br> 12 =C2=A0 =C2=A0 =C2=A04.7455<br> 13 =C2=A0 =C2=A0 =C2=A04.85332<br> 14 =C2=A0 =C2=A0 =C2=A03.37768<br> 15 =C2=A0 =C2=A0 =C2=A03.44962<br> 16 =C2=A0 =C2=A0 =C2=A03.54049<br> 17 =C2=A0 =C2=A0 =C2=A03.40236<br> 18 =C2=A0 =C2=A0 =C2=A03.47339<br> 19 =C2=A0 =C2=A0 =C2=A03.40212<br> 20 =C2=A0 =C2=A0 =C2=A03.15997<br> 21 =C2=A0 =C2=A0 =C2=A03.32644<br> 22 =C2=A0 =C2=A0 =C2=A03.22767<br> <br> DMD scalar:<br> <br> 2 =C2=A0 =C2=A0 =C2=A0 0.478998<br> 3 =C2=A0 =C2=A0 =C2=A0 0.772341<br> 4 =C2=A0 =C2=A0 =C2=A0 1.6106<br> 5 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:1.68516" value=3D"+61168516" target= =3D"_blank">1.68516</a><br> 6 =C2=A0 =C2=A0 =C2=A0 1.7083<br> 7 =C2=A0 =C2=A0 =C2=A0 1.70625<br> 8 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:1.68684" value=3D"+61168684" target= =3D"_blank">1.68684</a><br> 9 =C2=A0 =C2=A0 =C2=A0 <a href=3D"tel:1.66931" value=3D"+61166931" target= =3D"_blank">1.66931</a><br> 10 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.66125" value=3D"+61166125" target= =3D"_blank">1.66125</a><br> 11 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.63756" value=3D"+61163756" target= =3D"_blank">1.63756</a><br> 12 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.61885" value=3D"+61161885" target= =3D"_blank">1.61885</a><br> 13 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.60459" value=3D"+61160459" target= =3D"_blank">1.60459</a><br> 14 =C2=A0 =C2=A0 =C2=A01.402<br> 15 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.39665" value=3D"+61139665" target= =3D"_blank">1.39665</a><br> 16 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.37894" value=3D"+61137894" target= =3D"_blank">1.37894</a><br> 17 =C2=A0 =C2=A0 =C2=A0<a href=3D"tel:1.36306" value=3D"+61136306" target= =3D"_blank">1.36306</a><br> 18 =C2=A0 =C2=A0 =C2=A01.27189<br> 19 =C2=A0 =C2=A0 =C2=A01.21033<br> 20 =C2=A0 =C2=A0 =C2=A01.25719<br> 21 =C2=A0 =C2=A0 =C2=A01.21315<br> 22 =C2=A0 =C2=A0 =C2=A01.21606<br> <br> SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5 = and 3 for DMD. Code compiled with GDC is just a little bit slower than G++ = (and just for some values of n), which is really nice.<br> </blockquote></div><br> --002354470e0c96c4b404b759c477--

Jan 25 2012

On 25 January 2012 12:54, Manu <turkeyman gmail.com> wrote:Can you paste disassembly's of the GDC code and the G++ code? I imagine there's something trivial in the scheduler that GDC has missed. Like the other day I noticed GDC was unnecessarily generating a stack frame for leaf functions, which Iain already fixed. I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll work well.

That's right, go on and give that GDC maintainer guy more work to do. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 25 2012

--20cf300fb41df99e2f04b7608db7 Content-Type: text/plain; charset=UTF-8 On 25 January 2012 20:50, Iain Buclaw <ibuclaw ubuntu.com> wrote:On 25 January 2012 12:54, Manu <turkeyman gmail.com> wrote:Can you paste disassembly's of the GDC code and the G++ code? I imagine there's something trivial in the scheduler that GDC has missed. Like the other day I noticed GDC was unnecessarily generating a stack

for leaf functions, which Iain already fixed. I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think

work well.

That's right, go on and give that GDC maintainer guy more work to do.

How can I resist, he does such a fast and awesome job of it! :P --20cf300fb41df99e2f04b7608db7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 25 January 2012 20:50, Iain Buclaw <span dir= =3D"ltr"><<a href=3D"mailto:ibuclaw ubuntu.com">ibuclaw ubuntu.com</a>&g= t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0= .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">On 25 January 2012 12:54, Manu <<a href=3D"mailto:turk= eyman gmail.com">turkeyman gmail.com</a>> wrote:<br> > Can you paste disassembly's of the GDC code and the G++ code?<br> > I imagine there's something trivial in the scheduler that GDC has = missed.<br> > Like the other day I noticed GDC was unnecessarily generating a stack = frame<br> > for leaf functions, which Iain already fixed.<br> ><br> > I'd also be interested to try out my experimental std.simd (portab= le)<br> > library in the context of your FFT... might give that a shot, I think = it'll<br> > work well.<br> <br> </div>That's right, go on and give that GDC maintainer guy more work to= do.</blockquote><div><br></div><div>How can I resist, he does such a fast = and awesome job of it! :P</div></div> --20cf300fb41df99e2f04b7608db7--

Jan 25 2012

For some reason your messages never have the proper position (i.e. they aren't connected with the post the respond to) in the message tree.

Jan 25 2012

On Wednesday, January 25, 2012 22:37:41 Trass3r wrote:For some reason your messages never have the proper position (i.e. they aren't connected with the post the respond to) in the message tree.

I believe that it has something to do with newsgroup vs mailing list post, but I'm not sure. It could also be your client. Some clients are smarter about threading posts than others. - Jonathan M Davis

Jan 25 2012

On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:Can you paste disassembly's of the GDC code and the G++ code? I imagine there's something trivial in the scheduler that GDC has missed.

Like the other day I noticed GDC was unnecessarily generating a stack frame for leaf functions, which Iain already fixed.

I've uploaded it here: https://github.com/downloads/jerro/pfft/disassembly.zip For g++ there is the disassembly of the entire object file. For gdc that included parts of the standard library and was about 170k lines so I coppied just the functions that took the most time. I have also included percentages of time taken by functions for "./bench 22". I noticed that 2% of time is taken by memset. This could be caused by static array initialization in fft_passes_stride, so I'll void initialize static arrays there. For very small sizes the code compiled with g++ is probably faster because it's more aggressively inlined.I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll work well.

For that you probably only need to replace the code in sse.d (create stdsimd.d or something). The type that you pass as a first parameter for the FFT template should define at least the following: - an alias "vec" for the vector type - an alias "T" for the scalar type - an enum vec_size which is the number of scalars in a vector - a function "static vec scalar_to_vector(T)", which takes a scalar and returns a vector with all elements set to that scalar - a function "static void bit_reverse_swap_16(T * p0, T * p1, T * p2, T * p3, size_t i1, size_t i2)" - a function "static void bit_reverse_16(T * p0, T * p1, T * p2, T * p3, size_t i)" You can start with using Scalar!T.bit_reverse_16 and Scalar!T.bit_reverse_swap_16 from fft_impl.d but that won't be very fast. Those to funcions do the following: bit_reverse_16: This one reads 16 elements - four from p0 + i, four from p1 + i, four from p2 + i and four from p3 + i. Then it "bit reverses" them - this means that for each pair of indices j and k such that k has the reverse bits of j, it swaps element at j and element at k. I define the index here so that the element that was read from address pn + i + m has index n * 4 + m. After bit reversing them, it writes the elements back. You cann assume that (p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to 4*T.sizeof. bit_reverse_swap_16: This one reads 16 elements from (p0 + i), (p1 + i), (p2 + i) and (p3 + i) and bitreverses them and also the elements from (p0 + j), (p1 + j), (p2 + j) and (p3 + j) and bitreverses those. Then it writes the elements that were read from offsets i to offsets j and vice versa. You can assue that (p0 + i), (p1 + i), (p2 + i) and (p3 + i), (p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to 4*T.sizeof. If you want to speed up the fft for larg sizes (greater or equal to 1<<Options.largeLimit, which is roughly those that do not fit in L1 cache) a bit, you can also write those functions: - static void interleave(int n)(vec a, vec b, ref vec c, ref vec d) Here n will be a power of two larger than 1 and less or eaqual to vec_size. The function breaks vector a into n equaly sized parts and does the same for b, then interleaves those parts and writes them to c,d. For example: for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to c and (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it should write (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d. - static void deinterleave(int n)(vec a, vec b, ref vec c, ref vec d) This is an inverse of interleave!n(a, b, c, d) - static void complex_array_to_real_imag_vec(int n)(float * arr, ref vec rr, ref vec ri) This one reads n pairs from arr. It repeats the first element of each pair vec_size / n times and writes them to rr. It also repeats the second element of each pair vec_size / n times and writes them to ri. For example: if vec_size is 4 and n is 2 then the it should write (arr[0], arr[0], arr[2], arr[2]) to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assume that arr is aligned to 2*n*T.sizeof. The functions interleave, deinterleave and complex_array_to_real_imag_vec are called in the inner loop, so if you can't make them very fast, you should just ommit them. The algorithm will then fall back (it checks for the presence of interleave!2) to a scalar one for the passes where these functions would be used otherwise.

Jan 25 2012