www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - SIMD on Windows

reply "Jonathan Dunlap" <jadit2 gmail.com> writes:
In D 2.063.2 on Windows 7:
Error: SIMD vector types not supported on this platform

Should I file a bug for this or is this currently on a roadmap? 
I'm SUPER excited to get into SIMD development with D. :D
Jun 21 2013
next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Btw, is it possible to check for SIMD support as a compilation 
condition? Ideally I'm looking to 'polyfill' SIMD if it's not 
supported on the platform.
Jun 21 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.
Jun 21 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/21/2013 10:54 PM, Jonathan Dunlap wrote:
 Alright, I installed VC2010 (with x64 libs) and added the -m64 option to the
 compiler. Sadly the compiler dies with the below message. Should I file a bug
or
 did I miss something?

Anytime you see a message like "Internal error" it's a compiler bug and should be reported to bugzilla. To work around, try replacing -gc with -g.
 -----

 Building: Easy (Debug)

 Performing main compilation...

 Current dictionary: C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\

 C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"
 "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug"
 "-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe"
-m64

 Internal error: ..\ztc\cgcv.c 2162

 Exit code 1

 Build complete -- 1 error, 0 warnings

Jun 21 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/22/2013 1:10 AM, Manu wrote:
 You can't use SIMD and symbolic debuginfo. The compiler will crash.

I didn't know that. Bugzilla?
Jun 22 2013
prev sibling parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 22.06.2013 02:07, Manu wrote:
 It would certainly be nice in Win32, but I tend to think Win32 COFF
 should be much higher priority.

I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/dmd/tree/coff32 https://github.com/rainers/druntime/tree/coff32 https://github.com/rainers/phobos/tree/coff32 Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line. If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: [Environment32ms] PATH=c:\l\vs9\Common7\IDE;%PATH% LIB="% P%\..\..\lib32";c:\l\vs9\vc\lib;c:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib" DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO LINKCMD=c:\l\vs9\vc\bin\link.exe BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.
Jun 23 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-06-23 15:33, Rainer Schuetze wrote:

 I have removed the dust from these patches and pushed them successfully
 through the test suite and unittests:

 https://github.com/rainers/dmd/tree/coff32
 https://github.com/rainers/druntime/tree/coff32
 https://github.com/rainers/phobos/tree/coff32

 Compile dmd as usual, but druntime and phobos with something like

 druntime:
   make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 phobos:
   make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>"
 "AR=<path-to-32bit-lib>"

 COFF32 files are generated when -m32ms is used on the command line.

So, you have implemented support for COFF 32bit? How long have you been hiding this :) Although I'm not a Windows user I consider it great news. -- /Jacob Carlborg
Jun 23 2013
parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 23.06.2013 20:24, Jacob Carlborg wrote:
 On 2013-06-23 15:33, Rainer Schuetze wrote:

 COFF32 files are generated when -m32ms is used on the command line.

So, you have implemented support for COFF 32bit? How long have you been hiding this :) Although I'm not a Windows user I consider it great news.

I experimented with it a few times, but it didn't work good enough until this week-end.
Jun 23 2013
prev sibling next sibling parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 23.06.2013 15:33, Rainer Schuetze wrote:
 BTW: I also found some bugs in the Win64 along the way, I'll create pull
 requests for these.

https://github.com/D-Programming-Language/dmd/pull/2253 https://github.com/D-Programming-Language/dmd/pull/2254
Jun 23 2013
prev sibling parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 23.06.2013 21:55, Michael wrote:
 Cool)))

 Any chances to see it [coff32] in official build?

Let's see if Walter approves. There is one maybe disruptive change: with two different C runtimes available for Win32, versioning on Win32/Win64 no longer works. I added versions CRuntime_DigitalMars and CRuntime_Microsoft (and CRuntime_GNU for anything else), and adapting to this makes most of the changes in druntime and phobos.
Jun 24 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--089e013a0bbc6bc02504dfb2f590
Content-Type: text/plain; charset=UTF-8

On 22 June 2013 09:04, Walter Bright <newshound2 digitalmars.com> wrote:

 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:

 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.

It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority. And to OP: There's a version(SIMD) you can test. --089e013a0bbc6bc02504dfb2f590 Content-Type: text/html; charset=UTF-8 <div dir="ltr">On 22 June 2013 09:04, Walter Bright <span dir="ltr">&lt;<a href="mailto:newshound2 digitalmars.com" target="_blank">newshound2 digitalmars.com</a>&gt;</span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> In D 2.063.2 on Windows 7:<br> Error: SIMD vector types not supported on this platform<br> <br> Should I file a bug for this or is this currently on a roadmap? I&#39;m SUPER<br> excited to get into SIMD development with D. :D<br> </blockquote> <br></div></div> It&#39;s not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations.<br> <br> You can, however, file an enhancement request on Bugzilla for it.<br> </blockquote></div><br></div><div class="gmail_extra" style>It would certainly be nice in Win32, but I tend to think Win32 COFF should be much higher priority.</div><div class="gmail_extra" style><br></div><div class="gmail_extra" style> And to OP: There&#39;s a version(SIMD) you can test.</div></div> --089e013a0bbc6bc02504dfb2f590--
Jun 21 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 It's not a bug, and there are currently no plans to support 
 SIMD on Win32. However, it is supported for Win64 compilations.

LDC2 supports SIMD on Win32. Bye, bearophile
Jun 21 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.

How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)
Jun 21 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:
 On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.

How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)

The i386 DMD can produce Win64 COFF object files which can then be linked by the free MSVC toolchain. Basically you just need to install that toolchain and DMD -m64 should just work in my experience. Someone probably has a link handy for the MSVC toolchain. I'm not sure because I have Visual Studio installed for work so I've always already got it installed.
Jun 21 2013
prev sibling next sibling parent "Geancarlo Rocha" <asdf mailinator.com> writes:
If just installing VC++ doesn't work... 
http://forum.dlang.org/post/mailman.2800.1355837582.5162.digitalmars-d puremagic.com
On Saturday, 22 June 2013 at 04:28:57 UTC, Jonathan Dunlap wrote:
 On Friday, 21 June 2013 at 23:04:10 UTC, Walter Bright wrote:
 On 6/21/2013 3:43 PM, Jonathan Dunlap wrote:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a 
 roadmap? I'm SUPER
 excited to get into SIMD development with D. :D

It's not a bug, and there are currently no plans to support SIMD on Win32. However, it is supported for Win64 compilations. You can, however, file an enhancement request on Bugzilla for it.

How do you compile for Win64? The only package for Windows I see is i386 which doesn't seem to support Win64 offhand... does the compiler require a flag? (yes, I'm on a Win64 OS/system)

Jun 21 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Alright, I installed VC2010 (with x64 libs) and added the -m64 
option to the compiler. Sadly the compiler dies with the below 
message. Should I file a bug or did I miss something?
-----

Building: Easy (Debug)

Performing main compilation...

Current dictionary: 
C:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\

C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"  
"-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" 
"-odobj\Debug" 
"-ofC:\Users\dunlap\Documents\GitHub\CodeEval\Dlang\bin\Debug\SIMDTests.exe" 
-m64

Internal error: ..\ztc\cgcv.c 2162

Exit code 1

Build complete -- 1 error, 0 warnings
Jun 21 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Also tried VC2012 Express (with x64 libs)... received the same 
compiler error.
Jun 21 2013
prev sibling next sibling parent "Michael" <pr m1xa.com> writes:
Check twice where is yours 64 bit tools installed. Paths 
something diff in Win8, VS2010, VS2012 Express installations.
Jun 21 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b2e4b9c79cc7304dfb9b60b
Content-Type: text/plain; charset=UTF-8

On 22 June 2013 15:54, Jonathan Dunlap <jadit2 gmail.com> wrote:

 Alright, I installed VC2010 (with x64 libs) and added the -m64 option to
 the compiler. Sadly the compiler dies with the below message. Should I file
 a bug or did I miss something?
 -----

 Building: Easy (Debug)

 Performing main compilation...

 Current dictionary: C:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\

 C:\D\dmd2\windows\bin\dmd.exe -debug -gc "main.d" "SIMDTests.d"
  "-IC:\D\dmd2\src\druntime\src" "-IC:\D\dmd2\src\phobos" "-odobj\Debug"
 "-ofC:\Users\dunlap\Documents\**GitHub\CodeEval\Dlang\bin\**Debug\SIMDTests.exe"
 -m64

 Internal error: ..\ztc\cgcv.c 2162

 Exit code 1

 Build complete -- 1 error, 0 warnings

You can't use SIMD and symbolic debuginfo. The compiler will crash. --047d7b2e4b9c79cc7304dfb9b60b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">On 22 June 2013 15:54, Jonathan Dunlap <span dir=3D"ltr">&= lt;<a href=3D"mailto:jadit2 gmail.com" target=3D"_blank">jadit2 gmail.com</= a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gmail_quot= e"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left= :1px #ccc solid;padding-left:1ex"> Alright, I installed VC2010 (with x64 libs) and added the -m64 option to th= e compiler. Sadly the compiler dies with the below message. Should I file a= bug or did I miss something?<br> -----<br> <br> Building: Easy (Debug)<br> <br> Performing main compilation...<br> <br> Current dictionary: C:\Users\dunlap\Documents\<u></u>GitHub\CodeEval\Dlang\= <br> <br> C:\D\dmd2\windows\bin\dmd.exe -debug -gc &quot;main.d&quot; &quot;SIMDTests= .d&quot; =C2=A0&quot;-IC:\D\dmd2\src\druntime\src&quot; &quot;-IC:\D\dmd2\s= rc\phobos&quot; &quot;-odobj\Debug&quot; &quot;-ofC:\Users\dunlap\Documents= \<u></u>GitHub\CodeEval\Dlang\bin\<u></u>Debug\SIMDTests.exe&quot; -m64<br> <br> Internal error: ..\ztc\cgcv.c 2162<br> <br> Exit code 1<br> <br> Build complete -- 1 error, 0 warnings<br> </blockquote></div><br></div><div class=3D"gmail_extra" style>You can&#39;t= use SIMD and symbolic debuginfo. The compiler will crash.</div></div> --047d7b2e4b9c79cc7304dfb9b60b--
Jun 22 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--089e0158ae3861275304dfbc21b8
Content-Type: text/plain; charset=UTF-8

Pretty sure it's in there...
Here it is: http://d.puremagic.com/issues/show_bug.cgi?id=10224

On 22 June 2013 18:36, Walter Bright <newshound2 digitalmars.com> wrote:

 On 6/22/2013 1:10 AM, Manu wrote:

 You can't use SIMD and symbolic debuginfo. The compiler will crash.

I didn't know that. Bugzilla?

--089e0158ae3861275304dfbc21b8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Pretty sure it&#39;s in there...<div>Here it is:=C2=A0<a h= ref=3D"http://d.puremagic.com/issues/show_bug.cgi?id=3D10224">http://d.pure= magic.com/issues/show_bug.cgi?id=3D10224</a><br><div class=3D"gmail_extra">= <br><div class=3D"gmail_quote"> On 22 June 2013 18:36, Walter Bright <span dir=3D"ltr">&lt;<a href=3D"mailt= o:newshound2 digitalmars.com" target=3D"_blank">newshound2 digitalmars.com<= /a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:= 0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);= border-left-style:solid;padding-left:1ex"> <div class=3D"im">On 6/22/2013 1:10 AM, Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p= adding-left:1ex"> You can&#39;t use SIMD and symbolic debuginfo. The compiler will crash.<br> </blockquote> <br></div> I didn&#39;t know that. Bugzilla?<br> </blockquote></div><br></div></div></div> --089e0158ae3861275304dfbc21b8--
Jun 22 2013
prev sibling next sibling parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
Am 22.06.2013 00:43, schrieb Jonathan Dunlap:
 In D 2.063.2 on Windows 7:
 Error: SIMD vector types not supported on this platform

 Should I file a bug for this or is this currently on a roadmap? I'm
 SUPER excited to get into SIMD development with D. :D

In its current state you don't want to be using SIMD with dmd because the generated assembly will be significantly slower then if you just use the default FPU math. If you need simd you will need to write inline assembler. This will then also work on 32 bit windows. But you have to use unaligned loads / stores because the compiler will not garantuee alignment (on 32 bit). More details on the underperforming generated assembly can be found here: http://d.puremagic.com/issues/show_bug.cgi?id=10226 Kind Regards Benjamin Thaut
Jun 22 2013
next sibling parent Benjamin Thaut <code benjamin-thaut.de> writes:
Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd because
 the generated assembly will be significantly slower then if you just
 use the default FPU math.

That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Well, but judging from the assembly it generates, it could be even faster. What exactly is pfft? Does it use dmd's __simd intrinsics? Or does it only do primitive operations (* / - +) on simd types? Kind Regards Benjamin Thaut
Jun 22 2013
prev sibling parent Benjamin Thaut <code benjamin-thaut.de> writes:
Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd because
 the generated assembly will be significantly slower then if you just
 use the default FPU math.

That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Ok I saw that you did write quite a few cirtical functions in inline assembly. Not really a good argument for dmds codegen with simd intrinsics. Kind Regards Benjamin Thaut
Jun 22 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 In its current state you don't want to be using SIMD with dmd 
 because the generated assembly will be significantly slower 
 then if you just use the default FPU math.

That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.
Jun 22 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Saturday, 22 June 2013 at 15:41:43 UTC, Benjamin Thaut wrote:
 Am 22.06.2013 15:53, schrieb jerro:
 In its current state you don't want to be using SIMD with dmd 
 because
 the generated assembly will be significantly slower then if 
 you just
 use the default FPU math.

That may be true for some kinds of code, but it isn't true int general. For example, see the comparison of pfft's performance when built with 64 bit DMD using SIMD and without SIMD: http://i.imgur.com/kYYI9R9.png This benchmark was run on a core i5 2500K on 64 bit Debian Wheezy.

Ok I saw that you did write quite a few cirtical functions in inline assembly. Not really a good argument for dmds codegen with simd intrinsics. Kind Regards Benjamin Thaut

I have actually run that benchmark with the code from this branch: https://github.com/jerro/pfft/tree/experimental The only function in sse_float.d on that branch that uses inline assembly is scalar_to_vector. The reason why I used more inline assembly in the master branch is that DMD didn't have intrinsics for some instructions such as shufps at the time. I'm not really arguing for DMD's codegen with SIMD intrinsics. It's more that, from what I've seen, it doesn't produce very good scalar floating point code either (at least when compared to LDC or GDC). Whether I use scalar floating point or SIMD, pfft is about two times slower if I compile it with DMD than it is if I compile it with GDC.
Jun 22 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 Well, but judging from the assembly it generates, it could be 
 even faster. What exactly is pfft? Does it use dmd's __simd 
 intrinsics?
 Or does it only do primitive operations (* / - +) on simd types?

It's a FFT implementation. It does most of the work using + - and *. There's one part off the algorithm that uses mostly shufps, and that part takes about 10% of the time (for sizes around 2 ^^ 10 when using SSE).
Jun 22 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b33caaaf3a6c904dfd53be4
Content-Type: text/plain; charset=UTF-8

I've said it before, but this man is a genius! :)

On 23 June 2013 23:33, Rainer Schuetze <r.sagitario gmx.de> wrote:

 On 22.06.2013 02:07, Manu wrote:

 It would certainly be nice in Win32, but I tend to think Win32 COFF
 should be much higher priority.

I have removed the dust from these patches and pushed them successfully through the test suite and unittests: https://github.com/rainers/**dmd/tree/coff32<https://github.com/rainers/dmd/tree/coff32> https://github.com/rainers/**druntime/tree/coff32<https://github.com/rainers/druntime/tree/coff32> https://github.com/rainers/**phobos/tree/coff32<https://github.com/rainers/phobos/tree/coff32> Compile dmd as usual, but druntime and phobos with something like druntime: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" phobos: make -f win64.mak MODEL=32ms "CC=<path-to-32bit-cl>" "AR=<path-to-32bit-lib>" COFF32 files are generated when -m32ms is used on the command line. If you put the resulting libraries into the lib folder, using a standard installation of VS2010 might work, but I recommend adding a new section to sc.ini and adjust paths there. Mine looks like this: [Environment32ms] PATH=c:\l\vs9\Common7\IDE;%**PATH% LIB="% P%\..\..\lib32";c:\l\**vs9\vc\lib;c:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib" DFLAGS=%DFLAGS% -L/nologo -L/INCREMENTAL:NO LINKCMD=c:\l\vs9\vc\bin\link.**exe BTW: I also found some bugs in the Win64 along the way, I'll create pull requests for these.

--047d7b33caaaf3a6c904dfd53be4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">I&#39;ve said it before, but this man is a genius! :)<br><= div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On 23 June 2013 23= :33, Rainer Schuetze <span dir=3D"ltr">&lt;<a href=3D"mailto:r.sagitario gm= x.de" target=3D"_blank">r.sagitario gmx.de</a>&gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im"><br> On 22.06.2013 02:07, Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> It would certainly be nice in Win32, but I tend to think Win32 COFF<br> should be much higher priority.<br> </blockquote> <br></div> I have removed the dust from these patches and pushed them successfully thr= ough the test suite and unittests:<br> <br> <a href=3D"https://github.com/rainers/dmd/tree/coff32" target=3D"_blank">ht= tps://github.com/rainers/<u></u>dmd/tree/coff32</a><br> <a href=3D"https://github.com/rainers/druntime/tree/coff32" target=3D"_blan= k">https://github.com/rainers/<u></u>druntime/tree/coff32</a><br> <a href=3D"https://github.com/rainers/phobos/tree/coff32" target=3D"_blank"=
https://github.com/rainers/<u></u>phobos/tree/coff32</a><br>

Compile dmd as usual, but druntime and phobos with something like<br> <br> druntime:<br> =C2=A0make -f win64.mak MODEL=3D32ms &quot;CC=3D&lt;path-to-32bit-cl&gt;&qu= ot;<br> phobos:<br> =C2=A0make -f win64.mak MODEL=3D32ms &quot;CC=3D&lt;path-to-32bit-cl&gt;&qu= ot; &quot;AR=3D&lt;path-to-32bit-lib&gt;&quot;<br> <br> COFF32 files are generated when -m32ms is used on the command line.<br> <br> If you put the resulting libraries into the lib folder, using a standard in= stallation of VS2010 might work, but I recommend adding a new section to sc= .ini and adjust paths there. Mine looks like this:<br> <br> [Environment32ms]<br> PATH=3Dc:\l\vs9\Common7\IDE;%<u></u>PATH%<br> LIB=3D&quot;% P%\..\..\lib32&quot;;c:\l\<u></u>vs9\vc\lib;c:\Program Files = (x86)\Microsoft SDKs\Windows\v7.1A\Lib&quot;<br> DFLAGS=3D%DFLAGS% -L/nologo -L/INCREMENTAL:NO<br> LINKCMD=3Dc:\l\vs9\vc\bin\link.<u></u>exe<br> <br> BTW: I also found some bugs in the Win64 along the way, I&#39;ll create pul= l requests for these.<br> </blockquote></div><br></div></div> --047d7b33caaaf3a6c904dfd53be4--
Jun 23 2013
prev sibling next sibling parent "Michael" <pr m1xa.com> writes:
Cool)))

Any chances to see it [coff32] in official build?
Jun 23 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Alright, I'm now officially building for Windows x64 (amd64). 
I've created this early benchmark http://dpaste.dzfl.pl/eae0233e 
to explore SIMD performance. As you can see below, on my machine 
there is almost zero difference. Am I missing something?

//===SIMD===
0 1.#INF 5 1.#INF <-- vector result
hnsecs: 100006 <-- duration time
0 1.#INF 5 1.#INF
hnsecs: 90005
0 1.#INF 5 1.#INF
hnsecs: 90006
//===SCALAR===
0 1.#INF 5 1.#INF
hnsecs: 90005
0 1.#INF 5 1.#INF
hnsecs: 100005
0 1.#INF 5 1.#INF
hnsecs: 100006
Jun 29 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:
 Alright, I'm now officially building for Windows x64 (amd64). 
 I've created this early benchmark 
 http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As 
 you can see below, on my machine there is almost zero 
 difference. Am I missing something?

 //===SIMD===
 0 1.#INF 5 1.#INF <-- vector result
 hnsecs: 100006 <-- duration time
 0 1.#INF 5 1.#INF
 hnsecs: 90005
 0 1.#INF 5 1.#INF
 hnsecs: 90006
 //===SCALAR===
 0 1.#INF 5 1.#INF
 hnsecs: 90005
 0 1.#INF 5 1.#INF
 hnsecs: 100005
 0 1.#INF 5 1.#INF
 hnsecs: 100006

First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. It also seems that the fact that g, s, i and d are class fields and that g is a static array makes DMD load them from memory and store them back on every iteration even when calcSIMD and calcScalar are inlined. But even if I make the class final and build it with gdc -O3 -finline-functions -frelease -march=native (in which case GDC generates assembly that looks optimal to me), the scalar version is still a bit faster than the vector version. The main reason for that is that even with scalar code, the compiler can do multiple operations in parallel. So on Sandy Bridge CPUs, for example, floating point multiplication takes 5 cycles to complete, but the processor can do one multiplication per cycle. So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. That would explain the scalar code being equaly fast, but not faster than vector code. The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.
Jun 29 2013
prev sibling next sibling parent "Kiith-Sa" <kiithsacmp gmail.com> writes:
See Manu's talk and google how to use it. If you don't know what 
you're doing you are unlikely to see performance improvements.

I'm not even sure if you're benchmarking SIMD performance or 
function call overhead there.
Jun 29 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
I've updated the project with your suggestions at 
http://dpaste.dzfl.pl/fce2d93b but still get the same 
performance. Vectors defined in the benchmark function body, no 
function calling overhead, etc. See some of my comments below btw:

 First of all, calcSIMD and calcScalar are virtual functions so 
 they can't be inlined, which prevents any further optimization.

For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.
 So my guess is that the first four multiplications and the 
 second four multiplications in calcScalar are done in parallel. 
 ... The reason it's faster is that gdc replaces multiplication 
 by 2 with addition and omits multiplication by 1.

I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.
Jun 29 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 29 June 2013 18:57, Jonathan Dunlap <jadit2 gmail.com> wrote:
 I've updated the project with your suggestions at
 http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors
 defined in the benchmark function body, no function calling overhead, etc.
 See some of my comments below btw:


 First of all, calcSIMD and calcScalar are virtual functions so they can't
 be inlined, which prevents any further optimization.

For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.
 So my guess is that the first four multiplications and the second four
 multiplications in calcScalar are done in parallel. ... The reason it's
 faster is that gdc replaces multiplication by 2 with addition and omits
 multiplication by 1.

I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.

s/class/struct/ -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jun 29 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
 I've updated the project with your suggestions at 
 http://dpaste.dzfl.pl/fce2d93b but still get the same 
 performance. Vectors defined in the benchmark function body, no 
 function calling overhead, etc. See some of my comments below 
 btw:

 First of all, calcSIMD and calcScalar are virtual functions so 
 they can't be inlined, which prevents any further optimization.

For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.
 So my guess is that the first four multiplications and the 
 second four multiplications in calcScalar are done in 
 parallel. ... The reason it's faster is that gdc replaces 
 multiplication by 2 with addition and omits multiplication by 
 1.

I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.

The multiples 2 and 1 were the reason why the scalar code performs a little bit better than SIMD code when compiled with GDC. The main reason why scalar code isn't much slower than SIMD code is instruction level parallelism. Because the first four operation in calcScalar are independent (none of them depends on the result of any of the other three) modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput. That's why it doesn't really make a difference that the scalar version does four times as many operations. You can also make advantage of instruction level parallelism when using SIMD. For example, I get about the same number of iterations per second for the following two functions (when using GDC): import gcc.attribute; attribute("forceinline") void calcSIMD1() { s0 = s0 * i0; s0 = s0 * d0; s1 = s1 * i1; s1 = s1 * d1; s2 = s2 * i2; s2 = s2 * d2; s3 = s3 * i3; s3 = s3 * d3; } attribute("forceinline") void calcSIMD2() { s0 = s0 * i0; s0 = s0 * d0; } By the way, if performance is very important to you, you should try GDC (or LDC, but I don't think LDC is currently fully usable on Windows).
Jun 29 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 For the dlang docs: Member functions which are private or 
 package are never virtual, and hence cannot be overridden.

The call to calcScalar compiles to this: mov rax,QWORD PTR [r12] rex.W call QWORD PTR [rax+0x40] so I think the implementation doesn't conform to the spec in this case.
Jun 29 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
modern x86-64 processors can execute them in parallel. Because 
of that, the speed of your program is limited by instruction 
latency and not throughput.


It seems like auto-vectorization to SIMD code may be an ideal strategy (e.g. Java) since it seems that the conditions to get any performance improvement have to be very particular and situational... which is something the compiler may be best suited to handle. Thoughts?
Jun 29 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 It seems like auto-vectorization to SIMD code may be an ideal 
 strategy (e.g. Java) since it seems that the conditions to get 
 any performance improvement have to be very particular and 
 situational... which is something the compiler may be best 
 suited to handle. Thoughts?

The things is that using SIMD efficiently often requires you to organize your data and your algorithm differently, which is something that the compiler can't do for you. Another problem is that the compiler doesn't know how often different code paths will be executed so it can't know how to use SIMD in the best way (that could be solved with profile guided optimization, though). Alignment restrictions are another thing that can cause problems. For those reasons auto-vectorization only works in the simplest of cases. But if you want auto-vectorization, GDC and LDC already do it. I recommend watching Manu's talk (as Kiith-Sa has already suggested): http://youtube.com/watch?v=q_39RnxtkgM
Jun 29 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
I did watch Manu's a few days ago which inspired me to start this 
project. With the updates in http://dpaste.dzfl.pl/fce2d93b, I'm 
still a bit clueless as to why there is almost zero performance 
difference... considering that is seems like an ideal setup to 
benefit from SIMD. I feel that if I can't see gains here: that I 
shouldn't bother using them in practice, where sometimes 
non-ideal operations must be done.
Jun 29 2013
prev sibling next sibling parent "Michael" <pr m1xa.com> writes:
 versioning on Win32/Win64 no longer works.

Jun 29 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--089e01184a247fbac304e055c7ee
Content-Type: text/plain; charset=UTF-8

You should probably watch my talk again ;)
Most of the points I make towards the end when I make the claim "almost
everyone who tries to use SIMD will see the same or slower performance, and
the reason is they have simply revealed other bottlenecks".
And I also made the point "only by strictly applying ALL of the points I
demonstrated, will you see significant performance improvement".

The problem with your code is that it doesn't do any real work. Your
operations are all dependent on the result of the previous operation. The
scalar operations have a shorter latency than the SIMD operations, and they
all execute in parallel.
This is exactly the pathological worst-case comparison that basically
everyone new to SIMD tries to write and wonders why it's slow.
I guess I should have demonstrated this point more clearly in my talk. It
was very rushed (actually, the script was basically on the spot), sorry
about that!

There's not enough code in those loops. You're basically profiling loop
iteration performance and the latency of a float opcode vs a simd opcode...
not any significant work.
You should see a big difference if you unroll the loop 4-8 times (or more
for such a short loop, depending on the CPU).
I also made the point that you should always avoid doing SIMD profiling on
an x86, and certainly not an x64, since it is both, the most forgiving
(results in the least wins of any arch), and also the hardest to predict;
the performance difference you see will almost certainly not be the same on
someone else's chip..

Look again to my points about latency, reducing the overall pipeline length
(demonstrated with the addition sequence), and unrolling the loops.


On 30 June 2013 06:34, Jonathan Dunlap <jadit2 gmail.com> wrote:

 I did watch Manu's a few days ago which inspired me to start this project.
 With the updates in http://dpaste.dzfl.pl/fce2d93b**, I'm still a bit
 clueless as to why there is almost zero performance difference...
 considering that is seems like an ideal setup to benefit from SIMD. I feel
 that if I can't see gains here: that I shouldn't bother using them in
 practice, where sometimes non-ideal operations must be done.

--089e01184a247fbac304e055c7ee Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">You should probably watch my talk again ;)<div style>Most = of the points I make towards the end when I make the claim &quot;almost eve= ryone who tries to use SIMD will see the same or slower performance, and th= e reason is they have simply revealed other bottlenecks&quot;.</div> <div style>And I also made the point &quot;only by strictly applying ALL of= the points I demonstrated, will you see significant performance improvemen= t&quot;.</div><div style><br></div><div style>The problem with your code is= that it doesn&#39;t do any real work. Your operations are all dependent on= the result of the previous operation. The scalar operations have a shorter= latency than the SIMD operations, and they all execute in parallel.</div> <div style>This is exactly the pathological worst-case comparison that basi= cally everyone new to SIMD tries to write and wonders why it&#39;s slow.</d= iv><div style>I guess I should have demonstrated this point more clearly in= my talk. It was very rushed (actually, the script was basically on the spo= t), sorry about that!</div> <div style><br></div><div style>There&#39;s not enough code in those loops.= You&#39;re basically profiling loop iteration performance and the latency = of a float opcode vs a simd opcode... not any significant work.</div><div s= tyle> You should see a big difference if you unroll the loop 4-8 times (or more f= or such a short loop, depending on the CPU).</div><div style>I also made th= e point that you should always avoid doing SIMD profiling on an x86, and ce= rtainly not an x64, since it is both, the most forgiving (results in the le= ast wins of any arch), and also the hardest to predict; the performance dif= ference you see will almost certainly not be the same on someone else&#39;s= chip..</div> <div style><br></div><div style>Look again to my points about latency, redu= cing the overall pipeline length (demonstrated with the addition sequence),= and unrolling the loops.</div></div><div class=3D"gmail_extra"><br><br><di= v class=3D"gmail_quote"> On 30 June 2013 06:34, Jonathan Dunlap <span dir=3D"ltr">&lt;<a href=3D"mai= lto:jadit2 gmail.com" target=3D"_blank">jadit2 gmail.com</a>&gt;</span> wro= te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-= left:1px #ccc solid;padding-left:1ex"> I did watch Manu&#39;s a few days ago which inspired me to start this proje= ct. With the updates in <a href=3D"http://dpaste.dzfl.pl/fce2d93b" target= =3D"_blank">http://dpaste.dzfl.pl/fce2d93b</a><u></u>, I&#39;m still a bit = clueless as to why there is almost zero performance difference... consideri= ng that is seems like an ideal setup to benefit from SIMD. I feel that if I= can&#39;t see gains here: that I shouldn&#39;t bother using them in practi= ce, where sometimes non-ideal operations must be done.<br> </blockquote></div><br></div> --089e01184a247fbac304e055c7ee--
Jun 29 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Thanks Manu, I think I understand. Quick questions, so I've 
updated my test to allow for loop unrolling 
http://dpaste.dzfl.pl/12933bc8 as the calculation is done over an 
array of elements and does not depend on the last operation. My 
problem is that the program reports using 0 time. However, as 
soon as I start printing out elements the time then jumps to 
looking more realistic. However, even if I print the elements of 
the list after I print the calculation operation, I still get 
zero seconds. Like:

1: calc time
2: do operations
3: print time delta (result:0 time)
4: print all values from operation

1: calc time
2: do operations
3: print all values from operation
4: print time delta (result:large time delta actually shown)

Is D performing operations lazily by default or am I missing 
something?
Jul 01 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Monday, 1 July 2013 at 17:19:02 UTC, Jonathan Dunlap wrote:
 Thanks Manu, I think I understand. Quick questions, so I've 
 updated my test to allow for loop unrolling 
 http://dpaste.dzfl.pl/12933bc8

The loop body in testSimd doesn't do anything. This line: auto di = d[i]; copies the vector, it does not reference it.
Jul 01 2013
prev sibling next sibling parent "Jonathan Dunlap" <jadit2 gmail.com> writes:
Thanks Jerro, I went ahead and used a pointer reference to ensure 
it's being saved back into the array 
(http://dpaste.dzfl.pl/52710926). Two things:
1) still showing zero time delta
2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher 
will cause the program to immediately quit with no output at all. 
Even the first statement of writeln in the constructor doesn't 
execute.
Jul 01 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--001a11c312427845bf04e07ffc8b
Content-Type: text/plain; charset=UTF-8

Maybe make the arrays public? it's conceivable the optimiser could
eliminate all that code, since it can prove the results are never
referenced...
I doubt that's the problem though, just a first guess.

On 2 July 2013 09:14, Jonathan Dunlap <jadit2 gmail.com> wrote:

 Thanks Jerro, I went ahead and used a pointer reference to ensure it's
 being saved back into the array
(http://dpaste.dzfl.pl/**52710926<http://dpaste.dzfl.pl/52710926>).
 Two things:
 1) still showing zero time delta
 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause
 the program to immediately quit with no output at all. Even the first
 statement of writeln in the constructor doesn't execute.

--001a11c312427845bf04e07ffc8b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_extra" style>Maybe make the arrays pub= lic? it&#39;s conceivable the optimiser could eliminate all that code, sinc= e it can prove the results are never referenced...</div><div class=3D"gmail= _extra" style> I doubt that&#39;s the problem though, just a first guess.</div><div class= =3D"gmail_extra"><br><div class=3D"gmail_quote">On 2 July 2013 09:14, Jonat= han Dunlap <span dir=3D"ltr">&lt;<a href=3D"mailto:jadit2 gmail.com" target= =3D"_blank">jadit2 gmail.com</a>&gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">Thanks Jerro, I went ahead and used a pointe= r reference to ensure it&#39;s being saved back into the array (<a href=3D"= http://dpaste.dzfl.pl/52710926" target=3D"_blank">http://dpaste.dzfl.pl/<u>= </u>52710926</a>). Two things:<br> 1) still showing zero time delta<br> 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher will cause t= he program to immediately quit with no output at all. Even the first statem= ent of writeln in the constructor doesn&#39;t execute.<br> </blockquote></div><br></div><div class=3D"gmail_extra"><br></div></div> --001a11c312427845bf04e07ffc8b--
Jul 01 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Jonathan Dunlap:

 Thanks Jerro, I went ahead and used a pointer reference to 
 ensure it's being saved back into the array 
 (http://dpaste.dzfl.pl/52710926). Two things:
 1) still showing zero time delta
 2) On windows 7 x74, using a SAMPLE_AT size of 30000 or higher 
 will cause the program to immediately quit with no output at 
 all. Even the first statement of writeln in the constructor 
 doesn't execute.

Have you taken a look at the asm? Bye, bearophile
Jul 01 2013
prev sibling next sibling parent "SomeDude" <lovelydear mailmetrash.com> writes:
On Saturday, 22 June 2013 at 16:04:26 UTC, jerro wrote:

 I have actually run that benchmark with the code from this 
 branch:

 https://github.com/jerro/pfft/tree/experimental

Hello, did you propose your pfft library as a replacement in std.numeric ?
Jul 03 2013
prev sibling parent "jerro" <a a.com> writes:
 Hello, did you propose your pfft library as a replacement in 
 std.numeric ?

I have thought about it, but haven't gotten around to doing it yet. I'd like to finish support for multidimensional transforms first.
Jul 04 2013