www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - 64-bit and SSE

reply dsimcha <dsimcha yahoo.com> writes:
Given that Walter has indicated that 64-bit support is on the agenda for after
D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2)
support in DMD in the relatively near future?  If so, will it be exposed as a
compiler option even when compiling in 32-bit mode?

I've realized that this is kind of important for me since Intel deprecated x87
on its Core 2 and Pentium 4 chips, meaning any old school floating point code
runs painfully slow compared to, say, an AMD chip that still has a decent x87.
Mar 02 2010
next sibling parent reply retard <re tard.com.invalid> writes:
Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda for
 after D2 is finished and x87 is deprecated in 64-bit mode, will we also
 see SSE(2) support in DMD in the relatively near future?  If so, will it
 be exposed as a compiler option even when compiling in 32-bit mode?
 
 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school
 floating point code runs painfully slow compared to, say, an AMD chip
 that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Mar 02 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"retard" <re tard.com.invalid> wrote in message 
news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda for
 after D2 is finished and x87 is deprecated in 64-bit mode, will we also
 see SSE(2) support in DMD in the relatively near future?  If so, will it
 be exposed as a compiler option even when compiling in 32-bit mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school
 floating point code runs painfully slow compared to, say, an AMD chip
 that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Mar 02 2010
parent reply retard <re tard.com.invalid> writes:
Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:

 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
Mar 02 2010
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow. At the bottom end, who the heck still uses machines that don't support SSE2? I agree with Nick to some degree that developers shouldn't assume their audiences have the latest and greatest, but SSE2 has been supported by AMD for about 7 years and Intel for about 9. I'm pretty sure that's at least a few standard deviations longer than the average lifetime of computer equipment. You have to draw the line somewhere or we'd all be tweaking our programs to fit in 640k of address space.
Mar 02 2010
next sibling parent reply retard <re tard.com.invalid> writes:
Tue, 02 Mar 2010 19:41:17 +0000, dsimcha wrote:

 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the
 agenda for after D2 is finished and x87 is deprecated in 64-bit
 mode, will we also see SSE(2) support in DMD in the relatively near
 future?  If so, will it be exposed as a compiler option even when
 compiling in 32-bit mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow. At the bottom end, who the heck still uses machines that don't support SSE2? I agree with Nick to some degree that developers shouldn't assume their audiences have the latest and greatest, but SSE2 has been supported by AMD for about 7 years and Intel for about 9. I'm pretty sure that's at least a few standard deviations longer than the average lifetime of computer equipment. You have to draw the line somewhere or we'd all be tweaking our programs to fit in 640k of address space.
At least popular video decoders like mplayer seem to have dynamic cpu detection. No idea, how it exactly affects the instructions used in the code. Well, there are all kinds of x86 clones out there. I don't remember what instructions they all support, but they have more variation than the pentium/core line of processors.
Mar 02 2010
parent retard <re tard.com.invalid> writes:
Tue, 02 Mar 2010 19:46:20 +0000, retard wrote:

 At least popular video decoders like mplayer seem to have dynamic cpu
 detection. No idea, how it exactly affects the instructions used in the
 code. Well, there are all kinds of x86 clones out there. I don't
 remember what instructions they all support, but they have more
 variation than the pentium/core line of processors.
Also the code example was badly crafted. It's not even my opinion that we should support all legacy pre-586 CPUs.
Mar 02 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"dsimcha" <dsimcha yahoo.com> wrote in message 
news:hmjpkt$1i20$1 digitalmars.com...
 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow.
I'd think that kind of branching could be done automatically by a reasonably intelligent optimizer. Or there's the possibility of compiling-upon-installation that could just detect the CPU being used (although that admittedly comes with a few difficulties and potential issues). I guess I was only assuming that retard was suggesting requiring > SSE2. I'm not sure if he really did mean it that way.
 At the bottom end, who the heck still uses machines that don't support 
 SSE2?
My Linux box is an AMD without SSE2.
 You have to draw the line
 somewhere or we'd all be tweaking our programs to fit in 640k of address 
 space.
Certainly true.
Mar 02 2010
parent retard <re tard.com.invalid> writes:
Tue, 02 Mar 2010 15:13:10 -0500, Nick Sabalausky wrote:

 "dsimcha" <dsimcha yahoo.com> wrote in message
 news:hmjpkt$1i20$1 digitalmars.com...
 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
Two reasons: At the top end it's more trouble than it's worth unless the code is **really** performance critical, and a lot of code with really performance critical floating point is scientific computing code that may only have to run on one arch anyhow.
I'd think that kind of branching could be done automatically by a reasonably intelligent optimizer. Or there's the possibility of compiling-upon-installation that could just detect the CPU being used (although that admittedly comes with a few difficulties and potential issues). I guess I was only assuming that retard was suggesting requiring > SSE2. I'm not sure if he really did mean it that way.
You can compile two versions of e.g. mplayer. The one with architecture fixed on compile time and the one with dynamic cpu detection. The latter is rather useful for free linux live CDs when you really can't guess all the target machines beforehand.
Mar 02 2010
prev sibling parent reply Don <nospam nospam.com> writes:
retard wrote:
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Mar 02 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Don" <nospam nospam.com> wrote in message 
news:hmk01v$1u32$1 digitalmars.com...
 retard wrote:
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:

 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.
SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?
Yes. The ones who enjoy arbitrarily shrinking their potential user base.
Why not dynamic code path selection: if (cpu_capabilities && SSE4_2) run_fast_method(); else if (cpu_capabilities && SSE2) run_medium_fast_method(); else run_slow_method(); One could also use higher level design patterns like abstract factories here.
The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
You can still just increase the grain-size as needed. For instance, take this example of code that is too fine-grained: ------------------------------------------- void fineGranedA(Param p) { if(supports_SSE4) // Use SSE4 else if(supports_SSE2) // Use SSE2 else // Use Default } void fineGranedB(Param p) { if(supports_SSE4) // Use SSE4 else if(supports_SSE2) // Use SSE2 else // Use Default } void foo() { foreach(thing; bunchOThings) { fineGranedA(thing); fineGranedB(thing); } } ------------------------------------------- That can be turned into this (and a smart optimizer could probably do it automatically, especially if it's the compiler that's internally generating 'fineGrainedA' and 'fineGrainedB' in the first place): ------------------------------------------- enum CPUVer { SSE4, SSE2, Default } void fineGranedA(CPUVer ver)(Param p) { static if(ver == CPUVer.SSE4) // Use SSE4 else static if(ver == CPUVer.SSE2) // Use SSE2 else // Use Default } void fineGranedB(CPUVer ver)(Param p) { static if(ver == CPUVer.SSE4) // Use SSE4 else static if(ver == CPUVer.SSE2) // Use SSE2 else // Use Default } void fooImpl(CPUVer ver)() { foreach(thing; bunchOThings) { fineGranedA!(ver)(thing); fineGranedB!(ver)(thing); } } void foo() { if(supports_SSE4) fooImpl!(CPUVer.SSE4)(); else if(supports_SSE2) fooImpl!(CPUVer.SSE2)(); else fooImpl!(CPUVer.Default)(); } ------------------------------------------- And if foo gets called a lot, like in some loop, you can just take things another level out.
Mar 02 2010
prev sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.
The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern. -- Rainer Deyke - rainerd eldwood.com
Mar 02 2010
next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Tue, 02 Mar 2010 23:01:01 -0500, Rainer Deyke <rainerd eldwood.com>  
wrote:

 On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.
The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern.
That's great until you start linking, or worse, dynamically linking. Then you run into some major problems.
Mar 02 2010
prev sibling parent Don <nospam nospam.com> writes:
Rainer Deyke wrote:
 On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.
The method needs to be fairly large for that to be beneficial. For fine-grained stuff, like basic operations on 3D vectors, it doesn't work at all. And that's one of the primary use cases for SSE.
Why not do it at the largest possible level of granularity? int main() { if (cpu_capabilities && SSE4_2) { return run_fast_main(); } else if (cpu_capabilities && SSE2) { return run_medium_fast_main(); } else { return run_slow_main(); } } The compiler should be able to do this automatically by compiling every single function in the program N times with N different code generation setting. Executable size will skyrocket, but it won't matter because executable size is rarely a significant concern.
I don't think that ever makes sense. I'd just compile multiple executables with different settings, and select which one to use at install time.
Mar 03 2010
prev sibling next sibling parent reply Don <nospam nospam.com> writes:
dsimcha wrote:
 Given that Walter has indicated that 64-bit support is on the agenda for after
 D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2)
 support in DMD in the relatively near future?  If so, will it be exposed as a
 compiler option even when compiling in 32-bit mode?
I think the way to do this will be, as a first step, to use SSE for short vector operations. There's some really low-hanging fruit there. To get the full benefit from SSE we need to use SSE registers for parameter passing, but I think that'll only be possible with the 64 bit API.
 I've realized that this is kind of important for me since Intel deprecated x87
 on its Core 2 and Pentium 4 chips, meaning any old school floating point code
 runs painfully slow compared to, say, an AMD chip that still has a decent x87.
x87 is only slow on P4. But everything is slow on P4. AFAIK 80 bit loads and stores are the only things which are slower on Core2 and i7 than on Pentium 3 (4 vs 2 cycles). And they're actually faster than AMD. So I don't think this is such a big issue. In fact, some of the x87 transcendental operations are faster on Core2 than on any earlier processor. So they still have a decent x87 :-). Of course, in the occasions when SSE lets you do 4 operations at once, you get nearly a 4X speedup...
Mar 02 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:
 To get the full benefit from SSE we need to use SSE registers for 
 parameter passing, but I think that'll only be possible with the 64 bit API.
Can you explain me this a bit better? Why can't D use SSE registers in 32 bit code too? Bye, bearophile
Mar 02 2010
parent Don <nospam nospam.com> writes:
bearophile wrote:
 Don:
 To get the full benefit from SSE we need to use SSE registers for 
 parameter passing, but I think that'll only be possible with the 64 bit API.
Can you explain me this a bit better? Why can't D use SSE registers in 32 bit code too? Bye, bearophile
It can, but not in the ABI. So support in 64-bit can be better.
Mar 02 2010
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Don (nospam nospam.com)'s article
 Of course, in the occasions when SSE lets you do 4 operations at once,
 you get nearly a 4X speedup...
Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize? Would I be able to expect any speedup from going from x87 to SSE(2) for code that has a decent amount of implicit instruction level parallelism but wasn't explicitly vectorized either by me or the compiler?
Mar 02 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
 Is SSE(2) inherently faster then (at least in real-world implementations) than
 x87, even when you don't vectorize?<
sqrt for example is fast, and there are other high level instructions (for video decoding cryptography, etc). But you have to think how much time has passed from the design of C language. CPUs when C was designed were profoundly different from the ones available now. If D will have some success, future CPUs will be surely different from the current ones. I think SSE registers will be kind of obsolete when AVX will be out about next year. Do you need to change the ABI of D3 again for AVX? Bye, bearophile
Mar 02 2010
prev sibling parent Don <nospam nospam.com> writes:
dsimcha wrote:
 == Quote from Don (nospam nospam.com)'s article
 Of course, in the occasions when SSE lets you do 4 operations at once,
 you get nearly a 4X speedup...
Is SSE(2) inherently faster then (at least in real-world implementations) than x87, even when you don't vectorize?
No. (Except on Pentium 4, where SSE was basically the only part of the CPU that wasn't crippled). Would I be able to expect any speedup from
 going from x87 to SSE(2) for code that has a decent amount of implicit
instruction
 level parallelism but wasn't explicitly vectorized either by me or the
compiler?
I doubt it. The only time that you get an easy benefit is when you have a mix of serial and parallel calculations. float[4] x, y; float z = some_calculation; x[] += z*y[]; If you're using SSE for all your calculations, z will already be in an SSE register, so it makes setting up the parallel calculation a bit quicker. And the compiler might be better at scheduling SSE code, than x87. But that's not really a processor thing.
Mar 03 2010
prev sibling parent #ponce <ponce deleteme.adinpsz.org> writes:
 Is SSE(2) inherently faster then (at least in real-world implementations) than
 x87, even when you don't vectorize?  Would I be able to expect any speedup from
 going from x87 to SSE(2) for code that has a decent amount of implicit
instruction
 level parallelism but wasn't explicitly vectorized either by me or the
compiler?
There is a couple of interesting scalar instructions in SSE - cvttss2si : floorf without modifying the rounding mode (SSE2) - 32-bit float square root and inverse square root - min, max SSE doesn't suffer from denormalization which can be very useful. I personnally don't mind if the compiler use them or not, provided one can use inline assembly :)
Mar 04 2010