digitalmars.D - 64-bit and SSE

dsimcha (7/7) Mar 02 2010 Given that Walter has indicated that 64-bit support is on the agenda for...

retard (2/11) Mar 02 2010 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

Nick Sabalausky (3/14) Mar 02 2010 Yes. The ones who enjoy arbitrarily shrinking their potential user base....

retard (10/28) Mar 02 2010 Why not dynamic code path selection:

dsimcha (11/39) Mar 02 2010 Two reasons: At the top end it's more trouble than it's worth unless th...

retard (6/49) Mar 02 2010 At least popular video decoders like mplayer seem to have dynamic cpu

retard (3/8) Mar 02 2010 Also the code example was badly crafted. It's not even my opinion that w...

Nick Sabalausky (10/39) Mar 02 2010 I'd think that kind of branching could be done automatically by a reason...

retard (5/40) Mar 02 2010 You can compile two versions of e.g. mplayer. The one with architecture

Don (4/34) Mar 02 2010 The method needs to be fairly large for that to be beneficial. For

Nick Sabalausky (75/109) Mar 02 2010 You can still just increase the grain-size as needed. For instance, take...
Rainer Deyke (17/33) Mar 02 2010 Why not do it at the largest possible level of granularity?

Robert Jacques (4/35) Mar 02 2010 That's great until you start linking, or worse, dynamically linking. The...
Don (4/37) Mar 03 2010 I don't think that ever makes sense. I'd just compile multiple

Don (13/20) Mar 02 2010 I think the way to do this will be, as a first step, to use SSE for

bearophile (4/6) Mar 02 2010 Can you explain me this a bit better? Why can't D use SSE registers in 3...

Don (2/10) Mar 02 2010 It can, but not in the ABI. So support in 64-bit can be better.

dsimcha (5/7) Mar 02 2010 Is SSE(2) inherently faster then (at least in real-world implementations...

bearophile (5/7) Mar 02 2010 sqrt for example is fast, and there are other high level instructions (f...
Don (13/21) Mar 03 2010 No. (Except on Pentium 4, where SSE was basically the only part of the

#ponce (6/10) Mar 04 2010 There is a couple of interesting scalar instructions in SSE

dsimcha <dsimcha yahoo.com> writes:

Given that Walter has indicated that 64-bit support is on the agenda for after
D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2)
support in DMD in the relatively near future?  If so, will it be exposed as a
compiler option even when compiling in 32-bit mode?

I've realized that this is kind of important for me since Intel deprecated x87
on its Core 2 and Pentium 4 chips, meaning any old school floating point code
runs painfully slow compared to, say, an AMD chip that still has a decent x87.

Mar 02 2010

retard <re tard.com.invalid> writes:

Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda for
 after D2 is finished and x87 is deprecated in 64-bit mode, will we also
 see SSE(2) support in DMD in the relatively near future?  If so, will it
 be exposed as a compiler option even when compiling in 32-bit mode?
 
 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school
 floating point code runs painfully slow compared to, say, an AMD chip
 that still has a decent x87.

SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

Mar 02 2010

"Nick Sabalausky" <a a.a> writes:

"retard" <re tard.com.invalid> wrote in message 
news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda for
 after D2 is finished and x87 is deprecated in 64-bit mode, will we also
 see SSE(2) support in DMD in the relatively near future?  If so, will it
 be exposed as a compiler option even when compiling in 32-bit mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old school
 floating point code runs painfully slow compared to, say, an AMD chip
 that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

Yes. The ones who enjoy arbitrarily shrinking their potential user base.

Mar 02 2010

retard <re tard.com.invalid> writes:

Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:

 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 
 Yes. The ones who enjoy arbitrarily shrinking their potential user base.

Why not dynamic code path selection:

if (cpu_capabilities && SSE4_2)
  run_fast_method();
else if (cpu_capabilities && SSE2)
  run_medium_fast_method();
else
  run_slow_method();

One could also use higher level design patterns like abstract factories 
here.

Mar 02 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user base.

 Why not dynamic code path selection:
 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();
 One could also use higher level design patterns like abstract factories
 here.

Two reasons:  At the top end it's more trouble than it's worth unless the code
is
**really** performance critical, and a lot of code with really performance
critical floating point is scientific computing code that may only have to run
on
one arch anyhow.

At the bottom end, who the heck still uses machines that don't support SSE2?  I
agree with Nick to some degree that developers shouldn't assume their audiences
have the latest and greatest, but SSE2 has been supported by AMD for about 7
years
and Intel for about 9.  I'm pretty sure that's at least a few standard
deviations
longer than the average lifetime of computer equipment.  You have to draw the
line
somewhere or we'd all be tweaking our programs to fit in 640k of address space.

Mar 02 2010

retard <re tard.com.invalid> writes:

Tue, 02 Mar 2010 19:41:17 +0000, dsimcha wrote:

 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the
 agenda for after D2 is finished and x87 is deprecated in 64-bit
 mode, will we also see SSE(2) support in DMD in the relatively near
 future?  If so, will it be exposed as a compiler option even when
 compiling in 32-bit mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user
 base.

 Why not dynamic code path selection:
 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();
 One could also use higher level design patterns like abstract factories
 here.

 
 Two reasons:  At the top end it's more trouble than it's worth unless
 the code is **really** performance critical, and a lot of code with
 really performance critical floating point is scientific computing code
 that may only have to run on one arch anyhow.
 
 At the bottom end, who the heck still uses machines that don't support
 SSE2?  I agree with Nick to some degree that developers shouldn't assume
 their audiences have the latest and greatest, but SSE2 has been
 supported by AMD for about 7 years and Intel for about 9.  I'm pretty
 sure that's at least a few standard deviations longer than the average
 lifetime of computer equipment.  You have to draw the line somewhere or
 we'd all be tweaking our programs to fit in 640k of address space.

At least popular video decoders like mplayer seem to have dynamic cpu 
detection. No idea, how it exactly affects the instructions used in the 
code. Well, there are all kinds of x86 clones out there. I don't remember 
what instructions they all support, but they have more variation than the 
pentium/core line of processors.

Mar 02 2010

retard <re tard.com.invalid> writes:

Tue, 02 Mar 2010 19:46:20 +0000, retard wrote:

 At least popular video decoders like mplayer seem to have dynamic cpu
 detection. No idea, how it exactly affects the instructions used in the
 code. Well, there are all kinds of x86 clones out there. I don't
 remember what instructions they all support, but they have more
 variation than the pentium/core line of processors.

Also the code example was badly crafted. It's not even my opinion that we 
should support all legacy pre-586 CPUs.

Mar 02 2010

"Nick Sabalausky" <a a.a> writes:

"dsimcha" <dsimcha yahoo.com> wrote in message 
news:hmjpkt$1i20$1 digitalmars.com...
 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user 
 base.

 Why not dynamic code path selection:
 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();
 One could also use higher level design patterns like abstract factories
 here.

 Two reasons:  At the top end it's more trouble than it's worth unless the 
 code is
 **really** performance critical, and a lot of code with really performance
 critical floating point is scientific computing code that may only have to 
 run on
 one arch anyhow.

I'd think that kind of branching could be done automatically by a reasonably 
intelligent optimizer. Or there's the possibility of 
compiling-upon-installation that could just detect the CPU being used 
(although that admittedly comes with a few difficulties and potential 
issues). I guess I was only assuming that retard was suggesting requiring > 
SSE2. I'm not sure if he really did mean it that way.

 At the bottom end, who the heck still uses machines that don't support 
 SSE2?

My Linux box is an AMD without SSE2.

 You have to draw the line
 somewhere or we'd all be tweaking our programs to fit in 640k of address 
 space.

Certainly true.

Mar 02 2010

retard <re tard.com.invalid> writes:

Tue, 02 Mar 2010 15:13:10 -0500, Nick Sabalausky wrote:

 "dsimcha" <dsimcha yahoo.com> wrote in message
 news:hmjpkt$1i20$1 digitalmars.com...
 == Quote from retard (re tard.com.invalid)'s article
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user
 base.

 Why not dynamic code path selection:
 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();
 One could also use higher level design patterns like abstract
 factories here.

 Two reasons:  At the top end it's more trouble than it's worth unless
 the code is
 **really** performance critical, and a lot of code with really
 performance critical floating point is scientific computing code that
 may only have to run on
 one arch anyhow.

 I'd think that kind of branching could be done automatically by a
 reasonably intelligent optimizer. Or there's the possibility of
 compiling-upon-installation that could just detect the CPU being used
 (although that admittedly comes with a few difficulties and potential
 issues). I guess I was only assuming that retard was suggesting
 requiring > SSE2. I'm not sure if he really did mean it that way.

You can compile two versions of e.g. mplayer. The one with architecture 
fixed on compile time and the one with dynamic cpu detection. The latter 
is rather useful for free linux live CDs when you really can't guess all 
the target machines beforehand.

Mar 02 2010

Don <nospam nospam.com> writes:

retard wrote:
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:
 
 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user base.

 
 Why not dynamic code path selection:
 
 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();
 
 One could also use higher level design patterns like abstract factories 
 here.

The method needs to be fairly large for that to be beneficial. For 
fine-grained stuff, like basic operations on 3D vectors, it doesn't work 
at all. And that's one of the primary use cases for SSE.

Mar 02 2010

"Nick Sabalausky" <a a.a> writes:

"Don" <nospam nospam.com> wrote in message 
news:hmk01v$1u32$1 digitalmars.com...
 retard wrote:
 Tue, 02 Mar 2010 14:17:12 -0500, Nick Sabalausky wrote:

 "retard" <re tard.com.invalid> wrote in message
 news:hmjmjd$15uj$1 digitalmars.com...
 Tue, 02 Mar 2010 15:49:02 +0000, dsimcha wrote:

 Given that Walter has indicated that 64-bit support is on the agenda
 for after D2 is finished and x87 is deprecated in 64-bit mode, will we
 also see SSE(2) support in DMD in the relatively near future?  If so,
 will it be exposed as a compiler option even when compiling in 32-bit
 mode?

 I've realized that this is kind of important for me since Intel
 deprecated x87 on its Core 2 and Pentium 4 chips, meaning any old
 school floating point code runs painfully slow compared to, say, an
 AMD chip that still has a decent x87.

 SSE(2) ? Don't people already use SSE 4.2 and prepare for AVX?

 Yes. The ones who enjoy arbitrarily shrinking their potential user base.

 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract factories 
 here.

 The method needs to be fairly large for that to be beneficial. For 
 fine-grained stuff, like basic operations on 3D vectors, it doesn't work 
 at all. And that's one of the primary use cases for SSE.

You can still just increase the grain-size as needed. For instance, take 
this example of code that is too fine-grained:

-------------------------------------------
void fineGranedA(Param p)
{
    if(supports_SSE4)
        // Use SSE4
    else if(supports_SSE2)
        // Use SSE2
    else
        // Use Default
}

void fineGranedB(Param p)
{
    if(supports_SSE4)
        // Use SSE4
    else if(supports_SSE2)
        // Use SSE2
    else
        // Use Default
}

void foo()
{
    foreach(thing; bunchOThings)
    {
        fineGranedA(thing);
        fineGranedB(thing);
    }
}
-------------------------------------------

That can be turned into this (and a smart optimizer could probably do it 
automatically, especially if it's the compiler that's internally generating 
'fineGrainedA' and 'fineGrainedB' in the first place):

-------------------------------------------
enum CPUVer { SSE4, SSE2, Default }

void fineGranedA(CPUVer ver)(Param p)
{
    static if(ver == CPUVer.SSE4)
        // Use SSE4
    else static if(ver == CPUVer.SSE2)
        // Use SSE2
    else
        // Use Default
}

void fineGranedB(CPUVer ver)(Param p)
{
    static if(ver == CPUVer.SSE4)
        // Use SSE4
    else static if(ver == CPUVer.SSE2)
        // Use SSE2
    else
        // Use Default
}

void fooImpl(CPUVer ver)()
{
    foreach(thing; bunchOThings)
    {
        fineGranedA!(ver)(thing);
        fineGranedB!(ver)(thing);
    }
}

void foo()
{
    if(supports_SSE4)
        fooImpl!(CPUVer.SSE4)();
    else if(supports_SSE2)
        fooImpl!(CPUVer.SSE2)();
    else
        fooImpl!(CPUVer.Default)();
}
-------------------------------------------

And if foo gets called a lot, like in some loop, you can just take things 
another level out.

Mar 02 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.

 
 The method needs to be fairly large for that to be beneficial. For
 fine-grained stuff, like basic operations on 3D vectors, it doesn't work
 at all. And that's one of the primary use cases for SSE.

Why not do it at the largest possible level of granularity?

int main() {
  if (cpu_capabilities && SSE4_2) {
    return run_fast_main();
  } else if (cpu_capabilities && SSE2) {
    return run_medium_fast_main();
  } else {
    return run_slow_main();
  }
}

The compiler should be able to do this automatically by compiling every
single function in the program N times with N different code generation
setting.  Executable size will skyrocket, but it won't matter because
executable size is rarely a significant concern.


-- 
Rainer Deyke - rainerd eldwood.com

Mar 02 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 02 Mar 2010 23:01:01 -0500, Rainer Deyke <rainerd eldwood.com>  
wrote:

 On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.

 The method needs to be fairly large for that to be beneficial. For
 fine-grained stuff, like basic operations on 3D vectors, it doesn't work
 at all. And that's one of the primary use cases for SSE.

 Why not do it at the largest possible level of granularity?

 int main() {
   if (cpu_capabilities && SSE4_2) {
     return run_fast_main();
   } else if (cpu_capabilities && SSE2) {
     return run_medium_fast_main();
   } else {
     return run_slow_main();
   }
 }

 The compiler should be able to do this automatically by compiling every
 single function in the program N times with N different code generation
 setting.  Executable size will skyrocket, but it won't matter because
 executable size is rarely a significant concern.

That's great until you start linking, or worse, dynamically linking. Then  
you run into some major problems.

Mar 02 2010

Don <nospam nospam.com> writes:

Rainer Deyke wrote:
 On 3/2/2010 14:28, Don wrote:
 retard wrote:
 Why not dynamic code path selection:

 if (cpu_capabilities && SSE4_2)
   run_fast_method();
 else if (cpu_capabilities && SSE2)
   run_medium_fast_method();
 else
   run_slow_method();

 One could also use higher level design patterns like abstract
 factories here.

 The method needs to be fairly large for that to be beneficial. For
 fine-grained stuff, like basic operations on 3D vectors, it doesn't work
 at all. And that's one of the primary use cases for SSE.

 
 Why not do it at the largest possible level of granularity?
 
 int main() {
   if (cpu_capabilities && SSE4_2) {
     return run_fast_main();
   } else if (cpu_capabilities && SSE2) {
     return run_medium_fast_main();
   } else {
     return run_slow_main();
   }
 }
 
 The compiler should be able to do this automatically by compiling every
 single function in the program N times with N different code generation
 setting.  Executable size will skyrocket, but it won't matter because
 executable size is rarely a significant concern.

I don't think that ever makes sense. I'd just compile multiple 
executables with different settings, and select which one to use at 
install time.

Mar 03 2010

Don <nospam nospam.com> writes:

dsimcha wrote:
 Given that Walter has indicated that 64-bit support is on the agenda for after
 D2 is finished and x87 is deprecated in 64-bit mode, will we also see SSE(2)
 support in DMD in the relatively near future?  If so, will it be exposed as a
 compiler option even when compiling in 32-bit mode?

I think the way to do this will be, as a first step, to use SSE for 
short vector operations. There's some really low-hanging fruit there.
To get the full benefit from SSE we need to use SSE registers for 
parameter passing, but I think that'll only be possible with the 64 bit API.

 I've realized that this is kind of important for me since Intel deprecated x87
 on its Core 2 and Pentium 4 chips, meaning any old school floating point code
 runs painfully slow compared to, say, an AMD chip that still has a decent x87.

x87 is only slow on P4. But everything is slow on P4.
AFAIK 80 bit loads and stores are the only things which are slower on 
Core2 and i7 than on Pentium 3 (4 vs 2 cycles). And they're actually 
faster than AMD. So I don't think this is such a big issue.
In fact, some of the x87 transcendental operations are faster on Core2 
than on any earlier processor. So they still have a decent x87 :-).

Of course, in the occasions when SSE lets you do 4 operations at once, 
you get nearly a 4X speedup...

Mar 02 2010

bearophile <bearophileHUGS lycos.com> writes:

Don:
 To get the full benefit from SSE we need to use SSE registers for 
 parameter passing, but I think that'll only be possible with the 64 bit API.

Can you explain me this a bit better? Why can't D use SSE registers in 32 bit
code too?

Bye,
bearophile

Mar 02 2010

Don <nospam nospam.com> writes:

bearophile wrote:
 Don:
 To get the full benefit from SSE we need to use SSE registers for 
 parameter passing, but I think that'll only be possible with the 64 bit API.

 
 Can you explain me this a bit better? Why can't D use SSE registers in 32 bit
code too?
 
 Bye,
 bearophile

It can, but not in the ABI. So support in 64-bit can be better.

Mar 02 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Don (nospam nospam.com)'s article
 Of course, in the occasions when SSE lets you do 4 operations at once,
 you get nearly a 4X speedup...

Is SSE(2) inherently faster then (at least in real-world implementations) than
x87, even when you don't vectorize?  Would I be able to expect any speedup from
going from x87 to SSE(2) for code that has a decent amount of implicit
instruction
level parallelism but wasn't explicitly vectorized either by me or the compiler?

Mar 02 2010

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:
 Is SSE(2) inherently faster then (at least in real-world implementations) than
 x87, even when you don't vectorize?<

sqrt for example is fast, and there are other high level instructions (for
video decoding cryptography, etc).
But you have to think how much time has passed from the design of C language.
CPUs when C was designed were profoundly different from the ones available now.
If D will have some success, future CPUs will be surely different from the
current ones. I think SSE registers will be kind of obsolete when AVX will be
out about next year. Do you need to change the ABI of D3 again for AVX?

Bye,
bearophile

Mar 02 2010

Don <nospam nospam.com> writes:

dsimcha wrote:
 == Quote from Don (nospam nospam.com)'s article
 Of course, in the occasions when SSE lets you do 4 operations at once,
 you get nearly a 4X speedup...

 
 Is SSE(2) inherently faster then (at least in real-world implementations) than
 x87, even when you don't vectorize? 

No. (Except on Pentium 4, where SSE was basically the only part of the 
CPU that wasn't crippled).

  Would I be able to expect any speedup from
 going from x87 to SSE(2) for code that has a decent amount of implicit
instruction
 level parallelism but wasn't explicitly vectorized either by me or the
compiler?

I doubt it.  The only time that you get an easy benefit is when you have 
a mix of serial and parallel calculations.

float[4] x, y;

float z = some_calculation;
x[] += z*y[];

If you're using SSE for all your calculations, z will already be in an 
SSE register, so it makes setting up the parallel calculation a bit quicker.

And the compiler might be better at scheduling SSE code, than x87. But 
that's not really a processor thing.

Mar 03 2010

#ponce <ponce deleteme.adinpsz.org> writes:

 Is SSE(2) inherently faster then (at least in real-world implementations) than
 x87, even when you don't vectorize?  Would I be able to expect any speedup from
 going from x87 to SSE(2) for code that has a decent amount of implicit
instruction
 level parallelism but wasn't explicitly vectorized either by me or the
compiler?

There is a couple of interesting scalar instructions in SSE

- cvttss2si : floorf without modifying the rounding mode (SSE2)
- 32-bit float square root and inverse square root
- min, max

SSE doesn't suffer from denormalization which can be very useful.
I personnally don't mind if the compiler use them or not, provided one can use
inline assembly :)

Mar 04 2010

D Programming

C/C++ Programming

Other

digitalmars.D - 64-bit and SSE