digitalmars.D - Vectorization examples

bearophile (12/12) Apr 20 2015 "Utilizing the other 80% of your system's performance: Starting

Panke (2/14) Apr 20 2015 Aren't unaligned loads as fast as aligned loads on modern x86?

finalpatch (6/23) Apr 20 2015 No that's not true. On modern x86 processors using unaligned

Panke (1/6) Apr 20 2015 Thanks for clarifying.
Luc Bourhis (8/15) Apr 24 2015 According to [1, section 7.13 and 8.13], the overhead was

Walter Bright (4/13) Apr 20 2015 Use arrays of double2, float4, int4, etc., declared in core.simd. Those ...

bearophile (16/24) Apr 20 2015 Is the GC able to give memory aligned to 32 bytes for new

Walter Bright (9/28) Apr 20 2015 When the CPU requires 32 byte alignment, the compiler/GC will support it...

"bearophile" <bearophileHUGS lycos.com> writes:

"Utilizing the other 80% of your system's performance: Starting 
with Vectorization" by Ulrich Drepper:

https://www.youtube.com/watch?v=DXPfE2jGqg0

It shows two still missing parts of the D type system: a way to 
define strongly typed byte alignments for arrays (something 
better than the aligned() shown here, because I prefer the 
alignment to be part of the type), and a way to tell the type 
system that some array slices are fully distinct (the __restrict 
seen here, I think this information doesn't need to be part of a 
type).

Bye,
bearophile

Apr 20 2015

"Panke" <tobias pankrath.net> writes:

On Monday, 20 April 2015 at 09:41:09 UTC, bearophile wrote:
 "Utilizing the other 80% of your system's performance: Starting 
 with Vectorization" by Ulrich Drepper:

 https://www.youtube.com/watch?v=DXPfE2jGqg0

 It shows two still missing parts of the D type system: a way to 
 define strongly typed byte alignments for arrays (something 
 better than the aligned() shown here, because I prefer the 
 alignment to be part of the type), and a way to tell the type 
 system that some array slices are fully distinct (the 
 __restrict seen here, I think this information doesn't need to 
 be part of a type).

 Bye,
 bearophile

Aren't unaligned loads as fast as aligned loads on modern x86?

Apr 20 2015

"finalpatch" <fengli gmail.com> writes:

On Monday, 20 April 2015 at 11:01:28 UTC, Panke wrote:
 On Monday, 20 April 2015 at 09:41:09 UTC, bearophile wrote:
 "Utilizing the other 80% of your system's performance: 
 Starting with Vectorization" by Ulrich Drepper:

 https://www.youtube.com/watch?v=DXPfE2jGqg0

 It shows two still missing parts of the D type system: a way 
 to define strongly typed byte alignments for arrays (something 
 better than the aligned() shown here, because I prefer the 
 alignment to be part of the type), and a way to tell the type 
 system that some array slices are fully distinct (the 
 __restrict seen here, I think this information doesn't need to 
 be part of a type).

 Bye,
 bearophile

 Aren't unaligned loads as fast as aligned loads on modern x86?

No that's not true. On modern x86 processors using unaligned 
loading instructions on aligned data does not incur additional 
overhead, therefore you can always use unaligned load for 
everything, but loading unaligned data is still slower than 
aligned data.

Apr 20 2015

"Panke" <tobias pankrath.net> writes:

 No that's not true. On modern x86 processors using unaligned 
 loading instructions on aligned data does not incur additional 
 overhead, therefore you can always use unaligned load for 
 everything, but loading unaligned data is still slower than 
 aligned data.

Thanks for clarifying.

Apr 20 2015

"Luc Bourhis" <ljbo nowhere.com> writes:

On Monday, 20 April 2015 at 11:15:48 UTC, finalpatch wrote:
 On Monday, 20 April 2015 at 11:01:28 UTC, Panke wrote:
 Aren't unaligned loads as fast as aligned loads on modern x86?

 No that's not true. On modern x86 processors using unaligned 
 loading instructions on aligned data does not incur additional 
 overhead, therefore you can always use unaligned load for 
 everything, but loading unaligned data is still slower than 
 aligned data.

According to [1, section 7.13 and 8.13], the overhead was 
particularly bad for Core2 but this not a major issue either for 
Nehalem or SandyBridge anymore. Do you have data contradicting 
him?

[1] Agner Fog, 3. The microarchitecture of Intel, AMD and VIA 
CPUs, Tech. report, Copenhagen University College of Engineering, 
February 2012. http://www.agner.org/optimize/

Apr 24 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/20/2015 2:41 AM, bearophile wrote:
 "Utilizing the other 80% of your system's performance: Starting with
 Vectorization" by Ulrich Drepper:

 https://www.youtube.com/watch?v=DXPfE2jGqg0

 It shows two still missing parts of the D type system: a way to define strongly
 typed byte alignments for arrays (something better than the aligned() shown
 here, because I prefer the alignment to be part of the type),

Use arrays of double2, float4, int4, etc., declared in core.simd. Those will be 
aligned appropriately.


 and a way to tell
 the type system that some array slices are fully distinct (the __restrict seen
 here, I think this information doesn't need to be part of a type).

A runtime test is sufficient.

Apr 20 2015

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 Use arrays of double2, float4, int4, etc., declared in 
 core.simd. Those will be aligned appropriately.

Is the GC able to give memory aligned to 32 bytes for new 
architectures with 512 bits wide SIMD?


 and a way to tell
 the type system that some array slices are fully distinct (the 
 __restrict seen
 here, I think this information doesn't need to be part of a 
 type).

 A runtime test is sufficient.

One of the points of having a type system is to rule out certain 
classes of bugs caused by programmers. The compiler could use the 
type system to add those runtime tests where needed. And even 
better sometimes is to avoid the time used by run time tests, as 
shown in that video, using the static information inserted in the 
code (he shows assembly code that contains run time tests).

Another example of missing static information in D is shown near 
the end of the video, where he shows an annotation to compile 
functions for different CPUs, where the compiler updates function 
pointers inside the binary according to the CPU you are using, 
making the code safe and efficient.

Bye,
bearophile

Apr 20 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/20/2015 1:09 PM, bearophile wrote:
 Walter Bright:

 Use arrays of double2, float4, int4, etc., declared in core.simd. Those will
 be aligned appropriately.

 Is the GC able to give memory aligned to 32 bytes for new architectures with
512
 bits wide SIMD?

When the CPU requires 32 byte alignment, the compiler/GC will support it.

And even if it doesn't, it is trivial to manually align things.

 and a way to tell
 the type system that some array slices are fully distinct (the __restrict seen
 here, I think this information doesn't need to be part of a type).

 A runtime test is sufficient.

 One of the points of having a type system is to rule out certain classes of
bugs
 caused by programmers. The compiler could use the type system to add those
 runtime tests where needed. And even better sometimes is to avoid the time used
 by run time tests, as shown in that video, using the static information
inserted
 in the code (he shows assembly code that contains run time tests).

"this information doesn't need to be part of a type"

Besides, you can create a 'restrict' template that checks for overlap at 
runtime, checking that can be turned on and off at compile time (i.e. assert). 
The runtime check overhead should be insignificant if using large arrays.


 Another example of missing static information in D is shown near the end of the
 video, where he shows an annotation to compile functions for different CPUs,
 where the compiler updates function pointers inside the binary according to the
 CPU you are using, making the code safe and efficient.

Come on, bearophile. I've done that stuff in C based on the runtime CPU. No 
compiler support is needed.

Apr 20 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Vectorization examples