www.digitalmars.com         C & C++   DMDScript  

D - Optimising

reply nicO <nicolas.boulay ifrance.com> writes:
I'm from the fcpu project (f-cpu.org), the goal is to create a "free
cpu". 

One of our main problem is the compiler. I have wrote some algorithme in
C to improve matrix multiplication (using gcc on a P3). I have used loop
unrolling, software pipelining, a have brake a lot of false
dependancies, remove calculation from inner loop. I reorder the loop to
align data access and do strip mining. Most of this are processor
dependant and should be done by the compiler. I have won an average of 4
times compare to usual algorithme, my best gain was made on an 1.2Ghz
athlon processors with 512*512 matrix : x25 !

One of the main problem is the alias problem induice by pointer. In my
test, i lose 25% of the performance by using pointer instead of array.

So what in D will be done to improve the use of new instruction ? All
new cpu use conditional move (avoid to empty the pipeline), and vector
instructions (like MMX and SSE). We can also add the fact to compile for
2 processors and more.

What do you think ?

nicO
Aug 17 2001
parent reply "Walter" <walter digitalmars.com> writes:
Since D supports rectangular arrays, much more aggressive array optimization
becomes possible.

nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...
I'm from the fcpu project (f-cpu.org), the goal is to create a "free
cpu".

One of our main problem is the compiler. I have wrote some algorithme in
C to improve matrix multiplication (using gcc on a P3). I have used loop
unrolling, software pipelining, a have brake a lot of false
dependancies, remove calculation from inner loop. I reorder the loop to
align data access and do strip mining. Most of this are processor
dependant and should be done by the compiler. I have won an average of 4
times compare to usual algorithme, my best gain was made on an 1.2Ghz
athlon processors with 512*512 matrix : x25 !

One of the main problem is the alias problem induice by pointer. In my
test, i lose 25% of the performance by using pointer instead of array.

So what in D will be done to improve the use of new instruction ? All
new cpu use conditional move (avoid to empty the pipeline), and vector
instructions (like MMX and SSE). We can also add the fact to compile for
2 processors and more.

What do you think ?

nicO
Aug 17 2001
next sibling parent reply "LuigiG" <ask.if iyou.need.it> writes:
BTW, Microsoft's compiler has a
"/Oa assume no aliasing"
switch.
kind of playing with fire I assume ;)

"Walter" <walter digitalmars.com> wrote in message
news:9lkk0g$2qdq$2 digitaldaemon.com...
 Since D supports rectangular arrays, much more aggressive array
optimization
 becomes possible.

 nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...
I'm from the fcpu project (f-cpu.org), the goal is to create a "free
cpu".

One of our main problem is the compiler. I have wrote some algorithme in
C to improve matrix multiplication (using gcc on a P3). I have used loop
unrolling, software pipelining, a have brake a lot of false
dependancies, remove calculation from inner loop. I reorder the loop to
align data access and do strip mining. Most of this are processor
dependant and should be done by the compiler. I have won an average of 4
times compare to usual algorithme, my best gain was made on an 1.2Ghz
athlon processors with 512*512 matrix : x25 !

One of the main problem is the alias problem induice by pointer. In my
test, i lose 25% of the performance by using pointer instead of array.

So what in D will be done to improve the use of new instruction ? All
new cpu use conditional move (avoid to empty the pipeline), and vector
instructions (like MMX and SSE). We can also add the fact to compile for
2 processors and more.

What do you think ?

nicO
Aug 18 2001
parent reply "Walter" <walter digitalmars.com> writes:
LuigiG wrote in message <9ll862$5ch$1 digitaldaemon.com>...
BTW, Microsoft's compiler has a
"/Oa assume no aliasing"
switch.
kind of playing with fire I assume ;)
I decided long ago not to support such a switch. It would really only be useful to someone who understood exactly how the internal compiler optimizations really worked, which is likely nobody <g>.
Aug 18 2001
parent "LuigiG" <ask.if iyou.need.it> writes:
"Walter" <walter digitalmars.com> wrote in message
news:9lmlj0$111a$3 digitaldaemon.com...
 LuigiG wrote in message <9ll862$5ch$1 digitaldaemon.com>...
BTW, Microsoft's compiler has a
"/Oa assume no aliasing"
switch.
kind of playing with fire I assume ;)
I decided long ago not to support such a switch. It would really only be useful to someone who understood exactly how the internal compiler optimizations really worked, which is likely nobody <g>.
Yep, basically, if you know so much about the compiler that you can use the noalias switch; you don't the noalias switch anymore.
Aug 19 2001
prev sibling parent reply nicO <nicolas.boulay ifrance.com> writes:
Walter a écrit :
 
 Since D supports rectangular arrays, much more aggressive array optimization
 becomes possible.
 
Is that enough to introduice strip mining ? Or to change the order of the loop ?
 nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...
I'm from the fcpu project (f-cpu.org), the goal is to create a "free
cpu".

One of our main problem is the compiler. I have wrote some algorithme in
C to improve matrix multiplication (using gcc on a P3). I have used loop
unrolling, software pipelining, a have brake a lot of false
dependancies, remove calculation from inner loop. I reorder the loop to
align data access and do strip mining. Most of this are processor
dependant and should be done by the compiler. I have won an average of 4
times compare to usual algorithme, my best gain was made on an 1.2Ghz
athlon processors with 512*512 matrix : x25 !

One of the main problem is the alias problem induice by pointer. In my
test, i lose 25% of the performance by using pointer instead of array.

So what in D will be done to improve the use of new instruction ? All
new cpu use conditional move (avoid to empty the pipeline), and vector
instructions (like MMX and SSE). We can also add the fact to compile for
2 processors and more.

What do you think ?

nicO
Aug 18 2001
parent reply "Walter" <walter digitalmars.com> writes:
nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...
Walter a écrit :
 Since D supports rectangular arrays, much more aggressive array
optimization
 becomes possible.
Is that enough to introduice strip mining ? Or to change the order of the loop ?
I think so.
Aug 18 2001
next sibling parent nicO <nicolas.boulay ifrance.com> writes:
Walter a écrit :
 
 nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...
Walter a écrit :
 Since D supports rectangular arrays, much more aggressive array
optimization
 becomes possible.
Is that enough to introduice strip mining ? Or to change the order of the loop ?
I think so.
Good news !
Aug 19 2001
prev sibling parent reply nicO <nicolas.boulay ifrance.com> writes:
Walter a écrit :
 
 nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...
Walter a écrit :
 Since D supports rectangular arrays, much more aggressive array
optimization
 becomes possible.
Is that enough to introduice strip mining ? Or to change the order of the loop ?
I think so.
And what about vector computing ? (for the use of MMX and SSE)
Aug 19 2001
parent reply "Walter" <walter digitalmars.com> writes:
nicO wrote in message <3B801890.5A7216C7 ifrance.com>...
And what about vector computing ? (for the use of MMX and SSE)
I don't know.
Aug 19 2001
parent reply nicO <nicolas.boulay ifrance.com> writes:
Walter a écrit :
 
 nicO wrote in message <3B801890.5A7216C7 ifrance.com>...
And what about vector computing ? (for the use of MMX and SSE)
I don't know.
To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicO
Aug 19 2001
parent reply "Sean L. Palmer" <spalmer iname.com> writes:
I think the key concept to take advantage of SSE and MMX and 3DNow etc SIMD
instruction sets is that the language needs to recognize the data types
manipulated by these instruction sets (usually short, power-of-two constant
sized vectors of int or float of various sizes) and have the language
provide operators to work with these "extra" basic types.  Vector add,
subtract, vector componentwise multiply, vector-by-scalar multiply, square
root of one or all components, reciprocal square root, dot product, and
matrix multiply (SSE/MMX don't provide much help for these last two).
Conditional testing (for every component in A, if less than corresponding
component in B, set result component to 1, else set result component to 0).
Bitwise ops for integer SIMD.  Component swizzling/shuffling/rearrangement.

If you're keen on understanding how SIMD programming works, download the
Microsoft Processor Pack for MSVC 6 and install it... it provides new __m64
and __m128 data types and a bunch of new intrinsic C functions that work on
these types, as well as __declspec(aligned(16)) so you can have a prayer of
getting the compiler to help you meet these types' heavy memory alignment
restrictions.

Perhaps if a few operators were added, constant-sized arrays of basic int
and float types could automatically take advantage of these types of
instruction sets.  Operators that work on two arrays and take an array as
result.  This really ties into the operator overloading discussion.

Sean

"nicO" <nicolas.boulay ifrance.com> wrote in message
news:3B807A0C.33576068 ifrance.com...
 Walter a écrit :
 nicO wrote in message <3B801890.5A7216C7 ifrance.com>...
And what about vector computing ? (for the use of MMX and SSE)
I don't know.
To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicO
Oct 29 2001
parent reply "Walter" <walter digitalmars.com> writes:
You're right. -Walter

"Sean L. Palmer" <spalmer iname.com> wrote in message
news:9rj7js$2sq8$1 digitaldaemon.com...
 I think the key concept to take advantage of SSE and MMX and 3DNow etc
SIMD
 instruction sets is that the language needs to recognize the data types
 manipulated by these instruction sets (usually short, power-of-two
constant
 sized vectors of int or float of various sizes) and have the language
 provide operators to work with these "extra" basic types.  Vector add,
 subtract, vector componentwise multiply, vector-by-scalar multiply, square
 root of one or all components, reciprocal square root, dot product, and
 matrix multiply (SSE/MMX don't provide much help for these last two).
 Conditional testing (for every component in A, if less than corresponding
 component in B, set result component to 1, else set result component to
0).
 Bitwise ops for integer SIMD.  Component
swizzling/shuffling/rearrangement.
 If you're keen on understanding how SIMD programming works, download the
 Microsoft Processor Pack for MSVC 6 and install it... it provides new
__m64
 and __m128 data types and a bunch of new intrinsic C functions that work
on
 these types, as well as __declspec(aligned(16)) so you can have a prayer
of
 getting the compiler to help you meet these types' heavy memory alignment
 restrictions.

 Perhaps if a few operators were added, constant-sized arrays of basic int
 and float types could automatically take advantage of these types of
 instruction sets.  Operators that work on two arrays and take an array as
 result.  This really ties into the operator overloading discussion.

 Sean

 "nicO" <nicolas.boulay ifrance.com> wrote in message
 news:3B807A0C.33576068 ifrance.com...
 Walter a écrit :
 nicO wrote in message <3B801890.5A7216C7 ifrance.com>...
And what about vector computing ? (for the use of MMX and SSE)
I don't know.
To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicO
Jan 09 2002
parent reply "Sean L. Palmer" <spalmer iname.com> writes:
If D had good native support for SIMD programming, that would seem to be a
great big selling point for D.  "D:  Language of the future of parallel
programming".  ;)   That's what SIMD programming is, parallel programming.
You do several operations in parallel usually and mix the results together
occasionally.  Modern processors can do several operations in parallel
usually but we usually have to resort to assembly language to program that
way.  D could abstract the whole concept of that kind of data structure and
emulate it if the actual hardware isn't available  (you could choose to have
the compiler emit SSE or 3DNow instructions, or emulate using the FPU.)

Not just 3D graphics, but a *lot* of higher order math and physics
calculations are done with vectors and matrices.  This stuff needs to run as
fast as possible, and a compiler is much better suited to the job of
reducing stalls and such than a human is.  (because we humans are lazy,
mainly, and it's tedious, repetitive work... work computers are well suited
to do)  If you don't have language support you have to do it thru the
assembly language pathway, which is littered with minefields since usually
even inline assembly causes all registers to be saved and restored or at
least causes some kind of inefficiency with the optimizer.  Non-inline
assembly (not every platform supports inline assembly) involves call/return
statements and more register save/load bullshit.  Not the kind of stuff you
want happening all over in your tight, fast inner loops where 99% of the
time in the application is spent... the parts that benefit most from vector
computing.

You could even provide convenience features that may not exist in hardware,
such as inner product (which isn't available on Intel SSE instruction set,
yet definitely belongs in any hardware vector processor IMO) or cross
product (no instruction sets provide this to my knowledge, it takes like 3
or 4 instructions to emulate).

I think all the 3D engine programmers would switch to D overnight.  ;)

But not only that, the code generator could support using these instructions
to optimize other parts of the code too.  It could unroll this, for
instance:

float a[4],b[4],c[4];

for (int i = 0; i<4; ++i)
  a[i] = b[i]*b[i] + c[i];

Into this:

a[] = b[] * b[] + c[];

Or perhaps with some Unicode operator (Dot and Cross product symbols both
exist in Unicode)
Let's change that pesky "evaluates right-to-left" assignment into something
that reads more left-to-right (or in this case,

b[] * b[] + c[] ==> a[];

Which breaks down to this:

b[0] * b[0] + c[0] => a[0];
b[1] * b[1] + c[1] => a[1];
b[2] * b[2] + c[2] => a[2];
b[3] * b[3] + c[3] => a[3];

Let's just turn that line sideways, rotate it 90 degrees to the right, and
drop the square brackets:

b0 b1 b2 b3
*  *  *  *
b0 b1 b2 b3
+  +  +  +
c0 c1 c2 c3
=> => => =>
a0 a1 a2 a3

Which would essentially boil down to this kind of assembly language:

fmov4 xr0, b
fmov4 xr1, c
fmul4 xr2,xr0,xr0
fadd4 xr2,xr2,xr1
fmov4 a, xr2

or even this:

fmov4 xr0, b
fmad4 xr2,xr0,xr0,c ; multiply each component of xr0 by itself and add to
corresponding component of xr1, store all 4 into xr2
fmov4 a, xr2

Check out the instruction set for D3D's vertex or pixel shaders sometime.
;)  All new video cards have that kind of vector processor.  Your kid's PS2
has two of those vector units.  Every PC since the Pentium MMX has at least
one that can do very fast operations on bytes.   Yet I can't name a
programming language above assembly that inherently supports such things.

Sean

"Walter" <walter digitalmars.com> wrote in message
news:a1j6gs$2b02$1 digitaldaemon.com...
 You're right. -Walter

 "Sean L. Palmer" <spalmer iname.com> wrote in message
 news:9rj7js$2sq8$1 digitaldaemon.com...
 I think the key concept to take advantage of SSE and MMX and 3DNow etc
SIMD
 instruction sets is that the language needs to recognize the data types
 manipulated by these instruction sets (usually short, power-of-two
constant
 sized vectors of int or float of various sizes) and have the language
 provide operators to work with these "extra" basic types.  Vector add,
 subtract, vector componentwise multiply, vector-by-scalar multiply,
square
 root of one or all components, reciprocal square root, dot product, and
 matrix multiply (SSE/MMX don't provide much help for these last two).
 Conditional testing (for every component in A, if less than
corresponding
 component in B, set result component to 1, else set result component to
0).
 Bitwise ops for integer SIMD.  Component
swizzling/shuffling/rearrangement.
 If you're keen on understanding how SIMD programming works, download the
 Microsoft Processor Pack for MSVC 6 and install it... it provides new
__m64
 and __m128 data types and a bunch of new intrinsic C functions that work
on
 these types, as well as __declspec(aligned(16)) so you can have a prayer
of
 getting the compiler to help you meet these types' heavy memory
alignment
 restrictions.

 Perhaps if a few operators were added, constant-sized arrays of basic
int
 and float types could automatically take advantage of these types of
 instruction sets.  Operators that work on two arrays and take an array
as
 result.  This really ties into the operator overloading discussion.

 Sean

 "nicO" <nicolas.boulay ifrance.com> wrote in message
 news:3B807A0C.33576068 ifrance.com...
 Walter a écrit :
 nicO wrote in message <3B801890.5A7216C7 ifrance.com>...
And what about vector computing ? (for the use of MMX and SSE)
I don't know.
To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicO
Jan 10 2002
parent reply "Walter" <walter digitalmars.com> writes:
Actually, I think D does with its array operation semantics. They aren't
implemented :-(, but they are in the spec. -Walter

"Sean L. Palmer" <spalmer iname.com> wrote in message
news:a1jj23$2jl8$1 digitaldaemon.com...
 If D had good native support for SIMD programming, that would seem to be a
 great big selling point for D.  "D:  Language of the future of parallel
 programming".  ;)   That's what SIMD programming is, parallel programming.
 You do several operations in parallel usually and mix the results together
 occasionally.  Modern processors can do several operations in parallel
 usually but we usually have to resort to assembly language to program that
 way.  D could abstract the whole concept of that kind of data structure
and
 emulate it if the actual hardware isn't available  (you could choose to
have
 the compiler emit SSE or 3DNow instructions, or emulate using the FPU.)

 Not just 3D graphics, but a *lot* of higher order math and physics
 calculations are done with vectors and matrices.  This stuff needs to run
as
 fast as possible, and a compiler is much better suited to the job of
 reducing stalls and such than a human is.  (because we humans are lazy,
 mainly, and it's tedious, repetitive work... work computers are well
suited
 to do)  If you don't have language support you have to do it thru the
 assembly language pathway, which is littered with minefields since usually
 even inline assembly causes all registers to be saved and restored or at
 least causes some kind of inefficiency with the optimizer.  Non-inline
 assembly (not every platform supports inline assembly) involves
call/return
 statements and more register save/load bullshit.  Not the kind of stuff
you
 want happening all over in your tight, fast inner loops where 99% of the
 time in the application is spent... the parts that benefit most from
vector
 computing.

 You could even provide convenience features that may not exist in
hardware,
 such as inner product (which isn't available on Intel SSE instruction set,
 yet definitely belongs in any hardware vector processor IMO) or cross
 product (no instruction sets provide this to my knowledge, it takes like 3
 or 4 instructions to emulate).

 I think all the 3D engine programmers would switch to D overnight.  ;)

 But not only that, the code generator could support using these
instructions
 to optimize other parts of the code too.  It could unroll this, for
 instance:

 float a[4],b[4],c[4];

 for (int i = 0; i<4; ++i)
   a[i] = b[i]*b[i] + c[i];

 Into this:

 a[] = b[] * b[] + c[];

 Or perhaps with some Unicode operator (Dot and Cross product symbols both
 exist in Unicode)
 Let's change that pesky "evaluates right-to-left" assignment into
something
 that reads more left-to-right (or in this case,

 b[] * b[] + c[] ==> a[];

 Which breaks down to this:

 b[0] * b[0] + c[0] => a[0];
 b[1] * b[1] + c[1] => a[1];
 b[2] * b[2] + c[2] => a[2];
 b[3] * b[3] + c[3] => a[3];

 Let's just turn that line sideways, rotate it 90 degrees to the right, and
 drop the square brackets:

 b0 b1 b2 b3
 *  *  *  *
 b0 b1 b2 b3
 +  +  +  +
 c0 c1 c2 c3
 => => => =>
 a0 a1 a2 a3

 Which would essentially boil down to this kind of assembly language:

 fmov4 xr0, b
 fmov4 xr1, c
 fmul4 xr2,xr0,xr0
 fadd4 xr2,xr2,xr1
 fmov4 a, xr2

 or even this:

 fmov4 xr0, b
 fmad4 xr2,xr0,xr0,c ; multiply each component of xr0 by itself and add to
 corresponding component of xr1, store all 4 into xr2
 fmov4 a, xr2

 Check out the instruction set for D3D's vertex or pixel shaders sometime.
 ;)  All new video cards have that kind of vector processor.  Your kid's
PS2
 has two of those vector units.  Every PC since the Pentium MMX has at
least
 one that can do very fast operations on bytes.   Yet I can't name a
 programming language above assembly that inherently supports such things.

 Sean
Jan 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a1jlrk$2lgs$1 digitaldaemon.com...

 Actually, I think D does with its array operation semantics. They aren't
 implemented :-(, but they are in the spec. -Walter
But there is no dot and cross product for vectors.
Jan 10 2002
parent reply "Walter" <walter digitalmars.com> writes:
True.

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a1k1lr$2shq$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a1jlrk$2lgs$1 digitaldaemon.com...

 Actually, I think D does with its array operation semantics. They aren't
 implemented :-(, but they are in the spec. -Walter
But there is no dot and cross product for vectors.
Jan 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a1kmio$2hu$2 digitaldaemon.com...

 True.
Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Jan 10 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a1knkr$d57$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a1kmio$2hu$2 digitaldaemon.com...

 True.
Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Because I need to finish what I've got first!
Jan 11 2002
next sibling parent reply "Robert W. Cunningham" <rcunning acm.org> writes:
Walter wrote:

 "Pavel Minayev" <evilone omen.ru> wrote in message
 news:a1knkr$d57$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a1kmio$2hu$2 digitaldaemon.com...

 True.
Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Because I need to finish what I've got first!
I nominate Walter to be the first human to be cloned... With enough Walters, and if they all work the same hours and all think alike, then we may have an SIMD-based language author and compiler writer! -BobC
Jan 12 2002
parent "Pavel Minayev" <evilone omen.ru> writes:
"Robert W. Cunningham" <rcunning acm.org> wrote in message
news:3C408D59.DE2F85AB acm.org...

 With enough Walters, and if they all work the same hours and all think
 alike, then we may have an SIMD-based language author and compiler
 writer!
Yes, but who will hold the copyright? Also, imagine all those Walters arguing here on the newsgroup... =)
Jan 12 2002
prev sibling parent "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a1oifr$brs$2 digitaldaemon.com...

 Because I need to finish what I've got first!
I don't mean "now". I mean "in general" =)
Jan 12 2002