digitalmars.D - __restrict, architecture intrinsics vs asm, consoles, and other stuff
- Manu <turkeyman gmail.com> Sep 21 2011
- Trass3r <un known.com> Sep 21 2011
- Walter Bright <newshound2 digitalmars.com> Sep 21 2011
- a <a a.com> Sep 21 2011
- Don <nospam nospam.com> Sep 21 2011
- a <a a.com> Sep 22 2011
- Walter Bright <newshound2 digitalmars.com> Sep 22 2011
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Sep 22 2011
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Sep 22 2011
- Manu Evans <turkeyman gmail.com> Sep 23 2011
- bearophile <bearophileHUGS lycos.com> Sep 23 2011
- Manu Evans <turkeyman gmail.com> Sep 23 2011
- Don <nospam nospam.com> Sep 23 2011
- Peter Alexander <peter.alexander.au gmail.com> Sep 22 2011
- Don <nospam nospam.com> Sep 23 2011
- bearophile <bearophileHUGS lycos.com> Sep 24 2011
- Benjamin Thaut <code benjamin-thaut.de> Sep 21 2011
- Walter Bright <newshound2 digitalmars.com> Sep 22 2011
- Benjamin Thaut <code benjamin-thaut.de> Sep 22 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 22 2011
- Peter Alexander <peter.alexander.au gmail.com> Sep 22 2011
- "Marco Leise" <Marco.Leise gmx.de> Sep 22 2011
- so <so so.so> Sep 22 2011
- so <so so.so> Sep 22 2011
- so <so so.so> Sep 22 2011
- so <so so.so> Sep 23 2011
- Manu Evans <turkeyman gmail.com> Sep 23 2011
- Iain Buclaw <ibuclaw ubuntu.com> Sep 24 2011
- Max Klyga <max.klyga gmail.com> Sep 24 2011
- so <so so.so> Sep 24 2011
- Manu <turkeyman gmail.com> Sep 24 2011
- so <so so.so> Sep 24 2011
- Iain Buclaw <ibuclaw ubuntu.com> Sep 21 2011
- Kagamin <spam here.lot> Sep 22 2011
Hello D community. I've been reading a lot about D lately. I have known it existed for ages, but for some reason never even took a moment to look into it. The more I looked into it, the more I realise, this is the language I want. C(/C++) has been ruined, far beyond salvation. D seems to be the reboot that it desperately needs. Anyway, I work in the games industry, 10 years in cross platform console games at major studios. Sadly, I don't think Microsoft, Sony, Nintendo, Apple, Google (...maybe google) will support D any time soon, but I've started some after-hours game projects to test D in a some real gamedev environments. So far I have these (critical) questions. Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance. C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware? Is the D assembler a macro assembler? (ie, assigns registers automatically and manage loads/stores intelligently?) I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere? As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have? Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible. In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware. How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here? I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here. What about Android (or iPhone, but apple's 'x-code policy' prevents that)? I'd REALLY love to write an android project in D... the toolchain is GCC, I see no reason why it shouldn't be possible to write an android app if an appropriate toolchain was available? Sorry it's a bit long, thanks for reading this far! I'm looking forward to a brighter future writing lots of D code :P But I need to know basically all these questions are addressed before I could consider it for serious commercial game dev.
Sep 21 2011
I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere?
Well DMD only supports x86 including inline asm so that's the only thing that's tested. You need to try LDC or GDC for most of the things you request. http://dsource.org/projects/ldc/wiki/InlineAsmExpressions https://bitbucket.org/goshawk/gdc/wiki/UserDocumentation#!extended-assembler Some guys already managed to compile cross-compilers for ARM and ran some basic code on e.g. Nintendo DS. For anything serious you would need to make druntime work though. It's just nobody has done the dirty work yet.
Sep 21 2011
On 9/21/2011 3:55 PM, Manu wrote:Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance.
D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware?
D does have some intrinsics, like sin() and cos(). They tend to get added on a strictly as-needed basis, not a speculative one. D has no current intention to replace the inline assembler with intrinsics. As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration D code a while back that would take a string representing a floating point expression, and would literally compile it (using Compile Time Function Execution) and produce a string literal of inline asm functions, which were then compiled by the inline assembler. So yes, it is entirely possible and practical for end users to write custom intrinsics.Is the D assembler a macro assembler?
No. It's what-you-write-is-what-you-get.(ie, assigns registers automatically and manage loads/stores intelligently?)
No. It's intended to be a low level assembler for those who want to precisely control things.I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists.
I enjoy writing x86 inline assembler :-)Are there PowerPC or ARM examples anywhere?
The intention is for other CPU targets to employ the syntax used in their respective CPU manual datasheets.As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?
The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible.
Vectors (statically dimensioned arrays) are currently passed by value (unlike C or C++).In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware.
Yes, I agree.How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here.
I don't know much about GDC's capabilities.
Sep 21 2011
How would one do something like this without intrinsics (the code is c++ using
gcc vector extensions):
template <class V>
struct Fft
{
typedef typename V::T T;
typedef typename V::vec vec;
static const int VecSize = V::Size;
...
template <int Interleaved>
static NOINLINE void fft_pass_interleaved(
vec * __restrict pr,
vec *__restrict pi,
vec *__restrict pend,
T *__restrict table)
{
for(; pr < pend; pr += 2, pi += 2, table += 2*Interleaved)
{
vec tmpr, ti, ur, ui, wr, wi;
V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi);
V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr);
V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti);
vec tr = tmpr*wr - ti*wi;
ti = tmpr*wi + ti*wr;
V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]);
V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]);
}
}
...
Here vector elements need to be shuffled around when they are loaded and
stored.
This is platform dependent and cannot be expressed through vector operations
(or gcc vector extensions). Here I abstracted platform dependent functionality
in member functions of V, which are implemented using intrinsics. The
assembly
generated for SSE single precision and Interleaved=4 is:
0000000000000000 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>:
0: 48 39 d7 cmp %rdx,%rdi
3: 0f 83 9c 00 00 00 jae a5
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5>
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10: 0f 28 19 movaps (%rcx),%xmm3
13: 0f 28 41 10 movaps 0x10(%rcx),%xmm0
17: 48 83 c1 20 add $0x20,%rcx
1b: 0f 28 f3 movaps %xmm3,%xmm6
1e: 0f 28 2f movaps (%rdi),%xmm5
21: 0f c6 d8 dd shufps $0xdd,%xmm0,%xmm3
25: 0f c6 f0 88 shufps $0x88,%xmm0,%xmm6
29: 0f 28 e5 movaps %xmm5,%xmm4
2c: 0f 28 47 10 movaps 0x10(%rdi),%xmm0
30: 0f 28 4e 10 movaps 0x10(%rsi),%xmm1
34: 0f c6 e0 88 shufps $0x88,%xmm0,%xmm4
38: 0f c6 e8 dd shufps $0xdd,%xmm0,%xmm5
3c: 0f 28 06 movaps (%rsi),%xmm0
3f: 0f 28 d0 movaps %xmm0,%xmm2
42: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0
46: 0f c6 d1 88 shufps $0x88,%xmm1,%xmm2
4a: 0f 28 cd movaps %xmm5,%xmm1
4d: 0f 28 f8 movaps %xmm0,%xmm7
50: 0f 59 ce mulps %xmm6,%xmm1
53: 0f 59 fb mulps %xmm3,%xmm7
56: 0f 59 c6 mulps %xmm6,%xmm0
59: 0f 59 dd mulps %xmm5,%xmm3
5c: 0f 5c cf subps %xmm7,%xmm1
5f: 0f 58 c3 addps %xmm3,%xmm0
62: 0f 28 dc movaps %xmm4,%xmm3
65: 0f 5c d9 subps %xmm1,%xmm3
68: 0f 58 cc addps %xmm4,%xmm1
6b: 0f 28 e1 movaps %xmm1,%xmm4
6e: 0f 15 cb unpckhps %xmm3,%xmm1
71: 0f 14 e3 unpcklps %xmm3,%xmm4
74: 0f 29 4f 10 movaps %xmm1,0x10(%rdi)
78: 0f 28 ca movaps %xmm2,%xmm1
7b: 0f 29 27 movaps %xmm4,(%rdi)
7e: 0f 5c c8 subps %xmm0,%xmm1
81: 48 83 c7 20 add $0x20,%rdi
85: 0f 58 c2 addps %xmm2,%xmm0
88: 0f 28 d0 movaps %xmm0,%xmm2
8b: 0f 15 c1 unpckhps %xmm1,%xmm0
8e: 0f 14 d1 unpcklps %xmm1,%xmm2
91: 0f 29 46 10 movaps %xmm0,0x10(%rsi)
95: 0f 29 16 movaps %xmm2,(%rsi)
98: 48 83 c6 20 add $0x20,%rsi
9c: 48 39 fa cmp %rdi,%rdx
9f: 0f 87 6b ff ff ff ja 10
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10>
a5: f3 c3 repz retq
Would something like that be possible with D inline assembly or would there be
additional loads and stores for each call of V::interleave, V::deinterleave
and V::expandComplexArrayToRealImagVec?
Sep 21 2011
On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.template<class V> struct Fft { typedef typename V::T T; typedef typename V::vec vec; static const int VecSize = V::Size; ... template<int Interleaved> static NOINLINE void fft_pass_interleaved( vec * __restrict pr, vec *__restrict pi, vec *__restrict pend, T *__restrict table) { for(; pr< pend; pr += 2, pi += 2, table += 2*Interleaved) { vec tmpr, ti, ur, ui, wr, wi; V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi); V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr); V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti); vec tr = tmpr*wr - ti*wi; ti = tmpr*wi + ti*wr; V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]); V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]); } } ... Here vector elements need to be shuffled around when they are loaded and stored. This is platform dependent and cannot be expressed through vector operations (or gcc vector extensions). Here I abstracted platform dependent functionality in member functions of V, which are implemented using intrinsics. The assembly generated for SSE single precision and Interleaved=4 is: 0000000000000000<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>: 0: 48 39 d7 cmp %rdx,%rdi 3: 0f 83 9c 00 00 00 jae a5<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5> 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 0f 28 19 movaps (%rcx),%xmm3 13: 0f 28 41 10 movaps 0x10(%rcx),%xmm0 17: 48 83 c1 20 add $0x20,%rcx 1b: 0f 28 f3 movaps %xmm3,%xmm6 1e: 0f 28 2f movaps (%rdi),%xmm5 21: 0f c6 d8 dd shufps $0xdd,%xmm0,%xmm3 25: 0f c6 f0 88 shufps $0x88,%xmm0,%xmm6 29: 0f 28 e5 movaps %xmm5,%xmm4 2c: 0f 28 47 10 movaps 0x10(%rdi),%xmm0 30: 0f 28 4e 10 movaps 0x10(%rsi),%xmm1 34: 0f c6 e0 88 shufps $0x88,%xmm0,%xmm4 38: 0f c6 e8 dd shufps $0xdd,%xmm0,%xmm5 3c: 0f 28 06 movaps (%rsi),%xmm0 3f: 0f 28 d0 movaps %xmm0,%xmm2 42: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0 46: 0f c6 d1 88 shufps $0x88,%xmm1,%xmm2 4a: 0f 28 cd movaps %xmm5,%xmm1 4d: 0f 28 f8 movaps %xmm0,%xmm7 50: 0f 59 ce mulps %xmm6,%xmm1 53: 0f 59 fb mulps %xmm3,%xmm7 56: 0f 59 c6 mulps %xmm6,%xmm0 59: 0f 59 dd mulps %xmm5,%xmm3 5c: 0f 5c cf subps %xmm7,%xmm1 5f: 0f 58 c3 addps %xmm3,%xmm0 62: 0f 28 dc movaps %xmm4,%xmm3 65: 0f 5c d9 subps %xmm1,%xmm3 68: 0f 58 cc addps %xmm4,%xmm1 6b: 0f 28 e1 movaps %xmm1,%xmm4 6e: 0f 15 cb unpckhps %xmm3,%xmm1 71: 0f 14 e3 unpcklps %xmm3,%xmm4 74: 0f 29 4f 10 movaps %xmm1,0x10(%rdi) 78: 0f 28 ca movaps %xmm2,%xmm1 7b: 0f 29 27 movaps %xmm4,(%rdi) 7e: 0f 5c c8 subps %xmm0,%xmm1 81: 48 83 c7 20 add $0x20,%rdi 85: 0f 58 c2 addps %xmm2,%xmm0 88: 0f 28 d0 movaps %xmm0,%xmm2 8b: 0f 15 c1 unpckhps %xmm1,%xmm0 8e: 0f 14 d1 unpcklps %xmm1,%xmm2 91: 0f 29 46 10 movaps %xmm0,0x10(%rsi) 95: 0f 29 16 movaps %xmm2,(%rsi) 98: 48 83 c6 20 add $0x20,%rsi 9c: 48 39 fa cmp %rdi,%rdx 9f: 0f 87 6b ff ff ff ja 10<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10> a5: f3 c3 repz retq Would something like that be possible with D inline assembly or would there be additional loads and stores for each call of V::interleave, V::deinterleave and V::expandComplexArrayToRealImagVec?
Sep 21 2011
which compiles to a single shufps instruction.
Doesn't it often require additional needless movaps instructions? For example, the following: asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } compiles to movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) Is it possible to avoid needlless loading and storing of values when calling multiple functions that use asm blocks? It also seems that the compiler doesn't inline functions containing asm.
Sep 22 2011
On 9/22/2011 5:11 AM, a wrote:It also seems that the compiler doesn't inline functions containing asm.
That's correct, it currently does not.
Sep 22 2011
On 9/22/11 1:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d.
I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.
A good argument for (a) moving stuff from the compiler into the library, (b) continuing Don's great work on making CTFE a solid proposition. Andrei
Sep 22 2011
On 9/22/11 6:00 PM, so wrote:On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Sep 22 2011
On 9/22/11 9:11 PM, so wrote:On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 6:00 PM, so wrote:On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this?
Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. Andrei
Sep 22 2011
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleOn 9/22/11 1:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the
c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to
But, what we've done is to move SIMD into the machine model: the
machine model assumes that float[4] + float[4] is a more
operation than a loop. Currently, only arithmetic operations are implemented, and on
least, they're still not proper intrinsics. So in the long term
possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using
giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists
characters, each of which is a, b, c, or d.
interested in taking up that task?A couple of years ago I made a DSL compiler for BLAS1
capable of doing some pretty wild stuff, even then. (The DSL
normal D code). But the compiler has improved enormously since that time. It's
perfectly feasible to make a DSL for the SIMD operations you
The really nice thing about this, compared to normal asm, is
have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code
a DSL in D, than in a compiler back-end.
(b) continuing Don's great work on making CTFE a solid
Andrei
This sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations? I think that's a wrong decision in its self, and a poor foundation for this approach. Let me try and convince you that the language should have an explicit hardware vector type, and not attempt to make use of any clever language tricks... If float[4] is considered a hardware vector by the compiler, - How to I define an ACTUAL float[4]? - How can I be confident that it actually WILL be a hardware vector? Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such. They may be float4, u/int4, u/short8, u/byte16, double2... All these types are interchangeable within the one register, do you intend to special case fixed length arrays of all those types to support the hardware functionality for those? Hardware vectors are NOT floats, they can not interact with the floating point unit, dereferencing of this style 'float x = myVector[0]' is NOT supported by the hardware and it should not be exposed to the programmer as a trivial possibility. This seemingly harmless line of code will undermine the entire reason for using hardware vector hardware in the first place. Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least resistance also be the 'correct' optimal choice, whereby a seemingly simple line of code may ruin the entire function. float[4] is not even a particularly conveniently sized vector for most inexperienced programmers, the majority will want float[3]. This is NOT a trivial map to float[4], and programmers should be well aware that there is inherent complexity in using the hardware vector architecture, and forced to think it through. Most inexperienced programmers think of results of operations like dot product and magnitude as being scalar values, but they are not, they are a scalar value repeated across all 4 components of a vector 4, and this should be explicit too. .... I know I'm nobody around here, so I can't expect to be taken too seriously, but I'm very excited about the language, so here's what I would consider instead: Add a hardware vector type, lets call it 'v128' for the exercise. It is a primitive type, aligned by definition, and does not have any members. You may use this to refer to a hardware vector registers explicitly, as vector register function arguments, or as arguments to inline asm blocks. Add some functions to the standard library (ideally implemented as compiler intrinsics) which do very specific stuff to vectors, and ideally expandable by hardware vendors or platform holders. You might want to have some classes in the standard library which wrap said v128, and expose the concept as a float4, int4, byte16, etc. These classes would provide maths operators, comparisons, initialisation and immediate assignment, and casts between various vector types. Different vector units support completely different methods of permutation. I would be very careful about adding intrinsic support into the library for generalised permutation. And if so, at least leave the capability of implementing intrinsic architecture-specific permutation but the hardware vendors/platform holders. At the end of the day, it is imperative that the code generator and optimiser still retain the concept of hardware vectors, and can perform appropriate load/store elimination, apply hardware specific optimisation to operations like permutes/swizzles, component broadcasts, load immediates, etc. .... The reason I am so adamant about this, is in almost all architectures (SSE is the most tolerant by far), using the hardware vector unit is an all-or-nothing choice. If you interact between the vector and float registers, you will almost certainly result in slower code than if you just used the float unit outright. Also, since people usually use hardware vectors in areas of extreme performance optimisation, it's not tolerable for the compiler to be making mistakes. As a minimum the programmer needs to be able to explicitly address the vector registers, pass it to and from functions, and perform explicit (probably IHV supplied) functions on them. The code generator and optimiser needs all the information possible, and as explicit as possible so IHV's can implement the best possible support for their architecture. The API should reflect this, and not allow easy access to functionality that would violate hardware support. Ease of programming should be a SECONDARY goal, at which point something like the typed wrapper classes I described would come in, allowing maths operators, comparisons and all I mentioned above, ie, making them look like a real mathematical type, but still keeping their distance from primitive float/int types, to discourage interaction at all costs. I hope this doesn't sound too much like an overly long rant! :) And hopefully I've managed to sell my point... Don: I'd love to hear counter arguments to justify float[4] as a reasonable solution. Currently no matter how I slice it, I just can't see it. Criticism welcome? Cheers! - Manu
Sep 23 2011
Manu Evans: I appreciate your efforts. I answer to the OP that DMD doesn't yet offer most of the things discussed in this thread. But I think that it's better to add and work on high-performance features when the basics of D are in better shape. Currently there are more basic fishes to implement or debug, like tuples syntax sugar, module system issues, const issues, inout, and so on and on (on the other hand I agree that it's OK to discuss even now D design ideas that will allow that future high performance).Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such.
What do you want to do when CPU with 256 bit registers appear? When they grow to 512 bit? To 1024? Do you want to keep adding specific types? How many things do you want to add to D in the next 15 years of CPU evolution? Bye, bearophile
Sep 23 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleManu Evans: I appreciate your efforts. I answer to the OP that DMD doesn't yet
it's better to add and work on high-performance features when the basics of D are in better shape. Currently there are more basic fishes to implement or debug, like tuples syntax sugar, module system issues, const issues, inout, and so on and on (on the other hand I agree that it's OK to discuss even now D design ideas that will allow that future high performance). I make the point because, while I agree the topics you mention are of greater immediate performance, the previous posts in this thread suggest there is already experimentation/implementation of these features happening in the language now, and if they are defined now, and defined incorrectly, it's always very difficult to go back on these decisions.Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations
supported, they are probably aligned, and they are only
in 128bit quantities. I think they should be explicitly defined
such.
specific types? Yes. I don't think it's likely to progress as you suggest though. I foresee perhaps a 4 component 64bit-word vector (256bit), and a hardware matrix. I can't see it being any less appropriate to implement a v256 in addition to v128 than a long in addition to an int. A matrix is a fundamentally different concept, and surely worthy of its own type.How many things do you want to add to D in the next 15 years of
As many things as are universally accepted by computer hardware as a normal/standard feature. Hardware vectors definitely fit this bill. We've had hardware vector support in virtually every architecture for 10-15 years now, and yet there is still no language that really supports it.
Sep 23 2011
On 24.09.2011 00:47, Manu Evans wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleOn 9/22/11 1:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the
c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to
But, what we've done is to move SIMD into the machine model: the
machine model assumes that float[4] + float[4] is a more
operation than a loop. Currently, only arithmetic operations are implemented, and on
least, they're still not proper intrinsics. So in the long term
possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using
giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists
characters, each of which is a, b, c, or d.
interested in taking up that task?A couple of years ago I made a DSL compiler for BLAS1
capable of doing some pretty wild stuff, even then. (The DSL
normal D code). But the compiler has improved enormously since that time. It's
perfectly feasible to make a DSL for the SIMD operations you
The really nice thing about this, compared to normal asm, is
have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code
a DSL in D, than in a compiler back-end.
(b) continuing Don's great work on making CTFE a solid
Andrei
This sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations?
No, it's completely unrelated. It has nothing in common.I think that's a wrong decision in its self, and a poor foundation for this approach. Let me try and convince you that the language should have an explicit hardware vector type, and not attempt to make use of any clever language tricks... If float[4] is considered a hardware vector by the compiler, - How to I define an ACTUAL float[4]? - How can I be confident that it actually WILL be a hardware vector?
float[4] is not considered to be a hardware vector. It is only passed as one. To pass it the C++ way, declare the parameter as float[], or pass by ref. Everything after that is the reponsibility of the compiler/optimizer. A big difference compared to C++, is that generally, it's pretty strange to pass fixed-length arrays as value parameters. At this stage we don't have any way of forcing it to be a hardware vector. We've just introduced the parameter passing and the vector operations to make it easier for the compiler to use hardware registers. Very little else is decided at this stage. You make some excellent points.Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such. They may be float4, u/int4, u/short8, u/byte16, double2... All these types are interchangeable within the one register, do you intend to special case fixed length arrays of all those types to support the hardware functionality for those? Hardware vectors are NOT floats, they can not interact with the floating point unit, dereferencing of this style 'float x = myVector[0]' is NOT supported by the hardware and it should not be exposed to the programmer as a trivial possibility. This seemingly harmless line of code will undermine the entire reason for using hardware vector hardware in the first place. Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least resistance also be the 'correct' optimal choice, whereby a seemingly simple line of code may ruin the entire function. float[4] is not even a particularly conveniently sized vector for most inexperienced programmers, the majority will want float[3]. This is NOT a trivial map to float[4], and programmers should be well aware that there is inherent complexity in using the hardware vector architecture, and forced to think it through. Most inexperienced programmers think of results of operations like dot product and magnitude as being scalar values, but they are not, they are a scalar value repeated across all 4 components of a vector 4, and this should be explicit too. .... I know I'm nobody around here, so I can't expect to be taken too seriously, but I'm very excited about the language, so here's what I would consider instead: Add a hardware vector type, lets call it 'v128' for the exercise. It is a primitive type, aligned by definition, and does not have any members. You may use this to refer to a hardware vector registers explicitly, as vector register function arguments, or as arguments to inline asm blocks. Add some functions to the standard library (ideally implemented as compiler intrinsics) which do very specific stuff to vectors, and ideally expandable by hardware vendors or platform holders. You might want to have some classes in the standard library which wrap said v128, and expose the concept as a float4, int4, byte16, etc. These classes would provide maths operators, comparisons, initialisation and immediate assignment, and casts between various vector types. Different vector units support completely different methods of permutation. I would be very careful about adding intrinsic support into the library for generalised permutation. And if so, at least leave the capability of implementing intrinsic architecture-specific permutation but the hardware vendors/platform holders. At the end of the day, it is imperative that the code generator and optimiser still retain the concept of hardware vectors, and can perform appropriate load/store elimination, apply hardware specific optimisation to operations like permutes/swizzles, component broadcasts, load immediates, etc. .... The reason I am so adamant about this, is in almost all architectures (SSE is the most tolerant by far), using the hardware vector unit is an all-or-nothing choice. If you interact between the vector and float registers, you will almost certainly result in slower code than if you just used the float unit outright. Also, since people usually use hardware vectors in areas of extreme performance optimisation, it's not tolerable for the compiler to be making mistakes. As a minimum the programmer needs to be able to explicitly address the vector registers, pass it to and from functions, and perform explicit (probably IHV supplied) functions on them. The code generator and optimiser needs all the information possible, and as explicit as possible so IHV's can implement the best possible support for their architecture. The API should reflect this, and not allow easy access to functionality that would violate hardware support. Ease of programming should be a SECONDARY goal, at which point something like the typed wrapper classes I described would come in, allowing maths operators, comparisons and all I mentioned above, ie, making them look like a real mathematical type, but still keeping their distance from primitive float/int types, to discourage interaction at all costs. I hope this doesn't sound too much like an overly long rant! :) And hopefully I've managed to sell my point... Don: I'd love to hear counter arguments to justify float[4] as a reasonable solution. Currently no matter how I slice it, I just can't see it. Criticism welcome? Cheers! - Manu
Sep 23 2011
On 22/09/11 7:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
Sep 22 2011
On 22.09.2011 20:19, Marco Leise wrote:Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander <peter.alexander.au gmail.com>:On 22/09/11 7:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.
Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.
Sep 23 2011
Don:Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.
Is it possible to solve some of those problems adding something like this to D/DMD: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions And then, what changes/work is needed to allow inlining of some functions that contain asm? I mean something like this allow_inline? http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline (I have asked similar questions four times in the last two years, with no answers or comments.) Bye, bearophile
Sep 24 2011
Am 22.09.2011 02:38, schrieb Walter Bright:nsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
I recently tried that, and I couldn't do it because D has no way of aligning structs on the stack. Manually allocating the neccessary aligned memroy is also not always possible because it can not be done for compiler temporary variables: vec4 v1 = func1(); vec4 v2 = func2(); vec4 result = (v1 + v2) * 0.5f; Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support? -- Kind Regards Benjamin Thaut
Sep 21 2011
On 9/21/2011 10:56 PM, Benjamin Thaut wrote:Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support?
No, but 64 bit DMD aligns the stack on 16 byte boundaries.
Sep 22 2011
== Auszug aus Walter Bright (newshound2 digitalmars.com)'s ArtikelOn 9/21/2011 10:56 PM, Benjamin Thaut wrote:Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support?
Unfortunaltey there is no 64 bit dmd on windows.
Sep 22 2011
Am 22.09.2011, 08:39 Uhr, schrieb Don <nospam nospam.com>:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.
That's a nice fresh approach to intrinsics. I bet if other languages had the CTFE capabilities, they'd probably do the same. Sure, it is ideal if the compiler works magic here, but it takes longer to implement the right code generation in the compiler, than to write an isolated piece of library code and extensions can be added by anyone, especially since there will already be some examples to look at. Thumbs up!
Sep 22 2011
On 22/09/11 1:38 AM, Walter Bright wrote:D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.
It's used for vector stuff, but I wouldn't say mostly. Just about any performance intensive piece of code involving pointers can benefit from __restrict. I use it in a VM for example.As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?
The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).
I don't see how this would be possible without intrinsics, or at least some form of language extension. Would DMD just *always* put float[4] in XMM registers (assuming they are available)? That doesn't seem like a good idea if you don't want to use it as a vector. BTW, if you want to get a good idea of how game programmers use vector intrinsics on current hardware, there is a good blog post about it here: http://altdevblogaday.com/2011/01/31/vectiquette/
Sep 22 2011
Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander <peter.alexander.au gmail.com>:On 22/09/11 7:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.
Sep 22 2011
On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
You mean some helper functions to be used in user structures? Because i don't know of any structure in std.numerics that could use it. We first need to improve opDispatch. Currently i think no one know how it works or how it was intended to work. It refuses to except a few things which i think it should. For example: A { opDispatch(string)() opDispatch(string)() const } A a, b; a.fun = b.run; // This should be perfectly fine.
Sep 22 2011
On Fri, 23 Sep 2011 02:00:50 +0300, so <so so.so> wrote:It refuses to except a few things which i think it should.
accept...
Sep 22 2011
------------8ABWX90qerM6Eisyp8Guo2 Content-Type: text/plain; charset=iso-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 6:00 PM, so wrote:On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this? ------------8ABWX90qerM6Eisyp8Guo2 Content-Disposition: attachment; filename=test.d Content-Type: application/octet-stream; name=test.d Content-Transfer-Encoding: Base64 bW9kdWxlIHN3aXp6bGU7CmltcG9ydCBzdGQuc3RkaW8sIHN0ZC5jb252OwppbXBv cnQgc3RkLnRyYWl0cywgc3RkLmFsZ29yaXRobTsKCnByaXZhdGUgZW51bSBwcm9w ZXJ0aWVzID0gWyJ4eXp3IiwgInJnYmEiXTsKCnByaXZhdGUgdGVtcGxhdGUgaW5k ZXhPZlByb3BlcnR5KGNoYXIgYywgc2l6ZV90IGk9MCkKewogICAgc3RhdGljIGFz c2VydChwcm9wZXJ0aWVzLmxlbmd0aCA+IDAsICJwcm9wZXJ0eSBsaXN0IGVtcHR5 Iik7CgogICAgc3RhdGljIGlmKGNvdW50VW50aWwocHJvcGVydGllc1tpXSwgYykg IT0gLTEpCiAgICAgICAgZW51bSBzaXplX3QgaW5kZXhPZlByb3BlcnR5ID0gY291 bnRVbnRpbChwcm9wZXJ0aWVzW2ldLCBjKTsKICAgIGVsc2Ugc3RhdGljIGlmKGkg PCBwcm9wZXJ0aWVzLmxlbmd0aCAtIDEpCiAgICAgICAgZW51bSBzaXplX3QgaW5k ZXhPZlByb3BlcnR5ID0gaW5kZXhPZlByb3BlcnR5IShjLCBpICsgMSk7CiAgICBl bHNlIHN0YXRpYyBhc3NlcnQoMCwgInVuYWJsZSB0byBsb2NhdGUgaW5kZXg6ICIg fiBjKTsKfQoKcHJpdmF0ZSB0ZW1wbGF0ZSBpc1Byb3BlcnRpZXNTb3J0ZWQoc3Ry aW5nIHMsIHNpemVfdCBpPTApCnsKICAgIHN0YXRpYyBpZihjb3VudFVudGlsKHBy b3BlcnRpZXNbaV0sIHMpICE9IC0xKQogICAgICAgIGVudW0gaXNQcm9wZXJ0aWVz U29ydGVkID0gdHJ1ZTsKICAgIGVsc2Ugc3RhdGljIGlmKGkgPCBwcm9wZXJ0aWVz Lmxlbmd0aCAtIDEpCiAgICAgICAgZW51bSBpc1Byb3BlcnRpZXNTb3J0ZWQgPSBp c1Byb3BlcnRpZXNTb3J0ZWQhKHMsIGkgKyAxKTsKICAgIGVsc2Ugc3RhdGljIGFz c2VydCgwLCBzIH4gIiBub3Qgc29ydGVkIG9yIG5vdCBpbiB0aGUgbGlzdCBvZiBw cm9wZXJ0aWVzIik7Cn0KCnZvaWQgc3dpenpsZUNvcHkoc3RyaW5nIHMsIHNpemVf dCBpLCBzaXplX3QgQU4sIHNpemVfdCBCTiwgQSwgQikocmVmIEEgYSwgcmVmIGNv bnN0IEIgYikKewoJZW51bSBhaSA9IGktMTsKCWVudW0gYmkgPSBpbmRleE9mUHJv cGVydHkhKHNbYWldKTsKCglzdGF0aWMgYXNzZXJ0KGFpIDwgQU4sIHNbYWldIH4g IjogaW5kZXggb3V0IG9mIGJvdW5kcyBBICIgfnRvIXN0cmluZyhhaSl+ICIgPj0g IiB+dG8hc3RyaW5nKEFOKSk7CglzdGF0aWMgYXNzZXJ0KGJpIDwgQk4sIHNbYWld IH4gIjogaW5kZXggb3V0IG9mIGJvdW5kcyBCICIgfnRvIXN0cmluZyhiaSl+ICIg Pj0gIiB+dG8hc3RyaW5nKEJOKSk7CgoJYVthaV0gPSBiW2JpXTsKCQoJc3RhdGlj IGlmKGkgPiAxKQoJCXN3aXp6bGVDb3B5IShzLCBhaSwgQU4sIEJOKShhLCBiKTsK fQoKcHJpdmF0ZSBzdHJ1Y3QgVihULCBzaXplX3QgTikgewoJCgl0aGlzKFQgYSkg ewoJCWZvcmVhY2gocmVmIGU7IHJhdykKCQkJZSA9IGEgKz0gMTsKCX0KCQovLyAJ YXV0byBvcERpc3BhdGNoKHN0cmluZyBzKSgpIGNvbnN0CglhdXRvIHN3aXp6bGVS KHN0cmluZyBzKSgpIGNvbnN0Cgl7CgkJViEoVCwgcy5sZW5ndGgpIHI7CgkJc3dp enpsZUNvcHkhKHMsIHMubGVuZ3RoLCBzLmxlbmd0aCwgTikoci5yYXcsIHJhdyk7 CgkJcmV0dXJuIHI7Cgl9CgovLyAJcmVmIGF1dG8gb3BEaXNwYXRjaChzdHJpbmcg cykoKQoJcmVmIGF1dG8gc3dpenpsZUwoc3RyaW5nIHMpKCkKCQlpZihpc1Byb3Bl cnRpZXNTb3J0ZWQhcykKCXsKCQllbnVtIGluZGV4ID0gaW5kZXhPZlByb3BlcnR5 IShzWzBdKTsKCQlzdGF0aWMgYXNzZXJ0KGluZGV4ICsgcy5sZW5ndGggPD0gTiwg ImluZGV4IG91dCBvZiBib3VuZHMiKTsKCgkJcmV0dXJuICooY2FzdChWIShULCBz Lmxlbmd0aCkqKQoJCQkmcmF3LnB0cltpbmRleF0pOwoJfQoKCVRbTl0gcmF3Owp9 Cgpwcml2YXRlIHZvaWQgdGVzdCgpCnsKCWFsaWFzIFYhKGZsb2F0LCAzKSBWMzsK CWFsaWFzIFYhKGZsb2F0LCA0KSBWNDsKCWFsaWFzIFYhKGZsb2F0LCA1KSBWNTsK CglWMyBhID0gVjMoMyk7CglWNCBiID0gVjQoMTMpOwoJVjUgYyA9IFY1KDQyKTsK Cgl3cml0ZWxuKGEucmF3KTsKCXdyaXRlbG4oYi5yYXcpOwoJd3JpdGVsbihjLnJh dyk7Cgl3cml0ZWxuKCItLS0tLSIpOwoKCWIuc3dpenpsZUwhInh5eiIgPSBhLnN3 aXp6bGVSISJnYnIiOwoJYy5zd2l6emxlTCEienciICA9IGEuc3dpenpsZVIhInJi IjsKCi8vIAliLnh5eiA9IGEuZ2JyOwovLyAJYy56dyAgPSBhLnJiOwoKCXdyaXRl bG4oYS5yYXcpOwoJd3JpdGVsbihiLnJhdyk7Cgl3cml0ZWxuKGMucmF3KTsKfQoK dm9pZCBtYWluKCkKewoJdGVzdCgpOwp9Cg== ------------8ABWX90qerM6Eisyp8Guo2--
Sep 22 2011
On Fri, 23 Sep 2011 06:44:44 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 9:11 PM, so wrote:On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 6:00 PM, so wrote:On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this?
Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. Andrei
It was there to show how it should be used in user code, and testing. Swizzle is not just a rvalue operation, there is also a lvalue part to it which plays a bit differently (hence, swizzleR and swizzleL). We could take care of it with an overload but D doesn't act quite like what i expected (like C++), i don't understand why it won't differentiate "fun()" from "fun() const".
Sep 23 2011
== Quote from Walter Bright (newshound2 digitalmars.com)'s articleD doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.
Use of __restrict is certainly not limited to your example, it's applicable basically anywhere that a pointer is dereferenced on either side of a write through any other pointer, or a function call (since it could potentially do anything), the resident value from the previous dereference is invalidated and must be reloaded needlessly unless the pointer is explicitly marked restrict. http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html For RISC architectures in particular, __restrict is mandatory when optimising certain hot functions without making a mess of your code (declaring stack locals all over the place), and I think I've run into cases where even that's not enough.D does have some intrinsics, like sin() and cos(). They tend to get added on a strictly as-needed basis, not a speculative one. D has no current intention to replace the inline assembler with intrinsics. As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration D code a while back that would take a string representing a floating point expression, and would literally compile it (using Compile Time Function Execution) and produce a string literal of inline asm functions, which were then compiled by the inline assembler. So yes, it is entirely possible and practical for end users to write custom intrinsics.
I hadn't thought of that using compile-time functions, that's really nice. I'm not sure if that'll be enough to generate good code in all cases, but I'll do some experiments and see where it goes. The main problem with writing (intelligently generated) inline asm vs using intrinsics, is in the context of the C (or D) source code, you don't have enough context to know about the state of the register assignment, and producing the appropriate loads/stores. Also, the opcodes selected to perform the operation may change with context. (again, specific examples are hard to fabricate, but I've had them consistently pop up over the years) Also, I think someone else said that you couldn't inline functions with inline asm? Is that correct? If so, I assume that's intended to be fixed?As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?
that the vector code gen isn't done (currently it is just implemented using loops).
Are you referring to the comment about special casing a float[4]? I can see why one might reach for that as a solution, but it sounds like a really bad idea to me...Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible.
or C++).
Do you mean that like a memcpy to the stack, or somehow intuitively using the hardware vector registers to pass arguments to the function properly?How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector type, there's no way to pass vectors to functions in registers, and also, using explicit asm, you tend to end up with endless unnecessary loads and stores, and potentially a lot of redundant shuffling/permutation. This will differ radically between architectures too. I think I read in another post too that functions containing inline asm will not be inlined? How does the D compiler go at optimising code around inline asm blocks? Most compilers have a lot of trouble optimising around inline asm blocks, and many don't even attempt to do so... How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.
Sep 23 2011
== Quote from Manu Evans (turkeyman gmail.com)'s articleHow can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
yours isn't lost, except that it'd have to be in inline asm rather than
But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector
no way to pass vectors to functions in registers, and also, using explicit asm,
end up with endless unnecessary loads and stores, and potentially a lot of redundant shuffling/permutation. This will differ radically between architectures too. I think I read in another post too that functions containing inline asm will not
How does the D compiler go at optimising code around inline asm blocks? Most
lot of trouble optimising around inline asm blocks, and many don't even attempt
How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.
GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Sep 24 2011
On 2011-09-24 16:50:39 +0300, Manu said:Is there an IRC channel, or anywhere for realtime D discussion?
There is a #d channel for general D discussions and #d.gdc for GDC related themes on irc.freenode.org
Sep 24 2011
------------3zFmvBq6SXZ02RfKgrjTfd Content-Type: text/plain; charset=iso-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit On Fri, 23 Sep 2011 16:09:31 +0300, so <so so.so> wrote:It was there to show how it should be used in user code, and testing. Swizzle is not just a rvalue operation, there is also a lvalue part to it which plays a bit differently (hence, swizzleR and swizzleL). We could take care of it with an overload but D doesn't act quite like what i expected (like C++), i don't understand why it won't differentiate "fun()" from "fun() const".
Sorry about the nonsense. It is now with opDispatch (attached) To make a generic "swizzle" function we need to introduce a few traits but if all you want is a support for T[N] that is easy. ------------3zFmvBq6SXZ02RfKgrjTfd Content-Disposition: attachment; filename=test.d Content-Type: application/octet-stream; name=test.d Content-Transfer-Encoding: Base64 bW9kdWxlIHN3aXp6bGU7CmltcG9ydCBzdGQuc3RkaW8sIHN0ZC5jb252OwppbXBv cnQgc3RkLnRyYWl0cywgc3RkLmFsZ29yaXRobTsKCnByaXZhdGUgZW51bSBwcm9w ZXJ0aWVzID0gWyJ4eXp3IiwgInJnYmEiXTsKCnByaXZhdGUgdGVtcGxhdGUgaW5k ZXhPZlByb3BlcnR5KGNoYXIgYywgc2l6ZV90IGk9MCkKewogICAgc3RhdGljIGFz c2VydChwcm9wZXJ0aWVzLmxlbmd0aCA+IDAsICJwcm9wZXJ0eSBsaXN0IGVtcHR5 Iik7CgogICAgc3RhdGljIGlmKGNvdW50VW50aWwocHJvcGVydGllc1tpXSwgYykg IT0gLTEpCiAgICAgICAgZW51bSBzaXplX3QgaW5kZXhPZlByb3BlcnR5ID0gY291 bnRVbnRpbChwcm9wZXJ0aWVzW2ldLCBjKTsKICAgIGVsc2Ugc3RhdGljIGlmKGkg PCBwcm9wZXJ0aWVzLmxlbmd0aCAtIDEpCiAgICAgICAgZW51bSBzaXplX3QgaW5k ZXhPZlByb3BlcnR5ID0gaW5kZXhPZlByb3BlcnR5IShjLCBpICsgMSk7CiAgICBl bHNlIHN0YXRpYyBhc3NlcnQoMCwgInVuYWJsZSB0byBsb2NhdGUgaW5kZXg6ICIg fiBjKTsKfQoKcHJpdmF0ZSB0ZW1wbGF0ZSBpc1Byb3BlcnRpZXNTb3J0ZWQoc3Ry aW5nIHMsIHNpemVfdCBpPTApCnsKICAgIHN0YXRpYyBpZihjb3VudFVudGlsKHBy b3BlcnRpZXNbaV0sIHMpICE9IC0xKQogICAgICAgIGVudW0gaXNQcm9wZXJ0aWVz U29ydGVkID0gdHJ1ZTsKICAgIGVsc2Ugc3RhdGljIGlmKGkgPCBwcm9wZXJ0aWVz Lmxlbmd0aCAtIDEpCiAgICAgICAgZW51bSBpc1Byb3BlcnRpZXNTb3J0ZWQgPSBp c1Byb3BlcnRpZXNTb3J0ZWQhKHMsIGkgKyAxKTsKICAgIGVsc2UKICAgICAgICBl bnVtIGlzUHJvcGVydGllc1NvcnRlZCA9IGZhbHNlOwp9Cgp2b2lkIHN3aXp6bGVD b3B5KHN0cmluZyBzLCBzaXplX3QgaSwgc2l6ZV90IEFOLCBzaXplX3QgQk4sIEEs IEIpKHJlZiBBIGEsIHJlZiBjb25zdCBCIGIpCnsKCWVudW0gYWkgPSBpLTE7Cgll bnVtIGJpID0gaW5kZXhPZlByb3BlcnR5IShzW2FpXSk7CgoJc3RhdGljIGFzc2Vy dChhaSA8IEFOLCBzW2FpXSB+ICI6IGluZGV4IG91dCBvZiBib3VuZHMgQSAiIH50 byFzdHJpbmcoYWkpfiAiID49ICIgfnRvIXN0cmluZyhBTikpOwoJc3RhdGljIGFz c2VydChiaSA8IEJOLCBzW2FpXSB+ICI6IGluZGV4IG91dCBvZiBib3VuZHMgQiAi IH50byFzdHJpbmcoYmkpfiAiID49ICIgfnRvIXN0cmluZyhCTikpOwoKCWFbYWld ID0gYltiaV07CgkKCXN0YXRpYyBpZihpID4gMSkKCQlzd2l6emxlQ29weSEocywg YWksIEFOLCBCTikoYSwgYik7Cn0KCnByaXZhdGUgc3RydWN0IFZ0ZXN0KFQsIHNp emVfdCBOKSB7CgkKCXRoaXMoVCBhKSB7CgkJZm9yZWFjaChyZWYgZTsgcmF3KQoJ CQllID0gYSArPSAxOwoJfQoJCgl0ZW1wbGF0ZSBvcERpc3BhdGNoKHN0cmluZyBz KQoJewoJCXN0YXRpYyBpZihpc1Byb3BlcnRpZXNTb3J0ZWQhcykKCQl7CgkJCXJl ZiBhdXRvIG9wRGlzcGF0Y2goKQoJCQl7CgkJCQllbnVtIGluZGV4ID0gaW5kZXhP ZlByb3BlcnR5IShzWzBdKTsKCQkJCXN0YXRpYyBhc3NlcnQoaW5kZXggKyBzLmxl bmd0aCA8PSBOLCAiaW5kZXggb3V0IG9mIGJvdW5kcyIpOwoKCQkJCXJldHVybiAq KGNhc3QoVnRlc3QhKFQsIHMubGVuZ3RoKSopCgkJCQkJJnJhdy5wdHJbaW5kZXhd KTsKCQkJfQoJCX0KCQllbHNlCgkJewoJCQlhdXRvIG9wRGlzcGF0Y2goKQoJCQl7 CgkJCQlWdGVzdCEoVCwgcy5sZW5ndGgpIHI7CgkJCQlzd2l6emxlQ29weSEocywg cy5sZW5ndGgsIHMubGVuZ3RoLCBOKShyLnJhdywgcmF3KTsKCQkJCXJldHVybiBy OwoJCQl9CgkJfQoJfQoKCglUW05dIHJhdzsKfQoKdGVtcGxhdGUgc3dpenpsZShz dHJpbmcgcykKewoJc3RhdGljIGlmKGlzUHJvcGVydGllc1NvcnRlZCFzKQoJewoJ CXJlZiBhdXRvIHN3aXp6bGUoc2l6ZV90IE4sIFQpKHJlZiBUW05dIGFyZykKCQl7 CgkJCWVudW0gaW5kZXggPSBpbmRleE9mUHJvcGVydHkhKHNbMF0pOwoJCQlzdGF0 aWMgYXNzZXJ0KGluZGV4ICsgcy5sZW5ndGggPD0gTiwgImluZGV4IG91dCBvZiBi b3VuZHMiKTsKCgkJCXJldHVybiAqKGNhc3QoVFtzLmxlbmd0aF0qKQoJCQkJJmFy Zy5wdHJbaW5kZXhdKTsKCQl9Cgl9CgllbHNlCgl7CgkJYXV0byBzd2l6emxlKHNp emVfdCBOLCBUKShjb25zdCByZWYgVFtOXSBhcmcpCgkJewoJCQlUW3MubGVuZ3Ro XSByOwoJCQlzd2l6emxlQ29weSEocywgcy5sZW5ndGgsIHMubGVuZ3RoLCBOKShy LCBhcmcpOwoJCQlyZXR1cm4gcjsKCQl9Cgl9Cn0KCnByaXZhdGUgdm9pZCB0ZXN0 KCkKewoJYWxpYXMgVnRlc3QhKGZsb2F0LCAzKSBWMzsKCWFsaWFzIFZ0ZXN0IShm bG9hdCwgNCkgVjQ7CglhbGlhcyBWdGVzdCEoZmxvYXQsIDUpIFY1OwoKCVYzIGEg PSBWMygzKTsKCVY0IGIgPSBWNCgxMyk7CglWNSBjID0gVjUoNDIpOwoKCWZsb2F0 WzNdIGFhID0gWzQuMGYsIDUsIDZdOwoJZmxvYXRbNF0gYmIgPSBbMTQuMGYsIDE1 LCAxNiwgMTddOwoJZmxvYXRbNV0gY2MgPSBbNDMuMGYsIDQ0LCA0NSwgNDYsIDQ3 XTsKCgl3cml0ZWxuKGEucmF3KTsKCXdyaXRlbG4oYi5yYXcpOwoJd3JpdGVsbihj LnJhdyk7Cgl3cml0ZWxuKGFhKTsKCXdyaXRlbG4oYmIpOwoJd3JpdGVsbihjYyk7 Cgl3cml0ZWxuKCItLS0tLSIpOwoKCWIueHl6ID0gYS5nYnI7CgljLnp3ICA9IGEu cmI7CgoJc3dpenpsZSEieHl6IihiYikgPSBzd2l6emxlISJnYnIiKGFhKTsKCXN3 aXp6bGUhInp3IihjYykgID0gc3dpenpsZSEicmIiKGFhKTsKCgl3cml0ZWxuKGEu cmF3KTsKCXdyaXRlbG4oYi5yYXcpOwoJd3JpdGVsbihjLnJhdyk7Cgl3cml0ZWxu KGFhKTsKCXdyaXRlbG4oYmIpOwoJd3JpdGVsbihjYyk7Cn0KCnZvaWQgbWFpbigp CnsKCXRlc3QoKTsKfQo= ------------3zFmvBq6SXZ02RfKgrjTfd--
Sep 24 2011
--0016e6d283311048c204adb03683 Content-Type: text/plain; charset=UTF-8 On 24 September 2011 15:37, Iain Buclaw <ibuclaw ubuntu.com> wrote:== Quote from Manu Evans (turkeyman gmail.com)'s articleHow can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
yours isn't lost, except that it'd have to be in inline asm rather than
But sadly, in that case, it wouldn't work. Without an intrinsic hardware
type, there'sno way to pass vectors to functions in registers, and also, using
you tend toend up with endless unnecessary loads and stores, and potentially a lot
shuffling/permutation. This will differ radically between architectures
I think I read in another post too that functions containing inline asm
be inlined?How does the D compiler go at optimising code around inline asm blocks?
compilers have alot of trouble optimising around inline asm blocks, and many don't even
to do so...How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.
GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Nice! Is there an IRC channel, or anywhere for realtime D discussion? I'm interested in trying to build some GDC cross compilers, and perhaps contributing to the standard library on a few embedded systems, but I have a lot of little questions and general things that don't suit a mailing list... Perhaps some IM? It seems to me that you are the authority on GDC implementation and support... --0016e6d283311048c204adb03683 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div><div class=3D"gmail_quote">On 24 September 2011 15:37, Iain Buclaw <sp= an dir=3D"ltr"><<a href=3D"mailto:ibuclaw ubuntu.com">ibuclaw ubuntu.com= </a>></span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin= :0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> =3D=3D Quote from Manu Evans (<a href=3D"mailto:turkeyman gmail.com">turkey= man gmail.com</a>)'s article<br> > > > How can I do this in a nice way in D? I'm long sick of w= riting<br> > > > unsightly vector classes in C++, but fortunately using vendo= r<br> > > > specific compiler intrinsics usually leads to decent code<br=
y worse)<br> > > > hardware vector library in D, if it's even possible. But= perhaps<br> > > > I've missed something here?<br> > > Your C++ vector code should be amenable to translation to D, so t= hat effort of<br> > > yours isn't lost, except that it'd have to be in inline a= sm rather than<br> intrinsics.<br> > But sadly, in that case, it wouldn't work. Without an intrinsic ha= rdware vector<br> type, there's<br> > no way to pass vectors to functions in registers, and also, using expl= icit asm,<br> you tend to<br> > end up with endless unnecessary loads and stores, and potentially a lo= t of redundant<br> > shuffling/permutation. This will differ radically between architecture= s too.<br> > I think I read in another post too that functions containing inline as= m will not<br> be inlined?<br> > How does the D compiler go at optimising code around inline asm blocks= ? Most<br> compilers have a<br> > lot of trouble optimising around inline asm blocks, and many don't= even attempt<br> to do so...<br> > How does GDC compare to DMD? Does it do a good job?<br> > I really need to take the weekend and do a lot of experiments I think.= <br> <br> GDC is just the same as DMD (same runtime library implementation for vector= array<br> operations).<br> <br> <br> You can define vector types in the language through use of GCC's attrib= ute though<br> (is a pragma in GDC), then use a union to interface between it and the<br> corresponding static array. =C2=A0It's deliberately UGLY and PRONE to y= ou hitting lots<br> of brick walls if you don't handle them in a very specific way though. = :~)<br> <br> Stock example:<br> <br> pragma(attribute, vector_size())<br> =C2=A0typedef float __v4sf_t<br> <br> union __v4sf {<br> =C2=A0float[4] f;<br> =C2=A0__v4sf_t v;<br> }<br> <br> <br> __v4sf a =3D {[1,2,3,4]}<br> =C2=A0 =C2=A0 =C2=A0 b =3D {[1,2,3,4]}<br> =C2=A0 =C2=A0 =C2=A0 c;<br> <br> c.v =3D a.v + b.v;<br> assert(c.f =3D=3D [2,4,6,8]);<br> <br> <br> The assignment compiles down to ~5 instructions:<br> movaps -0x88(%ebp),%xmm1<br> movaps -0x78(%ebp),%xmm0<br> addps =C2=A0%xmm1,%xmm0<br> movaps %xmm0,-0x68(%ebp)<br> flds =C2=A0 -0x68(%ebp)<br> <br> And is far quicker than c[] =3D a[] + b[] due to it being inlined, and not = an<br> external library call.<br> <br> Regards<br> <font color=3D"#888888">Iain<br> </font></blockquote></div><br></div><div>Nice!<div>Is there an IRC channel,= or anywhere for realtime D discussion?</div><div>I'm interested in try= ing to build some GDC cross compilers, and perhaps contributing to the stan= dard library on a few embedded systems, but I have a lot of little question= s and general things that don't suit a mailing list...<br> </div></div><div>Perhaps some IM? It seems to me that you are the authority= on GDC implementation and support...</div> --0016e6d283311048c204adb03683--
Sep 24 2011
On Sat, 24 Sep 2011 16:50:39 +0300, Manu <turkeyman gmail.com> wrote:Nice! Is there an IRC channel, or anywhere for realtime D discussion? I'm interested in trying to build some GDC cross compilers, and perhaps contributing to the standard library on a few embedded systems, but I have a lot of little questions and general things that don't suit a mailing list... Perhaps some IM? It seems to me that you are the authority on GDC implementation and support...
We all know where it leads to, first ask IM then ask phone and finally ...
Sep 24 2011
== Quote from Manu (turkeyman gmail.com)'s articleHello D community. I've been reading a lot about D lately. I have known it existed for ages, but for some reason never even took a moment to look into it. The more I looked into it, the more I realise, this is the language I want. C(/C++) has been ruined, far beyond salvation. D seems to be the reboot that it desperately needs. Anyway, I work in the games industry, 10 years in cross platform console games at major studios. Sadly, I don't think Microsoft, Sony, Nintendo, Apple, Google (...maybe google) will support D any time soon, but I've started some after-hours game projects to test D in a some real gamedev environments. So far I have these (critical) questions. Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance. C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware?
The DMD compiler has some basic intrinsics, other compilers build upon this using their own backends. ie: GCC has hundreds of builtins, including some target builtins where intrinsic types are mappable to D types (__float80 ->.real).Is the D assembler a macro assembler? (ie, assigns registers automatically and manage loads/stores intelligently?) I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere? As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have? Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible. In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware. How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?
I would imagine it should now be possible to use GCC vector builtins with the GDC compiler. Given that I manage to get round to turning these routines on though. :~)I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here. What about Android (or iPhone, but apple's 'x-code policy' prevents that)? I'd REALLY love to write an android project in D... the toolchain is GCC, I see no reason why it shouldn't be possible to write an android app if an appropriate toolchain was available? Sorry it's a bit long, thanks for reading this far! I'm looking forward to a brighter future writing lots of D code :P But I need to know basically all these questions are addressed before I could consider it for serious commercial game dev.
Someone has recently confirmed D working just fine on the Alpha platform. For D2, your biggest showstopper is the runtime library. There are many gaps to fill to port druntime to your preferred architecture. Regards
Sep 21 2011
Manu Wrote:I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here.
http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html Maybe this man can be of some help for you.
Sep 22 2011









Trass3r <un known.com> 