www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Multi-architecture binaries

reply Jascha Wetzel <"[firstname]" mainia.de> writes:
A thought that came up in the VM discussion...

Suppose someday we have language support for vector operations. We want
to ship binaries that support but do not require extensions like SSE. We
do not want to ship multiple binaries and wrappers that switch between
them or installers that decide which one to use, because it's more work
and we'd be shipping a lot of redundant code.

Ideally we wouldn't have to write additional code either. The compiler
could emit code for multiple targets on a per-function basis (e.g. with
the target architecure mangled into the function name). The runtime
would check at startup, which version will be used and "link" the
appropriate function.
Here is a small proof-of-concept implementation of this detection and
linking mechanism.

Comments?
May 01 2007
next sibling parent Chad J <gamerChad _spamIsBad_gmail.com> writes:
I've thought about this myself, and really like the idea.  In the VM 
discussion Don mentioned benchmarking different codepaths to find which 
one works best on the current CPU, then linking the best one in.  This 
makes a lot of sense to me, since CPUs seem to have different 
performance characteristics, even regardless of instruction set 
differences.

I was once benchmarking an algorithm on my notebook computer with a more 
modern processor, and my desktop computer with an older processor.  The 
algo ran faster on the notebook of course, but branching had an 
especially reduced cost.  That is, branching on the more modern 
processor was less expensive relative to other instructions than it was 
on the previous processor.  This was with the same D binary on both of 
them.

That is the sort of stuff that I think the JITC's want to leverage, but 
I have to wonder if using a strategy like this and covering enough 
permutations of costly algorithms would give exactly the same benifit, 
with a massively reduced startup time for applications.  Of course, it 
would also be nice to be able to turn it off, because it will cost SOME 
startup time as well as executable size, which are not worthwhile costs 
for some apps like simple command line apps that need to be snappy and 
small.  It would rock for games though ;)

I really can't wait to see D's performance some day when/if it gets cool 
tricks like this, low-d vector primitives, array operations, etc.

Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.
 
 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.
 
 Comments?
 
 
 ------------------------------------------------------------------------
 
 import std.cpuid;
 import std.stdio;
 
 //-----------------------------------------------------------------------------
 //  This code goes into the runtime library
 
 const uint  CPU_NO_EXTENSION    = 0,
             CPU_MMX             = 1,
             CPU_SSE             = 2,
             CPU_SSE2            = 4,
             CPU_SSE3            = 8;
 
 /******************************************************************************
     A function pointer with a bitmask for it's required extensions
 ******************************************************************************/
 struct MultiTargetVariant
 {
     static MultiTargetVariant opCall(uint ext, void* func)
     {
         MultiTargetVariant mtv;
         mtv.ext = ext;
         mtv.func = func;
         return mtv;
     }
 
     uint    ext;
     void*   func;
 }
 
 /******************************************************************************
     Chooses the first matching MTV
     and saves it's FP to the dummy entry in the VTBL
 ******************************************************************************/
 void LinkMultiTarget(ClassInfo ci, void* dummy_ptr, MultiTargetVariant[]
multi_target_variants)
 {
     uint extensions;
     if ( mmx )  extensions |= CPU_MMX;
     if ( sse )  extensions |= CPU_SSE;
     if ( sse2 ) extensions |= CPU_SSE2;
     if ( sse3 ) extensions |= CPU_SSE3;
 
     foreach ( i, inout vp; ci.vtbl )
     {
         if ( vp is dummy_ptr )
         {
             foreach ( variant; multi_target_variants )
             {
                 if ( (variant.ext & extensions) == variant.ext )
                 {
                     vp = variant.func;
                     break;
                 }
             }
             assert(vp !is dummy_ptr);
             break;
         }
     }
 }
 
 
 //-----------------------------------------------------------------------------
 //  This is application code
 
 /******************************************************************************
     A class with a multi-target function
 ******************************************************************************/
 class MyMultiTargetClass
 {
     // The following 3 functions could be generated automatically by the
compiler
     // with different targets enabled. For example, when we have language
support for
     // vector operations, the compiler could generate multiple versions for
different
     // SIMD extensions. Then there would be only one extension independent
implementation.
 
     char[] multi_target_sse2()
     {
         return "using SSE2";
     }
 
     char[] multi_target_sse_mmx()
     {
         return "using SSE and MMX";
     }
 
     char[] multi_target_noext()
     {
         return "using no extension";
     }
 
     // The following code could be generated by the compiler if there are
multi-target
     // functions 
 
     char[] multi_target() { return null; }
     static this()
     {
         MultiTargetVariant[] variants = [
             MultiTargetVariant(CPU_SSE2, &multi_target_sse2),
             MultiTargetVariant(CPU_SSE|CPU_MMX, &multi_target_sse_mmx),
             MultiTargetVariant(CPU_NO_EXTENSION, &multi_target_noext)
         ];
         LinkMultiTarget(this.classinfo, &multi_target, variants);
     }
 }
 
 /******************************************************************************
     Finally, the usage is completely opaque and there is no runtime overhead
     besides the detection at startup.
 ******************************************************************************/
 void main()
 {
     MyMultiTargetClass t = new MyMultiTargetClass;
     writefln("%s", t.multi_target);
 }
May 01 2007
prev sibling next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.
On a totally unrelated note we are using GDC to build Universal Binaries for Mac OS X, that is: objects with both i386 (=i686) and ppc (=powerpc) code. They are however twice as big as when building for just one arch. $ file hello hello: Mach-O universal binary with 2 architectures hello (for architecture ppc): Mach-O executable ppc hello (for architecture i386): Mach-O executable i386 The GCC driver automatically runs two compilation steps and lipos them, so it's pretty straight-forward to use (unrelated to vector ops, though) gdc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch ppc -arch i386 ... It only does one variant for each architecture, so no help for a "JIT". --anders
May 01 2007
prev sibling next sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
I've seen some games with multiple executables compiled for different 
architectures. Since the executable size is dwarfed by resources, this 
is no problem for these kind of applications.

How much of a negative impact would your suggested approach have on 
compiler optimizations? (inlining and that sort of thing)

Another thing, what are the benefits of the compiler doing this over 
libraries?

On a related note it may be worth mentioning liboil which implements 
exactly this in a library: http://liboil.freedesktop.org/wiki/
May 01 2007
parent Jascha Wetzel <"[firstname]" mainia.de> writes:
Lutger wrote:
 I've seen some games with multiple executables compiled for different
 architectures. Since the executable size is dwarfed by resources, this
 is no problem for these kind of applications.
yeah, the size issue isn't that important. it's actually more about not doing anything but adding a compiler switch to get multiple versions.
 How much of a negative impact would your suggested approach have on
 compiler optimizations? (inlining and that sort of thing)
the smallest unit for this approach would be a non-inlined function. any function that gets inlined within the multi-arch function would be compiled with the appropriate target as well. all intraprocedural optimizations work as usual. only optimizations that change the calling convention are affected. those have to be equal for all versions of the function because the caller never knows which version it calls. in the example where only virtual functions are supported this is not an issue, since virtual functions have that requirement anyway. for static functions this has to be ensured explicitly.
 Another thing, what are the benefits of the compiler doing this over
 libraries?
using libraries means that you have to at least compile multiple versions of each library and have code that loads the appropriate version. with compiler support it's a lot more convenient and less error prone, since you do not have to write any additional code or have more complex build scripts.
 On a related note it may be worth mentioning liboil which implements
 exactly this in a library: http://liboil.freedesktop.org/wiki/
yep, it has the same goal and looking at the source shows how much work it is. of course that's also because everything is manually optimized.
May 02 2007
prev sibling next sibling parent reply janderson <askme me.com> writes:
Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.
 
 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.
 
 Comments?
 
This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 01 2007
parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
the granularity isn't as fine as it could be, of course. but the effort
to make it happen is pretty small and it's better than compiling the
whole program multiple times and switching manually.
it's not a replacement for JITC or methods like Abrash's welding.

janderson wrote:
 Jascha Wetzel wrote:
 A thought that came up in the VM discussion...

 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.

 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.

 Comments?
This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 02 2007
parent janderson <askme me.com> writes:
Jascha Wetzel wrote:
 the granularity isn't as fine as it could be, of course. but the effort
 to make it happen is pretty small and it's better than compiling the
 whole program multiple times and switching manually.
 it's not a replacement for JITC or methods like Abrash's welding.
 
I agree, it's a good start.
 janderson wrote:
 Jascha Wetzel wrote:
 A thought that came up in the VM discussion...

 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.

 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.

 Comments?
This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 02 2007
prev sibling parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
here is a much simpler version that works with templates. what is boils
down to is choosing one template instance at startup that will replace a
function pointer.

now the only compiler support required would be a pragma or similar to
select the target architecture.
this could also be used to manage multiple versions of BLADE code.
May 02 2007
parent reply Don Clugston <dac nospam.com.au> writes:
Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.
 
 now the only compiler support required would be a pragma or similar to
 select the target architecture.
A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).
 this could also be used to manage multiple versions of BLADE code.
It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
May 02 2007
next sibling parent janderson <askme me.com> writes:
Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.
A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).
 this could also be used to manage multiple versions of BLADE code.
It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
That may be the case. Also if the code is only called once, it would cause a huge cache miss that would last for many nano-seconds. If this is happen a lot the code would keep spiking over over the place (for the first few seconds of the app and then when you hit code that hasn't been used before). A better approach would be to figure them out in large batches, perhaps per-module level. That way you get less cache-misses. Nice idea though. -Joel
May 02 2007
prev sibling next sibling parent Pragma <ericanderton yahoo.removeme.com> writes:
Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.
A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).
 this could also be used to manage multiple versions of BLADE code.
It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one.
I was thinking about this. What would be nice is if D had reflection annotations/attributes to flag methods and functions, rather than kluding more information into the symbol name. pragma(attr,CPUOptionFixup(CPU.SSE2,&myFunction)) void myFunction_SSE2(){ /* do something with SSE2 */ } void myFunction(){ /* use vanilla code here */ } During, or after link-time, you just walk the reflection metadata and patch things up as appropriate. Now DDL already has a metadata capability via .ddl wrapper support - your imagination is the limit on how that is done post-build (a D front-end comes to mind). Once I get this next revision of DDL out, it should be possible to publish DDL metadata directly from a module via a static hashmap, instead of relying on a post-build process. Either way, it's just a matter of walking that metadata as it's exposed from each DynamicModule and DynamicLibrary, and patching the symbol tables during runtime linking.
 
 I wonder if it's possible to pop ESP off the stack, and write back into 
 the code that called you, without the operating system triggering a 
 security alert -- in that case, the function you call could be a little 
 thunk, something like:
 
 asm {
   naked;
   mov eax, CPU_TYPE;
   mov eax, FUNCPOINTERS[eax];
   mov ecx, [esp-4]; // get the return address
   mov [ecx-4], eax; // patch the call address, so this thunk never gets 
 called again.
   jmp [eax];
 }
 
 But I think a modern OS would go nuts if you try this?
 (It's been a long time since I wrote self modifying code).
If I'm not mistaken, this should be doable thanks to D adopting a *very* flat memory model (at least on win32). All the segment registers have the same base address in memory. So just as long as you read/write against ES/DS/FS/GS/SS and read/call against CS, you should be good to go. IIRC, Windows does provide some stronger code-segment write protection (I forget what it was actually called), but it has to be enabled explicitly. At some point in the future, Don, I'd like to pick your brain about using trampolines like this for DDL. I'd like to see cross-OS binaries become possible by thunking the exception-handling mechanisms between *nix and Win32 at link time, but I'm not sure how to pull that off just yet. -- - EricAnderton at yahoo
May 02 2007
prev sibling parent Jascha Wetzel <"[firstname]" mainia.de> writes:
 It's a nice idea, but I don't know how it could generate the class to
 put the 'this()' function into (we don't want a memory alloc every time
 we enter that function!)
i'm not sure what you mean. i thought of something like this: void foo(uint arch)() { auto p = Vec!(arch)([3.5, 1.1, 3.8]); auto r = Vec!(arch)([17.0f, 28.25, 1]) p *= dot(p,r); } the template parameter to Vec could choose the target used by BLADE (x87 or SSE vor example). the result is a class with multiple instances of foo (since all desired instances appear in the static c'tor). the static c'tor chooses one of the instances (depending on hardware availability or benchmarks) and copies it's address to the init-data in the classinfo. everytime the class is instantiated, the function-pointer will automatically be initialized with the chosen pointer - no self-modifying code necessary. instead of changing the init-data, we can also modify the VTBL in the classinfo (that's what the first version of this example did). Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.
A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).
 this could also be used to manage multiple versions of BLADE code.
It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
May 02 2007