digitalmars.D - Multi-architecture binaries

Jascha Wetzel (14/14) May 01 2007 A thought that came up in the VM discussion...

Chad J (24/157) May 01 2007 I've thought about this myself, and really like the idea. In the VM
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/20) May 01 2007 On a totally unrelated note we are using GDC to build Universal Binaries
Lutger (9/9) May 01 2007 I've seen some games with multiple executables compiled for different

Jascha Wetzel (20/29) May 02 2007 yeah, the size issue isn't that important. it's actually more about not

janderson (16/34) May 01 2007 This is fine when you have a small sub-set of target architectures,

Jascha Wetzel (5/44) May 02 2007 the granularity isn't as fine as it could be, of course. but the effort

janderson (2/46) May 02 2007

Jascha Wetzel (6/6) May 02 2007 here is a much simpler version that works with templates. what is boils

Don Clugston (27/34) May 02 2007 A pragma would only be required as a size optimisation. Probably not

janderson (10/51) May 02 2007 That may be the case. Also if the code is only called once, it would
Pragma (22/63) May 02 2007 I was thinking about this. What would be nice is if D had reflection an...
Jascha Wetzel (18/62) May 02 2007 i'm not sure what you mean. i thought of something like this:

Jascha Wetzel <"[firstname]" mainia.de> writes:

A thought that came up in the VM discussion...

Suppose someday we have language support for vector operations. We want
to ship binaries that support but do not require extensions like SSE. We
do not want to ship multiple binaries and wrappers that switch between
them or installers that decide which one to use, because it's more work
and we'd be shipping a lot of redundant code.

Ideally we wouldn't have to write additional code either. The compiler
could emit code for multiple targets on a per-function basis (e.g. with
the target architecure mangled into the function name). The runtime
would check at startup, which version will be used and "link" the
appropriate function.
Here is a small proof-of-concept implementation of this detection and
linking mechanism.

Comments?

May 01 2007

Chad J <gamerChad _spamIsBad_gmail.com> writes:

I've thought about this myself, and really like the idea.  In the VM 
discussion Don mentioned benchmarking different codepaths to find which 
one works best on the current CPU, then linking the best one in.  This 
makes a lot of sense to me, since CPUs seem to have different 
performance characteristics, even regardless of instruction set 
differences.

I was once benchmarking an algorithm on my notebook computer with a more 
modern processor, and my desktop computer with an older processor.  The 
algo ran faster on the notebook of course, but branching had an 
especially reduced cost.  That is, branching on the more modern 
processor was less expensive relative to other instructions than it was 
on the previous processor.  This was with the same D binary on both of 
them.

That is the sort of stuff that I think the JITC's want to leverage, but 
I have to wonder if using a strategy like this and covering enough 
permutations of costly algorithms would give exactly the same benifit, 
with a massively reduced startup time for applications.  Of course, it 
would also be nice to be able to turn it off, because it will cost SOME 
startup time as well as executable size, which are not worthwhile costs 
for some apps like simple command line apps that need to be snappy and 
small.  It would rock for games though ;)

I really can't wait to see D's performance some day when/if it gets cool 
tricks like this, low-d vector primitives, array operations, etc.

Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.
 
 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.
 
 Comments?
 
 
 ------------------------------------------------------------------------
 
 import std.cpuid;
 import std.stdio;
 
 //-----------------------------------------------------------------------------
 //  This code goes into the runtime library
 
 const uint  CPU_NO_EXTENSION    = 0,
             CPU_MMX             = 1,
             CPU_SSE             = 2,
             CPU_SSE2            = 4,
             CPU_SSE3            = 8;
 
 /******************************************************************************
     A function pointer with a bitmask for it's required extensions
 ******************************************************************************/
 struct MultiTargetVariant
 {
     static MultiTargetVariant opCall(uint ext, void* func)
     {
         MultiTargetVariant mtv;
         mtv.ext = ext;
         mtv.func = func;
         return mtv;
     }
 
     uint    ext;
     void*   func;
 }
 
 /******************************************************************************
     Chooses the first matching MTV
     and saves it's FP to the dummy entry in the VTBL
 ******************************************************************************/
 void LinkMultiTarget(ClassInfo ci, void* dummy_ptr, MultiTargetVariant[]
multi_target_variants)
 {
     uint extensions;
     if ( mmx )  extensions |= CPU_MMX;
     if ( sse )  extensions |= CPU_SSE;
     if ( sse2 ) extensions |= CPU_SSE2;
     if ( sse3 ) extensions |= CPU_SSE3;
 
     foreach ( i, inout vp; ci.vtbl )
     {
         if ( vp is dummy_ptr )
         {
             foreach ( variant; multi_target_variants )
             {
                 if ( (variant.ext & extensions) == variant.ext )
                 {
                     vp = variant.func;
                     break;
                 }
             }
             assert(vp !is dummy_ptr);
             break;
         }
     }
 }
 
 
 //-----------------------------------------------------------------------------
 //  This is application code
 
 /******************************************************************************
     A class with a multi-target function
 ******************************************************************************/
 class MyMultiTargetClass
 {
     // The following 3 functions could be generated automatically by the
compiler
     // with different targets enabled. For example, when we have language
support for
     // vector operations, the compiler could generate multiple versions for
different
     // SIMD extensions. Then there would be only one extension independent
implementation.
 
     char[] multi_target_sse2()
     {
         return "using SSE2";
     }
 
     char[] multi_target_sse_mmx()
     {
         return "using SSE and MMX";
     }
 
     char[] multi_target_noext()
     {
         return "using no extension";
     }
 
     // The following code could be generated by the compiler if there are
multi-target
     // functions 
 
     char[] multi_target() { return null; }
     static this()
     {
         MultiTargetVariant[] variants = [
             MultiTargetVariant(CPU_SSE2, &multi_target_sse2),
             MultiTargetVariant(CPU_SSE|CPU_MMX, &multi_target_sse_mmx),
             MultiTargetVariant(CPU_NO_EXTENSION, &multi_target_noext)
         ];
         LinkMultiTarget(this.classinfo, &multi_target, variants);
     }
 }
 
 /******************************************************************************
     Finally, the usage is completely opaque and there is no runtime overhead
     besides the detection at startup.
 ******************************************************************************/
 void main()
 {
     MyMultiTargetClass t = new MyMultiTargetClass;
     writefln("%s", t.multi_target);
 }

May 01 2007

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.

On a totally unrelated note we are using GDC to build Universal Binaries
for Mac OS X, that is: objects with both i386 (=i686) and ppc (=powerpc)
code. They are however twice as big as when building for just one arch.

$ file hello
hello: Mach-O universal binary with 2 architectures
hello (for architecture ppc):   Mach-O executable ppc
hello (for architecture i386):  Mach-O executable i386

The GCC driver automatically runs two compilation steps and lipos them,
so it's pretty straight-forward to use (unrelated to vector ops, though)
gdc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch ppc -arch i386 ...

It only does one variant for each architecture, so no help for a "JIT".

--anders

May 01 2007

Lutger <lutger.blijdestijn gmail.com> writes:

I've seen some games with multiple executables compiled for different 
architectures. Since the executable size is dwarfed by resources, this 
is no problem for these kind of applications.

How much of a negative impact would your suggested approach have on 
compiler optimizations? (inlining and that sort of thing)

Another thing, what are the benefits of the compiler doing this over 
libraries?

On a related note it may be worth mentioning liboil which implements 
exactly this in a library: http://liboil.freedesktop.org/wiki/

May 01 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

Lutger wrote:
 I've seen some games with multiple executables compiled for different
 architectures. Since the executable size is dwarfed by resources, this
 is no problem for these kind of applications.

yeah, the size issue isn't that important. it's actually more about not
doing anything but adding a compiler switch to get multiple versions.

 How much of a negative impact would your suggested approach have on
 compiler optimizations? (inlining and that sort of thing)

the smallest unit for this approach would be a non-inlined function. any
function that gets inlined within the multi-arch function would be
compiled with the appropriate target as well.
all intraprocedural optimizations work as usual. only optimizations that
change the calling convention are affected. those have to be equal for
all versions of the function because the caller never knows which
version it calls.
in the example where only virtual functions are supported this is not an
issue, since virtual functions have that requirement anyway. for static
functions this has to be ensured explicitly.

 Another thing, what are the benefits of the compiler doing this over
 libraries?

using libraries means that you have to at least compile multiple
versions of each library and have code that loads the appropriate version.
with compiler support it's a lot more convenient and less error prone,
since you do not have to write any additional code or have more complex
build scripts.

 On a related note it may be worth mentioning liboil which implements
 exactly this in a library: http://liboil.freedesktop.org/wiki/

yep, it has the same goal and looking at the source shows how much work
it is. of course that's also because everything is manually optimized.

May 02 2007

janderson <askme me.com> writes:

Jascha Wetzel wrote:
 A thought that came up in the VM discussion...
 
 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.
 
 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.
 
 Comments?
 

This is fine when you have a small sub-set of target architectures, 
however if you want to be really optimal it needs to be optimized for 
the target architecture.  Michael Abrash tried this for Pixomatic 
however the size of the executable grow to large (its an exponential 
thing because you want to avoid branching so things must be inlined).

http://www.ddj.com/184405765
http://www.ddj.com/184405807
http://www.ddj.com/184405848

I'm not saying its not a good start however I think the compiler would 
need to perform some sort of compression and optimize the function for 
the architecture (even the order of instructions can make a huge 
different to efficiency) at startup. I guess that's a kinda JITC however

I guess it could be a load of tiny code segments that are pre-built and 
rearranged and added together just before build. (Kinda like pixomatic)

-Joel

May 01 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

the granularity isn't as fine as it could be, of course. but the effort
to make it happen is pretty small and it's better than compiling the
whole program multiple times and switching manually.
it's not a replacement for JITC or methods like Abrash's welding.

janderson wrote:
 Jascha Wetzel wrote:
 A thought that came up in the VM discussion...

 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.

 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.

 Comments?

 
 This is fine when you have a small sub-set of target architectures,
 however if you want to be really optimal it needs to be optimized for
 the target architecture.  Michael Abrash tried this for Pixomatic
 however the size of the executable grow to large (its an exponential
 thing because you want to avoid branching so things must be inlined).
 
 http://www.ddj.com/184405765
 http://www.ddj.com/184405807
 http://www.ddj.com/184405848
 
 I'm not saying its not a good start however I think the compiler would
 need to perform some sort of compression and optimize the function for
 the architecture (even the order of instructions can make a huge
 different to efficiency) at startup. I guess that's a kinda JITC however
 
 I guess it could be a load of tiny code segments that are pre-built and
 rearranged and added together just before build. (Kinda like pixomatic)
 
 -Joel

May 02 2007

janderson <askme me.com> writes:

Jascha Wetzel wrote:
 the granularity isn't as fine as it could be, of course. but the effort
 to make it happen is pretty small and it's better than compiling the
 whole program multiple times and switching manually.
 it's not a replacement for JITC or methods like Abrash's welding.
 

I agree, it's a good start.


 janderson wrote:
 Jascha Wetzel wrote:
 A thought that came up in the VM discussion...

 Suppose someday we have language support for vector operations. We want
 to ship binaries that support but do not require extensions like SSE. We
 do not want to ship multiple binaries and wrappers that switch between
 them or installers that decide which one to use, because it's more work
 and we'd be shipping a lot of redundant code.

 Ideally we wouldn't have to write additional code either. The compiler
 could emit code for multiple targets on a per-function basis (e.g. with
 the target architecure mangled into the function name). The runtime
 would check at startup, which version will be used and "link" the
 appropriate function.
 Here is a small proof-of-concept implementation of this detection and
 linking mechanism.

 Comments?

 This is fine when you have a small sub-set of target architectures,
 however if you want to be really optimal it needs to be optimized for
 the target architecture.  Michael Abrash tried this for Pixomatic
 however the size of the executable grow to large (its an exponential
 thing because you want to avoid branching so things must be inlined).

 http://www.ddj.com/184405765
 http://www.ddj.com/184405807
 http://www.ddj.com/184405848

 I'm not saying its not a good start however I think the compiler would
 need to perform some sort of compression and optimize the function for
 the architecture (even the order of instructions can make a huge
 different to efficiency) at startup. I guess that's a kinda JITC however

 I guess it could be a load of tiny code segments that are pre-built and
 rearranged and added together just before build. (Kinda like pixomatic)

 -Joel

May 02 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

here is a much simpler version that works with templates. what is boils
down to is choosing one template instance at startup that will replace a
function pointer.

now the only compiler support required would be a pragma or similar to
select the target architecture.
this could also be used to manage multiple versions of BLADE code.

May 02 2007

Don Clugston <dac nospam.com.au> writes:

Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.
 
 now the only compiler support required would be a pragma or similar to
 select the target architecture.

A pragma would only be required as a size optimisation. Probably not 
worth worrying about (We have enough version information already).

 this could also be used to manage multiple versions of BLADE code.

It's a nice idea, but I don't know how it could generate the class to 
put the 'this()' function into (we don't want a memory alloc every time 
we enter that function!)

Interestingly DDL could be fantastic for this. At startup, walk through 
the symbol fixup table, and look for any import symbols marked 
__cpu_fixup_XXX.
When you find them, look for an export symbol called __cpu_SSE2_XXX, and 
patch them into everything in the the fixup list. That way, you even get 
a direct function call, instead of an indirect one.

I wonder if it's possible to pop ESP off the stack, and write back into 
the code that called you, without the operating system triggering a 
security alert -- in that case, the function you call could be a little 
thunk, something like:

asm {
   naked;
   mov eax, CPU_TYPE;
   mov eax, FUNCPOINTERS[eax];
   mov ecx, [esp-4]; // get the return address
   mov [ecx-4], eax; // patch the call address, so this thunk never gets 
called again.
   jmp [eax];
}

But I think a modern OS would go nuts if you try this?
(It's been a long time since I wrote self modifying code).

May 02 2007

janderson <askme me.com> writes:

Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.

 
 A pragma would only be required as a size optimisation. Probably not 
 worth worrying about (We have enough version information already).
 
 this could also be used to manage multiple versions of BLADE code.

 
 It's a nice idea, but I don't know how it could generate the class to 
 put the 'this()' function into (we don't want a memory alloc every time 
 we enter that function!)
 
 Interestingly DDL could be fantastic for this. At startup, walk through 
 the symbol fixup table, and look for any import symbols marked 
 __cpu_fixup_XXX.
 When you find them, look for an export symbol called __cpu_SSE2_XXX, and 
 patch them into everything in the the fixup list. That way, you even get 
 a direct function call, instead of an indirect one.
 
 I wonder if it's possible to pop ESP off the stack, and write back into 
 the code that called you, without the operating system triggering a 
 security alert -- in that case, the function you call could be a little 
 thunk, something like:
 
 asm {
   naked;
   mov eax, CPU_TYPE;
   mov eax, FUNCPOINTERS[eax];
   mov ecx, [esp-4]; // get the return address
   mov [ecx-4], eax; // patch the call address, so this thunk never gets 
 called again.
   jmp [eax];
 }
 
 But I think a modern OS would go nuts if you try this?
 (It's been a long time since I wrote self modifying code).

That may be the case.  Also if the code is only called once, it would 
cause a huge cache miss that would last for many nano-seconds.

If this is happen a lot the code would keep spiking over over the place 
(for the first few seconds of the app and then when you hit code that 
hasn't been used before).

A better approach would be to figure them out in large batches, perhaps 
per-module level.  That way you get less cache-misses.

Nice idea though.

-Joel

May 02 2007

Pragma <ericanderton yahoo.removeme.com> writes:

Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.

 
 A pragma would only be required as a size optimisation. Probably not 
 worth worrying about (We have enough version information already).
 
 this could also be used to manage multiple versions of BLADE code.

 
 It's a nice idea, but I don't know how it could generate the class to 
 put the 'this()' function into (we don't want a memory alloc every time 
 we enter that function!)
 
 Interestingly DDL could be fantastic for this. At startup, walk through 
 the symbol fixup table, and look for any import symbols marked 
 __cpu_fixup_XXX.
 When you find them, look for an export symbol called __cpu_SSE2_XXX, and 
 patch them into everything in the the fixup list. That way, you even get 
 a direct function call, instead of an indirect one.

I was thinking about this.  What would be nice is if D had reflection
annotations/attributes to flag methods and 
functions, rather than kluding more information into the symbol name.

pragma(attr,CPUOptionFixup(CPU.SSE2,&myFunction))
void myFunction_SSE2(){ /* do something with SSE2 */ }

void myFunction(){ /* use vanilla code here */ }

During, or after link-time, you just walk the reflection metadata and patch
things up as appropriate.

Now DDL already has a metadata capability via .ddl wrapper support - your
imagination is the limit on how that is done 
post-build (a D front-end comes to mind).  Once I get this next revision of DDL
out, it should be possible to publish 
DDL metadata directly from a module via a static hashmap, instead of relying on
a post-build process.

Either way, it's just a matter of walking that metadata as it's exposed from
each DynamicModule and DynamicLibrary, and 
patching the symbol tables during runtime linking.

 
 I wonder if it's possible to pop ESP off the stack, and write back into 
 the code that called you, without the operating system triggering a 
 security alert -- in that case, the function you call could be a little 
 thunk, something like:
 
 asm {
   naked;
   mov eax, CPU_TYPE;
   mov eax, FUNCPOINTERS[eax];
   mov ecx, [esp-4]; // get the return address
   mov [ecx-4], eax; // patch the call address, so this thunk never gets 
 called again.
   jmp [eax];
 }
 
 But I think a modern OS would go nuts if you try this?
 (It's been a long time since I wrote self modifying code).

If I'm not mistaken, this should be doable thanks to D adopting a *very* flat
memory model (at least on win32).  All the 
segment registers have the same base address in memory.  So just as long as you
read/write against ES/DS/FS/GS/SS and 
read/call against CS, you should be good to go.

IIRC, Windows does provide some stronger code-segment write protection (I
forget what it was actually called), but it 
has to be enabled explicitly.

At some point in the future, Don, I'd like to pick your brain about using
trampolines like this for DDL.  I'd like to 
see cross-OS binaries become possible by thunking the exception-handling
mechanisms between *nix and Win32 at link time, 
but I'm not sure how to pull that off just yet.

-- 
- EricAnderton at yahoo

May 02 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

 It's a nice idea, but I don't know how it could generate the class to
 put the 'this()' function into (we don't want a memory alloc every time
 we enter that function!)

i'm not sure what you mean. i thought of something like this:

void foo(uint arch)()
{
  auto p = Vec!(arch)([3.5, 1.1, 3.8]);
  auto r = Vec!(arch)([17.0f, 28.25, 1])
  p *= dot(p,r);
}

the template parameter to Vec could choose the target used by BLADE (x87
or SSE vor example). the result is a class with multiple instances of
foo (since all desired instances appear in the static c'tor).
the static c'tor chooses one of the instances (depending on hardware
availability or benchmarks) and copies it's address to the init-data in
the classinfo. everytime the class is instantiated, the function-pointer
will automatically be initialized with the chosen pointer - no
self-modifying code necessary.
instead of changing the init-data, we can also modify the VTBL in the
classinfo (that's what the first version of this example did).

Don Clugston wrote:
 Jascha Wetzel wrote:
 here is a much simpler version that works with templates. what is boils
 down to is choosing one template instance at startup that will replace a
 function pointer.

 now the only compiler support required would be a pragma or similar to
 select the target architecture.

 
 A pragma would only be required as a size optimisation. Probably not
 worth worrying about (We have enough version information already).
 
 this could also be used to manage multiple versions of BLADE code.

 
 It's a nice idea, but I don't know how it could generate the class to
 put the 'this()' function into (we don't want a memory alloc every time
 we enter that function!)
 
 Interestingly DDL could be fantastic for this. At startup, walk through
 the symbol fixup table, and look for any import symbols marked
 __cpu_fixup_XXX.
 When you find them, look for an export symbol called __cpu_SSE2_XXX, and
 patch them into everything in the the fixup list. That way, you even get
 a direct function call, instead of an indirect one.
 
 I wonder if it's possible to pop ESP off the stack, and write back into
 the code that called you, without the operating system triggering a
 security alert -- in that case, the function you call could be a little
 thunk, something like:
 
 asm {
   naked;
   mov eax, CPU_TYPE;
   mov eax, FUNCPOINTERS[eax];
   mov ecx, [esp-4]; // get the return address
   mov [ecx-4], eax; // patch the call address, so this thunk never gets
 called again.
   jmp [eax];
 }
 
 But I think a modern OS would go nuts if you try this?
 (It's been a long time since I wrote self modifying code).

May 02 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Multi-architecture binaries