www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - goto [variable], and address of code labels

reply Manu <turkeyman gmail.com> writes:
--0016e6db2af3bcb3c504b05e67f9
Content-Type: text/plain; charset=UTF-8

Hi people.

I'd like to propose support for taking the address of code labels, and
supporting variable goto statements.
This is a feature I have found extremely useful, implemented as a GCC
specific extension.

I've used this to get great speedups and simplify code while writing
emulators/vm's. Perhaps also useful in efficient state machine type code
too...


Simple example:

void *opcodes[] = { &OP_ADDA, &OP_SUBA, &OP_MOVA, &OP_JMPA, &OP_etc... };

void exec()
{
  // begin execution
  goto opcodes[ pProgram[regs.PC++] ];

OP_ADDA:
  regs.A += pProgram[regs.PC++];
  goto opcodes[ pProgram[regs.PC++] ];

OP_SUBA:
  regs.A -= pProgram[regs.PC++];
  goto opcodes[ pProgram[regs.PC++] ];

OP_MOVA:
  regs.A = pProgram[regs.PC++];
  goto opcodes[ pProgram[regs.PC++] ];

OP_JMPA:
  regs.PC = regs.A;
  goto opcodes[ pProgram[regs.PC++] ];

OP_etc:
  ...
  goto opcodes[ pProgram[PC++] ];
}

Notice how this structure completely eliminates branch logic, control
statements, etc.

Note, GCC code labels are void*, but in D, perhaps some special code label
pointer type should be invented for type safety...
One may also expect that function pointers may also be implicitly cast to
this generalised code pointer type, which might be useful in some code where
a naked function pointer is used to implement some custom calling
convention.

--0016e6db2af3bcb3c504b05e67f9
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi people.<div><br></div><div>I&#39;d like to propose support for taking th=
e address of code labels, and supporting variable goto statements.</div><di=
v>This is a feature I have found extremely useful, implemented as a GCC spe=
cific extension.</div>
<div><br></div><div>I&#39;ve used this to get great speedups and simplify c=
ode while writing emulators/vm&#39;s. Perhaps also useful in efficient stat=
e machine type code too...</div><div><br></div><div><br></div><div>Simple e=
xample:</div>
<div><br></div><div>void *opcodes[] =3D { &amp;OP_ADDA, &amp;OP_SUBA, &amp;=
OP_MOVA, &amp;OP_JMPA, &amp;OP_etc... };</div><div><br></div><div>void exec=
()</div><div>{</div><div>=C2=A0 // begin execution</div><div>=C2=A0 goto=C2=
=A0opcodes[=C2=A0pProgram[regs.PC++] ];</div>
<div><br></div><div>OP_ADDA:</div><div><div>=C2=A0 regs.A +=3D=C2=A0pProgra=
m[regs.PC++];</div><div>=C2=A0 goto=C2=A0opcodes[=C2=A0pProgram[regs.PC++] =
];</div></div><div><br></div><div>OP_SUBA:</div><div><div>=C2=A0 regs.A -=
=3D=C2=A0pProgram[regs.PC++];</div></div>
<div>=C2=A0 goto=C2=A0opcodes[=C2=A0pProgram[regs.PC++] ];</div><div><div><=
br></div><div>OP_MOVA:</div></div><div>=C2=A0 regs.A =3D=C2=A0pProgram[regs=
.PC++];</div><div>=C2=A0 goto=C2=A0opcodes[=C2=A0pProgram[regs.PC++] ];</di=
v><div><div><br></div><div>OP_JMPA:</div>
</div><div>=C2=A0 regs.PC =3D regs.A;</div><div><div>=C2=A0 goto=C2=A0opcod=
es[=C2=A0pProgram[regs.PC++] ];</div><div></div><div><br></div></div><div><=
div>OP_etc:</div></div><div>=C2=A0 ...</div><div><div>=C2=A0 goto=C2=A0opco=
des[=C2=A0pProgram[PC++] ];</div></div>
<div>}</div><div><br></div><div>Notice how this structure completely elimin=
ates branch logic, control statements, etc.</div><div><br></div><div>Note, =
GCC code labels are void*, but in D, perhaps some special code label pointe=
r type should be invented for type safety...</div>
<div>One may also expect that function pointers may also be implicitly cast=
 to this generalised code pointer type, which might be useful in some code =
where a naked function pointer is used to implement some custom calling con=
vention.</div>

--0016e6db2af3bcb3c504b05e67f9--
Oct 28 2011
next sibling parent Norbert Nemec <Norbert Nemec-online.de> writes:
Have you considered how this mechanism should handle crossing of scope 
boundaries?


On 28.10.2011 18:30, Manu wrote:
 Hi people.

 I'd like to propose support for taking the address of code labels, and
 supporting variable goto statements.
 This is a feature I have found extremely useful, implemented as a GCC
 specific extension.

 I've used this to get great speedups and simplify code while writing
 emulators/vm's. Perhaps also useful in efficient state machine type code
 too...


 Simple example:

 void *opcodes[] = {&OP_ADDA,&OP_SUBA,&OP_MOVA,&OP_JMPA,&OP_etc... };

 void exec()
 {
    // begin execution
    goto opcodes[ pProgram[regs.PC++] ];

 OP_ADDA:
    regs.A += pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_SUBA:
    regs.A -= pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_MOVA:
    regs.A = pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_JMPA:
    regs.PC = regs.A;
    goto opcodes[ pProgram[regs.PC++] ];

 OP_etc:
    ...
    goto opcodes[ pProgram[PC++] ];
 }

 Notice how this structure completely eliminates branch logic, control
 statements, etc.

 Note, GCC code labels are void*, but in D, perhaps some special code label
 pointer type should be invented for type safety...
 One may also expect that function pointers may also be implicitly cast to
 this generalised code pointer type, which might be useful in some code where
 a naked function pointer is used to implement some custom calling
 convention.

Oct 28 2011
prev sibling next sibling parent reply ponce <spam spam.org> writes:
Provided some hairy conditions, the switch instruction will optimize to 
a jump table in GCC and probably most C compilers.

In ICC, some static analysis is even used to optimize out the test 
before the switch.

In D, final switch might enable such an optimization with statically 
checking for out-of-enum values.
Oct 28 2011
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28.10.2011 22:57, ponce wrote:
 Provided some hairy conditions, the switch instruction will optimize to
 a jump table in GCC and probably most C compilers.

They do, you have no guaranties. It will either fly or crawl, depending on sparseness of values. Even so it *usually* makes table, just check on it from time to time ;).
 In ICC, some static analysis is even used to optimize out the test
 before the switch.

The thing is this tiny bit at the end of every instruction code: .... goto opcodes[pProgram[regs.PC++]]; This is instruction dispatch, the trick in the branch prediction that operates on per branch basis, thus single switch-jump based VM dispatch will mispredict jumps most of the time. I seen claims of up to 99% on average. If you place a whole switch at the end of every instruction code I'm not sure compiler will find it's way out of this mess, or even optimize it. I tried that with DMD, the problem is it have no idea how to use the *same* jump *table* for all of identical switches.
 In D, final switch might enable such an optimization with statically
 checking for out-of-enum values.

-- Dmitry Olshansky
Oct 28 2011
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29.10.2011 3:15, Manu wrote:
     This is instruction dispatch, the trick in the branch prediction
     that operates on per branch basis, thus single switch-jump based VM
     dispatch will mispredict jumps most of the time. I seen claims of up
     to 99% on average.
     If you place a whole switch at the end of every instruction code I'm
     not sure compiler will find it's way out of this mess, or even
     optimize it. I tried that with DMD, the problem is it have no idea
     how to use the *same* jump *table* for all of identical switches.


 I don't quite follow what you mean here... so forgive me if I'm way off,
 or if you're actually agreeing with me :)
 Yes, this piece of code is effectively identical to the switch that may
 fall at the top of an execute loop (assuming the switch compiles a jump
 table), but there are some subtle differences...
 As said before, there's no outer loop, which saves some code and a
 branch,

It doesn't save code - you replace unconditional jump to start of switch, with an extra load from table + indirect jump per VM opcode. but more importantly, on architectures that have branch target
 registers, the compiler can schedule the load to the branch target
 earlier (while the rest of the 'opcode' is being executed), which will
 give the processor more time to preload the instruction stream from the
 branch destination, which will reduce the impact of the branch
 misprediction.

I was thinking x86 here, but there are other processors that do not have branch target register. And depending on the pipeline length prediction it should be done earlier then loading of destination address.
 I'm not sure what you're trying to say about the branch prediction, but
 this isn't a 'branch' (nor is a switch that produces a jump table), it's
 a jump, and it will never predict since it doesn't have a binary target.

I'm not sure about terminology, but branch === jump, and both can be conditional. By switch-jump I mean tabulated indirect jump, that is target is loaded from table. It's an indirect jump and as such can go wherever it wants to. And yes it's still predicted, mostly on basis of "it will go to the same address as last time". Actually, branch benediction mechanisms I heard of don't know if some jump is conditional at all (that would complicate it), the end result is just a block of "branch/jump at address X is predicted to go to address Y".
 Many architectures attempt to alleviate this sort of unknown target
 penalty by introducing a branch target register, which once loaded, will
 immediately start fetching opcodes from the target.. the key for the
 processor is to load the target address into that register as early as
 possible to hide the instruction fetch latency. Code written in the
 style I illustrate will give the best possible outcome to that end.

Nice to know, which ones by the way? -- Dmitry Olshansky
Oct 29 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29.10.2011 15:16, Manu wrote:
 On 29 October 2011 10:55, Dmitry Olshansky <dmitry.olsh gmail.com
 <mailto:dmitry.olsh gmail.com>> wrote:

     On 29.10.2011 3:15, Manu wrote:


             This is instruction dispatch, the trick in the branch prediction
             that operates on per branch basis, thus single switch-jump
         based VM
             dispatch will mispredict jumps most of the time. I seen
         claims of up
             to 99% on average.
             If you place a whole switch at the end of every instruction
         code I'm
             not sure compiler will find it's way out of this mess, or even
             optimize it. I tried that with DMD, the problem is it have
         no idea
             how to use the *same* jump *table* for all of identical
         switches.


         I don't quite follow what you mean here... so forgive me if I'm
         way off,
         or if you're actually agreeing with me :)
         Yes, this piece of code is effectively identical to the switch
         that may
         fall at the top of an execute loop (assuming the switch compiles
         a jump
         table), but there are some subtle differences...
         As said before, there's no outer loop, which saves some code and a
         branch,


     It doesn't save code - you replace unconditional jump to start of
     switch, with an extra load from table + indirect jump per VM opcode.


 True, it may appear there is more code, but the pipeline is far more
 overlapped, it can load the jump target earlier, also there is still
 less work whichever way you slice it.

Yes, I was just arguing that code size is *strictly speaking* larger. Yet it's faster, but saving on extra uncoditional jump is not the only reason.
     but more importantly, on architectures that have branch target

         registers, the compiler can schedule the load to the branch target
         earlier (while the rest of the 'opcode' is being executed),
         which will
         give the processor more time to preload the instruction stream
         from the
         branch destination, which will reduce the impact of the branch
         misprediction.


     I was thinking x86 here, but there are other processors that do not
     have branch target register. And depending on the pipeline length
     prediction it should be done earlier then loading of destination
     address.


 I'm not sure what you mean here about pipeline length prediction...

Ouch. It should have been: depending on the pipeline length, prediction should have started earlier then loading destination address. but
 with respect to x86, and other performance orientated architectures, I
 don't see how this could disadvantage the processor in any way. While an
 out-of-order processor like x86 may be able to look ahead, resolve an
 unconditional jump back to the top of a switch, and begin resolving the
 next target ahead of time, having the dispatch inline just means there
 is no unconditional jump, a couple less instructions to process.

No objections here, it is faster.
         I'm not sure what you're trying to say about the branch
         prediction, but
         this isn't a 'branch' (nor is a switch that produces a jump
         table), it's
         a jump, and it will never predict since it doesn't have a binary
         target.


     I'm not sure about terminology, but branch === jump, and both can be
     conditional. By switch-jump I mean tabulated indirect jump, that is
     target is loaded from table.
     It's an indirect jump and as such can go wherever it wants to. And
     yes it's still predicted, mostly on basis of "it will go to the same
     address as last time". Actually, branch benediction mechanisms I
     heard of don't know if some jump is conditional at all (that would
     complicate it), the end result is just a block of "branch/jump at
     address X is predicted to go to address Y".


 I would say branch != jump; a branch is conditional (I think the term
 'branch' implies conditionality), and is a candidate for prediction.
 Branch opcodes are more often than not to absolute targets too. A jump
 is unconditional, the branch prediction hardware has no effect on jump
 opcodes, and naturally it may be an absolute jump (performs more or less
 like a noop), or an indirect one, which is the worst case.

Indirect jumps are still being predicted, that's what I've been arguing. Anyway, it's a well known stuff called threaded code, and I'm sure there are better sources then my mumbling about optimizing it: http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf cpu's will
 typically try and optimise this by prefetching the code at the earliest
 moment it can know the target. The mechanics of which are quite
 different for out-of-order processors, and architectures like PPC, which
 have a specific register indirect jump targets.
 An x86 or some other OoO processor may look ahead far enough to find the
 target in advance, but I'm not sure about that. Simplier architectures
 will not be able to do that (intel atom/ppc/arm/mips).


         Many architectures attempt to alleviate this sort of unknown target
         penalty by introducing a branch target register, which once
         loaded, will
         immediately start fetching opcodes from the target.. the key for the
         processor is to load the target address into that register as
         early as
         possible to hide the instruction fetch latency. Code written in the
         style I illustrate will give the best possible outcome to that end.


     Nice to know, which ones by the way?


 PowerPC being the most important one (only one that's relevant these
 days), ie, all games consoles. But also some popular microcontrollers.
 But I'm fairly sure most popular architectures will do the same thing
 where the jump target comes from a general register (ARM). The sooner
 the target is known, the better.


 I don't see any real disadvantage to this syntax. It's optional, has no
 side effects, and allows you to express something that can't be
 expressed any other way.
 I imagined it would be easy to implement, and it's definitely useful in
 certain circumstances, a handy tool to have, particularly when working
 on embedded platforms where compiler backends/optimisers are always weak
 and immature, and these sorts of performance considerations are more
 important (ie, games consoles, microcontrollers).

-- Dmitry Olshansky
Oct 29 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
ponce:

 Provided some hairy conditions, the switch instruction will optimize to 
 a jump table in GCC and probably most C compilers.
 
 In ICC, some static analysis is even used to optimize out the test 
 before the switch.
 
 In D, final switch might enable such an optimization with statically 
 checking for out-of-enum values.

Yeah yeah yeah, in D we'll see that in ten years, maybe. In the meantime give me good computed gotos. Bye, bearophile
Oct 28 2011
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28.10.2011 20:30, Manu wrote:
 Hi people.

 I'd like to propose support for taking the address of code labels, and
 supporting variable goto statements.
 This is a feature I have found extremely useful, implemented as a GCC
 specific extension.

 I've used this to get great speedups and simplify code while writing
 emulators/vm's. Perhaps also useful in efficient state machine type code
 too...

Yes, mostly efficient VMs. Maybe for JIT, though it would depend on some hacks. But custom made state machines usually do fixed jumps anyway.
 Simple example:

 void *opcodes[] = { &OP_ADDA, &OP_SUBA, &OP_MOVA, &OP_JMPA, &OP_etc... };

I gather this should be somewhere inside exec, or it will be *extremely* unsafe.
 void exec()
 {

i.e. : enum opcodes[] = { ... };
    // begin execution
    goto opcodes[ pProgram[regs.PC++] ];

 OP_ADDA:
    regs.A += pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_SUBA:
    regs.A -= pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_MOVA:
    regs.A = pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_JMPA:
    regs.PC = regs.A;
    goto opcodes[ pProgram[regs.PC++] ];

 OP_etc:
    ...
    goto opcodes[ pProgram[PC++] ];
 }

 Notice how this structure completely eliminates branch logic, control
 statements, etc.

Yes, but you'd still have to check correctness, maybe before executing the whole VM program.
 Note, GCC code labels are void*, but in D, perhaps some special code
 label pointer type should be invented for type safety...

Or just call them void function(void), there are cases where jumping to function is OK. It's some pretty low-level stuff, that most people do in assembly though.
 One may also expect that function pointers may also be implicitly cast
 to this generalised code pointer type, which might be useful in some
 code where a naked function pointer is used to implement some custom
 calling convention.

Overall, I'd personally welcome this kind of extension, but it's should be doable in inline asm even today, which somewhat diminishes it's impact. The advantage of computed gotos compared to asm is portability, the disadvantage is complicating backend for a special case. -- Dmitry Olshansky
Oct 28 2011
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--00163646d448aff7eb04b063bd5c
Content-Type: text/plain; charset=UTF-8

On 28 October 2011 22:16, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 On 28.10.2011 20:30, Manu wrote:

 Hi people.

 I'd like to propose support for taking the address of code labels, and
 supporting variable goto statements.
 This is a feature I have found extremely useful, implemented as a GCC
 specific extension.

 I've used this to get great speedups and simplify code while writing
 emulators/vm's. Perhaps also useful in efficient state machine type code
 too...

hacks. But custom made state machines usually do fixed jumps anyway.
 Simple example:

 void *opcodes[] = { &OP_ADDA, &OP_SUBA, &OP_MOVA, &OP_JMPA, &OP_etc... };

unsafe. void exec()
 {

i.e. : enum opcodes[] = { ... }; // begin execution
   goto opcodes[ pProgram[regs.PC++] ];

 OP_ADDA:
   regs.A += pProgram[regs.PC++];
   goto opcodes[ pProgram[regs.PC++] ];

 OP_SUBA:
   regs.A -= pProgram[regs.PC++];
   goto opcodes[ pProgram[regs.PC++] ];

 OP_MOVA:
   regs.A = pProgram[regs.PC++];
   goto opcodes[ pProgram[regs.PC++] ];

 OP_JMPA:
   regs.PC = regs.A;
   goto opcodes[ pProgram[regs.PC++] ];

 OP_etc:
   ...
   goto opcodes[ pProgram[PC++] ];
 }

 Notice how this structure completely eliminates branch logic, control
 statements, etc.

Yes, but you'd still have to check correctness, maybe before executing the whole VM program.
 Note, GCC code labels are void*, but in D, perhaps some special code
 label pointer type should be invented for type safety...

Or just call them void function(void), there are cases where jumping to function is OK. It's some pretty low-level stuff, that most people do in assembly though. One may also expect that function pointers may also be implicitly cast
 to this generalised code pointer type, which might be useful in some
 code where a naked function pointer is used to implement some custom
 calling convention.

Overall, I'd personally welcome this kind of extension, but it's should be doable in inline asm even today, which somewhat diminishes it's impact. The advantage of computed gotos compared to asm is portability, the disadvantage is complicating backend for a special case.

Regarding your points about my example, I agree re the scope of my array, it was just a 2 second example. I would expect a D implementation to be more strict (probably with regard to scope/etc), and typesafe as I suggested. And you nailed it, probably is do-able in asm (although I can't think of a way to produce an array of code addresses?), but this syntax is portable. How would it complicate the backend an awful lot? The ability to goto to a variable? I would have though this would be quite trivial. It should surely be just about as easy to take a variable rather than an immediate target...? The ability to assign the address of a code label to a const variable can't be any more difficult than assigning it as the constant target of a goto? Anyway, I don't know the details of implementation, but it seemed simple to me in theory :) Just thought I'd throw it out there.. I'd find this useful, as I have in the past on numerous occasions. --00163646d448aff7eb04b063bd5c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 28 October 2011 22:16, Dmitry Olshansky <span= dir=3D"ltr">&lt;<a href=3D"mailto:dmitry.olsh gmail.com">dmitry.olsh gmail= .com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div class=3D"im">On 28.10.2011 20:30, Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Hi people.<br> <br> I&#39;d like to propose support for taking the address of code labels, and<= br> supporting variable goto statements.<br> This is a feature I have found extremely useful, implemented as a GCC<br> specific extension.<br> <br> I&#39;ve used this to get great speedups and simplify code while writing<br=

e<br> too...<br> <br> </blockquote> <br></div> Yes, mostly efficient VMs. Maybe for JIT, though it would depend on some ha= cks.<br> But custom made state machines usually do fixed jumps anyway.<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> <br> Simple example:<br> <br><div class=3D"im"> void *opcodes[] =3D { &amp;OP_ADDA, &amp;OP_SUBA, &amp;OP_MOVA, &amp;OP_JMP= A, &amp;OP_etc... };<br> <br> </div></blockquote> <br> I gather this should be somewhere inside exec, or it will be *extremely* un= safe.<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> void exec()<br> {<br> </blockquote> <br> i.e. :<br> =C2=A0 =C2=A0 enum opcodes[] =3D { ... };<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> =C2=A0 // begin execution<br> =C2=A0 goto opcodes[ pProgram[regs.PC++] ];<br> <br> OP_ADDA:<br> =C2=A0 regs.A +=3D pProgram[regs.PC++];<br> =C2=A0 goto opcodes[ pProgram[regs.PC++] ];<br> <br> OP_SUBA:<br> =C2=A0 regs.A -=3D pProgram[regs.PC++];<br> =C2=A0 goto opcodes[ pProgram[regs.PC++] ];<br> <br> OP_MOVA:<br> =C2=A0 regs.A =3D pProgram[regs.PC++];<br> =C2=A0 goto opcodes[ pProgram[regs.PC++] ];<br> <br> OP_JMPA:<br> =C2=A0 regs.PC =3D regs.A;<br> =C2=A0 goto opcodes[ pProgram[regs.PC++] ];<br> <br> OP_etc:<br> =C2=A0 ...<br> =C2=A0 goto opcodes[ pProgram[PC++] ];<br> }<br> <br> Notice how this structure completely eliminates branch logic, control<br> statements, etc.<br> </blockquote> <br></div> Yes, but you&#39;d still have to check correctness, maybe before executing = the whole VM program.<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> <br> Note, GCC code labels are void*, but in D, perhaps some special code<br> label pointer type should be invented for type safety...<br> </blockquote> <br></div> Or just call them void function(void), there are cases where jumping to fun= ction is OK. It&#39;s some pretty low-level stuff, that most people do in a= ssembly though.<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> One may also expect that function pointers may also be implicitly cast<br> to this generalised code pointer type, which might be useful in some<br> code where a naked function pointer is used to implement some custom<br> calling convention.<br> </blockquote> <br></div> Overall, I&#39;d personally welcome this kind of extension, but it&#39;s sh= ould be doable in inline asm even today, which somewhat diminishes it&#39;s= impact. The advantage of computed gotos compared to asm is portability, th= e disadvantage is complicating =C2=A0backend for a special case.</blockquot= e> <div><br></div><div>Regarding your points about my example, I agree re the = scope of my array, it was just a 2 second example. I would expect a D imple= mentation to be more strict (probably with regard to scope/etc), and typesa= fe as I suggested.</div> <div><br></div><div>And you nailed it, probably is do-able in asm (although= I can&#39;t think of a way to produce an array of code addresses?), but th= is syntax is portable.</div><div>How would it complicate the backend an awf= ul lot?=C2=A0The ability to goto to a variable? I would have though this wo= uld be quite trivial. It should surely be just about as easy to take a vari= able rather than an immediate target...?</div> <div>The ability to assign the address of a code label to a const variable = can&#39;t be any more difficult than assigning it as the constant target of= a goto?</div><div><br></div><div>Anyway, I don&#39;t know the details of i= mplementation, but it seemed simple to me in theory :)</div> <div>Just thought I&#39;d throw it out there.. I&#39;d find this useful, as= I have in the past on numerous occasions.</div></div> --00163646d448aff7eb04b063bd5c--
Oct 28 2011
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--0016e648ffa473163004b0641094
Content-Type: text/plain; charset=UTF-8

On 28 October 2011 22:30, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 On 28.10.2011 22:57, ponce wrote:

 Provided some hairy conditions, the switch instruction will optimize to
 a jump table in GCC and probably most C compilers.

sparseness of values. Even so it *usually* makes table, just check on it from time to time ;).

It does, sure, but that switch must then be wrapped in looping logic, which adds extra code and branching to the loop. In ICC, some static analysis is even used to optimize out the test
 before the switch.

  But switch-jump alone is not enough to get efficient VM interpreter.

.... goto opcodes[pProgram[regs.PC++]]; This is instruction dispatch, the trick in the branch prediction that operates on per branch basis, thus single switch-jump based VM dispatch will mispredict jumps most of the time. I seen claims of up to 99% on average. If you place a whole switch at the end of every instruction code I'm not sure compiler will find it's way out of this mess, or even optimize it. I tried that with DMD, the problem is it have no idea how to use the *same* jump *table* for all of identical switches.

I don't quite follow what you mean here... so forgive me if I'm way off, or if you're actually agreeing with me :) Yes, this piece of code is effectively identical to the switch that may fall at the top of an execute loop (assuming the switch compiles a jump table), but there are some subtle differences... As said before, there's no outer loop, which saves some code and a branch, but more importantly, on architectures that have branch target registers, the compiler can schedule the load to the branch target earlier (while the rest of the 'opcode' is being executed), which will give the processor more time to preload the instruction stream from the branch destination, which will reduce the impact of the branch misprediction. I'm not sure what you're trying to say about the branch prediction, but this isn't a 'branch' (nor is a switch that produces a jump table), it's a jump, and it will never predict since it doesn't have a binary target. Many architectures attempt to alleviate this sort of unknown target penalty by introducing a branch target register, which once loaded, will immediately start fetching opcodes from the target.. the key for the processor is to load the target address into that register as early as possible to hide the instruction fetch latency. Code written in the style I illustrate will give the best possible outcome to that end. --0016e648ffa473163004b0641094 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 28 October 2011 22:30, Dmitry Olshansky <span= dir=3D"ltr">&lt;<a href=3D"mailto:dmitry.olsh gmail.com">dmitry.olsh gmail= .com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div class=3D"im">On 28.10.2011 22:57, ponce wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Provided some hairy conditions, the switch instruction will optimize to<br> a jump table in GCC and probably most C compilers.<br> <br> </blockquote> <br></div> They do, you have no guaranties. It will either fly or crawl, depending on = sparseness of values. Even so it *usually* makes table, just check on it fr= om time to time ;).</blockquote><div><br></div><div>It does, sure, but that= switch must then be wrapped in looping logic, which adds extra code and br= anching to the loop.</div> <div>=C2=A0</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class= =3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> In ICC, some static analysis is even used to optimize out the test<br> before the switch.<br> <br> </blockquote></div> But switch-jump alone is not enough to get efficient VM interpreter.<br> The thing is this tiny bit at the end of every instruction code:<br> ....<br> =C2=A0goto opcodes[pProgram[regs.PC++]];<br> <br> This is instruction dispatch, the trick in the branch prediction that opera= tes on per branch basis, thus single switch-jump based VM dispatch will mis= predict jumps most of the time. I seen claims of up to 99% on average.<br> If you place a whole switch at the end of every instruction code I&#39;m no= t sure compiler will find it&#39;s way out of this mess, or even optimize i= t. I tried that with DMD, the problem is it have no idea how to use the *sa= me* jump *table* for all of identical switches.</blockquote> <div><br></div><div>I don&#39;t quite follow what you mean here... so forgi= ve me if I&#39;m way off, or if you&#39;re actually agreeing with me :)</di= v><div>Yes, this piece of code is effectively identical to the switch that = may fall at the top of an execute loop (assuming the switch compiles a jump= table), but there are some subtle differences...</div> <div>As said before, there&#39;s no outer loop, which saves some code and a= branch, but more importantly, on architectures that have branch target reg= isters, the compiler can schedule the load to the branch target earlier (wh= ile the rest of the &#39;opcode&#39; is being executed), which will give th= e processor more time to preload the instruction stream from the branch des= tination, which will reduce the impact of the branch misprediction.</div> <div><br></div><div>I&#39;m not sure what you&#39;re trying to say about th= e branch prediction, but this isn&#39;t a &#39;branch&#39; (nor is a switch= that produces a jump table), it&#39;s a jump, and it will never predict si= nce it doesn&#39;t have a binary target. Many architectures attempt to alle= viate this sort of unknown target penalty by introducing a branch target re= gister, which once loaded, will immediately start fetching opcodes from the= target.. the key for the processor is to load the target address into that= register as early as possible to hide the instruction fetch latency. Code = written in the style I illustrate will give the best possible outcome to th= at end.</div> </div> --0016e648ffa473163004b0641094--
Oct 28 2011
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--001636427114661ce704b06e2177
Content-Type: text/plain; charset=UTF-8

On 29 October 2011 10:55, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 On 29.10.2011 3:15, Manu wrote:

    This is instruction dispatch, the trick in the branch prediction
    that operates on per branch basis, thus single switch-jump based VM
    dispatch will mispredict jumps most of the time. I seen claims of up
    to 99% on average.
    If you place a whole switch at the end of every instruction code I'm
    not sure compiler will find it's way out of this mess, or even
    optimize it. I tried that with DMD, the problem is it have no idea
    how to use the *same* jump *table* for all of identical switches.


 I don't quite follow what you mean here... so forgive me if I'm way off,
 or if you're actually agreeing with me :)
 Yes, this piece of code is effectively identical to the switch that may
 fall at the top of an execute loop (assuming the switch compiles a jump
 table), but there are some subtle differences...
 As said before, there's no outer loop, which saves some code and a
 branch,

It doesn't save code - you replace unconditional jump to start of switch, with an extra load from table + indirect jump per VM opcode.

True, it may appear there is more code, but the pipeline is far more overlapped, it can load the jump target earlier, also there is still less work whichever way you slice it.
 but more importantly, on architectures that have branch target

 registers, the compiler can schedule the load to the branch target
 earlier (while the rest of the 'opcode' is being executed), which will
 give the processor more time to preload the instruction stream from the
 branch destination, which will reduce the impact of the branch
 misprediction.

I was thinking x86 here, but there are other processors that do not have branch target register. And depending on the pipeline length prediction it should be done earlier then loading of destination address.

I'm not sure what you mean here about pipeline length prediction... but with respect to x86, and other performance orientated architectures, I don't see how this could disadvantage the processor in any way. While an out-of-order processor like x86 may be able to look ahead, resolve an unconditional jump back to the top of a switch, and begin resolving the next target ahead of time, having the dispatch inline just means there is no unconditional jump, a couple less instructions to process. I'm not sure what you're trying to say about the branch prediction, but
 this isn't a 'branch' (nor is a switch that produces a jump table), it's
 a jump, and it will never predict since it doesn't have a binary target.

I'm not sure about terminology, but branch === jump, and both can be conditional. By switch-jump I mean tabulated indirect jump, that is target is loaded from table. It's an indirect jump and as such can go wherever it wants to. And yes it's still predicted, mostly on basis of "it will go to the same address as last time". Actually, branch benediction mechanisms I heard of don't know if some jump is conditional at all (that would complicate it), the end result is just a block of "branch/jump at address X is predicted to go to address Y".

I would say branch != jump; a branch is conditional (I think the term 'branch' implies conditionality), and is a candidate for prediction. Branch opcodes are more often than not to absolute targets too. A jump is unconditional, the branch prediction hardware has no effect on jump opcodes, and naturally it may be an absolute jump (performs more or less like a noop), or an indirect one, which is the worst case. cpu's will typically try and optimise this by prefetching the code at the earliest moment it can know the target. The mechanics of which are quite different for out-of-order processors, and architectures like PPC, which have a specific register indirect jump targets. An x86 or some other OoO processor may look ahead far enough to find the target in advance, but I'm not sure about that. Simplier architectures will not be able to do that (intel atom/ppc/arm/mips). Many architectures attempt to alleviate this sort of unknown target
 penalty by introducing a branch target register, which once loaded, will
 immediately start fetching opcodes from the target.. the key for the
 processor is to load the target address into that register as early as
 possible to hide the instruction fetch latency. Code written in the
 style I illustrate will give the best possible outcome to that end.

Nice to know, which ones by the way?

PowerPC being the most important one (only one that's relevant these days), ie, all games consoles. But also some popular microcontrollers. But I'm fairly sure most popular architectures will do the same thing where the jump target comes from a general register (ARM). The sooner the target is known, the better. I don't see any real disadvantage to this syntax. It's optional, has no side effects, and allows you to express something that can't be expressed any other way. I imagined it would be easy to implement, and it's definitely useful in certain circumstances, a handy tool to have, particularly when working on embedded platforms where compiler backends/optimisers are always weak and immature, and these sorts of performance considerations are more important (ie, games consoles, microcontrollers). --001636427114661ce704b06e2177 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 29 October 2011 10:55, Dmitry Olshansky <span= dir=3D"ltr">&lt;<a href=3D"mailto:dmitry.olsh gmail.com">dmitry.olsh gmail= .com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div class=3D"im">On 29.10.2011 3:15, Manu wrote:<br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <br> =C2=A0 =C2=A0This is instruction dispatch, the trick in the branch predict= ion<br> =C2=A0 =C2=A0that operates on per branch basis, thus single switch-jump ba= sed VM<br> =C2=A0 =C2=A0dispatch will mispredict jumps most of the time. I seen claim= s of up<br> =C2=A0 =C2=A0to 99% on average.<br> =C2=A0 =C2=A0If you place a whole switch at the end of every instruction c= ode I&#39;m<br> =C2=A0 =C2=A0not sure compiler will find it&#39;s way out of this mess, or= even<br> =C2=A0 =C2=A0optimize it. I tried that with DMD, the problem is it have no= idea<br> =C2=A0 =C2=A0how to use the *same* jump *table* for all of identical switc= hes.<br> <br> <br></div><div class=3D"im"> I don&#39;t quite follow what you mean here... so forgive me if I&#39;m way= off,<br> or if you&#39;re actually agreeing with me :)<br> Yes, this piece of code is effectively identical to the switch that may<br> fall at the top of an execute loop (assuming the switch compiles a jump<br> table), but there are some subtle differences...<br> As said before, there&#39;s no outer loop, which saves some code and a<br> branch,<br> </div></blockquote> <br> It doesn&#39;t save code - you replace unconditional jump to start of switc= h, with an extra load from table + indirect jump per VM opcode.</blockquote=
<div><br></div><div>True, it may appear there is more code, but the pipeli=

is still less work whichever way you slice it.</div> <div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"= margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class= =3D"im"> but more importantly, on architectures that have branch target<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> registers, the compiler can schedule the load to the branch target<br> earlier (while the rest of the &#39;opcode&#39; is being executed), which w= ill<br> give the processor more time to preload the instruction stream from the<br> branch destination, which will reduce the impact of the branch<br> misprediction.<br> </blockquote> <br></div> I was thinking x86 here, but there are other processors that do not have br= anch target register. And depending on the pipeline length prediction it sh= ould be done earlier then loading of destination address.</blockquote> <div><br></div><div>I&#39;m not sure what you mean here about pipeline leng= th prediction... but with respect to x86, and other performance orientated = architectures, I don&#39;t see how this could disadvantage the processor in= any way. While an out-of-order processor like x86 may be able to look ahea= d, resolve an unconditional jump back to the top of a switch, and begin res= olving the next target ahead of time, having the dispatch inline just means= there is no unconditional jump, a couple less instructions to process.</di= v> <div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma= rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class=3D= "im"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le= ft:1px #ccc solid;padding-left:1ex"> I&#39;m not sure what you&#39;re trying to say about the branch prediction,= but<br> this isn&#39;t a &#39;branch&#39; (nor is a switch that produces a jump tab= le), it&#39;s<br> a jump, and it will never predict since it doesn&#39;t have a binary target= .<br> </blockquote> <br></div> I&#39;m not sure about terminology, but branch =3D=3D=3D jump, and both can= be conditional. By switch-jump I mean tabulated indirect jump, that is tar= get is loaded from table.<br> It&#39;s an indirect jump and as such can go wherever it wants to. And yes = it&#39;s still predicted, mostly on basis of &quot;it will go to the same a= ddress as last time&quot;. Actually, branch benediction mechanisms I heard = of don&#39;t know if some jump is conditional at all (that would complicate= it), the end result is just a block of &quot;branch/jump at address X is p= redicted to go to address Y&quot;.</blockquote> <div><br></div><div>I would say branch !=3D jump; a branch is conditional (= I think the term &#39;branch&#39; implies conditionality), and is a candida= te for prediction. Branch opcodes are more often than not to absolute targe= ts too. A jump is unconditional, the branch prediction hardware has no effe= ct on jump opcodes, and naturally it may be an absolute jump (performs more= or less like a noop), or an indirect one, which is the worst case. cpu&#39= ;s will typically try and optimise this by prefetching the code at the earl= iest moment it can know the target. The mechanics of which are quite differ= ent for out-of-order processors, and architectures like PPC, which have a s= pecific register indirect jump targets.</div> <div>An x86 or some other OoO processor may look ahead far enough to find t= he target in advance, but I&#39;m not sure about that. Simplier architectur= es will not be able to do that (intel atom/ppc/arm/mips).</div><div><br> </div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 = 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Many architectures attempt to alleviate this sort of unknown target<br> penalty by introducing a branch target register, which once loaded, will<br=

processor is to load the target address into that register as early as<br> possible to hide the instruction fetch latency. Code written in the<br> style I illustrate will give the best possible outcome to that end.<br> </blockquote> <br></div> Nice to know, which ones by the way?<br></blockquote><div><br></div><div>Po= werPC being the most important one (only one that&#39;s relevant these days= ), ie, all games consoles. But also some popular microcontrollers.</div> <div>But I&#39;m fairly sure most popular architectures will do the same th= ing where the jump target comes from a general register (ARM). The sooner t= he target is known, the better.</div><div><br></div><div><br></div><div> I don&#39;t see any real disadvantage to this syntax. It&#39;s optional, ha= s no side effects, and allows you to express something that can&#39;t be ex= pressed any other way.</div><div>I imagined it would be easy to implement, = and it&#39;s definitely useful in certain circumstances, a handy tool to ha= ve, particularly when working on embedded platforms where compiler backends= /optimisers are always weak and immature, and these sorts of performance co= nsiderations are more important (ie, games consoles, microcontrollers).</di= v> </div> --001636427114661ce704b06e2177--
Oct 29 2011
prev sibling parent bcs <bcs example.com> writes:
On 10/28/2011 09:30 AM, Manu wrote:
 Hi people.

 I'd like to propose support for taking the address of code labels, and
 supporting variable goto statements.
 This is a feature I have found extremely useful, implemented as a GCC
 specific extension.

 I've used this to get great speedups and simplify code while writing
 emulators/vm's. Perhaps also useful in efficient state machine type code
 too...


 Simple example:

 void *opcodes[] = { &OP_ADDA, &OP_SUBA, &OP_MOVA, &OP_JMPA, &OP_etc... };

 void exec()
 {
    // begin execution
    goto opcodes[ pProgram[regs.PC++] ];

 OP_ADDA:
    regs.A += pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_SUBA:
    regs.A -= pProgram[regs.PC++];
    goto opcodes[  ];

 OP_MOVA:
    regs.A = pProgram[regs.PC++];
    goto opcodes[ pProgram[regs.PC++] ];

 OP_JMPA:
    regs.PC = regs.A;
    goto opcodes[ pProgram[regs.PC++] ];

 OP_etc:
    ...
    goto opcodes[ pProgram[PC++] ];
 }

 Notice how this structure completely eliminates branch logic, control
 statements, etc.

 Note, GCC code labels are void*, but in D, perhaps some special code
 label pointer type should be invented for type safety...
 One may also expect that function pointers may also be implicitly cast
 to this generalised code pointer type, which might be useful in some
 code where a naked function pointer is used to implement some custom
 calling convention.

For the given example, this could be re-cases via a switch statement that ends each case with a "goto case pProgram[regs.PC++];". That is assuming non-const expression are allowed as the expression. The end result should be identical.
Oct 29 2011