digitalmars.D - Computed gotos on Reddit

bearophile (23/23) Jul 22 2012 A discussion appeared few days ago on Reddit, about computed

deadalnix (6/9) Jul 22 2012 The presented example is insanely unsafe. By giving invalid input, you

Dmitry Olshansky (13/29) Jul 22 2012 The trick is to check the whole bytecode first then execute. The bulk
=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (6/21) Jul 23 2012 You should always verify your bytecode before executing it anyway.
bearophile (36/41) Jul 23 2012 Thank you for your answer. We have not yet designed how D

deadalnix (2/41) Jul 24 2012 That would make sense.

Tobias Pankrath (2/4) Jul 23 2012 For another use case: dotat.at/tmp/gll.pdf
Walter Bright (2/5) Jul 23 2012 D already has it: http://dlang.org/statement.html#FinalSwitchStatement

bearophile (4/6) Jul 23 2012 Do you have a proof? (Show me asm code)

bearophile (104/105) Jul 23 2012 Just to be more clear, what I mean is that given a D program
Walter Bright (7/10) Jul 23 2012 Since the final switch does not allow a 'default' case, the check can be...

bearophile (26/31) Jul 23 2012 I understand the difference between what compilers are able to do

Walter Bright (14/27) Jul 23 2012 It is easy. Note that the compiler already generates a jump table, this ...

bearophile (4/5) Jul 23 2012 Thank you for your answers Walter :-)

Walter Bright (2/5) Jul 23 2012 You're welcome.

Dmitry Olshansky (44/56) Jul 24 2012 Sorry, but it's just wrong. The trick is that there is no need for jump

Walter Bright (4/17) Jul 24 2012 I believe you can do this with:

Dmitry Olshansky (13/34) Jul 24 2012 And how is pc is supposed to be an opcode? It's a counter after all...

Walter Bright (9/23) Jul 24 2012 switch (pc++)

Dmitry Olshansky (24/48) Jul 24 2012 It's not. Let's get to assembly then. It's an illustration as I'm no

Walter Bright (12/42) Jul 24 2012 jmp jumptable[ecx]

Dmitry Olshansky (132/161) Jul 24 2012 Damn, it's not the same. If in the above ecx is pc++, here it it

Walter Bright (10/170) Jul 24 2012 Sorry, I have no idea why it is not the same. jumptable is a static arra...

Dmitry Olshansky (35/82) Jul 24 2012 Code was assumed to be in ebx obviously. BTW Static array for jump table...

Walter Bright (16/101) Jul 24 2012 The jump table can be in the code segment, which is not possible for a u...

Dmitry Olshansky (15/58) Jul 24 2012 Yes, one read + one jump.

Walter Bright (6/10) Jul 25 2012 All right, that's the piece that was missing.

Dmitry Olshansky (8/20) Jul 25 2012 While it may sound possible I'm certain that in most interesting cases
Don Clugston (13/25) Jul 25 2012 Another interesting optimization with "final switch" would be if each

Dmitry Olshansky (6/36) Jul 25 2012 Could be interesting if some other simple progressions could be

Don Clugston (4/43) Jul 25 2012 And an unconditional branch takes no time (only one 1 uop) on modern

Walter Bright (14/15) Jul 25 2012 Rethinking your idea a bit...

Don Clugston (3/18) Jul 25 2012 Very nice. The jumps in the jump table take effectively zero cycles.

Dmitry Olshansky (5/30) Jul 25 2012 Looks neat. I'd more then willing to test how it affects my tiny VM in

Walter Bright (2/30) Jul 25 2012 Is it possible you could code it up and test it using inline asm?

Dmitry Olshansky (26/60) Jul 25 2012 Mm... I could try. So the trick is to add say this:

Walter Bright (3/27) Jul 25 2012 I wouldn't worry about it. EAX is good.

Dmitry Olshansky (20/26) Jul 25 2012 OK. I'm almost there, here is what I have:

Walter Bright (4/30) Jul 25 2012 I was afraid of that. You may have to approximate it by loading the addr...

Dmitry Olshansky (13/50) Jul 25 2012 like this ?

Dmitry Olshansky (4/44) Jul 25 2012 Corrected typos. Still it doesn't take off.

Dmitry Olshansky (6/12) Jul 25 2012 I failed to load it in any register or interact with it in any way.

Walter Bright (14/24) Jul 25 2012 How to get an address:

Don Clugston (7/36) Jul 26 2012 But doing that screws up the CPU"s stack prediction so badly that it

Dmitry Olshansky (36/42) Jul 26 2012 BTW this seems to be a roundabout way to get address of label

Walter Bright (3/6) Jul 26 2012 It can be done, it's just that nobody has done the implementation in the...

Dmitry Olshansky (4/10) Jul 26 2012 Great! I guess I should file another enhancement request?

Dmitry Olshansky (4/15) Jul 26 2012 Filed: http://d.puremagic.com/issues/show_bug.cgi?id=8448

Walter Bright (2/3) Jul 26 2012 Thank you.

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (3/9) Jul 25 2012 I hope it is just that typo! :p

Walter Bright (76/77) Jul 25 2012 I dummied up some code to do it:

Walter Bright (3/9) Jul 25 2012 I tried replacing these with 2 byte jmp instructions, instead of 5 byte ...
Dmitry Olshansky (5/82) Jul 26 2012 Do the above in loop. And more cases of course. Something around 40

Walter Bright (28/44) Jul 26 2012 Here's my entire test program. It runs a consistent 5 to 10% slower with...

Dmitry Olshansky (4/25) Jul 26 2012 Thanks. I'll play with it a bit if time permits.

David Piepgrass (7/15) Jul 25 2012 Man this has been frustrating to read. I understood what Dmitry

"bearophile" <bearophileHUGS lycos.com> writes:

A discussion appeared few days ago on Reddit, about computed 
gotos:

http://www.reddit.com/r/programming/comments/wld04/eli_benderskys_website_computed_goto_for/

http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables/

Computed gotos are useful to write interpreters. Interpreters are 
a niche but important purpose of system languages like D. 
Computed gotos are also useful to implement certain finite state 
machines (like some used in computational biology).

The GCC back-end supports computed gotos very well, and recent 
versions of LLVM support them decently (but not perfectly). This 
means implementing those gotos in LDC2 and GDC is probably not 
too much hard. DMD back-end probably don't support them (and who 
knows how much work it takes to implement that).

So maybe someday people will add a nonstandard extension to D, 
for GDC and/or LDC to support computed gotos. I hope they will 
use the same syntax, but generally nonstandard extension are a 
portability pain, and some people (purists, language lawyers, 
etc) seem to hate them.

So I have suggested to define a standard syntax for computed 
gotos in D, so LDC and GDC will avoid inventing a different 
syntax, so only DMD is the nonsupporting compiler.

Bye,
bearophile

Jul 22 2012

deadalnix <deadalnix gmail.com> writes:

On 23/07/2012 01:35, bearophile wrote:
 A discussion appeared few days ago on Reddit, about computed gotos:

 http://www.reddit.com/r/programming/comments/wld04/eli_benderskys_website_computed_goto_for/


 http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables/


The presented example is insanely unsafe. By giving invalid input, you 
can basically branch randomly.

The check added by the switch is what is required to make it safe, so it 
isn't faster at the end.

The case in the article isn't very strong.

Jul 22 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 23-Jul-12 07:05, deadalnix wrote:
 On 23/07/2012 01:35, bearophile wrote:
 A discussion appeared few days ago on Reddit, about computed gotos:

 http://www.reddit.com/r/programming/comments/wld04/eli_benderskys_website_computed_goto_for/



 http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables/


 The presented example is insanely unsafe. By giving invalid input, you
 can basically branch randomly.

 The check added by the switch is what is required to make it safe, so it
 isn't faster at the end.

 The case in the article isn't very strong.

The  trick is to check the whole bytecode first then execute. The bulk 
of time is spent in loops anyway.
But yes it's not particularly safe.

 The case in the article isn't very strong.

 From a glance it's a well known case of threaded code interpreters. 
That is the only fast *portable* interpreters as of now.

So the case is strong but hardly anything new ;)

P.S. sorry for mailing you directly, updated firefox changed interface 
*again* :)

-- 
Dmitry Olshansky


-- 
Dmitry Olshansky

Jul 22 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <alex lycus.org> writes:

On 23-07-2012 05:05, deadalnix wrote:
 On 23/07/2012 01:35, bearophile wrote:
 A discussion appeared few days ago on Reddit, about computed gotos:

 http://www.reddit.com/r/programming/comments/wld04/eli_benderskys_website_computed_goto_for/



 http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables/


 The presented example is insanely unsafe. By giving invalid input, you
 can basically branch randomly.

 The check added by the switch is what is required to make it safe, so it
 isn't faster at the end.

 The case in the article isn't very strong.

You should always verify your bytecode before executing it anyway.

-- 
Alex Rønne Petersen
alex lycus.org
http://lycus.org

Jul 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

deadalnix:

 The presented example is insanely unsafe. By giving invalid 
 input, you can basically branch randomly.

 The check added by the switch is what is required to make it 
 safe, so it isn't faster at the end.

 The case in the article isn't very strong.

Thank you for your answer. We have not yet designed how D 
computed gotos are, both in syntax and semantics. This means that 
maybe there are ways to make them a little safer, in non-release 
mode.

Part of the GCC-C code in the article was:

static void* dispatch_table[] = {
         &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
         &&do_div2, &&do_add7, &&do_neg};
#define DISPATCH() goto *dispatch_table[code[pc++]]

In D there are no preprocessor macros and the array bounds are 
enforced at run-time, this avoids some possible bugs of similar 
code.

Also, instead of void* maybe a different type can be introduced, 
so only label addresses of the current function can be put in 
dispatch_table and pointer variables given to goto. This 
statically avoids some other possible bugs.

It's true that in the end computed gotos are unsafe, you 
generally need to validate the bytecode/state transition before 
feeding them to the computed goto, and in D (unlike GCC-C) you 
are able to forbid the usage of computed gotos in  safe 
functions/methods. This doesn't avoid bugs, but helps contain 
them in delimited places. We have the  safe/ trusted/ system for 
a purpose.

In D modules like std.reflection suggested by Andrei are useful, 
because D is usable to write lot of high-level code too, I use a 
highly functional D style in some little D "scripts". But D is 
also a system language, and sometimes I desire to use it for 
tasks where C is used. GCC has computed gotos since many years 
(and Clang supports them) and as you see in the linked article 
some very common language interpreters you see around use 
computed gotos if the C compiler supports them. This is not a 
must-have feature for D, but dismissing it just because it's not 
safe is a bad idea.

Bye,
bearophile

Jul 23 2012

deadalnix <deadalnix gmail.com> writes:

Le 23/07/2012 18:56, bearophile a écrit :
 deadalnix:

 The presented example is insanely unsafe. By giving invalid input, you
 can basically branch randomly.

 The check added by the switch is what is required to make it safe, so
 it isn't faster at the end.

 The case in the article isn't very strong.

 Thank you for your answer. We have not yet designed how D computed gotos
 are, both in syntax and semantics. This means that maybe there are ways
 to make them a little safer, in non-release mode.

 Part of the GCC-C code in the article was:

 static void* dispatch_table[] = {
 &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
 &&do_div2, &&do_add7, &&do_neg};
 #define DISPATCH() goto *dispatch_table[code[pc++]]

 In D there are no preprocessor macros and the array bounds are enforced
 at run-time, this avoids some possible bugs of similar code.

 Also, instead of void* maybe a different type can be introduced, so only
 label addresses of the current function can be put in dispatch_table and
 pointer variables given to goto. This statically avoids some other
 possible bugs.

 It's true that in the end computed gotos are unsafe, you generally need
 to validate the bytecode/state transition before feeding them to the
 computed goto, and in D (unlike GCC-C) you are able to forbid the usage
 of computed gotos in  safe functions/methods. This doesn't avoid bugs,
 but helps contain them in delimited places. We have the
  safe/ trusted/ system for a purpose.

 In D modules like std.reflection suggested by Andrei are useful, because
 D is usable to write lot of high-level code too, I use a highly
 functional D style in some little D "scripts". But D is also a system
 language, and sometimes I desire to use it for tasks where C is used.
 GCC has computed gotos since many years (and Clang supports them) and as
 you see in the linked article some very common language interpreters you
 see around use computed gotos if the C compiler supports them. This is
 not a must-have feature for D, but dismissing it just because it's not
 safe is a bad idea.

 Bye,
 bearophile

That would make sense.

Jul 24 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Sunday, 22 July 2012 at 23:35:34 UTC, bearophile wrote:
 A discussion appeared few days ago on Reddit, about computed 
 gotos:

For another use case: dotat.at/tmp/gll.pdf

Jul 23 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/22/2012 4:35 PM, bearophile wrote:
 Computed gotos are useful to write interpreters. Interpreters are a niche but
 important purpose of system languages like D. Computed gotos are also useful to
 implement certain finite state machines (like some used in computational
biology).

D already has it: http://dlang.org/statement.html#FinalSwitchStatement

Jul 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 D already has it: 
 http://dlang.org/statement.html#FinalSwitchStatement

Do you have a proof? (Show me asm code)

Bye,
bearophile

Jul 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

 Do you have a proof? (Show me asm code)

Just to be more clear, what I mean is that given a D program 
equivalent to the C code shown in the article:


int interp_cgoto(unsigned char* code, int initval) {
     static const void* dispatch_table[] = {
         &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
         &&do_div2, &&do_add7, &&do_neg};

     #define DISPATCH() goto *dispatch_table[code[pc++]]

     int pc = 0;
     int val = initval;

     DISPATCH();

     do_halt:
         return val;
     do_inc:
         val++;
         DISPATCH();
     do_dec:
         val--;
         DISPATCH();
     do_mul2:
         val *= 2;
         DISPATCH();
     do_div2:
         val /= 2;
         DISPATCH();
     do_add7:
         val += 7;
         DISPATCH();
     do_neg:
         val = -val;
         DISPATCH();
}

int main() {return 0;}


Is a 32 bit D compiler producing asm (with performance) similar 
to:


_interp_cgoto:
     movl    4(%esp), %edx
     movzbl  (%edx), %eax
     addl    $1, %edx
     movl    _dispatch_table.1363(,%eax,4), %ecx
     movl    8(%esp), %eax
     jmp *%ecx
     .p2align 4,,7
L3:
     rep
     ret
     .p2align 4,,7
L4:
     movzbl  (%edx), %ecx
     addl    $1, %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     .p2align 4,,7
L5:
     addl    $1, %edx
     jmp *%ecx
     .p2align 4,,7
L6:
     movzbl  (%edx), %ecx
     subl    $1, %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     addl    $1, %edx
     jmp *%ecx
     .p2align 4,,7
L7:
     movzbl  (%edx), %ecx
     addl    %eax, %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     addl    $1, %edx
     jmp *%ecx
     .p2align 4,,7
L8:
     movl    %eax, %ecx
     shrl    $31, %ecx
     addl    %ecx, %eax
     movzbl  (%edx), %ecx
     sarl    %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     addl    $1, %edx
     jmp *%ecx
     .p2align 4,,7
L9:
     movzbl  (%edx), %ecx
     addl    $7, %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     addl    $1, %edx
     jmp *%ecx
     .p2align 4,,7
L10:
     movzbl  (%edx), %ecx
     negl    %eax
     movl    _dispatch_table.1363(,%ecx,4), %ecx
     addl    $1, %edx
     jmp *%ecx

     .section .rdata,"dr"
     .align 4
_dispatch_table.1363:
     .long   L3
     .long   L4
     .long   L6
     .long   L7
     .long   L8
     .long   L9
     .long   L10


Bye,
bearophile

Jul 23 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/23/2012 2:25 PM, bearophile wrote:
 Walter Bright:

 D already has it: http://dlang.org/statement.html#FinalSwitchStatement

 Do you have a proof? (Show me asm code)

Since the final switch does not allow a 'default' case, the check can be 
omitted, and the generated code is a simple index-jump, just like the computed 
goto example.

dmd currently doesn't do that, but that's not the language's fault, it's a 
quality of implementation issue.

The point is, computed goto offers nothing that final switch doesn't.

Jul 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 dmd currently doesn't do that, but that's not the language's 
 fault, it's a quality of implementation issue.

I understand the difference between what compilers are able to do 
(today or in future), and what the language specs allow compiler 
writers to do. So in theory I agree.

In practice there is also the well known fallacy of the 
"sufficiently smart compiler":
http://c2.com/cgi/wiki?SufficientlySmartCompiler

This fallacy implies that if you want to actually see a compiler 
able to perform a certain optimization, such optimization must be 
rather "easy", this means it must be easy for the compiler to 
infer as true all the conditions necessary to apply that 
optimization (and then you need someone to actually implement it, 
in a community as small as the D one optimizations can't be top 
priority).

The other problem with optimizations is that often if you can't 
rely on them, that means you can't be certain they are used in 
the code you are writing, then it's like they don't exist. A good 
example of this is the Scheme standard requiring all Scheme 
compilers to implement the tail call optimization.


 Since the final switch does not allow a 'default' case, the 
 check can be omitted, and the generated code is a simple 
 index-jump, just like the computed goto example.

You have seen the asm I have shown in the next post. You see 
those "jmp *%ecx" at the end of each case. Computed gotos in this 
case are not just a single index-jump, there is an index-jump at 
the end of each case. Is your future hypothetical D compiler able 
to do that?

Bye,
bearophile

Jul 23 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/23/2012 3:23 PM, bearophile wrote:
 This fallacy implies that if you want to actually see a compiler able to
perform
 a certain optimization, such optimization must be rather "easy", this means it
 must be easy for the compiler to infer as true all the conditions necessary to
 apply that optimization (and then you need someone to actually implement it, in
 a community as small as the D one optimizations can't be top priority).

It is easy. Note that the compiler already generates a jump table, this would 
just leave off the default check. In fact, the compiler could do a better job 
doing data flow analysis with final switch than computed goto, because the jump 
targets are constrained and are all known at compile time.

BTW, if computed gotos were put into the language, one would also require 
someone to implement it. You're not saving anything.


 The other problem with optimizations is that often if you can't rely on them,
 that means you can't be certain they are used in the code you are writing, then
 it's like they don't exist. A good example of this is the Scheme standard
 requiring all Scheme compilers to implement the tail call optimization.

Sorry, but I cannot buy this as an argument for loading in more language 
features. Even worse, requiring that a certain semantic construct be
implemented 
in a certain way precludes the implementer from finding an even better way to
do it.


 You see those "jmp *%ecx"
 at the end of each case. Computed gotos in this case are not just a single
 index-jump, there is an index-jump at the end of each case. Is your future
 hypothetical D compiler able to do that?

Of course. There's no magic technology there. It's just a minor variation on
the 
well known loop rotation technique (which dmd does do). Computed gotos are a 
rather hackish design that goes around the horn rather than doing what is
needed 
directly.

Jul 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 Of course. There's no magic technology there.

Thank you for your answers Walter :-)

Bye,
bearophile

Jul 23 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/23/2012 5:05 PM, bearophile wrote:
 Walter Bright:

 Of course. There's no magic technology there.

 Thank you for your answers Walter :-)

You're welcome.

Jul 23 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24-Jul-12 02:06, Walter Bright wrote:
 On 7/23/2012 2:25 PM, bearophile wrote:
 Walter Bright:

 D already has it: http://dlang.org/statement.html#FinalSwitchStatement

 Do you have a proof? (Show me asm code)

 Since the final switch does not allow a 'default' case, the check can be
 omitted, and the generated code is a simple index-jump, just like the
 computed goto example.

Bounds check is actually not that important.

 dmd currently doesn't do that, but that's not the language's fault, it's
 a quality of implementation issue.

 The point is, computed goto offers nothing that final switch doesn't.


Sorry, but it's just wrong. The trick is that there is no need for jump 
table at all. That saves one memory access to read jump table entry and 
saves on cache (need only one "table" - bytecode itself).

Now to the heart of it all - threaded interpreter looks like this (I'll 
use switches for clarity):

switch(opcode){
case OP1:
	... //do instruction 1
	//+ decode next
	opcode = code[pc++];
	switch(opcode){
	case OP1:
		// jump to case OP1 above
	... etc.
	}
//no break as it will jump away
case OP2:
	... //do instruction 2
	//+ decode next
	opcode = code[pc++];
	switch(opcode){
	case OP1:
		// jump to case OP1 above e.g. by planting label on it
	... etc.
	}
....
}

Now if I use final switches there is still:

A) jump table per switch, or maybe less but there is no guarantees
  (= depend on another optimization that's not here)
B) it's an ugly and a roundabout way to do this

However I think that always requiring tail call optimization or 
providing syntax to enforce it would work:

void op_1()
{
	...//some code for instruction 1
	opcode = cast(function void ())code[pc++];
	goto opcode(); //opcode is pointer to function op_xx
}
//can do without goto fn() iff tail call is GUARANTEED

-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 12:58 AM, Dmitry Olshansky wrote:
 Now if I use final switches there is still:

 A) jump table per switch, or maybe less but there is no guarantees
   (= depend on another optimization that's not here)
 B) it's an ugly and a roundabout way to do this

 However I think that always requiring tail call optimization or providing
syntax
 to enforce it would work:

 void op_1()
 {
      ...//some code for instruction 1
      opcode = cast(function void ())code[pc++];
      goto opcode(); //opcode is pointer to function op_xx
 }
 //can do without goto fn() iff tail call is GUARANTEED

I believe you can do this with:

     switch (pc++)

and there are the same number of indirections.

Jul 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24-Jul-12 14:04, Walter Bright wrote:
 On 7/24/2012 12:58 AM, Dmitry Olshansky wrote:
 Now if I use final switches there is still:

 A) jump table per switch, or maybe less but there is no guarantees
   (= depend on another optimization that's not here)
 B) it's an ugly and a roundabout way to do this

 However I think that always requiring tail call optimization or
 providing syntax
 to enforce it would work:

 void op_1()
 {
      ...//some code for instruction 1
      opcode = cast(function void ())code[pc++];
      goto opcode(); //opcode is pointer to function op_xx
 }
 //can do without goto fn() iff tail call is GUARANTEED

 I believe you can do this with:

      switch (pc++)

 and there are the same number of indirections.

And how is pc is supposed to be an opcode? It's a counter after all...

The trick is that it must be switch(code[pc++])...
It's just if code contains function pointers (or goto jump pointers) 
then separate jump table table is not needed:

code = [ &op_1, &op_2, &op_1, ... ]; //generated somewhere
pc = 0;
code[pc]();
// op_1 increments ( or changes somehow) pc, decodes next op and jumps to it


So I still of opinion that enforced tail call is the cleanest way to 
support this idiom.

-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 6:58 AM, Dmitry Olshansky wrote:
 void op_1()
 {
      ...//some code for instruction 1
      opcode = cast(function void ())code[pc++];
      goto opcode(); //opcode is pointer to function op_xx
 }
 //can do without goto fn() iff tail call is GUARANTEED

 I believe you can do this with:

      switch (pc++)

 and there are the same number of indirections.

 And how is pc is supposed to be an opcode? It's a counter after all...

     switch (pc++)

and
     goto code[pc++]

are the same (same as in same number of indirections).

     switch (code[pc++])

and

     goto code[pc++]()

are the same, too.

Jul 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24-Jul-12 21:03, Walter Bright wrote:
 On 7/24/2012 6:58 AM, Dmitry Olshansky wrote:
 void op_1()
 {
      ...//some code for instruction 1
      opcode = cast(function void ())code[pc++];
      goto opcode(); //opcode is pointer to function op_xx
 }
 //can do without goto fn() iff tail call is GUARANTEED

 I believe you can do this with:

      switch (pc++)

 and there are the same number of indirections.

 And how is pc is supposed to be an opcode? It's a counter after all...

      switch (pc++)

 and
      goto code[pc++]

 are the same (same as in same number of indirections).

      switch (code[pc++])

 and

      goto code[pc++]()

 are the same, too.

It's not. Let's get to assembly then. It's an illustration as I'm no 
expert and may have made some illegal shortcuts in this listing.

goto code[pc++]() is roughly:

mov ecx, [ebx]
inc ebx
jmp [ecx]

switch(code[pc++]) is:

mov ecx, [ebx]
inc ebx
mov ecx, [edx+ecx] // assuming jump table is at edx
jump [ecx]

If you count only jumps, then yes, the same number of indirect jumps. 
BUT note the use of extra register to point to the table & extra read of 
jump table contents. (BTW I assumed jump table address is loaded in 
register, a luxurious assumption esp. on 32bit).

Again, the biggest practical limitation of switches (loosing some 
performance hurts but not show stopper) is that last time I checked dmd 
doesn't try to merge equivalent jump tables.

Thus I can't put VM dispatch switch at the of each branch of main opcode 
switch (see my earlier posts) to help branch predictor. It just spawns 
ton of new tables, of course it has lower performance and wastes data cache.

-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
 are the same (same as in same number of indirections).

      switch (code[pc++])

 and

      goto code[pc++]()

 are the same, too.

 It's not. Let's get to assembly then. It's an illustration as I'm no expert and
 may have made some illegal shortcuts in this listing.

 goto code[pc++]() is roughly:

 mov ecx, [ebx]
 inc ebx
 jmp [ecx]

    jmp code[ecx]


 switch(code[pc++]) is:

 mov ecx, [ebx]
 inc ebx
 mov ecx, [edx+ecx] // assuming jump table is at edx
 jump [ecx]

    jmp jumptable[ecx]


 If you count only jumps, then yes, the same number of indirect jumps. BUT note
 the use of extra register to point to the table & extra read of jump table
 contents. (BTW I assumed jump table address is loaded in register, a luxurious
 assumption esp. on 32bit).

You don't need an extra register for the jump table address - and if you did, 
you'd need it for both, as the table needs to be referenced somehow.

Addressing modes have long been "for free" in turns of runtime cost, so

    [ECX]

and

    offset[ECX]

are the same cost.


 Again, the biggest practical limitation of switches (loosing some performance
 hurts but not show stopper) is that last time I checked dmd doesn't try to
merge
 equivalent jump tables.

I can't think of an example of this.


 Thus I can't put VM dispatch switch at the of each branch of main opcode switch
 (see my earlier posts) to help branch predictor. It just spawns ton of new
 tables, of course it has lower performance and wastes data cache.

Please post source code example so I understand what you mean.

Jul 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 01:40, Walter Bright wrote:
 On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
 are the same (same as in same number of indirections).

      switch (code[pc++])

 and

      goto code[pc++]()

 are the same, too.

 It's not. Let's get to assembly then. It's an illustration as I'm no
 expert and
 may have made some illegal shortcuts in this listing.

 goto code[pc++]() is roughly:

 mov ecx, [ebx]
 inc ebx
 jmp [ecx]

     jmp code[ecx]

There is no code just jump [ecx]. The offset is included already.

 switch(code[pc++]) is:

 mov ecx, [ebx]
 inc ebx
 mov ecx, [edx+ecx] // assuming jump table is at edx
 jump [ecx]

     jmp jumptable[ecx]

Damn, it's not the same. If in the above ecx is pc++, here it it 
code[pc++] i.e. needs to be loaded. The whole point. And yes, I *didn't* 
object that it's still 1 JUMP. I'm more focused on extra work being done.

 Please post source code example so I understand what you mean.

OK. It sure gets confusing.
Here is an article that shows the big picture of to what I want to do: 
http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf
It contains some advanced techniques that are out of scope of current 
topic but Introduction & Background are short and full of relevant facts.

And for the sample code here it is, casts are ommited for brevity.

First one is "if I had a way to have enforced tail call to function or 
take address of label".

Let's assume labels:

//GLOBALS
size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's 
DMDscript ;)

bool jitted = false;

void run(size_t pc)
{
	//so here bytecode is normal bytecode if jitted != true
	//for simplicity sake it starts with some number e.g.:
	// 0 - op_1
	// 1 - op_2, etc.
	if(!jitted)
	{
		int i=0;
		//1:1 map of each label that servers specific opcode
		static tabulated_ops[] =
			[ &L_op1, &L_op2, &L_op3, ... ];

		while(notEndOfProgram(bytecode[i]) )
		{
			size_t n = bytecode[i];
			//store real target
			bytecode[i] = tabulated_ops[n];
			//advance by some size, skipping operands etc.
			i += opSize(n);
		}
	}

	//interpret:
	goto bytecode[pc];
	//<---- never gets here
L_op1:
	//do my op1 thing
	pc += x;
//DISPATH:
	goto bytecode[pc]; // load address at pc & jump to it
L_op2:
	//do some other thing
	pc += y;
//DISPATH:

	goto bytecode[pc]; // load address at pc & jump to it
L_op3:
	//my other op, jumps back
	pc -= ...;
//DISPATH:

	goto bytecode[pc];  // load address at pc & jump to it
...
L_opN:
	...
	goto bytecode[pc];  // load address at pc & jump to it
}



Now the same thing with switches.
And before you ask -
NO! Single switch won't do as it would have successful branch prediction 
rate of as bad as about 2% (see the article!). The more opcode types the 
worse prediction is.

Anyway here it goes:

//GLOBALS
size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's 
DMDscript ;)

void run(size_t pc)
{
	//here we don't JIT to real addresses beforehand
	//as jump tables somehow should be good enough
	// Okay...
	
	//interpret:
	switch(bytecode[pc])
	{
	L_op1:
	case 0:
		//do my op1 thing
		pc += x;
	//DISPATCH:
	//here comes trouble - 1st of N nearly identical tables??
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	L_op2:
	case 1:
		//do some other thing
		pc += y;
	//DISPATCH:
	//here comes trouble - 2nd table?
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	L_op3:
	case 2:
		//my other op, jumps back
		pc -= ...;
	//DISPATCH:
	//here comes trouble - 3rd table?
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
...
	L_opN:
	case N-1:
	...
	//DISPATCH:
	//here comes trouble Nth table ... time to count overhead
		switch(bytecode[pc])
		{
		case 0:	goto L_op1;
		case 1: goto L_op2;
		...
		}
	}//end of giant switch
}


-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 01:40, Walter Bright wrote:
 On 7/24/2012 12:50 PM, Dmitry Olshansky wrote:
 are the same (same as in same number of indirections).

      switch (code[pc++])

 and

      goto code[pc++]()

 are the same, too.

 It's not. Let's get to assembly then. It's an illustration as I'm no
 expert and
 may have made some illegal shortcuts in this listing.

 goto code[pc++]() is roughly:

 mov ecx, [ebx]
 inc ebx
 jmp [ecx]

     jmp code[ecx]

 There is no code just jump [ecx]. The offset is included already.

I'm not seeing where "code" is in the asm code you presented.

 switch(code[pc++]) is:

 mov ecx, [ebx]
 inc ebx
 mov ecx, [edx+ecx] // assuming jump table is at edx
 jump [ecx]

     jmp jumptable[ecx]

 Damn, it's not the same. If in the above ecx is pc++, here it it code[pc++]
i.e.
 needs to be loaded. The whole point. And yes, I *didn't* object that it's still
 1 JUMP. I'm more focused on extra work being done.

Sorry, I have no idea why it is not the same. jumptable is a static array, and 
so does not need to be loaded into a register. And code[] needs to be loaded 
from somewhere, and it looks like it's omitted from your example.



 Please post source code example so I understand what you mean.

 OK. It sure gets confusing.
 Here is an article that shows the big picture of to what I want to do:
 http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf
 It contains some advanced techniques that are out of scope of current topic but
 Introduction & Background are short and full of relevant facts.

 And for the sample code here it is, casts are ommited for brevity.

 First one is "if I had a way to have enforced tail call to function or take
 address of label".

 Let's assume labels:

 //GLOBALS
 size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)

 bool jitted = false;

 void run(size_t pc)
 {
      //so here bytecode is normal bytecode if jitted != true
      //for simplicity sake it starts with some number e.g.:
      // 0 - op_1
      // 1 - op_2, etc.
      if(!jitted)
      {
          int i=0;
          //1:1 map of each label that servers specific opcode
          static tabulated_ops[] =
              [ &L_op1, &L_op2, &L_op3, ... ];

          while(notEndOfProgram(bytecode[i]) )
          {
              size_t n = bytecode[i];
              //store real target
              bytecode[i] = tabulated_ops[n];
              //advance by some size, skipping operands etc.
              i += opSize(n);
          }
      }

      //interpret:
      goto bytecode[pc];
      //<---- never gets here
 L_op1:
      //do my op1 thing
      pc += x;
 //DISPATH:
      goto bytecode[pc]; // load address at pc & jump to it
 L_op2:
      //do some other thing
      pc += y;
 //DISPATH:

      goto bytecode[pc]; // load address at pc & jump to it
 L_op3:
      //my other op, jumps back
      pc -= ...;
 //DISPATH:

      goto bytecode[pc];  // load address at pc & jump to it
 ...
 L_opN:
      ...
      goto bytecode[pc];  // load address at pc & jump to it
 }



 Now the same thing with switches.
 And before you ask -
 NO! Single switch won't do as it would have successful branch prediction rate
of
 as bad as about 2% (see the article!). The more opcode types the worse
 prediction is.

I see what you mean, but this could be done with final switch by the compiler, 
as I explained to bearophile.


 Anyway here it goes:

 //GLOBALS
 size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's DMDscript ;)

 void run(size_t pc)
 {
      //here we don't JIT to real addresses beforehand
      //as jump tables somehow should be good enough
      // Okay...

      //interpret:
      switch(bytecode[pc])
      {
      L_op1:
      case 0:
          //do my op1 thing
          pc += x;
      //DISPATCH:
      //here comes trouble - 1st of N nearly identical tables??
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
      L_op2:
      case 1:
          //do some other thing
          pc += y;
      //DISPATCH:
      //here comes trouble - 2nd table?
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
      L_op3:
      case 2:
          //my other op, jumps back
          pc -= ...;
      //DISPATCH:
      //here comes trouble - 3rd table?
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
 ...
      L_opN:
      case N-1:
      ...
      //DISPATCH:
      //here comes trouble Nth table ... time to count overhead
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
      }//end of giant switch
 }

I see what you mean here, too. Thanks for the explanation. It never occurred to 
me that one could write code like that, but I see the point, and doing jump 
table merging could be done fairly easily. No new language feature is required.

Jul 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 03:31, Walter Bright wrote:
 On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
 There is no code just jump [ecx]. The offset is included already.

 I'm not seeing where "code" is in the asm code you presented.

 Sorry, I have no idea why it is not the same. jumptable is a static
 array, and so does not need to be loaded into a register. And code[]
 needs to be loaded from somewhere, and it looks like it's omitted from
 your example.

Code was assumed to be in ebx obviously. BTW Static array for jump table 
is all good and well
  but does this trick work with PIC code? And last but not least - the 
jump target has to be _read_ from jump table
  and then jumped to it isn't it?

OK I've taken your comments into account.
Now I think I finally got it right:

mov ecx, [ebx] ; ecx = code[pc]
inc ebx ; pc ++
jmp ecx ; goto code[pc], as ecx is already a pointer

vs

mov ecx, [ebx] ; ecx = code[pc]
inc ebx ; ; inc pc
jump jump_table[ecx]; ; switch jump to it

or in english, ommiting PC increment:
1.
read x from array
jump to it
2.
  read x from array
  read jump address from 2nd static array at offset x
  jump to it

So to sum it all up: 1 vs 2 memory accesses, same register use, same 
(macro) instruction count.
Still I think that not having to touch extra static table is bonus and 
jump jump_table[ecx] could be less efficient on some processors then 
plain jump ecx (need to checks this).

 Now the same thing with switches.
 And before you ask -
 NO! Single switch won't do as it would have successful branch
 prediction rate of
 as bad as about 2% (see the article!). The more opcode types the worse
 prediction is.

 I see what you mean, but this could be done with final switch by the
 compiler, as I explained to bearophile.

Great but I still try to show that they are less efficient, see above.


 //GLOBALS
 size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's
 DMDscript ;)

 void run(size_t pc)
 {
      //here we don't JIT to real addresses beforehand
      //as jump tables somehow should be good enough
      // Okay...

      //interpret:
      switch(bytecode[pc])
      {
      L_op1:
      case 0:


[snip]
      L_opN:
      case N-1:
      ...
      //DISPATCH:
      //here comes trouble, Nth table ... time to count overhead
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
      }//end of giant switch
 }

 I see what you mean here, too. Thanks for the explanation. It never
 occurred to me that one could write code like that, but I see the point,
 and doing jump table merging could be done fairly easily. No new
 language feature is required.

Superb! I actually tried the code above, generating common things with a 
help of string mixins, of course currently it only gets slightly slower.

Should I file an enhancement request?


-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 10:04 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 03:31, Walter Bright wrote:
 On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
 There is no code just jump [ecx]. The offset is included already.

 I'm not seeing where "code" is in the asm code you presented.

 Sorry, I have no idea why it is not the same. jumptable is a static
 array, and so does not need to be loaded into a register. And code[]
 needs to be loaded from somewhere, and it looks like it's omitted from
 your example.

 Code was assumed to be in ebx obviously.

It's got to load it somehow.

 BTW Static array for jump table is all
 good and well but does this trick work with PIC code?

The jump table can be in the code segment, which is not possible for a user 
generated array.

 And last but not least - the jump
 target has to be _read_ from jump table
   and then jumped to it isn't it?

And it has to be read from code[] and jumped to. No difference.


 OK I've taken your comments into account.
 Now I think I finally got it right:

 mov ecx, [ebx] ; ecx = code[pc]
 inc ebx ; pc ++
 jmp ecx ; goto code[pc], as ecx is already a pointer

Nope, ecx is an opcode, not a pointer. You need another indirection.

 vs

 mov ecx, [ebx] ; ecx = code[pc]
 inc ebx ; ; inc pc
 jump jump_table[ecx]; ; switch jump to it

 or in english, ommiting PC increment:
 1.
 read x from array
 jump to it

It's

     pc => opcode => address

not

     pc => address

 2.
   read x from array
   read jump address from 2nd static array at offset x
   jump to it

 So to sum it all up: 1 vs 2 memory accesses, same register use, same (macro)
 instruction count.
 Still I think that not having to touch extra static table is bonus and jump
 jump_table[ecx] could be less efficient on some processors then plain jump ecx
 (need to checks this).

In both cases, you must pull the address out of an array and jump to it.


 Now the same thing with switches.
 And before you ask -
 NO! Single switch won't do as it would have successful branch
 prediction rate of
 as bad as about 2% (see the article!). The more opcode types the worse
 prediction is.

 I see what you mean, but this could be done with final switch by the
 compiler, as I explained to bearophile.

 Great but I still try to show that they are less efficient, see above.

No, it is the same code for each. The trouble is you're omitting one of the 
indirections needed for the computed goto case. You must:

    programcounter => opcode => address

2 indirections required. You keep skipping one of them!


 //GLOBALS
 size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's
 DMDscript ;)

 void run(size_t pc)
 {
      //here we don't JIT to real addresses beforehand
      //as jump tables somehow should be good enough
      // Okay...

      //interpret:
      switch(bytecode[pc])
      {
      L_op1:
      case 0:


 [snip]
      L_opN:
      case N-1:
      ...
      //DISPATCH:
      //here comes trouble, Nth table ... time to count overhead
          switch(bytecode[pc])
          {
          case 0:    goto L_op1;
          case 1: goto L_op2;
          ...
          }
      }//end of giant switch
 }

 I see what you mean here, too. Thanks for the explanation. It never
 occurred to me that one could write code like that, but I see the point,
 and doing jump table merging could be done fairly easily. No new
 language feature is required.

 Superb! I actually tried the code above, generating common things with a help
of
 string mixins, of course currently it only gets slightly slower.

 Should I file an enhancement request?

For the jump table merging, yes please.

Jul 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 10:11, Walter Bright wrote:
 On 7/24/2012 10:04 PM, Dmitry Olshansky wrote:
 BTW Static array for jump table is all
 good and well but does this trick work with PIC code?

 The jump table can be in the code segment, which is not possible for a
 user generated array.

OK, so it works - good to know.

 And last but not least - the jump
 target has to be _read_ from jump table
   and then jumped to it isn't it?

 And it has to be read from code[] and jumped to. No difference.

Yes, one read + one jump.

 OK I've taken your comments into account.
 Now I think I finally got it right:

 mov ecx, [ebx] ; ecx = code[pc]
 inc ebx ; pc ++
 jmp ecx ; goto code[pc], as ecx is already a pointer

 Nope, ecx is an opcode, not a pointer. You need another indirection.

Great, I think we finally got to the heart of it.
The trick is that we already pre-processed our bytecode. Now it contains 
real addresses. See my computed goto example again.
(even I myself made a mistake of writing [ecx] previously)

 vs

 mov ecx, [ebx] ; ecx = code[pc]
 inc ebx ; ; inc pc
 jump jump_table[ecx]; ; switch jump to it

 or in english, ommiting PC increment:
 1.
 read x from array
 jump to it

 It's

      pc => opcode => address

 not

      pc => address

It's pc => address because one can first preprocess all of byte code 
doing opcode => address rewrites. But you can't do it unless taking 
address of labels is possible.

 2 indirections required. You keep skipping one of them!

That's how you make things fast! :)

 I see what you mean here, too. Thanks for the explanation. It never
 occurred to me that one could write code like that, but I see the point,
 and doing jump table merging could be done fairly easily. No new
 language feature is required.

 Superb! I actually tried the code above, generating common things with
 a help of
 string mixins, of course currently it only gets slightly slower.

 Should I file an enhancement request?

 For the jump table merging, yes please.

Done:
http://d.puremagic.com/issues/show_bug.cgi?id=8431

-- 
Dmitry Olshansky

Jul 24 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
 It's pc => address because one can first preprocess all of byte code doing
 opcode => address rewrites. But you can't do it unless taking address of labels
 is possible.

All right, that's the piece that was missing.

I suppose it is possible for the compiler to recognize that the opcode=>address 
array is invariant, and optimize it out, but that would be a novel
optimization. 
I don't know how hard it would be.


 Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

Thanks!

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 11:37, Walter Bright wrote:
 On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
 It's pc => address because one can first preprocess all of byte code
 doing
 opcode => address rewrites. But you can't do it unless taking address
 of labels
 is possible.

 All right, that's the piece that was missing.

 I suppose it is possible for the compiler to recognize that the
 opcode=>address array is invariant, and optimize it out, but that would
 be a novel optimization. I don't know how hard it would be.

While it may sound possible I'm certain that in most interesting cases 
it's not possible for compiler to do it in advance, as array of opcodes 
ultimately comes from some external source e.g. parsed script file.

So opcode=>address transform happens at run-time after that it indeed 
becomes invariant.

  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

 Thanks!


-- 
Dmitry Olshansky

Jul 25 2012

Don Clugston <dac nospam.com> writes:

On 25/07/12 09:37, Walter Bright wrote:
 On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
 It's pc => address because one can first preprocess all of byte code
 doing
 opcode => address rewrites. But you can't do it unless taking address
 of labels
 is possible.

 All right, that's the piece that was missing.

 I suppose it is possible for the compiler to recognize that the
 opcode=>address array is invariant, and optimize it out, but that would
 be a novel optimization. I don't know how hard it would be.


  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

 Thanks!

Another interesting optimization with "final switch" would be if each 
case has about the same code length.

final switch(x) {
case C1:  ...
case C2:  ...
case C3:  ...
case C4:  ...
}
then if &(case c2) - &(case C1) == &(case C3) - &(case C2)

change it to
goto (&case C1) + x *( &(case c2) - &(case C1) );

so that there is no lookup table, just a multiply.

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 11:51, Don Clugston wrote:
 On 25/07/12 09:37, Walter Bright wrote:
 On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
 It's pc => address because one can first preprocess all of byte code
 doing
 opcode => address rewrites. But you can't do it unless taking address
 of labels
 is possible.

 All right, that's the piece that was missing.

 I suppose it is possible for the compiler to recognize that the
 opcode=>address array is invariant, and optimize it out, but that would
 be a novel optimization. I don't know how hard it would be.


  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

 Thanks!

 Another interesting optimization with "final switch" would be if each
 case has about the same code length.

 final switch(x) {
 case C1:  ...
 case C2:  ...
 case C3:  ...
 case C4:  ...
 }
 then if &(case c2) - &(case C1) == &(case C3) - &(case C2)

 change it to
 goto (&case C1) + x *( &(case c2) - &(case C1) );

 so that there is no lookup table, just a multiply.

Could be interesting if some other simple progressions could be 
hardcoded, if say branches are be sorted by length. Also modern CPU seem 
to be exceptionally fast at skipping NOPs so a bit of padding won't hurt.

-- 
Dmitry Olshansky

Jul 25 2012

Don Clugston <dac nospam.com> writes:

On 25/07/12 09:55, Dmitry Olshansky wrote:
 On 25-Jul-12 11:51, Don Clugston wrote:
 On 25/07/12 09:37, Walter Bright wrote:
 On 7/24/2012 11:46 PM, Dmitry Olshansky wrote:
 It's pc => address because one can first preprocess all of byte code
 doing
 opcode => address rewrites. But you can't do it unless taking address
 of labels
 is possible.

 All right, that's the piece that was missing.

 I suppose it is possible for the compiler to recognize that the
 opcode=>address array is invariant, and optimize it out, but that would
 be a novel optimization. I don't know how hard it would be.


  > Done: http://d.puremagic.com/issues/show_bug.cgi?id=8431

 Thanks!

 Another interesting optimization with "final switch" would be if each
 case has about the same code length.

 final switch(x) {
 case C1:  ...
 case C2:  ...
 case C3:  ...
 case C4:  ...
 }
 then if &(case c2) - &(case C1) == &(case C3) - &(case C2)

 change it to
 goto (&case C1) + x *( &(case c2) - &(case C1) );

 so that there is no lookup table, just a multiply.

 Could be interesting if some other simple progressions could be
 hardcoded, if say branches are be sorted by length.

Ooh, that's an interesting idea too.

 Also modern CPU seem
 to be exceptionally fast at skipping NOPs so a bit of padding won't hurt.

And an unconditional branch takes no time (only one 1 uop) on modern 
CPUs too.

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 12:51 AM, Don Clugston wrote:
 so that there is no lookup table, just a multiply.

Rethinking your idea a bit...

Suppose the switch jump_address[] array was really an array of hardcoded jmp 
instructions, 5 bytes each:

   jmp_table:
     jmp Lcase1;
     jmp Lcase2;
     jmp Lcase3;
     ...

and then the switch(EBX) would be:

     lea EAX,jmp_table[EBX][EBX*4]
     jmp EAX

is that kick-ass or what?

(There'd be some additional complication for PIC code.)

Jul 25 2012

Don Clugston <dac nospam.com> writes:

On 25/07/12 12:11, Walter Bright wrote:
 On 7/25/2012 12:51 AM, Don Clugston wrote:
 so that there is no lookup table, just a multiply.

 Rethinking your idea a bit...

 Suppose the switch jump_address[] array was really an array of hardcoded
 jmp instructions, 5 bytes each:

    jmp_table:
      jmp Lcase1;
      jmp Lcase2;
      jmp Lcase3;
      ...

 and then the switch(EBX) would be:

      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX

 is that kick-ass or what?

 (There'd be some additional complication for PIC code.)

Very nice. The jumps in the jump table take effectively zero cycles. 
That looks quite doable.

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 15:14, Don Clugston wrote:
 On 25/07/12 12:11, Walter Bright wrote:
 On 7/25/2012 12:51 AM, Don Clugston wrote:
 so that there is no lookup table, just a multiply.

 Rethinking your idea a bit...

 Suppose the switch jump_address[] array was really an array of hardcoded
 jmp instructions, 5 bytes each:

    jmp_table:
      jmp Lcase1;
      jmp Lcase2;
      jmp Lcase3;
      ...

 and then the switch(EBX) would be:

      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX

 is that kick-ass or what?

 (There'd be some additional complication for PIC code.)

 Very nice. The jumps in the jump table take effectively zero cycles.
 That looks quite doable.

Looks neat. I'd more then willing to test how it affects my tiny VM in 
std.regex.

-- 
Dmitry Olshansky

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 4:26 AM, Dmitry Olshansky wrote:
 On 25-Jul-12 15:14, Don Clugston wrote:
 On 25/07/12 12:11, Walter Bright wrote:
 On 7/25/2012 12:51 AM, Don Clugston wrote:
 so that there is no lookup table, just a multiply.

 Rethinking your idea a bit...

 Suppose the switch jump_address[] array was really an array of hardcoded
 jmp instructions, 5 bytes each:

    jmp_table:
      jmp Lcase1;
      jmp Lcase2;
      jmp Lcase3;
      ...

 and then the switch(EBX) would be:

      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX

 is that kick-ass or what?

 (There'd be some additional complication for PIC code.)

 Very nice. The jumps in the jump table take effectively zero cycles.
 That looks quite doable.

 Looks neat. I'd more then willing to test how it affects my tiny VM in
std.regex.

Is it possible you could code it up and test it using inline asm?

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 21:19, Walter Bright wrote:
 On 7/25/2012 4:26 AM, Dmitry Olshansky wrote:
 On 25-Jul-12 15:14, Don Clugston wrote:
 On 25/07/12 12:11, Walter Bright wrote:
 On 7/25/2012 12:51 AM, Don Clugston wrote:
 so that there is no lookup table, just a multiply.

 Rethinking your idea a bit...

 Suppose the switch jump_address[] array was really an array of
 hardcoded
 jmp instructions, 5 bytes each:

    jmp_table:
      jmp Lcase1;
      jmp Lcase2;
      jmp Lcase3;
      ...

 and then the switch(EBX) would be:

      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX

 is that kick-ass or what?

 (There'd be some additional complication for PIC code.)

 Very nice. The jumps in the jump table take effectively zero cycles.
 That looks quite doable.

 Looks neat. I'd more then willing to test how it affects my tiny VM in
 std.regex.

 Is it possible you could code it up and test it using inline asm?

Mm... I could try. So the trick is to add say this:
Dispatch:
asm{
       ...
      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX
jmp_table:
       jmp Lcase1;
       jmp Lcase2;
       jmp Lcase3;
}
Lcase1:
	...
	goto Dispatch
instead of current switch and replace case with labels. Sounds not half bad.

Then I could even replace that one goto Dispatch with same
      lea EAX,jmp_table[EBX][EBX*4]
      jmp EAX

I'll give it a shot. The only thing that worries me is that I will step 
on compiler's toes breaking his register allocation scheme (it would 
have to work around my inline asm).
Any tips on which spare registers to use (I guess ecx is no go, as there 
is 'this' pointer present) ?

-- 
Dmitry Olshansky

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
 Is it possible you could code it up and test it using inline asm?

 Mm... I could try. So the trick is to add say this:
 Dispatch:
 asm{
        ...
       lea EAX,jmp_table[EBX][EBX*4]
       jmp EAX
 jmp_table:
        jmp Lcase1;
        jmp Lcase2;
        jmp Lcase3;
 }
 Lcase1:
      ...
      goto Dispatch
 instead of current switch and replace case with labels. Sounds not half bad.

 Then I could even replace that one goto Dispatch with same
       lea EAX,jmp_table[EBX][EBX*4]
       jmp EAX

 I'll give it a shot. The only thing that worries me is that I will step on
 compiler's toes breaking his register allocation scheme (it would have to work
 around my inline asm).

Use the same register for both schemes, and it should then give comparable
results.

 Any tips on which spare registers to use (I guess ecx is no go, as there is
 'this' pointer present) ?

I wouldn't worry about it. EAX is good.

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 21:47, Walter Bright wrote:
 On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
 Is it possible you could code it up and test it using inline asm?



...


 Any tips on which spare registers to use (I guess ecx is no go, as
 there is
 'this' pointer present) ?

 I wouldn't worry about it. EAX is good.

OK. I'm almost there, here is what I have:
	//my byte code sets 8-bit to discern code/data
	 uint c = re.ir[t.pc].code - 128;
	 //no idea how to code the above in asm
	 //.code is { return x & mask; } property
	 asm{
               mov EAX, c;
               lea EAX, L_jumptable[EAX][EAX*4];
               jmp EAX;
             }
         L_jumptable:
          mixin(`asm{`
               ~ genJumpTable()
               ~ `} `);

So I have proper table generated and it all goes fine untill I get:

std\regex.d(5118): Error: undefined identifier 'L_jumptable'

-- 
Dmitry Olshansky

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 21:47, Walter Bright wrote:
 On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
 Is it possible you could code it up and test it using inline asm?



 ...


 Any tips on which spare registers to use (I guess ecx is no go, as
 there is
 'this' pointer present) ?

 I wouldn't worry about it. EAX is good.

 OK. I'm almost there, here is what I have:
      //my byte code sets 8-bit to discern code/data
       uint c = re.ir[t.pc].code - 128;
       //no idea how to code the above in asm
       //.code is { return x & mask; } property
       asm{
                mov EAX, c;
                lea EAX, L_jumptable[EAX][EAX*4];
                jmp EAX;
              }
          L_jumptable:
           mixin(`asm{`
                ~ genJumpTable()
                ~ `} `);

 So I have proper table generated and it all goes fine untill I get:

 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

I was afraid of that. You may have to approximate it by loading the address of 
L_jumptable into a register and adding it in instead of using the addressing 
mode. Then, add those extra instructions as dummies into the other path.

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25-Jul-12 23:58, Walter Bright wrote:
 On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 21:47, Walter Bright wrote:
 On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
 Is it possible you could code it up and test it using inline asm?



 ...


 Any tips on which spare registers to use (I guess ecx is no go, as
 there is
 'this' pointer present) ?

 I wouldn't worry about it. EAX is good.

 OK. I'm almost there, here is what I have:
      //my byte code sets 8-bit to discern code/data
       uint c = re.ir[t.pc].code - 128;
       //no idea how to code the above in asm
       //.code is { return x & mask; } property
       asm{
                mov EAX, c;
                lea EAX, L_jumptable[EAX][EAX*4];
                jmp EAX;
              }
          L_jumptable:
           mixin(`asm{`
                ~ genJumpTable()
                ~ `} `);

 So I have proper table generated and it all goes fine untill I get:

 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of using
 the addressing mode.

like this ?
	mov EDX, L_jumpable
	move EAX, EDX[EAX][EAX*4]
doesn't work. Seems like label is nonexistent anywhere but jump instruction.

Will this one do it:
	lea EAX, $[EAX+5][EAX*4];
         jmp EAX;

Compiles. Maybe I miscalculated though. Indirect jump has size of ?

Then, add those extra instructions as dummies into
 the other path.

Ehm? Other path like what? You mean compare it with jump table that uses 
plain addresses? Please expand a bit.


-- 
Dmitry Olshansky

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26-Jul-12 00:06, Dmitry Olshansky wrote:
 On 25-Jul-12 23:58, Walter Bright wrote:
 On 7/25/2012 12:52 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 21:47, Walter Bright wrote:
 On 7/25/2012 10:29 AM, Dmitry Olshansky wrote:
 Is it possible you could code it up and test it using inline asm?



 ...


 Any tips on which spare registers to use (I guess ecx is no go, as
 there is
 'this' pointer present) ?

 I wouldn't worry about it. EAX is good.

 OK. I'm almost there, here is what I have:
      //my byte code sets 8-bit to discern code/data
       uint c = re.ir[t.pc].code - 128;
       //no idea how to code the above in asm
       //.code is { return x & mask; } property
       asm{
                mov EAX, c;
                lea EAX, L_jumptable[EAX][EAX*4];
                jmp EAX;
              }
          L_jumptable:
           mixin(`asm{`
                ~ genJumpTable()
                ~ `} `);

 So I have proper table generated and it all goes fine untill I get:

 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of using
 the addressing mode.

 like this ?
      mov EDX, L_jumptable
      mov EAX, EDX[EAX][EAX*4]

Corrected typos. Still it doesn't take off.
-- 
Dmitry Olshansky

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of using
 the addressing mode.



I failed to load it in any register or interact with it in any way.
I think I've stalled. There has to be a way to get label address 
somehow, I got tired of guess and shot :(

BTW that would allow us to do computed gotos but only in inline asm.

-- 
Dmitry Olshansky

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 2:55 PM, Dmitry Olshansky wrote:
 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of using
 the addressing mode.



 I failed to load it in any register or interact with it in any way.
 I think I've stalled. There has to be a way to get label address somehow, I got
 tired of guess and shot :(

 BTW that would allow us to do computed gotos but only in inline asm.

How to get an address:

      call jump_table
L1:

      ...
jump_table:
      pop EAX
      jmp L1
      .. the rest of the jump table ...

Yes, it's awful, but we're just trying to take some measurements here, so it 
doesn't matter.

What I meant was add these extra instructions in to the switch version as 
dummies in order to make the extra time they take irrelevant to looking at the 
difference.

Jul 25 2012

Don Clugston <dac nospam.com> writes:

On 26/07/12 00:46, Walter Bright wrote:
 On 7/25/2012 2:55 PM, Dmitry Olshansky wrote:
 std\regex.d(5118): Error: undefined identifier 'L_jumptable'

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of
 using
 the addressing mode.



 I failed to load it in any register or interact with it in any way.
 I think I've stalled. There has to be a way to get label address
 somehow, I got
 tired of guess and shot :(

 BTW that would allow us to do computed gotos but only in inline asm.

 How to get an address:

       call jump_table
 L1:

       ...
 jump_table:
       pop EAX
       jmp L1
       .. the rest of the jump table ...

 Yes, it's awful, but we're just trying to take some measurements here,
 so it doesn't matter.

 What I meant was add these extra instructions in to the switch version
 as dummies in order to make the extra time they take irrelevant to
 looking at the difference.

But doing that screws up the CPU"s stack prediction so badly that it 
will dominate the timing
At least do something like:

jump_table:
       move EAX, [ESP]
       ret

Jul 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26-Jul-12 11:56, Don Clugston wrote:
 But doing that screws up the CPU"s stack prediction so badly that it
 will dominate the timing
 At least do something like:

 jump_table:
        move EAX, [ESP]
        ret

BTW this seems to be a roundabout way to get address of label
that I can use do a threaded code interpreter.
Basically every branch is:

L_opcode_1:
	asm { mov EAX, [ESP];
		ret;
	}
	... real code here
L_opcode_2:
	asm { mov EAX, [ESP];
		ret;
	}
	... real code here

Then "compile" step looks like this:
while(not_last_opcode(code[pc]){
	size_t c = code[pc];
	switch(c){
	case op_1:
		asm{ call L_opcode_1; add EAX, 4; mov c, EAX; }
		break;
	case op_2:
		...
	}
	code[pc] = c; //now we have label address
	pc++;
}

//interpret:
pc = 0;
size_t op = code[pc];
asm { mov EAX, op; jump eax; } //here we go, almost computed goto

Obviously it's backwards and awful. Makes me wonder why can't we take it 
directly, what's limitation ?
How about allowing it, at least in inline assembly?


-- 
Dmitry Olshansky

Jul 26 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/26/2012 1:14 PM, Dmitry Olshansky wrote:
 Obviously it's backwards and awful. Makes me wonder why can't we take it
 directly, what's limitation ?
 How about allowing it, at least in inline assembly?

It can be done, it's just that nobody has done the implementation in the inline 
assembler.

Jul 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 27-Jul-12 00:31, Walter Bright wrote:
 On 7/26/2012 1:14 PM, Dmitry Olshansky wrote:
 Obviously it's backwards and awful. Makes me wonder why can't we take it
 directly, what's limitation ?
 How about allowing it, at least in inline assembly?

 It can be done, it's just that nobody has done the implementation in the
 inline assembler.

Great! I guess I should file another enhancement request?

-- 
Dmitry Olshansky

Jul 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 27-Jul-12 00:50, Dmitry Olshansky wrote:
 On 27-Jul-12 00:31, Walter Bright wrote:
 On 7/26/2012 1:14 PM, Dmitry Olshansky wrote:
 Obviously it's backwards and awful. Makes me wonder why can't we take it
 directly, what's limitation ?
 How about allowing it, at least in inline assembly?

 It can be done, it's just that nobody has done the implementation in the
 inline assembler.

 Great! I guess I should file another enhancement request?

Filed: http://d.puremagic.com/issues/show_bug.cgi?id=8448

-- 
Dmitry Olshansky

Jul 26 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/26/2012 2:17 PM, Dmitry Olshansky wrote:
 Filed: http://d.puremagic.com/issues/show_bug.cgi?id=8448


Thank you.

Jul 26 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 07/25/2012 01:06 PM, Dmitry Olshansky wrote:
 On 25-Jul-12 23:58, Walter Bright wrote:

 I was afraid of that. You may have to approximate it by loading the
 address of L_jumptable into a register and adding it in instead of using
 the addressing mode.

 like this ?
 mov EDX, L_jumpable

I hope it is just that typo! :p

Ali

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 10:19 AM, Walter Bright wrote:
 Is it possible you could code it up and test it using inline asm?

I dummied up some code to do it:

int test(int i)
{
     switch (i)
     {
         case 3: i += 3; break;
         case 4: i += 4; break;
         case 5: i += 5; break;
         case 6: i += 6; break;
         case 7: i += 7; break;
         case 8: i += 8; break;
         default: i += 100; break;
     }
     return i;
}

                 enter   4,0
                 push    EBX
                 mov     -4[EBP],EAX
                 mov     EBX,EAX
                 sub     EBX,3
                 cmp     EBX,5
                 ja      L5D
                 lea     ECX,_D3foo4testFiZi[01Bh][EBX*4][EBX]
                 jmp     ECX
                 jmp     near ptr L39
                 jmp     near ptr L3F
                 jmp     near ptr L45
                 jmp     near ptr L4B
                 jmp     near ptr L51
                 jmp     near ptr L57
L39:            add     dword ptr -4[EBP],3
                 jmp short       L61
L3F:            add     dword ptr -4[EBP],4
                 jmp short       L61
L45:            add     dword ptr -4[EBP],5
                 jmp short       L61
L4B:            add     dword ptr -4[EBP],6
                 jmp short       L61
L51:            add     dword ptr -4[EBP],7
                 jmp short       L61
L57:            add     dword ptr -4[EBP],8
                 jmp short       L61
L5D:            add     dword ptr -4[EBP],064h
L61:            mov     EAX,-4[EBP]
                 pop     EBX
                 leave
                 ret

===============================================
Sadly, it's significantly slower than:

                 enter   4,0
                 push    EBX
                 mov     -4[EBP],EAX
                 mov     EBX,EAX
                 sub     EBX,3
                 cmp     EBX,5
                 ja      L3D
                 jmp     dword ptr FLAT:_DATA[00h][EBX*4]
                 add     dword ptr -4[EBP],3
                 jmp short       L41
                 add     dword ptr -4[EBP],4
                 jmp short       L41
                 add     dword ptr -4[EBP],5
                 jmp short       L41
                 add     dword ptr -4[EBP],6
                 jmp short       L41
                 add     dword ptr -4[EBP],7
                 jmp short       L41
                 add     dword ptr -4[EBP],8
                 jmp short       L41
L3D:            add     dword ptr -4[EBP],064h
L41:            mov     EAX,-4[EBP]
                 pop     EBX
                 leave
                 ret

Maybe the branch prediction logic doesn't work for JMP ECX.

Jul 25 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/25/2012 11:24 PM, Walter Bright wrote:
                  jmp     near ptr L39
                  jmp     near ptr L3F
                  jmp     near ptr L45
                  jmp     near ptr L4B
                  jmp     near ptr L51
                  jmp     near ptr L57

I tried replacing these with 2 byte jmp instructions, instead of 5 byte ones, 
but no improvement.

Jul 25 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26-Jul-12 10:24, Walter Bright wrote:
 On 7/25/2012 10:19 AM, Walter Bright wrote:
 Is it possible you could code it up and test it using inline asm?

 I dummied up some code to do it:

 int test(int i)
 {
      switch (i)
      {
          case 3: i += 3; break;
          case 4: i += 4; break;
          case 5: i += 5; break;
          case 6: i += 6; break;
          case 7: i += 7; break;
          case 8: i += 8; break;
          default: i += 100; break;
      }
      return i;
 }

Do the above in loop. And more cases of course. Something around 40 
should do. I'm still figuring out why my inline asm version segfaults :)

                  enter   4,0
                  push    EBX
                  mov     -4[EBP],EAX
                  mov     EBX,EAX
                  sub     EBX,3
                  cmp     EBX,5
                  ja      L5D
                  lea     ECX,_D3foo4testFiZi[01Bh][EBX*4][EBX]
                  jmp     ECX
                  jmp     near ptr L39
                  jmp     near ptr L3F
                  jmp     near ptr L45
                  jmp     near ptr L4B
                  jmp     near ptr L51
                  jmp     near ptr L57
 L39:            add     dword ptr -4[EBP],3
                  jmp short       L61
 L3F:            add     dword ptr -4[EBP],4
                  jmp short       L61
 L45:            add     dword ptr -4[EBP],5
                  jmp short       L61
 L4B:            add     dword ptr -4[EBP],6
                  jmp short       L61
 L51:            add     dword ptr -4[EBP],7
                  jmp short       L61
 L57:            add     dword ptr -4[EBP],8
                  jmp short       L61
 L5D:            add     dword ptr -4[EBP],064h
 L61:            mov     EAX,-4[EBP]
                  pop     EBX
                  leave
                  ret

 ===============================================
 Sadly, it's significantly slower than:

                  enter   4,0
                  push    EBX
                  mov     -4[EBP],EAX
                  mov     EBX,EAX
                  sub     EBX,3
                  cmp     EBX,5
                  ja      L3D
                  jmp     dword ptr FLAT:_DATA[00h][EBX*4]
                  add     dword ptr -4[EBP],3
                  jmp short       L41
                  add     dword ptr -4[EBP],4
                  jmp short       L41
                  add     dword ptr -4[EBP],5
                  jmp short       L41
                  add     dword ptr -4[EBP],6
                  jmp short       L41
                  add     dword ptr -4[EBP],7
                  jmp short       L41
                  add     dword ptr -4[EBP],8
                  jmp short       L41
 L3D:            add     dword ptr -4[EBP],064h
 L41:            mov     EAX,-4[EBP]
                  pop     EBX
                  leave
                  ret

 Maybe the branch prediction logic doesn't work for JMP ECX.


-- 
Dmitry Olshansky

Jul 26 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 7/26/2012 8:55 AM, Dmitry Olshansky wrote:
 int test(int i)
 {
      switch (i)
      {
          case 3: i += 3; break;
          case 4: i += 4; break;
          case 5: i += 5; break;
          case 6: i += 6; break;
          case 7: i += 7; break;
          case 8: i += 8; break;
          default: i += 100; break;
      }
      return i;
 }

 Do the above in loop. And more cases of course. Something around 40 should do.


Here's my entire test program. It runs a consistent 5 to 10% slower with the
new 
method compared with the old. Color me very disappointed.
=======================================================
import core.stdc.stdio;

int test(int i)
{
     switch (i)
     {
         case 3: i += 3; break;
         case 4: i += 4; break;
         case 5: i += 5; break;
         case 6: i += 6; break;
         case 7: i += 7; break;
         case 8: i += 8; break;
         default: i += 100; break;
     }
     return i;
}

void main()
{
     for (int i = 0; i < 100000000; i++)
     {
         for (int j = 0; j < 10; j++)
             test(j);
     }
     printf("%d\n", test(6));
}

Jul 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 27-Jul-12 00:38, Walter Bright wrote:
 On 7/26/2012 8:55 AM, Dmitry Olshansky wrote:
 int test(int i)
 {
      switch (i)
      {
          case 3: i += 3; break;
          case 4: i += 4; break;
          case 5: i += 5; break;
          case 6: i += 6; break;
          case 7: i += 7; break;
          case 8: i += 8; break;
          default: i += 100; break;
      }
      return i;
 }

 Do the above in loop. And more cases of course. Something around 40
 should do.


 Here's my entire test program. It runs a consistent 5 to 10% slower with
 the new method compared with the old. Color me very disappointed.

Thanks. I'll play with it a bit if time permits.



-- 
Dmitry Olshansky

Jul 26 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 OK I've taken your comments into account.
 Now I think I finally got it right:

 mov ecx, [ebx] ; ecx = code[pc]
 inc ebx ; pc ++
 jmp ecx ; goto code[pc], as ecx is already a pointer

 Nope, ecx is an opcode, not a pointer. You need another 
 indirection.

Man this has been frustrating to read. I understood what Dmitry 
was talking about over at least dozen posts ago, and that's 
without actually reading the article about interpreters (I did 
write a SNES emulator once, but it didn't use this cool 
technique. I did, however, have to write it in assembly because 
the C version was dog-slow because e.g. I couldn't capture the 
overflow/negative/zero flags in C.)

Jul 25 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Computed gotos on Reddit