digitalmars.D.learn - Can I get a more in-depth guide about the inline assembler?

ZILtoid1991 (33/33) Jun 01 2016 Here's the assembly code for my alpha-blending routine:

Era Scarecrow (14/28) Jun 01 2016 In the assembler the variable names actually become just the

ZILtoid1991 (5/33) Jun 01 2016 I could get the code working with a bug after replacing pmulhuw

ZILtoid1991 (3/10) Jun 01 2016 I forgot to mention that I had to make pointers for the arrays I

Era Scarecrow (10/18) Jun 01 2016 I'm not familiar with the MMX instruction set, however glancing

Johan Engelen (5/6) Jun 02 2016 Could you also paste the D version of your code? Perhaps the

ZILtoid1991 (62/68) Jun 02 2016 ubyte[4] dest2 = *p;

Era Scarecrow (29/55) Jun 02 2016 I'd say the major portion of your speedup happens to be because

ZILtoid1991 (33/68) Jun 03 2016 Problem solved. Current assembly code:

Era Scarecrow (7/21) Jun 03 2016 So... Why did you need to dereference the pointer for p and move

ZILtoid1991 <ziltoidtheomnicent gmail.com> writes:

Here's the assembly code for my alpha-blending routine:
ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
offsetY);
asm{	//moving the values to their destinations
movd	MM0, p;
movd	MM1, src;
movq	MM5, alpha;
movq	MM7, alphaMMXmul_const1;
movq	MM6, alphaMMXmul_const2;
									punpcklbw	MM2, MM0;
punpcklbw	MM3, MM1;

paddw	MM6, MM5;	//1 + alpha
psubw	MM7, MM5;	//256 - alpha

pmulhuw	MM2, MM6;	//src * (1 + alpha)
pmulhuw MM3, MM7;	//dest * (256 - alpha)
paddw	MM3, MM2;	//(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256
									//moving the result to its place;
									packuswb	MM4, MM3;
movd	p, MM4;
emms;
}

The two constants being referred here:
static immutable ushort[4] alphaMMXmul_const1 = [256,256,256,256];
static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1];

alpha is a ushort[4] containing the alpha value four times.

After some debugging, I found out that the p pointer becomes null 
at the end instead of pointing to a value. I have no experience 
with using in-line assemblers (although I made a few Hello World 
programs for MS-Dos with a stand-alone assembler), so I don't 
know when and how the compiler will interpret the types from D.

Jun 01 2016

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
 After some debugging, I found out that the p pointer becomes 
 null at the end instead of pointing to a value. I have no 
 experience with using in-line assemblers (although I made a few 
 Hello World programs for MS-Dos with a stand-alone assembler), 
 so I don't know when and how the compiler will interpret the 
 types from D.

  In the assembler the variable names actually become just the 
offset to where they are in the stack in relation to BP. So if 
you want the full pointer you actually need to convert it into a 
register first and then just use that register instead. So.... 
This should be correct.

//unless you are going to actually use ubyte[4] here, just making 
a pointer will work instead, so cast(uint*) probably
 ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
 ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
 offsetY);
 asm{	//moving the values to their destinations

  movd   ESI, src[EBP]; //get source pointer
  movd   EDI,   p[EBP]; //get destination pointer
  movd	MM0,    [EDI]; //use directly
  movd	MM1,    [ESI];
 movq	MM5, alpha;
 movq	MM7, alphaMMXmul_const1;
 movq	MM6, alphaMMXmul_const2;

 <snip>

  movd	[EDI], MM4;
}

Jun 01 2016

ZILtoid1991 <ziltoidtheomnicent gmail.com> writes:

On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote:
 On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
 After some debugging, I found out that the p pointer becomes 
 null at the end instead of pointing to a value. I have no 
 experience with using in-line assemblers (although I made a 
 few Hello World programs for MS-Dos with a stand-alone 
 assembler), so I don't know when and how the compiler will 
 interpret the types from D.

  In the assembler the variable names actually become just the 
 offset to where they are in the stack in relation to BP. So if 
 you want the full pointer you actually need to convert it into 
 a register first and then just use that register instead. 
 So.... This should be correct.

 //unless you are going to actually use ubyte[4] here, just 
 making a pointer will work instead, so cast(uint*) probably
 ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
 ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
 offsetY);
 asm{	//moving the values to their destinations

  movd   ESI, src[EBP]; //get source pointer
  movd   EDI,   p[EBP]; //get destination pointer
  movd	MM0,    [EDI]; //use directly
  movd	MM1,    [ESI];
 movq	MM5, alpha;
 movq	MM7, alphaMMXmul_const1;
 movq	MM6, alphaMMXmul_const2;

 <snip>

  movd	[EDI], MM4;
 }

I could get the code working with a bug after replacing pmulhuw 
with pmullw, but due to integer overflow I get a glitched image. 
I try to get around the fact that pmulhuw stores the high bits of 
the result either with multiplication or with bit shifting.

Jun 01 2016

ZILtoid1991 <ziltoidtheomnicent gmail.com> writes:

On Thursday, 2 June 2016 at 00:51:15 UTC, ZILtoid1991 wrote:
 On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote:
 On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:


 I could get the code working with a bug after replacing pmulhuw 
 with pmullw, but due to integer overflow I get a glitched 
 image. I try to get around the fact that pmulhuw stores the 
 high bits of the result either with multiplication or with bit 
 shifting.

I forgot to mention that I had to make pointers for the arrays I 
used in order to be able to load them.

Jun 01 2016

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 2 June 2016 at 00:52:48 UTC, ZILtoid1991 wrote:
 On Thursday, 2 June 2016 at 00:51:15 UTC, ZILtoid1991 wrote:
 I could get the code working with a bug after replacing 
 pmulhuw with pmullw, but due to integer overflow I get a 
 glitched image. I try to get around the fact that pmulhuw 
 stores the high bits of the result either with multiplication 
 or with bit shifting.

 I forgot to mention that I had to make pointers for the arrays 
 I used in order to be able to load them.

  I'm not familiar with the MMX instruction set, however glancing 
at the source again I notice the const registers are (of course) 
arrays. Those two where you're loading the constants you should 
probably create/convert the pointers as appropriate, or if they 
are an Enum value I think they will be dropped in and work fine. 
(the value being 0x0100_0100_0100_0100 and 0x0001_0001_0001_0001 
I believe, based on the short layout).

  Maybe you already had that solved or the compiler does something 
I don't know...

Jun 01 2016

Johan Engelen <j j.nl> writes:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
 Here's the assembly code for my alpha-blending routine:

Could you also paste the D version of your code? Perhaps the 
compiler (LDC, GDC) will generate similarly vectorized code that 
is inlinable, etc.

-Johan

Jun 02 2016

ZILtoid1991 <ziltoidtheomnicent gmail.com> writes:

On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
 On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
 Here's the assembly code for my alpha-blending routine:

 Could you also paste the D version of your code? Perhaps the 
 compiler (LDC, GDC) will generate similarly vectorized code 
 that is inlinable, etc.

 -Johan

ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - 
src[0]))>>8);
dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - 
src[0]))>>8);
dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - 
src[0]))>>8);
*p = dest2;

The main problem with this is that it's much slower, even if I 
would calculate the alpha blending values once. The assembly code 
does not seem to have higher impact than the "replace if alpha = 
255" algorithm:

if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code, 
mostly with the pmulhuw command (it returns the higher 16 bit of 
the result, I need the lower 16 bit as unsigned), also with the 
pointers, as the read outs and write backs doesn't land to their 
correct places, sometimes resulting in a flickering screen or 
wrong colors affecting neighboring pixels. Current assembly code:

//ushort[4] alpha = [src[0],src[0],src[0],src[0]];	//replace it 
if there's a faster method for this
ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{
									//moving the values to their destinations
									mov		ESI, p2[EBP];
mov		EDI, p[EBP];
movd	MM0, [ESI];
movd	MM1, [EDI];
mov		ESI, p3[EBP];
movq	MM5, [ESI];
mov		ESI, pc_256[EBP];
movq	MM7, [ESI];
mov		ESI, pc_1[EBP];
movq	MM6, [ESI];
punpcklbw	MM2, MM0;
punpcklbw	MM3, MM1;

paddw	MM6, MM5;	//1 + alpha
psubw	MM7, MM5;	//256 - alpha

//psllw	MM2, 2;
//psllw	MM3, 2;
psrlw	MM6, 1;
psrlw	MM7, 1;
pmullw	MM2, MM6;	//src * (1 + alpha)
pmullw	MM3, MM7;	//dest * (256 - alpha)
paddw	MM3, MM2;	//(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256
									//moving the result to its place;
packuswb	MM4, MM3;
movd	[EDI-3], MM4;

emms;
}

Tried to get the correct result with trial and error, but there's 
no real improvement.

Jun 02 2016

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:
 On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
 Could you also paste the D version of your code? Perhaps the 
 compiler (LDC, GDC) will generate similarly vectorized code 
 that is inlinable, etc.

 ubyte[4] dest2 = *p;
 dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - 
 src[0]))>>8);
 dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - 
 src[0]))>>8);
 dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - 
 src[0]))>>8);
 *p = dest2;

 The main problem with this is that it's much slower, even if I 
 would calculate the alpha blending values once. The assembly 
 code does not seem to have higher impact than the "replace if 
 alpha = 255" algorithm:

 if(src[0] == 255){
 *p = src;
 }

 It also seems I have a quite few problems with the assembly 
 code, mostly with the pmulhuw command (it returns the higher 16 
 bit of the result, I need the lower 16 bit as unsigned), also 
 with the pointers, as the read outs and write backs doesn't 
 land to their correct places, sometimes resulting in a 
 flickering screen or wrong colors affecting neighboring pixels. 
 Current assembly code:

  I'd say the major portion of your speedup happens to be because 
you're trying to do 3 things at once. Rather specifically, 
because you're working with 3 8bit colors, you have 24bits of 
data to work with, and by adding 8bits for fixed floating point 
you can do a multiply and do 4 small multiplies in a single 
command.

  You'd probably get a similar effect from bit shifting before and 
after the results. Since you're working with 3 colors and the 
alpha/multiplier... This assumes you do it without MMX. (reduces 
6 multiplies to a mere 2)

ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];

tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);

src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;


  You could also increase the bit precision up so if you decided 
to do further adds or some other calculations it would have more 
room to fudge with, but not much. Say if you gave yourself 20 
bits per variable rather than 16, the values can then hold 16x 
higher for getting say the average of x values at no cost (if 
divisible by ^2) other than a little difference in how you write 
it :)

  Although you might still get a better result from MMX 
instructions if you have them in the right order. Don't forget 
though MMX uses the same register space as floating point, so 
mixing the two is a big no-no.

Jun 02 2016

ZILtoid1991 <ziltoidtheomnicent gmail.com> writes:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
 Here's the assembly code for my alpha-blending routine:
 ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
 ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
 offsetY);
 asm{	//moving the values to their destinations
 movd	MM0, p;
 movd	MM1, src;
 movq	MM5, alpha;
 movq	MM7, alphaMMXmul_const1;
 movq	MM6, alphaMMXmul_const2;
 									punpcklbw	MM2, MM0;
 punpcklbw	MM3, MM1;

 paddw	MM6, MM5;	//1 + alpha
 psubw	MM7, MM5;	//256 - alpha

 pmulhuw	MM2, MM6;	//src * (1 + alpha)
 pmulhuw MM3, MM7;	//dest * (256 - alpha)
 paddw	MM3, MM2;	//(src * (1 + alpha)) + (dest * (256 - alpha))
 psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
 256
 									//moving the result to its place;
 									packuswb	MM4, MM3;
 movd	p, MM4;
 emms;
 }

 The two constants being referred here:
 static immutable ushort[4] alphaMMXmul_const1 = 
 [256,256,256,256];
 static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1];

 alpha is a ushort[4] containing the alpha value four times.

 After some debugging, I found out that the p pointer becomes 
 null at the end instead of pointing to a value. I have no 
 experience with using in-line assemblers (although I made a few 
 Hello World programs for MS-Dos with a stand-alone assembler), 
 so I don't know when and how the compiler will interpret the 
 types from D.

Problem solved. Current assembly code:

asm{
									//moving the values to their destinations
mov		EBX, p[EBP];
movd	MM0, src;
movd	MM1, [EBX];

movq	MM5, alpha;			
movq	MM7, alphaMMXmul_const256;
movq	MM6, alphaMMXmul_const1;
pxor	MM2, MM2;
punpcklbw	MM0, MM2;
punpcklbw	MM1, MM2;

paddusw	MM6, MM5;	//1 + alpha
psubusw	MM7, MM5;	//256 - alpha

pmullw	MM0, MM6;	//src * (1 + alpha)
pmullw	MM1, MM7;	//dest * (256 - alpha)
paddusw	MM0, MM1;	//(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM0, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256
									//moving the result to its place;
//pxor	MM2, MM2;
packuswb	MM0, MM2;

movd	[EBX], MM0;

emms;
}
The actual problem was the poor documentation of MMX instructions 
as it never really caught on, and the disappearance of assembly 
programming from the mainstream. The end result was a quick 
alpha-blending algorithm that barely has any extra performance 
penalty compared to just copying the pixels. I currently have no 
plans on translating the whole sprite displaying algorithm to 
assembly, instead I'll work on the editor for the game engine.

Jun 03 2016

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Saturday, 4 June 2016 at 01:44:38 UTC, ZILtoid1991 wrote:
 Problem solved. Current assembly code:

 //moving the values to their destinations
 mov	EBX, p[EBP];
 movd	MM0, src;
 movd	MM1, [EBX];

 <snip>

 The actual problem was the poor documentation of MMX 
 instructions as it never really caught on, and the 
 disappearance of assembly programming from the mainstream. The 
 end result was a quick alpha-blending algorithm that barely has 
 any extra performance penalty compared to just copying the 
 pixels. I currently have no plans on translating the whole 
 sprite displaying algorithm to assembly, instead I'll work on 
 the editor for the game engine.

  So... Why did you need to dereference the pointer for p and move 
it to EBX, but didn't need to do it for src (no [])?

  Maybe you should explain your experiences with the MMX 
instruction set, follies and what you succeeded on? Where does 
the documentation fail? And are we talking about the Intel 
manuals and instruction sets or another source?

Jun 03 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Can I get a more in-depth guide about the inline assembler?