digitalmars.D.learn - Optimization tips for alpha blending / rasterization loop

Mikko Ronkainen (46/46) Nov 21 2013 I'm trying to learn some software rasterization stuff. Here's

Craig Dillabaugh (9/56) Nov 21 2013 Do you want to use a ubyte instead of a byte here?

Andrea Fontana (11/77) Nov 22 2013 If I'm right all of these lines:

Andrea Fontana (2/84) Nov 22 2013 Of course I mean immutable, not enum :)

bearophile (5/6) Nov 22 2013 See:

Craig Dillabaugh (6/12) Nov 22 2013 Yes it is pretty easy to mix that up. A lot of my work is with

bearophile (4/8) Nov 22 2013 Vote for the bug if you like it :-)

Mikko Ronkainen (37/43) Nov 22 2013 Yes, that was a silly mistake. It seems that fixing that removed

"Mikko Ronkainen" <mikoro iki.fi> writes:

I'm trying to learn some software rasterization stuff. Here's 
what I'm doing:

32-bit DMD on 64-bit Windows
Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR 
(this seems fastest to my CPU + GPU)
Framebuffer is thrown as is to OpenGL, rendered as textured quad.

Here's a simple rectangle drawing algorithm that also does alpha 
blending. I tried quite a many variations (for example without 
the byte casting, using ints and shifting instead), but none was 
as fast as this:

class Framebuffer
{
   int[] data;
   int width;
   int height;
}

void drawRectangle(Framebuffer framebuffer, int x, int y, int 
width, int height, int color)
{
   foreach (i; y .. y + height)
   {
     int start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       byte* bg = cast(byte*)&framebuffer.data[start + j];
       byte* fg = cast(byte*)&color;

       int alpha = (fg[3] & 0xff) + 1;
       int inverseAlpha = 257 - alpha;

       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * 
(bg[0] & 0xff)) >> 8);
       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * 
(bg[1] & 0xff)) >> 8);
       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * 
(bg[2] & 0xff)) >> 8);
       bg[3] = cast(byte)0xff;
     }
   }
}

I would like to make this as fast as possible as it is done for 
almost every pixel every frame.

Am I doing something stupid that is slowing things down? Cache 
trashing, or even branch prediction errors? :)
Is this kind of algorith + data even a candidate for SIMD usage?
Even if fg is of type byte, fg[0] would return greater value than 
0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder 
why?

Nov 21 2013

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), but 
 none was as fast as this:

 class Framebuffer
 {
   int[] data;
   int width;
   int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
   foreach (i; y .. y + height)
   {
     int start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       byte* bg = cast(byte*)&framebuffer.data[start + j];
       byte* fg = cast(byte*)&color;

       int alpha = (fg[3] & 0xff) + 1;
       int inverseAlpha = 257 - alpha;

       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
       bg[3] = cast(byte)0xff;
     }
   }
 }

 I would like to make this as fast as possible as it is done for 
 almost every pixel every frame.

 Am I doing something stupid that is slowing things down? Cache 
 trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. I 
 wonder why?

Do you want to use a ubyte instead of a byte here?

Also, for your alpha channel:

int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.

Craig

Nov 21 2013

"Andrea Fontana" <nospam example.com> writes:

On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh 
wrote:
 On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
 wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), 
 but none was as fast as this:

 class Framebuffer
 {
  int[] data;
  int width;
  int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
  foreach (i; y .. y + height)
  {
    int start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      byte* bg = cast(byte*)&framebuffer.data[start + j];
      byte* fg = cast(byte*)&color;

      int alpha = (fg[3] & 0xff) + 1;
      int inverseAlpha = 257 - alpha;

      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
      bg[3] = cast(byte)0xff;
    }
  }
 }

 I would like to make this as fast as possible as it is done 
 for almost every pixel every frame.

 Am I doing something stupid that is slowing things down? Cache 
 trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD 
 usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. 
 I wonder why?

 Do you want to use a ubyte instead of a byte here?

 Also, for your alpha channel:

 int alpha = (fg[3] & 0xff) + 1;
 int inverseAlpha = 257 - alpha;

 If fg[3] = 0 then inverseAlpha = 256, which is out of the range
 that can be stored in a ubyte.

 Craig

If I'm right all of these lines:

byte* fg = cast(byte*)&color;
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

are constant, and you put it outside the both foreach using an 
enum;

you can also pre-calculate this:
(alpha * (fg[0] & 0xff)

before foreach.

Nov 22 2013

"Andrea Fontana" <nospam example.com> writes:

On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:
 On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh 
 wrote:
 On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
 wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), 
 but none was as fast as this:

 class Framebuffer
 {
 int[] data;
 int width;
 int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
 foreach (i; y .. y + height)
 {
   int start = x + i * framebuffer.width;

   foreach(j; 0 .. width)
   {
     byte* bg = cast(byte*)&framebuffer.data[start + j];
     byte* fg = cast(byte*)&color;

     int alpha = (fg[3] & 0xff) + 1;
     int inverseAlpha = 257 - alpha;

     bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
     bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
     bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
     bg[3] = cast(byte)0xff;
   }
 }
 }

 I would like to make this as fast as possible as it is done 
 for almost every pixel every frame.

 Am I doing something stupid that is slowing things down? 
 Cache trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD 
 usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. 
 I wonder why?

 Do you want to use a ubyte instead of a byte here?

 Also, for your alpha channel:

 int alpha = (fg[3] & 0xff) + 1;
 int inverseAlpha = 257 - alpha;

 If fg[3] = 0 then inverseAlpha = 256, which is out of the range
 that can be stored in a ubyte.

 Craig

 If I'm right all of these lines:

 byte* fg = cast(byte*)&color;
 int alpha = (fg[3] & 0xff) + 1;
 int inverseAlpha = 257 - alpha;

 are constant, and you put it outside the both foreach using an 
 enum;

 you can also pre-calculate this:
 (alpha * (fg[0] & 0xff)

 before foreach.

Of course I mean immutable, not enum :)

Nov 22 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Craig Dillabaugh:

 Do you want to use a ubyte instead of a byte here?

See:
http://d.puremagic.com/issues/show_bug.cgi?id=3850

Bye,
bearophile

Nov 22 2013

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:
 Craig Dillabaugh:

 Do you want to use a ubyte instead of a byte here?

 See:
 http://d.puremagic.com/issues/show_bug.cgi?id=3850

 Bye,
 bearophile

Yes it is pretty easy to mix that up.  A lot of my work is with
images with single byte pixels, so I am pretty used to using
ubyte now.  I can't remember if I ever used byte, and if I did I
was likely a bug.

Craig

Nov 22 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Craig Dillabaugh:

 Yes it is pretty easy to mix that up.  A lot of my work is with
 images with single byte pixels, so I am pretty used to using
 ubyte now.  I can't remember if I ever used byte, and if I did I
 was likely a bug.

Vote for the bug if you like it :-)

Bye,
bearophile

Nov 22 2013

"Mikko Ronkainen" <mikoro iki.fi> writes:

 Do you want to use a ubyte instead of a byte here?

Yes, that was a silly mistake. It seems that fixing that removed 
the need for all the masking operations, which had the biggest 
speedup.

 Also, for your alpha channel:

 int alpha = (fg[3] & 0xff) + 1;
 int inverseAlpha = 257 - alpha;

 If fg[3] = 0 then inverseAlpha = 256, which is out of the range
 that can be stored in a ubyte.

I think my logic should be correct. The calculations are done 
with ints, and the result is then just casted/clamped to the 
byte. The reason for the +1 is the >> 8, which divides by 256.

class Framebuffer
{
   uint[] framebufferData;
   uint framebufferWidth;
   uint framebufferHeight;
}

void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint 
width, uint height, uint color)
{
   immutable ubyte* fg = cast(immutable ubyte*)&color;
   immutable uint alpha = fg[3] + 1;
   immutable uint invAlpha = 257 - alpha;
   immutable uint afg0 = alpha * fg[0];
   immutable uint afg1 = alpha * fg[1];
   immutable uint afg2 = alpha * fg[2];

   foreach (i; y .. y + height)
   {
     uint start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);

       bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
       bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
       bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
       bg[3] = 0xff;
     }
   }
}

Can this be made faster with SIMD? (I don't know much about it, 
maybe the data and algorithm doesn't fit it?)

Can this be parallelized with any real gains?

Nov 22 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Optimization tips for alpha blending / rasterization loop