www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Optimization tips for alpha blending / rasterization loop

reply "Mikko Ronkainen" <mikoro iki.fi> writes:
I'm trying to learn some software rasterization stuff. Here's 
what I'm doing:

32-bit DMD on 64-bit Windows
Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR 
(this seems fastest to my CPU + GPU)
Framebuffer is thrown as is to OpenGL, rendered as textured quad.

Here's a simple rectangle drawing algorithm that also does alpha 
blending. I tried quite a many variations (for example without 
the byte casting, using ints and shifting instead), but none was 
as fast as this:

class Framebuffer
{
   int[] data;
   int width;
   int height;
}

void drawRectangle(Framebuffer framebuffer, int x, int y, int 
width, int height, int color)
{
   foreach (i; y .. y + height)
   {
     int start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       byte* bg = cast(byte*)&framebuffer.data[start + j];
       byte* fg = cast(byte*)&color;

       int alpha = (fg[3] & 0xff) + 1;
       int inverseAlpha = 257 - alpha;

       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * 
(bg[0] & 0xff)) >> 8);
       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * 
(bg[1] & 0xff)) >> 8);
       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * 
(bg[2] & 0xff)) >> 8);
       bg[3] = cast(byte)0xff;
     }
   }
}

I would like to make this as fast as possible as it is done for 
almost every pixel every frame.

Am I doing something stupid that is slowing things down? Cache 
trashing, or even branch prediction errors? :)
Is this kind of algorith + data even a candidate for SIMD usage?
Even if fg is of type byte, fg[0] would return greater value than 
0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder 
why?
Nov 21 2013
parent reply "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), but 
 none was as fast as this:

 class Framebuffer
 {
   int[] data;
   int width;
   int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
   foreach (i; y .. y + height)
   {
     int start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       byte* bg = cast(byte*)&framebuffer.data[start + j];
       byte* fg = cast(byte*)&color;

       int alpha = (fg[3] & 0xff) + 1;
       int inverseAlpha = 257 - alpha;

       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
       bg[3] = cast(byte)0xff;
     }
   }
 }

 I would like to make this as fast as possible as it is done for 
 almost every pixel every frame.

 Am I doing something stupid that is slowing things down? Cache 
 trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. I 
 wonder why?
Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
Nov 21 2013
next sibling parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh 
wrote:
 On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
 wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), 
 but none was as fast as this:

 class Framebuffer
 {
  int[] data;
  int width;
  int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
  foreach (i; y .. y + height)
  {
    int start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      byte* bg = cast(byte*)&framebuffer.data[start + j];
      byte* fg = cast(byte*)&color;

      int alpha = (fg[3] & 0xff) + 1;
      int inverseAlpha = 257 - alpha;

      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
      bg[3] = cast(byte)0xff;
    }
  }
 }

 I would like to make this as fast as possible as it is done 
 for almost every pixel every frame.

 Am I doing something stupid that is slowing things down? Cache 
 trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD 
 usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. 
 I wonder why?
Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
If I'm right all of these lines: byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; are constant, and you put it outside the both foreach using an enum; you can also pre-calculate this: (alpha * (fg[0] & 0xff) before foreach.
Nov 22 2013
parent "Andrea Fontana" <nospam example.com> writes:
On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:
 On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh 
 wrote:
 On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
 wrote:
 I'm trying to learn some software rasterization stuff. Here's 
 what I'm doing:

 32-bit DMD on 64-bit Windows
 Framebuffer is an int[], each int is a pixel of format 
 0xAABBGGRR (this seems fastest to my CPU + GPU)
 Framebuffer is thrown as is to OpenGL, rendered as textured 
 quad.

 Here's a simple rectangle drawing algorithm that also does 
 alpha blending. I tried quite a many variations (for example 
 without the byte casting, using ints and shifting instead), 
 but none was as fast as this:

 class Framebuffer
 {
 int[] data;
 int width;
 int height;
 }

 void drawRectangle(Framebuffer framebuffer, int x, int y, int 
 width, int height, int color)
 {
 foreach (i; y .. y + height)
 {
   int start = x + i * framebuffer.width;

   foreach(j; 0 .. width)
   {
     byte* bg = cast(byte*)&framebuffer.data[start + j];
     byte* fg = cast(byte*)&color;

     int alpha = (fg[3] & 0xff) + 1;
     int inverseAlpha = 257 - alpha;

     bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha 
 * (bg[0] & 0xff)) >> 8);
     bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha 
 * (bg[1] & 0xff)) >> 8);
     bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha 
 * (bg[2] & 0xff)) >> 8);
     bg[3] = cast(byte)0xff;
   }
 }
 }

 I would like to make this as fast as possible as it is done 
 for almost every pixel every frame.

 Am I doing something stupid that is slowing things down? 
 Cache trashing, or even branch prediction errors? :)
 Is this kind of algorith + data even a candidate for SIMD 
 usage?
 Even if fg is of type byte, fg[0] would return greater value 
 than 0xff. It needs to be (fg[0] & 0xff) to make things work. 
 I wonder why?
Do you want to use a ubyte instead of a byte here? Also, for your alpha channel: int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; If fg[3] = 0 then inverseAlpha = 256, which is out of the range that can be stored in a ubyte. Craig
If I'm right all of these lines: byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; are constant, and you put it outside the both foreach using an enum; you can also pre-calculate this: (alpha * (fg[0] & 0xff) before foreach.
Of course I mean immutable, not enum :)
Nov 22 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Craig Dillabaugh:

 Do you want to use a ubyte instead of a byte here?
See: http://d.puremagic.com/issues/show_bug.cgi?id=3850 Bye, bearophile
Nov 22 2013
parent reply "Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:
On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:
 Craig Dillabaugh:

 Do you want to use a ubyte instead of a byte here?
See: http://d.puremagic.com/issues/show_bug.cgi?id=3850 Bye, bearophile
Yes it is pretty easy to mix that up. A lot of my work is with images with single byte pixels, so I am pretty used to using ubyte now. I can't remember if I ever used byte, and if I did I was likely a bug. Craig
Nov 22 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Craig Dillabaugh:

 Yes it is pretty easy to mix that up.  A lot of my work is with
 images with single byte pixels, so I am pretty used to using
 ubyte now.  I can't remember if I ever used byte, and if I did I
 was likely a bug.
Vote for the bug if you like it :-) Bye, bearophile
Nov 22 2013
prev sibling parent "Mikko Ronkainen" <mikoro iki.fi> writes:
 Do you want to use a ubyte instead of a byte here?
Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup.
 Also, for your alpha channel:

 int alpha = (fg[3] & 0xff) + 1;
 int inverseAlpha = 257 - alpha;

 If fg[3] = 0 then inverseAlpha = 256, which is out of the range
 that can be stored in a ubyte.
I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256. class Framebuffer { uint[] framebufferData; uint framebufferWidth; uint framebufferHeight; } void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color) { immutable ubyte* fg = cast(immutable ubyte*)&color; immutable uint alpha = fg[3] + 1; immutable uint invAlpha = 257 - alpha; immutable uint afg0 = alpha * fg[0]; immutable uint afg1 = alpha * fg[1]; immutable uint afg2 = alpha * fg[2]; foreach (i; y .. y + height) { uint start = x + i * framebuffer.width; foreach(j; 0 .. width) { ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]); bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8); bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8); bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8); bg[3] = 0xff; } } } Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?) Can this be parallelized with any real gains?
Nov 22 2013