digitalmars.D - Fast 2D matrix of bits

bearophile (5/5) Sep 19 2011 In Phobos (beside fast 2D/3D/4D vectors) I'd like a fast 2D matrix of bi...

Timon Gehr (13/22) Sep 19 2011 The implementation uses int and uint everywhere. I think it is not 64

bearophile (11/20) Sep 19 2011 For me a huge unsigned value that I can't catch in the ctor precondition...

Josh Simmons (20/40) Sep 19 2011 is worse than a negative value that I am able to catch. I don't need to ...

bearophile (4/16) Sep 20 2011 My version with bsr is faster.

Denis Shelomovskij (35/51) Sep 20 2011 Is it? Every conditional branch can break CPU command queue and slow

bearophile (7/10) Sep 22 2011 Hours ago Don has written a comment in one of my enhancement requests th...

Josh Simmons (5/8) Sep 20 2011 Is that science or guessing?
Josh Simmons (7/18) Sep 20 2011 Ah, when the one I gave was slower it wasn't being unrolled by gcc,

Josh Simmons (2/3) Sep 19 2011 Which one exactly? Sometimes I wonder...

Paul D. Anderson (3/7) Sep 20 2011 There is another problem with bitmanip and 64-bit integers. See Bug #668...

bearophile <bearophileHUGS lycos.com> writes:

In Phobos (beside fast 2D/3D/4D vectors) I'd like a fast 2D matrix of bits. I
have added an implementation in attach in this enhancement request:

http://d.puremagic.com/issues/show_bug.cgi?id=6697

It's a common enough data structure, and despite being simple it's not obvious,
especially if you want it to be fast. A naive implementation that uses a
BitArray is not fast enough.

Bye,
bearophile

Sep 19 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/20/2011 12:19 AM, bearophile wrote:
 In Phobos (beside fast 2D/3D/4D vectors) I'd like a fast 2D matrix of bits. I
have added an implementation in attach in this enhancement request:

 http://d.puremagic.com/issues/show_bug.cgi?id=6697

 It's a common enough data structure, and despite being simple it's not
obvious, especially if you want it to be fast. A naive implementation that uses
a BitArray is not fast enough.

 Bye,
 bearophile

The implementation uses int and uint everywhere. I think it is not 64 
bit aware?

 // sizex_, sizey_ are signed to catch negative arguments bugs.

There would be no negative arguments if they were unsigned.


 // opIndex not used because it's slower on LDC

Why is it slower? The structure should imho certainly use opIndex(Assign).


Since you want it to be really fast, this occurs multiple times:

 immutable int index = x + (y << this.pow2); // 1D bit index
 this._bits[index / nbits_item] |= 1 << (index % nbits_item);

nbits_item is a power of two. So the divisions/modulo have the potential 
to be really fast. But the signedness of index makes some 
straightforward optimizations very hard to carry out for the compiler. 
(it will probably use bit shifts and logical and, but it has to emit 
something more to cope with (im)possible negative values) Try making 
index unsigned in those cases, it'll probably save some machine 
instructions. (size_t is always your best bet for indexes)

Sep 19 2011

bearophile <bearophileHUGS lycos.com> writes:

Timon Gehr:

 The implementation uses int and uint everywhere. I think it is not 64 
 bit aware?

I have not even tried it on a 64 bit system.


  > // sizex_, sizey_ are signed to catch negative arguments bugs.
 
 There would be no negative arguments if they were unsigned.

For me a huge unsigned value that I can't catch in the ctor precondition is
worse than a negative value that I am able to catch. I don't need to create a
2D matrix with sides bigger than int.max.
I agree that all the other methods are probably better with size_t arguments.
I'll fix it.


 Why is it slower?

Ask it to LDC1 developers :-)
Adding that method is easy, if you want it. And I agree it's handy, with syntax
mat[x,y] = true/false.
But note the preferred methods to use this matrix are set/reset, because they
are faster than the assign. I'll think about it.


 But the signedness of index makes some 
 straightforward optimizations very hard to carry out for the compiler. 
 [...] but it has to emit something more to cope with (im)possible negative
values)

Right, index needs to be uint or size_t. I'll fix it.

Thank you for your suggestions and notes,
bye,
bearophile

Sep 19 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Tue, Sep 20, 2011 at 10:24 AM, bearophile <bearophileHUGS lycos.com> wro=
te:
 Timon Gehr:

 The implementation uses int and uint everywhere. I think it is not 64
 bit aware?

 I have not even tried it on a 64 bit system.


 =C2=A0> // sizex_, sizey_ are signed to catch negative arguments bugs.

 There would be no negative arguments if they were unsigned.

 For me a huge unsigned value that I can't catch in the ctor precondition =

is worse than a negative value that I am able to catch. I don't need to cre=
ate a 2D matrix with sides bigger than int.max.
 I agree that all the other methods are probably better with size_t argume=

nts. I'll fix it.
 Why is it slower?

 Ask it to LDC1 developers :-)
 Adding that method is easy, if you want it. And I agree it's handy, with =

syntax mat[x,y] =3D true/false.
 But note the preferred methods to use this matrix are set/reset, because =

they are faster than the assign. I'll think about it.
 But the signedness of index makes some
 straightforward optimizations very hard to carry out for the compiler.
 [...] but it has to emit something more to cope with (im)possible negati=


ve values)
 Right, index needs to be uint or size_t. I'll fix it.

 Thank you for your suggestions and notes,
 bye,
 bearophile

Zero is a power of two too :(

You could also just...

uint32_t next_pow_2(uint32_t n)
{
    n -=3D 1;
    n |=3D n >> 1;
    n |=3D n >> 2;
    n |=3D n >> 4;
    n |=3D n >> 8;
    n |=3D n >> 16;
    return n + 1;
}

Sep 19 2011

bearophile <bearophileHUGS lycos.com> writes:

Josh Simmons:

 You could also just...
 
 uint32_t next_pow_2(uint32_t n)
 {
     n -= 1;
     n |= n >> 1;
     n |= n >> 2;
     n |= n >> 4;
     n |= n >> 8;
     n |= n >> 16;
     return n + 1;
 }

My version with bsr is faster.

Bye,
bearophile

Sep 20 2011

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

20.09.2011 10:22, bearophile �����:
 Josh Simmons:

 You could also just...

 uint32_t next_pow_2(uint32_t n)
 {
      n -= 1;
      n |= n>>  1;
      n |= n>>  2;
      n |= n>>  4;
      n |= n>>  8;
      n |= n>>  16;
      return n + 1;
 }

 My version with bsr is faster.

 Bye,
 bearophile

Is it? Every conditional branch can break CPU command queue and slow 
down execution. Anyway, I saw same problem not far-away (2^n sized 
OpenGL textures). IMHO (too lazy to make tests) it is the fastest:

---
bool isPow2(int n)
in { assert(n > 0); }
body { return !((n - 1) & n); }

unittest {
	assert(!isPow2(0));
	assert(isPow2(1));
	assert(isPow2(2));
	assert(!isPow2(3));
	assert(isPow2(4));
	assert(!isPow2(5));
}

int toPow2(int n)
in { assert(n > 0); }
body {
	return 1 << (bsr(n) + !isPow2(n));
}

unittest {
	assert(toPow2(1) == 1);
	assert(toPow2(2) == 2);
	assert(toPow2(3) == 4);
	assert(toPow2(4) == 4);
	assert(toPow2(5) == 8);
}
---


A short version (AFAIK, !! is one lexem for dmd):
---
int toPow2(int n) {
	return 1 << (bsr(n) + !!((n - 1) & n));
}
---

Sep 20 2011

bearophile <bearophileHUGS lycos.com> writes:

Denis Shelomovskij:

 Is it? Every conditional branch can break CPU command queue and slow 
 down execution. Anyway, I saw same problem not far-away (2^n sized 
 OpenGL textures).

Hours ago Don has written a comment in one of my enhancement requests that is
related to this topic:
http://d.puremagic.com/issues/show_bug.cgi?id=4046

I will do better benchmarks, and if necessary I'll update that helper function
in the bit matrix code in Bugzilla (I have already fixed most of the other
suggestions given by Timon Gehr).

If Don is right, this asks for some deprecation in core.bitops.

Bye,
bearophile

Sep 22 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Tue, Sep 20, 2011 at 5:22 PM, bearophile <bearophileHUGS lycos.com> wrote:
 My version with bsr is faster.

 Bye,
 bearophile

Is that science or guessing?

My horribly unscientific test shows the opposite to be true, I'm
looking over the assembly output to see if there's an extraneous
factor.

Sep 20 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Tue, Sep 20, 2011 at 6:36 PM, Josh Simmons <simmons.44 gmail.com> wrote:
 On Tue, Sep 20, 2011 at 5:22 PM, bearophile <bearophileHUGS lycos.com> wrote:
 My version with bsr is faster.

 Bye,
 bearophile

 Is that science or guessing?

 My horribly unscientific test shows the opposite to be true, I'm
 looking over the assembly output to see if there's an extraneous
 factor.

Ah, when the one I gave was slower it wasn't being unrolled by gcc,
when yours was slower my trivial loop was being vectorised. When
they're both handled the same the results are the same not favoring
either.

Cool.

I do like that the propagate-right version can be vectorised though.

Sep 20 2011

Josh Simmons <simmons.44 gmail.com> writes:

On Tue, Sep 20, 2011 at 12:03 PM, Josh Simmons <simmons.44 gmail.com> wrote:
 Zero is a power of two too :(

Which one exactly? Sometimes I wonder...

Sep 19 2011

Paul D. Anderson <paul.d.removethis.anderson comcast.andthis.net> writes:

Timon Gehr Wrote:

 
 The implementation uses int and uint everywhere. I think it is not 64 
 bit aware?
 



Paul

Sep 20 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Fast 2D matrix of bits