digitalmars.D.learn - core.bitop.bt not faster than & ?

Trollgeir (14/14) Dec 17 2014 I've been doing some benchmarking the bt function

Adam D. Ruppe (17/20) Dec 17 2014 The processor doesn't work in terms of bits like that - it still

Trollgeir (9/29) Dec 17 2014 Thanks for the explanation, I suspected it might work something

H. S. Teoh via Digitalmars-d-learn (13/46) Dec 17 2014 That's an inaccurate understanding of how the CPU works. It does not
Dicebot (4/4) Dec 18 2014 Having quick read through the

"Trollgeir" <asgeir.berland gmail.com> writes:

I've been doing some benchmarking the bt function 
(http://dlang.org/phobos/core_bitop.html) up against a regular 
bit & -operator, and I'm a bit confused over the timings.

I'd expect the bt function to be up to 32 times faster as I 
thought it only compared two bits, and not the entire length of 
bits in the uint. The results I got indicated that there were no 
difference between that and just a regular & condition. In most 
cases, the & was actually a tad bit faster!

Ex.

bt(&element, 16) vs (element & 0b00000000000000001000000000000000)

Can anyone confirm this?


I used a simple std.datetime benchmark template iterating over a 
static array[100000], trying both random values for each entry 
and fixed ones.

Dec 17 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote:
 I'd expect the bt function to be up to 32 times faster as I 
 thought it only compared two bits, and not the entire length of 
 bits in the uint.

The processor doesn't work in terms of bits like that - it still 
needs to look at the whole integer. In fact, according to my 
(OLD) asm reference, the bt instruction is slower than the and 
instruction at the cpu level.

I think it has to do a wee bit more work, translating the 16 into 
a mask then moving the result into the flag... then moving the 
flag back into a register to return the value. (That last step 
could probably be skipped if you do an if() on it and the 
compiler optimizes the branch, and the first step might be 
skipped too if it is a constant, since the compiler can rewrite 
the instruction. So given that, I'd expect what you saw: no 
difference when they are optimized to the same thing or when the 
CPU's stars align right, and & a bit faster when bt isn't 
optimized)

bt() and friends are special instructions for specialized use 
cases. Probably useful for threading and stuff.

Dec 17 2014

"Trollgeir" <asgeir.berland gmail.com> writes:

On Wednesday, 17 December 2014 at 14:58:13 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote:
 I'd expect the bt function to be up to 32 times faster as I 
 thought it only compared two bits, and not the entire length 
 of bits in the uint.

 The processor doesn't work in terms of bits like that - it 
 still needs to look at the whole integer. In fact, according to 
 my (OLD) asm reference, the bt instruction is slower than the 
 and instruction at the cpu level.

 I think it has to do a wee bit more work, translating the 16 
 into a mask then moving the result into the flag... then moving 
 the flag back into a register to return the value. (That last 
 step could probably be skipped if you do an if() on it and the 
 compiler optimizes the branch, and the first step might be 
 skipped too if it is a constant, since the compiler can rewrite 
 the instruction. So given that, I'd expect what you saw: no 
 difference when they are optimized to the same thing or when 
 the CPU's stars align right, and & a bit faster when bt isn't 
 optimized)

 bt() and friends are special instructions for specialized use 
 cases. Probably useful for threading and stuff.


Thanks for the explanation, I suspected it might work something 
like that.

For my implementation - I have bits shifting to the right every 
update, and I want to check if it has reached certain markers. 
Hence, I felt it was really inefficient to check every single bit 
in the uint when I was only interested in some specific ones. Is 
an alternative (optimized) version of bt even possible?

Dec 17 2014

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, Dec 17, 2014 at 04:08:40PM +0000, Trollgeir via Digitalmars-d-learn
wrote:
 On Wednesday, 17 December 2014 at 14:58:13 UTC, Adam D. Ruppe wrote:
On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote:
I'd expect the bt function to be up to 32 times faster as I thought
it only compared two bits, and not the entire length of bits in the
uint.

The processor doesn't work in terms of bits like that - it still
needs to look at the whole integer. In fact, according to my (OLD)
asm reference, the bt instruction is slower than the and instruction
at the cpu level.

I think it has to do a wee bit more work, translating the 16 into a
mask then moving the result into the flag... then moving the flag
back into a register to return the value. (That last step could
probably be skipped if you do an if() on it and the compiler
optimizes the branch, and the first step might be skipped too if it
is a constant, since the compiler can rewrite the instruction. So
given that, I'd expect what you saw: no difference when they are
optimized to the same thing or when the CPU's stars align right, and
& a bit faster when bt isn't optimized)

bt() and friends are special instructions for specialized use cases.
Probably useful for threading and stuff.

 
 
 Thanks for the explanation, I suspected it might work something like
 that.
 
 For my implementation - I have bits shifting to the right every
 update, and I want to check if it has reached certain markers. Hence,
 I felt it was really inefficient to check every single bit in the uint
 when I was only interested in some specific ones. Is an alternative
 (optimized) version of bt even possible?

That's an inaccurate understanding of how the CPU works. It does not
check the bits "one by one"; the & operation translates to a single asm
instruction that performs the bitwise-and of *all* bits in parallel.
There is no faster implementation.

What causes the slowdown is probably the conversion of the bit index
into a mask. You could alleviate this part by precomputing the mask (if
you're testing for bits in the same position in many values -- this will
pay for the cost of computing the mask once, and reusing it many times
thereafter, instead of computing it every single time).


T

-- 
Computers shouldn't beep through the keyhole.

Dec 17 2014

"Dicebot" <public dicebot.lv> writes:

Having quick read through the 
http://en.wikipedia.org/wiki/Word_%28computer_architecture%29 may 
help re-calibrating the way you thing about bit operations and 
optimization.

Dec 18 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - core.bitop.bt not faster than & ?