digitalmars.D - Counting bits in a ulong

H. S. Teoh (97/97) Jun 03 2020 So recently I needed to implement a function to count the number of 1

Paul Backus (5/14) Jun 03 2020 Interesting stuff. Did you compare your implementations with the
ketmar (5/17) Jun 03 2020 actually, that branch has very good prediction rate. it should be predic...
Patrick Schluter (8/14) Jun 04 2020 You should try to use the popcnt instruction of the cpu (in asm

H. S. Teoh (11/24) Jun 04 2020 [...]

Patrick Schluter (14/41) Jun 04 2020 Cool, I just wanted to comment as I have myself a Phenom II X6

wjoe (10/33) Jun 04 2020 Did you run the benchmark on different CPU architectures ?
H. S. Teoh (29/38) Jun 04 2020 On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalma...
Dominikus Dittes Scherkl (14/44) Jun 04 2020 wikipedia recommends this:
ttk (25/27) Jun 05 2020 This made me wonder how well a lookup table approach would

H. S. Teoh (29/59) Jun 05 2020 You might be running into alignment issues, possibly. For one thing, why
H. S. Teoh (157/170) Jun 05 2020 [...]
sebasg (3/7) Jun 05 2020 Jesus. Shouldn't you be able to generate that instead of

ttk (9/17) Jun 08 2020 A fair point. Duplicating it in emacs was literally five

H. S. Teoh (11/23) Jun 08 2020 No templates needed, just use static foreach. ;-) (See my other reply

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop.  There's of course the
na�ve method (`val` is the input ulong):

        uint count;
        foreach (i; 0 .. 64)
        {
            if (val & (1L << i))
                count++;
        }
        return count;

Knowing that this code is inside a hotspot, I wrote a parallel bit
counter based on the idea from [1]:

        // Parallel bit count.
        // We count upper/lower halves separately, interspersed to maximize
        // hyperthreading. >:-)
        uint v1 = cast(uint)(val >> 32);
        uint v2 = cast(uint)val & 0xFFFF_FFFF;

        v1 = v1 - ((v1 >> 1) & 0x5555_5555);
        v2 = v2 - ((v2 >> 1) & 0x5555_5555);
        v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
        v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
        v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
        v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
        v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
        v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
        v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
        v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

        return cast(ulong)v1 + cast(ulong)v2;

[1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

The basic idea behind this trick is to add bits in pairs, then in 4's,
then in 8's, and so on, inside the ulong itself.  I furthermore split it
into upper and lower halves and interleaved the two halves to cut the
data dependency between instructions and (hopefully) increase the
opportunity for hyperthreading / increase out-of-order instruction
execution.  This is completely branchless and in theory should maximize
throughput. Unfortunately it comes at the cost of high instruction
count.

The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:

        uint count;
        for (count = 0; val; count++)
        {
            val &= val - 1;
        }
        return count;

(How it does this is left as an exercise for the reader. It's really
quite clever.)  This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.


I did some measurements with real data (not just a micro benchmark, this
is the actual algorithm with the bitcount function in its hotspot), and
as expected the na�ve bitcount performed the worst, about 2x slower.
However, between Kernighan and the parallel bit counter, the results
depend on the kind of input given.  For program inputs where there is a
large number of 1's in the average ulong to be counted, the parallel bit
counter performs better. Here's one example of such an input case:

	Na�ve:
	real	0m4.159s
	user	0m4.172s
	sys	0m0.024s

	Kernighan:
	real	0m2.114s
	user	0m2.129s
	sys	0m0.020s

	Parallel:
	real	0m2.024s
	user	0m2.039s
	sys	0m0.028s

As you can see, the advantage of the parallel count is only slightly
(albeit consistently) better than Kernighan.

However, when the input tends to generate ulongs with a low saturation
of 1's, Kernighan wins out by quite a large margin:

	Na�ve:
	real	0m5.661s
	user	0m5.706s
	sys	0m0.054s

	Kernighan:
	real	0m3.080s
	user	0m3.123s
	sys	0m0.058s

	Parallel:
	real	0m3.805s
	user	0m3.844s
	sys	0m0.056s

This illustrates how optimization is often input-dependent, and what
works well in one case may work not so well in another case. And even an
overall better choice (in this case Kernighan's trick) will in some
niche cases perform less well.

This goes to reinforce the point Andrei recently made, that sometimes if
you optimize for a micro-benchmark, you can end up with code that looks
good on the benchmark but not so good on real-world data -- you end up
optimizing for the benchmark rather than for the real-world use case.
Real-world optimization isn't so straightforward as minimizing a single
number; it's often a multi-dimensional problem. :-P


T

-- 
Amateurs built the Ark; professionals built the Titanic.

Jun 03 2020

Paul Backus <snarwin gmail.com> writes:

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):

[...]
 This illustrates how optimization is often input-dependent, and 
 what works well in one case may work not so well in another 
 case. And even an overall better choice (in this case 
 Kernighan's trick) will in some niche cases perform less well.

Interesting stuff. Did you compare your implementations with the 
popcnt function in core.bitop? There could be a potential 
improvement here.

Jun 03 2020

ketmar <ketmar ketmar.no-ip.org> writes:

H. S. Teoh wrote:

 The third alternative is known as Kernighan's trick, which involves a
 loop that executes exactly as many times as the number of 1 bits:

         uint count;
         for (count = 0; val; count++)
         {
             val &= val - 1;
         }
         return count;

 (How it does this is left as an exercise for the reader. It's really
 quite clever.)  This comes with the advantage of a very low instruction
 count, but it does have a branch that isn't easily predictable, so
 actual execution time is marginally slower.

actually, that branch has very good prediction rate. it should be predicted 
most of the time, and when predictor failed, it is mostly doesn't matter, 
because we're done with the loop. coincidentally, this is what your 
benchmarking results demonstrates. ;-)

Jun 03 2020

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):

 [...]

You should try to use the popcnt instruction of the cpu (in asm 
or with intrinsics). It's at least 3 or 4 order of magnitude 
(i.e. between 1000 and 10000 times) faster than any of these 
tricks. I made the change last year in my work codebase since 
we've left the SPARC servers (they had a very slow implementation 
of popcount).
All modern CPU's have the instruction.

Jun 04 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:
 On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
 [...]
 Interesting stuff. Did you compare your implementations with the
 popcnt function in core.bitop? There could be a potential improvement
 here.

 
 On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d
wrote:
 [...]
 You should try to use the popcnt instruction of the cpu (in asm or
 with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
 between 1000 and 10000 times) faster than any of these tricks.


[...]
 As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
 instruction. :-(

[...]

Actually, I was wrong, my CPU *does* have this instruction, but I needed
to pass the right -mcpu flag to LDC before it will emit it.  After
turning it on, I got a huge performance boost; it completely moved the
function that calls popcnt out of the hotspot into the background! Now
it's other functions that have become the bottleneck.  Cool stuff!


T

-- 
What do you mean the Internet isn't filled with subliminal messages? What about
all those buttons marked "submit"??

Jun 04 2020

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote:
 On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via 
 Digitalmars-d wrote:
 On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via 
 Digitalmars-d wrote: [...]
 Interesting stuff. Did you compare your implementations with 
 the popcnt function in core.bitop? There could be a 
 potential improvement here.

 
 On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via 
 Digitalmars-d wrote: [...]
 You should try to use the popcnt instruction of the cpu (in 
 asm or
 with intrinsics). It's at least 3 or 4 order of magnitude 
 (i.e.
 between 1000 and 10000 times) faster than any of these 
 tricks.


 [...]
 As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't 
 support the
 instruction. :-(

 [...]

 Actually, I was wrong, my CPU *does* have this instruction, but 
 I needed to pass the right -mcpu flag to LDC before it will 
 emit it.  After turning it on, I got a huge performance boost; 
 it completely moved the function that calls popcnt out of the 
 hotspot into the background! Now it's other functions that have 
 become the bottleneck.  Cool stuff!

Cool, I just wanted to comment as I have myself a Phenom II X6 
1090T and was wondering.
Btw, if you want to use the Kernighan function for values with a 
lot of set bits, count the inverted value:

That's what I had in my code (gcc C)

DEFINE_INLINE  int PopCountFewClr(uint64_t w)
{
#ifdef __POPCNT__
   return __builtin_popcountll(w);
#else
   return CHAR_BIT*sizeof w - PopCountFewSet(~w);
#endif
}

Jun 04 2020

wjoe <invalid example.com> writes:

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 I did some measurements with real data (not just a micro 
 benchmark, this is the actual algorithm with the bitcount 
 function in its hotspot), and as expected the naïve bitcount 
 performed the worst, about 2x slower. However, between 
 Kernighan and the parallel bit counter, the results depend on 
 the kind of input given.  For program inputs where there is a 
 large number of 1's in the average ulong to be counted, the 
 parallel bit counter performs better. Here's one example of 
 such an input case:

 	Naïve:
 	real	0m4.159s
 	user	0m4.172s
 	sys	0m0.024s

 	Kernighan:
 	real	0m2.114s
 	user	0m2.129s
 	sys	0m0.020s

 	Parallel:
 	real	0m2.024s
 	user	0m2.039s
 	sys	0m0.028s

 As you can see, the advantage of the parallel count is only 
 slightly (albeit consistently) better than Kernighan.


Did you run the benchmark on different CPU architectures ?

I recently benchmarked code which was performing a calculation 
one of 2 ways.
1st way was to calculate for real, the other used look up tables.
I ran the test on an i7 haswell and an i5 skylake.
The i5 was performing faster with look up tables, the i7 
performance was worse using look up tables.
The performance impact was similar to your 1st benchmark Parallel 
vs Kernighan.

Jun 04 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
[...]
 Interesting stuff. Did you compare your implementations with the
 popcnt function in core.bitop? There could be a potential improvement
 here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d
wrote:
[...]
 You should try to use the popcnt instruction of the cpu (in asm or
 with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
 between 1000 and 10000 times) faster than any of these tricks. I made
 the change last year in my work codebase since we've left the SPARC
 servers (they had a very slow implementation of popcount).
 All modern CPU's have the instruction.

Haha, I learn something new every day. Didn't even know there was such a
CPU instruction as popcnt. :-)

But what's more interesting is the result of using core.bitop.popcnt: as
it turns out, it runs faster than Kernighan when the input ulongs have
high bit count, but *slower* than Kernighan when the inputs have low bit
density. This was very puzzling to me, since a CPU instruction is
supposed to be orders of magnitude faster, as Patrick said, so I looked
at the disassembly.

As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
instruction. :-(  So the LDC intrinsic reverted to a software
implementation, which is essentially the parallel bitcount I wrote,
except better. Evidently, I miscalculated the impact of data dependency
between instructions, or more likely, the AMD CPU doesn't have
hyperthreading so splitting the ulong into upper/lower halves did more
harm than good for the calculation.

But nonetheless, even the better implementation of parallel bitcount
loses out to Kernighan when the input does not have a high bit density.

I do have a VPS that runs on an Intel CPU, but apparently also an older
one that doesn't support SSE4 (either that, or the hypervisor filtered
that out for whatever reason, but I doubt this).

So again we see the complex landscape of optimization: not only the
optimality depend on input data, it also depends on the hardware. :-)


T

-- 
Век живи - век учись. А дураком помрёшь.

Jun 04 2020

Dominikus Dittes Scherkl <dominikus scherkl.de> writes:

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):

         uint count;
         foreach (i; 0 .. 64)
         {
             if (val & (1L << i))
                 count++;
         }
         return count;

 Knowing that this code is inside a hotspot, I wrote a parallel 
 bit counter based on the idea from [1]:

         // Parallel bit count.
         // We count upper/lower halves separately, interspersed 
 to maximize
         // hyperthreading. >:-)
         uint v1 = cast(uint)(val >> 32);
         uint v2 = cast(uint)val & 0xFFFF_FFFF;

         v1 = v1 - ((v1 >> 1) & 0x5555_5555);
         v2 = v2 - ((v2 >> 1) & 0x5555_5555);
         v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
         v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
         v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
         v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
         v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
         v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
         v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
         v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

wikipedia recommends this:
     x -= (x >> 1) & m1;
     x = (x & m2) + ((x >> 2) & m2);
     x = (x + (x >> 4)) & m4;
     x += x >>  8;  //put count of each 16 bits into their lowest 
8 bits
     x += x >> 16;  //put count of each 32 bits into their lowest 
8 bits
     x += x >> 32;  //put count of each 64 bits into their lowest 
8 bits
     return x & 0x7f;
so in the last 4 lines you do too much work.
better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)

Jun 04 2020

ttk <ttk ciar.org> writes:

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1 bits in a ulong inside a (very) hot inner loop.

This made me wonder how well a lookup table approach would 
compare to these logical methods (and it was an excuse to blow 
off work a bit to fiddle with D).

The popcnt instruction is going to blow away all of these, of 
course, but it still seems like a worthwhile exercise, if only to 
get a feel for how purely in-register logic performance compares 
to cache access performance on modern "big" CPUs.

ttk kirov:/home/ttk/prog/D$ ./popcount
Starting tests, which take about 10 seconds each
Naive:         5596740 iter/sec
Parallel:      166352970 iter/sec
Kernighan:     47668800 iter/sec
Lookup 8-bit:  467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec

It surprised me a bit that the 16-bit lookup wasn't closer to 
2.0x the performance of the 8-bit lookup, since both lookup 
tables fit well within the L1 cache.

Source is here .. a bit large, since I manually unrolled loops 
100x to minimize the impact of looping logic (particularly the 
now() call):

http://ciar.org/h/popcount.d

I ran this on an Intel Core i7-9750M   2.6GHz, with 192KB L1 / 
1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O 
popcount.d" on Slackware-current.

Jun 05 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
 On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the number of
 1 bits in a ulong inside a (very) hot inner loop.

 
 This made me wonder how well a lookup table approach would compare to
 these logical methods (and it was an excuse to blow off work a bit to
 fiddle with D).

Hooray for having an excuse to play with D! ;-)


 The popcnt instruction is going to blow away all of these, of course,
 but it still seems like a worthwhile exercise, if only to get a feel
 for how purely in-register logic performance compares to cache access
 performance on modern "big" CPUs.
 
 ttk kirov:/home/ttk/prog/D$ ./popcount
 Starting tests, which take about 10 seconds each
 Naive:         5596740 iter/sec
 Parallel:      166352970 iter/sec
 Kernighan:     47668800 iter/sec
 Lookup 8-bit:  467161540 iter/sec
 Lookup 16-bit: 826179570 iter/sec
 
 It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the
 performance of the 8-bit lookup, since both lookup tables fit well
 within the L1 cache.

You might be running into alignment issues, possibly. For one thing, why
does LOOKUP_1 have 257 elements instead of 256? That last byte is never
accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
element is redundant.


 Source is here .. a bit large, since I manually unrolled loops 100x to
 minimize the impact of looping logic (particularly the now() call):
 
 http://ciar.org/h/popcount.d

There's no need to manually unroll the code: just use static
foreach, since the code is identical every iteration! :-P


 I ran this on an Intel Core i7-9750M   2.6GHz, with 192KB L1 / 1.5MB
 L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
 popcount.d" on Slackware-current.

I don't trust performance results with dmd, so I tested it on `ldc2 -O3`
instead:

	Starting tests, which take about 10 seconds each
	Naive:         25665230 iter/sec
	Parallel:      124692900 iter/sec
	Kernighan:     40680010 iter/sec
	Lookup 8-bit:  1532505390 iter/sec
	Lookup 16-bit: 1690195340 iter/sec

Something is up with the 16-bit lookups. Why is it only barely faster
than 8-bit lookup?

Disassembling shows that the code was *not* duplicated 100 times;
apparently LDC's aggressive optimizer noticed that `count` is reassigned
every unrolled iteration, so all except the last are no-ops. You need to
do something with `count` that isn't immediately killed by the next
unrolled iteration, otherwise your painstakingly copy-n-pasted
iterations will all be elided by the optimizer. :-P

OR'ing `count` into ACCUMULATOR after every count is probably what you
need to do here.


T

-- 
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a
primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on
Microsoft's strategy" -- Bill G., 1999.

Jun 05 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jun 05, 2020 at 12:45:36PM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 You might be running into alignment issues, possibly. For one thing, why
 does LOOKUP_1 have 257 elements instead of 256? That last byte is never
 accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
 element is redundant.

[...]
 Disassembling shows that the code was *not* duplicated 100 times;
 apparently LDC's aggressive optimizer noticed that `count` is reassigned
 every unrolled iteration, so all except the last are no-ops. You need to
 do something with `count` that isn't immediately killed by the next
 unrolled iteration, otherwise your painstakingly copy-n-pasted
 iterations will all be elided by the optimizer. :-P
 
 OR'ing `count` into ACCUMULATOR after every count is probably what you
 need to do here.

[...]

So I made the above changes, and here are the new measurements:

	Starting tests, which take about 10 seconds each
	Naive:         25272530 iter/sec
	Parallel:      133259870 iter/sec
	Kernighan:     36447830 iter/sec
	Lookup 8-bit:  133837870 iter/sec
	Lookup 16-bit: 283343300 iter/sec

Now the 16-bit lookup makes a lot more sense! :-D  (I also confirmed via
disassembly that the code is indeed unrolled 100 times, and not just
elided.)

Interestingly, the parallel version seems to perform almost just as well
as the 8-bit lookup.


Here's the fixed code (it's a lot shorter :-P):

------------------------------------------------------
import std.stdio;
import std.datetime.systime;

ubyte [257]   LOOKUP_1; // 8-bit lookup table
ubyte [65537] LOOKUP_2; // 16-bit lookup table
const ulong STEPPER = 0x110D0B0705030201;  // val increment step, to minimize
cache effects
ulong ACCUMULATOR = 0;  // might keep optimizer from noticing we never use
count and tossing it
double TEST_DURATION = 10.0;

// High-resolution time()-like, copied from VS_Common_D:
double now() {
    double tm = Clock.currStdTime();
    return (tm / 10000000.0) - 62135596800.0;
}

ubyte popcount_naive(ulong val) {
    ubyte count;
    foreach (i; 0 .. 64) if (val & (1L << i)) count++;
    return count;
}

ulong microbench_naive() {
    ulong iter_count = 0;
    ulong val = 0;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    foreach (i; 0 .. 64) if (val & (1L << i)) count++; val += STEPPER;
	    ACCUMULATOR ^= count;
	}
	iter_count += 100;
    }
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_parallel() {
    ulong iter_count = 0;
    ulong val = 0;
    uint  count = 0;
    uint  v1;
    uint  v2;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    v1 = cast(uint)(val >> 32);
	    v2 = cast(uint)val & 0xFFFF_FFFF;
	    v1 = v1 - ((v1 >> 1) & 0x5555_5555);
	    v2 = v2 - ((v2 >> 1) & 0x5555_5555);
	    v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
	    v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
	    v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
	    v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
	    v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
	    v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
	    v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
	    v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
	    count += cast(ulong)v1 + cast(ulong)v2;
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_kernighan() {
    ulong iter_count = 0;
    ulong val = 0;
    ulong tval = 0;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    for (count = 0; val; count++) val &= val - 1;  val = tval += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_8bit() {
    ulong iter_count = 0;
    ulong val = 0;
    ubyte *av = cast(ubyte*) &val;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	// writeln("top: iter = ", iter_count, ", val = ", val, ", av = ", av, ", av0
= ", av[0], ", av1 = ", av[1], ", av2 = ", av[2], ", av3 = ", av[3], ", av4 =
", av[4], ", av5 = ", av[5], ", av6 = ", av[6], ", av7 = ", av[7]);
	static foreach (_; 0 .. 100) {
	    count = LOOKUP_1[av[0]] + LOOKUP_1[av[1]] + LOOKUP_1[av[2]] +
LOOKUP_1[av[3]] + LOOKUP_1[av[4]] + LOOKUP_1[av[5]] + LOOKUP_1[av[6]] +
LOOKUP_1[av[7]];
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_16bit() {
    ulong iter_count = 0;
    ulong val = 0;
    ushort *av = cast(ushort*) &val;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    count = LOOKUP_2[av[0]] + LOOKUP_2[av[1]] + LOOKUP_2[av[2]] +
LOOKUP_2[av[3]];
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

int main() {
    // initialize the lookup tables:
    foreach (i; 0 ..   256) LOOKUP_1[i] = popcount_naive(i);
    foreach (i; 0 .. 65536) LOOKUP_2[i] = popcount_naive(i);

    writeln("Starting tests, which take about ", TEST_DURATION, " seconds
each");
    writeln("Naive:         ", microbench_naive(),        " iter/sec");
    writeln("Parallel:      ", microbench_parallel(),     " iter/sec");
    writeln("Kernighan:     ", microbench_kernighan(),    " iter/sec");
    writeln("Lookup 8-bit:  ", microbench_lookup_8bit(),  " iter/sec");
    writeln("Lookup 16-bit: ", microbench_lookup_16bit(), " iter/sec");

    // trick optimizer into thinking ACCUMULATOR, and thus count, is important:
    return ACCUMULATOR / STEPPER;
}
------------------------------------------------------


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

Jun 05 2020

sebasg <bcs6wg3tvw 4dh24.anonbox.net> writes:

On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
 Source is here .. a bit large, since I manually unrolled loops 
 100x to minimize the impact of looping logic (particularly the 
 now() call):

 http://ciar.org/h/popcount.d

Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.

Jun 05 2020

ttk <ttk ciar.org> writes:

On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:
 On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
 Source is here .. a bit large, since I manually unrolled loops 
 100x to minimize the impact of looping logic (particularly the 
 now() call):

 http://ciar.org/h/popcount.d

 Jesus. Shouldn't you be able to generate that instead of
 copy-pasting? It's D, after all.

A fair point.  Duplicating it in emacs was literally five 
keystrokes, but I should rewrite it to use templates anyway (more 
excuse to fiddle with D!).

Also, I figured out why the 16-bit lookup wasn't more closely 
2.0x the performance of the 8-bit lookup.  The L1 cache is 192KB, 
but that's divided across its six cores, so this single-threaded 
program only got 32KB of L1 to play with.  The 8-bit lookup table 
fit in that, but only half of the 16-bit lookup table did.

Jun 08 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jun 08, 2020 at 06:08:55PM +0000, ttk via Digitalmars-d wrote:
 On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:

[...]
 Jesus. Shouldn't you be able to generate that instead of
 copy-pasting? It's D, after all.

 
 A fair point.  Duplicating it in emacs was literally five keystrokes,
 but I should rewrite it to use templates anyway (more excuse to fiddle
 with D!).

No templates needed, just use static foreach. ;-)  (See my other reply
to your code.)


 Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the
 performance of the 8-bit lookup.  The L1 cache is 192KB, but that's
 divided across its six cores, so this single-threaded program only got
 32KB of L1 to play with.  The 8-bit lookup table fit in that, but only
 half of the 16-bit lookup table did.

I think your results are skewed; your copy-n-pasted block of count
repeatedly overwrites `count` without assigning its value anywhere else,
so the optimizer deleted all except the last block of code!  See my
other reply.


T

-- 
Tech-savvy: euphemism for nerdy.

Jun 08 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Counting bits in a ulong