## digitalmars.D - Counting bits in a ulong

• H. S. Teoh (97/97) Jun 03 So recently I needed to implement a function to count the number of 1
• Paul Backus (5/14) Jun 03 Interesting stuff. Did you compare your implementations with the
• ketmar (5/17) Jun 03 actually, that branch has very good prediction rate. it should be predic...
• Patrick Schluter (8/14) Jun 04 You should try to use the popcnt instruction of the cpu (in asm
• wjoe (10/33) Jun 04 Did you run the benchmark on different CPU architectures ?
• H. S. Teoh (29/38) Jun 04 On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalma...
• Dominikus Dittes Scherkl (14/44) Jun 04 wikipedia recommends this:
• ttk (25/27) Jun 05 This made me wonder how well a lookup table approach would
• H. S. Teoh (29/59) Jun 05 You might be running into alignment issues, possibly. For one thing, why
• H. S. Teoh (157/170) Jun 05 [...]
• sebasg (3/7) Jun 05 Jesus. Shouldn't you be able to generate that instead of
• ttk (9/17) Jun 08 A fair point. Duplicating it in emacs was literally five
• H. S. Teoh (11/23) Jun 08 No templates needed, just use static foreach. ;-) (See my other reply
```So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop.  There's of course the
na�ve method (`val` is the input ulong):

uint count;
foreach (i; 0 .. 64)
{
if (val & (1L << i))
count++;
}
return count;

Knowing that this code is inside a hotspot, I wrote a parallel bit
counter based on the idea from :

// Parallel bit count.
// We count upper/lower halves separately, interspersed to maximize
uint v1 = cast(uint)(val >> 32);
uint v2 = cast(uint)val & 0xFFFF_FFFF;

v1 = v1 - ((v1 >> 1) & 0x5555_5555);
v2 = v2 - ((v2 >> 1) & 0x5555_5555);
v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

return cast(ulong)v1 + cast(ulong)v2;

 https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

The basic idea behind this trick is to add bits in pairs, then in 4's,
then in 8's, and so on, inside the ulong itself.  I furthermore split it
into upper and lower halves and interleaved the two halves to cut the
data dependency between instructions and (hopefully) increase the
opportunity for hyperthreading / increase out-of-order instruction
execution.  This is completely branchless and in theory should maximize
throughput. Unfortunately it comes at the cost of high instruction
count.

The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:

uint count;
for (count = 0; val; count++)
{
val &= val - 1;
}
return count;

(How it does this is left as an exercise for the reader. It's really
quite clever.)  This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.

I did some measurements with real data (not just a micro benchmark, this
is the actual algorithm with the bitcount function in its hotspot), and
as expected the na�ve bitcount performed the worst, about 2x slower.
However, between Kernighan and the parallel bit counter, the results
depend on the kind of input given.  For program inputs where there is a
large number of 1's in the average ulong to be counted, the parallel bit
counter performs better. Here's one example of such an input case:

Na�ve:
real	0m4.159s
user	0m4.172s
sys	0m0.024s

Kernighan:
real	0m2.114s
user	0m2.129s
sys	0m0.020s

Parallel:
real	0m2.024s
user	0m2.039s
sys	0m0.028s

As you can see, the advantage of the parallel count is only slightly
(albeit consistently) better than Kernighan.

However, when the input tends to generate ulongs with a low saturation
of 1's, Kernighan wins out by quite a large margin:

Na�ve:
real	0m5.661s
user	0m5.706s
sys	0m0.054s

Kernighan:
real	0m3.080s
user	0m3.123s
sys	0m0.058s

Parallel:
real	0m3.805s
user	0m3.844s
sys	0m0.056s

This illustrates how optimization is often input-dependent, and what
works well in one case may work not so well in another case. And even an
overall better choice (in this case Kernighan's trick) will in some
niche cases perform less well.

This goes to reinforce the point Andrei recently made, that sometimes if
you optimize for a micro-benchmark, you can end up with code that looks
good on the benchmark but not so good on real-world data -- you end up
optimizing for the benchmark rather than for the real-world use case.
Real-world optimization isn't so straightforward as minimizing a single
number; it's often a multi-dimensional problem. :-P

T

--
Amateurs built the Ark; professionals built the Titanic.
```
Jun 03
```On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
So recently I needed to implement a function to count the
number of 1
bits in a ulong inside a (very) hot inner loop.  There's of
course the
naïve method (`val` is the input ulong):

[...]
This illustrates how optimization is often input-dependent, and
what works well in one case may work not so well in another
case. And even an overall better choice (in this case
Kernighan's trick) will in some niche cases perform less well.

Interesting stuff. Did you compare your implementations with the
popcnt function in core.bitop? There could be a potential
improvement here.
```
Jun 03
```H. S. Teoh wrote:

The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:

uint count;
for (count = 0; val; count++)
{
val &= val - 1;
}
return count;

(How it does this is left as an exercise for the reader. It's really
quite clever.)  This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.

actually, that branch has very good prediction rate. it should be predicted
most of the time, and when predictor failed, it is mostly doesn't matter,
because we're done with the loop. coincidentally, this is what your
benchmarking results demonstrates. ;-)
```
Jun 03
```On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
So recently I needed to implement a function to count the
number of 1
bits in a ulong inside a (very) hot inner loop.  There's of
course the
naïve method (`val` is the input ulong):

[...]

You should try to use the popcnt instruction of the cpu (in asm
or with intrinsics). It's at least 3 or 4 order of magnitude
(i.e. between 1000 and 10000 times) faster than any of these
tricks. I made the change last year in my work codebase since
we've left the SPARC servers (they had a very slow implementation
of popcount).
All modern CPU's have the instruction.
```
Jun 04
```On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
[...]
Interesting stuff. Did you compare your implementations with the
popcnt function in core.bitop? There could be a potential improvement
here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d
wrote:
[...]
You should try to use the popcnt instruction of the cpu (in asm or
with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
between 1000 and 10000 times) faster than any of these tricks.

[...]
As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
instruction. :-(

[...]

Actually, I was wrong, my CPU *does* have this instruction, but I needed
to pass the right -mcpu flag to LDC before it will emit it.  After
turning it on, I got a huge performance boost; it completely moved the
function that calls popcnt out of the hotspot into the background! Now
it's other functions that have become the bottleneck.  Cool stuff!

T

--
What do you mean the Internet isn't filled with subliminal messages? What about
all those buttons marked "submit"??
```
Jun 04
```On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote:
On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via
Digitalmars-d wrote:
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via
Digitalmars-d wrote: [...]
Interesting stuff. Did you compare your implementations with
the popcnt function in core.bitop? There could be a
potential improvement here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via
Digitalmars-d wrote: [...]
You should try to use the popcnt instruction of the cpu (in
asm or
with intrinsics). It's at least 3 or 4 order of magnitude
(i.e.
between 1000 and 10000 times) faster than any of these
tricks.

[...]
As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't
support the
instruction. :-(

[...]

Actually, I was wrong, my CPU *does* have this instruction, but
I needed to pass the right -mcpu flag to LDC before it will
emit it.  After turning it on, I got a huge performance boost;
it completely moved the function that calls popcnt out of the
hotspot into the background! Now it's other functions that have
become the bottleneck.  Cool stuff!

Cool, I just wanted to comment as I have myself a Phenom II X6
1090T and was wondering.
Btw, if you want to use the Kernighan function for values with a
lot of set bits, count the inverted value:

That's what I had in my code (gcc C)

DEFINE_INLINE  int PopCountFewClr(uint64_t w)
{
#ifdef __POPCNT__
return __builtin_popcountll(w);
#else
return CHAR_BIT*sizeof w - PopCountFewSet(~w);
#endif
}
```
Jun 04
```On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
I did some measurements with real data (not just a micro
benchmark, this is the actual algorithm with the bitcount
function in its hotspot), and as expected the naïve bitcount
performed the worst, about 2x slower. However, between
Kernighan and the parallel bit counter, the results depend on
the kind of input given.  For program inputs where there is a
large number of 1's in the average ulong to be counted, the
parallel bit counter performs better. Here's one example of
such an input case:

Naïve:
real	0m4.159s
user	0m4.172s
sys	0m0.024s

Kernighan:
real	0m2.114s
user	0m2.129s
sys	0m0.020s

Parallel:
real	0m2.024s
user	0m2.039s
sys	0m0.028s

As you can see, the advantage of the parallel count is only
slightly (albeit consistently) better than Kernighan.

Did you run the benchmark on different CPU architectures ?

I recently benchmarked code which was performing a calculation
one of 2 ways.
1st way was to calculate for real, the other used look up tables.
I ran the test on an i7 haswell and an i5 skylake.
The i5 was performing faster with look up tables, the i7
performance was worse using look up tables.
The performance impact was similar to your 1st benchmark Parallel
vs Kernighan.
```
Jun 04
```On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
[...]
Interesting stuff. Did you compare your implementations with the
popcnt function in core.bitop? There could be a potential improvement
here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d
wrote:
[...]
You should try to use the popcnt instruction of the cpu (in asm or
with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
between 1000 and 10000 times) faster than any of these tricks. I made
the change last year in my work codebase since we've left the SPARC
servers (they had a very slow implementation of popcount).
All modern CPU's have the instruction.

Haha, I learn something new every day. Didn't even know there was such a
CPU instruction as popcnt. :-)

But what's more interesting is the result of using core.bitop.popcnt: as
it turns out, it runs faster than Kernighan when the input ulongs have
high bit count, but *slower* than Kernighan when the inputs have low bit
density. This was very puzzling to me, since a CPU instruction is
supposed to be orders of magnitude faster, as Patrick said, so I looked
at the disassembly.

As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
instruction. :-(  So the LDC intrinsic reverted to a software
implementation, which is essentially the parallel bitcount I wrote,
except better. Evidently, I miscalculated the impact of data dependency
between instructions, or more likely, the AMD CPU doesn't have
hyperthreading so splitting the ulong into upper/lower halves did more
harm than good for the calculation.

But nonetheless, even the better implementation of parallel bitcount
loses out to Kernighan when the input does not have a high bit density.

I do have a VPS that runs on an Intel CPU, but apparently also an older
one that doesn't support SSE4 (either that, or the hypervisor filtered
that out for whatever reason, but I doubt this).

So again we see the complex landscape of optimization: not only the
optimality depend on input data, it also depends on the hardware. :-)

T

--
Век живи - век учись. А дураком помрёшь.
```
Jun 04
```On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
So recently I needed to implement a function to count the
number of 1
bits in a ulong inside a (very) hot inner loop.  There's of
course the
naïve method (`val` is the input ulong):

uint count;
foreach (i; 0 .. 64)
{
if (val & (1L << i))
count++;
}
return count;

Knowing that this code is inside a hotspot, I wrote a parallel
bit counter based on the idea from :

// Parallel bit count.
// We count upper/lower halves separately, interspersed
to maximize
uint v1 = cast(uint)(val >> 32);
uint v2 = cast(uint)val & 0xFFFF_FFFF;

v1 = v1 - ((v1 >> 1) & 0x5555_5555);
v2 = v2 - ((v2 >> 1) & 0x5555_5555);
v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

wikipedia recommends this:
x -= (x >> 1) & m1;
x = (x & m2) + ((x >> 2) & m2);
x = (x + (x >> 4)) & m4;
x += x >>  8;  //put count of each 16 bits into their lowest
8 bits
x += x >> 16;  //put count of each 32 bits into their lowest
8 bits
x += x >> 32;  //put count of each 64 bits into their lowest
8 bits
return x & 0x7f;
so in the last 4 lines you do too much work.
better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)
```
Jun 04
```On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
So recently I needed to implement a function to count the
number of 1 bits in a ulong inside a (very) hot inner loop.

This made me wonder how well a lookup table approach would
compare to these logical methods (and it was an excuse to blow
off work a bit to fiddle with D).

The popcnt instruction is going to blow away all of these, of
course, but it still seems like a worthwhile exercise, if only to
get a feel for how purely in-register logic performance compares
to cache access performance on modern "big" CPUs.

ttk kirov:/home/ttk/prog/D\$ ./popcount
Starting tests, which take about 10 seconds each
Naive:         5596740 iter/sec
Parallel:      166352970 iter/sec
Kernighan:     47668800 iter/sec
Lookup 8-bit:  467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec

It surprised me a bit that the 16-bit lookup wasn't closer to
2.0x the performance of the 8-bit lookup, since both lookup
tables fit well within the L1 cache.

Source is here .. a bit large, since I manually unrolled loops
100x to minimize the impact of looping logic (particularly the
now() call):

http://ciar.org/h/popcount.d

I ran this on an Intel Core i7-9750M   2.6GHz, with 192KB L1 /
1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
popcount.d" on Slackware-current.
```
Jun 05
```On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
So recently I needed to implement a function to count the number of
1 bits in a ulong inside a (very) hot inner loop.

This made me wonder how well a lookup table approach would compare to
these logical methods (and it was an excuse to blow off work a bit to
fiddle with D).

Hooray for having an excuse to play with D! ;-)

The popcnt instruction is going to blow away all of these, of course,
but it still seems like a worthwhile exercise, if only to get a feel
for how purely in-register logic performance compares to cache access
performance on modern "big" CPUs.

ttk kirov:/home/ttk/prog/D\$ ./popcount
Starting tests, which take about 10 seconds each
Naive:         5596740 iter/sec
Parallel:      166352970 iter/sec
Kernighan:     47668800 iter/sec
Lookup 8-bit:  467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec

It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the
performance of the 8-bit lookup, since both lookup tables fit well
within the L1 cache.

You might be running into alignment issues, possibly. For one thing, why
does LOOKUP_1 have 257 elements instead of 256? That last byte is never
accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
element is redundant.

Source is here .. a bit large, since I manually unrolled loops 100x to
minimize the impact of looping logic (particularly the now() call):

http://ciar.org/h/popcount.d

There's no need to manually unroll the code: just use static
foreach, since the code is identical every iteration! :-P

I ran this on an Intel Core i7-9750M   2.6GHz, with 192KB L1 / 1.5MB
L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
popcount.d" on Slackware-current.

I don't trust performance results with dmd, so I tested it on `ldc2 -O3`

Starting tests, which take about 10 seconds each
Naive:         25665230 iter/sec
Parallel:      124692900 iter/sec
Kernighan:     40680010 iter/sec
Lookup 8-bit:  1532505390 iter/sec
Lookup 16-bit: 1690195340 iter/sec

Something is up with the 16-bit lookups. Why is it only barely faster
than 8-bit lookup?

Disassembling shows that the code was *not* duplicated 100 times;
apparently LDC's aggressive optimizer noticed that `count` is reassigned
every unrolled iteration, so all except the last are no-ops. You need to
do something with `count` that isn't immediately killed by the next
unrolled iteration, otherwise your painstakingly copy-n-pasted
iterations will all be elided by the optimizer. :-P

OR'ing `count` into ACCUMULATOR after every count is probably what you
need to do here.

T

--
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a
primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on
Microsoft's strategy" -- Bill G., 1999.
```
Jun 05
```On Fri, Jun 05, 2020 at 12:45:36PM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
You might be running into alignment issues, possibly. For one thing, why
does LOOKUP_1 have 257 elements instead of 256? That last byte is never
accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
element is redundant.

[...]
Disassembling shows that the code was *not* duplicated 100 times;
apparently LDC's aggressive optimizer noticed that `count` is reassigned
every unrolled iteration, so all except the last are no-ops. You need to
do something with `count` that isn't immediately killed by the next
unrolled iteration, otherwise your painstakingly copy-n-pasted
iterations will all be elided by the optimizer. :-P

OR'ing `count` into ACCUMULATOR after every count is probably what you
need to do here.

[...]

So I made the above changes, and here are the new measurements:

Starting tests, which take about 10 seconds each
Naive:         25272530 iter/sec
Parallel:      133259870 iter/sec
Kernighan:     36447830 iter/sec
Lookup 8-bit:  133837870 iter/sec
Lookup 16-bit: 283343300 iter/sec

Now the 16-bit lookup makes a lot more sense! :-D  (I also confirmed via
disassembly that the code is indeed unrolled 100 times, and not just
elided.)

Interestingly, the parallel version seems to perform almost just as well
as the 8-bit lookup.

Here's the fixed code (it's a lot shorter :-P):

------------------------------------------------------
import std.stdio;
import std.datetime.systime;

ubyte    LOOKUP_1; // 8-bit lookup table
ubyte  LOOKUP_2; // 16-bit lookup table
const ulong STEPPER = 0x110D0B0705030201;  // val increment step, to minimize
cache effects
ulong ACCUMULATOR = 0;  // might keep optimizer from noticing we never use
count and tossing it
double TEST_DURATION = 10.0;

// High-resolution time()-like, copied from VS_Common_D:
double now() {
double tm = Clock.currStdTime();
return (tm / 10000000.0) - 62135596800.0;
}

ubyte popcount_naive(ulong val) {
ubyte count;
foreach (i; 0 .. 64) if (val & (1L << i)) count++;
return count;
}

ulong microbench_naive() {
ulong iter_count = 0;
ulong val = 0;
uint  count = 0;
double elapsed;
double t0 = now();
while((elapsed = now() - t0) < TEST_DURATION) {
// unrolling 100x should be enough to amortize overhead of calling now():
static foreach (_; 0 .. 100) {
foreach (i; 0 .. 64) if (val & (1L << i)) count++; val += STEPPER;
ACCUMULATOR ^= count;
}
iter_count += 100;
}
return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_parallel() {
ulong iter_count = 0;
ulong val = 0;
uint  count = 0;
uint  v1;
uint  v2;
double elapsed;
double t0 = now();
while((elapsed = now() - t0) < TEST_DURATION) {
// unrolling 100x should be enough to amortize overhead of calling now():
static foreach (_; 0 .. 100) {
v1 = cast(uint)(val >> 32);
v2 = cast(uint)val & 0xFFFF_FFFF;
v1 = v1 - ((v1 >> 1) & 0x5555_5555);
v2 = v2 - ((v2 >> 1) & 0x5555_5555);
v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
count += cast(ulong)v1 + cast(ulong)v2;
val += STEPPER;
ACCUMULATOR ^= count;
}

iter_count += 100;
}
ACCUMULATOR += count;
return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_kernighan() {
ulong iter_count = 0;
ulong val = 0;
ulong tval = 0;
uint  count = 0;
double elapsed;
double t0 = now();
while((elapsed = now() - t0) < TEST_DURATION) {
// unrolling 100x should be enough to amortize overhead of calling now():
static foreach (_; 0 .. 100) {
for (count = 0; val; count++) val &= val - 1;  val = tval += STEPPER;
ACCUMULATOR ^= count;
}

iter_count += 100;
}
ACCUMULATOR += count;
return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_8bit() {
ulong iter_count = 0;
ulong val = 0;
ubyte *av = cast(ubyte*) &val;
uint  count = 0;
double elapsed;
double t0 = now();
while((elapsed = now() - t0) < TEST_DURATION) {
// unrolling 100x should be enough to amortize overhead of calling now():
// writeln("top: iter = ", iter_count, ", val = ", val, ", av = ", av, ", av0
= ", av, ", av1 = ", av, ", av2 = ", av, ", av3 = ", av, ", av4 =
", av, ", av5 = ", av, ", av6 = ", av, ", av7 = ", av);
static foreach (_; 0 .. 100) {
count = LOOKUP_1[av] + LOOKUP_1[av] + LOOKUP_1[av] +
LOOKUP_1[av] + LOOKUP_1[av] + LOOKUP_1[av] + LOOKUP_1[av] +
LOOKUP_1[av];
val += STEPPER;
ACCUMULATOR ^= count;
}

iter_count += 100;
}
ACCUMULATOR += count;
return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_16bit() {
ulong iter_count = 0;
ulong val = 0;
ushort *av = cast(ushort*) &val;
uint  count = 0;
double elapsed;
double t0 = now();
while((elapsed = now() - t0) < TEST_DURATION) {
// unrolling 100x should be enough to amortize overhead of calling now():
static foreach (_; 0 .. 100) {
count = LOOKUP_2[av] + LOOKUP_2[av] + LOOKUP_2[av] +
LOOKUP_2[av];
val += STEPPER;
ACCUMULATOR ^= count;
}

iter_count += 100;
}
ACCUMULATOR += count;
return cast(ulong)(iter_count / elapsed);  // iterations per second
}

int main() {
// initialize the lookup tables:
foreach (i; 0 ..   256) LOOKUP_1[i] = popcount_naive(i);
foreach (i; 0 .. 65536) LOOKUP_2[i] = popcount_naive(i);

writeln("Starting tests, which take about ", TEST_DURATION, " seconds
each");
writeln("Naive:         ", microbench_naive(),        " iter/sec");
writeln("Parallel:      ", microbench_parallel(),     " iter/sec");
writeln("Kernighan:     ", microbench_kernighan(),    " iter/sec");
writeln("Lookup 8-bit:  ", microbench_lookup_8bit(),  " iter/sec");
writeln("Lookup 16-bit: ", microbench_lookup_16bit(), " iter/sec");

// trick optimizer into thinking ACCUMULATOR, and thus count, is important:
return ACCUMULATOR / STEPPER;
}
------------------------------------------------------

T

--
In a world without fences, who needs Windows and Gates? -- Christian Surchi
```
Jun 05
```On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
Source is here .. a bit large, since I manually unrolled loops
100x to minimize the impact of looping logic (particularly the
now() call):

http://ciar.org/h/popcount.d

Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.
```
Jun 05
```On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:
On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
Source is here .. a bit large, since I manually unrolled loops
100x to minimize the impact of looping logic (particularly the
now() call):

http://ciar.org/h/popcount.d

Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.

A fair point.  Duplicating it in emacs was literally five
keystrokes, but I should rewrite it to use templates anyway (more
excuse to fiddle with D!).

Also, I figured out why the 16-bit lookup wasn't more closely
2.0x the performance of the 8-bit lookup.  The L1 cache is 192KB,
but that's divided across its six cores, so this single-threaded
program only got 32KB of L1 to play with.  The 8-bit lookup table
fit in that, but only half of the 16-bit lookup table did.
```
Jun 08
```On Mon, Jun 08, 2020 at 06:08:55PM +0000, ttk via Digitalmars-d wrote:
On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:

[...]
Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.

A fair point.  Duplicating it in emacs was literally five keystrokes,
but I should rewrite it to use templates anyway (more excuse to fiddle
with D!).

No templates needed, just use static foreach. ;-)  (See my other reply

Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the
performance of the 8-bit lookup.  The L1 cache is 192KB, but that's
divided across its six cores, so this single-threaded program only got
32KB of L1 to play with.  The 8-bit lookup table fit in that, but only
half of the 16-bit lookup table did.