www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Counting bits in a ulong

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop.  There's of course the
nave method (`val` is the input ulong):

        uint count;
        foreach (i; 0 .. 64)
        {
            if (val & (1L << i))
                count++;
        }
        return count;

Knowing that this code is inside a hotspot, I wrote a parallel bit
counter based on the idea from [1]:

        // Parallel bit count.
        // We count upper/lower halves separately, interspersed to maximize
        // hyperthreading. >:-)
        uint v1 = cast(uint)(val >> 32);
        uint v2 = cast(uint)val & 0xFFFF_FFFF;

        v1 = v1 - ((v1 >> 1) & 0x5555_5555);
        v2 = v2 - ((v2 >> 1) & 0x5555_5555);
        v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
        v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
        v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
        v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
        v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
        v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
        v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
        v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

        return cast(ulong)v1 + cast(ulong)v2;

[1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

The basic idea behind this trick is to add bits in pairs, then in 4's,
then in 8's, and so on, inside the ulong itself.  I furthermore split it
into upper and lower halves and interleaved the two halves to cut the
data dependency between instructions and (hopefully) increase the
opportunity for hyperthreading / increase out-of-order instruction
execution.  This is completely branchless and in theory should maximize
throughput. Unfortunately it comes at the cost of high instruction
count.

The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:

        uint count;
        for (count = 0; val; count++)
        {
            val &= val - 1;
        }
        return count;

(How it does this is left as an exercise for the reader. It's really
quite clever.)  This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.


I did some measurements with real data (not just a micro benchmark, this
is the actual algorithm with the bitcount function in its hotspot), and
as expected the nave bitcount performed the worst, about 2x slower.
However, between Kernighan and the parallel bit counter, the results
depend on the kind of input given.  For program inputs where there is a
large number of 1's in the average ulong to be counted, the parallel bit
counter performs better. Here's one example of such an input case:

	Nave:
	real	0m4.159s
	user	0m4.172s
	sys	0m0.024s

	Kernighan:
	real	0m2.114s
	user	0m2.129s
	sys	0m0.020s

	Parallel:
	real	0m2.024s
	user	0m2.039s
	sys	0m0.028s

As you can see, the advantage of the parallel count is only slightly
(albeit consistently) better than Kernighan.

However, when the input tends to generate ulongs with a low saturation
of 1's, Kernighan wins out by quite a large margin:

	Nave:
	real	0m5.661s
	user	0m5.706s
	sys	0m0.054s

	Kernighan:
	real	0m3.080s
	user	0m3.123s
	sys	0m0.058s

	Parallel:
	real	0m3.805s
	user	0m3.844s
	sys	0m0.056s

This illustrates how optimization is often input-dependent, and what
works well in one case may work not so well in another case. And even an
overall better choice (in this case Kernighan's trick) will in some
niche cases perform less well.

This goes to reinforce the point Andrei recently made, that sometimes if
you optimize for a micro-benchmark, you can end up with code that looks
good on the benchmark but not so good on real-world data -- you end up
optimizing for the benchmark rather than for the real-world use case.
Real-world optimization isn't so straightforward as minimizing a single
number; it's often a multi-dimensional problem. :-P


T

-- 
Amateurs built the Ark; professionals built the Titanic.
Jun 03
next sibling parent Paul Backus <snarwin gmail.com> writes:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):
[...]
 This illustrates how optimization is often input-dependent, and 
 what works well in one case may work not so well in another 
 case. And even an overall better choice (in this case 
 Kernighan's trick) will in some niche cases perform less well.
Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.
Jun 03
prev sibling next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
H. S. Teoh wrote:

 The third alternative is known as Kernighan's trick, which involves a
 loop that executes exactly as many times as the number of 1 bits:

         uint count;
         for (count = 0; val; count++)
         {
             val &= val - 1;
         }
         return count;

 (How it does this is left as an exercise for the reader. It's really
 quite clever.)  This comes with the advantage of a very low instruction
 count, but it does have a branch that isn't easily predictable, so
 actual execution time is marginally slower.
actually, that branch has very good prediction rate. it should be predicted most of the time, and when predictor failed, it is mostly doesn't matter, because we're done with the loop. coincidentally, this is what your benchmarking results demonstrates. ;-)
Jun 03
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):

 [...]
You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks. I made the change last year in my work codebase since we've left the SPARC servers (they had a very slow implementation of popcount). All modern CPU's have the instruction.
Jun 04
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:
 On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
 [...]
 Interesting stuff. Did you compare your implementations with the
 popcnt function in core.bitop? There could be a potential improvement
 here.
On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
 You should try to use the popcnt instruction of the cpu (in asm or
 with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
 between 1000 and 10000 times) faster than any of these tricks.
[...]
 As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
 instruction. :-(
[...] Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff! T -- What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"??
Jun 04
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote:
 On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via 
 Digitalmars-d wrote:
 On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via 
 Digitalmars-d wrote: [...]
 Interesting stuff. Did you compare your implementations with 
 the popcnt function in core.bitop? There could be a 
 potential improvement here.
On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
 You should try to use the popcnt instruction of the cpu (in 
 asm or
 with intrinsics). It's at least 3 or 4 order of magnitude 
 (i.e.
 between 1000 and 10000 times) faster than any of these 
 tricks.
[...]
 As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't 
 support the
 instruction. :-(
[...] Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff!
Cool, I just wanted to comment as I have myself a Phenom II X6 1090T and was wondering. Btw, if you want to use the Kernighan function for values with a lot of set bits, count the inverted value: That's what I had in my code (gcc C) DEFINE_INLINE int PopCountFewClr(uint64_t w) { #ifdef __POPCNT__ return __builtin_popcountll(w); #else return CHAR_BIT*sizeof w - PopCountFewSet(~w); #endif }
Jun 04
prev sibling next sibling parent wjoe <invalid example.com> writes:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 I did some measurements with real data (not just a micro 
 benchmark, this is the actual algorithm with the bitcount 
 function in its hotspot), and as expected the naïve bitcount 
 performed the worst, about 2x slower. However, between 
 Kernighan and the parallel bit counter, the results depend on 
 the kind of input given.  For program inputs where there is a 
 large number of 1's in the average ulong to be counted, the 
 parallel bit counter performs better. Here's one example of 
 such an input case:

 	Naïve:
 	real	0m4.159s
 	user	0m4.172s
 	sys	0m0.024s

 	Kernighan:
 	real	0m2.114s
 	user	0m2.129s
 	sys	0m0.020s

 	Parallel:
 	real	0m2.024s
 	user	0m2.039s
 	sys	0m0.028s

 As you can see, the advantage of the parallel count is only 
 slightly (albeit consistently) better than Kernighan.
Did you run the benchmark on different CPU architectures ? I recently benchmarked code which was performing a calculation one of 2 ways. 1st way was to calculate for real, the other used look up tables. I ran the test on an i7 haswell and an i5 skylake. The i5 was performing faster with look up tables, the i7 performance was worse using look up tables. The performance impact was similar to your 1st benchmark Parallel vs Kernighan.
Jun 04
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
[...]
 Interesting stuff. Did you compare your implementations with the
 popcnt function in core.bitop? There could be a potential improvement
 here.
On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
 You should try to use the popcnt instruction of the cpu (in asm or
 with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
 between 1000 and 10000 times) faster than any of these tricks. I made
 the change last year in my work codebase since we've left the SPARC
 servers (they had a very slow implementation of popcount).
 All modern CPU's have the instruction.
Haha, I learn something new every day. Didn't even know there was such a CPU instruction as popcnt. :-) But what's more interesting is the result of using core.bitop.popcnt: as it turns out, it runs faster than Kernighan when the input ulongs have high bit count, but *slower* than Kernighan when the inputs have low bit density. This was very puzzling to me, since a CPU instruction is supposed to be orders of magnitude faster, as Patrick said, so I looked at the disassembly. As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-( So the LDC intrinsic reverted to a software implementation, which is essentially the parallel bitcount I wrote, except better. Evidently, I miscalculated the impact of data dependency between instructions, or more likely, the AMD CPU doesn't have hyperthreading so splitting the ulong into upper/lower halves did more harm than good for the calculation. But nonetheless, even the better implementation of parallel bitcount loses out to Kernighan when the input does not have a high bit density. I do have a VPS that runs on an Intel CPU, but apparently also an older one that doesn't support SSE4 (either that, or the hypervisor filtered that out for whatever reason, but I doubt this). So again we see the complex landscape of optimization: not only the optimality depend on input data, it also depends on the hardware. :-) T -- Век живи - век учись. А дураком помрёшь.
Jun 04
prev sibling next sibling parent Dominikus Dittes Scherkl <dominikus scherkl.de> writes:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1
 bits in a ulong inside a (very) hot inner loop.  There's of 
 course the
 naïve method (`val` is the input ulong):

         uint count;
         foreach (i; 0 .. 64)
         {
             if (val & (1L << i))
                 count++;
         }
         return count;

 Knowing that this code is inside a hotspot, I wrote a parallel 
 bit counter based on the idea from [1]:

         // Parallel bit count.
         // We count upper/lower halves separately, interspersed 
 to maximize
         // hyperthreading. >:-)
         uint v1 = cast(uint)(val >> 32);
         uint v2 = cast(uint)val & 0xFFFF_FFFF;

         v1 = v1 - ((v1 >> 1) & 0x5555_5555);
         v2 = v2 - ((v2 >> 1) & 0x5555_5555);
         v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
         v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
         v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
         v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
         v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
         v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
         v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
         v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
wikipedia recommends this: x -= (x >> 1) & m1; x = (x & m2) + ((x >> 2) & m2); x = (x + (x >> 4)) & m4; x += x >> 8; //put count of each 16 bits into their lowest 8 bits x += x >> 16; //put count of each 32 bits into their lowest 8 bits x += x >> 32; //put count of each 64 bits into their lowest 8 bits return x & 0x7f; so in the last 4 lines you do too much work. better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)
Jun 04
prev sibling parent reply ttk <ttk ciar.org> writes:
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the 
 number of 1 bits in a ulong inside a (very) hot inner loop.
This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D). The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs. ttk kirov:/home/ttk/prog/D$ ./popcount Starting tests, which take about 10 seconds each Naive: 5596740 iter/sec Parallel: 166352970 iter/sec Kernighan: 47668800 iter/sec Lookup 8-bit: 467161540 iter/sec Lookup 16-bit: 826179570 iter/sec It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache. Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.d I ran this on an Intel Core i7-9750M 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current.
Jun 05
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
 On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
 So recently I needed to implement a function to count the number of
 1 bits in a ulong inside a (very) hot inner loop.
This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D).
Hooray for having an excuse to play with D! ;-)
 The popcnt instruction is going to blow away all of these, of course,
 but it still seems like a worthwhile exercise, if only to get a feel
 for how purely in-register logic performance compares to cache access
 performance on modern "big" CPUs.
 
 ttk kirov:/home/ttk/prog/D$ ./popcount
 Starting tests, which take about 10 seconds each
 Naive:         5596740 iter/sec
 Parallel:      166352970 iter/sec
 Kernighan:     47668800 iter/sec
 Lookup 8-bit:  467161540 iter/sec
 Lookup 16-bit: 826179570 iter/sec
 
 It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the
 performance of the 8-bit lookup, since both lookup tables fit well
 within the L1 cache.
You might be running into alignment issues, possibly. For one thing, why does LOOKUP_1 have 257 elements instead of 256? That last byte is never accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last element is redundant.
 Source is here .. a bit large, since I manually unrolled loops 100x to
 minimize the impact of looping logic (particularly the now() call):
 
 http://ciar.org/h/popcount.d
There's no need to manually unroll the code: just use static foreach, since the code is identical every iteration! :-P
 I ran this on an Intel Core i7-9750M   2.6GHz, with 192KB L1 / 1.5MB
 L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
 popcount.d" on Slackware-current.
I don't trust performance results with dmd, so I tested it on `ldc2 -O3` instead: Starting tests, which take about 10 seconds each Naive: 25665230 iter/sec Parallel: 124692900 iter/sec Kernighan: 40680010 iter/sec Lookup 8-bit: 1532505390 iter/sec Lookup 16-bit: 1690195340 iter/sec Something is up with the 16-bit lookups. Why is it only barely faster than 8-bit lookup? Disassembling shows that the code was *not* duplicated 100 times; apparently LDC's aggressive optimizer noticed that `count` is reassigned every unrolled iteration, so all except the last are no-ops. You need to do something with `count` that isn't immediately killed by the next unrolled iteration, otherwise your painstakingly copy-n-pasted iterations will all be elided by the optimizer. :-P OR'ing `count` into ACCUMULATOR after every count is probably what you need to do here. T -- "640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
Jun 05
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jun 05, 2020 at 12:45:36PM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 You might be running into alignment issues, possibly. For one thing, why
 does LOOKUP_1 have 257 elements instead of 256? That last byte is never
 accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
 element is redundant.
[...]
 Disassembling shows that the code was *not* duplicated 100 times;
 apparently LDC's aggressive optimizer noticed that `count` is reassigned
 every unrolled iteration, so all except the last are no-ops. You need to
 do something with `count` that isn't immediately killed by the next
 unrolled iteration, otherwise your painstakingly copy-n-pasted
 iterations will all be elided by the optimizer. :-P
 
 OR'ing `count` into ACCUMULATOR after every count is probably what you
 need to do here.
[...] So I made the above changes, and here are the new measurements: Starting tests, which take about 10 seconds each Naive: 25272530 iter/sec Parallel: 133259870 iter/sec Kernighan: 36447830 iter/sec Lookup 8-bit: 133837870 iter/sec Lookup 16-bit: 283343300 iter/sec Now the 16-bit lookup makes a lot more sense! :-D (I also confirmed via disassembly that the code is indeed unrolled 100 times, and not just elided.) Interestingly, the parallel version seems to perform almost just as well as the 8-bit lookup. Here's the fixed code (it's a lot shorter :-P): ------------------------------------------------------ import std.stdio; import std.datetime.systime; ubyte [257] LOOKUP_1; // 8-bit lookup table ubyte [65537] LOOKUP_2; // 16-bit lookup table const ulong STEPPER = 0x110D0B0705030201; // val increment step, to minimize cache effects ulong ACCUMULATOR = 0; // might keep optimizer from noticing we never use count and tossing it double TEST_DURATION = 10.0; // High-resolution time()-like, copied from VS_Common_D: double now() { double tm = Clock.currStdTime(); return (tm / 10000000.0) - 62135596800.0; } ubyte popcount_naive(ulong val) { ubyte count; foreach (i; 0 .. 64) if (val & (1L << i)) count++; return count; } ulong microbench_naive() { ulong iter_count = 0; ulong val = 0; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { foreach (i; 0 .. 64) if (val & (1L << i)) count++; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_parallel() { ulong iter_count = 0; ulong val = 0; uint count = 0; uint v1; uint v2; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { v1 = cast(uint)(val >> 32); v2 = cast(uint)val & 0xFFFF_FFFF; v1 = v1 - ((v1 >> 1) & 0x5555_5555); v2 = v2 - ((v2 >> 1) & 0x5555_5555); v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333); v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333); v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F); v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F); v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF); v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF); v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF); v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF); count += cast(ulong)v1 + cast(ulong)v2; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_kernighan() { ulong iter_count = 0; ulong val = 0; ulong tval = 0; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { for (count = 0; val; count++) val &= val - 1; val = tval += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_lookup_8bit() { ulong iter_count = 0; ulong val = 0; ubyte *av = cast(ubyte*) &val; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): // writeln("top: iter = ", iter_count, ", val = ", val, ", av = ", av, ", av0 = ", av[0], ", av1 = ", av[1], ", av2 = ", av[2], ", av3 = ", av[3], ", av4 = ", av[4], ", av5 = ", av[5], ", av6 = ", av[6], ", av7 = ", av[7]); static foreach (_; 0 .. 100) { count = LOOKUP_1[av[0]] + LOOKUP_1[av[1]] + LOOKUP_1[av[2]] + LOOKUP_1[av[3]] + LOOKUP_1[av[4]] + LOOKUP_1[av[5]] + LOOKUP_1[av[6]] + LOOKUP_1[av[7]]; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } ulong microbench_lookup_16bit() { ulong iter_count = 0; ulong val = 0; ushort *av = cast(ushort*) &val; uint count = 0; double elapsed; double t0 = now(); while((elapsed = now() - t0) < TEST_DURATION) { // unrolling 100x should be enough to amortize overhead of calling now(): static foreach (_; 0 .. 100) { count = LOOKUP_2[av[0]] + LOOKUP_2[av[1]] + LOOKUP_2[av[2]] + LOOKUP_2[av[3]]; val += STEPPER; ACCUMULATOR ^= count; } iter_count += 100; } ACCUMULATOR += count; return cast(ulong)(iter_count / elapsed); // iterations per second } int main() { // initialize the lookup tables: foreach (i; 0 .. 256) LOOKUP_1[i] = popcount_naive(i); foreach (i; 0 .. 65536) LOOKUP_2[i] = popcount_naive(i); writeln("Starting tests, which take about ", TEST_DURATION, " seconds each"); writeln("Naive: ", microbench_naive(), " iter/sec"); writeln("Parallel: ", microbench_parallel(), " iter/sec"); writeln("Kernighan: ", microbench_kernighan(), " iter/sec"); writeln("Lookup 8-bit: ", microbench_lookup_8bit(), " iter/sec"); writeln("Lookup 16-bit: ", microbench_lookup_16bit(), " iter/sec"); // trick optimizer into thinking ACCUMULATOR, and thus count, is important: return ACCUMULATOR / STEPPER; } ------------------------------------------------------ T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi
Jun 05
prev sibling parent reply sebasg <bcs6wg3tvw 4dh24.anonbox.net> writes:
On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
 Source is here .. a bit large, since I manually unrolled loops 
 100x to minimize the impact of looping logic (particularly the 
 now() call):

 http://ciar.org/h/popcount.d
Jesus. Shouldn't you be able to generate that instead of copy-pasting? It's D, after all.
Jun 05
parent reply ttk <ttk ciar.org> writes:
On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:
 On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
 Source is here .. a bit large, since I manually unrolled loops 
 100x to minimize the impact of looping logic (particularly the 
 now() call):

 http://ciar.org/h/popcount.d
Jesus. Shouldn't you be able to generate that instead of copy-pasting? It's D, after all.
A fair point. Duplicating it in emacs was literally five keystrokes, but I should rewrite it to use templates anyway (more excuse to fiddle with D!). Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the performance of the 8-bit lookup. The L1 cache is 192KB, but that's divided across its six cores, so this single-threaded program only got 32KB of L1 to play with. The 8-bit lookup table fit in that, but only half of the 16-bit lookup table did.
Jun 08
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 08, 2020 at 06:08:55PM +0000, ttk via Digitalmars-d wrote:
 On Saturday, 6 June 2020 at 04:19:44 UTC, sebasg wrote:
[...]
 Jesus. Shouldn't you be able to generate that instead of
 copy-pasting? It's D, after all.
A fair point. Duplicating it in emacs was literally five keystrokes, but I should rewrite it to use templates anyway (more excuse to fiddle with D!).
No templates needed, just use static foreach. ;-) (See my other reply to your code.)
 Also, I figured out why the 16-bit lookup wasn't more closely 2.0x the
 performance of the 8-bit lookup.  The L1 cache is 192KB, but that's
 divided across its six cores, so this single-threaded program only got
 32KB of L1 to play with.  The 8-bit lookup table fit in that, but only
 half of the 16-bit lookup table did.
I think your results are skewed; your copy-n-pasted block of count repeatedly overwrites `count` without assigning its value anywhere else, so the optimizer deleted all except the last block of code! See my other reply. T -- Tech-savvy: euphemism for nerdy.
Jun 08