digitalmars.D.learn - Strange counter-performance in an alternative `decimalLength9`

Basile B. (142/142) Feb 25 2020 So after reading the translation of RYU I was interested too see

H. S. Teoh (22/31) Feb 25 2020 [...]

Basile B. (6/13) Feb 25 2020 I didn't use DMD. The script line is actually interpreted by the

Stefan Koch (3/7) Feb 26 2020 It can be made faster using binary search. Not by much though.
Basile B. (211/212) Feb 26 2020 It turns out that there's a problem with the benchmarking method.

Bruce Carneal (6/10) Feb 26 2020 The input stream is highly predictable and strongly skewed

Bruce Carneal (6/17) Feb 26 2020 Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit

Basile B. (7/14) Feb 26 2020 I've replaced the input by the front of a rndGen (that pops for

Bruce Carneal (18/28) Feb 26 2020 The uniform random uint distribution yields a highly skewed

Bruce Carneal (7/16) Feb 26 2020 The 2.4X improvement came when using a shuffled uniform digits
Basile B. (3/13) Feb 26 2020 No thanks, it's crystal clear now.

Basile B. (263/280) Feb 27 2020 although I don't see the performance gain. Now for me an hybrid

Bruce Carneal (32/68) Feb 27 2020 OK. I'm working with equi-probable digits. You're working with

Bruce Carneal (18/19) Feb 27 2020 TL;DR for the snipped: Unsurprisingly, different inputs will lead
Basile B. (5/8) Feb 27 2020 give me something I can compile and verify. I'm not there to

Bruce Carneal (7/15) Feb 27 2020 I'm not at all concerned about theft of trivial code. I am

Basile B. (6/23) Feb 27 2020 Yes please, post the benchmark method. You see the benchmarks I

Bruce Carneal (137/141) Feb 27 2020 The code below is the test jig that I'm using currently. It is

Basile B. (5/11) Feb 28 2020 Yes I finally can see the branchless version beating the original

Johan (11/23) Feb 26 2020 Hi Basile,

Basile B. (9/21) Feb 26 2020 yes llvm-mca looks excellent, although I don't know if it worth
Basile B. (2/14) Feb 27 2020

Basile B. (2/19) Feb 27 2020 https://www.youtube.com/watch?v=Qq_WaiwzOtI

Dennis Cote (91/94) Feb 27 2020 Perhaps you could try something like this.

Basile B. (27/122) Feb 27 2020 Sorry but no. I think that you have missed how this has changed

Basile B. (9/37) Feb 27 2020 maybe just add you version to the test program and run

9il (5/9) Feb 27 2020 bsr can be done in one/two CPU operation, quite quick. But

Bruce Carneal (6/16) Feb 28 2020 That's surprising. I just got ldc to inline core.bitop.bsr on

Bruce Carneal (8/26) Feb 28 2020 Here's the code I used:
kinke (3/10) Feb 28 2020 These tiny core.bitop wrappers around LLVM intrinsics are always
9il (5/23) Feb 28 2020 Ah, my bad. It fails to inline with LDC <= 1.14

Basile B. <b2.temp gmx.com> writes:

So after reading the translation of RYU I was interested too see 
if the decimalLength() function can be written to be faster, as 
it cascades up to 8 CMP.

After struggling with bad ideas I finally find something that 
looks nice:

- count the leading zero of the input
- switch() this count to have in the worst of the cases only 1 
CMP (when the bit number possibly changes the digit count, e.g 9 
-> 10 (when bsr(input) == 4)

after writing all the values allowing to know the cases where a 
comparison is necessary...

   min2 = 0b10
   max2 = 0b11
   min3 = 0b100
   max3 = 0b111
   ...
   ...
   min32 = 0b100.......0
   max32 = 0b111.......1

...I finally write the "nice" thing.

---

import std.stdio;

ubyte decimalLength9(const uint v)
{
     if (v >= 100000000) { return 9; }
     if (v >= 10000000) { return 8; }
     if (v >= 1000000) { return 7; }
     if (v >= 100000) { return 6; }
     if (v >= 10000) { return 5; }
     if (v >= 1000) { return 4; }
     if (v >= 100) { return 3; }
     if (v >= 10) { return 2; }
     return 1;
}

ubyte fdecimalLength9(const uint v) pure nothrow
{
     import core.bitop;
     const ubyte lzc = cast(ubyte) (bsr(v) + 1);
     ubyte result;
     switch (lzc)
     {
         case 0 :
         case 1 :
         case 2 :
         case 3 : result =  1; break;
         case 4 : result =  v >= 10  ? 2 : 1; break;
         case 5 :
         case 6 : result =  2; break;
         case 7 : result =  v >= 100  ? 3 : 2; break;
         case 8 :
         case 9 : result =  3; break;
         case 10: result =  v >= 1000  ? 4 : 3; break;
         case 11:
         case 12:
         case 13: result =  4; break;
         case 14: result =  v >= 10000  ? 5 : 4; break;
         case 15:
         case 16: result =  5; break;
         case 17: result =  v >= 100000  ? 6 : 5; break;
         case 18:
         case 19: result =  6; break;
         case 20: result =  v >= 1000000  ? 7 : 6; break;
         case 21:
         case 22:
         case 23: result =  7; break;
         case 24: result =  v >= 10000000  ? 8 : 7; break;
         case 25:
         case 26: result =  8; break;
         case 27: result =  v >= 100000000  ? 9 : 8; break;
         case 28:
         case 29:
         case 30:
         case 31:
         case 32: result =  9; break;
         default: assert(false);
     }
     return result;
}

private ubyte ffdecimalLength9(const uint v) pure nothrow
{
     import core.bitop;
     const ubyte lzc = cast(ubyte) (bsr(v) + 1);
     static immutable pure nothrow ubyte function(uint)[33] tbl =
     [
     0 : (uint a) => ubyte(1),
     1 : (uint a) => ubyte(1),
     2 : (uint a) => ubyte(1),
     3 : (uint a) => ubyte(1),
     4 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
     5 : (uint a) => ubyte(2),
     6 : (uint a) => ubyte(2),
     7 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
     8 : (uint a) => ubyte(3),
     9 : (uint a) => ubyte(3),
     10: (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
     11: (uint a) => ubyte(4),
     12: (uint a) => ubyte(4),
     13: (uint a) => ubyte(4),
     14: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
     15: (uint a) => ubyte(5),
     16: (uint a) => ubyte(5),
     17: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
     18: (uint a) => ubyte(6),
     19: (uint a) => ubyte(6),
     20: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
     21: (uint a) => ubyte(7),
     22: (uint a) => ubyte(7),
     23: (uint a) => ubyte(7),
     24: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
     25: (uint a) => ubyte(8),
     26: (uint a) => ubyte(8),
     27: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
     28: (uint a) => ubyte(9),
     29: (uint a) => ubyte(9),
     30: (uint a) => ubyte(9),
     31: (uint a) => ubyte(9),
     32: (uint a) => ubyte(9),
     ];
     return tbl[lzc](v);
}

void main()
{
     import std.datetime.stopwatch, std.range, std.algorithm;

     int s1, s2, s3;
     benchmark!({ iota(100000000u).each!(a => s1 += 
decimalLength9(a+1)); })(10).writeln;
     benchmark!({ iota(100000000u).each!(a => s2 += 
fdecimalLength9(a+1));})(10).writeln;
     benchmark!({ iota(100000000u).each!(a => s3 += 
ffdecimalLength9(a+1));})(10).writeln;
     assert(s1 == s2);
     assert(s1 == s3);
}
---

Then bad surprise. Even with ldmd (so ldc2 basically) feeded with 
the args from the script line. Maybe the fdecimalLength9 version 
is slightly faster. Only *slightly*. Good news, I've lost my 
time. So I try an alternative version that uses a table of 
delegates instead of a switch (ffdecimalLength9) and surprise, 
"tada", it is like **100x** slower then the two others.

How is that possible ?

Feb 25 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via Digitalmars-d-learn
wrote:
[...]


[...]

TBH, I'm skeptical of any performance results using dmd.  I wouldn't pay
attention to performance numbers obtained this way, and rather look at
the ldmd/ldc2 numbers.


[...]
 Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the
 args from the script line. Maybe the fdecimalLength9 version is
 slightly faster.  Only *slightly*. Good news, I've lost my time. So I
 try an alternative version that uses a table of delegates instead of a
 switch (ffdecimalLength9) and surprise, "tada", it is like **100x**
 slower then the two others.
 
 How is that possible ?

Did you check the assembly output to see what the difference is?

Delegates involve a function call, which involves function call
overhead, which includes a CPU pipeline hazard.  Worse yet it's an
indirect call, meaning you're defeating the CPU branch predictor and
invalidating the instruction cache. And on top of that, delegates
involve allocating a context, and you *really* do not want allocations
inside an inner loop.

And of course, normal function calls are easier for compilers to inline,
because the destination is fixed. Indirect calls involving delegates are
hard to predict, and the optimizer is more liable to just give up.

These are just my educated guesses, of course.  For the real answer,
look at the assembly output. :-D


T

-- 
What are you when you run out of Monet? Baroque.

Feb 25 2020

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 26 February 2020 at 01:10:07 UTC, H. S. Teoh wrote:
 On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via 
 Digitalmars-d-learn wrote: [...]


 [...]

 TBH, I'm skeptical of any performance results using dmd.  I 
 wouldn't pay attention to performance numbers obtained this 
 way, and rather look at the ldmd/ldc2 numbers.

I didn't use DMD. The script line is actually interpreted by the 
IDE. It drops the compiler name and just parse the arg and pass 
them to a compiler that's defined by non overridable options. In 
my case I used LDMD (this is LDC you see, but with a DMD like 
options syntax)

Feb 25 2020

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 [...]

It can be made faster using binary search. Not by much though.

You'll get ceil(log2(numberofplaces)) cmps.

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 How is that possible ?

It turns out that there's a problem with the benchmarking method. 
With command line argument the different optimization passes of 
LLVM don't fuck up with the literal constants.

It appears that none of the alternatives based on the "most 
significant bit trick" are faster (at least when covering a 
decent range of numbers):

---

-mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2
import core.memory;
import core.bitop;
import std.stdio;
import std.range;
import std.algorithm;
import std.getopt;

ubyte decimalLength9_0(const uint v)
{
     if (v >= 100000000) { return 9; }
     if (v >= 10000000) { return 8; }
     if (v >= 1000000) { return 7; }
     if (v >= 100000) { return 6; }
     if (v >= 10000) { return 5; }
     if (v >= 1000) { return 4; }
     if (v >= 100) { return 3; }
     if (v >= 10) { return 2; }
     return 1;
}

ubyte decimalLength9_1(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     const ubyte lzc = cast(ubyte) bsr(v);
     ubyte result;
     switch (lzc)
     {
         case 0 :
         case 1 :
         case 2 : result =  1; break;
         case 3 : result =  v >= 10 ? 2 : 1; break;
         case 4 :
         case 5 : result =  2; break;
         case 6 : result =  v >= 100 ? 3 : 2; break;
         case 7 :
         case 8 : result =  3; break;
         case 9 : result =  v >= 1000 ? 4 : 3; break;
         case 10:
         case 11:
         case 12: result =  4; break;
         case 13: result =  v >= 10000 ? 5 : 4; break;
         case 14:
         case 15: result =  5; break;
         case 16: result =  v >= 100000 ? 6 : 5; break;
         case 17:
         case 18: result =  6; break;
         case 19: result =  v >= 1000000 ? 7 : 6; break;
         case 20:
         case 21:
         case 22: result =  7; break;
         case 23: result =  v >= 10000000 ? 8 : 7; break;
         case 24:
         case 25: result =  8; break;
         case 26: result =  v >= 100000000 ? 9 : 8; break;
         case 27:
         case 28:
         case 29:
         case 30:
         case 31: result =  9; break;

         default: assert(false);
     }
     return result;
}

private ubyte decimalLength9_2(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     const ubyte lzc = cast(ubyte) bsr(v);
     static immutable pure nothrow ubyte function(uint)[32] tbl =
     [
     0 : (uint a) => ubyte(1),
     1 : (uint a) => ubyte(1),
     2 : (uint a) => ubyte(1),
     3 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
     4 : (uint a) => ubyte(2),
     5 : (uint a) => ubyte(2),
     6 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
     7 : (uint a) => ubyte(3),
     8 : (uint a) => ubyte(3),
     9 : (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
     10: (uint a) => ubyte(4),
     11: (uint a) => ubyte(4),
     12: (uint a) => ubyte(4),
     13: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
     14: (uint a) => ubyte(5),
     15: (uint a) => ubyte(5),
     16: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
     17: (uint a) => ubyte(6),
     18: (uint a) => ubyte(6),
     19: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
     20: (uint a) => ubyte(7),
     21: (uint a) => ubyte(7),
     22: (uint a) => ubyte(7),
     23: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
     24: (uint a) => ubyte(8),
     25: (uint a) => ubyte(8),
     26: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
     27: (uint a) => ubyte(9),
     28: (uint a) => ubyte(9),
     29: (uint a) => ubyte(9),
     30: (uint a) => ubyte(9),
     31: (uint a) => ubyte(9),
     ];
     return tbl[lzc](v);
}

ubyte decimalLength9_3(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     ubyte result;
     const ubyte lzc = cast(ubyte) bsr(v);
     const ubyte[32] decimalLength9LUT =
     [
     0 : ubyte(1),
     1 : ubyte(1),
     2 : ubyte(1),
     3 : ubyte(10),
     4 : ubyte(2),
     5 : ubyte(2),
     6 : ubyte(11),
     7 : ubyte(3),
     8 : ubyte(3),
     9 : ubyte(12),
     10: ubyte(4),
     11: ubyte(4),
     12: ubyte(4),
     13: ubyte(12),
     14: ubyte(5),
     15: ubyte(5),
     16: ubyte(13),
     17: ubyte(6),
     18: ubyte(6),
     19: ubyte(14),
     20: ubyte(7),
     21: ubyte(7),
     22: ubyte(7),
     23: ubyte(15),
     24: ubyte(8),
     25: ubyte(8),
     26: ubyte(16),
     27: ubyte(9),
     28: ubyte(9),
     29: ubyte(9),
     30: ubyte(9),
     31: ubyte(9),
     ];
     ubyte resultOrSelector = decimalLength9LUT[lzc];
     if (resultOrSelector < 10)
         result = resultOrSelector;
     else switch (lzc)
     {
         case 3 : result =  v >= 10 ? 2 : 1; break;
         case 6 : result =  v >= 100 ? 3 : 2; break;
         case 9 : result =  v >= 1000 ? 4 : 3; break;
         case 13: result =  v >= 10000 ? 5 : 4; break;
         case 16: result =  v >= 100000 ? 6 : 5; break;
         case 19: result =  v >= 1000000 ? 7 : 6; break;
         case 23: result =  v >= 10000000 ? 8 : 7; break;
         case 26: result =  v >= 100000000 ? 9 : 8; break;
         default: assert(false);
     }
     return result;
}

void main(string[] args)
{
     uint    sum;
     uint    count;
     ubyte   func;

     GC.disable;
     getopt(args, config.passThrough, "c|count", &count, "f|func", 
&func);
     static const funcs = [&decimalLength9_0, &decimalLength9_1, 
&decimalLength9_2, &decimalLength9_3];

     foreach (i; 0 .. count)
         sum += funcs[func](i);
     writeln(sum);
}
---

Results:

---
[mr_x pc1 temp]$ time ./declen --count=100000000 --func=0
788888890

real    0m0,207s
user    0m0,203s
sys     0m0,003s
[mr_x pc1 temp]$ time ./declen --count=100000000 --func=1
788888890

real    0m0,369s
user    0m0,366s
sys     0m0,002s
[mr_x pc1 temp]$ time ./declen --count=100000000 --func=2
788888890

real    0m0,271s
user    0m0,264s
sys     0m0,003s
[mr_x pc1 temp]$ time ./declen --count=100000000 --func=3
788888890

real    0m0,253s
user    0m0,250s
sys     0m0,002s
---

That's pathetic ;)

Feb 26 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 ...
     foreach (i; 0 .. count)
         sum += funcs[func](i);

The input stream is highly predictable and strongly skewed 
towards higher digits.

The winning function implementation lines up with that 
distribution.  It would not fare as well with higher entropy 
input.

Feb 26 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal 
wrote:
 On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
 wrote:
 ...
     foreach (i; 0 .. count)
         sum += funcs[func](i);

 The input stream is highly predictable and strongly skewed 
 towards higher digits.

 The winning function implementation lines up with that 
 distribution.  It would not fare as well with higher entropy 
 input.

Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit 
numbers, ...) decimalLength9_0 beats a simple branchless 
implementation by about 10%.

After shuffling the input, branchless wins by 2.4X (240%).

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
wrote:
 The winning function implementation lines up with that 
 distribution.  It would not fare as well with higher entropy 
 input.

 Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit 
 numbers, ...) decimalLength9_0 beats a simple branchless 
 implementation by about 10%.

 After shuffling the input, branchless wins by 2.4X (240%).

I've replaced the input by the front of a rndGen (that pops for 
count times and starting with a custom seed) and I never see the 
decimalLength9_3 (which seems to be the closest to the original 
in term of performances) doing better.

Maybe you talked about another implementation of decimalLength9 ?

Feb 26 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
 On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
 wrote:
 After shuffling the input, branchless wins by 2.4X (240%).

 I've replaced the input by the front of a rndGen (that pops for 
 count times and starting with a custom seed) and I never see 
 the decimalLength9_3 (which seems to be the closest to the 
 original in term of performances) doing better.

The uniform random uint distribution yields a highly skewed 
digits distribution.  Note, for example, that only 10 of the 
2**32 possible outputs from a uint PRNG can be represented by a 
single decimal digit.

Your current winner will be very hard to beat if the inputs are 
uniform random.  That implementation's high-to-low cascade of 
early exit ifs aligns with PRNG output.

If you aren't sure what the input distribution is, or want 
guaranteed good worst case behavior, then branchless may be a 
better way to go.

 Maybe you talked about another implementation of decimalLength9 
 ?

Yes.  It's one I wrote after I saw your post. Psuedo-code here:

   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) 
... }

Using ldc to target an x86 with the above yields a series of 
cmpl, seta instruction pairs in the function body followed by a 
summing and a return.  No branching.

Let me know if the above is unclear or insufficient.

Feb 26 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:
 On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
 On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
 wrote:
 After shuffling the input, branchless wins by 2.4X (240%).

 

 snip

 Let me know if the above is unclear or insufficient.

The 2.4X improvement came when using a shuffled uniform digits 
distribution.  Equal numbers of 1 digit values, 2 digit values, 
... put in to an array and shuffled.

Branchless *loses* by 8.5% or so on my Zen1 when the array is not 
shuffled (when branch predictions are nearly perfect).

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:
 Maybe you talked about another implementation of 
 decimalLength9 ?

 Yes.  It's one I wrote after I saw your post. Psuedo-code here:

   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 
 100) ... }

 Using ldc to target an x86 with the above yields a series of 
 cmpl, seta instruction pairs in the function body followed by a 
 summing and a return.  No branching.

 Let me know if the above is unclear or insufficient.

No thanks, it's crystal clear now.

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
 wrote:
 Maybe you talked about another implementation of 
 decimalLength9 ?

 Yes.  It's one I wrote after I saw your post. Psuedo-code here:

   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 
 100) ... }

 Using ldc to target an x86 with the above yields a series of 
 cmpl, seta instruction pairs in the function body followed by 
 a summing and a return.  No branching.

 Let me know if the above is unclear or insufficient.

 No thanks, it's crystal clear now.

although I don't see the performance gain. Now for me an hybrid 
version based on a LUT and  on the branchless one is the fatest 
(decimalLength9_5), although still slowest then the original.

updated program, incl your branchless version (decimalLength9_4):

---

-mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2
import core.memory;
import core.bitop;
import std.stdio;
import std.range;
import std.algorithm;
import std.getopt;
import std.random;

ubyte decimalLength9_0(const uint v)
{
     if (v >= 100000000) { return 9; }
     if (v >= 10000000) { return 8; }
     if (v >= 1000000) { return 7; }
     if (v >= 100000) { return 6; }
     if (v >= 10000) { return 5; }
     if (v >= 1000) { return 4; }
     if (v >= 100) { return 3; }
     if (v >= 10) { return 2; }
     return 1;
}

ubyte decimalLength9_1(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     const ubyte lzc = cast(ubyte) bsr(v);
     ubyte result;
     switch (lzc)
     {
         case 0 :
         case 1 :
         case 2 : result =  1; break;
         case 3 : result =  v >= 10 ? 2 : 1; break;
         case 4 :
         case 5 : result =  2; break;
         case 6 : result =  v >= 100 ? 3 : 2; break;
         case 7 :
         case 8 : result =  3; break;
         case 9 : result =  v >= 1000 ? 4 : 3; break;
         case 10:
         case 11:
         case 12: result =  4; break;
         case 13: result =  v >= 10000 ? 5 : 4; break;
         case 14:
         case 15: result =  5; break;
         case 16: result =  v >= 100000 ? 6 : 5; break;
         case 17:
         case 18: result =  6; break;
         case 19: result =  v >= 1000000 ? 7 : 6; break;
         case 20:
         case 21:
         case 22: result =  7; break;
         case 23: result =  v >= 10000000 ? 8 : 7; break;
         case 24:
         case 25: result =  8; break;
         case 26: result =  v >= 100000000 ? 9 : 8; break;
         case 27:
         case 28:
         case 29:
         case 30:
         case 31: result =  9; break;
         default: assert(false);
     }
     return result;
}

private ubyte decimalLength9_2(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     const ubyte lzc = cast(ubyte) bsr(v);
     static immutable pure nothrow ubyte function(uint)[32] tbl =
     [
     0 : (uint a) => ubyte(1),
     1 : (uint a) => ubyte(1),
     2 : (uint a) => ubyte(1),
     3 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
     4 : (uint a) => ubyte(2),
     5 : (uint a) => ubyte(2),
     6 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
     7 : (uint a) => ubyte(3),
     8 : (uint a) => ubyte(3),
     9 : (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
     10: (uint a) => ubyte(4),
     11: (uint a) => ubyte(4),
     12: (uint a) => ubyte(4),
     13: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
     14: (uint a) => ubyte(5),
     15: (uint a) => ubyte(5),
     16: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
     17: (uint a) => ubyte(6),
     18: (uint a) => ubyte(6),
     19: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
     20: (uint a) => ubyte(7),
     21: (uint a) => ubyte(7),
     22: (uint a) => ubyte(7),
     23: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
     24: (uint a) => ubyte(8),
     25: (uint a) => ubyte(8),
     26: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
     27: (uint a) => ubyte(9),
     28: (uint a) => ubyte(9),
     29: (uint a) => ubyte(9),
     30: (uint a) => ubyte(9),
     31: (uint a) => ubyte(9),
     ];
     return tbl[lzc](v);
}

ubyte decimalLength9_3(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     ubyte result;
     enum ubyte doSwitch = ubyte(0);
     const ubyte lzc = cast(ubyte) bsr(v);
     const ubyte[32] decimalLength9LUT =
     [
     0 : ubyte(1),
     1 : ubyte(1),
     2 : ubyte(1),
     3 : doSwitch,
     4 : ubyte(2),
     5 : ubyte(2),
     6 : doSwitch,
     7 : ubyte(3),
     8 : ubyte(3),
     9 : doSwitch,
     10: ubyte(4),
     11: ubyte(4),
     12: ubyte(4),
     13: doSwitch,
     14: ubyte(5),
     15: ubyte(5),
     16: doSwitch,
     17: ubyte(6),
     18: ubyte(6),
     19: doSwitch,
     20: ubyte(7),
     21: ubyte(7),
     22: ubyte(7),
     23: doSwitch,
     24: ubyte(8),
     25: ubyte(8),
     26: doSwitch,
     27: ubyte(9),
     28: ubyte(9),
     29: ubyte(9),
     30: ubyte(9),
     31: ubyte(9),
     ];
     ubyte resultOrSelector = decimalLength9LUT[lzc];
     if (resultOrSelector != doSwitch)
         result = resultOrSelector;
     else switch (lzc)
     {
         case 3 : result =  v >= 10 ? 2 : 1; break;
         case 6 : result =  v >= 100 ? 3 : 2; break;
         case 9 : result =  v >= 1000 ? 4 : 3; break;
         case 13: result =  v >= 10000 ? 5 : 4; break;
         case 16: result =  v >= 100000 ? 6 : 5; break;
         case 19: result =  v >= 1000000 ? 7 : 6; break;
         case 23: result =  v >= 10000000 ? 8 : 7; break;
         case 26: result =  v >= 100000000 ? 9 : 8; break;
         default: assert(false);
     }
     return result;
}

ubyte decimalLength9_4(const uint v) pure nothrow
{
     return 1 +  (v >= 10) +
                 (v >= 100) +
                 (v >= 1000) +
                 (v >= 10000) +
                 (v >= 100000) +
                 (v >= 1000000) +
                 (v >= 10000000) +
                 (v >= 100000000) ;
}

ubyte decimalLength9_5(const uint v) pure nothrow
{
     if (v == 0) // BSR and LZCNT UB when input is 0
         return 1;
     ubyte result;
     enum ubyte doBranchlessWay = ubyte(0);
     const ubyte lzc = cast(ubyte) bsr(v);
     const ubyte[32] decimalLength9LUT =
     [
     0 : ubyte(1),
     1 : ubyte(1),
     2 : ubyte(1),
     3 : doBranchlessWay,
     4 : ubyte(2),
     5 : ubyte(2),
     6 : doBranchlessWay,
     7 : ubyte(3),
     8 : ubyte(3),
     9 : doBranchlessWay,
     10: ubyte(4),
     11: ubyte(4),
     12: ubyte(4),
     13: doBranchlessWay,
     14: ubyte(5),
     15: ubyte(5),
     16: doBranchlessWay,
     17: ubyte(6),
     18: ubyte(6),
     19: doBranchlessWay,
     20: ubyte(7),
     21: ubyte(7),
     22: ubyte(7),
     23: doBranchlessWay,
     24: ubyte(8),
     25: ubyte(8),
     26: doBranchlessWay,
     27: ubyte(9),
     28: ubyte(9),
     29: ubyte(9),
     30: ubyte(9),
     31: ubyte(9),
     ];
     ubyte resultOrSelector = decimalLength9LUT[lzc];
     if (resultOrSelector != doBranchlessWay)
         result = resultOrSelector;
     else
     {
     result = 1 + (v >= 10) +
                  (v >= 100) +
                  (v >= 1000) +
                  (v >= 10000) +
                  (v >= 100000) +
                  (v >= 1000000) +
                  (v >= 10000000) +
                  (v >= 100000000) ;
     }
     return result;
}

void main(string[] args)
{
     uint    sum;
     ulong   count;
     uint    seed;
     ubyte   func;

     GC.disable;
     getopt(args, config.passThrough, "c|count", &count, "f|func", 
&func, "s|seed", &seed);
     static const funcs = [&decimalLength9_0, &decimalLength9_1, 
&decimalLength9_2, &decimalLength9_3, &decimalLength9_4, 
&decimalLength9_5];

     rndGen.seed(seed);
     foreach (const ulong i; 0 .. count)
     {
         sum += funcs[func](rndGen.front());
         rndGen.popFront();
     }
     writeln(sum);
}
---

Could you share your benchmarking method maybe ?

Feb 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
 wrote:
 Maybe you talked about another implementation of 
 decimalLength9 ?

 Yes.  It's one I wrote after I saw your post. Psuedo-code 
 here:

   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 
 100) ... }

 Using ldc to target an x86 with the above yields a series of 
 cmpl, seta instruction pairs in the function body followed by 
 a summing and a return.  No branching.

 Let me know if the above is unclear or insufficient.

 No thanks, it's crystal clear now.

 although I don't see the performance gain.
 snip

 ubyte decimalLength9_0(const uint v)
 {
     if (v >= 100000000) { return 9; }
     if (v >= 10000000) { return 8; }
     if (v >= 1000000) { return 7; }
     if (v >= 100000) { return 6; }
     if (v >= 10000) { return 5; }
     if (v >= 1000) { return 4; }
     if (v >= 100) { return 3; }
     if (v >= 10) { return 2; }
     return 1;
 }

 snip

 Could you share your benchmarking method maybe ?

OK. I'm working with equi-probable digits.  You're working with 
equi-probable values.  The use-case may be either of these or, 
more likely, a third as yet unknown/unspecified distribution.

The predictability of the test input sequence clearly influences 
the performance of the branching implementations.  The uniform 
random input is highly predictable wrt digits. (there are *many* 
more high digit numbers than low digit numbers)

Take the implementation above as an example.  If, as in the 
random number (equi-probable value) scenario, your inputs skew 
*stongly* toward a higher number of digits, then the code should 
perform very well.

If you believe the function will see uniform random values as 
input, then testing with uniform random inputs is the way to go.  
If you believe some digits distribution is what will be seen, 
then you should alter your inputs to match.  (testing uniform 
random in addition to the supposed target distribution is almost 
always a good idea for sensitivity analysis).

My testing methodology: To obtain an equi-digit distribution, I 
fill an array with N 1-digit values, followed by N 2-digit 
values, ... followed by N 9-digit values.  I use N == 1000 and 
loop over the total array until I've presented about 1 billion 
values to the function under test.

I test against that equi-digit array before shuffling (this 
favors branching implementations) and then again after shuffling 
(this favors branchless).

If you wish to work with equi-digit distributions I'd prefer that 
you implement the functionality anew.  This is some small 
duplicated effort but will help guard against my having screwed 
up even that simple task (less than 6 LOC IIRC).

I will post my code if there is any meaningful difference in your 
subsequent results.

Feb 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:
 big snip

TL;DR for the snipped: Unsurprisingly, different inputs will lead 
to different timing results.  The equi-probable values supplied 
by a standard PRNG differ significantly from an equi-probable 
digit input.  In particular, the PRNG input is much more 
predictable.

Note also that a uint PRNG will produce numbers that need 10 
decimal digits.  If my arithmetic is correct, you'll see "9 or 
better" out of a uint PRNG 97.79% of the time.  The 
predictability of the full set of inputs is, secondarily but 
importantly, also influenced by the ability of the branch 
predictor to ferret out combinations.

I lack confidence in my ability to reason from first principles 
when presented with a modern CPU branch predictor and a complex 
input.  Much easier for me to just run tests against various 
distributions that I think span the input and reason from the 
results.

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:
 On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
 I will post my code if there is any meaningful difference in 
 your subsequent results.

give me something I can compile and verify. I'm not there to 
steal, if you found something you can still propose it to the 
repos that would take advantage of the optim.

Feb 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
 wrote:
 On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
 I will post my code if there is any meaningful difference in 
 your subsequent results.

 give me something I can compile and verify. I'm not there to 
 steal, if you found something you can still propose it to the 
 repos that would take advantage of the optim.

I'm not at all concerned about theft of trivial code.  I am 
concerned that a simple error in my code will live on in a 
copy/paste environment.

Regardless, I'll post the code once I get home.  It may be the 
only way to communicate the central problem as I see it: 
imprecision in the test specification (the input specification).

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 17:17:32 UTC, Bruce Carneal 
wrote:
 On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
 wrote:
 On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. 
 wrote:
 I will post my code if there is any meaningful difference in 
 your subsequent results.

 give me something I can compile and verify. I'm not there to 
 steal, if you found something you can still propose it to the 
 repos that would take advantage of the optim.

 I'm not at all concerned about theft of trivial code.  I am 
 concerned that a simple error in my code will live on in a 
 copy/paste environment.

 Regardless, I'll post the code once I get home.  It may be the 
 only way to communicate the central problem as I see it: 
 imprecision in the test specification (the input specification).

Yes please, post the benchmark method. You see the benchmarks I 
run with your version are always slowest. I'm aware that rndGen 
(and generaly any uniform rnd func) is subject to a bias but I 
dont thing this bias maters much in the case we talk about.

Feb 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Thursday, 27 February 2020 at 19:46:23 UTC, Basile B. wrote:
 Yes please, post the benchmark method. You see the benchmarks I 
 run with your version are always slowest. I'm aware that rndGen 
 (and generaly any uniform rnd func) is subject to a bias but I 
 dont thing this bias maters much in the case we talk about.

The code below is the test jig that I'm using currently.  It is 
adopted from yours but has added the -d=distribution command line 
option.

I have omitted the function bodies for the entries in the funcs[] 
table.  You can cut and paste there at will but I wanted to spare 
any other readers the bulk.

The timings I get from the below are entirely consistent with my 
earlier reporting and, I believe, with yours.

I urge you to investigate the difference in timings between the 
original, and currently defaulted, distribution and the -d=edig 
distribution.  If you don't understand what's going on there, or 
if you still don't believe input distributions matter, please get 
in touch.  After all, it's possible that I'm the one with the 
misunderstanding.

Have fun!

<code>
const funcs = [
     &decimalLength9_0, &decimalLength9_1, &decimalLength9_2,
     &decimalLength9_3, &decimalLength9_4, &decimalLength9_5
];

ulong runOriginalFuncCount(ubyte funcIdx, ulong count)
{
     GC.disable;

     uint sum;
     foreach (const ulong i; 0 .. count)
     {
         sum += funcs[funcIdx](rndGen.front());
         rndGen.popFront();
     }
     return sum;
}

ulong runFuncCountDist(ubyte funcIdx, ulong count, string dist)
{
     enum perDigit = 1000;
     // NOTE: this is a gross distortion given uint.max but it is 
"the spec"
     enum maxDigits = 9;
     uint[perDigit * maxDigits] samples = void;
     const chunks = count / samples.length;
     const leftOvers = count % samples.length;

     if (dist == "prng")
     {
         foreach (ref val; samples[])
         {
             val = rndGen.front();
             rndGen.popFront();
         }
     }
     else if (dist == "edig" || dist == "edigsorted")
     {
         // for reference, this is the "6 LOC IIRC"
         uint value = 1;
         foreach (dig; 0 .. maxDigits)
         {
             samples[dig * perDigit .. (dig + 1) * perDigit] = 
value;
             value *= 10;
         }

         if (auto shuffle = dist != "edigsorted")
             randomShuffle(samples[]);
     }
     else
     {
         assert(0, "dist option should be in {oprng, prng, edig, 
edigsorted}");
     }
     const func = funcs[funcIdx];

     if (count < 1) // late check gives us a handle on setup time
         return 0;
     // OK, enough with the setup, time to roll
     ulong sum = 0;
     foreach (chunk; 0 .. chunks)
     {
         foreach (sample; samples[])
             sum += func(sample);
     }
     foreach (sample; samples[$ - leftOvers .. $])
         sum += func(sample);
     return sum;
}

// This is a timing test jig for functions that accept a uint
// and return the number of decimal digits needed to represent
// that uint (except that 10 digit values are lumped in with
// 9 digit values currently).
//
// The command line options here are:
//   -c=count (the number of values to present to the function 
selected)
//   -s=seed (the value supplied to rndGen.seed)
//   -f=funcIndex (identifies the function under test, see source 
funcs[])
//   -d=distribution (one of: oprng, prng, edig, edigsorted)
//
// Distributions explained:
//   oprng: original prng distribution, a rndGen.popFront per 
function call
//   prng:  prng distribution, also rndGen but coming from a 
large, pre-filled, array
//   edig:  same number of 1-digit, 2-digit, ... numbers, shuffled
//   edigsorted: edig values but sorted to highlight the branch 
predictor impact
//
// This test jig could evolve in at least six ways:
//  1) the full digit range could (arguably should) be accounted 
for
//  2) additional distributions, like "favor small numbers", 
could be added
//  3) additional implementation ideas can be explored
//  4) the input range could be made variable
//  5) the number base could be templatized
//  6) inlined performance could be tested by aliasing Func 
(runtime switch to template)
//
//  Final note: the main benefit of working on this problem is 
better understanding, not
//  shaving nanoseconds on this particular function.

void main(string[] args)
{

     ulong count;
     uint seed;
     ubyte func;
     // NOTE: oprng writeln output will usually not equal that of 
-d=prng
     enum defaultDistribution = "oprng";
     string dist = defaultDistribution;

     GC.disable;
     getopt(args, config.passThrough, "c|count", &count, "f|func", 
&func,
             "s|seed", &seed, "d|distribution", &dist);
     rndGen.seed(seed);
     const original = dist == "oprng";
     const sum = original ? runOriginalFuncCount(func, count) : 
runFuncCountDist(func, count, dist);
     writeln(sum);
}

<code>

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 21:46:08 UTC, Bruce Carneal 
wrote:
 On Thursday, 27 February 2020 at 19:46:23 UTC, Basile B. wrote:
 [...]

 The code below is the test jig that I'm using currently.  It is 
 adopted from yours but has added the -d=distribution command 
 line option.

 [...]

Yes I finally can see the branchless version beating the original 
with -dedig.
Thanks for your time.

Feb 28 2020

Johan <j j.nl> writes:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 ...

 Then bad surprise. Even with ldmd (so ldc2 basically) feeded 
 with the args from the script line. Maybe the fdecimalLength9 
 version is slightly faster. Only *slightly*. Good news, I've 
 lost my time. So I try an alternative version that uses a table 
 of delegates instead of a switch (ffdecimalLength9) and 
 surprise, "tada", it is like **100x** slower then the two 
 others.

 How is that possible ?

Hi Basile,
   I recently saw this presentation: 
https://www.youtube.com/watch?v=Czr5dBfs72U
It has some ideas that may help you make sure your measurements 
are good and may give you ideas to find the performance 
bottleneck or where to optimize.
llvm-mca is featured on godbolt.org: 
https://mca.godbolt.org/z/YWp3yv

cheers,
   Johan

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 [...]

 Hi Basile,
   I recently saw this presentation: 
 https://www.youtube.com/watch?v=Czr5dBfs72U
 It has some ideas that may help you make sure your measurements 
 are good and may give you ideas to find the performance 
 bottleneck or where to optimize.
 llvm-mca is featured on godbolt.org: 
 https://mca.godbolt.org/z/YWp3yv

 cheers,
   Johan

yes llvm-mca looks excellent, although I don't know if it worth 
continuing... You see this function is certainly not a 
bottleneck, it's just that I wanted to try better than the naive 
implementation.

Fundamentatlly the problem is that
1. the original is smaller, faster to decode
2. the alternatives (esp. the 3rd) is conceptually better but the 
cost of the jump table + lzcnt wastes it.

Feb 26 2020

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 [...]

 Hi Basile,
   I recently saw this presentation: 
 https://www.youtube.com/watch?v=Czr5dBfs72U

Andrei made a talk about this too a few years ago.

 It has some ideas that may help you make sure your measurements 
 are good and may give you ideas to find the performance 
 bottleneck or where to optimize.
 llvm-mca is featured on godbolt.org: 
 https://mca.godbolt.org/z/YWp3yv

 cheers,
   Johan

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 14:12:35 UTC, Basile B. wrote:
 On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
 wrote:
 [...]

 Hi Basile,
   I recently saw this presentation: 
 https://www.youtube.com/watch?v=Czr5dBfs72U

 Andrei made a talk about this too a few years ago.

 It has some ideas that may help you make sure your 
 measurements are good and may give you ideas to find the 
 performance bottleneck or where to optimize.
 llvm-mca is featured on godbolt.org: 
 https://mca.godbolt.org/z/YWp3yv

 cheers,
   Johan


https://www.youtube.com/watch?v=Qq_WaiwzOtI

Feb 27 2020

Dennis Cote <private secret.com> writes:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

Perhaps you could try something like this.

<code>
int decimalDigitLength(ulong n) {
	if (n < 10000)
		if (n < 100)
			return n < 10 ? 1 : 2;
		else
			return n < 1000 ? 3 : 4;
	else
		if (n < 100000000)
			if (n < 1000000)
				return n < 100000 ? 5 : 6;
			else
				return n < 10000000 ? 7 : 8;
		else
			if (n < 1000000000000)
				if (n < 10000000000)
					return n < 1000000000 ? 9 : 10;
				else
					return n < 100000000000 ? 11 : 12;
			else
				if (n < 10000000000000000)
					if (n < 100000000000000)
						return n < 10000000000000 ? 13 : 14;
					else
						return n < 1000000000000000 ? 15 : 16;
				else
					if (n < 1000000000000000000)
						return n < 100000000000000000 ? 17 : 18;
					else
						return n < 10000000000000000000 ? 19 : 20;
}										
</code>

This uses at most 6 compares for any 64 bit number and only 3 for 
the most common small numbers less than 10000.

I was glad to see that with ldc at run.dlang.io using the -O3 
optimization I could change the function signature to match yours 
and the compiler eliminated all the unreachable dead code for 
larger values. The compiler produced the following assembler

<code>
	.section	.text.ubyte 
onlineapp.decimalLength9(uint),"axG", progbits,ubyte 
onlineapp.decimalLength9(uint),comdat
	.globl	ubyte onlineapp.decimalLength9(uint)
	.p2align	4, 0x90
	.type	ubyte onlineapp.decimalLength9(uint), function
ubyte onlineapp.decimalLength9(uint):
	.cfi_startproc
	cmpl	$9999, %edi
	ja	.LBB1_5
	cmpl	$99, %edi
	ja	.LBB1_4
	cmpl	$10, %edi
	movb	$2, %al
	sbbb	$0, %al
	retq
.LBB1_5:
	cmpl	$99999999, %edi
	ja	.LBB1_9
	cmpl	$999999, %edi
	ja	.LBB1_8
	cmpl	$100000, %edi
	movb	$6, %al
	sbbb	$0, %al
	retq
.LBB1_4:
	cmpl	$1000, %edi
	movb	$4, %al
	sbbb	$0, %al
	retq
.LBB1_9:
	cmpl	$1000000000, %edi
	movb	$10, %al
	sbbb	$0, %al
	retq
.LBB1_8:
	cmpl	$10000000, %edi
	movb	$8, %al
	sbbb	$0, %al
	retq
.Lfunc_end1:
	.size	ubyte onlineapp.decimalLength9(uint), .Lfunc_end1-ubyte 
onlineapp.decimalLength9(uint)
	.cfi_endproc
</code>

for the same body with signature ubyte decimalLength9(uint n).

This may be faster than your sequential comparison function 
depending upon the distribution of numbers. In real applications, 
small numbers are far more common so the reduced number of 
compares for those values should be beneficial in most cases.

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 09:33:28 UTC, Dennis Cote wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 Perhaps you could try something like this.

 <code>
 int decimalDigitLength(ulong n) {
 	if (n < 10000)
 		if (n < 100)
 			return n < 10 ? 1 : 2;
 		else
 			return n < 1000 ? 3 : 4;
 	else
 		if (n < 100000000)
 			if (n < 1000000)
 				return n < 100000 ? 5 : 6;
 			else
 				return n < 10000000 ? 7 : 8;
 		else
 			if (n < 1000000000000)
 				if (n < 10000000000)
 					return n < 1000000000 ? 9 : 10;
 				else
 					return n < 100000000000 ? 11 : 12;
 			else
 				if (n < 10000000000000000)
 					if (n < 100000000000000)
 						return n < 10000000000000 ? 13 : 14;
 					else
 						return n < 1000000000000000 ? 15 : 16;
 				else
 					if (n < 1000000000000000000)
 						return n < 100000000000000000 ? 17 : 18;
 					else
 						return n < 10000000000000000000 ? 19 : 20;
 }										
 </code>

 This uses at most 6 compares for any 64 bit number and only 3 
 for the most common small numbers less than 10000.

 I was glad to see that with ldc at run.dlang.io using the -O3 
 optimization I could change the function signature to match 
 yours and the compiler eliminated all the unreachable dead code 
 for larger values. The compiler produced the following assembler

 <code>
 	.section	.text.ubyte 
 onlineapp.decimalLength9(uint),"axG", progbits,ubyte 
 onlineapp.decimalLength9(uint),comdat
 	.globl	ubyte onlineapp.decimalLength9(uint)
 	.p2align	4, 0x90
 	.type	ubyte onlineapp.decimalLength9(uint), function
 ubyte onlineapp.decimalLength9(uint):
 	.cfi_startproc
 	cmpl	$9999, %edi
 	ja	.LBB1_5
 	cmpl	$99, %edi
 	ja	.LBB1_4
 	cmpl	$10, %edi
 	movb	$2, %al
 	sbbb	$0, %al
 	retq
 .LBB1_5:
 	cmpl	$99999999, %edi
 	ja	.LBB1_9
 	cmpl	$999999, %edi
 	ja	.LBB1_8
 	cmpl	$100000, %edi
 	movb	$6, %al
 	sbbb	$0, %al
 	retq
 .LBB1_4:
 	cmpl	$1000, %edi
 	movb	$4, %al
 	sbbb	$0, %al
 	retq
 .LBB1_9:
 	cmpl	$1000000000, %edi
 	movb	$10, %al
 	sbbb	$0, %al
 	retq
 .LBB1_8:
 	cmpl	$10000000, %edi
 	movb	$8, %al
 	sbbb	$0, %al
 	retq
 .Lfunc_end1:
 	.size	ubyte onlineapp.decimalLength9(uint), .Lfunc_end1-ubyte 
 onlineapp.decimalLength9(uint)
 	.cfi_endproc
 </code>

 for the same body with signature ubyte decimalLength9(uint n).

 This may be faster than your sequential comparison function 
 depending upon the distribution of numbers. In real 
 applications, small numbers are far more common so the reduced 
 number of compares for those values should be beneficial in 
 most cases.

Sorry but no. I think that you have missed how this has changed 
since the first message.
1. the way it was tested initially was wrong because LLVM was 
optimizing some stuff in some tests and not others, due to 
literals constants.
2. Apparently there would be a branchless version that's fast 
when testing with unbiased input (to be verified)

this version is:

---
ubyte decimalLength9_4(const uint v) pure nothrow
{
     return 1 +  (v >= 10) +
                 (v >= 100) +
                 (v >= 1000) +
                 (v >= 10000) +
                 (v >= 100000) +
                 (v >= 1000000) +
                 (v >= 10000000) +
                 (v >= 100000000) ;
}
---

but i cannot see the improvment when use time on the test program 
and 100000000 calls feeded with a random number.

see 
https://forum.dlang.org/post/ctidwrnxvwwkouprjszw forum.dlang.org 
for the latest evolution of the discussion.

Feb 27 2020

Basile B. <b2.temp gmx.com> writes:

On Thursday, 27 February 2020 at 09:41:20 UTC, Basile B. wrote:
 On Thursday, 27 February 2020 at 09:33:28 UTC, Dennis Cote 
 wrote:
 [...]

 Sorry but no. I think that you have missed how this has changed 
 since the first message.
 1. the way it was tested initially was wrong because LLVM was 
 optimizing some stuff in some tests and not others, due to 
 literals constants.
 2. Apparently there would be a branchless version that's fast 
 when testing with unbiased input (to be verified)

 this version is:

 ---
 ubyte decimalLength9_4(const uint v) pure nothrow
 {
     return 1 +  (v >= 10) +
                 (v >= 100) +
                 (v >= 1000) +
                 (v >= 10000) +
                 (v >= 100000) +
                 (v >= 1000000) +
                 (v >= 10000000) +
                 (v >= 100000000) ;
 }
 ---

 but i cannot see the improvment when use time on the test 
 program and 100000000 calls feeded with a random number.

 see 
 https://forum.dlang.org/post/ctidwrnxvwwkouprjszw forum.dlang.org for the
latest evolution of the discussion.

maybe just add you version to the test program and run

time ./declen -c100000000 -f0 -s137 // original
time ./declen -c100000000 -f4 -s137 // the 100% branchless
time ./declen -c100000000 -f5 -s137 // the LUT + branchless for 
the bit num that need attention
time ./declen -c100000000 -f6 -s137 // assumed to be your version

to see if it beats the original. Thing is that i cannot do it 
right now but otherwise will try tomorrow.

Feb 27 2020

9il <ilyayaroshenko gmail.com> writes:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 [...]

bsr can be done in one/two CPU operation, quite quick. But 
core.bitop.bsr wouldn't be inlined. Instead, mir-core (mir.bitop: 
ctlz) or LDC intrinsics llvm_ctlz can be used for to get code 
with inlining.

Feb 27 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 [...]

 bsr can be done in one/two CPU operation, quite quick. But 
 core.bitop.bsr wouldn't be inlined. Instead, mir-core 
 (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
 to get code with inlining.

That's surprising.  I just got ldc to inline core.bitop.bsr on 
run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
target CPU is)

Under what conditions should I be guarding against an inlining 
failure?

Feb 28 2020

Bruce Carneal <bcarneal gmail.com> writes:

On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
 On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
 wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 [...]

 bsr can be done in one/two CPU operation, quite quick. But 
 core.bitop.bsr wouldn't be inlined. Instead, mir-core 
 (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
 to get code with inlining.

 That's surprising.  I just got ldc to inline core.bitop.bsr on 
 run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
 target CPU is)

 Under what conditions should I be guarding against an inlining 
 failure?

Here's the code I used:

int main(string[] args)
{
     import core.bitop : bsr;
     return bsr(cast(uint)args.length);
}

BTW, I'm a huge fan of your performance work.

Feb 28 2020

kinke <noone nowhere.com> writes:

On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
 On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
 bsr can be done in one/two CPU operation, quite quick. But 
 core.bitop.bsr wouldn't be inlined. Instead, mir-core 
 (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
 to get code with inlining.

 That's surprising.  I just got ldc to inline core.bitop.bsr on 
 run.dlang.io using ldc -O3 -mcpu=native.

These tiny core.bitop wrappers around LLVM intrinsics are always 
inlined (`pragma(inline, true)`), i.e., you don't even need -O.

Feb 28 2020

9il <ilyayaroshenko gmail.com> writes:

On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:
 On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
 On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
 wrote:
 So after reading the translation of RYU I was interested too 
 see if the decimalLength() function can be written to be 
 faster, as it cascades up to 8 CMP.

 [...]

 bsr can be done in one/two CPU operation, quite quick. But 
 core.bitop.bsr wouldn't be inlined. Instead, mir-core 
 (mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
 to get code with inlining.

 That's surprising.  I just got ldc to inline core.bitop.bsr on 
 run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
 target CPU is)

Ah, my bad. It fails to inline with LDC <= 1.14
https://d.godbolt.org/z/iz9p-6

 Under what conditions should I be guarding against an inlining 
 failure?

Mark it with `pragma(inline, true)`. LDC also has cross-module 
inlining for non-templated functions.

Feb 28 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Strange counter-performance in an alternative `decimalLength9`