digitalmars.D - Reducing the cost of autodecoding

Andrei Alexandrescu (43/43) Oct 12 2016 So we've had a good run with making popFront smaller. In ASCII

Stefan Koch (5/48) Oct 12 2016 This will only work really efficiently with some state on the

Andrei Alexandrescu (4/5) Oct 12 2016 Remember the ASCII part is the bothersome one. There's only two

safety0ff (12/15) Oct 12 2016 My measurements:

Stefan Koch (2/18) Oct 12 2016 where did you apply the branch hints ?

safety0ff (2/3) Oct 12 2016 Code: http://pastebin.com/CFCpUftW

Kagamin (3/4) Oct 13 2016 Line 25 doesn't look trusted: reads past the end of an empty

safety0ff (3/7) Oct 13 2016 Length is checked in the loop that calls this function.

Andrei Alexandrescu (11/26) Oct 12 2016 Wait, so going through the bytes made almost no difference? Or did you

Stefan Koch (4/12) Oct 12 2016 I was about to suggest the same.

Stefan Koch (3/21) Oct 12 2016 We should probably introduce a new module for stuff like this.

Andrei Alexandrescu (5/7) Oct 12 2016 Yah, shouldn't go in object.d as it's fairly niche. On the other hand

Stefan Koch (6/14) Oct 12 2016 maybe core.intrinsics ?
Johan Engelen (31/39) Oct 13 2016 There could be some kind of "expect" theme, or a
Jacob Carlborg (9/13) Oct 13 2016 I think it should be a new module. I think core.intrinsics, as Stefan

safety0ff (4/6) Oct 12 2016 It made little difference: LDC compiled into AVX2 vectorized

safety0ff (6/8) Oct 12 2016 Measurements without -mcpu=native:

Andrei Alexandrescu (5/14) Oct 12 2016 So we should be able to reduce overhead by means of proper code

Stefan Koch (7/23) Oct 12 2016 AVX for ascii ?

Andrei Alexandrescu (3/25) Oct 12 2016 Oh ok, so it's that checksum in particular that got optimized. Bad

safety0ff (4/6) Oct 13 2016 Also, I suspect a benchmark with a larger loop body might not

Stefan Koch (40/43) Oct 14 2016 I disagree in longer loops code compactness is as important as in

Stefan Koch (41/85) Oct 14 2016 Disregard all that code.

Patrick Schluter (38/128) Oct 15 2016 Looks very verbose to me. I had found in the BSD codebase a very

Patrick Schluter (5/5) Oct 15 2016 Oooops, I should not post after drinking 2 glasses of
Andrei Alexandrescu (3/5) Oct 15 2016 https://ldc.acomirei.ru

Uplink_Coder (27/27) Oct 15 2016 It can also be written like this producing smaller code.

Patrick Schluter (4/31) Oct 15 2016 Just a question. Do encoding errors not have to be detected or is

safety0ff (4/6) Oct 15 2016 AFAIK they have to be detected, otherwise it would be a

Patrick Schluter (9/9) Oct 15 2016 At least with that lookup table below, you can detect isolated

Patrick Schluter (5/14) Oct 15 2016 192 and 193 can never appear in a UTF-8 text, they are overlongs
Uplink_Coder (4/13) Oct 15 2016 If you use 0 instead of 1 the length check will suffice for

Stefan Koch (43/59) Oct 15 2016 __gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2,

Patrick Schluter (11/73) Oct 15 2016 What does "debug if" do ? Because when I replace it with a simple

Patrick Schluter (46/46) Oct 16 2016 Here my version. It's probably not the shortest (100 ligns of

Uplink_Coder (5/52) Oct 16 2016 This looks quite slow.

NoBrainer (5/8) Oct 16 2016 A: What's 2 + 2?
Patrick Schluter (35/40) Oct 16 2016 I know but it has to be correct before being fast.

Patrick Schluter (5/46) Oct 16 2016 Of course, this line is wrong, should shift by 18 not 16 :
Patrick Schluter (47/49) Oct 16 2016 Ok now the looped version which doesn't need the lookup table.

ketmar (4/4) Oct 16 2016 On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:

safety0ff (7/16) Oct 14 2016 You must have misunderstood:

Ilya Yaroshenko (3/16) Oct 12 2016 For fair test line should be feet into L1 cache. --Ilya

Ilya Yaroshenko (3/23) Oct 12 2016 EDITL For fair test the line should be in the L1 cache. --Ilya

Johan Engelen (9/11) Oct 12 2016 "-O3 -enable-inlining" is synonymous with "-O3" :-)
Dmitry Olshansky (43/50) Oct 25 2016 I'm sooo late to the party. Still a neat trick with UTF lookup table is

Steven Schveighoffer (13/27) Oct 25 2016 Half of the entries in your table are 0 (every char that starts with

Stefam Koch (3/39) Oct 25 2016 The whole point of a LUT to begin with is to reduce instructions.

Steven Schveighoffer (4/9) Oct 25 2016 Yes, I know. But as I understand it, using 64-bit math on 32-bit systems...

safety0ff (37/38) Oct 25 2016 Some food for thought:

safety0ff (4/6) Oct 25 2016 Unfortunately it also changes the API of popFront to throw on

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. 
$]. Smaller functions make sure that the impact on instruction cache in 
larger applications is not high.

Now it's time to look at the end-to-end cost of autodecoding. I wrote 
this simple microbenchmark:

=====
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
     ulong checksum;
     if (args.length == 1)
     {
         while (line.length) {
             version(autodecode)
             {
                 checksum += line.myFront;
                 line.myPopFront;
             }
             else
             {
                 checksum += line[0];
                 line = line[1 .. $];
             }
         }
         version(autodecode)
             writeln("autodecode ", checksum);
         else
             writeln("bytes ", checksum);
     }
     else
         writeln("overhead");
}
=====

On my machine, with "ldc2 -release -O3 -enable-inlining" I get something 
like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with 
autodecoding.

Your mission, should you choose to accept it, is to define a combination 
front/popFront that reduces the gap.


Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from s 
 = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. I 
 wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
     ulong checksum;
     if (args.length == 1)
     {
         while (line.length) {
             version(autodecode)
             {
                 checksum += line.myFront;
                 line.myPopFront;
             }
             else
             {
                 checksum += line[0];
                 line = line[1 .. $];
             }
         }
         version(autodecode)
             writeln("autodecode ", checksum);
         else
             writeln("bytes ", checksum);
     }
     else
         writeln("overhead");
 }
 =====

 On my machine, with "ldc2 -release -O3 -enable-inlining" I get 
 something like 0.54s overhead, 0.81s with no autodecoding, and 
 1.12s with autodecoding.

 Your mission, should you choose to accept it, is to define a 
 combination front/popFront that reduces the gap.


 Andrei

This will only work really efficiently with some state on the 
stack.
If we are to support Unicode.

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 12:03 PM, Stefan Koch wrote:
 This will only work really efficiently with some state on the stack.

Remember the ASCII part is the bothersome one. There's only two 
comparisons, all with 100% predictability. We should be able to arrange 
matters so the loss is negligible. -- Andrei

Oct 12 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only two 
 comparisons, all with 100% predictability. We should be able to 
 arrange matters so the loss is negligible. -- Andrei

My measurements:
ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
ldc version 1.0.0

overhead 0.350s
bytes    0.385s
current autodecoding 0.915s (with new LUT popFront)
copy-pasting std.utf decoding functions into current file 0.840s
adding ASCII branch hints (llvm_expect) 0.770s

With the branch hints LDC moves the non-Ascii code outside of the 
loop and creates a really tight loop body.

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
 Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only 
 two comparisons, all with 100% predictability. We should be 
 able to arrange matters so the loss is negligible. -- Andrei

 My measurements:
 ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
 ldc version 1.0.0

 overhead 0.350s
 bytes    0.385s
 current autodecoding 0.915s (with new LUT popFront)
 copy-pasting std.utf decoding functions into current file 0.840s
 adding ASCII branch hints (llvm_expect) 0.770s

 With the branch hints LDC moves the non-Ascii code outside of 
 the loop and creates a really tight loop body.

where did you apply the branch hints ?

Oct 12 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:
 where did you apply the branch hints ?

Code: http://pastebin.com/CFCpUftW

Oct 12 2016

Kagamin <spam here.lot> writes:

On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:
 Code: http://pastebin.com/CFCpUftW

Line 25 doesn't look trusted: reads past the end of an empty 
string.

Oct 13 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Thursday, 13 October 2016 at 14:51:50 UTC, Kagamin wrote:
 On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:
 Code: http://pastebin.com/CFCpUftW

 Line 25 doesn't look trusted: reads past the end of an empty 
 string.

Length is checked in the loop that calls this function.

In phobos length is only checked with an assertion,

Oct 13 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 04:02 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only two
 comparisons, all with 100% predictability. We should be able to
 arrange matters so the loss is negligible. -- Andrei

 My measurements:
 ldc -O3  -boundscheck=off -release -mcpu=native -enable-inlining
 ldc version 1.0.0

 overhead 0.350s
 bytes    0.385s

Wait, so going through the bytes made almost no difference? Or did you 
subtract the overhead already?

 current autodecoding 0.915s (with new LUT popFront)
 copy-pasting std.utf decoding functions into current file 0.840s
 adding ASCII branch hints (llvm_expect) 0.770s

 With the branch hints LDC moves the non-Ascii code outside of the loop
 and creates a really tight loop body.

I think we should define two aliases "likely" and "unlikely" with 
default implementations:

bool likely(bool b) { return b; }
bool unlikely(bool b) { return b; }

They'd go in druntime. Then implementers can hook them into their 
intrinsics.

Works?


Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:
 I think we should define two aliases "likely" and "unlikely" 
 with default implementations:

 bool likely(bool b) { return b; }
 bool unlikely(bool b) { return b; }

 They'd go in druntime. Then implementers can hook them into 
 their intrinsics.

 Works?


 Andrei

I was about to suggest the same.
I can prepare a PR.

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
 Alexandrescu wrote:
 I think we should define two aliases "likely" and "unlikely" 
 with default implementations:

 bool likely(bool b) { return b; }
 bool unlikely(bool b) { return b; }

 They'd go in druntime. Then implementers can hook them into 
 their intrinsics.

 Works?


 Andrei

 I was about to suggest the same.
 I can prepare a PR.

We should probably introduce a new module for stuff like this.
object.d is already filled with too much unrelated things.

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.

Yah, shouldn't go in object.d as it's fairly niche. On the other hand 
defining a new module for two functions seems excessive unless we have a 
good theme. On the third hand we may find an existing module that's 
topically close. Thoughts? -- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.

 Yah, shouldn't go in object.d as it's fairly niche. On the 
 other hand defining a new module for two functions seems 
 excessive unless we have a good theme. On the third hand we may 
 find an existing module that's topically close. Thoughts? -- 
 Andrei

maybe core.intrinsics ?
or code.codelayout ?

We can control the layout at object-file level.
We should be able to expose some of that functionality.

Oct 12 2016

Johan Engelen <j j.nl> writes:

On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.

 Yah, shouldn't go in object.d as it's fairly niche. On the 
 other hand defining a new module for two functions seems 
 excessive unless we have a good theme. On the third hand we may 
 find an existing module that's topically close. Thoughts? -- 
 Andrei

There could be some kind of "expect" theme, or a 
microoptimization theme. Functions that have no observable 
effects and that provide hints for optimization (possibly 
compiler-dependent implementations of those functions).

Besides providing the expected value of an expression, other 
"expect"/microopt functionality is checking explicitly for 
function pointer values (to inline a likely function), that I 
wrote about in [1]:
```
/// Calls `fptr(args)`, optimize for the case when fptr points to 
Likely().
pragma(inline, true)
auto is_likely(alias Likely, Fptr, Args...)(Fptr fptr, Args args) 
{
     return (fptr == &Likely) ? Likely(args) : fptr(args);
}
// ...
void function() fptr = get_function_ptr();
fptr.is_likely!likely_function();
```

A similar function can be made for "expecting" a class type for 
virtual calls [1].

Other microopt thingies that come to mind are:
- cache prefetching
- function attributes for hot/cold functions

cheers,
   Johan


[1] 
https://johanengelen.github.io/ldc/2016/04/13/PGO-in-LDC-virtual-calls.html

Oct 13 2016

Jacob Carlborg <doob me.com> writes:

On 2016-10-13 03:26, Andrei Alexandrescu wrote:

 Yah, shouldn't go in object.d as it's fairly niche. On the other hand
 defining a new module for two functions seems excessive unless we have a
 good theme. On the third hand we may find an existing module that's
 topically close. Thoughts? -- Andrei

I think it should be a new module. I think core.intrinsics, as Stefan 
suggested, sounds like a good idea. I don't think having a module with 
only two functions is a problem, assuming we expect more of these functions.

We already have that case with core.attribute [1], which only have _one_ 
attribute defined.

[1] https://github.com/dlang/druntime/blob/master/src/core/attribute.d#L54

-- 
/Jacob Carlborg

Oct 13 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:
 Wait, so going through the bytes made almost no difference? Or 
 did you subtract the overhead already?

It made little difference: LDC compiled into AVX2 vectorized 
addition (vpmovzxbq & vpaddq.)

Oct 12 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized 
 addition (vpmovzxbq & vpaddq.)

Measurements without -mcpu=native:
overhead 0.336s
bytes    0.610s
without branch hints 0.852s
code pasted 0.766s

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized addition
 (vpmovzxbq & vpaddq.)

 Measurements without -mcpu=native:
 overhead 0.336s
 bytes    0.610s
 without branch hints 0.852s
 code pasted 0.766s

So we should be able to reduce overhead by means of proper code 
arrangement and interplay of inlining and outlining. The prize, however, 
would be to get the AVX instructions for ASCII going. Is that possible? 
-- Andrei

Oct 12 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized 
 addition
 (vpmovzxbq & vpaddq.)

 Measurements without -mcpu=native:
 overhead 0.336s
 bytes    0.610s
 without branch hints 0.852s
 code pasted 0.766s

 So we should be able to reduce overhead by means of proper code 
 arrangement and interplay of inlining and outlining. The prize, 
 however, would be to get the AVX instructions for ASCII going. 
 Is that possible? -- Andrei

AVX for ascii ?
What are you referring to ?
Most text processing is terribly incompatible with simd.
sse 4.2 has a few instructions that do help, but as far as I am 
aware it is not yet too far spread.

Oct 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/12/2016 09:35 PM, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:
 On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized addition
 (vpmovzxbq & vpaddq.)

 Measurements without -mcpu=native:
 overhead 0.336s
 bytes    0.610s
 without branch hints 0.852s
 code pasted 0.766s

 So we should be able to reduce overhead by means of proper code
 arrangement and interplay of inlining and outlining. The prize,
 however, would be to get the AVX instructions for ASCII going. Is that
 possible? -- Andrei

 AVX for ascii ?
 What are you referring to ?
 Most text processing is terribly incompatible with simd.
 sse 4.2 has a few instructions that do help, but as far as I am aware it
 is not yet too far spread.

Oh ok, so it's that checksum in particular that got optimized. Bad 
benchmark! Bad! -- Andrei

Oct 12 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Thursday, 13 October 2016 at 01:36:44 UTC, Andrei Alexandrescu 
wrote:
 Oh ok, so it's that checksum in particular that got optimized. 
 Bad benchmark! Bad! -- Andrei

Also, I suspect a benchmark with a larger loop body might not 
benefit as significantly from branch hints as this one.

Oct 13 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei

 Also, I suspect a benchmark with a larger loop body might not 
 benefit as significantly from branch hints as this one.

I disagree in longer loops code compactness is as important as in 
small ones.

This is about the smallest inline version of decode I could come 
up with :

__gshared static immutable ubyte[] charWidthTab = [
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
];

dchar myFront(ref char[] str) pure nothrow
{
     dchar c = cast(dchar) str[0];
     if ((c & 128))
     {
         if (c & 64)
         	final switch(charWidthTab[c - 192])
         {
             case 2 :
                 c |= ((str[1] & 0x80) >> 5);
             break;
             case 3 :
                c |= ((str[1] & 0x80) >> 4);
                c |= ((str[2] & 0x80) >> 10);
             break;
             case 4 :
                c |= ((str[1] & 0x80) >> 3);
                c |= ((str[2] & 0x80) >> 9);
                c |= ((str[3] & 0x80) >> 15);
             break;
             case 5,6,1 :
               goto Linvalid;
         }
         else
         Linvalid :
         	c = dchar.init;

     }
	return c;
}

Oct 14 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei

 Also, I suspect a benchmark with a larger loop body might not 
 benefit as significantly from branch hints as this one.

 I disagree in longer loops code compactness is as important as 
 in small ones.

 This is about the smallest inline version of decode I could 
 come up with :

 __gshared static immutable ubyte[] charWidthTab = [
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
 ];

 dchar myFront(ref char[] str) pure nothrow
 {
     dchar c = cast(dchar) str[0];
     if ((c & 128))
     {
         if (c & 64)
         	final switch(charWidthTab[c - 192])
         {
             case 2 :
                 c |= ((str[1] & 0x80) >> 5);
             break;
             case 3 :
                c |= ((str[1] & 0x80) >> 4);
                c |= ((str[2] & 0x80) >> 10);
             break;
             case 4 :
                c |= ((str[1] & 0x80) >> 3);
                c |= ((str[2] & 0x80) >> 9);
                c |= ((str[3] & 0x80) >> 15);
             break;
             case 5,6,1 :
               goto Linvalid;
         }
         else
         Linvalid :
         	c = dchar.init;

     }
 	return c;
 }

Disregard all that code.
It is horribly wrong!

This is more correct : (Tough for some reason it does not pass 
the unittests)

dchar myFront(ref char[] str) pure
{
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             auto l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;

             final switch (l)
             {
             case 2:
                 c = ((c & ~(64 | 128)) << 6);
                 c |= (str.ptr[1] & ~0x80);
                 break;
             case 3:
                 c = ((c & ~(32 | 64 | 128)) << 12);
                 c |= ((str.ptr[1] & ~0x80) << 6);
                 c |= ((str.ptr[2] & ~0x80));
                 break;
             case 4:
                 c = ((c & ~(16 | 32 | 64 | 128)) << 18);
                 c |= ((str.ptr[1] & ~0x80) << 12);
                 c |= ((str.ptr[2] & ~0x80) << 6);
                 c |= ((str.ptr[3] & ~0x80));
                 break;
             case 5, 6, 1:
                 goto Linvalid;
             }
         }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
}

Oct 14 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 15 October 2016 at 00:50:08 UTC, Stefan Koch wrote:
 On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei

 Also, I suspect a benchmark with a larger loop body might not 
 benefit as significantly from branch hints as this one.

 I disagree in longer loops code compactness is as important as 
 in small ones.

 This is about the smallest inline version of decode I could 
 come up with :

 __gshared static immutable ubyte[] charWidthTab = [
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1
 ];

 dchar myFront(ref char[] str) pure nothrow
 {
     dchar c = cast(dchar) str[0];
     if ((c & 128))
     {
         if (c & 64)
         	final switch(charWidthTab[c - 192])
         {
             case 2 :
                 c |= ((str[1] & 0x80) >> 5);
             break;
             case 3 :
                c |= ((str[1] & 0x80) >> 4);
                c |= ((str[2] & 0x80) >> 10);
             break;
             case 4 :
                c |= ((str[1] & 0x80) >> 3);
                c |= ((str[2] & 0x80) >> 9);
                c |= ((str[3] & 0x80) >> 15);
             break;
             case 5,6,1 :
               goto Linvalid;
         }
         else
         Linvalid :
         	c = dchar.init;

     }
 	return c;
 }

 Disregard all that code.
 It is horribly wrong!

 This is more correct : (Tough for some reason it does not pass 
 the unittests)

 dchar myFront(ref char[] str) pure
 {
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             auto l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;

             final switch (l)
             {
             case 2:
                 c = ((c & ~(64 | 128)) << 6);
                 c |= (str.ptr[1] & ~0x80);
                 break;
             case 3:
                 c = ((c & ~(32 | 64 | 128)) << 12);
                 c |= ((str.ptr[1] & ~0x80) << 6);
                 c |= ((str.ptr[2] & ~0x80));
                 break;
             case 4:
                 c = ((c & ~(16 | 32 | 64 | 128)) << 18);
                 c |= ((str.ptr[1] & ~0x80) << 12);
                 c |= ((str.ptr[2] & ~0x80) << 6);
                 c |= ((str.ptr[3] & ~0x80));
                 break;
             case 5, 6, 1:
                 goto Linvalid;
             }
         }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
 }

Looks very verbose to me. I had found in the BSD codebase a very 
clever utf-8 conversion function in C, maybe it can be used here. 
Sorry if I do not participate on the testing as I don't have a 
proper compilation environment here at home. Here the routine I 
use at work (it's in C), put that here for inspiration.

DEFINE_INLINE uint_t xctomb(char *r, wchar_t wc)
{
uint_t u8l = utf8len(wc);

   switch(u8l) {
     /* Note: code falls through cases! */
     case 4: r[3] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x10000;
     case 3: r[2] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x800;
     case 2: r[1] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0xc0;
     case 1: r[0] = wc;
   }
   return u8l;
}

utf8len being

DEFINE_INLINE uint_t utf8len(wchar_t wc)
{
   if(wc < 0x80)
     return 1;
   else if(wc < 0x800)
     return 2;
   else
     if(wc < 0x10000)
       return 3;
     else
       return 4;
}


The code generated on SPARC with gcc 3.4.6 was really good. On 
x86_64 with gcc 5.1 was also not bad. I have not tried a lot of 
alternatives as UTF-8 coding is not a bottle neck on our project. 
There's also no check for length 5 and 6 as they are not possible 
on our system, but for here it has to be added. (the 
DEFINE_INLINE macro is either extern inline or inline depending 
on some macro magic that is not of importance here).

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

Oooops, I should not post after drinking 2 glasses of 
Châteauneuf-du-pape. That function does exactly the contrary of 
what popFront does. This one is conversion from dchar to 
multibyte not multibyte to dchar as you did.
Sorry for the inconvenience.

Oct 15 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/15/2016 12:42 PM, Patrick Schluter wrote:
 Sorry if I do not participate on the testing as I don't have a proper
 compilation environment here at home.

https://ldc.acomirei.ru

Andrei

Oct 15 2016

Uplink_Coder <uplink.coder googlemail.com> writes:

It can also be written like this producing smaller code.
But it the cost of slower decoding.

dchar myFront(ref char[] str) pure
{
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;
             c = 0;
           l--;
             while(l) {
               l--;
                c |= str.ptr[idx++];
                c <<= 6;
             }
             c |= str.ptr[idx];

        }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
}

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
 It can also be written like this producing smaller code.
 But it the cost of slower decoding.

 dchar myFront(ref char[] str) pure
 {
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;
             c = 0;
           l--;
             while(l) {
               l--;
                c |= str.ptr[idx++];
                c <<= 6;
             }
             c |= str.ptr[idx];

        }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
 }

Just a question. Do encoding errors not have to be detected or is 
validity of the string guaranteed? Wrong continuation bytes or 
overlong encodings are not detected by this routine.

Oct 15 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Saturday, 15 October 2016 at 19:00:12 UTC, Patrick Schluter 
wrote:
 Just a question. Do encoding errors not have to be detected or 
 is validity of the string guaranteed?

AFAIK they have to be detected, otherwise it would be a 
regression.

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

At least with that lookup table below, you can detect isolated 
continuation bytes (192 and 193) and invalid codes (above 244).

__gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
];

length 5 and 6 need not to be tested specifically for your goto.

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).

192 and 193 can never appear in a UTF-8 text, they are overlongs 
not continuation bytes. Continuation are characters between 128 
and 191 and thos are not allowed, so should be checked.

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your goto.

Oct 15 2016

Uplink_Coder <uplink.coder googlemail.com> writes:

On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your goto.

If you use 0 instead of 1 the length check will suffice for 
throwing on invalid.

Oct 15 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder wrote:
 On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
 wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your 
 goto.

 If you use 0 instead of 1 the length check will suffice for 
 throwing on invalid.

__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2, 
2, 2, 2, 2,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 3,
     3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 
4, 4, 4, 0, 0, 0, 0,
     0, 0, 0, 0];

dchar myFront2(ref char[] str) pure
{
     auto c1 = str.ptr[0];
     if (c1 & 128)
     {
         if (c1 & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c1 - 192];
             if (str.length < l)
                 goto Linvalid;
             dchar c = 0;
             l--;
             while (l)
             {
                 l--;
                 immutable cc = str.ptr[idx++];
                 debug if (cc & 64) goto Linvalid;
                 c |= cc;
                 c <<= 6;
             }
             c |= str.ptr[idx];
             return c;

         }
     Linvalid:
         throw new Exception("yadayada");

     }
     else
     {
         return c1;
     }
}

This code proofs to be the fastest so far.
On UTF and non-UTF text.
It's also fairly small.

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 15 October 2016 at 21:21:22 UTC, Stefan Koch wrote:
 On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder 
 wrote:
 On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
 wrote:
 At least with that lookup table below, you can detect 
 isolated continuation bytes (192 and 193) and invalid codes 
 (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your 
 goto.

 If you use 0 instead of 1 the length check will suffice for 
 throwing on invalid.

 __gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 
 2, 2, 2, 2, 2,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
 2, 2, 2, 3,
     3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 
 4, 4, 4, 0, 0, 0, 0,
     0, 0, 0, 0];

 dchar myFront2(ref char[] str) pure
 {
     auto c1 = str.ptr[0];
     if (c1 & 128)
     {
         if (c1 & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c1 - 192];
             if (str.length < l)
                 goto Linvalid;
             dchar c = 0;
             l--;
             while (l)
             {
                 l--;
                 immutable cc = str.ptr[idx++];
                 debug if (cc & 64) goto Linvalid;
                 c |= cc;
                 c <<= 6;
             }
             c |= str.ptr[idx];
             return c;

         }
     Linvalid:
         throw new Exception("yadayada");

     }
     else
     {
         return c1;
     }
 }

 This code proofs to be the fastest so far.
 On UTF and non-UTF text.
 It's also fairly small.

What does "debug if" do ? Because when I replace it with a simple 
"if" the code generated by "LDC 1.1.0-beta2 -release -O3 
-boundscheck=off" is 15 lines shorter.
That was my first question but I think I see the issue now. By 
removing the debug, the condition is compiled in and the compiler 
short circuits the whole loop and goes to the Linvalid.
The error is that cc is loaded the first time with the same value 
as c1 because idx=0 and it is post incremented. It should be 
preincremented and c would have to be initialised with the data 
bits of c1 not 0.

Oct 15 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

Here my version. It's probably not the shortest (100 ligns of 
assembly with LDC) but it is correct and has following properties:
- Performance proportional to the encoding length
- Detects Invalid byte sequences
- Detects Overlong encodings
- Detects Invalid code points

I put the exception to be comparable to other routines but 
Unicode specifies that it is preferable to not abort on encoding 
errors (to avoid denial of service attacks).

dchar myFront2(ref char[] str)
{
   dchar c0 = str.ptr[0];
   if(c0 < 0x80) {
     return c0;
   }
   else if(str.length > 1) {
     dchar c1 = str.ptr[1];
     if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) {
       c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F);
       if(c1 < 0x80) goto Linvalid;
       return c1;
     }
     else if(str.length > 2) {
       dchar c2 = str.ptr[2];
       if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80) 
{
         c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F);
         if(c2 < 0x800) goto Linvalid;
         return c2;
       }
       else if(str.length > 3) {
         dchar c3 = str.ptr[3];
         if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
0x80 && (c3 & 0xC0) == 0x80) {
           c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 
0x3F) << 6)|(c3 & 0x3F);
           if(c3 < 0x10000  || c3 > 0x10ffff) goto Linvalid;
           return c3;
         }
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
//assert(myFront2(['\xC2','\xA2'])==0xA3);
}

Oct 16 2016

Uplink_Coder <uplink.coder googlemail.com> writes:

On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
wrote:
 Here my version. It's probably not the shortest (100 ligns of 
 assembly with LDC) but it is correct and has following 
 properties:
 - Performance proportional to the encoding length
 - Detects Invalid byte sequences
 - Detects Overlong encodings
 - Detects Invalid code points

 I put the exception to be comparable to other routines but 
 Unicode specifies that it is preferable to not abort on 
 encoding errors (to avoid denial of service attacks).

 dchar myFront2(ref char[] str)
 {
   dchar c0 = str.ptr[0];
   if(c0 < 0x80) {
     return c0;
   }
   else if(str.length > 1) {
     dchar c1 = str.ptr[1];
     if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) {
       c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F);
       if(c1 < 0x80) goto Linvalid;
       return c1;
     }
     else if(str.length > 2) {
       dchar c2 = str.ptr[2];
       if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
 0x80) {
         c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F);
         if(c2 < 0x800) goto Linvalid;
         return c2;
       }
       else if(str.length > 3) {
         dchar c3 = str.ptr[3];
         if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
 0x80 && (c3 & 0xC0) == 0x80) {
           c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 
 0x3F) << 6)|(c3 & 0x3F);
           if(c3 < 0x10000  || c3 > 0x10ffff) goto Linvalid;
           return c3;
         }
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
 //assert(myFront2(['\xC2','\xA2'])==0xA3);
 }

This looks quite slow.
We already have a correct version in utf.decodeImpl.
The goal here was to find a small and fast alternative.

Oct 16 2016

NoBrainer <nobrainer dlang.org> writes:

On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.

A: What's 2 + 2?
B: 3
A: That's wrong.
B: But incredibly fast.

Oct 16 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
 wrote:

 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.

I know but it has to be correct before being fast.
The code is simple and the checks can easily be removed. Here the 
version without overlong, invalid sequence and codepoint check.

  dchar myFront3(ref char[] str)
  {
    dchar c0 = str.ptr[0];
    if(c0 < 0x80) {
      return c0;
    }
    else if(str.length > 1) {
      dchar c1 = str.ptr[1];
      if(c0 < 0xE0) {
        return ((c0 & 0x1F) << 6)|(c1 & 0x3F);
      }
      else if(str.length > 2) {
        dchar c2 = str.ptr[2];
        if(c0 < 0xF0) {
           return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 &  
0x3F);
        }
        else if(str.length > 3) {
          dchar c3 = str.ptr[3];
          if(c0 < 0xF5) {
           return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 &  
0x3F) << 6)|(c3 & 0x3F);
          }
        }
      }
    }
    Linvalid:
       throw new Exception("yadayada");
  }

Next step will be to loop for length 2,3,4, with or without your 
table.

Oct 16 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter 
wrote:
 On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
 wrote:

 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.

 I know but it has to be correct before being fast.
 The code is simple and the checks can easily be removed. Here 
 the version without overlong, invalid sequence and codepoint 
 check.

  dchar myFront3(ref char[] str)
  {
    dchar c0 = str.ptr[0];
    if(c0 < 0x80) {
      return c0;
    }
    else if(str.length > 1) {
      dchar c1 = str.ptr[1];
      if(c0 < 0xE0) {
        return ((c0 & 0x1F) << 6)|(c1 & 0x3F);
      }
      else if(str.length > 2) {
        dchar c2 = str.ptr[2];
        if(c0 < 0xF0) {
           return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 &  
 0x3F);
        }
        else if(str.length > 3) {
          dchar c3 = str.ptr[3];
          if(c0 < 0xF5) {
           return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 &  
 0x3F) << 6)|(c3 & 0x3F);

Of course, this line is wrong, should shift by 18 not 16 :
            return((c0 & 0x07) << 18)|((c1 & 0x3F) << 12)|((c2 &
  0x3F) << 6)|(c3 & 0x3F);

          }
        }
      }
    }
    Linvalid:
       throw new Exception("yadayada");
  }

 Next step will be to loop for length 2,3,4, with or without 
 your table.

Oct 16 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter 
wrote:
 Next step will be to loop for length 2,3,4, with or without 
 your table.

Ok now the looped version which doesn't need the lookup table. 
This one assembles in 72 lines of assembly (25 lines only for the 
exception code).

dchar myFront5(ref char[] str)
  {
    byte c0 = str[0];
    if(c0 >= 0) {
      return c0;
    }
    else {
      if(((c0!=-64) & (c0!=-63)) && c0 <= -12 ) {
	 auto l = str.length;

      int idx = 1;
      dchar mask0 = 0;
      dchar c1=c0&0x3f;
      int lim = -64;
      while(l--) {
          if(c0<lim) {
            if(c1 >0x10FFFF) break;
            return c1;
          }
          lim >>= 1 ;
          c1 = ((c1<<6) | (str.ptr[idx++]&0x3F)) & ~mask0;
          mask0 = 1 << (idx*6 + 7-idx);
        }
      }
    }
    Linvalid:
       throw new Exception("yadayada");
  }

Explanations of the tricks used:
1. the first character is read as signed byte with sign 
extension, this allows to compare it to the lim variable which is 
used to do mainly what the lookup table was doing.
2. there 2 loop escapes, if l, the variable holding the length of 
the string, is 0 which means that the string is too short or when 
the lim is bigger than the 1st character (see 1.)
3. Using the same 0x3f mask to extract the data bits from all 
bytes in the loop has the drawback of adding spurious bits coming 
from the 1st byte for 3 and 4 long sequences. The mask0 variable 
is used to "shoot" that bit in the next loop.
4. 2 simple tests allow to eliminate a lot of invalid cases 
(overlongs on 3 and 4 sequences are not tested yet though but I 
think there's a simple way of doing it but I'm too tired now to 
exlore it).

Oct 16 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
have you seen this[1] link? it is almost what you're doing, but 
with some more nice properties.


[1] http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Oct 16 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei

 Also, I suspect a benchmark with a larger loop body might not 
 benefit as significantly from branch hints as this one.

 I disagree in longer loops code compactness is as important as 
 in small ones.

You must have misunderstood:

My thought was simply that with a larger loop body, LLVM might 
not make such dramatic rearrangement of the basic blocks.

Take your straw man elsewhere :-/

 This is more correct : (Tough for some reason it does not pass 
 the unittests)

You're only validating the first byte, current code validates all 
of them.

Oct 14 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from s 
 = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. I 
 wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;

For fair test line should be feet into L1 cache. --Ilya

Oct 12 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko 
wrote:
 On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
 Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from 
 s = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. 
 I wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;

 For fair test line should be feet into L1 cache. --Ilya

EDITL For fair test the line should be in the L1 cache. --Ilya

Oct 12 2016

Johan Engelen <j j.nl> writes:

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 
 On my machine, with "ldc2 -release -O3 -enable-inlining"

"-O3 -enable-inlining" is synonymous with "-O3"  :-)

With LDC 1.1.0-beta3, you can try with 
"-enable-cross-module-inlining". It won't cross-module inline 
everything (notably: nested functions), but it's a start. If you 
see large performance differences, let me know.

cheers,
   Johan

Oct 12 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
 $]. Smaller functions make sure that the impact on instruction cache in
 larger applications is not high.

[snip]
 Your mission, should you choose to accept it, is to define a combination
 front/popFront that reduces the gap.

I'm sooo late to the party. Still a neat trick with UTF lookup table is 
to cram it into a single word with bit packing. Since we only ever 
consider top 5 bits of leading byte of which the top one can be safely 
discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry 
table, fits nicely in a 64-bit word with 4bits per entry.

// construct in-word lookup table
ulong bitTable()
{
         import std.utf;
	ulong table;
	for(size_t i = 128; i<256; i+= 8)
	{
		char[1] c = [ cast(char)i ];
		try{
			ulong s = std.utf.stride(c[0..1], 0);
			table |= s<<(4*((i-128)>>3));
		}
		catch(Exception){
			// keep zeros
		}
	}
	return table;
}

uint myStride()(char c)
{
	enum table = bitTable;
	return (table >> ((c & 0x7f) >> 3)) & 0xF;
}

void myPopFront()(ref char[] s)
{
	char c = s[0];
	if(c < 0x80)
		s = s[1..$];
	else {
		uint step = myStride(c);
		if(!step) throw new Exception("blah");
		s = s[step..$];
	}
}

Accordingly myFront could be modified to use the same technique.
Can't check perf with LDC at the moment sadly.

 Andrei

Oct 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
 On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
 $]. Smaller functions make sure that the impact on instruction cache in
 larger applications is not high.

 [snip]
 Your mission, should you choose to accept it, is to define a combination
 front/popFront that reduces the gap.

 I'm sooo late to the party. Still a neat trick with UTF lookup table is
 to cram it into a single word with bit packing. Since we only ever
 consider top 5 bits of leading byte of which the top one can be safely
 discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry
 table, fits nicely in a 64-bit word with 4bits per entry.

Half of the entries in your table are 0 (every char that starts with 
10). In addition, you only need 2 bits to store the length (max byte 
sequence is 4, subtract 1 from sequence to get 2 bits).

So I think you can fit this into a 16-bit word (8 entries of 2 bits 
each). You just need to pre-check for the invalid sequences (you no 
longer have to check for 0 result):

if(c < 0x80) s = s[1 .. $];
else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
else s = s[1 + myStride(c) .. $];

Should behave better on 32-bit system.

You could also store 3-bits to avoid extra addition.

-Steve

Oct 25 2016

Stefam Koch <uplink.coder googlemail.com> writes:

On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer 
wrote:
 On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
 On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from 
 s = s[1 ..
 $]. Smaller functions make sure that the impact on 
 instruction cache in
 larger applications is not high.

 [snip]
 Your mission, should you choose to accept it, is to define a 
 combination
 front/popFront that reduces the gap.

 I'm sooo late to the party. Still a neat trick with UTF lookup 
 table is
 to cram it into a single word with bit packing. Since we only 
 ever
 consider top 5 bits of leading byte of which the top one can 
 be safely
 discarded with a quick < 0x80 check. This leaves us at 4bits 
 -> 16 entry
 table, fits nicely in a 64-bit word with 4bits per entry.

 Half of the entries in your table are 0 (every char that starts 
 with 10). In addition, you only need 2 bits to store the length 
 (max byte sequence is 4, subtract 1 from sequence to get 2 
 bits).

 So I think you can fit this into a 16-bit word (8 entries of 2 
 bits each). You just need to pre-check for the invalid 
 sequences (you no longer have to check for 0 result):

 if(c < 0x80) s = s[1 .. $];
 else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
 else s = s[1 + myStride(c) .. $];

 Should behave better on 32-bit system.

 You could also store 3-bits to avoid extra addition.

 -Steve

The whole point of a LUT to begin with is to reduce instructions.

Oct 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 10/25/16 3:32 PM, Stefam Koch wrote:
 On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:

 Should behave better on 32-bit system.

 You could also store 3-bits to avoid extra addition.

 The whole point of a LUT to begin with is to reduce instructions.

Yes, I know. But as I understand it, using 64-bit math on 32-bit systems 
is expensive also. Maybe not in this case...

-Steve

Oct 25 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 Now it's time to look at the end-to-end cost of autodecoding.

Some food for thought:

- front necessarily needs to compute the number of bytes to 
advance.
- We can't change the API to share data between front and 
popFront, however we can create a situation where a pure function 
gets duplicate calls removed by the compiler.

Since we require that the ascii test gets inlined into the caller 
of front/popFront to improve ascii performance, we have a 
situation similar to this:

alias Result = Tuple!("codepoint",dchar,"advance",int);
auto decode(const char[] str) pure
{ pragma(inline,true);
   if (str[0] < 0x80) return Result(str[0],1);
   else return decodeNonAscii(str);
}

dchar front(const char[] str) pure
{ pragma(inline,true);
   return str.decode.codepoint;
}

void popFront(ref const(char)[] str)
{ pragma(inline,true);
   return str[str.decode.advance..$];
}

When used in front/popFront pairs, the duplicated decode calls 
get merged and we don't do any duplicate work (unlike the current 
situation.)

Unfortunately, it's not possible to achieve the best code 
generation due to missed optimizations by the compilers (I 
haven't tried GDC.)

LDC subforum.

Once we have this is possible only the decodeNonAscii, perhaps 
using the DFA method linked by ketmar.

P.S. I am aware that this pessimises popFront for code which only 
counts codepoints without inspecting them.

Oct 25 2016

safety0ff <safety0ff.dev gmail.com> writes:

On Tuesday, 25 October 2016 at 21:46:30 UTC, safety0ff wrote:
 P.S. I am aware that this pessimises popFront for code which 
 only counts codepoints without inspecting them.

Unfortunately it also changes the API of popFront to throw on 
invalid characters.

So the example would need to be reworked.

Oct 25 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Reducing the cost of autodecoding