www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Reducing the cost of autodecoding

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
So we've had a good run with making popFront smaller. In ASCII 
microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. 
$]. Smaller functions make sure that the impact on instruction cache in 
larger applications is not high.

Now it's time to look at the end-to-end cost of autodecoding. I wrote 
this simple microbenchmark:

=====
import std.range;

alias myPopFront = std.range.popFront;
alias myFront = std.range.front;

void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
     ulong checksum;
     if (args.length == 1)
     {
         while (line.length) {
             version(autodecode)
             {
                 checksum += line.myFront;
                 line.myPopFront;
             }
             else
             {
                 checksum += line[0];
                 line = line[1 .. $];
             }
         }
         version(autodecode)
             writeln("autodecode ", checksum);
         else
             writeln("bytes ", checksum);
     }
     else
         writeln("overhead");
}
=====

On my machine, with "ldc2 -release -O3 -enable-inlining" I get something 
like 0.54s overhead, 0.81s with no autodecoding, and 1.12s with 
autodecoding.

Your mission, should you choose to accept it, is to define a combination 
front/popFront that reduces the gap.


Andrei
Oct 12 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from s 
 = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. I 
 wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
     ulong checksum;
     if (args.length == 1)
     {
         while (line.length) {
             version(autodecode)
             {
                 checksum += line.myFront;
                 line.myPopFront;
             }
             else
             {
                 checksum += line[0];
                 line = line[1 .. $];
             }
         }
         version(autodecode)
             writeln("autodecode ", checksum);
         else
             writeln("bytes ", checksum);
     }
     else
         writeln("overhead");
 }
 =====

 On my machine, with "ldc2 -release -O3 -enable-inlining" I get 
 something like 0.54s overhead, 0.81s with no autodecoding, and 
 1.12s with autodecoding.

 Your mission, should you choose to accept it, is to define a 
 combination front/popFront that reduces the gap.


 Andrei
This will only work really efficiently with some state on the stack. If we are to support Unicode.
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 12:03 PM, Stefan Koch wrote:
 This will only work really efficiently with some state on the stack.
Remember the ASCII part is the bothersome one. There's only two comparisons, all with 100% predictability. We should be able to arrange matters so the loss is negligible. -- Andrei
Oct 12 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only two 
 comparisons, all with 100% predictability. We should be able to 
 arrange matters so the loss is negligible. -- Andrei
My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
Oct 12 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 20:02:16 UTC, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei 
 Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only 
 two comparisons, all with 100% predictability. We should be 
 able to arrange matters so the loss is negligible. -- Andrei
My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385s current autodecoding 0.915s (with new LUT popFront) copy-pasting std.utf decoding functions into current file 0.840s adding ASCII branch hints (llvm_expect) 0.770s With the branch hints LDC moves the non-Ascii code outside of the loop and creates a really tight loop body.
where did you apply the branch hints ?
Oct 12 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Wednesday, 12 October 2016 at 20:07:19 UTC, Stefan Koch wrote:
 where did you apply the branch hints ?
Code: http://pastebin.com/CFCpUftW
Oct 12 2016
parent reply Kagamin <spam here.lot> writes:
On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:
 Code: http://pastebin.com/CFCpUftW
Line 25 doesn't look trusted: reads past the end of an empty string.
Oct 13 2016
parent safety0ff <safety0ff.dev gmail.com> writes:
On Thursday, 13 October 2016 at 14:51:50 UTC, Kagamin wrote:
 On Wednesday, 12 October 2016 at 20:24:54 UTC, safety0ff wrote:
 Code: http://pastebin.com/CFCpUftW
Line 25 doesn't look trusted: reads past the end of an empty string.
Length is checked in the loop that calls this function. In phobos length is only checked with an assertion,
Oct 13 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 04:02 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:24:19 UTC, Andrei Alexandrescu wrote:
 Remember the ASCII part is the bothersome one. There's only two
 comparisons, all with 100% predictability. We should be able to
 arrange matters so the loss is negligible. -- Andrei
My measurements: ldc -O3 -boundscheck=off -release -mcpu=native -enable-inlining ldc version 1.0.0 overhead 0.350s bytes 0.385s
Wait, so going through the bytes made almost no difference? Or did you subtract the overhead already?
 current autodecoding 0.915s (with new LUT popFront)
 copy-pasting std.utf decoding functions into current file 0.840s
 adding ASCII branch hints (llvm_expect) 0.770s

 With the branch hints LDC moves the non-Ascii code outside of the loop
 and creates a really tight loop body.
I think we should define two aliases "likely" and "unlikely" with default implementations: bool likely(bool b) { return b; } bool unlikely(bool b) { return b; } They'd go in druntime. Then implementers can hook them into their intrinsics. Works? Andrei
Oct 12 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:
 I think we should define two aliases "likely" and "unlikely" 
 with default implementations:

 bool likely(bool b) { return b; }
 bool unlikely(bool b) { return b; }

 They'd go in druntime. Then implementers can hook them into 
 their intrinsics.

 Works?


 Andrei
I was about to suggest the same. I can prepare a PR.
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 23:59:15 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
 Alexandrescu wrote:
 I think we should define two aliases "likely" and "unlikely" 
 with default implementations:

 bool likely(bool b) { return b; }
 bool unlikely(bool b) { return b; }

 They'd go in druntime. Then implementers can hook them into 
 their intrinsics.

 Works?


 Andrei
I was about to suggest the same. I can prepare a PR.
We should probably introduce a new module for stuff like this. object.d is already filled with too much unrelated things.
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.
Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
Oct 12 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.
Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
maybe core.intrinsics ? or code.codelayout ? We can control the layout at object-file level. We should be able to expose some of that functionality.
Oct 12 2016
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Thursday, 13 October 2016 at 01:26:17 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:11 PM, Stefan Koch wrote:
 We should probably introduce a new module for stuff like this.
 object.d is already filled with too much unrelated things.
Yah, shouldn't go in object.d as it's fairly niche. On the other hand defining a new module for two functions seems excessive unless we have a good theme. On the third hand we may find an existing module that's topically close. Thoughts? -- Andrei
There could be some kind of "expect" theme, or a microoptimization theme. Functions that have no observable effects and that provide hints for optimization (possibly compiler-dependent implementations of those functions). Besides providing the expected value of an expression, other "expect"/microopt functionality is checking explicitly for function pointer values (to inline a likely function), that I wrote about in [1]: ``` /// Calls `fptr(args)`, optimize for the case when fptr points to Likely(). pragma(inline, true) auto is_likely(alias Likely, Fptr, Args...)(Fptr fptr, Args args) { return (fptr == &Likely) ? Likely(args) : fptr(args); } // ... void function() fptr = get_function_ptr(); fptr.is_likely!likely_function(); ``` A similar function can be made for "expecting" a class type for virtual calls [1]. Other microopt thingies that come to mind are: - cache prefetching - function attributes for hot/cold functions cheers, Johan [1] https://johanengelen.github.io/ldc/2016/04/13/PGO-in-LDC-virtual-calls.html
Oct 13 2016
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2016-10-13 03:26, Andrei Alexandrescu wrote:

 Yah, shouldn't go in object.d as it's fairly niche. On the other hand
 defining a new module for two functions seems excessive unless we have a
 good theme. On the third hand we may find an existing module that's
 topically close. Thoughts? -- Andrei
I think it should be a new module. I think core.intrinsics, as Stefan suggested, sounds like a good idea. I don't think having a module with only two functions is a problem, assuming we expect more of these functions. We already have that case with core.attribute [1], which only have _one_ attribute defined. [1] https://github.com/dlang/druntime/blob/master/src/core/attribute.d#L54 -- /Jacob Carlborg
Oct 13 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Wednesday, 12 October 2016 at 23:47:45 UTC, Andrei 
Alexandrescu wrote:
 Wait, so going through the bytes made almost no difference? Or 
 did you subtract the overhead already?
It made little difference: LDC compiled into AVX2 vectorized addition (vpmovzxbq & vpaddq.)
Oct 12 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized 
 addition (vpmovzxbq & vpaddq.)
Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized addition
 (vpmovzxbq & vpaddq.)
Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu 
wrote:
 On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized 
 addition
 (vpmovzxbq & vpaddq.)
Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 09:35 PM, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 01:27:35 UTC, Andrei Alexandrescu wrote:
 On 10/12/2016 08:41 PM, safety0ff wrote:
 On Thursday, 13 October 2016 at 00:32:36 UTC, safety0ff wrote:
 It made little difference: LDC compiled into AVX2 vectorized addition
 (vpmovzxbq & vpaddq.)
Measurements without -mcpu=native: overhead 0.336s bytes 0.610s without branch hints 0.852s code pasted 0.766s
So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei
AVX for ascii ? What are you referring to ? Most text processing is terribly incompatible with simd. sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.
Oh ok, so it's that checksum in particular that got optimized. Bad benchmark! Bad! -- Andrei
Oct 12 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Thursday, 13 October 2016 at 01:36:44 UTC, Andrei Alexandrescu 
wrote:
 Oh ok, so it's that checksum in particular that got optimized. 
 Bad benchmark! Bad! -- Andrei
Also, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
Oct 13 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei
Also, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }
Oct 14 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei
Also, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }
Disregard all that code. It is horribly wrong! This is more correct : (Tough for some reason it does not pass the unittests) dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { auto l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; final switch (l) { case 2: c = ((c & ~(64 | 128)) << 6); c |= (str.ptr[1] & ~0x80); break; case 3: c = ((c & ~(32 | 64 | 128)) << 12); c |= ((str.ptr[1] & ~0x80) << 6); c |= ((str.ptr[2] & ~0x80)); break; case 4: c = ((c & ~(16 | 32 | 64 | 128)) << 18); c |= ((str.ptr[1] & ~0x80) << 12); c |= ((str.ptr[2] & ~0x80) << 6); c |= ((str.ptr[3] & ~0x80)); break; case 5, 6, 1: goto Linvalid; } } else Linvalid : throw new Exception("yadayada"); } return c; }
Oct 14 2016
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 15 October 2016 at 00:50:08 UTC, Stefan Koch wrote:
 On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei
Also, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
I disagree in longer loops code compactness is as important as in small ones. This is about the smallest inline version of decode I could come up with : __gshared static immutable ubyte[] charWidthTab = [ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 ]; dchar myFront(ref char[] str) pure nothrow { dchar c = cast(dchar) str[0]; if ((c & 128)) { if (c & 64) final switch(charWidthTab[c - 192]) { case 2 : c |= ((str[1] & 0x80) >> 5); break; case 3 : c |= ((str[1] & 0x80) >> 4); c |= ((str[2] & 0x80) >> 10); break; case 4 : c |= ((str[1] & 0x80) >> 3); c |= ((str[2] & 0x80) >> 9); c |= ((str[3] & 0x80) >> 15); break; case 5,6,1 : goto Linvalid; } else Linvalid : c = dchar.init; } return c; }
Disregard all that code. It is horribly wrong! This is more correct : (Tough for some reason it does not pass the unittests) dchar myFront(ref char[] str) pure { dchar c = cast(dchar) str.ptr[0]; if (c & 128) { if (c & 64) { auto l = charWidthTab.ptr[c - 192]; if (str.length < l) goto Linvalid; final switch (l) { case 2: c = ((c & ~(64 | 128)) << 6); c |= (str.ptr[1] & ~0x80); break; case 3: c = ((c & ~(32 | 64 | 128)) << 12); c |= ((str.ptr[1] & ~0x80) << 6); c |= ((str.ptr[2] & ~0x80)); break; case 4: c = ((c & ~(16 | 32 | 64 | 128)) << 18); c |= ((str.ptr[1] & ~0x80) << 12); c |= ((str.ptr[2] & ~0x80) << 6); c |= ((str.ptr[3] & ~0x80)); break; case 5, 6, 1: goto Linvalid; } } else Linvalid : throw new Exception("yadayada"); } return c; }
Looks very verbose to me. I had found in the BSD codebase a very clever utf-8 conversion function in C, maybe it can be used here. Sorry if I do not participate on the testing as I don't have a proper compilation environment here at home. Here the routine I use at work (it's in C), put that here for inspiration. DEFINE_INLINE uint_t xctomb(char *r, wchar_t wc) { uint_t u8l = utf8len(wc); switch(u8l) { /* Note: code falls through cases! */ case 4: r[3] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x10000; case 3: r[2] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0x800; case 2: r[1] = 0x80 | (wc & 0x3f); wc >>= 6; wc |= 0xc0; case 1: r[0] = wc; } return u8l; } utf8len being DEFINE_INLINE uint_t utf8len(wchar_t wc) { if(wc < 0x80) return 1; else if(wc < 0x800) return 2; else if(wc < 0x10000) return 3; else return 4; } The code generated on SPARC with gcc 3.4.6 was really good. On x86_64 with gcc 5.1 was also not bad. I have not tried a lot of alternatives as UTF-8 coding is not a bottle neck on our project. There's also no check for length 5 and 6 as they are not possible on our system, but for here it has to be added. (the DEFINE_INLINE macro is either extern inline or inline depending on some macro magic that is not of importance here).
Oct 15 2016
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
Oooops, I should not post after drinking 2 glasses of 
Châteauneuf-du-pape. That function does exactly the contrary of 
what popFront does. This one is conversion from dchar to 
multibyte not multibyte to dchar as you did.
Sorry for the inconvenience.
Oct 15 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/15/2016 12:42 PM, Patrick Schluter wrote:
 Sorry if I do not participate on the testing as I don't have a proper
 compilation environment here at home.
https://ldc.acomirei.ru Andrei
Oct 15 2016
parent reply Uplink_Coder <uplink.coder googlemail.com> writes:
It can also be written like this producing smaller code.
But it the cost of slower decoding.

dchar myFront(ref char[] str) pure
{
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;
             c = 0;
           l--;
             while(l) {
               l--;
                c |= str.ptr[idx++];
                c <<= 6;
             }
             c |= str.ptr[idx];

        }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
}
Oct 15 2016
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
 It can also be written like this producing smaller code.
 But it the cost of slower decoding.

 dchar myFront(ref char[] str) pure
 {
     dchar c = cast(dchar) str.ptr[0];
     if (c & 128)
     {
         if (c & 64)
         {
             int idx = 0;
             int l = charWidthTab.ptr[c - 192];
             if (str.length < l)
                 goto Linvalid;
             c = 0;
           l--;
             while(l) {
               l--;
                c |= str.ptr[idx++];
                c <<= 6;
             }
             c |= str.ptr[idx];

        }
         else
     Linvalid : throw new Exception("yadayada");

     }
     return c;
 }
Just a question. Do encoding errors not have to be detected or is validity of the string guaranteed? Wrong continuation bytes or overlong encodings are not detected by this routine.
Oct 15 2016
parent safety0ff <safety0ff.dev gmail.com> writes:
On Saturday, 15 October 2016 at 19:00:12 UTC, Patrick Schluter 
wrote:
 Just a question. Do encoding errors not have to be detected or 
 is validity of the string guaranteed?
AFAIK they have to be detected, otherwise it would be a regression.
Oct 15 2016
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
At least with that lookup table below, you can detect isolated 
continuation bytes (192 and 193) and invalid codes (above 244).

__gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
];

length 5 and 6 need not to be tested specifically for your goto.
Oct 15 2016
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).
192 and 193 can never appear in a UTF-8 text, they are overlongs not continuation bytes. Continuation are characters between 128 and 191 and thos are not allowed, so should be checked.
 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your goto.
Oct 15 2016
prev sibling parent reply Uplink_Coder <uplink.coder googlemail.com> writes:
On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your goto.
If you use 0 instead of 1 the length check will suffice for throwing on invalid.
Oct 15 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder wrote:
 On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
 wrote:
 At least with that lookup table below, you can detect isolated 
 continuation bytes (192 and 193) and invalid codes (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your 
 goto.
If you use 0 instead of 1 the length check will suffice for throwing on invalid.
__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0]; dchar myFront2(ref char[] str) pure { auto c1 = str.ptr[0]; if (c1 & 128) { if (c1 & 64) { int idx = 0; int l = charWidthTab.ptr[c1 - 192]; if (str.length < l) goto Linvalid; dchar c = 0; l--; while (l) { l--; immutable cc = str.ptr[idx++]; debug if (cc & 64) goto Linvalid; c |= cc; c <<= 6; } c |= str.ptr[idx]; return c; } Linvalid: throw new Exception("yadayada"); } else { return c1; } } This code proofs to be the fastest so far. On UTF and non-UTF text. It's also fairly small.
Oct 15 2016
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 15 October 2016 at 21:21:22 UTC, Stefan Koch wrote:
 On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder 
 wrote:
 On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
 wrote:
 At least with that lookup table below, you can detect 
 isolated continuation bytes (192 and 193) and invalid codes 
 (above 244).

 __gshared static immutable ubyte[] charWidthTab = [
             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
 ];

 length 5 and 6 need not to be tested specifically for your 
 goto.
If you use 0 instead of 1 the length check will suffice for throwing on invalid.
__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0]; dchar myFront2(ref char[] str) pure { auto c1 = str.ptr[0]; if (c1 & 128) { if (c1 & 64) { int idx = 0; int l = charWidthTab.ptr[c1 - 192]; if (str.length < l) goto Linvalid; dchar c = 0; l--; while (l) { l--; immutable cc = str.ptr[idx++]; debug if (cc & 64) goto Linvalid; c |= cc; c <<= 6; } c |= str.ptr[idx]; return c; } Linvalid: throw new Exception("yadayada"); } else { return c1; } } This code proofs to be the fastest so far. On UTF and non-UTF text. It's also fairly small.
What does "debug if" do ? Because when I replace it with a simple "if" the code generated by "LDC 1.1.0-beta2 -release -O3 -boundscheck=off" is 15 lines shorter. That was my first question but I think I see the issue now. By removing the debug, the condition is compiled in and the compiler short circuits the whole loop and goes to the Linvalid. The error is that cc is loaded the first time with the same value as c1 because idx=0 and it is post incremented. It should be preincremented and c would have to be initialised with the data bits of c1 not 0.
Oct 15 2016
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
Here my version. It's probably not the shortest (100 ligns of 
assembly with LDC) but it is correct and has following properties:
- Performance proportional to the encoding length
- Detects Invalid byte sequences
- Detects Overlong encodings
- Detects Invalid code points

I put the exception to be comparable to other routines but 
Unicode specifies that it is preferable to not abort on encoding 
errors (to avoid denial of service attacks).

dchar myFront2(ref char[] str)
{
   dchar c0 = str.ptr[0];
   if(c0 < 0x80) {
     return c0;
   }
   else if(str.length > 1) {
     dchar c1 = str.ptr[1];
     if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) {
       c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F);
       if(c1 < 0x80) goto Linvalid;
       return c1;
     }
     else if(str.length > 2) {
       dchar c2 = str.ptr[2];
       if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 0x80) 
{
         c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F);
         if(c2 < 0x800) goto Linvalid;
         return c2;
       }
       else if(str.length > 3) {
         dchar c3 = str.ptr[3];
         if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
0x80 && (c3 & 0xC0) == 0x80) {
           c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 
0x3F) << 6)|(c3 & 0x3F);
           if(c3 < 0x10000  || c3 > 0x10ffff) goto Linvalid;
           return c3;
         }
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
//assert(myFront2(['\xC2','\xA2'])==0xA3);
}
Oct 16 2016
parent reply Uplink_Coder <uplink.coder googlemail.com> writes:
On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
wrote:
 Here my version. It's probably not the shortest (100 ligns of 
 assembly with LDC) but it is correct and has following 
 properties:
 - Performance proportional to the encoding length
 - Detects Invalid byte sequences
 - Detects Overlong encodings
 - Detects Invalid code points

 I put the exception to be comparable to other routines but 
 Unicode specifies that it is preferable to not abort on 
 encoding errors (to avoid denial of service attacks).

 dchar myFront2(ref char[] str)
 {
   dchar c0 = str.ptr[0];
   if(c0 < 0x80) {
     return c0;
   }
   else if(str.length > 1) {
     dchar c1 = str.ptr[1];
     if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) {
       c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F);
       if(c1 < 0x80) goto Linvalid;
       return c1;
     }
     else if(str.length > 2) {
       dchar c2 = str.ptr[2];
       if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
 0x80) {
         c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F);
         if(c2 < 0x800) goto Linvalid;
         return c2;
       }
       else if(str.length > 3) {
         dchar c3 = str.ptr[3];
         if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
 0x80 && (c3 & 0xC0) == 0x80) {
           c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 
 0x3F) << 6)|(c3 & 0x3F);
           if(c3 < 0x10000  || c3 > 0x10ffff) goto Linvalid;
           return c3;
         }
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
 //assert(myFront2(['\xC2','\xA2'])==0xA3);
 }
This looks quite slow. We already have a correct version in utf.decodeImpl. The goal here was to find a small and fast alternative.
Oct 16 2016
next sibling parent NoBrainer <nobrainer dlang.org> writes:
On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.
A: What's 2 + 2? B: 3 A: That's wrong. B: But incredibly fast.
Oct 16 2016
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
 wrote:
 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.
I know but it has to be correct before being fast. The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. dchar myFront3(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0) { return ((c0 & 0x1F) << 6)|(c1 & 0x3F); } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0) { return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5) { return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); } } } } Linvalid: throw new Exception("yadayada"); } Next step will be to loop for length 2,3,4, with or without your table.
Oct 16 2016
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter 
wrote:
 On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
 On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
 wrote:
 This looks quite slow.
 We already have a correct version in utf.decodeImpl.
 The goal here was to find a small and fast alternative.
I know but it has to be correct before being fast. The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. dchar myFront3(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0) { return ((c0 & 0x1F) << 6)|(c1 & 0x3F); } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0) { return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5) { return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F);
Of course, this line is wrong, should shift by 18 not 16 : return((c0 & 0x07) << 18)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F);
          }
        }
      }
    }
    Linvalid:
       throw new Exception("yadayada");
  }

 Next step will be to loop for length 2,3,4, with or without 
 your table.
Oct 16 2016
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter 
wrote:
 Next step will be to loop for length 2,3,4, with or without 
 your table.
Ok now the looped version which doesn't need the lookup table. This one assembles in 72 lines of assembly (25 lines only for the exception code). dchar myFront5(ref char[] str) { byte c0 = str[0]; if(c0 >= 0) { return c0; } else { if(((c0!=-64) & (c0!=-63)) && c0 <= -12 ) { auto l = str.length; int idx = 1; dchar mask0 = 0; dchar c1=c0&0x3f; int lim = -64; while(l--) { if(c0<lim) { if(c1 >0x10FFFF) break; return c1; } lim >>= 1 ; c1 = ((c1<<6) | (str.ptr[idx++]&0x3F)) & ~mask0; mask0 = 1 << (idx*6 + 7-idx); } } } Linvalid: throw new Exception("yadayada"); } Explanations of the tricks used: 1. the first character is read as signed byte with sign extension, this allows to compare it to the lim variable which is used to do mainly what the lookup table was doing. 2. there 2 loop escapes, if l, the variable holding the length of the string, is 0 which means that the string is too short or when the lim is bigger than the 1st character (see 1.) 3. Using the same 0x3f mask to extract the data bits from all bytes in the loop has the drawback of adding spurious bits coming from the 1st byte for 3 and 4 long sequences. The mask0 variable is used to "shoot" that bit in the next loop. 4. 2 simple tests allow to eliminate a lot of invalid cases (overlongs on 3 and 4 sequences are not tested yet though but I think there's a simple way of doing it but I'm too tired now to exlore it).
Oct 16 2016
prev sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
have you seen this[1] link? it is almost what you're doing, but 
with some more nice properties.


[1] http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Oct 16 2016
prev sibling parent safety0ff <safety0ff.dev gmail.com> writes:
On Friday, 14 October 2016 at 20:47:39 UTC, Stefan Koch wrote:
 On Thursday, 13 October 2016 at 21:49:22 UTC, safety0ff wrote:
 Bad benchmark! Bad! -- Andrei
Also, I suspect a benchmark with a larger loop body might not benefit as significantly from branch hints as this one.
I disagree in longer loops code compactness is as important as in small ones.
You must have misunderstood: My thought was simply that with a larger loop body, LLVM might not make such dramatic rearrangement of the basic blocks. Take your straw man elsewhere :-/
 This is more correct : (Tough for some reason it does not pass 
 the unittests)
You're only validating the first byte, current code validates all of them.
Oct 14 2016
prev sibling next sibling parent reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from s 
 = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. I 
 wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
For fair test line should be feet into L1 cache. --Ilya
Oct 12 2016
parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Wednesday, 12 October 2016 at 16:07:39 UTC, Ilya Yaroshenko 
wrote:
 On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
 Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII 
 microbenchmarks with ldc, the speed is indistinguishable from 
 s = s[1 .. $]. Smaller functions make sure that the impact on 
 instruction cache in larger applications is not high.

 Now it's time to look at the end-to-end cost of autodecoding. 
 I wrote this simple microbenchmark:

 =====
 import std.range;

 alias myPopFront = std.range.popFront;
 alias myFront = std.range.front;

 void main(string[] args) {
     import std.algorithm, std.array, std.stdio;
     char[] line = "0123456789".dup.repeat(50_000_000).join;
For fair test line should be feet into L1 cache. --Ilya
EDITL For fair test the line should be in the L1 cache. --Ilya
Oct 12 2016
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 
 On my machine, with "ldc2 -release -O3 -enable-inlining"
"-O3 -enable-inlining" is synonymous with "-O3" :-) With LDC 1.1.0-beta3, you can try with "-enable-cross-module-inlining". It won't cross-module inline everything (notably: nested functions), but it's a start. If you see large performance differences, let me know. cheers, Johan
Oct 12 2016
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
 $]. Smaller functions make sure that the impact on instruction cache in
 larger applications is not high.
[snip]
 Your mission, should you choose to accept it, is to define a combination
 front/popFront that reduces the gap.
I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry. // construct in-word lookup table ulong bitTable() { import std.utf; ulong table; for(size_t i = 128; i<256; i+= 8) { char[1] c = [ cast(char)i ]; try{ ulong s = std.utf.stride(c[0..1], 0); table |= s<<(4*((i-128)>>3)); } catch(Exception){ // keep zeros } } return table; } uint myStride()(char c) { enum table = bitTable; return (table >> ((c & 0x7f) >> 3)) & 0xF; } void myPopFront()(ref char[] s) { char c = s[0]; if(c < 0x80) s = s[1..$]; else { uint step = myStride(c); if(!step) throw new Exception("blah"); s = s[step..$]; } } Accordingly myFront could be modified to use the same technique. Can't check perf with LDC at the moment sadly.
 Andrei
Oct 25 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
 On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
 $]. Smaller functions make sure that the impact on instruction cache in
 larger applications is not high.
[snip]
 Your mission, should you choose to accept it, is to define a combination
 front/popFront that reduces the gap.
I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry.
Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits). So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result): if(c < 0x80) s = s[1 .. $]; else if(c < 0xc0 || c > 0xf7) throw new Exception("blah"); else s = s[1 + myStride(c) .. $]; Should behave better on 32-bit system. You could also store 3-bits to avoid extra addition. -Steve
Oct 25 2016
parent reply Stefam Koch <uplink.coder googlemail.com> writes:
On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer 
wrote:
 On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
 On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
 So we've had a good run with making popFront smaller. In ASCII
 microbenchmarks with ldc, the speed is indistinguishable from 
 s = s[1 ..
 $]. Smaller functions make sure that the impact on 
 instruction cache in
 larger applications is not high.
[snip]
 Your mission, should you choose to accept it, is to define a 
 combination
 front/popFront that reduces the gap.
I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry.
Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits). So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result): if(c < 0x80) s = s[1 .. $]; else if(c < 0xc0 || c > 0xf7) throw new Exception("blah"); else s = s[1 + myStride(c) .. $]; Should behave better on 32-bit system. You could also store 3-bits to avoid extra addition. -Steve
The whole point of a LUT to begin with is to reduce instructions.
Oct 25 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 10/25/16 3:32 PM, Stefam Koch wrote:
 On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:
 Should behave better on 32-bit system.

 You could also store 3-bits to avoid extra addition.
The whole point of a LUT to begin with is to reduce instructions.
Yes, I know. But as I understand it, using 64-bit math on 32-bit systems is expensive also. Maybe not in this case... -Steve
Oct 25 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei 
Alexandrescu wrote:
 Now it's time to look at the end-to-end cost of autodecoding.
Some food for thought: - front necessarily needs to compute the number of bytes to advance. - We can't change the API to share data between front and popFront, however we can create a situation where a pure function gets duplicate calls removed by the compiler. Since we require that the ascii test gets inlined into the caller of front/popFront to improve ascii performance, we have a situation similar to this: alias Result = Tuple!("codepoint",dchar,"advance",int); auto decode(const char[] str) pure { pragma(inline,true); if (str[0] < 0x80) return Result(str[0],1); else return decodeNonAscii(str); } dchar front(const char[] str) pure { pragma(inline,true); return str.decode.codepoint; } void popFront(ref const(char)[] str) { pragma(inline,true); return str[str.decode.advance..$]; } When used in front/popFront pairs, the duplicated decode calls get merged and we don't do any duplicate work (unlike the current situation.) Unfortunately, it's not possible to achieve the best code generation due to missed optimizations by the compilers (I haven't tried GDC.) LDC subforum. Once we have this is possible only the decodeNonAscii, perhaps using the DFA method linked by ketmar. P.S. I am aware that this pessimises popFront for code which only counts codepoints without inspecting them.
Oct 25 2016
parent safety0ff <safety0ff.dev gmail.com> writes:
On Tuesday, 25 October 2016 at 21:46:30 UTC, safety0ff wrote:
 P.S. I am aware that this pessimises popFront for code which 
 only counts codepoints without inspecting them.
Unfortunately it also changes the API of popFront to throw on invalid characters. So the example would need to be reworked.
Oct 25 2016