www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Can you shrink it further?

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Had a bit of a good time with popFront for UTF, got to this: 
https://github.com/dlang/phobos/pull/4848. I suspect there are cleverer 
things that can be done. Can anyone make it tighter? Take a look at the 
disassembly referred to in the pull request. Thanks! -- Andrei
Oct 09 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu 
wrote:
 Had a bit of a good time with popFront for UTF, got to this: 
 https://github.com/dlang/phobos/pull/4848. I suspect there are 
 cleverer things that can be done. Can anyone make it tighter? 
 Take a look at the disassembly referred to in the pull request. 
 Thanks! -- Andrei
With -Oz both functions look almost the same on ldc and gdc
Oct 09 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/09/2016 07:05 PM, Stefan Koch wrote:
 On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:
 Had a bit of a good time with popFront for UTF, got to this:
 https://github.com/dlang/phobos/pull/4848. I suspect there are
 cleverer things that can be done. Can anyone make it tighter? Take a
 look at the disassembly referred to in the pull request. Thanks! --
 Andrei
With -Oz both functions look almost the same on ldc and gdc
The big ldc link I posted: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 uses -O3, not -Oz. Changing the flag to -Oz seems to cause no change in the generated code. I see different code generated for popFront1 (proposed) and popFront2 (candidate), e.g. popFront1 has 27 asm lines whereas popFront2 has 34. (BTW: is there an easy way to see the size of the generated functions?) Under what conditions did you notice almost identical code? Thanks, Andrei
Oct 09 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu 
wrote:
 I suspect there are cleverer things that can be done.
Less clever seemed to work for me: void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80 || c >= 0xFE) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); // N.B. changed uint to size_t import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } }
Oct 09 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/09/2016 10:55 PM, safety0ff wrote:
 On Sunday, 9 October 2016 at 22:11:50 UTC, Andrei Alexandrescu wrote:
 I suspect there are cleverer things that can be done.
Less clever seemed to work for me: void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80 || c >= 0xFE) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); // N.B. changed uint to size_t import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } }
Oh, forgot to mention: the initial/short path should only check for ASCII, i.e. c < 0x80. -- Andrei
Oct 09 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu 
wrote:
 Oh, forgot to mention: the initial/short path should only check 
 for ASCII, i.e. c < 0x80. -- Andrei
void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (!(c & 0x80)) { goto Lend; } else { import core.bitop; uint i = 7u - bsr(~c | 1u); import std.algorithm; if (i > 6u) goto Lend; char_length = min(i, s.length); } Lend : s = s.ptr[char_length .. s.length]; } This one removes one unnecessary ret. It will also probably be better for the branch-predictor.
Oct 10 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 10 October 2016 at 15:17:05 UTC, Stefan Koch wrote:
 On Monday, 10 October 2016 at 03:55:17 UTC, Andrei Alexandrescu 
 wrote:
 Oh, forgot to mention: the initial/short path should only 
 check for ASCII, i.e. c < 0x80. -- Andrei
Since in this case stability of min is concern, you can shave of another 2 instructions by writing the comparison by hand void popFront1(ref char[] s) trusted pure nothrow { immutable c = s[0]; size_t char_length = 1; if (!(c & 0x80)) { goto Lend; } else { import core.bitop; uint i = 7u - bsr(~c | 1u); import std.algorithm; if (i > 6u) goto Lend; char_length = i < s.length ? i : s.length; } Lend : s = s.ptr[char_length .. s.length]; }
Oct 10 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 10 October 2016 at 15:37:09 UTC, Stefan Koch wrote:
 Since in this case stability of min is concern, you can shave 
 of another 2 instructions by writing the comparison by hand
In this case the stability of min is +NO+ concern.
Oct 10 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
This version has 24 instructions but these have a smaller 
encoding then and are generally inexpensive
With inline asm and conditional moves it would be possible to 
reduce this even further
to ~20 instructions.

void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (c < 127)
   {
     goto Lend;
   } else {
     if ((c & 0b1100_0000) == 0b1000_0000)
     {
       // This is invalid as a first char
       goto Lerror;
     }
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
       goto Lend;
     }

     //These characters are also no longer valid
     Lerror : assert(0);
   }
   Lend :
   s = s.ptr[char_length .. s.length];
}
Oct 10 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
That looks good. I'm just worried about the jump forward - ideally the 
case c < 127 would simply entail a quick return. I tried a fix, but it 
didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just 
skip one byte. Also, are we right to not worry about 5- and 6-byte 
sequences? The docs keep on threatening with it, and then immediately 
mention those are not valid.

void popFront3(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1;
   if (c < 127)
   {
   Lend :
     s = s.ptr[char_length .. s.length];
   } else {
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
     }
     goto Lend;
   }
}


Andrei

On 10/10/16 9:39 PM, Stefan Koch wrote:
 This version has 24 instructions but these have a smaller encoding then
 and are generally inexpensive
 With inline asm and conditional moves it would be possible to reduce
 this even further
 to ~20 instructions.

 void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   size_t char_length = 1;
   if (c < 127)
   {
     goto Lend;
   } else {
     if ((c & 0b1100_0000) == 0b1000_0000)
     {
       // This is invalid as a first char
       goto Lerror;
     }
     if (c < 192)
     {
       char_length = 2;
       goto Lend;
     }
     if (c < 240) {
       char_length = 3;
       goto Lend;
     }
     if (c < 248) {
        char_length = 4;
       goto Lend;
     }

     //These characters are also no longer valid
     Lerror : assert(0);
   }
   Lend :
   s = s.ptr[char_length .. s.length];
 }
Oct 10 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu 
wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the case c < 127 would simply entail a quick return. I 
 tried a fix, but it didn't do what I wanted in ldc. We 
 shouldn't assert(0) if wrong - just skip one byte. Also, are we 
 right to not worry about 5- and 6-byte sequences? The docs keep 
 on threatening with it, and then immediately mention those are 
 not valid.

 [ ... ]

 Andrei
If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Oct 10 2016
next sibling parent reply Lurker <lurker gmail.com> writes:
On Tuesday, 11 October 2016 at 03:00:45 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the case c < 127 would simply entail a quick return. I 
 tried a fix, but it didn't do what I wanted in ldc. We 
 shouldn't assert(0) if wrong - just skip one byte. Also, are 
 we right to not worry about 5- and 6-byte sequences? The docs 
 keep on threatening with it, and then immediately mention 
 those are not valid.

 [ ... ]

 Andrei
If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Pardon me asking, but why all these gotos instead of else ifs: if (c < 192) { char_length = 2; } else if (c < 240) { char_length = 3; } else if (...) { } Does it have any effect on generated code (I don't think it should)?
Oct 10 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 03:18:24 UTC, Lurker wrote:
 Pardon me asking, but why all these gotos instead of else ifs:

   if (c < 192)
   {
     char_length = 2;
   }
   else if (c < 240)
   {
     char_length = 3;
   } else if (...) {
   }

 Does it have any effect on generated code (I don't think it 
 should)?
No it does not. I wrote the gotos because that is how I am used to thinking about code. If else should work fine here.
Oct 10 2016
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - ideally the
 case c < 127 would simply entail a quick return. I tried a fix, but it
 didn't do what I wanted in ldc. We shouldn't assert(0) if wrong - just
 skip one byte. Also, are we right to not worry about 5- and 6-byte
 sequences? The docs keep on threatening with it, and then immediately
 mention those are not valid.

 [ ... ]

 Andrei
If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- Andrei
Oct 10 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei Alexandrescu 
wrote:
 On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
 That looks good. I'm just worried about the jump forward - 
 ideally the
 case c < 127 would simply entail a quick return. I tried a 
 fix, but it
 didn't do what I wanted in ldc. We shouldn't assert(0) if 
 wrong - just
 skip one byte. Also, are we right to not worry about 5- and 
 6-byte
 sequences? The docs keep on threatening with it, and then 
 immediately
 mention those are not valid.

 [ ... ]

 Andrei
If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- Andrei
It was not identical. ((c & b01100_0000) == 0b1000_0000)) Can be true in all of the 3 following cases. If we do not do a jmp to return here, we cannot guarantee that we will not skip over the next valid char. Thereby corrupting already corrupt strings even more. For best performance we need to leave the gotos in there.
Oct 10 2016
parent reply Matthias Bentrup <matthias.bentrup googlemail.com> writes:
On Tuesday, 11 October 2016 at 04:05:47 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 03:58:59 UTC, Andrei 
 Alexandrescu wrote:
 On 10/10/16 11:00 PM, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 02:48:22 UTC, Andrei 
 Alexandrescu wrote:
[...]
If you want to skip a byte it's easy to do as well. void popFront3(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 127) { Lend : s = s.ptr[char_length .. s.length]; } else { if ((c & b01100_0000) == 0b1000_0000) { //just skip one in case this is not the beginning of a code-point char goto Lend; } if (c < 192) { char_length = 2; goto Lend; } if (c < 240) { char_length = 3; goto Lend; } if (c < 248) { char_length = 4; goto Lend; } } }
Affirmative. That's identical to the code in "[ ... ]" :o). Generated code still does a jmp forward though. -- Andrei
It was not identical. ((c & b01100_0000) == 0b1000_0000)) Can be true in all of the 3 following cases. If we do not do a jmp to return here, we cannot guarantee that we will not skip over the next valid char. Thereby corrupting already corrupt strings even more. For best performance we need to leave the gotos in there.
A branch-free version: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[char_length .. s.length]; } Theoretically the char_length could be computed with three sub and addc instructions, but no compiler is smart enough to detect that.
Oct 11 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc instructions, but no compiler is smart enough to 
 detect that.
You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Oct 11 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
 wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc instructions, but no compiler is smart enough to 
 detect that.
You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Also the code produces conditional set instructions which have a higher latency. And worse throughput.
Oct 11 2016
next sibling parent reply Temtaime <temtaime gmail.com> writes:
On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:03:40 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 07:30:26 UTC, Matthias Bentrup 
 wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three 
 sub and addc instructions, but no compiler is smart enough to 
 detect that.
You still need to special case c < 128 as well as the follow chars. also smaller c's are more common the bigger ones making the branching version faster on average.
Also the code produces conditional set instructions which have a higher latency. And worse throughput.
void popFront1(ref char[] s) trusted pure nothrow { import core.bitop, std.algorithm; auto v = bsr(~s[0] | 1); s = s[clamp(v, 1, v > 6 ? 1 : $)..$]; } Seems to be less if i'm not wrong.
Oct 11 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.
Yours runs with 790 us best time. bsr is a real timetaker :)
Oct 11 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.
Yours runs with 790 us best time. bsr is a real timetaker :)
CORRECTION this is not bsr's fault. It's most likely clamp. I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.
Oct 11 2016
parent reply Temtaime <temtaime gmail.com> writes:
On Tuesday, 11 October 2016 at 09:13:10 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
 void popFront1(ref char[] s)  trusted pure nothrow
 {
   import core.bitop, std.algorithm;
   auto v = bsr(~s[0] | 1);
   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
 }

 Seems to be less if i'm not wrong.
Yours runs with 790 us best time. bsr is a real timetaker :)
CORRECTION this is not bsr's fault. It's most likely clamp. I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.
Sorry this was also a type in the code. void popFront7(ref char[] s) trusted pure nothrow { import core.bitop; auto v = 7 - bsr(~s[0] | 1); s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$]; } Please check this.
Oct 11 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
 Sorry this was also a type in the code.

 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
 }

 Please check this.
162 us
Oct 11 2016
parent Ethan Watson <gooberman gmail.com> writes:
On Tuesday, 11 October 2016 at 10:01:41 UTC, Stefan Koch wrote:
 On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
 Sorry this was also a type in the code.

 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 
 1)..$];
 }

 Please check this.
162 us
The branching, it hurts my eyes! Something like the following should give correct (assuming I haven't written bad logic) branchless results with architecture-optimised max calls. Note that the minus/plus 1 operation on the third line will ensure with the sign multiplication that values of 7 will map to 1, whereas for all other values it's an extra operation. But the advantage is that you're not sticking three branches in close proximity to each other, so you will never get a branch predictor fail. (Of note, any performance test for these functions should test with data designed to fail the branching code I quoted, keeping in mind that desktop Intel processors have a four-state branch predictor. I've not performance tested it myself, but this will certainly run faster on the AMD Jaguar processors than a version with branching checks.) int v = 7 - bsr( ~s[0] | 1 ); int sign = ( (v - 7) >> 31 ); v = ( v - 1 ) * sign + 1; str = str[ min( v, s.length ) .. $ ];
Oct 11 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/11/2016 05:45 AM, Temtaime wrote:
 void popFront7(ref char[] s)  trusted pure nothrow
 {
   import core.bitop;
   auto v = 7 - bsr(~s[0] | 1);
   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
 }

 Please check this.
Thanks. This does a lot of work on the frequent path c < 0x80: pure nothrow trusted void example.popFront7(ref char[]): movq 8(%rdi), %rax movzbl (%rax), %ecx xorq $254, %rcx orq $1, %rcx bsrq %rcx, %rcx notl %ecx addl $8, %ecx cmpl $6, %ecx jg .LBB0_2 testl %ecx, %ecx je .LBB0_2 movslq %ecx, %rdx movq (%rdi), %rcx cmpq %rcx, %rdx cmovaq %rcx, %rdx jmp .LBB0_3 .LBB0_2: movq (%rdi), %rcx movl $1, %edx .LBB0_3: addq %rdx, %rax subq %rdx, %rcx movq %rcx, (%rdi) movq %rax, 8(%rdi) retq So I changed it to: void popFront7(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { import core.bitop; uint v = 7 - bsr(~c | (c > 0xfd) << 6u); s = s.ptr[v > s.length ? s.length : v .. s.length]; } } That's about as large as the baseline. Andrei
Oct 11 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/11/2016 04:57 AM, Stefan Koch wrote:
 Yours runs with 790 us best time.
 bsr is a real timetaker :)
What inputs did you test it on? Here's what I think would be a good set of requirements: * The ASCII case should be short and fast: a comparison and a branch, followed by return. This would improve a very common case and address the main issue with autodecoding. * For the multibyte case, the main requirement is the code must be small. This is because it gets inlined all over the place and seldom used. * For the multibyte case, the fewer bytes in the encoding the less work. This is because more frequent multi-byte characters have generally lower codes. Currently front() - the other time spender in autodecoding - issues a function call on the multibyte case. That makes the code of front() itself small, at the cost of more expensive multibyte handling. Andrei
Oct 11 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 14:16:54 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 04:57 AM, Stefan Koch wrote:
 Yours runs with 790 us best time.
 bsr is a real timetaker :)
What inputs did you test it on?
https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt
 Here's what I think would be a good set of requirements:

 * The ASCII case should be short and fast: a comparison and a 
 branch, followed by return. This would improve a very common 
 case and address the main issue with autodecoding.
Already done
 * For the multibyte case, the main requirement is the code must 
 be small. This is because it gets inlined all over the place 
 and seldom used.

 * For the multibyte case, the fewer bytes in the encoding the 
 less work. This is because more frequent multi-byte characters 
 have generally lower codes.
That is why I had the branches, generally only the first one is taken
 Currently front() - the other time spender in autodecoding - 
 issues a function call on the multibyte case. That makes the 
 code of front() itself small, at the cost of more expensive 
 multibyte handling.
I think at some point we have to cache the length of the last decoded char, Otherwise we are throwing work away. However that will only work within a RangeWrapper-Struct
Oct 11 2016
prev sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 08:17:52 UTC, Stefan Koch wrote:
 Also the code produces conditional set instructions which have 
 a higher latency.
 And worse throughput.
On my arguably a bit dated laptop: popFront3 performs 109 us best time and popFront4 performs with 265 us best time Testcode : void main() { import std.datetime : StopWatch; import std.stdio; foreach(_;0 .. 255) { char[] test1 = (import("blns.txt")).dup; StopWatch sw; sw.start; while(test1.length) popFront(test1); sw.stop; writeln("pf1 took ", sw.peek.usecs, "us"); sw.reset(); } } blns.txt is taken from https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt
Oct 11 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub and addc
 instructions, but no compiler is smart enough to detect that.
Interesting. 0x80 should be special-cased and you forgot to check the bounds. So: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
Oct 11 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }
void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
It's much slower. Because of the flag-storing instructions.
Oct 11 2016
prev sibling parent reply Matthias Bentrup <matthias.bentrup googlemail.com> writes:
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
 A branch-free version:

 void popFront4(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
   s = s.ptr[char_length .. s.length];
 }

 Theoretically the char_length could be computed with three sub 
 and addc
 instructions, but no compiler is smart enough to detect that.
Interesting. 0x80 should be special-cased and you forgot to check the bounds. So: void popFront4(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s.ptr[1 .. s.length]; } else { uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248); s = s.ptr[l <= s.length ? l : s.length .. s.length]; } } This generated 27 instructions, i.e. same as the baseline. See: http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3 But it doesn't seem to check for all errors. Andrei
This is the result I'd like to get, but I can't find a way to write it without inline assembly :( void popFrontAsmIntel(ref char[] s) trusted pure nothrow { immutable c = s[0]; if (c < 0x80) { s = s[1 .. $]; } else { uint l = void; asm pure nothrow nogc { mov EAX, 1; mov BL, 0xf8-1; sub BL, c; cmp BL, 0xf8-0xc0; adc EAX, 0; cmp BL, 0xf8-0xe0; adc EAX, 0; cmp BL, 0xf8-0xf0; adc EAX, 0; mov l, EAX; } s = s[l <= $ ? l : $ .. $]; } }
Oct 11 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 14:49:28 UTC, Matthias Bentrup 
wrote:
 This is the result I'd like to get, but I can't find a way to 
 write it without inline assembly :(

 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }
This takes 180us. Baseline takes 124us.
Oct 11 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }
Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei
Oct 11 2016
parent reply Matthias Bentrup <matthias.bentrup googlemail.com> writes:
On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
 void popFrontAsmIntel(ref char[] s)  trusted pure nothrow {
   immutable c = s[0];
   if (c < 0x80) {
     s = s[1 .. $];
   } else {
     uint l = void;
     asm pure nothrow  nogc {
       mov EAX, 1;
       mov BL, 0xf8-1;
       sub BL, c;
       cmp BL, 0xf8-0xc0;
       adc EAX, 0;
       cmp BL, 0xf8-0xe0;
       adc EAX, 0;
       cmp BL, 0xf8-0xf0;
       adc EAX, 0;
       mov l, EAX;
     }
     s = s[l <= $ ? l : $ .. $];
   }
 }
Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei
Here are three branch-less variants that use the sign instead of the carry bit. The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch. void popFront1(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) { s = s[1 .. $]; } else if (c < -8) { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } else { s = s[1 .. $]; } } void popFront1a(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= 0) {Three s = s[1 .. $]; } else { uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16
 31)) & (c + 8 >> 31));
import std.algorithm; s = s[min(i, $) .. $]; } } void popFront1b(ref char[] s) trusted pure nothrow { immutable c = cast(byte)s[0]; if (c >= -8) { s = s[1 .. $]; } else { uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31); import std.algorithm; s = s[min(i, $) .. $]; } }
Oct 12 2016
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
wrote:
 Here are three branch-less variants that use the sign instead 
 of the carry bit.

 The last one is the fastest on my machine, although it mixes 
 the rare error case and the common 1-byte case into one branch.

 void popFront1(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {
     s = s[1 .. $];
   } else if (c < -8) {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   } else {
     s = s[1 .. $];
   }
 }

 void popFront1a(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= 0) {Three
     s = s[1 .. $];
   } else {
     uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 
 16 >> 31)) & (c + 8 >> 31));
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }

 void popFront1b(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= -8) {
     s = s[1 .. $];
   } else {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 
 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }
All three are slower than baseline, for my test-case. What did you test it against.
Oct 12 2016
next sibling parent reply Matthias Bentrup <matthias.bentrup googlemail.com> writes:
On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup 
 wrote:
 [...]
All three are slower than baseline, for my test-case. What did you test it against.
The blns.txt file mentioned upthread.
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup 
wrote:
 On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch 
 wrote:
 On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias 
 Bentrup wrote:
 [...]
All three are slower than baseline, for my test-case. What did you test it against.
The blns.txt file mentioned upthread.
And what were your timings ? BTW. the code you posted would not be a proper replacement for utf8 popFront since our UTF8-popFront does handle depreacted sequences correctly. making it equivalent to this code : void pfnew(ref char[] s) trusted pure nothrow { immutable c = s[0]; uint char_length = 1; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; } else { if (c < 192+32) { char_length = 2; } else if (c < 192+32+16) { char_length = 3; } else if (c < 192+32+16+8) { char_length = 4; } else if (c < 192+32+16+8+4) { char_length = 5; } else if (c < 192+32+16+8+4+2) { char_length = 6; } char_length = char_length > s.length ? s.length : char_length; goto Lend; } }
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
I just confirmed that branching version is faster then 
table-lookup.

please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 06:56 AM, Stefan Koch wrote:
 I just confirmed that branching version is faster then table-lookup.

 please test it our for yourself
 http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

 The table-lookup does produce the smallest code though.
Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 06:56 AM, Stefan Koch wrote:
 I just confirmed that branching version is faster then 
 table-lookup.

 please test it our for yourself
 http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

 The table-lookup does produce the smallest code though.
Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern: size_t char_length = 1; immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[char_length .. s.length]; return ; } as opposed to: immutable c = s[0]; if (c < 192) { Lend : s = s.ptr[1 .. s.length]; return ; } In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
btw. We could get rid of of a sub if we changed slices from holding a pointer and a length to holding two pointers.
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei 
 Alexandrescu wrote:
 In the second case, the compiler generates an inc for bumping 
 the pointer and a dec for decreasing the length (small 
 instructions). If the variable char_length is used, add/sub 
 must be used (larger). -- Andrei
When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFront
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 09:39 AM, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:
 In the second case, the compiler generates an inc for bumping the
 pointer and a dec for decreasing the length (small instructions). If
 the variable char_length is used, add/sub must be used (larger). --
 Andrei
When I write code so i can keep the str = str[1 .. str.length] It leads to lager code overall and decreases performance. both in table lookup and in the multi-branch code. update: for ldc the table-lookup code is faster 27% then the current popFront. on dmd the table-lookup is 2% faster then the current popFront. for ldc the branching code is 23% faster then the current popFront for dmd the branching code is 20% faster then the current popFront
Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 09:39 AM, Stefan Koch wrote:
 Thanks! I'd say make sure there is exactly 0% loss on 
 performance compared to the popFront in the ASCII case, and if 
 so make a PR with the table version. -- Andrei
I measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block. I cannot currently measure on gdc. I'd appreciate more perf data.
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 10:39 AM, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:
 On 10/12/2016 09:39 AM, Stefan Koch wrote:
 Thanks! I'd say make sure there is exactly 0% loss on performance
 compared to the popFront in the ASCII case, and if so make a PR with
 the table version. -- Andrei
I measured again. The table version has a DECREASES the performance for dmd by 1%. I think we should keep performance for dmd in mind. I could add the table version in a version (LDC) block.
No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei
Oct 12 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei 
Alexandrescu wrote:
 No need. 1% for dmd is negligible. 25% would raise an eyebrow. 
 -- Andrei
Alright then PR: https://github.com/dlang/phobos/pull/4849
Oct 12 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 05:23 AM, Stefan Koch wrote:
 All three are slower than baseline, for my test-case.
 What did you test it against.
I'd say: (a) test for speed of ASCII-only text; (b) make it small. That's all we need. Nobody worries about 10-20% in multibyte-heavy text. -- Andrei
Oct 12 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 04:56 AM, Matthias Bentrup wrote:
 void popFront1b(ref char[] s)  trusted pure nothrow {
   immutable c = cast(byte)s[0];
   if (c >= -8) {
     s = s[1 .. $];
   } else {
     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
     import std.algorithm;
     s = s[min(i, $) .. $];
   }
 }
This is really small although it does two jumps on the frequent path. Perhaps an improvement over the current implementation. Nice work! -- Andrei
Oct 12 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/10/2016 11:00 PM, Stefan Koch wrote:
 void popFront3(ref char[] s)  trusted pure nothrow {
    immutable c = s[0];
    uint char_length = 1;
    if (c < 127)
    {
    Lend :
      s = s.ptr[char_length .. s.length];
    } else {
      if ((c & b01100_0000) == 0b1000_0000)
      {
        //just skip one in case this is not the beginning of a code-point
 char
        goto Lend;
      }
      if (c < 192)
      {
        char_length = 2;
        goto Lend;
      }
      if (c < 240)
      {
        char_length = 3;
        goto Lend;
      }
      if (c < 248)
      {
        char_length = 4;
        goto Lend;
      }
    }
  }
Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops. Andrei
Oct 11 2016
next sibling parent David Nadlinger <code klickverbot.at> writes:
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu 
wrote:
 Looked at this, still seems to generate a jump forward with ldc.
ldc.intrinsics.llvm_expect might help to influence basic block layout. — David
Oct 11 2016
prev sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu 
wrote:
 On 10/10/2016 11:00 PM, Stefan Koch wrote:
 [...]
Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops.
I forgot that the fall-trough did no longer end in Lend; That forward jump to Lend is a very common and therefore predicted branch. I will now run this problem through STOKE. Let's see what it comes up with :)
Oct 11 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/11/16 11:15 AM, Stefan Koch wrote:
 I will now run this problem through STOKE.
 Let's see what it comes up with :)
http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei
Oct 11 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 11 October 2016 at 16:13:45 UTC, Andrei Alexandrescu 
wrote:
 On 10/11/16 11:15 AM, Stefan Koch wrote:
 I will now run this problem through STOKE.
 Let's see what it comes up with :)
http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei
Yep I mean that one. It will take a while to work out the right cost-functions. I'll do a PR as soon as this bears fruit.
Oct 11 2016
prev sibling parent reply safety0ff <safety0ff.dev gmail.com> writes:
My current favorites:

void popFront(ref char[] s)  trusted pure nothrow {
   immutable byte c = s[0];
   if (c >= -2) {
     s = s.ptr[1 .. s.length];
   } else {
     import core.bitop;
     size_t i = 7u - bsr(~c);
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
}

I also experimented with explicit speculation:

void popFront(ref char[] s)  trusted pure nothrow {
   immutable byte c = s[0];
   s = s.ptr[1 .. s.length];
   if (c < -2) {
     import core.bitop;
     size_t i = 6u - bsr(~c);
     import std.algorithm;
     s = s.ptr[min(i, s.length) .. s.length];
   }
}


LDC and GDC both compile these to 23 instructions.
DMD does worse than with my other code.

You can influence GDC's block layout with __builtin_expect.

I notice that many other snippets posted use uint instead of 
size_t in the multi-byte branch. This generates extra 
instructions for me.
Oct 12 2016
parent reply safety0ff <safety0ff.dev gmail.com> writes:
On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]
Didn't see the LUT implementation, nvm!
Oct 12 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]
Didn't see the LUT implementation, nvm!
Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
Oct 12 2016
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei 
Alexandrescu wrote:
 On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
 [Snip]
Didn't see the LUT implementation, nvm!
Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
The first results from stoke are in. It turns out stoke likes to produce garbage :( It's smallest result so far has around 100 instructions. However it might get better if I give it a few more hours to explore.
Oct 13 2016
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Friday, 14 October 2016 at 04:21:28 UTC, Stefan Koch wrote:
 On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei 
 Alexandrescu wrote:
 On 10/12/2016 01:05 PM, safety0ff wrote:
 On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff 
 wrote:
 [Snip]
Didn't see the LUT implementation, nvm!
Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
The first results from stoke are in. It turns out stoke likes to produce garbage :( It's smallest result so far has around 100 instructions. However it might get better if I give it a few more hours to explore.
Also I doubt that it is correct :( testb $0x8, 0x200aa9(%rip) movl $0x6, %eax prefetchnta 0x200a9d(%rip) je .L_400650 mulb -0x4(%rsp) movb $0xfa, -0x5(%rsp) vmovd (%rax), %xmm6 pmovzxbd -0x5(%rsp), %xmm11 1 psrad $0xf9, %xmm6 movl $0xef, %esp pextrd $0xfe, %xmm6, (%rax) .L_4005b0: vrsqrtps 0x200a69(%rip), %ymm13 vzeroall incl %edi cmpb %ah, %dl cmpq %rdi, %rdi jbe .L_400640 ja .L_4005f0 pcmpeqq -0x4(%rsp), %xmm10 sbbb %ah, 0x200a4d(%rip) jmpq .L_400643 .L_4005f0: ja .L_40060c jmpq .L_400643 .L_40060c: ja .L_400628 minsd 0x200a3c(%rip), %xmm10 jmpq .L_400643 .L_400628: vmovsldup %ymm3, %ymm3 vrcpps %ymm12, %ymm7 vrsqrtps -0x4(%rsp), %xmm0 fldl2t vmovmskpd %xmm8, %r10 vrcpps %xmm6, %xmm13 rcrw $0xf7, %ax jbe .L_400643 sbbq $0x40, %rax xorb $0xfe, 0x200a0d(%rip) adcw $0xf0, %r10w .L_400640: vmaskmovpd %xmm4, %xmm10, 0x2009ff(%rip) pabsb %xmm12, %xmm15 .L_400643: jne .L_4005b0 .L_400650: retq I am not quite sure what this does. But I am certain it has nothing to do with UTF-8 decoding :) Oh btw using an end pointer instead of a length reduces the table version to 12 instructions.
Oct 13 2016