www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Challenge: write a really really small front() for UTF8

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei
Mar 23 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

I thought it would detect broken encoding and do a substitution at least.
 Andrei

-- Dmitry Olshansky
Mar 23 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

I thought it would detect broken encoding and do a substitution at least.

That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below. Andrei
Mar 23 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2014 01:34, Andrei Alexandrescu пишет:
 On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

I thought it would detect broken encoding and do a substitution at least.

That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below.

Just how much you are willing to assert? You don't even check length. In short - what are the specs of this primitive and where you see it being used.
 Andrei

-- Dmitry Olshansky
Mar 23 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 3:10 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:34, Andrei Alexandrescu пишет:
 On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

I thought it would detect broken encoding and do a substitution at least.

That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below.

Just how much you are willing to assert? You don't even check length.

Array bounds checking takes care of that.
 In short - what are the specs of this primitive and where you see it
 being used.

A replacement for front() in arrays of char and wchar. Andrei
Mar 23 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 4:28 PM, Anonymous wrote:
 dchar front(char[] s) {
      uint c = s[0];
      ubyte p = ~s[0];
      if (p>>7)
        return c;
      c = c<<8 | s[1];
      if (p>>5)
        return c;
      c = c<<8 | s[2];
      if (p>>4)
        return c;
      return c<<8 | s[3];
 }

That's smaller but doesn't seem to do the same! Andrei
Mar 23 2014
prev sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-23 22:58:32 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Array bounds checking takes care of that.

If you want the smallest assembly size with array bound checking, make the function safe and see the disaster it causes to the assembly size. That's what you have to optimize. If you're going to optimize while looking at the assembly, better check for bounds manually: dchar front(char[] s) { if (s.length < 1) return dchar.init; size_t bytesize; dchar result; switch (s[0]) { case 0b00000000: .. case 0b01111111: return s[0]; case 0b11000000: .. case 0b11011111: result = s[0] & 0b00011111; bytesize = 2; break; case 0b11100000: .. case 0b11101111: result = s[0] & 0b00001111; bytesize = 3; break; case 0b11110000: .. case 0b11110111: result = s[0] & 0b00000111; bytesize = 4; default: return dchar.init; } if (s.length < bytesize) return dchar.init; foreach (i; 1..bytesize) result = (result << 6) | (s[i] & 0b00111111); return result; } -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
prev sibling next sibling parent "Anonymous" <anon ymous.net> writes:
dchar front(char[] s) {
     uint c = s[0];
     ubyte p = ~s[0];
     if (p>>7)
       return c;
     c = c<<8 | s[1];
     if (p>>5)
       return c;
     c = c<<8 | s[2];
     if (p>>4)
       return c;
     return c<<8 | s[3];
}
Mar 23 2014
prev sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 23 March 2014 at 22:58:53 UTC, Andrei Alexandrescu 
wrote:
 On 3/23/14, 3:10 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:34, Andrei Alexandrescu пишет:
 On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

I thought it would detect broken encoding and do a substitution at least.

That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below.

Just how much you are willing to assert? You don't even check length.

Array bounds checking takes care of that.

Since strings are often user input, if there is chance that bad input can cause undefined behavior, then the checks must be done unconditionally.
Mar 23 2014
prev sibling next sibling parent reply "Mike" <none none.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

This example only considers encodings of up to 4 bytes, but UTF-8 can encode code points in as many as 6 bytes. Is that not a concern? Mike
Mar 23 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/23/2014 5:32 PM, Mike wrote:
 This example only considers encodings of up to 4 bytes, but UTF-8 can encode
 code points in as many as 6 bytes.  Is that not a concern?

It's not anymore. The 5 and 6 byte encodings are now illegal.
Mar 23 2014
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2014 04:44, Simen Kjærås пишет:
 On 2014-03-24 00:32, Mike wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

This example only considers encodings of up to 4 bytes, but UTF-8 can encode code points in as many as 6 bytes. Is that not a concern? Mike

RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to conform to constraints in UTF-16, removing all 5- and 6-byte sequences.

More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right. -- Dmitry Olshansky
Mar 24 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/25/2014 4:00 AM, Iain Buclaw wrote:
 On 25 March 2014 00:04, Daniel N <ufo orbiting.us> wrote:
 On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
 I'm currently too busy to submit a complete solution, but please feel free
 to use my idea if you think it sounds promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window.

4) It's tied to one piece of hardware. No Thankee.

bool supportCpuFeatureX; void main() { supportCpuFeatureX = detectCpuFeatureX(); doStuff(); } void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); }
 dmd -inline blah.d

Mar 25 2014
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 25.03.2014 11:38, schrieb Nick Sabalausky:
 On 3/25/2014 4:00 AM, Iain Buclaw wrote:
 On 25 March 2014 00:04, Daniel N <ufo orbiting.us> wrote:
 On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
 I'm currently too busy to submit a complete solution, but please feel free
 to use my idea if you think it sounds promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window.

4) It's tied to one piece of hardware. No Thankee.

if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d

the extra branch could kill the performance benefit if doStuff is too small
Mar 25 2014
parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
W dniu 2014-03-25 11:42, dennis luehring pisze:
 Am 25.03.2014 11:38, schrieb Nick Sabalausky:
 On 3/25/2014 4:00 AM, Iain Buclaw wrote:
 On 25 March 2014 00:04, Daniel N <ufo orbiting.us> wrote:
 On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
 I'm currently too busy to submit a complete solution, but please
 feel free
 to use my idea if you think it sounds promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window.

4) It's tied to one piece of hardware. No Thankee.

if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d

the extra branch could kill the performance benefit if doStuff is too small

void function() doStuff; void main() { auto supportCpuFeatureX = detectCpuFeatureX(); if (supportCpuFeatureX) doStuff = &doStuff_FeatureX; else doStuff = &doStuff_Fallback; }
Mar 25 2014
prev sibling next sibling parent =?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On 2014-03-24 00:32, Mike wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

This example only considers encodings of up to 4 bytes, but UTF-8 can encode code points in as many as 6 bytes. Is that not a concern? Mike

RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to conform to constraints in UTF-16, removing all 5- and 6-byte sequences. -- Simen
Mar 23 2014
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Here's a baseline: http://goo.gl/91vIGc. Destroy!

Optimizing for smallest assembly size: dchar front(char[] s) { size_t bytesize; dchar result; switch (s[0]) { case 0b00000000: .. case 0b01111111: return s[0]; case 0b11000000: .. case 0b11011111: return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111); case 0b11100000: .. case 0b11101111: result = s[0] & 0b00001111; bytesize = 3; break; case 0b11110000: .. case 0b11110111: result = s[0] & 0b00000111; bytesize = 4; default: return dchar.init; } foreach (i; 1..bytesize) result = (result << 6) | (s[i] & 0b00111111); return result; } -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 6:53 PM, Michel Fortin wrote:
 On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 Here's a baseline: http://goo.gl/91vIGc. Destroy!

Optimizing for smallest assembly size: dchar front(char[] s) { size_t bytesize; dchar result; switch (s[0]) { case 0b00000000: .. case 0b01111111: return s[0]; case 0b11000000: .. case 0b11011111: return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111); case 0b11100000: .. case 0b11101111: result = s[0] & 0b00001111; bytesize = 3; break; case 0b11110000: .. case 0b11110111: result = s[0] & 0b00000111; bytesize = 4; default: return dchar.init; } foreach (i; 1..bytesize) result = (result << 6) | (s[i] & 0b00111111); return result; }

Nice, thanks! I'd hope for a short path for the ASCII subset, could you achieve that? Andrei
Mar 23 2014
parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 02:25:17 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 3/23/14, 6:53 PM, Michel Fortin wrote:
 On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

Optimizing for smallest assembly size: dchar front(char[] s) { size_t bytesize; dchar result; switch (s[0]) { case 0b00000000: .. case 0b01111111: return s[0]; case 0b11000000: .. case 0b11011111: return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111); case 0b11100000: .. case 0b11101111: result = s[0] & 0b00001111; bytesize = 3; break; case 0b11110000: .. case 0b11110111: result = s[0] & 0b00000111; bytesize = 4; default: return dchar.init; } foreach (i; 1..bytesize) result = (result << 6) | (s[i] & 0b00111111); return result; }

Nice, thanks! I'd hope for a short path for the ASCII subset, could you achieve that?

Unfortunately, there's a bug in the above. A missing "break" results in a fallthrough to default case. That's why the optimizer is so good, it just omits the four-byte branch entirely. I noticed something was wrong by looking at the assembly. I really wish D had no implicit fallthrough. But try this instead, the result is even shorter: dchar front(char[] s) { if (s[0] < 0b1000000) return s[0]; // ASCII // pattern indicator tailLength // 0b100xxxxx 0b00 (0) 1 // 0b101xxxxx 0b01 (1) 1 == indicator // 0b110xxxxx 0b10 (2) 2 == indicator // 0b111xxxxx 0b11 (3) 3 == indicator // note: undefined result for illegal 0b1111xxxx case auto indicator = (s[0] >> 5) & 0b11; auto tailLength = indicator ? indicator : 1; dchar result = s[0] & (0b00111111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b00111111); return result; } (Disclaimer: not tested, but I did check that all the expected code paths are present in the assembly this time.) -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 04:39:08 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Michel Fortin:
 
 I really wish D had no implicit fallthrough.

Isn't your wish already granted?

Maybe. I don't have a D compiler installed at the moment to check. I'm just playing with d.godbolt.org and it accepts implicit fallthrough. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/23/2014 09:44 PM, Michel Fortin wrote:
 On 2014-03-24 04:39:08 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Michel Fortin:

 I really wish D had no implicit fallthrough.

Isn't your wish already granted?

Maybe. I don't have a D compiler installed at the moment to check. I'm just playing with d.godbolt.org and it accepts implicit fallthrough.

That compiler is very old. :) import std.compiler; pragma(msg, version_major); pragma(msg, version_minor); The output: 2u 55u Ali
Mar 23 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 9:37 PM, Michel Fortin wrote:
 On 2014-03-24 02:25:17 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 3/23/14, 6:53 PM, Michel Fortin wrote:
 On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 Here's a baseline: http://goo.gl/91vIGc. Destroy!

Optimizing for smallest assembly size: dchar front(char[] s) { size_t bytesize; dchar result; switch (s[0]) { case 0b00000000: .. case 0b01111111: return s[0]; case 0b11000000: .. case 0b11011111: return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111); case 0b11100000: .. case 0b11101111: result = s[0] & 0b00001111; bytesize = 3; break; case 0b11110000: .. case 0b11110111: result = s[0] & 0b00000111; bytesize = 4; default: return dchar.init; } foreach (i; 1..bytesize) result = (result << 6) | (s[i] & 0b00111111); return result; }

Nice, thanks! I'd hope for a short path for the ASCII subset, could you achieve that?

Unfortunately, there's a bug in the above. A missing "break" results in a fallthrough to default case. That's why the optimizer is so good, it just omits the four-byte branch entirely. I noticed something was wrong by looking at the assembly. I really wish D had no implicit fallthrough.

-w Probably time to move that from a warning to an error. In the short time we've used D at Facebook (forgot to add "-w" to the implicit flags) we've already have not one, but two bugs related to switch's implicit fallthrough.
 But try this instead, the result is even shorter:

 dchar front(char[] s)
 {
   if (s[0] < 0b1000000)
     return s[0]; // ASCII

   // pattern     indicator  tailLength
   // 0b100xxxxx  0b00 (0)   1
   // 0b101xxxxx  0b01 (1)   1 == indicator
   // 0b110xxxxx  0b10 (2)   2 == indicator
   // 0b111xxxxx  0b11 (3)   3 == indicator
   // note: undefined result for illegal 0b1111xxxx case

   auto indicator = (s[0] >> 5) & 0b11;
   auto tailLength = indicator ? indicator : 1;

   dchar result = s[0] & (0b00111111 >> tailLength);
   foreach (i; 0..tailLength)
       result = (result << 6) | (s[1+i] & 0b00111111);
   return result;
 }

 (Disclaimer: not tested, but I did check that all the expected code
 paths are present in the assembly this time.)

Nicely done. I collected what we have at http://goo.gl/W9V6E7. Andrei
Mar 23 2014
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 04:37:22 +0000, Michel Fortin <michel.fortin michelf.ca> said:

 But try this instead, the result is even shorter:

Oops, messed up my patterns. Here's a hopefully fixed front(): dchar front(char[] s) { if (s[0] < 0b1000000) return s[0]; // ASCII // pattern indicator tailLength // 0b1100xxxx 0b00 (0) 1 // 0b1101xxxx 0b01 (1) 1 == indicator // 0b1110xxxx 0b10 (2) 2 == indicator // 0b1111xxxx 0b11 (3) 3 == indicator // note: undefined result for illegal 0b11111xxx case auto indicator = (s[0] >> 4) & 0b11; auto tailLength = indicator ? indicator : 1; dchar result = s[0] & (0b00111111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b00111111); return result; } -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 9:49 PM, Michel Fortin wrote:
 dchar front(char[] s)
 {
   if (s[0] < 0b1000000)
     return s[0]; // ASCII

   // pattern     indicator  tailLength
   // 0b1100xxxx  0b00 (0)   1
   // 0b1101xxxx  0b01 (1)   1 == indicator
   // 0b1110xxxx  0b10 (2)   2 == indicator
   // 0b1111xxxx  0b11 (3)   3 == indicator
   // note: undefined result for illegal 0b11111xxx case

   auto indicator = (s[0] >> 4) & 0b11;
   auto tailLength = indicator ? indicator : 1;

   dchar result = s[0] & (0b00111111 >> tailLength);
   foreach (i; 0..tailLength)
       result = (result << 6) | (s[1+i] & 0b00111111);
   return result;
 }

http://goo.gl/y9EVFr Andrei
Mar 23 2014
parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 05:09:15 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 3/23/14, 9:49 PM, Michel Fortin wrote:
 

Andrei

As I said earlier, that front2 version is broken. It's missing a break. Adding the break makes it non-interesting from an instruction count point of view. Also, there's still an issue in my code above (found by safetyOff): the first "if" for ASCII should use 0b1000_0000, not 0b1000000 (missing one zero). Here's the corrected (again) version of front1: dchar front1(char[] s) { if (s[0] < 0b1000_0000) return s[0]; // ASCII // pattern indicator tailLength // 0b1100xxxx 0b00 (0) 1 // 0b1101xxxx 0b01 (1) 1 == indicator // 0b1110xxxx 0b10 (2) 2 == indicator // 0b1111xxxx 0b11 (3) 3 == indicator // note: undefined result for illegal 0b11111xxx case auto indicator = (s[0] >> 4) & 0b11; auto tailLength = indicator ? indicator : 1; dchar result = s[0] & (0b0011_1111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b0011_1111); return result; } And I'm going to suggest a variant of the above with one less branch (but three more instructions): look at how tailLenght is computed by or'ing with the negative of bit 2 of indicator. I suspect it'll be faster with non-ASCII input, unless it gets inlined less. dchar front2(char[] s) { if (s[0] < 0b1000_0000) return s[0]; // ASCII // pattern indicator tailLength // 0b1100xxxx 0b00 (0) 1 // 0b1101xxxx 0b01 (1) 1 // 0b1110xxxx 0b10 (2) 2 // 0b1111xxxx 0b11 (3) 3 // note: undefined result for illegal 0b11111xxx case auto indicator = ((s[0] >> 4) & 0b11); auto tailLength = indicator | ((~indicator >> 1) & 0b1); dchar result = s[0] & (0b0011_1111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b0011_1111); return result; } -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 06:08:30 +0000, "safety0ff" <safety0ff.dev gmail.com> said:

 Everything seems to work in your corrected versions:
 http://dpaste.dzfl.pl/3b32535559ba
 
 Andrei's version doesn't compile on recent compiler versions due to 
 goto skipping initialization.

Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: http://goo.gl/QjnA2k implementation line count of main in assembly andralex 285 - 9 labels = 276 instructions MFortin1 135 - 13 labels = 122 instructions MFortin2 150 - 14 labels = 136 instructions safety0ff -- not inlined (too big?) -- dnspies 161 - 15 labels = 146 instructions CWilliams 233 - 25 labels = 208 instructions For compactness, my first implementation seems to be the winner, both with and without inlining. That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations: if (1+tailLength >= s.length) return dchar.init; Now lets see what happens with those changes: http://goo.gl/XPCGYE implementation line count of main in assembly MFortin1Check 103 - 11 labels = 92 instructions MFortin2Check 135 - 13 labels = 122 instructions Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: http://goo.gl/E2Q0Yu implementation line count of main in assembly andralex 384 - 9 labels = 375 instructions MFortin1 173 - 19 labels = 154 instructions MFortin2 182 - 18 labels = 164 instructions safety0ff -- not inlined -- dnspies -- not inlined -- CWilliams 229 - 23 labels = 206 instructions MFortin1Check 211 - 24 labels = 187 instructions MFortin2Check 218 - 21 labels = 197 instructions So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 24 2014
parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 13:39:45 +0000, Michel Fortin <michel.fortin michelf.ca> said:

 That said, my test "main" function isn't very representative of the 
 general case because the length is statically known by the optimizer. 
 Let's see what it does when the length is not statically known:

Oops, wrong link, was from an intermediary test. This one will give you the results in the table: http://goo.gl/lwU4hv
 implementation   line count of main in assembly
 andralex         384 -  9 labels = 375 instructions
 MFortin1         173 - 19 labels = 154 instructions
 MFortin2         182 - 18 labels = 164 instructions
 safety0ff         -- not inlined --
 dnspies           -- not inlined --
 CWilliams        229 - 23 labels = 206 instructions
 MFortin1Check    211 - 24 labels = 187 instructions
 MFortin2Check    218 - 21 labels = 197 instructions

-- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 24 2014
prev sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-24 04:58:08 +0000, "safety0ff" <safety0ff.dev gmail.com> said:

 0b1000000 is missing a zero: 0b1000_0000

Indeed, thanks.
 Fixing that, I still get a range violation from "s[1+i]".

That "auto indicator = (s[0] >> 5) & 0b11;" line is wrong too. s[0] needs a shift by 4, not by 5. No doubt it crashes your test program. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 23 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Michel Fortin:

 I really wish D had no implicit fallthrough.

Isn't your wish already granted? Bye, bearophile
Mar 23 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 -w

 Probably time to move that from a warning to an error. In the 
 short time we've used D at Facebook (forgot to add "-w" to the 
 implicit flags) we've already have not one, but two bugs 
 related to switch's implicit fallthrough.

Lot of people in D.learn don't use warnings. What I'd like is to have (informational) warning active on default, and to disable them on request with a compiler switch. Bye, bearophile
Mar 23 2014
prev sibling next sibling parent "safety0ff" <safety0ff.dev gmail.com> writes:
On Monday, 24 March 2014 at 04:37:23 UTC, Michel Fortin wrote:
 dchar front(char[] s)
 {
  if (s[0] < 0b1000000)
    return s[0]; // ASCII
  auto indicator = (s[0] >> 5) & 0b11;
  auto tailLength = indicator ? indicator : 1;

  dchar result = s[0] & (0b00111111 >> tailLength);
  foreach (i; 0..tailLength)
      result = (result << 6) | (s[1+i] & 0b00111111);
  return result;
 }

 (Disclaimer: not tested, but I did check that all the expected 
 code paths are present in the assembly this time.)

0b1000000 is missing a zero: 0b1000_0000 Fixing that, I still get a range violation from "s[1+i]". ----- Test program ----- void main() { foreach (ubyte b0; 0..0x80) { char[] s = [b0]; assert(front(s)==front2(s)); } writeln("Single byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x20) { char[] s = [0xC0|b1, 0x80|b0]; assert(front(s)==front2(s)); } writeln("Double byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x40) foreach (ubyte b2; 0..0x10) { char[] s = [0xE0|b2, 0x80|b1, 0x80|b0]; assert(front(s)==front2(s)); } writeln("Triple byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x40) foreach (ubyte b2; 0..0x40) foreach (ubyte b3; 0..0x08) { char[] s = [0xF0|b3, 0x80|b2, 0x80|b1, 0x80|b0]; assert(front(s)==front2(s)); } }
Mar 23 2014
prev sibling next sibling parent "safety0ff" <safety0ff.dev gmail.com> writes:
On Monday, 24 March 2014 at 05:27:43 UTC, Michel Fortin wrote:
 Here's the corrected (again) version of front1:
 [Snip]

 And I'm going to suggest a variant of the above with one less 
 branch (but three more instructions):
 [Snip]

Everything seems to work in your corrected versions: http://dpaste.dzfl.pl/3b32535559ba Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization.
Mar 23 2014
prev sibling next sibling parent "dnspies" <dspies ualberta.ca> writes:
It seems like mine wasn't being inlined because I had carelessly 
replaced char[] with const(char)[] in the function signature (I 
don't know why that should make a difference, but it does)

Going back to my original version (with Andre's trick to avoid 
letting the loop unroll), I get
dnspies_orig   159 - 16 labels = 143 instructions

However, MFortin1 can be made to do better by taking successive 
slices instead of indexing (and commenting out the check, since 
it no longer helps optimize).
MFortin1_slice 137 - 13 labels = 124 instructions

http://goo.gl/JAgVK1

On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
 On 2014-03-24 06:08:30 +0000, "safety0ff" 
 <safety0ff.dev gmail.com> said:

 Everything seems to work in your corrected versions:
 http://dpaste.dzfl.pl/3b32535559ba
 
 Andrei's version doesn't compile on recent compiler versions 
 due to goto skipping initialization.

Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: http://goo.gl/QjnA2k implementation line count of main in assembly andralex 285 - 9 labels = 276 instructions MFortin1 135 - 13 labels = 122 instructions MFortin2 150 - 14 labels = 136 instructions safety0ff -- not inlined (too big?) -- dnspies 161 - 15 labels = 146 instructions CWilliams 233 - 25 labels = 208 instructions For compactness, my first implementation seems to be the winner, both with and without inlining. That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations: if (1+tailLength >= s.length) return dchar.init; Now lets see what happens with those changes: http://goo.gl/XPCGYE implementation line count of main in assembly MFortin1Check 103 - 11 labels = 92 instructions MFortin2Check 135 - 13 labels = 122 instructions Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: http://goo.gl/E2Q0Yu implementation line count of main in assembly andralex 384 - 9 labels = 375 instructions MFortin1 173 - 19 labels = 154 instructions MFortin2 182 - 18 labels = 164 instructions safety0ff -- not inlined -- dnspies -- not inlined -- CWilliams 229 - 23 labels = 206 instructions MFortin1Check 211 - 24 labels = 187 instructions MFortin2Check 218 - 21 labels = 197 instructions So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.

Mar 24 2014
prev sibling parent "dnspies" <dspies ualberta.ca> writes:
Whoops, wrong "original" version.  That was the one before I 
understood the game. Here mine is fixed (but up to 180 lines - 16 
labels = 164 instructions):

http://goo.gl/wjIAVm

On Monday, 24 March 2014 at 17:14:11 UTC, dnspies wrote:
 It seems like mine wasn't being inlined because I had 
 carelessly replaced char[] with const(char)[] in the function 
 signature (I don't know why that should make a difference, but 
 it does)

 Going back to my original version (with Andre's trick to avoid 
 letting the loop unroll), I get
 dnspies_orig   159 - 16 labels = 143 instructions

 However, MFortin1 can be made to do better by taking successive 
 slices instead of indexing (and commenting out the check, since 
 it no longer helps optimize).
 MFortin1_slice 137 - 13 labels = 124 instructions

 http://goo.gl/JAgVK1

 On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
 On 2014-03-24 06:08:30 +0000, "safety0ff" 
 <safety0ff.dev gmail.com> said:

 Everything seems to work in your corrected versions:
 http://dpaste.dzfl.pl/3b32535559ba
 
 Andrei's version doesn't compile on recent compiler versions 
 due to goto skipping initialization.

Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: http://goo.gl/QjnA2k implementation line count of main in assembly andralex 285 - 9 labels = 276 instructions MFortin1 135 - 13 labels = 122 instructions MFortin2 150 - 14 labels = 136 instructions safety0ff -- not inlined (too big?) -- dnspies 161 - 15 labels = 146 instructions CWilliams 233 - 25 labels = 208 instructions For compactness, my first implementation seems to be the winner, both with and without inlining. That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations: if (1+tailLength >= s.length) return dchar.init; Now lets see what happens with those changes: http://goo.gl/XPCGYE implementation line count of main in assembly MFortin1Check 103 - 11 labels = 92 instructions MFortin2Check 135 - 13 labels = 122 instructions Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: http://goo.gl/E2Q0Yu implementation line count of main in assembly andralex 384 - 9 labels = 375 instructions MFortin1 173 - 19 labels = 154 instructions MFortin2 182 - 18 labels = 164 instructions safety0ff -- not inlined -- dnspies -- not inlined -- CWilliams 229 - 23 labels = 206 instructions MFortin1Check 211 - 24 labels = 187 instructions MFortin2Check 218 - 21 labels = 197 instructions So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.


Mar 24 2014
prev sibling next sibling parent reply "safety0ff" <safety0ff.dev gmail.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's my attempt: http://goo.gl/alJ9JS I interpreted the challenge slightly differently, I went for fewer lines of code and fewer branches. I verified the output for all valid inputs.
Mar 23 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/23/14, 8:36 PM, safety0ff wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's my attempt: http://goo.gl/alJ9JS I interpreted the challenge slightly differently, I went for fewer lines of code and fewer branches. I verified the output for all valid inputs.

Very interesting! -- Andrei
Mar 23 2014
prev sibling next sibling parent reply "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

http://goo.gl/TaZTNB
Mar 24 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 12:53 AM, Chris Williams wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

http://goo.gl/TaZTNB

Nice! Why assert(ret != 0)? Andrei
Mar 24 2014
prev sibling next sibling parent reply "dnspies" <dspies ualberta.ca> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); }
Mar 24 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 1:06 AM, dnspies wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); }

http://goo.gl/HBSv5T - looks nice! For minimum size replace the assert(false) with one inside the loop: http://goo.gl/eWs0ft Andrei
Mar 24 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 1:56 AM, dnspies wrote:
 On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); }

Sorry, I misunderstood. We only want the x's in the output. Here's my revised solution http://goo.gl/PL729J

That turned out quite large. -- Andrei
Mar 24 2014
prev sibling next sibling parent "dnspies" <dspies ualberta.ca> writes:
On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); }

Sorry, I misunderstood. We only want the x's in the output. Here's my revised solution http://goo.gl/PL729J dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) return res; dchar cover = 0b0100_0000; c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); cover <<= 5; res = ((res << 6) & (cover - 1)) | (d & 0b0011_1111); c <<= 1; if(c >= 0) return res; } assert(false); }
Mar 24 2014
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.
Mar 24 2014
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 24.03.2014 10:02, schrieb monarch_dodra:
 Having a fast "front" is fine and all, but if it means your
 program asserting in release (or worst, silently corrupting
 memory) just because the client was trying to read a bad text
 file, I'm unsure this is acceptable.

it would be great to habe a "basic" form of this range that error behavior could be extended policy based / templates - the phobos version could extend the basic range with "prefered" error behavior - but the basic range is still able to read without checking - for example if i know that my input is 100% valid and i need the speed etc.
Mar 24 2014
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 24.03.2014 13:51, schrieb w0rp:
 On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option, possibly via a template parameter, for turning error handling on or off, similar to how Python handles decoding. Examples below in Python 3. b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed. All three strategies are useful from time to time. I mainly reach for option three when I'm trying to get some text data out of some old broken databases or similar. We may consider leaving the error checking on in -release for the 'strict' decoding, but throwing an Error instead of an exception so the function can be nothrow. This would prevent memory corruption in release code. assert vs throw Error is up for debate.

+1
Mar 24 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 2:02 AM, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?

I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error. Andrei
Mar 24 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 5:51 AM, w0rp wrote:
 On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option

Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness. Andrei
Mar 24 2014
parent dennis luehring <dl.soluz gmx.net> writes:
Am 24.03.2014 17:44, schrieb Andrei Alexandrescu:
 On 3/24/14, 5:51 AM, w0rp wrote:
 On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option

Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness. Andrei

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed. i think there should be a base range for UTF8 iteration - with policy based error extension (like in python) and some variants that defer this base UTF8 range with different error behavior - and one of these become the phobos standard = default parameter so its still switchable
Mar 24 2014
prev sibling next sibling parent "JR" <zorael gmail.com> writes:
On Monday, 24 March 2014 at 07:53:11 UTC, Chris Williams wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

http://goo.gl/TaZTNB

  for (int i = 1; i < len; i++) {

size_t, ++i! :> (optimized away but still, nitnit pickpick)
Mar 24 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

As you'll note from the assembler, asserts are compiled out of release builds. Though if the goal of this little project is to create useful code for inclusion in a library, you would want to switch many of the asserts in the examples into "if (x) return 0xfffd;". (Mine needs an assert/if in the loop to check for the proper bit markers).
Mar 24 2014
prev sibling next sibling parent reply "dnspies" <dspies ualberta.ca> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Ok, I managed to make it smaller. I think this is the smallest so far with only 23 instructions (no loop unrolling in this one): http://goo.gl/RKF5Vm
Mar 24 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/14, 3:30 AM, dnspies wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Ok, I managed to make it smaller. I think this is the smallest so far with only 23 instructions (no loop unrolling in this one): http://goo.gl/RKF5Vm

This is nice because even the checks are relatively cheap. It's quite difficult to understand tho :o). -- Andrei
Mar 24 2014
prev sibling next sibling parent "Daniel N" <ufo orbiting.us> writes:
On Monday, 24 March 2014 at 11:48:00 UTC, Dmitry Olshansky wrote:
 RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 
 to
 conform to constraints in UTF-16, removing all 5- and 6-byte 
 sequences.

More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right.

I did some hacks using C at work with _pext_u32, it's an absolutely wonderful instruction(pext) with the corresponding pdep. http://software.intel.com/sites/landingpage/IntrinsicsGuide/ And ridiculously fast according to Agner(Latency 3, Throughput 1): http://www.agner.org/optimize/instruction_tables.pdf I think we should add this as an intrinsic to D as well(if it isn't already, but I couldn't find it)... it could do wonders for utf decoding. I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising.
Mar 24 2014
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option, possibly via a template parameter, for turning error handling on or off, similar to how Python handles decoding. Examples below in Python 3. b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed. All three strategies are useful from time to time. I mainly reach for option three when I'm trying to get some text data out of some old broken databases or similar. We may consider leaving the error checking on in -release for the 'strict' decoding, but throwing an Error instead of an exception so the function can be nothrow. This would prevent memory corruption in release code. assert vs throw Error is up for debate.
Mar 24 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

I've probably missed something, but here's 15 instructions: http://goo.gl/d3Mgj0 Also, I contacted the owner of d.godbolt.org and hopefully we'll have a more up-to-date gdc on there soon. Perhaps ldc as well :)
Mar 24 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 24 March 2014 at 15:32:09 UTC, John Colvin wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

I've probably missed something, but here's 15 instructions: http://goo.gl/d3Mgj0

woops yeah, that's completely broken.
Mar 24 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

http://forum.dlang.org/post/fsgdviklnbugdzjsgfsk forum.dlang.org
Mar 24 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu 
wrote:
 On 3/24/14, 2:02 AM, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?

I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error.

Ignoring UTF errors is a lossy operation and has the potential problem of irreversible data loss. For example, consider a program which needs to process Windows-1251 files: it would need to read the file from disk, convert to UTF-8, process it, convert back to Windows-1251, and save it back to disk. If a bug causes the UTF-8 conversion step to be accidentally skipped, then all Unicode data in that file is lost. I think UTF-8 decoding operations should, by default, throw on UTF-8 errors. Ignoring UTF-8 errors should only be done explicitly, with the programmer's consent.
Mar 24 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

On a bigendian machine with loose alignment requirements (1 byte), you can do this, which is down to 13 instructions on x86 (which is of course meaningless, what with it being the wrong endianess): uint front(char[] s) { if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII assert(s[0] & 0b0100_0000); if(s[0] & 0b0010_0000) { if(s[0] & 0b0001_0000) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111 && s[3] <= 0b1011_1111); return *(cast(dchar*)(s.ptr)); } assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111); return *(cast(dchar*)(s.ptr)) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_1111); return *(cast(wchar*)(s.ptr)); } http://goo.gl/Kf6RZJ There may be architectures that can benefit from this.
Mar 24 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 24 March 2014 at 16:41:02 UTC, John Colvin wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

On a bigendian machine with loose alignment requirements (1 byte), you can do this

Same again, but for little-endian, 18 instructions: http://goo.gl/jlrweQ uint front(char[] s) { if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII assert(s[0] & 0b0100_0000); if(s[0] & 0b0010_0000) { if(s[0] & 0b0001_0000) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111 && s[3] <= 0b1011_1111); return swapEndian(*(cast(dchar*)(s.ptr))); } assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111); return swapEndian(*(cast(dchar*)(s.ptr))) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_1111); return swapEndian(*(cast(wchar*)(s.ptr))); }
Mar 24 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Monday, 24 March 2014 at 16:17:52 UTC, Andrei Alexandrescu 
wrote:
 On 3/24/14, 12:53 AM, Chris Williams wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

http://goo.gl/TaZTNB

Nice! Why assert(ret != 0)? Andrei

It should be -1. I wanted to make sure that bsr() wasn't called on zero, but of course that's not the correct way to test it when I'm about to ~ it.
Mar 24 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

I had to join the party at some point. This seems like 25 instructions: http://goo.gl/N7sHtK -- Dmitry Olshansky
Mar 24 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2014 23:53, Dmitry Olshansky пишет:
 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

I had to join the party at some point. This seems like 25 instructions: http://goo.gl/N7sHtK

Interestingly gdc-4.8 produces better results. http://goo.gl/1R7GMs -- Dmitry Olshansky
Mar 25 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu 
wrote:
 On 3/24/14, 2:02 AM, monarch_dodra wrote:
 On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
 wrote:
 Here's a baseline: http://goo.gl/91vIGc. Destroy!

 Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?

I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error. Andrei

That sounds good to me (I think). Better than a straight up assert anyways... ...which is what the code you submitted does. Also, I think relying on bounds checking is wrong: Disabling bounds checking mean you are OK for memory corruption for *wrong code*. I don't think invalid string is wrong code. 'front' should do an actual "if", and "assert(0)" when needed. This should have no impact on performance BTW, if done with pointer extraction ("slice.ptr[1]"), and in a trusted context. Of course, this is all assuming we are asserting at all.
Mar 24 2014
prev sibling next sibling parent "Daniel N" <ufo orbiting.us> writes:
On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
 I'm currently too busy to submit a complete solution, but 
 please feel free to use my idea if you think it sounds 
 promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. #include <x86intrin.h> char32_t front(const char* inp) { static const uint32_t mask[32] = { 0b01111111'00000000'00000000'00000000, 0b00000000'00000000'00000000'00000000, 0b00011111'00111111'00000000'00000000, 0b00001111'00111111'00111111'00000000, 0b00000111'00111111'00111111'00111111, }; uint32_t rev = __builtin_bswap32(reinterpret_cast<const uint32_t*>(inp)[0]); uint32_t len = __lzcnt32(~rev); return _pext_u32(rev, mask[len]); } This is what clang 3.4 generated: ## BB#0: pushq %rbp Ltmp2: .cfi_def_cfa_offset 16 Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp movl (%rdi), %eax bswapl %eax movl %eax, %ecx notl %ecx lzcntl %ecx, %ecx leaq __ZZ5frontPKcE4mask(%rip), %rdx pextl (%rdx,%rcx,4), %eax, %eax popq %rbp ret
Mar 24 2014
prev sibling next sibling parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On 25 March 2014 00:04, Daniel N <ufo orbiting.us> wrote:
 On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
 I'm currently too busy to submit a complete solution, but please feel free
 to use my idea if you think it sounds promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window.

4) It's tied to one piece of hardware. No Thankee.
Mar 25 2014
prev sibling next sibling parent "Daniel N" <ufo orbiting.us> writes:
On Tuesday, 25 March 2014 at 10:42:59 UTC, dennis luehring wrote:
 void doStuff() {
      if(supportCpuFeatureX)
          doStuff_FeatureX();
      else
          doStuff_Fallback();
 }

  > dmd -inline blah.d

the extra branch could kill the performance benefit if doStuff is too small

you'd simply have to hoist the condition outside the inner loop. Furthermore the branch prediction would never fail, only unpredictable branches are terrible.
Mar 25 2014
prev sibling next sibling parent "anonymous" <anon ymous.net> writes:
dchar front(char[] s) {
   dchar c = s[0];
   if (!(c & 0x80))
     return c;
   byte b = (c >> 4) & 3;
   b += !b;
   c &= 63 >> b;
   char *p = s.ptr;
   do {
     p++;
     c = c << 6 | *p & 63;
   } while(--b);
   return c;
}
Mar 25 2014
prev sibling next sibling parent "Temtaime" <temtaime gmail.com> writes:
http://goo.gl/4RSWhr

Only 3 ifs.
Mar 26 2014
prev sibling parent "Temtaime" <temtaime gmail.com> writes:
Oups, in last return forgot "+ s[3] "
Mar 26 2014