www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Switch ignores case (?)

reply Chris <wendlec tcd.ie> writes:
Only one of the two cases is considered. What am I doing wrong?

`
import std.array;
import std.conv;
import std.stdio;

void main()
{
   auto tokens = to!(dchar[][])(["D"d, "’"d, "Addario"d, "'"d]);
   // Or use this below:
   //~ dstring[] tokens = ["D"d, "’"d, "Addario"d, "'"d];
   while (!tokens.empty)
   {
     switch (tokens[0])
     {
       case "\u2019"d:
         writeln("Apostrophe smart " ~ tokens[0]);
         break;
       case "\u0027"d:
         writeln("Apostrophe straight " ~ tokens[0]);
         break;
       default:
         writeln("Other " ~ tokens[0]);
         break;
     }
     tokens = tokens[1..$];
   }
}
`
Prints:

Other D
Apostrophe smart ’
Other Addario
Other '

I would have expected:

Other D
Apostrophe smart ’
Other Addario
Apostrophe straight '  <== expected
Nov 23 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/23/16 11:28 AM, Chris wrote:
 Only one of the two cases is considered. What am I doing wrong?

 `
 import std.array;
 import std.conv;
 import std.stdio;

 void main()
 {
   auto tokens = to!(dchar[][])(["D"d, "’"d, "Addario"d, "'"d]);
   // Or use this below:
   //~ dstring[] tokens = ["D"d, "’"d, "Addario"d, "'"d];
   while (!tokens.empty)
   {
     switch (tokens[0])
     {
       case "\u2019"d:
         writeln("Apostrophe smart " ~ tokens[0]);
         break;
       case "\u0027"d:
         writeln("Apostrophe straight " ~ tokens[0]);
         break;
       default:
         writeln("Other " ~ tokens[0]);
         break;
     }
     tokens = tokens[1..$];
   }
 }
 `
 Prints:

 Other D
 Apostrophe smart ’
 Other Addario
 Other '

 I would have expected:

 Other D
 Apostrophe smart ’
 Other Addario
 Apostrophe straight '  <== expected
I tested this locally with different ideas. This definitely looks like a codegen bug. I was able to reduce it to: void main() { switch("'"d) { case "'"d: writeln("a"); break; case "’"d: writeln("b"); break; default: writeln("default"); } } prints "default" What seems to fix it is removing the case statement for the "smart apostrophe". So I'd look there for where the bug is triggering. -Steve
Nov 23 2016
parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 23 November 2016 at 17:33:04 UTC, Steven 
Schveighoffer wrote:

 I tested this locally with different ideas. This definitely 
 looks like a codegen bug.

 I was able to reduce it to:

 void main()
 {
     switch("'"d)
     {
     case "'"d:
         writeln("a");
         break;
     case "’"d:
         writeln("b");
         break;
     default:
         writeln("default");
     }
 }

 prints "default"

 What seems to fix it is removing the case statement for the 
 "smart apostrophe". So I'd look there for where the bug is 
 triggering.

 -Steve
Yep, removing one of the two cases works. I tried it with different versions of dmd back to 2.070.0, and it always gives me the same (wrong) result. I haven't tried ldc yet.
Nov 23 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/23/16 1:03 PM, Chris wrote:
 On Wednesday, 23 November 2016 at 17:33:04 UTC, Steven Schveighoffer wrote:

 I tested this locally with different ideas. This definitely looks like
 a codegen bug.

 I was able to reduce it to:

 void main()
 {
     switch("'"d)
     {
     case "'"d:
         writeln("a");
         break;
     case "’"d:
         writeln("b");
         break;
     default:
         writeln("default");
     }
 }

 prints "default"

 What seems to fix it is removing the case statement for the "smart
 apostrophe". So I'd look there for where the bug is triggering.
Yep, removing one of the two cases works. I tried it with different versions of dmd back to 2.070.0, and it always gives me the same (wrong) result. I haven't tried ldc yet.
Please file here: https://issues.dlang.org/enter_bug.cgi I think this has been there forever. Happens in 2.040 too (the earliest dmd I have on my system). -Steve
Nov 23 2016
parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 23 November 2016 at 18:34:28 UTC, Steven 
Schveighoffer wrote:

 Please file here: https://issues.dlang.org/enter_bug.cgi

 I think this has been there forever. Happens in 2.040 too (the 
 earliest dmd I have on my system).

 -Steve
Here it is: https://issues.dlang.org/show_bug.cgi?id=16739 (I think I marked it as "P1" (default option), maybe it's "P2", didn't know what to choose) It has something to do with the smart quote, e.g.: import std.array; import std.conv; import std.stdio; void main() { auto tokens = to!(dchar[][])(["D"d, "’"d, "Addario"d, "'"d, "."d]); // Or use this below: //~ dstring[] tokens = ["D"d, "’"d, "Addario"d, "'"d]; while (!tokens.empty) { switch (tokens[0]) { case "\u2019"d: writeln("Apostrophe smart " ~ tokens[0]); break; case "\u0027"d: writeln("Apostrophe straight " ~ tokens[0]); break; case "D"d: writeln("Letter 'D'"); break; case "."d: writeln("fullstop " ~ tokens[0]); break; default: writeln("Other " ~ tokens[0]); break; } tokens = tokens[1..$]; } } prints: Letter 'D' Other ’ // <== not expected Other Addario Apostrophe straight ' fullstop . Both LDC and DMD produce the same error. I discovered this while writing a tokenizer. It is a bit upsetting, because it is really an essential thing. The workaround for now would be if (token[0] == "\u2019"d) goto Wherever; which works, by the way.
Nov 23 2016
next sibling parent Jonathan M Davis via Digitalmars-d-learn writes:
On Wednesday, November 23, 2016 19:07:49 Chris via Digitalmars-d-learn 
wrote:
 (I think I marked it as "P1" (default option), maybe it's "P2",
 didn't know what to choose)
AFAIK, there are no devs that pay any attention to those. Some attention is paid to regression vs enhancement vs etc. But the priority field really doesn't mean anything for how D bugs get fixed or who fixes what when. - Jonathan M Davis
Nov 23 2016
prev sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote:
 It has something to do with the smart quote, e.g.:
it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays.
Nov 23 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 23 November 2016 at 19:30:01 UTC, ketmar wrote:
 On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote:
 It has something to do with the smart quote, e.g.:
it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays.
Actually, I came across a compiler message that gave me something like \x19\x20 which I found odd. This sure needs fixing. After all, it's quite a basic feature. So it's back to the old `if` again (for now).
Nov 23 2016
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 23 November 2016 at 20:49:01 UTC, Chris wrote:
 Actually, I came across a compiler message that gave me 
 something like \x19\x20 which I found odd. This sure needs 
 fixing. After all, it's quite a basic feature. So it's back to 
 the old `if` again (for now).
yeah, until this is fixed, `switch` over wstring/dstring can be considered completely broken, and better be avoided. `switch` over normal string is unaffected, of course.
Nov 23 2016
prev sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
quickfix:


diff --git a/src/rt/switch_.d b/src/rt/switch_.d
index ec124e3..83572fe 100644
--- a/src/rt/switch_.d
+++ b/src/rt/switch_.d
   -27,6 +27,32    private import core.stdc.string;

  extern (C):

+import core.stdc.wchar_ : wchar_t, wmemcmp;
+
+
+static if (wchar_t.sizeof == wchar.sizeof) {
+  alias memcmpw = wmemcmp;
+  int memcmpd (const(void)* s0, const(void)* s1, size_t len) {
+    while (len--) {
+      if (int r = *cast(const(int)*)s0-*cast(const(int)*)s1) 
return (r < 0 ? -1 : 1);
+      s0 += dchar.sizeof;
+      s1 += dchar.sizeof;
+    }
+    return 0;
+  }
+} else static if (wchar_t.sizeof == dchar.sizeof) {
+  int memcmpw (const(void)* s0, const(void)* s1, size_t len) {
+    while (len--) {
+      if (int r = 
*cast(const(ushort)*)s0-*cast(const(ushort)*)s1) return (r < 0 ? 
-1 : 1);
+      s0 += wchar.sizeof;
+      s1 += wchar.sizeof;
+    }
+    return 0;
+  }
+  alias memcmpd = wmemcmp;
+} else static assert(0, "wut?!");
+
+
  int _d_switch_string(char[][] table, char[] ca)
  in
  {
   -189,7 +215,7    in
          {
              int c;

-            c = memcmp(table[j - 1].ptr, table[j].ptr, len1 * 
wchar.sizeof);
+            c = memcmpw(table[j - 1].ptr, table[j].ptr, len1);
              assert(c < 0);  // c==0 means a duplicate
          }
      }
   -205,7 +231,7    out (result)
          for (auto i = 0u; i < table.length; i++)
          {
              if (table[i].length == ca.length)
-            {   c = memcmp(table[i].ptr, ca.ptr, ca.length * 
wchar.sizeof);
+            {   c = memcmpw(table[i].ptr, ca.ptr, ca.length);
                  assert(c != 0);
              }
          }
   -218,7 +244,7    out (result)
              assert(i < table.length);
              if (table[i].length == ca.length)
              {
-                c = memcmp(table[i].ptr, ca.ptr, ca.length * 
wchar.sizeof);
+                c = memcmpw(table[i].ptr, ca.ptr, ca.length);
                  if (c == 0)
                  {
                      assert(i == result);
   -253,7 +279,7    body
          auto c = cast(sizediff_t)(ca.length - pca.length);
          if (c == 0)
          {
-            c = memcmp(ca.ptr, pca.ptr, ca.length * 
wchar.sizeof);
+            c = memcmpw(ca.ptr, pca.ptr, ca.length);
              if (c == 0)
              {   //printf("found %d\n", mid);
                  return cast(int)mid;
   -317,7 +343,7    in
          assert(len1 <= len2);
          if (len1 == len2)
          {
-            auto c = memcmp(table[j - 1].ptr, table[j].ptr, len1 
* dchar.sizeof);
+            auto c = memcmpd(table[j - 1].ptr, table[j].ptr, 
len1);
              assert(c < 0);  // c==0 means a duplicate
          }
      }
   -331,7 +357,7    out (result)
          for (auto i = 0u; i < table.length; i++)
          {
              if (table[i].length == ca.length)
-            {   auto c = memcmp(table[i].ptr, ca.ptr, ca.length 
* dchar.sizeof);
+            {   auto c = memcmpd(table[i].ptr, ca.ptr, 
ca.length);
                  assert(c != 0);
              }
          }
   -344,7 +370,7    out (result)
              assert(i < table.length);
              if (table[i].length == ca.length)
              {
-                auto c = memcmp(table[i].ptr, ca.ptr, ca.length 
* dchar.sizeof);
+                auto c = memcmpd(table[i].ptr, ca.ptr, 
ca.length);
                  if (c == 0)
                  {
                      assert(i == result);
   -379,7 +405,7    body
          auto c = cast(sizediff_t)(ca.length - pca.length);
          if (c == 0)
          {
-            c = memcmp(ca.ptr, pca.ptr, ca.length * 
dchar.sizeof);
+            c = memcmpd(ca.ptr, pca.ptr, ca.length);
              if (c == 0)
              {   //printf("found %d\n", mid);
                  return cast(int)mid;
--
2.9.2
Nov 23 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/23/16 2:30 PM, ketmar wrote:
 On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote:
 It has something to do with the smart quote, e.g.:
it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays.
Oh wow, so this is really an endian issue. On a big endian machine, the code would work. Interesting! I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. Thanks for finding this. -Steve
Nov 23 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/23/16 4:31 PM, Steven Schveighoffer wrote:
 On 11/23/16 2:30 PM, ketmar wrote:
 this can be fixed either by using slow char-by-char comparisons in
 druntime, or by fixing codegen, so it would sort strings as byte arrays.
 I think it makes the most sense to remove the memcmp, and do binary
 search based on actual char values.
On second thought, this is compiler-generated data, will never change. We may as well make it as efficient as possible. So the better way is to sort based on byte array, and just use memcmp for everything. -Steve
Nov 23 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven 
Schveighoffer wrote:
 So the better way is to sort based on byte array, and just use 
 memcmp for everything.
i am not completely sure that this is really better. sorting and generating tables is done in glue layer, which can be different for different codegens. and `.compare` method, that is used for sorting strings may be used in other places too. by introducing something like `switchCompare()` we will add unnecessary complexity to the compiler, will make sort order less obvious, and won't really get significant speedup, as binary search doesn't really do alot of comparisons anyway. i thing it is better to be fixed in druntime.
Nov 23 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/23/16 4:44 PM, ketmar wrote:
 On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven Schveighoffer wrote:
 So the better way is to sort based on byte array, and just use memcmp
 for everything.
i am not completely sure that this is really better. sorting and generating tables is done in glue layer, which can be different for different codegens. and `.compare` method, that is used for sorting strings may be used in other places too. by introducing something like `switchCompare()` we will add unnecessary complexity to the compiler, will make sort order less obvious, and won't really get significant speedup, as binary search doesn't really do alot of comparisons anyway.
I'm not certain of how difficult this will be, or how much risk there is. What I am certain of is that the user doesn't care that the compiler really isn't sorting the strings alphabetically, and that he wants the fastest code possible for a switch statement. I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) I also don't actually know that memcmp is faster than wmemcmp, so maybe there is even an advantage to changing behavior for dstring searching.
 i thing it is better to be fixed in druntime.
This can be a solution, for sure. And in reality, it's not *that* much slower -- you are still doing a binary search. I wonder if there is a "binary search among arrays" algorithm that can be optimized for this. -Steve
Nov 23 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven 
Schveighoffer wrote:
 I can't see why you need to deal with the glue layer at all -- 
 just tell the glue layer that it's a list of strings and not 
 dstrings ;)
'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-)
Nov 23 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote:
 On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven 
 Schveighoffer wrote:
 I can't see why you need to deal with the glue layer at all -- 
 just tell the glue layer that it's a list of strings and not 
 dstrings ;)
'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-)
Great job, ketmar! I'm only surprised that this bug wasn't discovered earlier, I mean it goes back to (at least) dmd 2.040.
Nov 24 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Thursday, 24 November 2016 at 09:56:25 UTC, Chris wrote:
 Great job, ketmar! I'm only surprised that this bug wasn't 
 discovered earlier, I mean it goes back to (at least) dmd 2.040.
thanks. tbh, i am surprised myself -- it is completely fubared, but nobody noticed. maybe that says something about real-world useability of dstring/wstring... ;-)
Nov 24 2016
parent Chris <wendlec tcd.ie> writes:
On Thursday, 24 November 2016 at 10:12:40 UTC, ketmar wrote:
 thanks. tbh, i am surprised myself -- it is completely fubared, 
 but nobody noticed. maybe that says something about real-world 
 useability of dstring/wstring... ;-)
Well, I came across it, because I wanted to work around autodecode, our old "friend" (who need enemies, if you have a friend like that?). Except, it doesn't work as expected. For my needs, I can work around it with `(c == "\u2019")`, because switch works for the rest of the (common) cases like full stop, question mark etc. The whole string handling issue in D needs to be fixed asap, because text processing is one of _the_ big things in IT these days. Think of all the data harvesting and evaluation and whatnot.
Nov 24 2016
prev sibling parent reply Antonio Corbi <amcb ggmail.com> writes:
On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote:
 On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven 
 Schveighoffer wrote:
 I can't see why you need to deal with the glue layer at all -- 
 just tell the glue layer that it's a list of strings and not 
 dstrings ;)
'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-)
Could it be possible to ping M. Nowak to include the fix for this bug (due to its importance) in the final 2.072.1 release? Antonio
Nov 24 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Thursday, 24 November 2016 at 11:40:24 UTC, Antonio Corbi 
wrote:
 Could it be possible to ping M. Nowak to include the fix for 
 this bug (due to its importance) in the final 2.072.1 release?
there is no real fix yet. what i provided is a quick hack, not tested on anything except GNU/Linux, and without unitests. somebody should create a real fix first. and it seems that we doesn't even have a bug opened in bugzilla yet! ;-)
Nov 24 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/24/16 7:16 AM, ketmar wrote:
 On Thursday, 24 November 2016 at 11:40:24 UTC, Antonio Corbi wrote:
 Could it be possible to ping M. Nowak to include the fix for this bug
 (due to its importance) in the final 2.072.1 release?
there is no real fix yet. what i provided is a quick hack, not tested on anything except GNU/Linux, and without unitests. somebody should create a real fix first. and it seems that we doesn't even have a bug opened in bugzilla yet! ;-)
Yeah, there is: https://issues.dlang.org/show_bug.cgi?id=16739 Having a fix is obviously possible, we can do it very slowly on Windows :) slow is better than incorrect! -Steve
Nov 25 2016
prev sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 23 November 2016 at 21:31:08 UTC, Steven 
Schveighoffer wrote:
 I think it makes the most sense to remove the memcmp, and do 
 binary search based on actual char values.
yeah. there is wmemcmp, which can be used to speed up one of the cases ('cause wchar_t has different size on windows and GNU/Linux), as i did in my hackfix. yet i am not sure that wmemcmp is really there on windows in all C runtimes (hence HACKfix ;-).
Nov 23 2016