www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - utf.d codeLength asserts false on certain input

reply Anonymouse <asdf asdf.net> writes:
My IRC bot is suddenly seeing crashes. It reads characters from a 
Socket into an ubyte[] array, then idups parts of that (full 
lines) into strings for parsing. Parsing involves slicing such 
strings into meaningful segments; sender, event type, target 
channel/user, message content, etc. I can assume all of them to 
be char[]-compliant except for the content field.

Running it in a debugger I see I'm tripping an assert in utf.d[1] 
when calling stripRight on a content slice[2].

 /++
     Returns the number of code units that are required to 
 encode the code point
     $(D c) when $(D C) is the character type used to encode it.
   +/
 ubyte codeLength(C)(dchar c)  safe pure nothrow  nogc
 if (isSomeChar!C)
 {
     static if (C.sizeof == 1)
     {
         if (c <= 0x7F) return 1;
         if (c <= 0x7FF) return 2;
         if (c <= 0xFFFF) return 3;
         if (c <= 0x10FFFF) return 4;
         assert(false);  // <--
     }
     // ...
This trips it:
 import std.string;

 void main()
 {
     string s = "\355\342\256 \342\245\341⮢\256\245 
 ᮮ\241饭\250\245".stripRight;  // <-- asserts false
 }
The real backtrace:

 /usr/include/dlang/dmd/std/utf.d:2530

 _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi
(this=0x7fffffff99c0, __applyArg1= 0x7fffffff9978: 26663461,
__applyArg0= 0x7fffffff9970: 17) at /usr/include/dlang/dmd/std/string.d:2918

 /usr/lib/libphobos2.so.0.78

 _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at 
 /usr/include/dlang/dmd/std/string.d:2915

 _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ir
defs8IRCEventKAyaZv (slice=..., event=...,parser=...) at
source/kameloso/irc.d:1184
Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert? I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError. [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522 [2]: https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184
Mar 27 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, March 27, 2018 23:29:57 Anonymouse via Digitalmars-d-learn 
wrote:
 My IRC bot is suddenly seeing crashes. It reads characters from a
 Socket into an ubyte[] array, then idups parts of that (full
 lines) into strings for parsing. Parsing involves slicing such
 strings into meaningful segments; sender, event type, target
 channel/user, message content, etc. I can assume all of them to
 be char[]-compliant except for the content field.

 Running it in a debugger I see I'm tripping an assert in utf.d[1]
 when calling stripRight on a content slice[2].

 /++

     Returns the number of code units that are required to

 encode the code point

     $(D c) when $(D C) is the character type used to encode it.

   +/

 ubyte codeLength(C)(dchar c)  safe pure nothrow  nogc
 if (isSomeChar!C)
 {

     static if (C.sizeof == 1)
     {

         if (c <= 0x7F) return 1;
         if (c <= 0x7FF) return 2;
         if (c <= 0xFFFF) return 3;
         if (c <= 0x10FFFF) return 4;
         assert(false);  // <--

     }
     // ...
This trips it:
 import std.string;

 void main()
 {

     string s = "\355\342\256 \342\245\341⮢\256\245

 ᮮ\241饭\250\245".stripRight;  // <-- asserts false
 }
The real backtrace:

 /usr/include/dlang/dmd/std/utf.d:2530

 _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi
 (this=0x7fffffff99c0, __applyArg1= 0x7fffffff9978: 26663461,
 __applyArg0= 0x7fffffff9970: 17) at

 _aApplyRcd2 () from
 /usr/lib/libphobos2.so.0.78

 _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at
 /usr/include/dlang/dmd/std/string.d:2915

 _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8I
 RCEventKAyaZv (slice=..., event=...,parser=...) at
 source/kameloso/irc.d:1184
Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert? I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError. [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522 [2]: https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184
It means that codeLength requires that dchar be a valid code point, though the documentation doesn't say that. It probably should. It was probably assumed that no one would try to pass it an invalid code point - especially since it's usually called with well-known values rather than data from some place like a socket. Regardless, the way to work around it would be to call isValidDchar on the dchar before passing it to codeLength so that you can handle the invalid code point rather than calling codeLength on it. - Jonathan M Davis
Mar 27 2018