digitalmars.D.learn - utf code unit sequence validity (non-)checking
- spir (24/24) Dec 01 2010 Hello,
- Steven Schveighoffer (4/25) Dec 01 2010 I agree, the compiler should verify all string literals are valid utf. ...
Hello, I just noted noted that D's builtin *string types do not behave the same wa= y in front of invalid code unit sequences. For instance: void main () { assert("h=C3=A6?" =3D=3D "\x68\xc3\xa6\x3f"); // Note: removing \xa6 thus makes invalid utf8. string s1 =3D "\x68\xc3\x3f"; // =3D=3D> OK, accepted -- but write-ing indeed produces "h=EF=BF=BD?". dstring s4 =3D "\x68\xc3\x3f"; // =3D=3D> compile-time Error: invalid UTF-8 sequence } I guess this is because, while converting from string to dstring, meaning w= hile decoding code units to code points, D is forced to check sequence vali= dity. But this is not needed, and not done, for utf8 string. Am I right on = this? If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, to ha= ve a concrete safety difference with dstrings? I know there are utf checkin= g routines in the std lib, but for dstrings one does not need no call them = explicitely. (Note that this checking is done at compile-time for source code literals.) denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
On Wed, 01 Dec 2010 07:35:15 -0500, spir <denis.spir gmail.com> wrote:Hello, I just noted noted that D's builtin *string types do not behave the same way in front of invalid code unit sequences. For instance: void main () { assert("hæ?" == "\x68\xc3\xa6\x3f"); // Note: removing \xa6 thus makes invalid utf8. string s1 = "\x68\xc3\x3f"; // ==> OK, accepted -- but write-ing indeed produces "h�?". dstring s4 = "\x68\xc3\x3f"; // ==> compile-time Error: invalid UTF-8 sequence } I guess this is because, while converting from string to dstring, meaning while decoding code units to code points, D is forced to check sequence validity. But this is not needed, and not done, for utf8 string. Am I right on this? If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, to have a concrete safety difference with dstrings? I know there are utf checking routines in the std lib, but for dstrings one does not need no call them explicitely. (Note that this checking is done at compile-time for source code literals.)I agree, the compiler should verify all string literals are valid utf. Can you file a bugzilla enhancement if there isn't already one? -Steve
Dec 01 2010