www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - utf code unit sequence validity (non-)checking

reply spir <denis.spir gmail.com> writes:
Hello,


I just noted noted that D's builtin *string types do not behave the same wa=
y in front of invalid code unit sequences. For instance:

void main () {
    assert("h=C3=A6?" =3D=3D "\x68\xc3\xa6\x3f");
    // Note: removing \xa6 thus makes invalid utf8.

    string s1 =3D "\x68\xc3\x3f";
    // =3D=3D> OK, accepted -- but write-ing indeed produces "h=EF=BF=BD?".

    dstring s4 =3D "\x68\xc3\x3f";
    // =3D=3D> compile-time Error: invalid UTF-8 sequence
}

I guess this is because, while converting from string to dstring, meaning w=
hile decoding code units to code points, D is forced to check sequence vali=
dity. But this is not needed, and not done, for utf8 string. Am I right on =
this?
If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, to ha=
ve a concrete safety difference with dstrings? I know there are utf checkin=
g routines in the std lib, but for dstrings one does not need no call them =
explicitely.
(Note that this checking is done at compile-time for source code literals.)


denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com
Dec 01 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 01 Dec 2010 07:35:15 -0500, spir <denis.spir gmail.com> wrote:

 Hello,


 I just noted noted that D's builtin *string types do not behave the same  
 way in front of invalid code unit sequences. For instance:

 void main () {
     assert("hæ?" == "\x68\xc3\xa6\x3f");
     // Note: removing \xa6 thus makes invalid utf8.

     string s1 = "\x68\xc3\x3f";
     // ==> OK, accepted -- but write-ing indeed produces "h�?".

     dstring s4 = "\x68\xc3\x3f";
     // ==> compile-time Error: invalid UTF-8 sequence
 }

 I guess this is because, while converting from string to dstring,  
 meaning while decoding code units to code points, D is forced to check  
 sequence validity. But this is not needed, and not done, for utf8  
 string. Am I right on this?
 If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, to  
 have a concrete safety difference with dstrings? I know there are utf  
 checking routines in the std lib, but for dstrings one does not need no  
 call them explicitely.
 (Note that this checking is done at compile-time for source code  
 literals.)

I agree, the compiler should verify all string literals are valid utf. Can you file a bugzilla enhancement if there isn't already one? -Steve
Dec 01 2010