www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - UTF-32 bug

It seems incredible, to me, that toUTF8(dchar[]) can ever /return/ invalid
UTF-8. But it does, when given invalid input! The following code


#    import std.utf;
#    
#    dchar[] s = [ 0xD800 ];	// Invalid UTF-32
#    
#    int main()
#    {
#        printf("1\n");
#        char[] t = toUTF8(s);
#        printf("2\n");
#        dchar[] u = toUTF32(t);
#        printf("3\n");
#        return 0;
#    }


Compiles successfully. Output is

#    1
#    2
#    Error: invalid UTF-8 sequence


The problem is that the output SHOULD be...

#    1
#    Error: invalid UTF-32 sequence


Fortunately, the fix is very simple. All you have to do is modify
toUTF8(dchar[]) to verify that every dchar in the input returns true from
std.utf.isValidDchar(). (Observe that isValidDchar(0xD800) correctly returns
false.)

Arcane Jill
Jul 09 2004