www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - toUTFindex

reply Derek Parnell <derek psych.ward> writes:
When an UCS index of zero is supplied to the 'toUTFindex' function, and the
supplied string does not have a valid UTF-8 sequence at offset zero, the
function fails to throw an exception. Instead it returns zero, implying
that the supplied string is valid up to that point.

This bug may exist in other similar functions too.

The following code illustrates the issue.

<code>
import std.utf;
import std.stdio;

void main()
{
   char[] B;

   B = "\xFF\xFF\xFF"; // Not a valid UTF-8 string

   writefln("Index 0=%d", std.utf.toUTFindex(B, 0)); // should fail
   writefln("Index 1=%d", std.utf.toUTFindex(B, 1)); // does fail
}
</code>

Suggested fix :
<code>
size_t toUTFindex(char[] s, size_t n)
{
    size_t i;
    size_t r;

    do
    {
        if (i >= s.length)
    	    throw new UtfError("3invalid UTC index", i);
    	size_t j = std.utf.UTF8stride[s[i]];
    	if (j == 0xFF)
    	    throw new UtfError("3invalid UTF-8 sequence", i);
    	r = i;
    	i += j;
    } while(n--);

    return r;
}
</code>


Also, I note that the UTF8stride table has entries for 5 and 6 byte
sequences. I was under the impression that these are no longer valid UTF-8
sequences.

-- 
Derek
Melbourne, Australia
25/05/2005 11:57:02 AM
May 24 2005
parent "Uwe Salomon" <post uwesalomon.de> writes:
 Also, I note that the UTF8stride table has entries for 5 and 6 byte
 sequences. I was under the impression that these are no longer valid  
 UTF-8 sequences.
I have already changed that in my changed std.utf module (posted some days ago). The toUtfX() functions were also changed to reject any invalid encodings. Regrettably, i have not heard anything about it. I don't know if Walter includes the changed code into Phobos (i don't think so...). As i said in that posting, i would also rework the other functions in std.utf. But i am not sure what to do about toUCSindex/toUTFindex() ─ they are very inefficient if used the wrong way... Ciao uwe
May 24 2005