digitalmars.D.bugs - [Issue 16090] New: popFront generates out-of-bounds array index on
- via Digitalmars-d-bugs (56/56) May 28 2016 https://issues.dlang.org/show_bug.cgi?id=16090
https://issues.dlang.org/show_bug.cgi?id=16090 Issue ID: 16090 Summary: popFront generates out-of-bounds array index on corrupted utf-8 strings Product: D Version: D2 Hardware: x86 OS: Mac OS X Status: NEW Severity: normal Priority: P1 Component: phobos Assignee: nobody puremagic.com Reporter: jrdemail2000-dlang yahoo.com If a utf-8 string is chopped (terminated) in the middle of a multi-byte utf-8 character, popFront will generate an out-of-bounds array index. If compiled with -boundscheck=on, a popFront generates a core.exception.RangeError. With -boundscheck=off, an undetermined behavior. In the program below, in my tests the while looped forever until generating a bus error. void main(string[] args) { import std.stdio; import std.range; auto s = "aä"; auto corrupted = s[0 .. $-1]; auto n = 0; while (!corrupted.empty) { corrupted.popFront; n++; } writeln(n); } In this program, the 'ä' character is a two utf-8 sequence. Dropping the last byte leaving an incomplete utf-8 code point. The reason this is so problematic is that string processing often involves corrupted strings, in particular, strings read at run-time from input sources. In the sample program above it can be said that this is a programmer error. However, if the string is read from an outside source, the program needs to be able to defend against corrupted strings. It appears this arises problem from this code in popFront (isNarrowString), currently line 2076 in std/range/primitives.d: import core.bitop : bsr; auto msbs = 7 - bsr(~c); if ((msbs < 2) | (msbs > 6)) { //Invalid UTF-8 msbs = 1; } str = str[msbs .. $]; The msbs variable is holding the length of the utf-8 code point as indicated by the first byte. The 'str[msbs .. $]' expression assumes the string is long enough to hold the full code point. Beside being problematic for practical applications, it is inconsistent with other auto-decoding behavior. The 'front' routine will throw a std.utf.UTFException in this situation. And, popFront itself handles the case of an invalid first byte differently, by simply moving past it. --
May 28 2016