digitalmars.D - [Fix] std.utf bad conversions from UTF-16
- Stewart Gordon (77/77) Jul 12 2004 Using DMD 0.95, Windows 98SE.
- Arcane Jill (8/13) Jul 12 2004 The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32....
Using DMD 0.95, Windows 98SE.
I've just been experimenting with std.utf. Two separate bugs cropped
up, both in ununittested functions:
1. toUTF32(wchar[]) runs into an infinite loop when it encounters a
non-ASCII single-word character. The problem is in decode - a missing
else block means that the counter doesn't get incremented.
2. toUTF8(wchar[]) also tends to fail. The problem is that each wchar
is cast to a dchar, one by one, instead of decoding the UTF-16 string.
The fixed functions are below.
Stewart.
----------
dchar decode(wchar[] s, inout size_t idx)
in
{
assert(idx >= 0 && idx < s.length);
}
out (result)
{
assert(isValidDchar(result));
}
body
{
char[] msg;
dchar V;
size_t i = idx;
uint u = s[i];
if (u >= 0xD800 && u <= 0xDBFF)
{
uint u2;
if (i + 1 == s.length)
{
msg = "surrogate UTF-16 high value past end of string";
goto Lerr;
}
u2 = s[i + 1];
if (u2 < 0xDC00 || u2 > 0xDFFF)
{
msg = "surrogate UTF-16 low value out of range";
goto Lerr;
}
u = ((u - 0xD7C0) << 10) + (u2 - 0xDC00);
i += 2;
}
else if (u >= 0xDC00 && u <= 0xDFFF)
{
msg = "unpaired surrogate UTF-16 value";
goto Lerr;
}
else if (u == 0xFFFE || u == 0xFFFF)
{
msg = "illegal UTF-16 value";
goto Lerr;
}
// default: single-word charcter (0x0000 to 0xD7FF, 0xE000 to 0xFFFD)
// SG fixed bug - previous if (u <= 0x7F) becomes redundant
else
{
i++;
}
idx = i;
return cast(dchar)u;
Lerr:
throw new UtfError(msg, i);
}
char[] toUTF8(wchar[] s)
{
char[] r;
for (size_t i = 0; i < s.length; )
{
encode(r, decode(s, i));
}
return r;
}
--
My e-mail is valid but not my primary mailbox, aside from its being the
unfortunate victim of intensive mail-bombing at the moment. Please keep
replies on the 'group where everyone may benefit.
Jul 12 2004
In article <ccto20$18bq$1 digitaldaemon.com>, Stewart Gordon says...
Cool. Excellent. Like it. Apart from these lines...
else if (u == 0xFFFE || u == 0xFFFF)
{
msg = "illegal UTF-16 value";
goto Lerr;
}
The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32. They
are permanently unassigned characters, that's all. There are in total 64 Unicode
characters which have this property, of which U+FFFE and U+FFFF are but two
examples. Walter has now changed isValidDchar() to return true for U+FFFE and
U+FFFF.
Arcane Jill
Jul 12 2004








Arcane Jill <Arcane_member pathlink.com>