www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [Fix] std.utf bad conversions from UTF-16

reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Using DMD 0.95, Windows 98SE.

I've just been experimenting with std.utf.  Two separate bugs cropped 
up, both in ununittested functions:

1. toUTF32(wchar[]) runs into an infinite loop when it encounters a 
non-ASCII single-word character.  The problem is in decode - a missing 
else block means that the counter doesn't get incremented.

2. toUTF8(wchar[]) also tends to fail.  The problem is that each wchar 
is cast to a dchar, one by one, instead of decoding the UTF-16 string.


The fixed functions are below.

Stewart.

----------
dchar decode(wchar[] s, inout size_t idx)
in
{
	assert(idx >= 0 && idx < s.length);
}
out (result)
{
	assert(isValidDchar(result));
}
body
{
	char[] msg;
	dchar V;
	size_t i = idx;
	uint u = s[i];

	if (u >= 0xD800 && u <= 0xDBFF)
	{
		uint u2;

		if (i + 1 == s.length)
		{
			msg = "surrogate UTF-16 high value past end of string";
			goto Lerr;
		}
		u2 = s[i + 1];
		if (u2 < 0xDC00 || u2 > 0xDFFF)
		{
			msg = "surrogate UTF-16 low value out of range";
			goto Lerr;
		}
		u = ((u - 0xD7C0) << 10) + (u2 - 0xDC00);
		i += 2;
	}
	else if (u >= 0xDC00 && u <= 0xDFFF)
	{
		msg = "unpaired surrogate UTF-16 value";
		goto Lerr;
	}
	else if (u == 0xFFFE || u == 0xFFFF)
	{
		msg = "illegal UTF-16 value";
		goto Lerr;
	}
	//	default: single-word charcter (0x0000 to 0xD7FF, 0xE000 to 0xFFFD)
	//	SG fixed bug - previous if (u <= 0x7F) becomes redundant
	else
	{
		i++;
	}

	idx = i;
	return cast(dchar)u;

   Lerr:
	throw new UtfError(msg, i);
}


char[] toUTF8(wchar[] s)
{
	char[] r;

	for (size_t i = 0; i < s.length; )
	{
		encode(r, decode(s, i));
	}
	return r;
}

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.
Jul 12 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccto20$18bq$1 digitaldaemon.com>, Stewart Gordon says...

Cool. Excellent. Like it. Apart from these lines...

	else if (u == 0xFFFE || u == 0xFFFF)
	{
		msg = "illegal UTF-16 value";
		goto Lerr;
	}

The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32. They are permanently unassigned characters, that's all. There are in total 64 Unicode characters which have this property, of which U+FFFE and U+FFFF are but two examples. Walter has now changed isValidDchar() to return true for U+FFFE and U+FFFF. Arcane Jill
Jul 12 2004