www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - std.string.maketrans and std.string.translate not unicode aware

reply Sam McCall <tunah.d tunah.net> writes:
The std.string.maketrans and translate functions are meant to create and 
apply a character translation table, respectively. However at the moment 
they create and apply a byte translation table.
This will cause translation errors, and assertions if you try and 
replace an ASCII character with a non-ASCII character, for example, due 
to different array lengths.
Unfortunately it's not possible to fix this without changing the 
function signatures, the lookup table would be too big. Something like 
this should work...
Sam

/************************************
  * Construct translation table for translate().
  */

dchar[dchar] maketrans(dchar[] from, dchar[] to)
     in
     {
	assert(from.length == to.length);
     }
     body
     {
	dchar[dchar] t;

	for (int i=0; i<from.length; i++)
	    t[from[i]] = to[i];

	return t;
     }

/******************************************
  * Translate characters in s[] using table created by maketrans().
  * Delete chars in delchars[].
  */

dchar[] translate(dchar[] s, dchar[dchar] transtab, dchar[] delchars) {
	dchar[] r;
	int i;
	int count;
	bit[dchar] deltab;

	for (i = 0; i < delchars.length; i++)
	    deltab[delchars[i]] = true;

	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			count++;

	r = new dchar[count];
	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			r[count++]=transtab[d];

	return r;
}


/******************************************
  * Translate characters in s[] using table created by maketrans().
  * Delete chars in delchars[].
  */

char[] translate(char[] s, dchar[dchar] transtab, dchar[] delchars) {
	dchar[] r;
	int i;
	int count;
	bit[dchar] deltab;

	for (i = 0; i < delchars.length; i++)
	    deltab[delchars[i]] = true;

	count = 0;
	foreach(dchar d; s)	// iterates properly over characters
		if(!(d in deltab))
			count++;

	r = new dchar[count];
	count = 0;
	foreach(dchar d; s)
		if(!(d in deltab))
			r[count++]=transtab[d];

	return toUTF8(r);
}
Jun 30 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbttmg$1u68$1 digitaldaemon.com>, Sam McCall says...
The std.string.maketrans and translate functions are meant to create and 
apply a character translation table, respectively. However at the moment 
they create and apply a byte translation table.

In agreement with Sam here, but I should point out that the bug is actually much more serious than Sam suggests. It's not just a matter of missing features - it's a matter of serious UTF-8 corruption. The current implementation allow users to modify char values in the range 0x80 to 0xFF. These bytes have specific meaning in terms of UTF-8. Allowing users to modify such values with a translate() routine is DANGEROUS, and is pretty much guaranteed to result in a string containing invalid UTF-8. Sam suggests a number of ways of making these functions dchar-based instead of char-based. But if you want to keep them char-based, then you absolutely must disallow the modification of any char >0x7F, and document such functions as ASCII-only. Arcane Jill
Jun 30 2004