www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - dchar undefined behaviour

reply tsbockman <thomas.bockman gmail.com> writes:
While working on updating and improving Lionello Lunesu's 
proposed fix for DMD issue #259, I have come across a value range 
propagation related issue with the dchar type.

The patch adds VRP-based compile-time evaluation of integer type 
comparisons, where possible. This caused the following issue:

The compiler will now optimize out attempts to handle invalid, 
out-of-range dchar values. For example:

dchar c = cast(dchar) uint.max;
if(c > 0x10FFFF)
     writeln("invalid");
else
     writeln("OK");

With constant folding for integer comparisons, the above will 
print "OK" rather than "invalid", as it should. The predicate (c 
 0x10FFFF) is simply *assumed* to be false, because the current 
starting range.imax for a dchar expression is dchar.max. So, this leads to the question: is making use of dchar values greater than dchar.max considered undefined behaviour, or not? 1. If it is UB, then there is quite a lot of D code (including std.uni) which must be corrected to use uint instead of dchar when dealing with values which could possibly fall outside the officially supported range. 2. If it is not UB, then the compiler needs to be updated to stop assuming that dchar values greater than dchar.max are impossible. This basically just means removing some of dchar's special treatment, and running it through more of the same code paths as uint. At the moment, I strongly prefer #2, but I suppose #1 could make sense if people think code which might have to deal with invalid code points can be isolated sufficiently from other unicode processing.
Oct 22 2015
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/22/2015 6:31 PM, tsbockman wrote:
 So, this leads to the question: is making use of dchar values greater than
 dchar.max considered undefined behaviour, or not?

 1. If it is UB, then there is quite a lot of D code (including std.uni) which
 must be corrected to use uint instead of dchar when dealing with values which
 could possibly fall outside the officially supported range.

 2. If it is not UB, then the compiler needs to be updated to stop assuming that
 dchar values greater than dchar.max are impossible. This basically just means
 removing some of dchar's special treatment, and running it through more of the
 same code paths as uint.
I think that ship has sailed. Illegal values in a dchar are not UB. Making it UB would result in surprising behavior which you've noted. Also, this segues into what to do about string, wstring, and dstring with invalid sequences in them. Currently, functions defined what they do with invalid sequences. Making it UB would be a burden to programmers.
Oct 23 2015
parent tsbockman <thomas.bockman gmail.com> writes:
On Friday, 23 October 2015 at 12:17:22 UTC, Walter Bright wrote:
 I think that ship has sailed. Illegal values in a dchar are not 
 UB. Making it UB would result in surprising behavior which 
 you've noted. Also, this segues into what to do about string, 
 wstring, and dstring with invalid sequences in them. Currently, 
 functions defined what they do with invalid sequences. Making 
 it UB would be a burden to programmers.
That makes sense to me. I think the language would have to work a lot harder to block the creation of invalid dchar values to justify the current VRP assumption. Fixing the compiler in accordance with option #2 looks like it should be easy, so far.
Oct 23 2015
prev sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 23 October 2015 at 01:31:47 UTC, tsbockman wrote:
 dchar c = cast(dchar) uint.max;
 if(c > 0x10FFFF)
     writeln("invalid");
 else
     writeln("OK");

 With constant folding for integer comparisons, the above will 
 print "OK" rather than "invalid", as it should. The predicate 
 (c > 0x10FFFF) is simply *assumed* to be false, because the 
 current starting range.imax for a dchar expression is dchar.max.
That doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.
Oct 23 2015
parent reply Anon <a a.a> writes:
On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev 
wrote:
 That doesn't sound right. In fact, this puts into question why 
 dchar.max is at the value it is now. It might be the current 
 maximum at the current version of Unicode, but this seems like 
 a completely pointless restriction that breaks 
 forward-compatibility with future Unicode versions, meaning 
 that D programs compiled today may be unable to work with 
 Unicode text in the future because of a pointless artificial 
 limitation.
Unless UTF-16 is deprecated and completely removed from all systems everywhere, there is no way for Unicode Consortium to increase the limit beyond U+10FFFF. That limit is not arbitrary, but based on the technical limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both have room for expansion, but have been defined to match UTF-16's limitations.
Oct 23 2015
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 24-Oct-2015 02:45, Anon wrote:
 On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:
 That doesn't sound right. In fact, this puts into question why
 dchar.max is at the value it is now. It might be the current maximum
 at the current version of Unicode, but this seems like a completely
 pointless restriction that breaks forward-compatibility with future
 Unicode versions, meaning that D programs compiled today may be unable
 to work with Unicode text in the future because of a pointless
 artificial limitation.
Unless UTF-16 is deprecated and completely removed from all systems everywhere, there is no way for Unicode Consortium to increase the limit beyond U+10FFFF. That limit is not arbitrary, but based on the technical limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both have room for expansion, but have been defined to match UTF-16's limitations.
Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or so. Previously it was expected to (maybe) expand beyond but it was decided to stay with 10FFFF pretty much indefinitely because of UTF-16. Also; only ~114k of codepoints have assigned meaning, we are looking at 900K+ unassigned values reserved today. -- Dmitry Olshansky
Oct 24 2015