digitalmars.D - dchar undefined behaviour

tsbockman (30/31) Oct 22 2015 While working on updating and improving Lionello Lunesu's

Walter Bright (6/15) Oct 23 2015 I think that ship has sailed. Illegal values in a dchar are not UB. Maki...

tsbockman (6/12) Oct 23 2015 That makes sense to me. I think the language would have to work a

Vladimir Panteleev (8/17) Oct 23 2015 That doesn't sound right. In fact, this puts into question why

Anon (8/16) Oct 23 2015 Unless UTF-16 is deprecated and completely removed from all

Dmitry Olshansky (8/22) Oct 24 2015 Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or

tsbockman <thomas.bockman gmail.com> writes:

While working on updating and improving Lionello Lunesu's 

propagation related issue with the dchar type.

The patch adds VRP-based compile-time evaluation of integer type 
comparisons, where possible. This caused the following issue:

The compiler will now optimize out attempts to handle invalid, 
out-of-range dchar values. For example:

dchar c = cast(dchar) uint.max;
if(c > 0x10FFFF)
     writeln("invalid");
else
     writeln("OK");

With constant folding for integer comparisons, the above will 
print "OK" rather than "invalid", as it should. The predicate (c 
 0x10FFFF) is simply *assumed* to be false, because the current 

starting range.imax for a dchar expression is dchar.max.

So, this leads to the question: is making use of dchar values 
greater than dchar.max considered undefined behaviour, or not?

1. If it is UB, then there is quite a lot of D code (including 
std.uni) which must be corrected to use uint instead of dchar 
when dealing with values which could possibly fall outside the 
officially supported range.

2. If it is not UB, then the compiler needs to be updated to stop 
assuming that dchar values greater than dchar.max are impossible. 
This basically just means removing some of dchar's special 
treatment, and running it through more of the same code paths as 
uint.


sense if people think code which might have to deal with invalid 
code points can be isolated sufficiently from other unicode 
processing.

Oct 22 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 10/22/2015 6:31 PM, tsbockman wrote:
 So, this leads to the question: is making use of dchar values greater than
 dchar.max considered undefined behaviour, or not?

 1. If it is UB, then there is quite a lot of D code (including std.uni) which
 must be corrected to use uint instead of dchar when dealing with values which
 could possibly fall outside the officially supported range.

 2. If it is not UB, then the compiler needs to be updated to stop assuming that
 dchar values greater than dchar.max are impossible. This basically just means
 removing some of dchar's special treatment, and running it through more of the
 same code paths as uint.

I think that ship has sailed. Illegal values in a dchar are not UB. Making it
UB 
would result in surprising behavior which you've noted. Also, this segues into 
what to do about string, wstring, and dstring with invalid sequences in them. 
Currently, functions defined what they do with invalid sequences. Making it UB 
would be a burden to programmers.

Oct 23 2015

tsbockman <thomas.bockman gmail.com> writes:

On Friday, 23 October 2015 at 12:17:22 UTC, Walter Bright wrote:
 I think that ship has sailed. Illegal values in a dchar are not 
 UB. Making it UB would result in surprising behavior which 
 you've noted. Also, this segues into what to do about string, 
 wstring, and dstring with invalid sequences in them. Currently, 
 functions defined what they do with invalid sequences. Making 
 it UB would be a burden to programmers.

That makes sense to me. I think the language would have to work a 
lot harder to block the creation of invalid dchar values to 
justify the current VRP assumption.


should be easy, so far.

Oct 23 2015

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Friday, 23 October 2015 at 01:31:47 UTC, tsbockman wrote:
 dchar c = cast(dchar) uint.max;
 if(c > 0x10FFFF)
     writeln("invalid");
 else
     writeln("OK");

 With constant folding for integer comparisons, the above will 
 print "OK" rather than "invalid", as it should. The predicate 
 (c > 0x10FFFF) is simply *assumed* to be false, because the 
 current starting range.imax for a dchar expression is dchar.max.

That doesn't sound right. In fact, this puts into question why 
dchar.max is at the value it is now. It might be the current 
maximum at the current version of Unicode, but this seems like a 
completely pointless restriction that breaks 
forward-compatibility with future Unicode versions, meaning that 
D programs compiled today may be unable to work with Unicode text 
in the future because of a pointless artificial limitation.

Oct 23 2015

Anon <a a.a> writes:

On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev 
wrote:
 That doesn't sound right. In fact, this puts into question why 
 dchar.max is at the value it is now. It might be the current 
 maximum at the current version of Unicode, but this seems like 
 a completely pointless restriction that breaks 
 forward-compatibility with future Unicode versions, meaning 
 that D programs compiled today may be unable to work with 
 Unicode text in the future because of a pointless artificial 
 limitation.

Unless UTF-16 is deprecated and completely removed from all 
systems everywhere, there is no way for Unicode Consortium to 
increase the limit beyond U+10FFFF. That limit is not arbitrary, 
but based on the technical limitations of what UTF-16 can 
actually represent. UTF-8 and UTF-32 both have room for 
expansion, but have been defined to match UTF-16's limitations.

Oct 23 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24-Oct-2015 02:45, Anon wrote:
 On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:
 That doesn't sound right. In fact, this puts into question why
 dchar.max is at the value it is now. It might be the current maximum
 at the current version of Unicode, but this seems like a completely
 pointless restriction that breaks forward-compatibility with future
 Unicode versions, meaning that D programs compiled today may be unable
 to work with Unicode text in the future because of a pointless
 artificial limitation.

 Unless UTF-16 is deprecated and completely removed from all systems
 everywhere, there is no way for Unicode Consortium to increase the limit
 beyond U+10FFFF. That limit is not arbitrary, but based on the technical
 limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both
 have room for expansion, but have been defined to match UTF-16's
 limitations.

Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or 
so. Previously it was expected to (maybe) expand beyond but it was 
decided to stay with 10FFFF pretty much indefinitely because of UTF-16.

Also; only ~114k of codepoints have assigned meaning, we are looking at 
900K+ unassigned values reserved today.

-- 
Dmitry Olshansky

Oct 24 2015

D Programming

C/C++ Programming

Other

digitalmars.D - dchar undefined behaviour