www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - isValidDchar error

reply Arcane Jill <Arcane_member pathlink.com> writes:
The source code for the function isValidDchar reads as follows:

#    bit isValidDchar(dchar c)
#    {
#        return c < 0xD800 ||
#         (c > 0xDFFF && c <= 0x10FFFF && c != 0xFFFE && c != 0xFFFF);
#    }

I believe that this implementation is incorrect. You see, the codepoints U+FFFE
and U+FFFF are not, in fact, invalid. They are simply permanently unassigned.
Their General Category is Cn. They are legal codepoints (though unassigned
characters).

Now, MOST unassigned characters are only temporarily unassigned. There is always
the possibility that what is unassigned today may be assigned tomorrow. However,
there is a small list of codepoints for which this is not the case (see below).
For this small list of codepoints, the Unicode Consortium guarantee that they
will remain permanently unassigned. U+FFFE and U+FFFF are in this list. (In
fact, this was the reason that U+FFFF was chosen as the value for wchar.init and
dchar.init).

To quote from the definitive Unicode standard document UFFF0[1].pdf, available
from the Unicode website:

Noncharacters:

/These codes are intended for process internal uses, but are not permitted for
interchange/

FFFE - <not a character>
* the value FFFE is guaranteed not to be a Unicode character at all
* may be used to detect byte order by contrast with FEFF which is a character.

FFFF - <not a character>
* the value FFFF is guaranteed not to be a Unicode character at all


Now, the key sentence here is the one which reads: "These codes are intended for
process internal uses, but are not permitted for interchange". In other words,
they are PERMITTED for application internal use, (such as delimiting a piece of
Unicode text, or as an appropriate value for dchar.init) but PROHIBITED for
character representation.

In addition, the equally definitive PropList.txt file lists properties for these
codepoints, as follows:

FDD0..FDEF    ; Noncharacter_Code_Point # Cn  [32] 
FFFE..FFFF    ; Noncharacter_Code_Point # Cn   [2] 
1FFFE..1FFFF  ; Noncharacter_Code_Point # Cn   [2] 
2FFFE..2FFFF  ; Noncharacter_Code_Point # Cn   [2] 
3FFFE..3FFFF  ; Noncharacter_Code_Point # Cn   [2] 
4FFFE..4FFFF  ; Noncharacter_Code_Point # Cn   [2] 
5FFFE..5FFFF  ; Noncharacter_Code_Point # Cn   [2] 
6FFFE..6FFFF  ; Noncharacter_Code_Point # Cn   [2] 
7FFFE..7FFFF  ; Noncharacter_Code_Point # Cn   [2] 
8FFFE..8FFFF  ; Noncharacter_Code_Point # Cn   [2] 
9FFFE..9FFFF  ; Noncharacter_Code_Point # Cn   [2] 
AFFFE..AFFFF  ; Noncharacter_Code_Point # Cn   [2] 
BFFFE..BFFFF  ; Noncharacter_Code_Point # Cn   [2] 
CFFFE..CFFFF  ; Noncharacter_Code_Point # Cn   [2] 
DFFFE..DFFFF  ; Noncharacter_Code_Point # Cn   [2] 
EFFFE..EFFFF  ; Noncharacter_Code_Point # Cn   [2] 
FFFFE..FFFFF  ; Noncharacter_Code_Point # Cn   [2] 
10FFFE..10FFFF; Noncharacter_Code_Point # Cn   [2]

The etc.unicode function isNoncharacterCodePoint() will return true for every
character in this list, and for no others. FFFE and FFFF may be non-character
codepoints, but so is FDD0, so is FDE0, and so on. That doesn't make them
invalid dchars - it just makes them noncharacter codepoints.

(On the other hand, if you're going to treat FFFE and FFFF as invalid, you
should treat all 66 codepoints listed above as equally invalid).

I suggest that you rewrite isValidDchar() as follows:

#    bit isValidDchar(dchar c)
#    {
#        return c < 0xD800 || (c > 0xDFFF && c <= 0x10FFFF);
#    }

Arcane Jill
Jul 09 2004
parent "Walter" <newshound digitalmars.com> writes:
You make a compelling argument. I'll make the change you suggest.
Jul 09 2004