digitalmars.D.bugs - [Issue 1357] New: Cannot use FFFF and FFFE in Unicode escape sequences.

d-bugmail puremagic.com (15/15) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357

d-bugmail puremagic.com (4/4) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (7/7) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (8/8) Jul 23 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (8/8) Jul 25 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (12/12) Sep 30 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (24/24) Sep 30 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (23/49) Oct 01 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (18/49) Oct 01 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (30/46) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (7/7) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (11/17) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (13/13) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (9/20) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (11/19) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (18/27) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (15/35) Oct 04 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
d-bugmail puremagic.com (5/5) Oct 04 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357

           Summary: Cannot use FFFF and FFFE in Unicode escape sequences.
           Product: D
           Version: 1.017
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: aziz.kerim gmail.com


Escape sequences \uFFFF, \uFFFE, \U0000FFFF, \U0000FFFE are deemed invalid by
the compiler.


--

Jul 20 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






I think they are invalid as Unicode. I'd have to check.


--

Jul 20 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






Sorry, I should have mentioned that these codepoints are valid because they are
specifically allowed for internal use by the Unicode Standard. I think Walter
already knows this. Phobos has a isValidDchar() function which returns true for
\uFFFE and \uFFFF.


--

Jul 20 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






0xFFFE and 0xFFFF are - unlike the the private use blocks - not for normal 
private. The only place they are allowed are inside a text processing system 
but the should never be used for data exchange between programs. As such it 
would be a reasonable safty measure to dissallow \uFFFF and \uFFFE literals and 
force the use of e.g. \xFF \xFE or x"FFFE".


--

Jul 23 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357


smjg iname.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smjg iname.com





\x is for UTF-8 fragments, not for splitting UTF-16 into bytes.


--

Jul 25 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






As I wrote my own encoding/decoding functions for Unicode characters I found
out that certain Unicode codepoints are not allowed to be encoded as UTF-8
sequences. I'm quoting from here:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

"Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as well
as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data. UTF-8
decoders should treat them like malformed or overlong sequences for safety
reasons."

So the behaviour of the compiler is actually correct, and Phobos has a bug.


--

Sep 30 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






When that talks of "decoders", is it talking about:
(a) decoding Unicode text from files?
(b) translating data used internally by an application?

If (a), then obviously it should reject U+FFFE and U+FFFF.  If (b), then it
should allow them.  The std.utf.toUTF* functions accept these codepoints for
this reason, and the behaviour of isValidDchar is by the same design:

/*******************************
 * Test if c is a valid UTF-32 character.
 *
 * \uFFFE and \uFFFF are considered valid by this function,
 * as they are permitted for internal use by an application,
 * but they are not allowed for interchange by the Unicode standard.
 *
 * Returns: true if it is, false if not.
 */

So it's not a bug in Phobos.  Just an omission - of a function to check that a
Unicode string is valid for data interchange (i.e. contains no U+FFFE or U+FFFF
codepoints as well as being otherwise valid).

It certainly ought to be possible to include U+FFFE and U+FFFF in string
literals by some means or another, as such things are necessarily being put
there for internal use by the application being developed.


--

Sep 30 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357







 When that talks of "decoders", is it talking about:
 (a) decoding Unicode text from files?
 (b) translating data used internally by an application?

I'm not sure but I guess it's (a).
 
 If (a), then obviously it should reject U+FFFE and U+FFFF.  If (b), then it
 should allow them.  The std.utf.toUTF* functions accept these codepoints for
 this reason, and the behaviour of isValidDchar is by the same design:

You are right. The Phobos Unicode functions are designed not to fire an
exception when, for example, decoding a UTF-8 sequence resulting in the
codepoint U+FFFF or U+FFFE. But the problem is that the average guy/gal doesn't
have the slightest clue about the technicalities of Unicode, and so would
assume that it's perfectly fine to use those functions for normal, non-internal
purposes. So in effect programs would accept illegal input and also produce
output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 strings.
 
 /*******************************
  * Test if c is a valid UTF-32 character.
  *
  * \uFFFE and \uFFFF are considered valid by this function,
  * as they are permitted for internal use by an application,
  * but they are not allowed for interchange by the Unicode standard.
  *
  * Returns: true if it is, false if not.
  */
 
 So it's not a bug in Phobos.  Just an omission - of a function to check that a
 Unicode string is valid for data interchange (i.e. contains no U+FFFE or U+FFFF
 codepoints as well as being otherwise valid).

Yes, but the encoding and decoding functions in Phobos use isValidDchar() to
verify if the character to be encoded or the character that was decoded is a
valid dchar. I'm not sure what the solution could be though. Two separate
modules maybe, one that is safe for data interchange and the other one for
internal data processing. Or perhaps add a function isEncodable() which is like
isValidDchar but excludes U+FFFE and U+FFFF. This new function should be used
to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or UTF-16.
 
 It certainly ought to be possible to include U+FFFE and U+FFFF in string
 literals by some means or another, as such things are necessarily being put
 there for internal use by the application being developed.
 

Maybe we should not allow the programmer to use the escape sequences \uFFFE
\uFFFF, \U0000FFFF etc. Instead one could do the following as Thomas suggested:

char[] str = "\xFF\xFFasdf";
char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated implicitly.


--

Oct 01 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357







 You are right.  The Phobos Unicode functions are designed not to 
 fire an exception when, for example, decoding a UTF-8 sequence 
 resulting in the codepoint U+FFFF or U+FFFE.  But the problem is 
 that the average guy/gal doesn't have the slightest clue about the 
 technicalities of Unicode, and so would assume that it's perfectly 
 fine to use those functions for normal, non-internal purposes.  So 
 in effect programs would accept illegal input and also produce 
 output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 
 strings.

I think the best place to deal with this is in documentation.  The 
need to check for U+FFFF and U+FFFE exists only when processing 
input.  It would be inefficient to keep checking for these codepoints 
every time an internal conversion is performed.  It is therefore 
sensible to keep validation separate from encoding/decoding, and 
inform the library user that such validation is necessary.

 Yes, but the encoding and decoding functions in Phobos use 
 isValidDchar() to verify if the character to be encoded or the 
 character that was decoded is a valid dchar.  I'm not sure what the 
 solution could be though.  Two separate modules maybe, one that is 
 safe for data interchange and the other one for internal data 
 processing.

Or an 'internal' parameter on the translation functions.  This raises 
the question: Should this parameter be optional, and if so, what 
should the default be?

 Or perhaps add a function isEncodable() which is like isValidDchar 
 but excludes U+FFFE and U+FFFF.  This new function should be used 
 to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or 
 UTF-16.

Uh, if we're going to use those names, ISTM the definitions should be 
the other way round.  But maybe an 'internal' parameter is the best 
solution here as well.

 It certainly ought to be possible to include U+FFFE and U+FFFF in 
 string literals by some means or another, as such things are 
 necessarily being put there for internal use by the application 
 being developed.

 
 Maybe we should not allow the programmer to use the escape 
 sequences \uFFFE \uFFFF, \U0000FFFF etc.  Instead one could do the 
 following as Thomas suggested:
 
 char[] str = "\xFF\xFFasdf";
 char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated 
 implicitly.

But U+FFFF isn't "\xFF\xFF".  It's "\xEF\xBF\xBF".

I guess we should have whole new escapes specifically for these codepoints.


--

Oct 01 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






Ok, the best thing we can do to solve this problem is actually read the Unicode
5.0 standard and determine what it actually has to say about this. I did read
the relevant parts of the standard and here is what I found out:

First of all, U+FFFE and U+FFFF are not the only code points that are intended
for internal use only.

Quoting from ch02.pdf page 27:

 Noncharacters. Sixty-six code points are not used to encode characters.
Noncharacters
 consist of U+FDD0..U+FDEF and any code point ending in the value
FFFE<sub>16</sub> or FFFF<sub>16</sub>—
 that is, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. (See
 Section 16.7, Noncharacters.)

A function testing for a noncharacter could look like this:
bool isNoncharacter(dchar d)
{
  return 0xFDD0 <= d && d <= 0xFDEF || // 32 code points
         d <= 0x10FFFF && (d & 0xFFFF) >= 0xFFFE; // 34 code points
}

Let us read a bit further. Quoting from ch02.pdf page 28:

 • Noncharacter code points are reserved for internal use, such as for
sentinel val-
   ues. They should never be interchanged. They do, however, have well-formed
   representations in Unicode encoding forms and survive conversions between
   encoding forms. This allows sentinel values to be preserved internally across
   Unicode encoding forms, even though they are not designed to be used in open
   interchange.

So it says that noncharacters can be encoded in UTF-8 and UTF-16. This is good
news, because this tells us that escape sequences not higher than U+10FFFF and
which are not surrogate code points (U+D800 - U+DFFF) can be encoded as UTF-8
or UTF-16. Therefore I think we should allow programmers to define such escape
sequences, even if they are noncharacters.

The next problem we need to think about is, what to do with noncharacters if
they appear as encoded characters in UTF-8 or UTF-16 source text or as code
points in UTF-32 source text.

The Unicode standard says in ch16.pdf at page 549:

 Applications are free to use any of these noncharacter code points internally
but should
 never attempt to exchange them. If a noncharacter is received in open
interchange, an
 application is not required to interpret it in any way. It is good practice,
however, to recog-
 nize it as a noncharacter and to take appropriate action, such as removing it
from the text.
 Note that Unicode conformance freely allows the removal of these characters.
(See con-
 formance clause C7 in Section 3.2, Conformance Requirements.)

I guess Walter has to decide what a D lexer should do in case it encounters a
noncharacter in the source text. My suggestion would be to ignore noncharacters
in favour of a faster lexer (although probably not many people are going to
stuff their source text with unialpha identifiers and comments/strings with
Unicode characters.)


--

Oct 02 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






-------
Testing for being under 0x10FFFF is redundant. dchar.max already is 0x10FFFF:
Unicode neither contains nor allows any value larger than that. All of Unicode
can be represented as UTF-32, UTF-16, or UTF-8, with the exception of the
UTF-16 surrogate code points, which are only allowed in UTF-16.


--

Oct 02 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357







 I guess Walter has to decide what a D lexer should do in case it 
 encounters a noncharacter in the source text.  My suggestion would 
 be to ignore noncharacters in favour of a faster lexer (although 
 probably not many people are going to stuff their source text with 
 unialpha identifiers and comments/strings with Unicode characters.)

That's a little off-topic to this issue.  Handling of actual non-characters in
the source code is a quite different matter from handling of escaped
representations of non-characters.


 Testing for being under 0x10FFFF is redundant. dchar.max already is 0x10FFFF:

That doesn't follow.  It's perfectly possible for values greater than 0x10FFFF
to find their way into a file or a piece of memory intended to contain UTF-32
text.  .max doesn't constrain the contents of memory in any way.


--

Oct 02 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






-------
You're basically right, that's just my attitude towards types: if it can be
outside the [type.min,type.max] range it shouldn't be stored in type. It's like
storing 119 in a bool just because it's a byte and not a bit of data. You can
do it, but you shouldn't. If there's a possibility that the data is malformed,
you should store it in a meaning-agnostic type like ubyte/uint.

Much of the problem is D's character types, which really should be called
something like "utf8", "utf16", and "utf32". It annoys me to no end that the C
standard library purportedly understands something about UTF-8: the C string
type should be ubyte*, not char*. But that's just me.

Regarding std.utf correctness, see also Issue 978.


--

Oct 03 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357







 You're basically right, that's just my attitude towards types: if 
 it can be outside the [type.min,type.max] range it shouldn't be 
 stored in type.  It's like storing 119 in a bool just because it's 
 a byte and not a bit of data.  You can do it, but you shouldn't.  
 If there's a possibility that the data is malformed, you should 
 store it in a meaning-agnostic type like ubyte/uint.

True up to a point.  But out-of-range data could just as easily be due to a bug
in the program - it makes little sense to use a meaning-agnostic type just to
steer clear of this possibility.  Half the point of the UTF validation
functions is to check for bugs.

 Much of the problem is D's character types, which really should be 
 called something like "utf8", "utf16", and "utf32".  It annoys me 
 to no end that the C standard library purportedly understands 
 something about UTF-8: the C string type should be ubyte*, not 
 char*.  But that's just me.

If we're going to change this, toStringz should return a ubyte* as well.


--

Oct 03 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






-------


 Much of the problem is D's character types, which really should be 
 called something like "utf8", "utf16", and "utf32".  It annoys me 
 to no end that the C standard library purportedly understands 
 something about UTF-8: the C string type should be ubyte*, not 
 char*.  But that's just me.

 
 If we're going to change this, toStringz should return a ubyte* as well.

There's very little chance that such a change will occur. To make it useful,
char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g.
string literals would be a pain. Many programs and libraries, in particular
Phobos and Tango, would also have to make a lot of changes just to compile.
Plus, it'd be another inconsistency between C and D: C 'char' would map to D
'ubyte'.

toStringz's return type would be only one of 100+ changes required.


--

Oct 03 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357







 If we're going to change this, toStringz should return a ubyte* as well.

 
 There's very little chance that such a change will occur. To make it useful,
 char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g.
 string literals would be a pain.

Only when trying to call C functions.  But even then, we wouldn't need to go as
far as that.  Just add ubyte* to the list of types that a string literal can
serve as.

 Many programs and libraries, in particular
 Phobos and Tango, would also have to make a lot of changes just to compile.
 Plus, it'd be another inconsistency between C and D: C 'char' would map to D
 'ubyte'.

As I read from comment 12 that you were already proposing.

But is it really an inconsistency?  Really, all that's happened is that C's
signed char has been renamed as byte, and C's unsigned char as ubyte.  It's no
more inconsistent than unsigned int being renamed uint, and long long being
renamed long.

The names 'byte' and 'ubyte' better reflect how C's char types tend to be used:
- as a code unit in an arbitrary 8-bit character encoding
- to hold a byte-sized integer value of arbitrary semantics (though APIs that
do this often define an alias of char to make this clearer)
which is more or less how D programmers are using byte/ubyte, and how ISTM you
think they should be used.


--

Oct 03 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






-------


 If we're going to change this, toStringz should return a ubyte* as well.

 
 There's very little chance that such a change will occur. To make it useful,
 char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g.
 string literals would be a pain.

 
 Only when trying to call C functions.  But even then, we wouldn't need to go as
 far as that.  Just add ubyte* to the list of types that a string literal can
 serve as.

It'd have to go beyond just string literals:

string foo = "asdf";
int i = strlen(foo.ptr);

Bad example, I know (who needs strlen?), but the above should work without
having to cast foo.ptr from char* (or invariant(char)* if that's what it is in
2.0) to ubyte*. Passing through toStringz at every call may not be an option.

 But is it really an inconsistency?  Really, all that's happened is that C's
 signed char has been renamed as byte, and C's unsigned char as ubyte.  It's no
 more inconsistent than unsigned int being renamed uint, and long long being
 renamed long.

No, not really, but Walter seems to think it important that code that looks
like C should work like it does in C. I agree with that sentiment to a point,
and thus minimizing such inconsistencies is a good idea. In this case, however,
I'd rather have the inconsistency.

 The names 'byte' and 'ubyte' better reflect how C's char types tend to be used:
 - as a code unit in an arbitrary 8-bit character encoding
 - to hold a byte-sized integer value of arbitrary semantics (though APIs that
 do this often define an alias of char to make this clearer)
 which is more or less how D programmers are using byte/ubyte, and how ISTM you
 think they should be used.

I agree.


--

Oct 04 2007

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=1357






This is getting very off-topic for this bug report.  I'll start a thread on
digitalmars.D where we could continue the discussion.


--

Oct 04 2007

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 1357] New: Cannot use FFFF and FFFE in Unicode escape sequences.