digitalmars.D.bugs - [Issue 1357] New: Cannot use FFFF and FFFE in Unicode escape sequences.
- d-bugmail puremagic.com (15/15) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (4/4) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (7/7) Jul 20 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (8/8) Jul 23 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (8/8) Jul 25 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (12/12) Sep 30 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (24/24) Sep 30 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (23/49) Oct 01 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (18/49) Oct 01 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (30/46) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (7/7) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (11/17) Oct 02 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (13/13) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (9/20) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (11/19) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (18/27) Oct 03 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (15/35) Oct 04 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
- d-bugmail puremagic.com (5/5) Oct 04 2007 http://d.puremagic.com/issues/show_bug.cgi?id=1357
http://d.puremagic.com/issues/show_bug.cgi?id=1357 Summary: Cannot use FFFF and FFFE in Unicode escape sequences. Product: D Version: 1.017 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: DMD AssignedTo: bugzilla digitalmars.com ReportedBy: aziz.kerim gmail.com Escape sequences \uFFFF, \uFFFE, \U0000FFFF, \U0000FFFE are deemed invalid by the compiler. --
Jul 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #1 from shro8822 vandals.uidaho.edu 2007-07-20 16:02 ------- I think they are invalid as Unicode. I'd have to check. --
Jul 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #2 from aziz.kerim gmail.com 2007-07-20 16:08 ------- Sorry, I should have mentioned that these codepoints are valid because they are specifically allowed for internal use by the Unicode Standard. I think Walter already knows this. Phobos has a isValidDchar() function which returns true for \uFFFE and \uFFFF. --
Jul 20 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #3 from thomas-dloop kuehne.cn 2007-07-23 14:57 ------- 0xFFFE and 0xFFFF are - unlike the the private use blocks - not for normal private. The only place they are allowed are inside a text processing system but the should never be used for data exchange between programs. As such it would be a reasonable safty measure to dissallow \uFFFF and \uFFFE literals and force the use of e.g. \xFF \xFE or x"FFFE". --
Jul 23 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 smjg iname.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |smjg iname.com ------- Comment #4 from smjg iname.com 2007-07-25 20:27 ------- \x is for UTF-8 fragments, not for splitting UTF-16 into bytes. --
Jul 25 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #5 from aziz.kerim gmail.com 2007-09-30 15:07 ------- As I wrote my own encoding/decoding functions for Unicode characters I found out that certain Unicode codepoints are not allowed to be encoded as UTF-8 sequences. I'm quoting from here: http://www.cl.cam.ac.uk/~mgk25/unicode.html "Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformed or overlong sequences for safety reasons." So the behaviour of the compiler is actually correct, and Phobos has a bug. --
Sep 30 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #6 from smjg iname.com 2007-09-30 16:31 ------- When that talks of "decoders", is it talking about: (a) decoding Unicode text from files? (b) translating data used internally by an application? If (a), then obviously it should reject U+FFFE and U+FFFF. If (b), then it should allow them. The std.utf.toUTF* functions accept these codepoints for this reason, and the behaviour of isValidDchar is by the same design: /******************************* * Test if c is a valid UTF-32 character. * * \uFFFE and \uFFFF are considered valid by this function, * as they are permitted for internal use by an application, * but they are not allowed for interchange by the Unicode standard. * * Returns: true if it is, false if not. */ So it's not a bug in Phobos. Just an omission - of a function to check that a Unicode string is valid for data interchange (i.e. contains no U+FFFE or U+FFFF codepoints as well as being otherwise valid). It certainly ought to be possible to include U+FFFE and U+FFFF in string literals by some means or another, as such things are necessarily being put there for internal use by the application being developed. --
Sep 30 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #7 from aziz.kerim gmail.com 2007-10-01 04:53 ------- (In reply to comment #6)When that talks of "decoders", is it talking about: (a) decoding Unicode text from files? (b) translating data used internally by an application?I'm not sure but I guess it's (a).If (a), then obviously it should reject U+FFFE and U+FFFF. If (b), then it should allow them. The std.utf.toUTF* functions accept these codepoints for this reason, and the behaviour of isValidDchar is by the same design:You are right. The Phobos Unicode functions are designed not to fire an exception when, for example, decoding a UTF-8 sequence resulting in the codepoint U+FFFF or U+FFFE. But the problem is that the average guy/gal doesn't have the slightest clue about the technicalities of Unicode, and so would assume that it's perfectly fine to use those functions for normal, non-internal purposes. So in effect programs would accept illegal input and also produce output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 strings./******************************* * Test if c is a valid UTF-32 character. * * \uFFFE and \uFFFF are considered valid by this function, * as they are permitted for internal use by an application, * but they are not allowed for interchange by the Unicode standard. * * Returns: true if it is, false if not. */ So it's not a bug in Phobos. Just an omission - of a function to check that a Unicode string is valid for data interchange (i.e. contains no U+FFFE or U+FFFF codepoints as well as being otherwise valid).Yes, but the encoding and decoding functions in Phobos use isValidDchar() to verify if the character to be encoded or the character that was decoded is a valid dchar. I'm not sure what the solution could be though. Two separate modules maybe, one that is safe for data interchange and the other one for internal data processing. Or perhaps add a function isEncodable() which is like isValidDchar but excludes U+FFFE and U+FFFF. This new function should be used to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or UTF-16.It certainly ought to be possible to include U+FFFE and U+FFFF in string literals by some means or another, as such things are necessarily being put there for internal use by the application being developed.Maybe we should not allow the programmer to use the escape sequences \uFFFE \uFFFF, \U0000FFFF etc. Instead one could do the following as Thomas suggested: char[] str = "\xFF\xFFasdf"; char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated implicitly. --
Oct 01 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #8 from smjg iname.com 2007-10-01 06:24 ------- (In reply to comment #7)You are right. The Phobos Unicode functions are designed not to fire an exception when, for example, decoding a UTF-8 sequence resulting in the codepoint U+FFFF or U+FFFE. But the problem is that the average guy/gal doesn't have the slightest clue about the technicalities of Unicode, and so would assume that it's perfectly fine to use those functions for normal, non-internal purposes. So in effect programs would accept illegal input and also produce output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 strings.I think the best place to deal with this is in documentation. The need to check for U+FFFF and U+FFFE exists only when processing input. It would be inefficient to keep checking for these codepoints every time an internal conversion is performed. It is therefore sensible to keep validation separate from encoding/decoding, and inform the library user that such validation is necessary.Yes, but the encoding and decoding functions in Phobos use isValidDchar() to verify if the character to be encoded or the character that was decoded is a valid dchar. I'm not sure what the solution could be though. Two separate modules maybe, one that is safe for data interchange and the other one for internal data processing.Or an 'internal' parameter on the translation functions. This raises the question: Should this parameter be optional, and if so, what should the default be?Or perhaps add a function isEncodable() which is like isValidDchar but excludes U+FFFE and U+FFFF. This new function should be used to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or UTF-16.Uh, if we're going to use those names, ISTM the definitions should be the other way round. But maybe an 'internal' parameter is the best solution here as well.But U+FFFF isn't "\xFF\xFF". It's "\xEF\xBF\xBF". I guess we should have whole new escapes specifically for these codepoints. --It certainly ought to be possible to include U+FFFE and U+FFFF in string literals by some means or another, as such things are necessarily being put there for internal use by the application being developed.Maybe we should not allow the programmer to use the escape sequences \uFFFE \uFFFF, \U0000FFFF etc. Instead one could do the following as Thomas suggested: char[] str = "\xFF\xFFasdf"; char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated implicitly.
Oct 01 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #9 from aziz.kerim gmail.com 2007-10-02 07:47 ------- Ok, the best thing we can do to solve this problem is actually read the Unicode 5.0 standard and determine what it actually has to say about this. I did read the relevant parts of the standard and here is what I found out: First of all, U+FFFE and U+FFFF are not the only code points that are intended for internal use only. Quoting from ch02.pdf page 27:Noncharacters. Sixty-six code points are not used to encode characters. Noncharacters consist of U+FDD0..U+FDEF and any code point ending in the value FFFE<sub>16</sub> or FFFF<sub>16</sub>— that is, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. (See Section 16.7, Noncharacters.)A function testing for a noncharacter could look like this: bool isNoncharacter(dchar d) { return 0xFDD0 <= d && d <= 0xFDEF || // 32 code points d <= 0x10FFFF && (d & 0xFFFF) >= 0xFFFE; // 34 code points } Let us read a bit further. Quoting from ch02.pdf page 28:• Noncharacter code points are reserved for internal use, such as for sentinel val- ues. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange.So it says that noncharacters can be encoded in UTF-8 and UTF-16. This is good news, because this tells us that escape sequences not higher than U+10FFFF and which are not surrogate code points (U+D800 - U+DFFF) can be encoded as UTF-8 or UTF-16. Therefore I think we should allow programmers to define such escape sequences, even if they are noncharacters. The next problem we need to think about is, what to do with noncharacters if they appear as encoded characters in UTF-8 or UTF-16 source text or as code points in UTF-32 source text. The Unicode standard says in ch16.pdf at page 549:Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recog- nize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See con- formance clause C7 in Section 3.2, Conformance Requirements.)I guess Walter has to decide what a D lexer should do in case it encounters a noncharacter in the source text. My suggestion would be to ignore noncharacters in favour of a faster lexer (although probably not many people are going to stuff their source text with unialpha identifiers and comments/strings with Unicode characters.) --
Oct 02 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #10 from matti.niemenmaa+dbugzilla iki.fi 2007-10-02 08:57 ------- Testing for being under 0x10FFFF is redundant. dchar.max already is 0x10FFFF: Unicode neither contains nor allows any value larger than that. All of Unicode can be represented as UTF-32, UTF-16, or UTF-8, with the exception of the UTF-16 surrogate code points, which are only allowed in UTF-16. --
Oct 02 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #11 from smjg iname.com 2007-10-02 12:48 ------- (In reply to comment #9)I guess Walter has to decide what a D lexer should do in case it encounters a noncharacter in the source text. My suggestion would be to ignore noncharacters in favour of a faster lexer (although probably not many people are going to stuff their source text with unialpha identifiers and comments/strings with Unicode characters.)That's a little off-topic to this issue. Handling of actual non-characters in the source code is a quite different matter from handling of escaped representations of non-characters. (In reply to comment #10)Testing for being under 0x10FFFF is redundant. dchar.max already is 0x10FFFF:That doesn't follow. It's perfectly possible for values greater than 0x10FFFF to find their way into a file or a piece of memory intended to contain UTF-32 text. .max doesn't constrain the contents of memory in any way. --
Oct 02 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #12 from matti.niemenmaa+dbugzilla iki.fi 2007-10-03 08:04 ------- You're basically right, that's just my attitude towards types: if it can be outside the [type.min,type.max] range it shouldn't be stored in type. It's like storing 119 in a bool just because it's a byte and not a bit of data. You can do it, but you shouldn't. If there's a possibility that the data is malformed, you should store it in a meaning-agnostic type like ubyte/uint. Much of the problem is D's character types, which really should be called something like "utf8", "utf16", and "utf32". It annoys me to no end that the C standard library purportedly understands something about UTF-8: the C string type should be ubyte*, not char*. But that's just me. Regarding std.utf correctness, see also Issue 978. --
Oct 03 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #13 from smjg iname.com 2007-10-03 08:57 ------- (In reply to comment #12)You're basically right, that's just my attitude towards types: if it can be outside the [type.min,type.max] range it shouldn't be stored in type. It's like storing 119 in a bool just because it's a byte and not a bit of data. You can do it, but you shouldn't. If there's a possibility that the data is malformed, you should store it in a meaning-agnostic type like ubyte/uint.True up to a point. But out-of-range data could just as easily be due to a bug in the program - it makes little sense to use a meaning-agnostic type just to steer clear of this possibility. Half the point of the UTF validation functions is to check for bugs.Much of the problem is D's character types, which really should be called something like "utf8", "utf16", and "utf32". It annoys me to no end that the C standard library purportedly understands something about UTF-8: the C string type should be ubyte*, not char*. But that's just me.If we're going to change this, toStringz should return a ubyte* as well. --
Oct 03 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #14 from matti.niemenmaa+dbugzilla iki.fi 2007-10-03 11:43 ------- (In reply to comment #13)(In reply to comment #12)There's very little chance that such a change will occur. To make it useful, char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g. string literals would be a pain. Many programs and libraries, in particular Phobos and Tango, would also have to make a lot of changes just to compile. Plus, it'd be another inconsistency between C and D: C 'char' would map to D 'ubyte'. toStringz's return type would be only one of 100+ changes required. --Much of the problem is D's character types, which really should be called something like "utf8", "utf16", and "utf32". It annoys me to no end that the C standard library purportedly understands something about UTF-8: the C string type should be ubyte*, not char*. But that's just me.If we're going to change this, toStringz should return a ubyte* as well.
Oct 03 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #15 from smjg iname.com 2007-10-03 16:44 ------- (In reply to comment #14)Only when trying to call C functions. But even then, we wouldn't need to go as far as that. Just add ubyte* to the list of types that a string literal can serve as.If we're going to change this, toStringz should return a ubyte* as well.There's very little chance that such a change will occur. To make it useful, char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g. string literals would be a pain.Many programs and libraries, in particular Phobos and Tango, would also have to make a lot of changes just to compile. Plus, it'd be another inconsistency between C and D: C 'char' would map to D 'ubyte'.As I read from comment 12 that you were already proposing. But is it really an inconsistency? Really, all that's happened is that C's signed char has been renamed as byte, and C's unsigned char as ubyte. It's no more inconsistent than unsigned int being renamed uint, and long long being renamed long. The names 'byte' and 'ubyte' better reflect how C's char types tend to be used: - as a code unit in an arbitrary 8-bit character encoding - to hold a byte-sized integer value of arbitrary semantics (though APIs that do this often define an alias of char to make this clearer) which is more or less how D programmers are using byte/ubyte, and how ISTM you think they should be used. --
Oct 03 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #16 from matti.niemenmaa+dbugzilla iki.fi 2007-10-04 03:27 ------- (In reply to comment #15)(In reply to comment #14)It'd have to go beyond just string literals: string foo = "asdf"; int i = strlen(foo.ptr); Bad example, I know (who needs strlen?), but the above should work without having to cast foo.ptr from char* (or invariant(char)* if that's what it is in 2.0) to ubyte*. Passing through toStringz at every call may not be an option.Only when trying to call C functions. But even then, we wouldn't need to go as far as that. Just add ubyte* to the list of types that a string literal can serve as.If we're going to change this, toStringz should return a ubyte* as well.There's very little chance that such a change will occur. To make it useful, char (or preferably 'utf8') should implicitly cast to ubyte, or using e.g. string literals would be a pain.But is it really an inconsistency? Really, all that's happened is that C's signed char has been renamed as byte, and C's unsigned char as ubyte. It's no more inconsistent than unsigned int being renamed uint, and long long being renamed long.No, not really, but Walter seems to think it important that code that looks like C should work like it does in C. I agree with that sentiment to a point, and thus minimizing such inconsistencies is a good idea. In this case, however, I'd rather have the inconsistency.The names 'byte' and 'ubyte' better reflect how C's char types tend to be used: - as a code unit in an arbitrary 8-bit character encoding - to hold a byte-sized integer value of arbitrary semantics (though APIs that do this often define an alias of char to make this clearer) which is more or less how D programmers are using byte/ubyte, and how ISTM you think they should be used.I agree. --
Oct 04 2007
http://d.puremagic.com/issues/show_bug.cgi?id=1357 ------- Comment #17 from smjg iname.com 2007-10-04 07:38 ------- This is getting very off-topic for this bug report. I'll start a thread on digitalmars.D where we could continue the discussion. --
Oct 04 2007