digitalmars.D - Handling of U+2028 and U+2029 in source code
- kdevel (17/17) Oct 15 2023 According to [1] U+2028 and U+2029 are considered end-of-line
- Richard (Rikki) Andrew Cattermole (2/2) Oct 15 2023 Based upon how the identifier tokenization occurs greedily, yes it makes...
- kdevel (19/21) Oct 16 2023 The error message is confusing, compare with this code, using
- Richard (Rikki) Andrew Cattermole (4/4) Oct 16 2023 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578
- Walter Bright (2/7) Oct 17 2023 Yes, please file a bugzilla!
- Richard (Rikki) Andrew Cattermole (1/1) Oct 17 2023 https://issues.dlang.org/show_bug.cgi?id=24190
- deadalnix (6/10) Oct 17 2023 I've noticed that in the past, but this is clearly wrong. It's
- Walter Bright (2/3) Oct 17 2023 I blame my parents.
According to [1] U+2028 and U+2029 are considered end-of-line characters. Does this make sense? ```d $ cat lsps.d void main () { enum b = 8; mixin ("enum a1 =\u2028b; pragma (msg, a1);"); mixin ("enum a2\u2028= b; pragma (msg, a2);"); mixin ("enum\u2028a3 = b; pragma (msg, a3);"); } $ dmd lsps.d 8 lsps.d-mixin-5(5): Error: char 0x2028 not allowed in identifier lsps.d-mixin-6(6): Error: char 0x2028 not allowed in identifier ``` [1] https://dlang.org/spec/lex.html#end_of_line
Oct 15 2023
Based upon how the identifier tokenization occurs greedily, yes it makes sense.
Oct 15 2023
On Monday, 16 October 2023 at 02:45:17 UTC, Richard (Rikki) Andrew Cattermole wrote:Based upon how the identifier tokenization occurs greedily, yes it makes sense.The error message is confusing, compare with this code, using returns: ``` $ cat ret.d void main () { enum b = 8; mixin ("enum a1 =\rb; pragma (msg, a1);"); mixin ("enum a2\r= b; pragma (msg, a2);"); mixin ("enum\ra3 = b; pragma (msg, a3);"); } $ dmd ret.d 8 8 8 ``` Why are U+2028 and U+2029 handled unlike \r and \n?
Oct 16 2023
https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578 Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.
Oct 16 2023
On 10/16/2023 5:37 PM, Richard (Rikki) Andrew Cattermole wrote:https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578 Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.Yes, please file a bugzilla!
Oct 17 2023
https://issues.dlang.org/show_bug.cgi?id=24190
Oct 17 2023
On Tuesday, 17 October 2023 at 00:37:41 UTC, Richard (Rikki) Andrew Cattermole wrote:https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578 Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.I've noticed that in the past, but this is clearly wrong. It's not just whitespace, it's also punctuation, emoji, a ton of stuff that are just not identifiers. The lexer should match the proper charset as a character start.
Oct 17 2023
On 10/17/2023 4:18 PM, deadalnix wrote:this is clearly wrong.I blame my parents.
Oct 17 2023