www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Handling of U+2028 and U+2029 in source code

reply kdevel <kdevel vogtner.de> writes:
According to [1] U+2028 and U+2029 are considered end-of-line 
characters. Does this make sense?

```d
$ cat lsps.d
void main ()
{
    enum b = 8;
    mixin ("enum a1 =\u2028b; pragma (msg, a1);");
    mixin ("enum a2\u2028= b; pragma (msg, a2);");
    mixin ("enum\u2028a3 = b; pragma (msg, a3);");
}
$ dmd lsps.d
8
lsps.d-mixin-5(5): Error: char 0x2028 not allowed in identifier
lsps.d-mixin-6(6): Error: char 0x2028 not allowed in identifier
```

[1] https://dlang.org/spec/lex.html#end_of_line
Oct 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
Based upon how the identifier tokenization occurs greedily, yes it makes 
sense.
Oct 15 2023
parent reply kdevel <kdevel vogtner.de> writes:
On Monday, 16 October 2023 at 02:45:17 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 Based upon how the identifier tokenization occurs greedily, yes 
 it makes sense.
The error message is confusing, compare with this code, using returns: ``` $ cat ret.d void main () { enum b = 8; mixin ("enum a1 =\rb; pragma (msg, a1);"); mixin ("enum a2\r= b; pragma (msg, a2);"); mixin ("enum\ra3 = b; pragma (msg, a3);"); } $ dmd ret.d 8 8 8 ``` Why are U+2028 and U+2029 handled unlike \r and \n?
Oct 16 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

Basically its in multi-byte UTF-8 character, checks if its in the 
non-ASCII character ranges. No special handling of new lines is 
provided, but probably should be.
Oct 16 2023
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/16/2023 5:37 PM, Richard (Rikki) Andrew Cattermole wrote:
 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578
 
 Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII 
 character ranges. No special handling of new lines is provided, but probably 
 should be.
Yes, please file a bugzilla!
Oct 17 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
https://issues.dlang.org/show_bug.cgi?id=24190
Oct 17 2023
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Tuesday, 17 October 2023 at 00:37:41 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

 Basically its in multi-byte UTF-8 character, checks if its in 
 the non-ASCII character ranges. No special handling of new 
 lines is provided, but probably should be.
I've noticed that in the past, but this is clearly wrong. It's not just whitespace, it's also punctuation, emoji, a ton of stuff that are just not identifiers. The lexer should match the proper charset as a character start.
Oct 17 2023
parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/17/2023 4:18 PM, deadalnix wrote:
 this is clearly wrong.
I blame my parents.
Oct 17 2023