digitalmars.D - Handling of U+2028 and U+2029 in source code

kdevel (17/17) Oct 15 2023 According to [1] U+2028 and U+2029 are considered end-of-line

Richard (Rikki) Andrew Cattermole (2/2) Oct 15 2023 Based upon how the identifier tokenization occurs greedily, yes it makes...

kdevel (19/21) Oct 16 2023 The error message is confusing, compare with this code, using

Richard (Rikki) Andrew Cattermole (4/4) Oct 16 2023 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

Walter Bright (2/7) Oct 17 2023 Yes, please file a bugzilla!

Richard (Rikki) Andrew Cattermole (1/1) Oct 17 2023 https://issues.dlang.org/show_bug.cgi?id=24190

deadalnix (6/10) Oct 17 2023 I've noticed that in the past, but this is clearly wrong. It's

Walter Bright (2/3) Oct 17 2023 I blame my parents.

kdevel <kdevel vogtner.de> writes:

According to [1] U+2028 and U+2029 are considered end-of-line 
characters. Does this make sense?

```d
$ cat lsps.d
void main ()
{
    enum b = 8;
    mixin ("enum a1 =\u2028b; pragma (msg, a1);");
    mixin ("enum a2\u2028= b; pragma (msg, a2);");
    mixin ("enum\u2028a3 = b; pragma (msg, a3);");
}
$ dmd lsps.d
8
lsps.d-mixin-5(5): Error: char 0x2028 not allowed in identifier
lsps.d-mixin-6(6): Error: char 0x2028 not allowed in identifier
```

[1] https://dlang.org/spec/lex.html#end_of_line

Oct 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

Based upon how the identifier tokenization occurs greedily, yes it makes 
sense.

Oct 15 2023

kdevel <kdevel vogtner.de> writes:

On Monday, 16 October 2023 at 02:45:17 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 Based upon how the identifier tokenization occurs greedily, yes 
 it makes sense.

The error message is confusing, compare with this code, using 
returns:

```
$ cat ret.d
void main ()
{
    enum b = 8;
    mixin ("enum a1 =\rb; pragma (msg, a1);");
    mixin ("enum a2\r= b; pragma (msg, a2);");
    mixin ("enum\ra3 = b; pragma (msg, a3);");
}
$ dmd ret.d
8
8
8
```

Why are U+2028 and U+2029 handled unlike \r and \n?

Oct 16 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

Basically its in multi-byte UTF-8 character, checks if its in the 
non-ASCII character ranges. No special handling of new lines is 
provided, but probably should be.

Oct 16 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 10/16/2023 5:37 PM, Richard (Rikki) Andrew Cattermole wrote:
 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578
 
 Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII 
 character ranges. No special handling of new lines is provided, but probably 
 should be.

Yes, please file a bugzilla!

Oct 17 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

https://issues.dlang.org/show_bug.cgi?id=24190

Oct 17 2023

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 17 October 2023 at 00:37:41 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

 Basically its in multi-byte UTF-8 character, checks if its in 
 the non-ASCII character ranges. No special handling of new 
 lines is provided, but probably should be.

I've noticed that in the past, but this is clearly wrong. It's 
not just whitespace, it's also punctuation, emoji, a ton of stuff 
that are just not identifiers.

The lexer should match the proper charset as a character start.

Oct 17 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 10/17/2023 4:18 PM, deadalnix wrote:
 this is clearly wrong.

I blame my parents.

Oct 17 2023

D Programming

C/C++ Programming

Other

digitalmars.D - Handling of U+2028 and U+2029 in source code