www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [dox] Fixing the lexical rule for BinaryInteger

reply "Andre Artus" <andre.artus gmail.com> writes:
The documentation on the lexical rules for BinaryInteger 
(http://dlang.org/lex.html#BinaryInteger) has a few issues:

 BinaryInteger:
    BinPrefix BinaryDigits

The nonterminal BinaryDigits, does not exist.
 BinaryDigitsUS:
    BinaryDigitUS
    BinaryDigitUS BinaryDigitsUS

The construction for BinaryDigitsUS currently allows for the following: _(_)*, e.g. 0b_, 0b__, 0b___ etc. Which is clearly not allowed by the compiler. I have put up a change on GitHub [1], but there is a clear problem. The DMD compiler allows for any of the following (reduced cases): a. 0b__1 b. 0b_1_ c. 0b1__ Whereas my change disallows the second case (b), but is in line with how the other integers are specified. This is a specification problem (limitation of BNF), not an implementation problem. In plain English one would just say that the BinaryDigitsUS sequence should contain at least one BinaryDigit character. I'm busy working on the HexadecimalInteger, which has related issues. 1. https://github.com/andre-artus/dlang.org/blob/LexBinaryDigit/lex.dd
Aug 16 2013
next sibling parent "Brian Schott" <briancschott gmail.com> writes:
I've been doing some work with the language grammar 
specification. You may find these resources useful:

http://d.puremagic.com/issues/show_bug.cgi?id=10233
https://github.com/Hackerpilot/DGrammar/blob/master/D.g4
Aug 16 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

You have done impressive work on your grammar; I just have some small issues. 1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.1 2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : fails Same with HexadecimalInteger. 3. The imports don't allow for all cases. 4. how are you handling the scope attribute specifier in the "attribute ':'" case, e.g. "public:"? There seems to be a few more places where it diverges a bit from what the compiler currently accepts. I'm not arguing for the wisdom of writing code as I am about to show, but the following compiles with the current release build of DMD, but may not parse with DGrammar, quite likely balk in the scanner: module main; public: static: import std.stdio; int main(string[] argv) { auto myBin = 0b0011_1101; writefln("%1$x\t%1$.8b\t%1$s", myBin); auto myBin2 = 0b_______1; writefln("%1$x\t%1$.8b\t%1$s", myBin2); auto myBin3 = 0b____1___; writefln("%1$x\t%1$.8b\t%1$s", myBin3); auto myHex1 = 0x1__; writefln("%1$x\t%1$.8b\t%1$s", myHex1); auto myHex2 = 0x_1_; writefln("%1$x\t%1$.8b\t%1$s", myHex2); auto myHex3 = 0x__1; writefln("%1$x\t%1$.8b\t%1$s", myHex3); return 0; }
Aug 16 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
 On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

You have done impressive work on your grammar; I just have some small issues. 1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.1

I'm aware of that. If you're able to get ANTLR to actually produce a working parser for D I'd be happy to merge your pull request. I haven't been able to get any parser generators to work for D.
 2. Your BinaryInteger and HexadecimalInteger only allow for one 
 of the following (reduced) cases:

 0b1__ : works
 0b_1_ : fails
 0b__1 : fails

It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.
 Same with HexadecimalInteger.

 3. The imports don't allow for all cases.

https://github.com/Hackerpilot/DGrammar/issues
 4. how are you handling the scope attribute specifier in the 
 "attribute ':'" case, e.g. "public:"?

 There seems to be a few more places where it diverges a bit 
 from what the compiler currently accepts.

 I'm not arguing for the wisdom of writing code as I am about to 
 show, but the following compiles with the current release build 
 of DMD, but may not parse with DGrammar, quite likely balk in 
 the scanner:

 module main;

 public:
 static:
 import std.stdio;

 int main(string[] argv)
 {
 	auto myBin = 0b0011_1101;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin);

 	auto myBin2 = 0b_______1;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin2);

 	auto myBin3 = 0b____1___;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin3);

 	auto myHex1 = 0x1__;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex1);

 	auto myHex2 = 0x_1_;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex2);

 	auto myHex3 = 0x__1;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex3);

 	
 	return 0;
 }

I wrote that grammar as part of my work on DCD and DScanner. My lexer, parser, and AST library need some more testing. Please download DScanner and run it with either the --ast or --syntaxCheck options. If you find issues, please report them on Github.
Aug 16 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested. According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars: import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens. The two I marked above caused problems when generating. I don't know whether you are in the middle of trying to fix the indirect left recursion issue but I see that terminals tied to "unaryExpression" are duplicated all over the place. I can fix the recursion issue, and clean up the dups if that would help you.
Aug 16 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 16 August 2013 at 23:07:38 UTC, Andre Artus wrote:
 On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested. According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars: import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens. The two I marked above caused problems when generating.

I must have missed those when I pulled the grammar out of my parser's DDOC comments.
 I don't know whether you are in the middle of trying to fix the 
 indirect left recursion issue but I see that terminals tied to 
 "unaryExpression" are duplicated all over the place. I can fix 
 the recursion issue, and clean up the dups if that would help 
 you.

It would. I'm not actively working on that grammar.
Aug 16 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
-- SNIP --

 I wrote that grammar as part of my work on DCD and DScanner. My 
 lexer, parser, and AST library need some more testing. Please 
 download DScanner and run it with either the --ast or 
 --syntaxCheck options. If you find issues, please report them 
 on Github.

I forked just under an hour ago, am I on old bits? I have fixed all but one of the build issues, I would like to fix the last one before I commit as I don't like to leave my repo's in a broken state. I'll continue the discussion on GitHub.
Aug 16 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
 On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

2. Your BinaryInteger and HexadecimalInteger only allow for one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.

I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row. You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax. I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest. But if you want to accept "strange" literals like 0b__1__, you could do something like: <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1" The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string. T -- That's not a bug; that's a feature!
Aug 16 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2013 at 05:50:24PM -0700, H. S. Teoh wrote:
[...]
 <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
 
 <binaryDigits> ::= <binaryDigit> <binaryDigits>
 		| <binaryDigit>
 
 <underscoreBinaryDigits> ::= ""
 		| "_" <binaryDigits>
 		| "_" <binaryDigits> <underscoreBinaryDigits>
 
 <binaryDigit> ::= "0"
 		| "1"

Regex equivalent: 0b(0|1)(0|1)*(_(0|1)(0|1)*)* [...]
 <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
<underscoreBinaryDigits>
 
 <underscoreBinaryDigits> ::= "_"
 		| "_" <underscoreBinaryDigits>
 		| <binaryDigit>
 		| <binaryDigit> <underscoreBinaryDigits>
 		| ""
 
 <binaryDigit> ::= "0"
 		| "1"

Regex equivalent: 0b(0|1|_)*(0|1)(0|1|_)* T -- "How are you doing?" "Doing what?"
Aug 16 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
 On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
 On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

2. Your BinaryInteger and HexadecimalInteger only allow for 
one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.

I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.
 You can also make your parser only
 pick up <binaryDigit> when performing semantic on binary 
 literals, so
 the other stuff is ignored and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).
 I'd be surprised if there's any D code out there that doesn't 
 fit this
 spec, to be honest.

It's not what I would call best practice, but the following is possible in the current compiler:
 	auto myBin1 = 0b0011_1101; // Sane
 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.
 But if you want to accept "strange" literals like 0b__1__, you 
 could do
 something like:

 <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> 
 <underscoreBinaryDigits>

 <underscoreBinaryDigits> ::= "_"
 		| "_" <underscoreBinaryDigits>
 		| <binaryDigit>
 		| <binaryDigit> <underscoreBinaryDigits>
 		| ""

 <binaryDigit> ::= "0"
 		| "1"

 The odd form of the rule for <binaryLiteral> is to ensure that 
 there's
 at least one binary digit in the string, whereas
 <underscoreBinaryDigits> is just a wildcard anything-goes rule 
 that
 takes any combination of 0, 1, and _, including the empty 
 string.

The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e. BinaryLiteral : '0b' [_01]* [01] [_01]* ; I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug. It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.
Aug 16 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 17, 2013 at 04:02:40AM +0200, Andre Artus wrote:
 On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

2. Your BinaryInteger and HexadecimalInteger only allow for
one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.

I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.
You can also make your parser only pick up <binaryDigit> when
performing semantic on binary literals, so the other stuff is ignored
and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).

I didn't mean to push it up to the parser. I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.
I'd be surprised if there's any D code out there that doesn't fit
this spec, to be honest.

It's not what I would call best practice, but the following is possible in the current compiler:
	auto myBin1 = 0b0011_1101; // Sane
	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.

I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P) The issue here is that when specs / DMD / TDPL don't agree, then it's not always clear which among the three are wrong. Perhaps *all* of them are wrong. Just because DMD accepts invalid code doesn't mean it should be part of the specs, for example. It could constitute a DMD bug.
But if you want to accept "strange" literals like 0b__1__, you could
do something like:

<binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
<underscoreBinaryDigits>

<underscoreBinaryDigits> ::= "_"
		| "_" <underscoreBinaryDigits>
		| <binaryDigit>
		| <binaryDigit> <underscoreBinaryDigits>
		| ""

<binaryDigit> ::= "0"
		| "1"

The odd form of the rule for <binaryLiteral> is to ensure that
there's at least one binary digit in the string, whereas
<underscoreBinaryDigits> is just a wildcard anything-goes rule that
takes any combination of 0, 1, and _, including the empty string.

The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e. BinaryLiteral : '0b' [_01]* [01] [_01]* ; I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.

No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is equivalent to the regex I posted later.
 It's not a problem implementing the rule, I am more concerned with
 documenting it in a clear and unambiguous way so that people
 building tools from it can get it right. BNF isn't always the
 easiest way to do so, but it's what being used.

Well, you could bug Walter about what *should* be accepted, and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD. Again, I seriously doubt that such a change would cause any code breakage, because writing 0b1 as 0b____1____ is just so ridiculous that any such code *should* be broken. T -- Prosperity breeds contempt, and poverty breeds consent. -- Suck.com
Aug 17 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
 Andre Artus wrote:
 2. Your BinaryInteger and HexadecimalInteger only allow for
 one of the following (reduced) cases:
 
 0b1__ : works
 0b_1_ : fails
 0b__1 : fails





 Brian Schott wrote:
 It's my opinion that the compiler should reject all of these 
 because I think of the underscore as a separator between 
 digits,
 but I'm constantly fighting the "spec, dmd, and idiom all 
 disagree" issue.




I agree with you, Brian, all three of these constructions go contrary to the goal of making the code clearer. I would not be too surprised if a significant number of programmers would see those as different numbers, at least until they paused to take it in.
 [...]
 H. S. Teoh wrote:
 
 I remember reading this part of the spec on dlang.org, and I 
 wonder if it was worded the way it is just for simplicity, 
 because to specify something like "_ must appear between 
 digits" involves some complicated BNF rules, which maybe 
 seems like overkill for a single literal.



I think that you are right.
 H. S. Teoh wrote:
 
 But sometimes it is good to be precise, if we want to enforce
 "proper" conventions for underscores:
 
 <binaryLiteral>  ::= "0b" <binaryDigits>
                     <underscoreBinaryDigits>
 
 <binaryDigits>   ::= <binaryDigit> <binaryDigits>
                    | <binaryDigit>
 
 <underscoreBinaryDigits>
                  ::= ""
                    | "_" <binaryDigits>
                    | "_" <binaryDigits> 
 <underscoreBinaryDigits>
 
 <binaryDigit>    ::= "0"
                    | "1"
 
 This BNF spec forces "_" to only appear between two binary 
 digits, and never more than a single _ in a row.



 Andre Artus wrote:
 Yup, that's the issue. Coding the actual behaviour by hand, or 
 doing it with a regular expression, is close to trivial.


 H. S. Teoh wrote:
 You can also make your parser only pick up <binaryDigit> when
 performing semantic on binary literals, so the other stuff is 
 ignored and only serves to enforce syntax.



 Andre Artus wrote:
 Pushing it up to the parser is an option in implementation, 
 but I don't see that making the specification easier (it's 
 3:40 in the morning here, so I am very likely not thinking too 
 clearly about this).


 H. S. Teoh wrote:
 I didn't mean to push it up to the parser.

Sorry I misunderstood, I had been up for over 21 hours at the time I wrote, so it was getting a bit difficult for me to concentrate. I got the impression you were saying that the parser would be responsible for extracting the binary digits.
 H. S. Teoh wrote:
 I was just using BNF to show that it's possible to specify the 
 behaviour precisely.
 And also that it's rather convoluted just for something as 
 intuitively straightforward as an integer literal. Which is a 
 likely reason why the current specs are a bit blurry about what 
 should/shouldn't be allowed.

I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.
 H. S. Teoh wrote:
 I'd be surprised if there's any D code out there that doesn't 
 fit this spec, to be honest.



 Andre Artus wrote:
 It's not what I would call best practice, but the following is
 possible in the current compiler:
 
 	auto myBin1 = 0b0011_1101; // Sane
 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.


 H. S. Teoh wrote:
 I know that, but I'm saying that hardly *any* code would break 
 if we made DMD reject things like this. I don't think anybody 
 in their right mind would write code like that. (Unless they 
 were competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".
 H. S. Teoh wrote:
 The issue here is that when specs / DMD / TDPL don't agree, 
 then it's not always clear which among the three are wrong. 
 Perhaps *all* of them are wrong. Just because DMD accepts 
 invalid code doesn't mean it should be part of the specs, for 
 example. It could constitute a DMD bug.

It would be good to get some clarification on this.
 H. S. Teoh wrote:
 But if you want to accept "strange" literals like 0b__1__, 
 you could do something like:
 
 <binaryLiteral>          ::= "0b" <underscoreBinaryDigits>
                              <binaryDigit>
                              <underscoreBinaryDigits>
 
 <underscoreBinaryDigits> ::= "_"
                            | "_" <underscoreBinaryDigits>
                            | <binaryDigit>
                            | <binaryDigit> 
 <underscoreBinaryDigits>
                            | ""
 
 <binaryDigit>            ::= "0"
                            | "1"
 
 The odd form of the rule for <binaryLiteral> is to ensure that
 there's at least one binary digit in the string, whereas
 <underscoreBinaryDigits> is just a wildcard anything-goes 
 rule that takes any combination of 0, 1, and _, including the
 empty string.



 Andre Artus wrote:
 The rule that matches the DMD compiler is actually very easy 
 to do in ANTLR4, i.e.
 
 BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
 
 I'm a bit too tired to fully pay attention, but it seems you 
 are saying that "0b" (no additional numbers) should match, 
 which I believe it should not (although I admit to not testing 
 this). If it does then I would consider that a bug.


 H. S. Teoh wrote:
 No, the BNF rules I wrote are equivalent to your ANTLR4 spec. 
 Which is equivalent to the regex I posted later.

I should have paid better attention, as I missed the <binaryDigit> in <binaryLiteral>. To be honest I was having a hard time focusing due to lack of sleep and a pervading stench of paint fumes seeping in from the adjacent building.
 Andre Artus wrote:
 It's not a problem implementing the rule, I am more concerned 
 with documenting it in a clear and unambiguous way so that 
 people building tools from it can get it right. BNF isn't 
 always the easiest way to do so, but it's what being used.


 H. S. Teoh wrote:
 Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.
 H. S. Teoh wrote:
 and if he agrees to restrict it to having _ only between two 
 digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value. Possible regex alternatives (note I do not include the sign, as per current spec). (0|[1-9]([_]*[0-9])*) or arguably better (0|[1-9]([_]?[0-9])*)
 H. S. Teoh wrote:
 Again, I seriously doubt that such a change would cause any 
 code breakage, because writing 0b1 as 0b____1____ is just so 
 ridiculous that any such code *should* be broken.

Agreed.
Aug 17 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 17, 2013 at 11:29:03PM +0200, Andre Artus wrote:
[...]
H. S. Teoh wrote:
I was just using BNF to show that it's possible to specify the
behaviour precisely.  And also that it's rather convoluted just for
something as intuitively straightforward as an integer literal. Which
is a likely reason why the current specs are a bit blurry about what
should/shouldn't be allowed.

I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.

You're right, I think the D specs page on literals using BNF is a bit of an overkill. Maybe it should be rewritten using regexen. It would be easier to understand, for one thing. [...]
H. S. Teoh wrote:
I know that, but I'm saying that hardly *any* code would break if
we made DMD reject things like this. I don't think anybody in
their right mind would write code like that. (Unless they were
competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".

Walter is someone who believes that compilers should only have errors, not warnings. :) [...]
Andre Artus wrote:
It's not a problem implementing the rule, I am more concerned
with documenting it in a clear and unambiguous way so that
people building tools from it can get it right. BNF isn't always
the easiest way to do so, but it's what being used.


H. S. Teoh wrote:
Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.

Email him and ask? :)
H. S. Teoh wrote:
and if he agrees to restrict it to having _ only between two
digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.

Yeah that sounds like a bug in the specs.
 Possible regex alternatives (note I do not include the sign, as per
 current spec).
 
 (0|[1-9]([_]*[0-9])*)
 
 or arguably better
 (0|[1-9]([_]?[0-9])*)

I think it should be: (0|[1-9]([0-9]*(_[0-9]+)*)?) That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34. T -- Blunt statements really don't have a point.
Aug 17 2013
prev sibling next sibling parent "Andre Artus" <andre.artus gmail.com> writes:
 [...]
 H. S. Teoh wrote:
 I was just using BNF to show that it's possible to specify 
 the behaviour precisely.  And also that it's rather 
 convoluted just for something as intuitively straightforward 
 as an integer literal. Which is a likely reason why the 
 current specs are a bit blurry about what should/shouldn't be 
 allowed.



 Andre Artus wrote:
 I don't think I've seen lexemes defined using (a variant of) 
 BNF before, most often a form of regular expressions are used. 
 One could cut down and clarify the page describing the lexical 
 syntax significantly employing simple regular expressions.


 H. S. Teoh wrote:
 You're right, I think the D specs page on literals using BNF is 
 a bit of an overkill. Maybe it should be rewritten using 
 regexen.
 It would be easier to understand, for one thing.

I would not mind doing this, I'll see what Walter says. It would also be quite easy to generate syntax diagrams from a reg-expr.
 [...]
 H. S. Teoh wrote:
 I know that, but I'm saying that hardly *any* code would 
 break if
 we made DMD reject things like this. I don't think anybody in
 their right mind would write code like that. (Unless they were
 competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".


 H. S. Teoh wrote:
 Walter is someone who believes that compilers should only have 
 errors, not warnings. :)

That can go both ways, but I suspect you mean that in the good way.
 [...]
 Andre Artus wrote:
 It's not a problem implementing the rule, I am more concerned
 with documenting it in a clear and unambiguous way so that
 people building tools from it can get it right. BNF isn't 
 always the easiest way to do so, but it's what being used.


 H. S. Teoh wrote:
 Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.


 H. S. Teoh wrote:
 Email him and ask? :)

I'll try that.
 H. S. Teoh wrote:
 and if he agrees to restrict it to having _ only between two
 digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.


 H. S. Teoh wrote:
 Yeah that sounds like a bug in the specs.

Yes, I believe so. The same issues are under "Floating Point Literals". Should be easy to fix.
 Possible regex alternatives (note I do not include the sign, 
 as per current spec).
 
 (0|[1-9]([_]*[0-9])*)
 
 or arguably better
 (0|[1-9]([_]?[0-9])*)

I think it should be: (0|[1-9]([0-9]*(_[0-9]+)*)?) That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34.

I concur with your assessment. I believe my second reg-ex is functionally equivalent to the one you propose (test results below). Although I would concede that yours may be easier to grok. The following match my regex (assuming it's whitespace delimited) 1 1_1 1_2_3_4_5_6_7_8_9_0 1234_45_15 1234567_8_90 123456789_0 1_234567890 12_34567890 123_4567890 1234_567890 12345_67890 123456_7890 1234567_890 12345678_90 123456789_0 123_45_6_789012345_67890 Whereas these do not _1 1_ _1_ 1______1 -12_34 -1234 123_45_6__789012345_67890 1234567890_ _1234567890_ _1234567890 1234567890_
Aug 17 2013
prev sibling parent "Andre Artus" <andre.artus gmail.com> writes:
[...]

For fun I made a scanner rule that forces BinaryInteger to 
conform to a power of 2 grouping of nibbles. I think it loses 
it's clarity after 16 bits.

I made the underscore optional between nibbles, but required for 
groups of 2 bytes and above.

Some passing cases from my test inputs.

0b00010001
0b0001_0001
0b00010001_0001_0001
0b00010001_00010001
0b0001_0001_00010001
0b00010001_00010001
0b00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_0001_0001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001

It loses some of the value of arbitrary grouping specifically the 
ability to group bits in a bitmask by function.
Aug 17 2013