www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Official D Grammar

reply "Brian Schott" <briancschott gmail.com> writes:
I've pretty much finished up my work on the std.d.lexer module. I 
am waiting for the review queue to make some progress on the 
other (three?) modules being reviewed before starting a thread on 
it.

In the meantime I've started some work on an AST module for 
Phobos that contains the data types necessary to build up a 
parser module so that we can have a standard set of code build D 
dev tools off of. I decided to work directly from the standard on 
dlang.org for this to make sure that my module is correct and 
that the standard is actually correct.

I've seen several threads on this newsgroup complaining about the 
state of the standard and unfortunately this will be another one.

1) Grammar defined in terms of things that aren't tokens. Take, 
for example, PropertyDeclaration. It's defined as an " " token 
followed by... what? "safe"? It's not a real token. It's an 
identifier. You can't parse this based on checking the token 
type. You have to check the type and the value.

2) Grammar references rules that don't exist. 
UserDefinedAttribute is defined in terms of CallExpression, but 
CallExpression doesn't exist elsewhere in the grammar. 
BaseInterfaceList is defined in terms of InterfaceClasses, but 
that rule is never defined.

3) Unnecessary rules. KeyExpression, ValueExpression, 
ScopeBlockStatement, DeclarationStatement, ThenStatement, 
ElseStatement, Test, Increment, Aggregate, LwrExpression, 
UprExpression, FirstExp, LastExp, StructAllocator, 
StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, 
ConstraintExpression, MixinIdentifier, etc... are all defined in 
terms of only one other rule.

I think that we need to be able to create a grammar description 
that:
* Fits in to a single file, so that a tool implementer does not 
need to collect bits of the grammar from the various pages on 
dlang.org.
* Can be verified to be correct by an existing tool such as 
Bison, Goldie, JavaCC, <your favorite here> with a small number 
of changes.
* Is part of the dmd/dlang repositories on github and gets 
updated every time the language changes.

I'm willing to work on this if there's a good chance it will 
actually be implemented. Thoughts?
Apr 01 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2013 4:18 PM, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am waiting
for
 the review queue to make some progress on the other (three?) modules being
 reviewed before starting a thread on it.

 In the meantime I've started some work on an AST module for Phobos that
contains
 the data types necessary to build up a parser module so that we can have a
 standard set of code build D dev tools off of. I decided to work directly from
 the standard on dlang.org for this to make sure that my module is correct and
 that the standard is actually correct.

 I've seen several threads on this newsgroup complaining about the state of the
 standard and unfortunately this will be another one.

 1) Grammar defined in terms of things that aren't tokens. Take, for example,
 PropertyDeclaration. It's defined as an " " token followed by... what? "safe"?
 It's not a real token. It's an identifier. You can't parse this based on
 checking the token type. You have to check the type and the value.
True, do you have a suggestion?
 2) Grammar references rules that don't exist. UserDefinedAttribute is defined
in
 terms of CallExpression, but CallExpression doesn't exist elsewhere in the
 grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that
 rule is never defined.
Yes, this needs to be fixed.
 3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement,
 DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate,
 LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator,
 StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression,
 MixinIdentifier, etc... are all defined in terms of only one other rule.
Using these makes documentation easier, and I don't think it harms anything.
 I think that we need to be able to create a grammar description that:
 * Fits in to a single file, so that a tool implementer does not need to collect
 bits of the grammar from the various pages on dlang.org.
 * Can be verified to be correct by an existing tool such as Bison, Goldie,
 JavaCC, <your favorite here> with a small number of changes.
 * Is part of the dmd/dlang repositories on github and gets updated every time
 the language changes.

 I'm willing to work on this if there's a good chance it will actually be
 implemented. Thoughts?
I suggest doing this as a sequence of pull requests, not doing just one big one.
Apr 01 2013
next sibling parent reply Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 02/04/2013 03:13, Walter Bright wrote:
 1) Grammar defined in terms of things that aren't tokens. Take, for
 example,
 PropertyDeclaration. It's defined as an " " token followed by... what?
 "safe"?
 It's not a real token. It's an identifier. You can't parse this based on
 checking the token type. You have to check the type and the value.
True, do you have a suggestion?
I don't think that kind of grammar issue is too annoying, since it's easy to understand what the intended behavior is (in this case at least). But to fix it, well, we can have just that: have the grammar say it should parse an identifier after the , and then issue a semantic error of sorts if the value is not one of the expected special values (safe, etc.). Parsing an identifier here is in any case the best error recovery strategy anyways. A bit more annoying is the case with the extern declaration, with the C++ parameter: extern(C++) here you have to look at a special identifier (the C, D, PASCAL part) and see if there is a ++ token ahead, it's a bit more of special-casing in the parser. Here I think it would have been better to change the the language itself and use "CPP" instead of "C++". A minor simplification. -- Bruno Medeiros - Software Engineer
Apr 09 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/9/2013 3:20 AM, Bruno Medeiros wrote:
 A bit more annoying is the case with the extern declaration, with the C++
 parameter:
    extern(C++)
 here you have to look at a special identifier (the C, D, PASCAL part) and see
if
 there is a ++ token ahead, it's a bit more of special-casing in the parser.
Here
 I think it would have been better to change the the language itself and use
 "CPP" instead of "C++". A minor simplification.
I think the special case is worth it.
Apr 20 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 04/20/2013 08:36 PM, Walter Bright wrote:
 On 4/9/2013 3:20 AM, Bruno Medeiros wrote:
 A bit more annoying is the case with the extern declaration, with the C++
 parameter:
    extern(C++)
 here you have to look at a special identifier (the C, D, PASCAL part)
 and see if
 there is a ++ token ahead, it's a bit more of special-casing in the
 parser. Here
 I think it would have been better to change the the language itself
 and use
 "CPP" instead of "C++". A minor simplification.
I think the special case is worth it.
Without doubt.
Apr 20 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-04-02 04:13, Walter Bright wrote:

 1) Grammar defined in terms of things that aren't tokens. Take, for
 example,
 PropertyDeclaration. It's defined as an " " token followed by... what?
 "safe"?
 It's not a real token. It's an identifier. You can't parse this based on
 checking the token type. You have to check the type and the value.
True, do you have a suggestion?
Just define that safe should be a token? -- /Jacob Carlborg
Apr 09 2013
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 02/04/2013 03:13, Walter Bright wrote:
 On 4/1/2013 4:18 PM, Brian Schott wrote:
<snip>
 1) Grammar defined in terms of things that aren't tokens. Take,
 for example, PropertyDeclaration. It's defined as an " " token
 followed by... what? "safe"? It's not a real token. It's an
 identifier. You can't parse this based on checking the token type.
 You have to check the type and the value.
True, do you have a suggestion?
<snip> Er, do away with the meaningless ? Since nobody seems to have succeeded in explaining how the -attributes differ from the rest, it seems the right way to go. Stewart. -- My email address is valid but not my primary mailbox and not checked regularly. Please keep replies on the 'group where everybody may benefit.
Apr 23 2013
next sibling parent reply "Diggory" <diggsey googlemail.com> writes:
If at some point custom attributes will be allowed,   would be a 
nice syntax for that.

Things like  safe could be attributes implemented by the standard 
library - would cut down on the number of keywords.
Apr 23 2013
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, April 24, 2013 00:34:54 Diggory wrote:
 If at some point custom attributes will be allowed,   would be a
 nice syntax for that.
 
 Things like  safe could be attributes implemented by the standard
 library - would cut down on the number of keywords.
We already have custom attributes using . They're called User Defined Attributes and were added a couple releases ago IIRC. - Jonathan M Davis
Apr 23 2013
prev sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, April 23, 2013 21:51:54 Stewart Gordon wrote:
 Er, do away with the meaningless  ? Since nobody seems to have
 succeeded in explaining how the  -attributes differ from the rest, it
 seems the right way to go.
That would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason. - Jonathan M Davis
Apr 23 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-04-24 03:40, Jonathan M Davis wrote:

 That would mean creating more keywords, which would break code. By using  , we
 avoid having to create new keywords, which I believe was the whole point in
 the first place. Which attributes got   on them was fairly arbitrary, but they
 do definitely serve a purpose. And any new attributes in the future will
 probably have   on them for the same reason.
You can't use the same reasoning anymore since we have UDA. Those will conflict with new keywords starting with an . -- /Jacob Carlborg
Apr 23 2013
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, April 24, 2013 08:39:21 Jacob Carlborg wrote:
 On 2013-04-24 03:40, Jonathan M Davis wrote:
 That would mean creating more keywords, which would break code. By using
  , we avoid having to create new keywords, which I believe was the whole
 point in the first place. Which attributes got   on them was fairly
 arbitrary, but they do definitely serve a purpose. And any new attributes
 in the future will probably have   on them for the same reason.
You can't use the same reasoning anymore since we have UDA. Those will conflict with new keywords starting with an .
True. Adding blah now risks conflicts where it didn't before, but it still risks fewer conflicts that adding a keyword, since a keyword could be used anywhere, so adding blah as a keyword would conflict with all uses of blah - including blah - whereas adding blah would just conflict with blah. There was also some discussion on how to make it so that blah might not cause conflicts (via namespacing or somesuch), but I don't recall if anything came of that or how much of applies to the actual UDA implementation. So, it makes more sense to go with adding an attribute than a keyword if we have to add a new attribute, because it'll conflict with less, but it's definitely true that we no longer have a way to avoid creating conflicts when adding attributes without altering the grammar in a manner similar to when we made it possible to have attributes starting with . So, it's that much more important that we only add new attributes when we really need them. - Jonathan M Davis
Apr 25 2013
prev sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 24 April 2013 at 01:40:52 UTC, Jonathan M Davis 
wrote:
 On Tuesday, April 23, 2013 21:51:54 Stewart Gordon wrote:
 Er, do away with the meaningless  ? Since nobody seems to have
 succeeded in explaining how the  -attributes differ from the 
 rest, it
 seems the right way to go.
That would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason. - Jonathan M Davis
Please don't. This was the approach taken by Java with the overload and friends, which leads to ugly code in my opinion. Once I worked in a project which took a similar approach, we would have annotations as a way to extend the core grammar. Each team eventually grew its own set of annotations, to the point the amount of annotations became bigger than the core grammar itself, with overlapping meaning. -- Paulo
Apr 24 2013
prev sibling next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 02/04/2013 00:18, Brian Schott wrote:
<snip>
 I think that we need to be able to create a grammar description that:
 * Fits in to a single file, so that a tool implementer does not need to
 collect bits of the grammar from the various pages on dlang.org.
 * Can be verified to be correct by an existing tool such as Bison,
 Goldie, JavaCC, <your favorite here> with a small number of changes.
 * Is part of the dmd/dlang repositories on github and gets updated every
 time the language changes.
<snip> Indeed, the published grammar needs to be thoroughly checked against what DMD is actually doing, and any discrepancies fixed (or filed in Bugzilla to be fixed in due course). And then they need to be kept in sync. Has the idea of using a parser generator to build D's parsing code been rejected in the past, or is hand-coding just the way Walter decided to do it? Is the code any more efficient than what a typical parser generator would generate? And all disambiguation rules (such as "if it's parseable as a DeclarationStatement, it's a DeclarationStatement") need to be made explicit as part of the grammar. I suppose this is where using Bison or similar would help, as it would point out any ambiguities in the grammar that need rules to resolve them. Stewart.
Apr 02 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-04-02 15:21, Stewart Gordon wrote:

 Indeed, the published grammar needs to be thoroughly checked against
 what DMD is actually doing, and any discrepancies fixed (or filed in
 Bugzilla to be fixed in due course).  And then they need to be kept in
 sync.

 Has the idea of using a parser generator to build D's parsing code been
 rejected in the past, or is hand-coding just the way Walter decided to
 do it?  Is the code any more efficient than what a typical parser
 generator would generate?

 And all disambiguation rules (such as "if it's parseable as a
 DeclarationStatement, it's a DeclarationStatement") need to be made
 explicit as part of the grammar.  I suppose this is where using Bison or
 similar would help, as it would point out any ambiguities in the grammar
 that need rules to resolve them.
I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves. -- /Jacob Carlborg
Apr 02 2013
parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
 I'm wondering if it's possibly to mechanically check that 
 what's in the grammar is how DMD behaves.
Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
Apr 02 2013
parent reply "Christopher Bergqvist" <spambox0 digitalpoetry.se> writes:
On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:
 I'm wondering if it's possibly to mechanically check that 
 what's in the grammar is how DMD behaves.
Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation. Theoretically possible, :) just look at Philippe Sigaud's Pegged.
Apr 08 2013
parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Monday, 8 April 2013 at 21:50:12 UTC, Christopher Bergqvist 
wrote:
 On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:
 I'm wondering if it's possibly to mechanically check that 
 what's in the grammar is how DMD behaves.
Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation. Theoretically possible, :) just look at Philippe Sigaud's Pegged.
I know but the parser is currently hand written and I think Walter will only accept an auto generated parser if it is as fast as the current solution. However in an old discussion someone said that the D grammar isn't LALR(1) or LR(1), so I don't think that is possible with current D parser generators. Do we have a pegged grammar for D? Another think is that for documentation purposes you might want to have an more readable but ambiguous grammar than your generator of choice will accept.
Apr 08 2013
parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 However in an old discussion someone said that the D grammar isn't LALR(1)
 or LR(1), so I don't think that is possible with current D parser
 generators. Do we have a pegged grammar for D?
Yes, it comes with the project. But, it's still buggy (sometimes due to my own mistakes, sometimes due to plain errors in the online D grammar). And the generated parser is quite slow, halas.
Apr 08 2013
prev sibling next sibling parent Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.

 In the meantime I've started some work on an AST module for Phobos that
 contains the data types necessary to build up a parser module so that we
 can have a standard set of code build D dev tools off of. I decided to
 work directly from the standard on dlang.org for this to make sure that
 my module is correct and that the standard is actually correct.

 I've seen several threads on this newsgroup complaining about the state
 of the standard and unfortunately this will be another one.

 1) Grammar defined in terms of things that aren't tokens. Take, for
 example, PropertyDeclaration. It's defined as an " " token followed
 by... what? "safe"? It's not a real token. It's an identifier. You can't
 parse this based on checking the token type. You have to check the type
 and the value.

 2) Grammar references rules that don't exist. UserDefinedAttribute is
 defined in terms of CallExpression, but CallExpression doesn't exist
 elsewhere in the grammar. BaseInterfaceList is defined in terms of
 InterfaceClasses, but that rule is never defined.

 3) Unnecessary rules. KeyExpression, ValueExpression,
 ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement,
 Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp,
 LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType,
 EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all
 defined in terms of only one other rule.

 I think that we need to be able to create a grammar description that:
 * Fits in to a single file, so that a tool implementer does not need to
 collect bits of the grammar from the various pages on dlang.org.
 * Can be verified to be correct by an existing tool such as Bison,
 Goldie, JavaCC, <your favorite here> with a small number of changes.
 * Is part of the dmd/dlang repositories on github and gets updated every
 time the language changes.

 I'm willing to work on this if there's a good chance it will actually be
 implemented. Thoughts?
Interesting thread. I've been working on a hand-written D parser (in Java, for the DDT IDE) and I too have found a slew of grammar spec issues. Some of them more serious than the ones you mentioned above. In same cases it's actually not clear, or downright wrong what the grammar spec says. For example, here's one off of my notes: void func(int foo() { } ); The spec says that is parsable (basically a function declaration in the parameter list), which makes no sense, and DMD doesn't accept. Some cases are a bit trickier, since it's not clear if the syntax should be accepted or not (sometimes they might make sense but not be allowed). These issues make things a bit harder for tools development that require D language parsers. But the whole grammar spec is so messy, I've been unsure whether it's worth filling bug reports or not (would they be addressed?). There is also the problem that even if those issues are fixed now, the spec could very easily fall out of date in the future, unless we have some system to test the spec. Like you mentioned, ideally we would have a grammar spec for a grammar/PG tool so that correctness could more easily be verified. (it doesn't guarantee no spec bugs, but it makes it much harder for them to be there) -- Bruno Medeiros - Software Engineer
Apr 06 2013
prev sibling next sibling parent reply Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.
BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified. -- Bruno Medeiros - Software Engineer
Apr 06 2013
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Saturday, April 06, 2013 16:21:12 Bruno Medeiros wrote:
 On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.
BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
It would definitely have to be INTEGER DOT IDENTIFIER due to UFCS, so it sounds like the spec wasn't updated like it should have been. - Jonathan M Davis
Apr 06 2013
prev sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 04/06/13 17:21, Bruno Medeiros wrote:
 On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.
BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these. Actually, that last case is even more "interesting"; http://dlang.org/lex.html has "1_2_3_4_5_6_._5_6_7_8" as a valid example, which of course it's not ("_5_6_7_8" is a valid identifier), but there is no reason do disallow "1_2_3_4_5_6_.5_6_7_8".
 that should be clarified.
These are just grammar bugs, that could easily be fixed. Then there are some things that can be less obvious, but shouldn't really be controversial like allowing empty HexString literals. Then there's the enhancement category. Looking through my comments, I think the only deliberate change from dlang.org that I have is in DelimitedString -- there is no reason to forbid q"/abc/def/"; there are no back-compat issues, as it couldn't have existed in legacy D code. artur
Apr 06 2013
parent reply Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 06/04/2013 20:52, Artur Skawina wrote:
 On 04/06/13 17:21, Bruno Medeiros wrote:
 On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.
BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these.
The "1..2" is actually mentioned in the spec: "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed. But unless I missed it, the spec is incorrect for the "1.ident" or "1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and it's not always 100% obvious to a human how these should be handled. Or maybe that's just me :) -- Bruno Medeiros - Software Engineer
Apr 06 2013
parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 04/07/13 00:35, Bruno Medeiros wrote:
 On 06/04/2013 20:52, Artur Skawina wrote:
 On 04/06/13 17:21, Bruno Medeiros wrote:
 On 02/04/2013 00:18, Brian Schott wrote:
 I've pretty much finished up my work on the std.d.lexer module. I am
 waiting for the review queue to make some progress on the other (three?)
 modules being reviewed before starting a thread on it.
BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these.
The "1..2" is actually mentioned in the spec: "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed.
I know, but documenting a (grammar) bug does not make it go away.
 But unless I missed it, the spec is incorrect for the "1.ident" or
"1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and
it's not always 100% obvious to a human how these should be handled. Or maybe
that's just me :)
What does the "spec" currently say about ".001"?.. It's been a while since I did a d lexer, based on the dlang grammar - it (the lexer) was supposed to be dmd compatible. Took a closer look at the actual dlang.org rules today while writing this message... Will try to find some time to clean up and convert a working D lexical grammar to PEG; what i have should be 1:1 translatable, except one rule (DelimitedString) and put it on the wiki. Maybe it will help someone avoid these issues. artur
Apr 07 2013
parent reply Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 07/04/2013 16:14, Artur Skawina wrote:
 The "1..2" is actually mentioned in the spec:
"An exception to this rule is that a .. embedded inside what looks like two
floating point literals, as in 1..2, is interpreted as if the .. was separated
by a space from the first integer."
so it's there, even if it can be missed.
I know, but documenting a (grammar) bug does not make it go away.
Who says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions: foo[1..2] // It would be silly to have to put a space after the 1 At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present. -- Bruno Medeiros - Software Engineer
Apr 09 2013
parent Artur Skawina <art.08.09 gmail.com> writes:
On 04/09/13 12:24, Bruno Medeiros wrote:
 On 07/04/2013 16:14, Artur Skawina wrote:
 The "1..2" is actually mentioned in the spec:
"An exception to this rule is that a .. embedded inside what looks like two
floating point literals, as in 1..2, is interpreted as if the .. was separated
by a space from the first integer."
so it's there, even if it can be missed.
I know, but documenting a (grammar) bug does not make it go away.
Who says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions: foo[1..2] // It would be silly to have to put a space after the 1 At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present.
It's a bug, because the grammar does not correctly describe the rules. A parser/lexer based on the grammar alone will not work. Documenting the exception(s) helps the human, but doesn't make the grammar correct. I've started the PEG conversion of my lexer rules and the relevant one looks like this: DecimalFloat: (LeadingDecimal "." !"." !IdentifierStart DecimalDigitsNoSingleUS DecimalExponent) / (LeadingDecimal "." !"." !IdentifierStart DigitUS*) / ("." LeadingDecimal DecimalExponent?) This works as-is, w/o any extra info - the working lexer is mechanically generated from aot this rule. (It differs from the dlang definition in at least four ways - "1..2", "1.ident", "1.000_1" and ".001") I'll try to finish the conversion and post the whole lexical grammar in a couple days (have almost no D-time right now). artur
Apr 09 2013
prev sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
I've moved my work on the grammar to the following location on 
Github:

https://github.com/Hackerpilot/DGrammar

This uses ANTLR, as the other parser generators can't handle D's 
grammar. Several rules from the official grammar were removed, 
and several others were added (Such as an actual rule for a 
function declaration...) I also tried to fix any inacuracies or 
omissions I came across in the online documentation.

Comments, issues, and pull requests welcome.
Apr 20 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
20-Apr-2013 12:31, Brian Schott пишет:
 I've moved my work on the grammar to the following location on Github:

 https://github.com/Hackerpilot/DGrammar

 This uses ANTLR, as the other parser generators can't handle D's
 grammar.
Great. IMHO ANTLR is one of the sanest.
 Several rules from the official grammar were removed, and
 several others were added (Such as an actual rule for a function
 declaration...) I also tried to fix any inacuracies or omissions I came
 across in the online documentation.

 Comments, issues, and pull requests welcome.
Bookmarked for now ;) -- Dmitry Olshansky
Apr 20 2013
prev sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:
 This uses ANTLR, as the other parser generators can't handle 
 D's grammar.
I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
Apr 26 2013
next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:
 I'm beginning to think that ANTRL is not up to the task either. 
 I've somehow managed to get the grammar to the point where it 
 correctly parses several phobos modules but takes a half hour 
 to do so.
I think I'm done with this grammar for now. If there's any interest, I can use it to create syntax diagrams for inclusion on dlang.org. I'll also be filing a series of bug reports over the next few days. (Two recent examples: 0.0L is not valid according to the lexical standard, and deprecated("string") is not documented anywhere)
Apr 28 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/28/13 8:16 PM, Brian Schott wrote:
 On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:
 I'm beginning to think that ANTRL is not up to the task either. I've
 somehow managed to get the grammar to the point where it correctly
 parses several phobos modules but takes a half hour to do so.
I think I'm done with this grammar for now. If there's any interest, I can use it to create syntax diagrams for inclusion on dlang.org.
Yes! This is awesome. Thanks for this work.
 I'll also be filing a series of bug reports over the next few days. (Two
 recent examples: 0.0L is not valid according to the lexical standard,
 and deprecated("string") is not documented anywhere)
Great. Andrei
Apr 28 2013
prev sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
I wrote a small program to take the grammar and generate input to 
the dot program from Graphviz. You can generate the diagrams with 
some scripts available on Github or view a prebuilt version here:

http://hackerpilot.org/d/dgrammar.zip

Please take a look at this and let me know if you find any 
mistakes.
May 12 2013
parent reply "Brian Schott" <briancschott gmail.com> writes:
As threatened at DConf, I've started filing bugs against the 
grammar specification. Anyone interested can track bug 10233[1], 
which I've marked as blocked by the various issues I've been 
finding.

As usual, my best guess at D's actual grammar[2] is located 
here[3].

Top secret project is hinted at here[4].

[1] http://d.puremagic.com/issues/show_bug.cgi?id=10233
[2] I disclaim all responsibility for injuries sustained while 
looking at the functionDefinition rule.
[3] https://github.com/Hackerpilot/DGrammar/blob/master/D.g4
[4] 
http://hackerpilot.github.io/experimental/std_lexer/phobos/parser.html
Jun 03 2013
parent reply "Brian Schott" <briancschott gmail.com> writes:
Status update:

My parser is at a point where it can parse all of phobos (well, 
excluding c-style array declarations, which I'm convinced are 
broken and should be removed from the language).

I've been updating the grammar definition as I worked on the 
parser. An HTML version is available for download here: 
https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html

Lexer, parser, and AST code is available here: 
https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. 
Bug reports and pull requests are welcome.
Jul 19 2013
next sibling parent reply "Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Friday, 19 July 2013 at 20:49:27 UTC, Brian Schott wrote:
 Status update:

 My parser is at a point where it can parse all of phobos (well, 
 excluding c-style array declarations, which I'm convinced are 
 broken and should be removed from the language).

 I've been updating the grammar definition as I worked on the 
 parser. An HTML version is available for download here: 
 https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html

 Lexer, parser, and AST code is available here: 
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. 
 Bug reports and pull requests are welcome.
https://rawgithub.com/Hackerpilot/DGrammar/master/grammar.html ;)
Jul 19 2013
parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 19 July 2013 at 20:52:33 UTC, Mr. Anonymous wrote:
 https://rawgithub.com/Hackerpilot/DGrammar/master/grammar.html
 ;)
I learn something new every day.
Jul 19 2013
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jul 19, 2013 at 10:49:25PM +0200, Brian Schott wrote:
 Status update:
 
 My parser is at a point where it can parse all of phobos (well,
 excluding c-style array declarations, which I'm convinced are broken
 and should be removed from the language).
Really? Those still exist in Phobos? Shouldn't they be rewritten to the "native" style?
 I've been updating the grammar definition as I worked on the parser.
 An HTML version is available for download here:
 https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html
 
 Lexer, parser, and AST code is available here:
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d.
 Bug reports and pull requests are welcome.
Wonderful! I'll have to look into this sometime. I've been dreaming about writing a D pretty-printer, and this may be just the thing I need to get it going. :) (Well actually, I *wrote* a D pretty-printer, but I only got as far as lexing, with some token-matching rules for the pretty-printing. Needless to say, the results were rather underwhelming, since I didn't have an AST to work with.) T -- Written on the window of a clothing store: No shirt, no shoes, no service.
Jul 19 2013
parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 19 July 2013 at 20:57:14 UTC, H. S. Teoh wrote:
 Really? Those still exist in Phobos? Shouldn't they be 
 rewritten to the
 "native" style?
Yes. https://github.com/D-Programming-Language/phobos/pull/1412/files
Jul 19 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 19 Jul 2013 22:49:25 +0200
schrieb "Brian Schott" <briancschott gmail.com>:

 Status update:
 
 My parser is at a point where it can parse all of phobos (well, 
 excluding c-style array declarations, which I'm convinced are 
 broken and should be removed from the language).
 
 I've been updating the grammar definition as I worked on the 
 parser. An HTML version is available for download here: 
 https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html
 
 Lexer, parser, and AST code is available here: 
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. 
 Bug reports and pull requests are welcome.
Nice work! I guess you already tried druntime but did you also run your lexer and or parser against the dmd test suite? The test suite probably has quite some old legacy stuff like c-style array declarations so it might be a pita to run those tests but it might show some real problems as well. If you wanted to do more testing you could also run it on some well-maintained big D projects (GTKd, derelict3, ...) http://wiki.dlang.org/Libraries_and_Frameworks http://wiki.dlang.org/Open_Source_Projects It's also nice that we may finally get a correct and complete grammar. BTW: I think it might be a good idea to start a new thread for this update. This stuff is quite important for D/phobos and usually old threads tend to be ignored or get less attention.
Jul 19 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-07-19 22:49, Brian Schott wrote:
 Status update:

 My parser is at a point where it can parse all of phobos (well,
 excluding c-style array declarations, which I'm convinced are broken and
 should be removed from the language).

 I've been updating the grammar definition as I worked on the parser. An
 HTML version is available for download here:
 https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html

 Lexer, parser, and AST code is available here:
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d.
 Bug reports and pull requests are welcome.
Now for the million dollar question: does it work during compile time? -- /Jacob Carlborg
Jul 20 2013
prev sibling parent "Michael" <pr m1xa.com> writes:
On Friday, 19 July 2013 at 20:49:27 UTC, Brian Schott wrote:
 Status update:
 excluding c-style array declarations, which I'm convinced are 
 broken and should be removed from the language
+1 I found for myself that "native" style arrays are more natural.
Jul 20 2013
prev sibling next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:
 On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:
 This uses ANTLR, as the other parser generators can't handle 
 D's grammar.
I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
Do you know, which parts of the grammar make the parser so slow? Do you use any suspicious features of ANTLR?
May 01 2013
parent Russel Winder <russel winder.org.uk> writes:
On Wed, 2013-05-01 at 17:34 +0200, Tobias Pankrath wrote:
 On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:
 On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:
 This uses ANTLR, as the other parser generators can't handle=20
 D's grammar.
Which other parser generators? cf. for example, http://en.wikipedia.org/wiki/Comparison_of_parser_generators
 I'm beginning to think that ANTRL is not up to the task either.=20
 I've somehow managed to get the grammar to the point where it=20
 correctly parses several phobos modules but takes a half hour=20
 to do so.
LL(k) can sometimes do this. As can LALR(1) of course.
 Do you know, which parts of the grammar make the parser so slow?
 Do you use any suspicious features of ANTLR?
There isn't just one ANTLR: ANTLR 2, ANTLR 3, and ANTLR 4 are very different beasties. Not many of the parser generators generate D code :-( --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 01 2013
prev sibling parent reply Bruno Medeiros <brunodomedeiros+dng gmail.com> writes:
On 26/04/2013 21:44, Brian Schott wrote:
 On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:
 This uses ANTLR, as the other parser generators can't handle D's grammar.
I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
Half hour??? That seems like a massive amount of time, even if the generated parser is very suboptimal and is doing a lot of unnecessary backtracking. I mean, even if the parser has quadratic performance over typical usage source files, it seems like a lot. To be honest, that's one of the reasons that put me off working with ANLTR. It seems easy to create a parser with ANTLR, but to create an efficient, well-behaved parser it looks quite complicated, in the sense that you can't abstract yourself from what is happening under the hood... you have to read a lot of theory and documention to learn the innards of ANTLR, and understand what kind of code it's actually generating, and how it processes input. (at moments it feels like you have to take a degree to learn how to use it effectively...) -- Bruno Medeiros - Software Engineer
May 02 2013
next sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Thu, 2013-05-02 at 17:44 +0100, Bruno Medeiros wrote:
[=E2=80=A6]
 To be honest, that's one of the reasons that put me off working with=20
 ANLTR. It seems easy to create a parser with ANTLR, but to create an=20
 efficient, well-behaved parser it looks quite complicated, in the sense=
=20
 that you can't abstract yourself from what is happening under the=20
 hood... you have to read a lot of theory and documention to learn the=20
 innards of ANTLR, and understand what kind of code it's actually=20
 generating, and how it processes input. (at moments it feels like you=20
 have to take a degree to learn how to use it effectively...)
The Groovy parser is an ANTLR 2.7.7 grammar, it works well and quickly. I think the trick is to work with the LL(k) idioms and avoid letting LALR(1) thoughts creep in. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 02 2013
parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Thursday, 2 May 2013 at 17:13:46 UTC, Russel Winder wrote:
 On Thu, 2013-05-02 at 17:44 +0100, Bruno Medeiros wrote:
 […]
 To be honest, that's one of the reasons that put me off 
 working with ANLTR. It seems easy to create a parser with 
 ANTLR, but to create an efficient, well-behaved parser it 
 looks quite complicated, in the sense that you can't abstract 
 yourself from what is happening under the hood... you have to 
 read a lot of theory and documention to learn the innards of 
 ANTLR, and understand what kind of code it's actually 
 generating, and how it processes input. (at moments it feels 
 like you have to take a degree to learn how to use it 
 effectively...)
The Groovy parser is an ANTLR 2.7.7 grammar, it works well and quickly. I think the trick is to work with the LL(k) idioms and avoid letting LALR(1) thoughts creep in.
LALR(1) is O(n) last time I looked. Both Ll(k) and LALR(k) should never backtrack. Dunno what Antlr is doing.
May 03 2013
prev sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 2 May 2013 at 16:44:44 UTC, Bruno Medeiros wrote:
 (at moments it feels like you have to take a degree to learn 
 how to use it effectively...)
That or buy the documentation. Or both...
May 02 2013