www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Parsing D Maybe Not Such a Good Idea <_<;

reply cy <dlang verge.info.tm> writes:
So I was toying with the idea of writing a D parser, and this 
happened.

const(/*
				D is kind of hard to parse. /*
			/**/
			int//) foo(T//
			) foo(T//
						)(T /* I mean,
										 seriously
										 */ bar)
	if ( is (T == // how on earth do they do it?
		 int) ) {
		return
			cast /+  where does the function name /+ even start? +/
						+/
			( const (int) )
			bar//;}
			;}
			
void main() {
	import std.stdio;
	writeln(foo(42));
}

I don't think I'm going to write a D parser.
Jun 14 2016
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser, and this 
 happened.

 const(/*
 				D is kind of hard to parse. /*
 			/**/
 			int//) foo(T//
 			) foo(T//
 						)(T /* I mean,
 										 seriously
 										 */ bar)
 	if ( is (T == // how on earth do they do it?
 		 int) ) {
 		return
 			cast /+  where does the function name /+ even start? +/
 						+/
 			( const (int) )
 			bar//;}
 			;}
 			
 void main() {
 	import std.stdio;
 	writeln(foo(42));
 }

 I don't think I'm going to write a D parser.
That's hard to read, for a human, there is nothing in that example that is especially difficult for a machine to muddle through.
Jun 14 2016
parent reply cy <dlang verge.info.tm> writes:
On Wednesday, 15 June 2016 at 04:25:15 UTC, deadalnix wrote:
 there is nothing in that example that is especially difficult 
 for a machine to muddle through.
Nested comments, have to keep track of nesting depth specifically for those. Single line comments, which give newline two totally different semantics based on context, that the parser has to account for. Comments can go absolutely anywhere, so you can never advance by simply incrementing your character pointer, regardless of the statement you're parsing. You have to do a complex test for the presence of comments and pass over them, and if you want to do any backtracking, you have to be able to skip backwards over them. Type definitions can have parentheses in them, so the parser can't search for '(' either forwards or backwards in order to find the beginning of a function definition. It has to analyze each parenthetical grouping in its own special context, taking into account all the possible keywords, special phrases like extern, properties, and template arguments, both for parameters and return values. And that's just to parse function declarations!
Jun 14 2016
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Wed, Jun 15, 2016 at 04:40:07AM +0000, cy via Digitalmars-d wrote:
 On Wednesday, 15 June 2016 at 04:25:15 UTC, deadalnix wrote:
 there is nothing in that example that is especially difficult for a
 machine to muddle through.
Nested comments, have to keep track of nesting depth specifically for those. Single line comments, which give newline two totally different semantics based on context, that the parser has to account for. Comments can go absolutely anywhere, so you can never advance by simply incrementing your character pointer, regardless of the statement you're parsing. You have to do a complex test for the presence of comments and pass over them, and if you want to do any backtracking, you have to be able to skip backwards over them.
IMHO, you're thinking about this at the wrong level of abstraction. The first order of business, before you even think about parsing, should be to tokenize (or lex) the input. This is the stage where you break up the input into atomic chunks: keywords, parentheses, commas, identifiers, etc.. Comments can simply be scanned once and discarded. The result of this process should be a stream of tokens, consisting of the logical units of the language, i.e., keywords, punctuation, identifiers, numbers, literals, etc.. At this stage you could optionally store say, the last n tokens for some fixed number n for backtracking purposes, and they don't have to be anything more than just slices of the input string(s). Once you have the token stream, then we can talk about parsing. Generally for a language of D's complexity you'd want to use a parser generator like yacc or bison, or some kind of hand-crafted recursive descent parser with backtracking. But you're basically looking at an LALR(1) grammar (or, in D's case, maybe LALR(n)), meaning that you have to keep track of multiple possible parses as you scan the token stream, until you reach a synchronization point where it narrows down to one of the possibilities. IIRC, function declarations require backtracking because of the compile-time argument syntax, which makes certain constructs ambiguous until you've finished scanning at least the first argument list. This makes it more complicated for a human to implement, but nothing scary as far as machine-generated parsers go -- C++ is far more insane to parse, for example. (Some have quipped that C++ is a language that must be parsed before it can be lexed. :-P) T -- Кто везде - тот нигде.
Jun 14 2016
parent reply Shachar Shemesh <shachar weka.io> writes:
On 15/06/16 08:27, H. S. Teoh via Digitalmars-d wrote:
 IMHO, you're thinking about this at the wrong level of abstraction.
I tend to agree.
 The first order of business, before you even think about parsing, should
 be to tokenize (or lex) the input. This is the stage where you break up
 the input into atomic chunks: keywords, parentheses, commas,
 identifiers, etc.. Comments can simply be scanned once and discarded.
No. The lexer is, typically, a regular language. /+ +/ comments are not parsable using a regular language. Shachar
Jun 15 2016
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 15-Jun-2016 12:37, Shachar Shemesh wrote:
 On 15/06/16 08:27, H. S. Teoh via Digitalmars-d wrote:
 IMHO, you're thinking about this at the wrong level of abstraction.
I tend to agree.
 The first order of business, before you even think about parsing, should
 be to tokenize (or lex) the input. This is the stage where you break up
 the input into atomic chunks: keywords, parentheses, commas,
 identifiers, etc.. Comments can simply be scanned once and discarded.
No. The lexer is, typically, a regular language. /+ +/ comments are not parsable using a regular language.
Scanner can easily be non-regular.
 Shachar
-- Dmitry Olshansky
Jun 15 2016
parent deadalnix <deadalnix gmail.com> writes:
On Wednesday, 15 June 2016 at 19:20:51 UTC, Dmitry Olshansky 
wrote:
 Scanner can easily be non-regular.
Especially in that case, when the irregularity is as trivial as incrementing/decrementing a counter.
Jun 15 2016
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser
I suggest you start with a lexer. It'd enormously simplify the problem.
Jun 14 2016
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser, and this 
 happened.

 const(/*
 				D is kind of hard to parse. /*
 			/**/
 			int//) foo(T//
 			) foo(T//
 						)(T /* I mean,
 										 seriously
 										 */ bar)
 	if ( is (T == // how on earth do they do it?
 		 int) ) {
 		return
 			cast /+  where does the function name /+ even start? +/
 						+/
 			( const (int) )
 			bar//;}
 			;}
 			
 void main() {
 	import std.stdio;
 	writeln(foo(42));
 }

 I don't think I'm going to write a D parser.
After lexing you can remove all the tokComment and everything becomes simple. I had the same issue in Coedit because it has an editor command that turns every version(none) into version(all) But version /*bla*/(/*bla*/none/*bla*/) is valid. A version is easy to parse but only after removing all the comments ;) otherwise you have a too complex stack of token to analyze and some crazy branches, e.g if (tok[1].kind == tkVersion && tok[2].kind != tkComment && ...) That's not sane. If you remove the comments then the sequence of lexical token to detect is always the same.
Jun 14 2016
parent reply cy <dlang verge.info.tm> writes:
On Wednesday, 15 June 2016 at 04:59:59 UTC, Basile B. wrote:
 After lexing you can remove all the tokComment and everything 
 becomes simple.
Well, let's see. Using this libdparser thing, now I have a function declaration, with a name, parameters and a return type, good so far. Now I want to modify the name, and output it along with the parameters and the return type. So, let's just serialize the return type then. Oh, a Type has a list of type... constructors that are used somehow. And a list of type suffixes. Let's see, each type suffix can have a delegate, a star, and another Type recursively, plus expressions for low and high, and a list of member function attributes. Each member function attribute has an identifier (I think? IdType is secretly imported from somewhere) and an attribute. Okay, so attributes have an argument list, as well as maybe a template instance, and an identifier token. An argument list has a list of items, each one an expression. So there are 17 different types derived from Expression that start with the letter A alone. Understandable, considering that D allows arbitrary expressions in a type definition, and arbitrary means a-n-y-t-h-i-n-g. So assuming I can magically represent an expression, let's go back to finish up serializing the Type object. Oh... there's a Type2 object inside the Type object. It has another one of those mysterious IdTypes, a symbol (for something?), a typeof expression, an identifier, or something called a template chain, a type constructor, and... a recursive Type object inside it. I think we have different definitions of the word "simple." D syntax is hella complex. (Either that or different definitions of the word "everything.")
Jun 14 2016
parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 15 June 2016 at 05:51:53 UTC, cy wrote:
 On Wednesday, 15 June 2016 at 04:59:59 UTC, Basile B. wrote:
 After lexing you can remove all the tokComment and everything 
 becomes simple.
Well, let's see. Using this libdparser thing, now I have a [...] I think we have different definitions of the word "simple." D syntax is hella complex. (Either that or different definitions of the word "everything.")
You're right it's not so simple and you're also right about "everything", my "everything" is not used adequatly... It depends on the grammatical construct you want to parse. But it's already much more simple when the comments are removed from the lexical token list.
Jun 15 2016
parent reply cy <dlang verge.info.tm> writes:
On Wednesday, 15 June 2016 at 07:16:31 UTC, Basile B. wrote:

 You're right it's not so simple and you're also right about 
 "everything", my "everything" is not used adequatly...
Sorry, I don't mean to complain. Actually the work has already all been done, rather elegantly in fact. If libdparse can get through a significant subset of D2 code, I have to say I'm pretty impressed with the project, and can't praise it enough. https://github.com/Hackerpilot/libdparse // disclaimer: this link not endorsed by the hackerpilot org ltd It already has a D formatter in it, which dumps (prettified!) D code to any sort of output range, and there's a case in it for every single kind of node in the AST. (speaking of which, when are we getting static switch statements?) What I meant by "D is not simple" isn't that I'm up a creek, without a paddle, but that the paddle is really complex, and I'd have no hope of tackling it if it wasn't already done. The complexity of D's syntax is not so much a problem here, as a spectacle.
 It depends on the grammatical construct you want to parse. But 
 it's already much more simple when the comments are removed 
 from the lexical token list.
I suppose. What's complicated is the shoving of expressions everywhere, since those spider out to all possible forms of construct. That means the difficulty of parsing does NOT depend on the grammatical construct you want to parse, except for a few, very minor constructs, only the ones that don't even *potentially* include expressions in their grammar. So, regardless of what you're doing, you pretty much have to handle every single kind of construct, but if "handle" means "transform, then output" and you can separate those two steps, then if someone does all the output for you, the "transform" step can be very simple and specific. Not because you can remove the comment nodes, but because you can ignore ALL nodes that you're not interested in transforming.
Jun 16 2016
parent reply Basile B. <b2.temp gmx.com> writes:
On Thursday, 16 June 2016 at 17:20:39 UTC, cy wrote:
 On Wednesday, 15 June 2016 at 07:16:31 UTC, Basile B. wrote:

 You're right it's not so simple and you're also right about 
 "everything", my "everything" is not used adequatly...
Sorry, I don't mean to complain. Actually the work has already all been done, rather elegantly in fact. If libdparse can get through a significant subset of D2 code, I have to say I'm pretty impressed with the project, and can't praise it enough. https://github.com/Hackerpilot/libdparse // disclaimer: this link not endorsed by the hackerpilot org ltd It already has a D formatter in it, which dumps (prettified!) D code to any sort of output range, and there's a case in it for every single kind of node in the AST.
Yes, libdparse is the reference and when someone has to parse D code he really should use it. Among all the D libraries it's the one I know the more. I use it to build the CE's symbol list (it's an AST visitor) and to detect the "TODO comments" ;) But somtimes it's too much: (I speak for me here) for example if you need to parse only simple constructs. In CE the **only** constructs that are parsed directly in the IDE (the two other cases mentioned previously are done in external tools) are ModuleDeclaration and VersionCondition. For them libdparse is not mandatory, they can be detected by hand in the token list.
 (speaking of which, when are we getting static switch 
 statements?)

 What I meant by "D is not simple" isn't that I'm up a creek, 
 without a paddle, but that the paddle is really complex, and 
 I'd have no hope of tackling it if it wasn't already done. The 
 complexity of D's syntax is not so much a problem here, as a 
 spectacle.

 It depends on the grammatical construct you want to parse. But 
 it's already much more simple when the comments are removed 
 from the lexical token list.
I suppose. What's complicated is the shoving of expressions everywhere, since those spider out to all possible forms of construct. That means the difficulty of parsing does NOT depend on the grammatical construct you want to parse, except for a few, very minor constructs, only the ones that don't even *potentially* include expressions in their grammar. So, regardless of what you're doing, you pretty much have to handle every single kind of construct,
No simple constructs can be detected in a token list. But if I understand correctly you've started the topic because you wished to detect functionDeclaration, right ? Obviously here you need the AST. Function declarations can be disabled via a versionCondition or enabled by a static if, injected by a mixin template, injected by a string... They cannot be accurately detected by picking 4 or 5 tokens in a list.
 but if "handle" means "transform, then output" and you can 
 separate those two steps, then if someone does all the output 
 for you, the "transform" step can be very simple and specific. 
 Not because you can remove the comment nodes, but because you 
 can ignore ALL nodes that you're not interested in transforming.
Jun 17 2016
parent Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Fri, 2016-06-17 at 08:07 +0000, Basile B. via Digitalmars-d wrote:
=20
[=E2=80=A6]
 Yes, libdparse is the reference and when someone has to parse D=C2=A0
 code he really should use it. Among all the D libraries it's the=C2=A0
 one I know the more. I use it to build the CE's symbol list (it's=C2=A0
 an AST visitor) and to detect the "TODO comments" ;)
=20
[=E2=80=A6] One can only wonder why this is not the reference parser, the one that DMD and all other tools can then use. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jun 17 2016