digitalmars.D - Parsing D Maybe Not Such a Good Idea <

digitalmars.D - Parsing D Maybe Not Such a Good Idea <_<;

cy (24/24) Jun 14 2016 So I was toying with the idea of writing a D parser, and this

deadalnix (4/28) Jun 14 2016 That's hard to read, for a human, there is nothing in that

cy (18/20) Jun 14 2016 Nested comments, have to keep track of nesting depth specifically

H. S. Teoh via Digitalmars-d (29/41) Jun 14 2016 IMHO, you're thinking about this at the wrong level of abstraction.

Shachar Shemesh (5/10) Jun 15 2016 No. The lexer is, typically, a regular language. /+ +/ comments are not

Dmitry Olshansky (4/15) Jun 15 2016 --

deadalnix (4/5) Jun 15 2016 Especially in that case, when the irregularity is as trivial as

Adam D. Ruppe (3/4) Jun 14 2016 I suggest you start with a lexer. It'd enormously simplify the
Basile B. (18/42) Jun 14 2016 After lexing you can remove all the tokComment and everything

cy (30/32) Jun 14 2016 Well, let's see. Using this libdparser thing, now I have a

Basile B. (6/13) Jun 15 2016 You're right it's not so simple and you're also right about

cy (29/34) Jun 16 2016 Sorry, I don't mean to complain. Actually the work has already

Basile B. (18/53) Jun 17 2016 Yes, libdparse is the reference and when someone has to parse D

Russel Winder via Digitalmars-d (15/21) Jun 17 2016 [=E2=80=A6]

cy <dlang verge.info.tm> writes:

So I was toying with the idea of writing a D parser, and this 
happened.

const(/*
				D is kind of hard to parse. /*
			/**/
			int//) foo(T//
			) foo(T//
						)(T /* I mean,
										 seriously
										 */ bar)
	if ( is (T == // how on earth do they do it?
		 int) ) {
		return
			cast /+  where does the function name /+ even start? +/
						+/
			( const (int) )
			bar//;}
			;}
			
void main() {
	import std.stdio;
	writeln(foo(42));
}

I don't think I'm going to write a D parser.

Jun 14 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser, and this 
 happened.

 const(/*
 				D is kind of hard to parse. /*
 			/**/
 			int//) foo(T//
 			) foo(T//
 						)(T /* I mean,
 										 seriously
 										 */ bar)
 	if ( is (T == // how on earth do they do it?
 		 int) ) {
 		return
 			cast /+  where does the function name /+ even start? +/
 						+/
 			( const (int) )
 			bar//;}
 			;}
 			
 void main() {
 	import std.stdio;
 	writeln(foo(42));
 }

 I don't think I'm going to write a D parser.

That's hard to read, for a human, there is nothing in that 
example that is especially difficult for a machine to muddle 
through.

Jun 14 2016

cy <dlang verge.info.tm> writes:

On Wednesday, 15 June 2016 at 04:25:15 UTC, deadalnix wrote:
 there is nothing in that example that is especially difficult 
 for a machine to muddle through.

Nested comments, have to keep track of nesting depth specifically 
for those. Single line comments, which give newline two totally 
different semantics based on context, that the parser has to 
account for. Comments can go absolutely anywhere, so you can 
never advance by simply incrementing your character pointer, 
regardless of the statement you're parsing. You have to do a 
complex test for the presence of comments and pass over them, and 
if you want to do any backtracking, you have to be able to skip 
backwards over them.

Type definitions can have parentheses in them, so the parser 
can't search for '(' either forwards or backwards in order to 
find the beginning of a function definition. It has to analyze 
each parenthetical grouping in its own special context, taking 
into account all the possible keywords, special phrases like 
extern, properties, and template arguments, both for parameters 
and return values.

And that's just to parse function declarations!

Jun 14 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Wed, Jun 15, 2016 at 04:40:07AM +0000, cy via Digitalmars-d wrote:
 On Wednesday, 15 June 2016 at 04:25:15 UTC, deadalnix wrote:
 there is nothing in that example that is especially difficult for a
 machine to muddle through.

 
 Nested comments, have to keep track of nesting depth specifically for
 those.  Single line comments, which give newline two totally different
 semantics based on context, that the parser has to account for.
 Comments can go absolutely anywhere, so you can never advance by
 simply incrementing your character pointer, regardless of the
 statement you're parsing. You have to do a complex test for the
 presence of comments and pass over them, and if you want to do any
 backtracking, you have to be able to skip backwards over them.

IMHO, you're thinking about this at the wrong level of abstraction.

The first order of business, before you even think about parsing, should
be to tokenize (or lex) the input. This is the stage where you break up
the input into atomic chunks: keywords, parentheses, commas,
identifiers, etc.. Comments can simply be scanned once and discarded.
The result of this process should be a stream of tokens, consisting of
the logical units of the language, i.e., keywords, punctuation,
identifiers, numbers, literals, etc.. At this stage you could optionally
store say, the last n tokens for some fixed number n for backtracking
purposes, and they don't have to be anything more than just slices of
the input string(s).

Once you have the token stream, then we can talk about parsing.
Generally for a language of D's complexity you'd want to use a parser
generator like yacc or bison, or some kind of hand-crafted recursive
descent parser with backtracking. But you're basically looking at an
LALR(1) grammar (or, in D's case, maybe LALR(n)), meaning that you have
to keep track of multiple possible parses as you scan the token stream,
until you reach a synchronization point where it narrows down to one of
the possibilities. IIRC, function declarations require backtracking
because of the compile-time argument syntax, which makes certain
constructs ambiguous until you've finished scanning at least the first
argument list.  This makes it more complicated for a human to implement,
but nothing scary as far as machine-generated parsers go -- C++ is far
more insane to parse, for example. (Some have quipped that C++ is a
language that must be parsed before it can be lexed. :-P)


T

-- 
Кто везде - тот нигде.

Jun 14 2016

Shachar Shemesh <shachar weka.io> writes:

On 15/06/16 08:27, H. S. Teoh via Digitalmars-d wrote:
 IMHO, you're thinking about this at the wrong level of abstraction.

I tend to agree.

 The first order of business, before you even think about parsing, should
 be to tokenize (or lex) the input. This is the stage where you break up
 the input into atomic chunks: keywords, parentheses, commas,
 identifiers, etc.. Comments can simply be scanned once and discarded.

No. The lexer is, typically, a regular language. /+ +/ comments are not 
parsable using a regular language.

Shachar

Jun 15 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 15-Jun-2016 12:37, Shachar Shemesh wrote:
 On 15/06/16 08:27, H. S. Teoh via Digitalmars-d wrote:
 IMHO, you're thinking about this at the wrong level of abstraction.

 I tend to agree.

 The first order of business, before you even think about parsing, should
 be to tokenize (or lex) the input. This is the stage where you break up
 the input into atomic chunks: keywords, parentheses, commas,
 identifiers, etc.. Comments can simply be scanned once and discarded.

 No. The lexer is, typically, a regular language. /+ +/ comments are not
 parsable using a regular language.

Scanner can easily be non-regular.

 Shachar


-- 
Dmitry Olshansky

Jun 15 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 15 June 2016 at 19:20:51 UTC, Dmitry Olshansky 
wrote:
 Scanner can easily be non-regular.

Especially in that case, when the irregularity is as trivial as 
incrementing/decrementing a counter.

Jun 15 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser

I suggest you start with a lexer. It'd enormously simplify the 
problem.

Jun 14 2016

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 15 June 2016 at 03:59:43 UTC, cy wrote:
 So I was toying with the idea of writing a D parser, and this 
 happened.

 const(/*
 				D is kind of hard to parse. /*
 			/**/
 			int//) foo(T//
 			) foo(T//
 						)(T /* I mean,
 										 seriously
 										 */ bar)
 	if ( is (T == // how on earth do they do it?
 		 int) ) {
 		return
 			cast /+  where does the function name /+ even start? +/
 						+/
 			( const (int) )
 			bar//;}
 			;}
 			
 void main() {
 	import std.stdio;
 	writeln(foo(42));
 }

 I don't think I'm going to write a D parser.

After lexing you can remove all the tokComment and everything 
becomes simple.
I had the same issue in Coedit because it has an editor command 
that turns every

     version(none)

into

     version(all)

But

     version /*bla*/(/*bla*/none/*bla*/)

is valid. A version is easy to parse but only after removing all 
the comments ;)
otherwise you have a too complex stack of token to analyze and 
some crazy branches, e.g

     if (tok[1].kind == tkVersion && tok[2].kind != tkComment && 
...)

That's not sane. If you remove the comments then the sequence of 
lexical token to detect is always the same.

Jun 14 2016

cy <dlang verge.info.tm> writes:

On Wednesday, 15 June 2016 at 04:59:59 UTC, Basile B. wrote:
 After lexing you can remove all the tokComment and everything 
 becomes simple.

Well, let's see. Using this libdparser thing, now I have a 
function declaration, with a name, parameters and a return type, 
good so far. Now I want to modify the name, and output it along 
with the parameters and the return type. So, let's just serialize 
the return type then.

Oh, a Type has a list of type... constructors that are used 
somehow. And a list of type suffixes. Let's see, each type suffix 
can have a delegate, a star, and another Type recursively, plus 
expressions for low and high, and a list of member function 
attributes. Each member function attribute has an identifier (I 
think? IdType is secretly imported from somewhere) and an 
attribute.

Okay, so attributes have an argument list, as well as maybe a 
template instance, and an identifier token. An argument list has 
a list of items, each one an expression.

So there are 17 different types derived from Expression that 
start with the letter A alone. Understandable, considering that D 
allows arbitrary expressions in a type definition, and arbitrary 
means a-n-y-t-h-i-n-g. So assuming I can magically represent an 
expression, let's go back to finish up serializing the Type 
object.

Oh... there's a Type2 object inside the Type object. It has 
another one of those mysterious IdTypes, a symbol (for 
something?), a typeof expression, an identifier, or something 
called a template chain, a type constructor, and... a recursive 
Type object inside it.

I think we have different definitions of the word "simple." D 
syntax is hella complex.

(Either that or different definitions of the word "everything.")

Jun 14 2016

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 15 June 2016 at 05:51:53 UTC, cy wrote:
 On Wednesday, 15 June 2016 at 04:59:59 UTC, Basile B. wrote:
 After lexing you can remove all the tokComment and everything 
 becomes simple.

 Well, let's see. Using this libdparser thing, now I have a [...]
 I think we have different definitions of the word "simple." D 
 syntax is hella complex.

 (Either that or different definitions of the word "everything.")

You're right it's not so simple and you're also right about 
"everything", my "everything" is not used adequatly... It depends 
on the grammatical construct you want to parse. But it's already 
much more simple when the comments are removed from the lexical 
token list.

Jun 15 2016

cy <dlang verge.info.tm> writes:

On Wednesday, 15 June 2016 at 07:16:31 UTC, Basile B. wrote:

 You're right it's not so simple and you're also right about 
 "everything", my "everything" is not used adequatly...

Sorry, I don't mean to complain. Actually the work has already 
all been done, rather elegantly in fact. If libdparse can get 
through a significant subset of D2 code, I have to say I'm pretty 
impressed with the project, and can't praise it enough.

https://github.com/Hackerpilot/libdparse // disclaimer: this link 
not endorsed by the hackerpilot org ltd

It already has a D formatter in it, which dumps (prettified!) D 
code to any sort of output range, and there's a case in it for 
every single kind of node in the AST.

(speaking of which, when are we getting static switch statements?)

What I meant by "D is not simple" isn't that I'm up a creek, 
without a paddle, but that the paddle is really complex, and I'd 
have no hope of tackling it if it wasn't already done. The 
complexity of D's syntax is not so much a problem here, as a 
spectacle.

 It depends on the grammatical construct you want to parse. But 
 it's already much more simple when the comments are removed 
 from the lexical token list.

I suppose. What's complicated is the shoving of expressions 
everywhere, since those spider out to all possible forms of 
construct. That means the difficulty of parsing does NOT depend 
on the grammatical construct you want to parse, except for a few, 
very minor constructs, only the ones that don't even 
*potentially* include expressions in their grammar.

So, regardless of what you're doing, you pretty much have to 
handle every single kind of construct, but if "handle" means 
"transform, then output" and you can separate those two steps, 
then if someone does all the output for you, the "transform" step 
can be very simple and specific. Not because you can remove the 
comment nodes, but because you can ignore ALL nodes that you're 
not interested in transforming.

Jun 16 2016

Basile B. <b2.temp gmx.com> writes:

On Thursday, 16 June 2016 at 17:20:39 UTC, cy wrote:
 On Wednesday, 15 June 2016 at 07:16:31 UTC, Basile B. wrote:

 You're right it's not so simple and you're also right about 
 "everything", my "everything" is not used adequatly...

 Sorry, I don't mean to complain. Actually the work has already 
 all been done, rather elegantly in fact. If libdparse can get 
 through a significant subset of D2 code, I have to say I'm 
 pretty impressed with the project, and can't praise it enough.

 https://github.com/Hackerpilot/libdparse // disclaimer: this 
 link not endorsed by the hackerpilot org ltd

 It already has a D formatter in it, which dumps (prettified!) D 
 code to any sort of output range, and there's a case in it for 
 every single kind of node in the AST.

Yes, libdparse is the reference and when someone has to parse D 
code he really should use it. Among all the D libraries it's the 
one I know the more. I use it to build the CE's symbol list (it's 
an AST visitor) and to detect the "TODO comments" ;)

But somtimes it's too much: (I speak for me here) for example if 
you need to parse only simple constructs. In CE the **only** 
constructs that are parsed directly in the IDE (the two other 
cases mentioned previously are done in external tools) are 
ModuleDeclaration and VersionCondition. For them libdparse is not 
mandatory, they can be detected by hand in the token list.

 (speaking of which, when are we getting static switch 
 statements?)

 What I meant by "D is not simple" isn't that I'm up a creek, 
 without a paddle, but that the paddle is really complex, and 
 I'd have no hope of tackling it if it wasn't already done. The 
 complexity of D's syntax is not so much a problem here, as a 
 spectacle.

 It depends on the grammatical construct you want to parse. But 
 it's already much more simple when the comments are removed 
 from the lexical token list.

 I suppose. What's complicated is the shoving of expressions 
 everywhere, since those spider out to all possible forms of 
 construct. That means the difficulty of parsing does NOT depend 
 on the grammatical construct you want to parse, except for a 
 few, very minor constructs, only the ones that don't even 
 *potentially* include expressions in their grammar.

 So, regardless of what you're doing, you pretty much have to 
 handle every single kind of construct,

No simple constructs can be detected in a token list. But if I 
understand correctly you've started the topic because you wished 
to detect functionDeclaration, right ?
Obviously here you need the AST. Function declarations can be 
disabled via a versionCondition or enabled by a static if, 
injected by a mixin template, injected by a string... They cannot 
be accurately detected by picking 4 or 5 tokens in a list.

 but if "handle" means "transform, then output" and you can 
 separate those two steps, then if someone does all the output 
 for you, the "transform" step can be very simple and specific. 
 Not because you can remove the comment nodes, but because you 
 can ignore ALL nodes that you're not interested in transforming.

Jun 17 2016

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Fri, 2016-06-17 at 08:07 +0000, Basile B. via Digitalmars-d wrote:
=20

[=E2=80=A6]
 Yes, libdparse is the reference and when someone has to parse D=C2=A0
 code he really should use it. Among all the D libraries it's the=C2=A0
 one I know the more. I use it to build the CE's symbol list (it's=C2=A0
 an AST visitor) and to detect the "TODO comments" ;)
=20

[=E2=80=A6]

One can only wonder why this is not the reference parser, the one that
DMD and all other tools can then use.

--=20

Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jun 17 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Parsing D Maybe Not Such a Good Idea <_<;