digitalmars.D - [spec] Phases of translation

Dibyendu Majumdar (18/18) May 19 2019 Hi,

Paul Backus (4/12) May 19 2019 Walter Bright's DConf 2016 talk, "Spelunking D Compuler

Dibyendu Majumdar (14/17) May 19 2019 Thank you - I watched it again just now. It was very helpful.

Max Haughton (4/19) May 19 2019 Only use the code to work out what the current behaviour: The

Dibyendu Majumdar (6/9) May 19 2019 Yes I understand that, but there may some some aspects implicit

Basile B. (10/19) May 19 2019 The current compilation model of the front end is said to be
Max Haughton (8/17) May 20 2019 You're right, that is definitely an issue with the current

Stefanos Baziotis (33/40) May 20 2019 As a newcomer here, I agree. After the parsing stage, things

Walter Bright (5/20) May 20 2019 The Translation Phases are conceptual, the compiler may not actually do ...

Dibyendu Majumdar (7/10) May 20 2019 Okay thank you. From the spec point of view I guess the important

Walter Bright (11/20) May 20 2019 The spec isn't a tutorial in that the consequences don't necessarily nee...

Dibyendu Majumdar (14/25) May 20 2019 I am trying to understand this aspect - I found the small write

Stefanos Baziotis (66/74) May 20 2019 For the first part: "tokenizing is independent of parsing", one
Dibyendu Majumdar (21/33) May 21 2019 Currently the Intro says:

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/6) May 21 2019 AFAIK, you should be able to parse C++ with a GLR parser.
Dibyendu Majumdar (4/39) Jun 01 2019 Hi Walter, please would you share any insights regarding above?

Dibyendu Majumdar (5/49) Jun 10 2019 Hi I am still waiting to hear your views on above. The pull

Dibyendu Majumdar <d.majumdar gmail.com> writes:

Hi,

In the introduction there is a section named 'Phases of 
Compilation'. Following the convention used in C standard, I am 
calling this 'Translation Phases'; translation is a more generic 
term too. See https://github.com/dlang/dlang.org/pull/2646/files 
for some initial rewording.

The recent post 
https://forum.dlang.org/post/xnoeazcqnysodqnjphhc forum.dlang.org 
got me thinking that actually we should expand the 'Translation 
Phases' section to describe more formally how D code is parsed. I 
had a quick look at the D parsing code. It seems that following 
an initial parse - presumably this constructs the syntax tree - 
there are three deferred semantic parses. Are any details 
available on what actually happens in each pass? I can read the 
code of course but it would be helpful to get a high level 
picture.

Thanks and Regards
Dibyendu

May 19 2019

Paul Backus <snarwin gmail.com> writes:

On Sunday, 19 May 2019 at 10:07:37 UTC, Dibyendu Majumdar wrote:
 I had a quick look at the D parsing code. It seems that 
 following an initial parse - presumably this constructs the 
 syntax tree - there are three deferred semantic parses. Are any 
 details available on what actually happens in each pass? I can 
 read the code of course but it would be helpful to get a high 
 level picture.

 Thanks and Regards
 Dibyendu

Walter Bright's DConf 2016 talk, "Spelunking D Compuler 
Internals", is a good resource for this:

https://www.youtube.com/watch?v=l_96Crl998E

May 19 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Sunday, 19 May 2019 at 11:39:02 UTC, Paul Backus wrote:
 Walter Bright's DConf 2016 talk, "Spelunking D Compuler 
 Internals", is a good resource for this:

 https://www.youtube.com/watch?v=l_96Crl998E

Thank you - I watched it again just now. It was very helpful.

I still need to understand more about the semantic analysis 
phase, but I don't yet know what questions to ask.

I guess to start with one point seems to be that after the syntax 
analysis phase, we simply have a abstract syntax tree that 
corresponds one to one with the input source - is that correct? 
So at this stage nothing more is known about the program - only 
the input is in a different format that is more amenable to 
further analysis.

Secondly - I assume that the syntax tree "lowering" (or 
simplification) is done prior to the semantic analysis?

Thanks and Regards
Dibyendu

May 19 2019

Max Haughton <maxhaton gmail.com> writes:

On Sunday, 19 May 2019 at 14:26:34 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 11:39:02 UTC, Paul Backus wrote:
 [...]

 Thank you - I watched it again just now. It was very helpful.

 I still need to understand more about the semantic analysis 
 phase, but I don't yet know what questions to ask.

 I guess to start with one point seems to be that after the 
 syntax analysis phase, we simply have a abstract syntax tree 
 that corresponds one to one with the input source - is that 
 correct? So at this stage nothing more is known about the 
 program - only the input is in a different format that is more 
 amenable to further analysis.

 Secondly - I assume that the syntax tree "lowering" (or 
 simplification) is done prior to the semantic analysis?

 Thanks and Regards
 Dibyendu

Only use the code to work out what the current behaviour: The 
specification needs to be totally indpendant of the current 
implementations ('s ideology).

May 19 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).

Yes I understand that, but there may some some aspects implicit 
in the implementation that should be made more explicit in the 
specification. Maybe I need to come back to this topic after I 
have done the other sections as I don't know what I don't know.

Regards

May 19 2019

Basile B. <b2.temp gmx.com> writes:

On Sunday, 19 May 2019 at 16:52:10 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).

 Yes I understand that, but there may some some aspects implicit 
 in the implementation that should be made more explicit in the 
 specification. Maybe I need to come back to this topic after I 
 have done the other sections as I don't know what I don't know.

 Regards

The current compilation model of the front end is said to be 
"lazy" but a eager compilation would produce the same program 
without changing the semantics.
I think that this is for example what Max Haughton meant.

Saying that the semantic is multi pass due to some language 
feature such as string mixin, template mixin, mutually dependent 
declarations, (etc.) would be enough. It must be just clear that 
a single pass, following the lexical order top to bottom, would 
not work at all.

May 19 2019

Max Haughton <maxhaton gmail.com> writes:

On Sunday, 19 May 2019 at 16:52:10 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).

 Yes I understand that, but there may some some aspects implicit 
 in the implementation that should be made more explicit in the 
 specification. Maybe I need to come back to this topic after I 
 have done the other sections as I don't know what I don't know.

 Regards

You're right, that is definitely an issue with the current 
specification. One of the really annoying issues is that dmd is 
understandable locally (e.g. I was able to add a little feature 
without looking much up) but the global structure of the code is 
quite disjoint e.g. lots of the analysis code is lumped together 
and fairly difficult to grok without watching a debugger go 
through it (Unless you already know)

May 20 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Monday, 20 May 2019 at 20:22:24 UTC, Max Haughton wrote:
 You're right, that is definitely an issue with the current 
 specification. One of the really annoying issues is that dmd is 
 understandable locally (e.g. I was able to add a little feature 
 without looking much up) but the global structure of the code 
 is quite disjoint e.g. lots of the analysis code is lumped 
 together and fairly difficult to grok without watching a 
 debugger go through it (Unless you already know)

As a newcomer here, I agree. After the parsing stage, things 
start to get
a little bit unclear.

While I agree that the spec is not supposed to be a tutorial, 
well, there isn't
any tutorial either. I'm not suggesting to make the spec a 
tutorial, but to make
some tutorial.

Compilers are of great interest to me and I think that DMD is 
great study material
if you want to look at a non-trivial compiler. But on the same 
time, I think
it is not attractive for new users. It maybe me that I'm not a 
compiler expert
but I think it's not easy for most people.

The source code guide [1] did not help much. It was way too 
high-level for me
to understand any important parts. I think that the video 
referenced above [2]
way more helpful. In that video, you said that you could talk for 
a month
about the compiler. Well, I would be glad to do that for you with 
some help. :)

After GSoC, I was planning to start diving into the compiler and 
writing about
it. But, I may write a lot of incorrect stuff.
So, if any experienced compiler dev wants to help / review those, 
I would be
very happy. And I think it would help other not-compiler-jedis 
understand it.


[1] https://wiki.dlang.org/DMD_Source_Guide
[2] https://www.youtube.com/watch?v=l_96Crl998E

May 20 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 5/19/2019 3:07 AM, Dibyendu Majumdar wrote:
 Hi,
 
 In the introduction there is a section named 'Phases of Compilation'.
Following 
 the convention used in C standard, I am calling this 'Translation Phases'; 
 translation is a more generic term too. See 
 https://github.com/dlang/dlang.org/pull/2646/files for some initial rewording.
 
 The recent post 
 https://forum.dlang.org/post/xnoeazcqnysodqnjphhc forum.dlang.org got me 
 thinking that actually we should expand the 'Translation Phases' section to 
 describe more formally how D code is parsed. I had a quick look at the D
parsing 
 code. It seems that following an initial parse - presumably this constructs
the 
 syntax tree - there are three deferred semantic parses. Are any details 
 available on what actually happens in each pass? I can read the code of course 
 but it would be helpful to get a high level picture.

The Translation Phases are conceptual, the compiler may not actually do them 
that way, it's just supposed to be "as if" they were done that way.

For example, the Digital Mars C/C++ compilers merge several of the translation 
phases.

May 20 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Monday, 20 May 2019 at 07:34:43 UTC, Walter Bright wrote:
 The Translation Phases are conceptual, the compiler may not 
 actually do them that way, it's just supposed to be "as if" 
 they were done that way.

Okay thank you. From the spec point of view I guess the important 
thing is to clarify any constraints this places on D. One 
constraint obviously is that the parsing into syntax tree 
requires arbitrary lookahead of tokens. Are there other 
constraints that need to be highlighted?

Regards

May 20 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 5/20/2019 4:51 AM, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 07:34:43 UTC, Walter Bright wrote:
 The Translation Phases are conceptual, the compiler may not actually do them 
 that way, it's just supposed to be "as if" they were done that way.

 
 Okay thank you. From the spec point of view I guess the important thing is to 
 clarify any constraints this places on D. One constraint obviously is that the 
 parsing into syntax tree requires arbitrary lookahead of tokens. Are there
other 
 constraints that need to be highlighted?

The spec isn't a tutorial in that the consequences don't necessarily need to be 
spelled out. The crucial thing to know is that the tokenizing is independent of 
parsing, and parsing is independent of semantic analysis.

I.e. the definition of a token does not change depending on what construct is 
being parsed, and the AST generated by the parser can be created without doing 
any semantic analysis (unlike C++).

These consequences fall out of the rest of the spec, hence they should be more 
of a clarification in the introduction. The idea is to head off attempts to add 
changes to D that introduce dependencies. Such proposals do crop up from time
to 
time, for example, user-defined operator tokens.

May 20 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is independent 
 of parsing, and parsing is independent of semantic analysis.

I am trying to understand this aspect - I found the small write 
up in the Intro section not very clear. Would be great if you 
could so a session on how things work - maybe video cast?

For example, the mixin declaration has to convert a string to AST 
I guess? When does this happen? Does it not need to invoke the 
lexer on the generated string and build AST while already in the 
semantic stage?

 I.e. the definition of a token does not change depending on 
 what construct is being parsed, and the AST generated by the 
 parser can be created without doing any semantic analysis 
 (unlike C++).

 These consequences fall out of the rest of the spec, hence they 
 should be more of a clarification in the introduction. The idea 
 is to head off attempts to add changes to D that introduce 
 dependencies. Such proposals do crop up from time to time, for 
 example, user-defined operator tokens.

I agree - hence I think we need to be explicit about what D 
requires of each phase. That way any change to the language can 
be subjected to a test - does it break some of the fundamental 
requirements for parsing or semantic analysis etc.

Thanks and Regards
Dibyendu

May 20 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of semantic 
 analysis.

 I am trying to understand this aspect - I found the small write 
 up in the Intro section not very clear. Would be great if you 
 could so a session on how things work - maybe video cast?

For the first part: "tokenizing is independent of parsing", one 
way to think of it
is that to write a lexer (the sub-program that does the 
tokenizing), you
don't need a parser. You could write it as an independent 
program. That makes sense, to split any token from the input 
('int', '+', '123', etc.) you don't every need to build any AST. 
Or differently, you don't need to know any info about the 
structure of the program.

For the second part, a parser can build an AST (or more 
correctly, can decide
in what grammatical rule it "falls") just by the kinds of tokens 
(e.g. you have a number token, like 123, then '+', so you must be 
parsing a binary expression).

I think that somewhere was mentioned that C++ is _dependent_ on 
the semantic
analysis. To understand this better, think that semantic analysis 
as the act
of giving meaning to the entities of the program (hence the term 
semantic).

For example, if I write: int a;
Well, 'a' is a token, whose kind we could say is an identifier. 
But it doesn't
have any meaning. Knowing that 'a' is a variable with type 'int' 
does now give
us more info. We can now know whether it can be part of this 
expression for example: a + 1.
If 'a' was a string, that would be invalid, and we found the 
error because
we knew the meaning of 'a'.

Now, what is an example of parsing being dependent to the meaning 
of tokens?
Well, C (and C++) of course:
Consider this expression: (a)*b
If 'a' is a variable, then that is really this: a*b
But, if I have done this:
typedef int a;
somewhere above, then now this expression is suddenly "cast the 
dereference
of 'b' to the type 'a' (which is int in this case)".

This is known as the lexer hack [1] (actually, the lexer hack is 
the solution
to the problem).

The important thing to understand here is that we can't parse the 
expression
just by knowing what kind of token 'a' is. We have to know 
additional info
about it, info that is provided by the semantic analysis (in the 
form of the
symbol table, a table which contains info about the symbols of 
the program).
These grammars are known as context-sensitive (i.e. not 
context-free) because
you have to have some context to deduce the grammatical rule.

Note that now suddenly there is no clear distinction between the 
parsing
phase and the semantic phase.

Last but not least, phases being separated clearly has other
important implications beyond comprehensibility. For example, the 
compiler
is paralellized more easily. While the lexer tokenizes file A, 
the parser
could be parsing file B while a semantic analysis is run on file 
C.

[1] https://en.wikipedia.org/wiki/The_lexer_hack

May 20 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of semantic 
 analysis.

 I am trying to understand this aspect - I found the small write 
 up in the Intro section not very clear. Would be great if you 
 could so a session on how things work - maybe video cast?

 For example, the mixin declaration has to convert a string to 
 AST I guess? When does this happen? Does it not need to invoke 
 the lexer on the generated string and build AST while already 
 in the semantic stage?

Currently the Intro says:

The process of compiling is divided into multiple phases. Each 
phase has no dependence on subsequent phases. For example, the 
scanner is not perturbed by the semantic analyzer. This 
separation of the passes makes language tools like syntax 
directed editors relatively easy to produce. It also is possible 
to compress D source by storing it in ‘tokenized’ form.

I feel this description is unclear, and it might just reflect how 
DMD is implemented. I haven't implemented a C++ parser but 
parsers I have worked with - such as for C - it is always the 
case that lexer doesn't get impacted by the semantic analysis. 
The standard process is to get a stream of tokens from the lexer 
and work with that. It is also conceivable that someone could 
create a "dumb" AST first for C++, and as a subsequent phase add 
semantic meaning to the AST, just as is done for D.

For now I propose to remove this paragraph until there is a 
better description available. Please would you review my pull 
request as it it is blocking me from doing further work.

Thanks and Regards
Dibyendu

May 21 2019

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
 It is also conceivable that someone could create a "dumb" AST 
 first for C++, and as a subsequent phase add semantic meaning 
 to the AST, just as is done for D.

AFAIK, you should be able to parse C++ with a GLR parser.

Ola.

May 21 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of 
 semantic analysis.

 I am trying to understand this aspect - I found the small 
 write up in the Intro section not very clear. Would be great 
 if you could so a session on how things work - maybe video 
 cast?

 For example, the mixin declaration has to convert a string to 
 AST I guess? When does this happen? Does it not need to invoke 
 the lexer on the generated string and build AST while already 
 in the semantic stage?

 Currently the Intro says:

 The process of compiling is divided into multiple phases. Each 
 phase has no dependence on subsequent phases. For example, the 
 scanner is not perturbed by the semantic analyzer. This 
 separation of the passes makes language tools like syntax 
 directed editors relatively easy to produce. It also is 
 possible to compress D source by storing it in ‘tokenized’ form.

 I feel this description is unclear, and it might just reflect 
 how DMD is implemented. I haven't implemented a C++ parser but 
 parsers I have worked with - such as for C - it is always the 
 case that lexer doesn't get impacted by the semantic analysis. 
 The standard process is to get a stream of tokens from the 
 lexer and work with that. It is also conceivable that someone 
 could create a "dumb" AST first for C++, and as a subsequent 
 phase add semantic meaning to the AST, just as is done for D.

 For now I propose to remove this paragraph until there is a 
 better description available. Please would you review my pull 
 request as it it is blocking me from doing further work.

Hi Walter, please would you share any insights regarding above?

Thanks and Regards
Dibyendu

Jun 01 2019

Dibyendu Majumdar <d.majumdar gmail.com> writes:

On Saturday, 1 June 2019 at 13:34:07 UTC, Dibyendu Majumdar wrote:
 On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar 
 wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of 
 semantic analysis.

 I am trying to understand this aspect - I found the small 
 write up in the Intro section not very clear. Would be great 
 if you could so a session on how things work - maybe video 
 cast?

 For example, the mixin declaration has to convert a string to 
 AST I guess? When does this happen? Does it not need to 
 invoke the lexer on the generated string and build AST while 
 already in the semantic stage?

 Currently the Intro says:

 The process of compiling is divided into multiple phases. Each 
 phase has no dependence on subsequent phases. For example, the 
 scanner is not perturbed by the semantic analyzer. This 
 separation of the passes makes language tools like syntax 
 directed editors relatively easy to produce. It also is 
 possible to compress D source by storing it in ‘tokenized’ 
 form.

 I feel this description is unclear, and it might just reflect 
 how DMD is implemented. I haven't implemented a C++ parser but 
 parsers I have worked with - such as for C - it is always the 
 case that lexer doesn't get impacted by the semantic analysis. 
 The standard process is to get a stream of tokens from the 
 lexer and work with that. It is also conceivable that someone 
 could create a "dumb" AST first for C++, and as a subsequent 
 phase add semantic meaning to the AST, just as is done for D.

 For now I propose to remove this paragraph until there is a 
 better description available. Please would you review my pull 
 request as it it is blocking me from doing further work.

 Hi Walter, please would you share any insights regarding above?

Hi I am still waiting to hear your views on above. The pull 
request I submitted for revising the intro is stuck because of 
this.

Regards

Jun 10 2019

D Programming

C/C++ Programming

Other

digitalmars.D - [spec] Phases of translation