www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [spec] Phases of translation

reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
Hi,

In the introduction there is a section named 'Phases of 
Compilation'. Following the convention used in C standard, I am 
calling this 'Translation Phases'; translation is a more generic 
term too. See https://github.com/dlang/dlang.org/pull/2646/files 
for some initial rewording.

The recent post 
https://forum.dlang.org/post/xnoeazcqnysodqnjphhc forum.dlang.org 
got me thinking that actually we should expand the 'Translation 
Phases' section to describe more formally how D code is parsed. I 
had a quick look at the D parsing code. It seems that following 
an initial parse - presumably this constructs the syntax tree - 
there are three deferred semantic parses. Are any details 
available on what actually happens in each pass? I can read the 
code of course but it would be helpful to get a high level 
picture.

Thanks and Regards
Dibyendu
May 19 2019
next sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Sunday, 19 May 2019 at 10:07:37 UTC, Dibyendu Majumdar wrote:
 I had a quick look at the D parsing code. It seems that 
 following an initial parse - presumably this constructs the 
 syntax tree - there are three deferred semantic parses. Are any 
 details available on what actually happens in each pass? I can 
 read the code of course but it would be helpful to get a high 
 level picture.

 Thanks and Regards
 Dibyendu
Walter Bright's DConf 2016 talk, "Spelunking D Compuler Internals", is a good resource for this: https://www.youtube.com/watch?v=l_96Crl998E
May 19 2019
parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Sunday, 19 May 2019 at 11:39:02 UTC, Paul Backus wrote:
 Walter Bright's DConf 2016 talk, "Spelunking D Compuler 
 Internals", is a good resource for this:

 https://www.youtube.com/watch?v=l_96Crl998E
Thank you - I watched it again just now. It was very helpful. I still need to understand more about the semantic analysis phase, but I don't yet know what questions to ask. I guess to start with one point seems to be that after the syntax analysis phase, we simply have a abstract syntax tree that corresponds one to one with the input source - is that correct? So at this stage nothing more is known about the program - only the input is in a different format that is more amenable to further analysis. Secondly - I assume that the syntax tree "lowering" (or simplification) is done prior to the semantic analysis? Thanks and Regards Dibyendu
May 19 2019
parent reply Max Haughton <maxhaton gmail.com> writes:
On Sunday, 19 May 2019 at 14:26:34 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 11:39:02 UTC, Paul Backus wrote:
 [...]
Thank you - I watched it again just now. It was very helpful. I still need to understand more about the semantic analysis phase, but I don't yet know what questions to ask. I guess to start with one point seems to be that after the syntax analysis phase, we simply have a abstract syntax tree that corresponds one to one with the input source - is that correct? So at this stage nothing more is known about the program - only the input is in a different format that is more amenable to further analysis. Secondly - I assume that the syntax tree "lowering" (or simplification) is done prior to the semantic analysis? Thanks and Regards Dibyendu
Only use the code to work out what the current behaviour: The specification needs to be totally indpendant of the current implementations ('s ideology).
May 19 2019
parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).
Yes I understand that, but there may some some aspects implicit in the implementation that should be made more explicit in the specification. Maybe I need to come back to this topic after I have done the other sections as I don't know what I don't know. Regards
May 19 2019
next sibling parent Basile B. <b2.temp gmx.com> writes:
On Sunday, 19 May 2019 at 16:52:10 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).
Yes I understand that, but there may some some aspects implicit in the implementation that should be made more explicit in the specification. Maybe I need to come back to this topic after I have done the other sections as I don't know what I don't know. Regards
The current compilation model of the front end is said to be "lazy" but a eager compilation would produce the same program without changing the semantics. I think that this is for example what Max Haughton meant. Saying that the semantic is multi pass due to some language feature such as string mixin, template mixin, mutually dependent declarations, (etc.) would be enough. It must be just clear that a single pass, following the lexical order top to bottom, would not work at all.
May 19 2019
prev sibling parent reply Max Haughton <maxhaton gmail.com> writes:
On Sunday, 19 May 2019 at 16:52:10 UTC, Dibyendu Majumdar wrote:
 On Sunday, 19 May 2019 at 16:41:22 UTC, Max Haughton wrote:
 Only use the code to work out what the current behaviour: The 
 specification needs to be totally indpendant of the current 
 implementations ('s ideology).
Yes I understand that, but there may some some aspects implicit in the implementation that should be made more explicit in the specification. Maybe I need to come back to this topic after I have done the other sections as I don't know what I don't know. Regards
You're right, that is definitely an issue with the current specification. One of the really annoying issues is that dmd is understandable locally (e.g. I was able to add a little feature without looking much up) but the global structure of the code is quite disjoint e.g. lots of the analysis code is lumped together and fairly difficult to grok without watching a debugger go through it (Unless you already know)
May 20 2019
parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Monday, 20 May 2019 at 20:22:24 UTC, Max Haughton wrote:
 You're right, that is definitely an issue with the current 
 specification. One of the really annoying issues is that dmd is 
 understandable locally (e.g. I was able to add a little feature 
 without looking much up) but the global structure of the code 
 is quite disjoint e.g. lots of the analysis code is lumped 
 together and fairly difficult to grok without watching a 
 debugger go through it (Unless you already know)
As a newcomer here, I agree. After the parsing stage, things start to get a little bit unclear. While I agree that the spec is not supposed to be a tutorial, well, there isn't any tutorial either. I'm not suggesting to make the spec a tutorial, but to make some tutorial. Compilers are of great interest to me and I think that DMD is great study material if you want to look at a non-trivial compiler. But on the same time, I think it is not attractive for new users. It maybe me that I'm not a compiler expert but I think it's not easy for most people. The source code guide [1] did not help much. It was way too high-level for me to understand any important parts. I think that the video referenced above [2] way more helpful. In that video, you said that you could talk for a month about the compiler. Well, I would be glad to do that for you with some help. :) After GSoC, I was planning to start diving into the compiler and writing about it. But, I may write a lot of incorrect stuff. So, if any experienced compiler dev wants to help / review those, I would be very happy. And I think it would help other not-compiler-jedis understand it. [1] https://wiki.dlang.org/DMD_Source_Guide [2] https://www.youtube.com/watch?v=l_96Crl998E
May 20 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/19/2019 3:07 AM, Dibyendu Majumdar wrote:
 Hi,
 
 In the introduction there is a section named 'Phases of Compilation'.
Following 
 the convention used in C standard, I am calling this 'Translation Phases'; 
 translation is a more generic term too. See 
 https://github.com/dlang/dlang.org/pull/2646/files for some initial rewording.
 
 The recent post 
 https://forum.dlang.org/post/xnoeazcqnysodqnjphhc forum.dlang.org got me 
 thinking that actually we should expand the 'Translation Phases' section to 
 describe more formally how D code is parsed. I had a quick look at the D
parsing 
 code. It seems that following an initial parse - presumably this constructs
the 
 syntax tree - there are three deferred semantic parses. Are any details 
 available on what actually happens in each pass? I can read the code of course 
 but it would be helpful to get a high level picture.
The Translation Phases are conceptual, the compiler may not actually do them that way, it's just supposed to be "as if" they were done that way. For example, the Digital Mars C/C++ compilers merge several of the translation phases.
May 20 2019
parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Monday, 20 May 2019 at 07:34:43 UTC, Walter Bright wrote:
 The Translation Phases are conceptual, the compiler may not 
 actually do them that way, it's just supposed to be "as if" 
 they were done that way.
Okay thank you. From the spec point of view I guess the important thing is to clarify any constraints this places on D. One constraint obviously is that the parsing into syntax tree requires arbitrary lookahead of tokens. Are there other constraints that need to be highlighted? Regards
May 20 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/20/2019 4:51 AM, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 07:34:43 UTC, Walter Bright wrote:
 The Translation Phases are conceptual, the compiler may not actually do them 
 that way, it's just supposed to be "as if" they were done that way.
Okay thank you. From the spec point of view I guess the important thing is to clarify any constraints this places on D. One constraint obviously is that the parsing into syntax tree requires arbitrary lookahead of tokens. Are there other constraints that need to be highlighted?
The spec isn't a tutorial in that the consequences don't necessarily need to be spelled out. The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis. I.e. the definition of a token does not change depending on what construct is being parsed, and the AST generated by the parser can be created without doing any semantic analysis (unlike C++). These consequences fall out of the rest of the spec, hence they should be more of a clarification in the introduction. The idea is to head off attempts to add changes to D that introduce dependencies. Such proposals do crop up from time to time, for example, user-defined operator tokens.
May 20 2019
parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is independent 
 of parsing, and parsing is independent of semantic analysis.
I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
 I.e. the definition of a token does not change depending on 
 what construct is being parsed, and the AST generated by the 
 parser can be created without doing any semantic analysis 
 (unlike C++).

 These consequences fall out of the rest of the spec, hence they 
 should be more of a clarification in the introduction. The idea 
 is to head off attempts to add changes to D that introduce 
 dependencies. Such proposals do crop up from time to time, for 
 example, user-defined operator tokens.
I agree - hence I think we need to be explicit about what D requires of each phase. That way any change to the language can be subjected to a test - does it break some of the fundamental requirements for parsing or semantic analysis etc. Thanks and Regards Dibyendu
May 20 2019
next sibling parent Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:
On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of semantic 
 analysis.
I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
For the first part: "tokenizing is independent of parsing", one way to think of it is that to write a lexer (the sub-program that does the tokenizing), you don't need a parser. You could write it as an independent program. That makes sense, to split any token from the input ('int', '+', '123', etc.) you don't every need to build any AST. Or differently, you don't need to know any info about the structure of the program. For the second part, a parser can build an AST (or more correctly, can decide in what grammatical rule it "falls") just by the kinds of tokens (e.g. you have a number token, like 123, then '+', so you must be parsing a binary expression). I think that somewhere was mentioned that C++ is _dependent_ on the semantic analysis. To understand this better, think that semantic analysis as the act of giving meaning to the entities of the program (hence the term semantic). For example, if I write: int a; Well, 'a' is a token, whose kind we could say is an identifier. But it doesn't have any meaning. Knowing that 'a' is a variable with type 'int' does now give us more info. We can now know whether it can be part of this expression for example: a + 1. If 'a' was a string, that would be invalid, and we found the error because we knew the meaning of 'a'. Now, what is an example of parsing being dependent to the meaning of tokens? Well, C (and C++) of course: Consider this expression: (a)*b If 'a' is a variable, then that is really this: a*b But, if I have done this: typedef int a; somewhere above, then now this expression is suddenly "cast the dereference of 'b' to the type 'a' (which is int in this case)". This is known as the lexer hack [1] (actually, the lexer hack is the solution to the problem). The important thing to understand here is that we can't parse the expression just by knowing what kind of token 'a' is. We have to know additional info about it, info that is provided by the semantic analysis (in the form of the symbol table, a table which contains info about the symbols of the program). These grammars are known as context-sensitive (i.e. not context-free) because you have to have some context to deduce the grammatical rule. Note that now suddenly there is no clear distinction between the parsing phase and the semantic phase. Last but not least, phases being separated clearly has other important implications beyond comprehensibility. For example, the compiler is paralellized more easily. While the lexer tokenizes file A, the parser could be parsing file B while a semantic analysis is run on file C. [1] https://en.wikipedia.org/wiki/The_lexer_hack
May 20 2019
prev sibling parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of semantic 
 analysis.
I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
Currently the Intro says: The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form. I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D. For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work. Thanks and Regards Dibyendu
May 21 2019
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
 It is also conceivable that someone could create a "dumb" AST 
 first for C++, and as a subsequent phase add semantic meaning 
 to the AST, just as is done for D.
AFAIK, you should be able to parse C++ with a GLR parser. Ola.
May 21 2019
prev sibling parent reply Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of 
 semantic analysis.
I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
Currently the Intro says: The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form. I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D. For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
Hi Walter, please would you share any insights regarding above? Thanks and Regards Dibyendu
Jun 01 2019
parent Dibyendu Majumdar <d.majumdar gmail.com> writes:
On Saturday, 1 June 2019 at 13:34:07 UTC, Dibyendu Majumdar wrote:
 On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar 
 wrote:
 On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

 The crucial thing to know is that the tokenizing is 
 independent of parsing, and parsing is independent of 
 semantic analysis.
I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
Currently the Intro says: The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form. I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D. For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
Hi Walter, please would you share any insights regarding above?
Hi I am still waiting to hear your views on above. The pull request I submitted for revising the intro is stuck because of this. Regards
Jun 10 2019