www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.d.lexer: pre-voting review / discussion

reply "Dicebot" <public dicebot.lv> writes:
std.d.lexer is standard module for lexing D code, written by 
Brian Schott

---- Input ----

Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

Initial discussion:
http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org

Usage example in real project:
https://github.com/Hackerpilot/Dscanner
(as stdx.d.lexer, Brian, please correct me if versions do not 
match)

---- Information for reviewers ----

(yes, I am mostly copy-pasting this :P)

Goal of this thread is to detect if there are any outstanding
issues that need to fixed before formal "yes"/"no" voting
happens. If no critical objections will arise, voting will begin
starting with a next week. Otherwise it depends on time module 
author needs to implement suggestions.

Please take this part seriously: "If you identify problems along 
the
way, please note if they are minor, serious, or showstoppers."
(http://wiki.dlang.org/Review/Process). This information later
will be used to determine if library is ready for voting.

If there are any frequent Phobos contributors / core developers
please pay extra attention to submission code style and fitting
into overall Phobos guidelines and structure.

Most important goal of this review is to determine any API / 
design problems. Any internal implementation tweaks may happen 
after inclusion to Phobos but it is important to assure that no 
breaking changes will be required any time soon after module will 
get wider usage.

---- Information request from module author ----

Performance was a major discussed topic in previous thread. Could 
you please provide benchmarking data for version currently 
ongoing the review?
Sep 11 2013
next sibling parent "Tove" <tove fransson.se> writes:
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

I remember reading there were some interesting hash-advances in dmd recently. http://forum.dlang.org/thread/kq7ov0$2o8n$1 digitalmars.com?page=1 maybe it's worth benchmarking those hashes for std.d.lexer as well.
Sep 11 2013
prev sibling next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Wed, 11 Sep 2013 17:01:58 +0200
schrieb "Dicebot" <public dicebot.lv>:

 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott
 

Question / Minor issue: As we already have a range based interface I'd love to have partial lexing / parsing, especially for IDEs. Say I have this source code: -------------------------------------------- 1: module a; 2: 3: void test(int a) 4: { 5: [...] 6: } 7: 8: void test2() 9: [...] -------------------------------------------- Then I first do a full parse pass over the source. Now line 5 is being edited. I know from the full parse that line 5 is part of a FunctionDeclaration which starts at line 3 and ends at line 6. Now I'd like to re-parse only that part: -------------------------------------------- FunctionDeclaration decl = document.getDeclByLine(5); decl.reparse(/*start_line=*/ 3, docBuffer); -------------------------------------------- I think these are the two critical points related to this for the proposed std.lexer: * How can I tell the lexer to start lexing at line/character n? Of course the input could be sliced, but then line number and position information in the Token struct is wrong. * I guess std.lexer slices the original input? This could make things difficult if the file buffer is edited in place. But this can probably be dealt with outside of std.lexer. (By reallocating all .value members) (And once this is working, an example in the docs would be great) But to be honest I'm not sure how important this really is. I think it should help for more responsive IDEs but maybe parsing is not a bottleneck at all?
Sep 11 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 11:43 AM, Johannes Pfau wrote:
 But to be honest I'm not sure how important this really is. I think it
 should help for more responsive IDEs but maybe parsing is not a
 bottleneck at all?

It is important, and I'm glad you brought it up. The LexerConfig can provide a spot to put a starting line/column value.
Sep 11 2013
prev sibling next sibling parent reply "qznc" <qznc web.de> writes:
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

The documentation for Token twice says "measured in ASCII characters or UTF-8 code units", which sounds confusing to me. Is it UTF-8, which includes ASCII? Then it should not be "or". This is nitpicking. Overall, I like the proposal. Great work!
Sep 11 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 11:45 AM, qznc wrote:
 On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

The documentation for Token twice says "measured in ASCII characters or UTF-8 code units", which sounds confusing to me. Is it UTF-8, which includes ASCII? Then it should not be "or".

Pedantically, it is just UTF-8 code units, which are a superset of ASCII.
Sep 11 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Thank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed. 3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type. 4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'. 5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is. 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead. 7. uint line; Should indicate that lines start with '1', not '0'. Ditto for columns. 8. 'default_' Again with the awful practice of appending _. 9. Need to insert intra-page navigation links, such as when 'byToken()' appears in the text, it should be link to where byToken is described.
 Goal of this thread is to detect if there are any outstanding
 issues that need to fixed before formal "yes"/"no" voting
 happens. If no critical objections will arise, voting will begin
 starting with a next week. Otherwise it depends on time module author needs to
implement suggestions.

I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for voting.
Sep 11 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 12:10 PM, Brian Schott wrote:
 The choice of ending token names with underscores was made according to the
 Phobos style guide.

 http://dlang.org/dstyle.html

I didn't realize that was in the style guide. I guess I can't complain about it, then :-)
Sep 11 2013
prev sibling next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version? -manfred
Sep 11 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 12:17 PM, Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

Since the very beginning. One example is determining if something is a declaration or an expression.
Sep 11 2013
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Walter Bright wrote:

 Since the very beginning.
 
 One example is determining if something is a declaration or an
 expression. 

I see now, that you wrote about parsing---not about lexing. Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes. -manfred
Sep 11 2013
parent Manfred Nowak <svv1999 hotmail.com> writes:
Robert Schadek wrote:

 especially with IdentifierList:

I see the shift/reduce conflict if you was indeed trying to syntactically differantiate between template identifiers and other identifiers. -manfred
Sep 12 2013
prev sibling next sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Jonathan M Davis wrote:

 You have to look ahead to figure out whether it's .. or a floating
 point literal.

This lookahead is introduced by using a petty grammar. Please reconsider, that lexing searches for the leftmost longest pattern of the rest of the input. This means that introducing a pattern like `<int>\.\.' return TokenType.INTDOTDOT; would eliminate the lookahead in the lexer. In the parser an additional rule then has to be added: <range> ::= INT DOTDOT INT | INTDOTDOT INT -manfred
Sep 11 2013
prev sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Brian Schott wrote:

 Parsing D requires arbitrary lookahead.




Walter road about _arbitrary_ overhead, i.e. unlimited overhead. -manfred
Sep 11 2013
prev sibling next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Thank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.

Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".
Sep 11 2013
next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
On 12.09.2013 01:22, Kapps wrote:
 On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:
 On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian
 Schott

Thank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.

Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".

C# allows this as well. For example, you can have: void MyVoid(string int) { ... } The name is actually int for purposes of reflection and the like, the just tells the compiler that it's not a keyword.

Yes, and this is important to actually have the real name without a '_' or any other prefix or suffix. Is there an enhancement proposal for this in the bugzilla?
Sep 11 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-09-11 21:57, Piotr Szturmaj wrote:

 Delphi designers realized this problem years ago and they came up with a
 solution:
 http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


 Basically, Delphi allows escaping reserved identifiers with a '&'. I
 wonder how D solves that problem when interfacing to COM classes if they
 have for example a function named "scope".

Scala does it as well: `keyword` if I recall correctly. Seems like you can put basically anything between the backticks in Scala. -- /Jacob Carlborg
Sep 12 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 12:55 PM, H. S. Teoh wrote:
 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).

A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.
 5. The LexerConfig initialization should be a constructor rather than
 a sequence of assignments. LexerConfig documentation is awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.

Yes, using a constructor :-)
 6. No clue how lookahead works with this. Parsing D requires arbitrary
 lookahead.

What has this got to do with lexing?

The lexer needs to support backing up and going forward again. That's how the dmd lexer works.
 Also, it's unclear what types of input is supported -- the code example
 only uses string input, but what about file input? Does it support
 byLine? Or does it need me to slurp the entire file contents into an
 in-memory buffer first?

The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.
Sep 11 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 3:02 PM, H. S. Teoh wrote:
 No, in typical D code, user-defined type names are capitalized, such as
 InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
 their names (we don't call them InputRangeType, for example). So this
 usage doesn't conflict at all.

It's not a type as is understood in D. It's an enumerated value.
 Constructors with many arguments are evil, because you can't just modify
 one setting and use defaults for the rest; you have to specify all of
 them up to the last default argument you have to change.

It's not that big a deal, and certainly not evil.
 6. No clue how lookahead works with this. Parsing D requires
 arbitrary lookahead.

What has this got to do with lexing?

The lexer needs to support backing up and going forward again. That's how the dmd lexer works.

Then the parser should be fixed.

The parser works fine. It doesn't need fixing.
 The point of having the input be an InputRange is you DO NOT CARE if
 the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters?

It does say that.
Sep 11 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-09-12 00:02, H. S. Teoh wrote:

 Constructors with many arguments are evil, because you can't just modify
 one setting and use defaults for the rest; you have to specify all of
 them up to the last default argument you have to change.

Currently this is possible: LexerConfig config = { iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }; I would really like the following to be possible as well: void foo (LexerConfig config); foo({ iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }); -- /Jacob Carlborg
Sep 11 2013
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires arbitrary 
 lookahead.

Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Sep 11 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 3:09 PM, Michel Fortin wrote:
 On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires arbitrary
lookahead.

Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too.

That's correct, but that implies re-lexing the tokens, which has negative performance implications.
Sep 11 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
next sibling parent reply Martin Nowak <code dawg.eu> writes:
On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.
Sep 11 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 7:21 PM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.

No, I don't have a rough idea, but it doesn't show up in the profile.
Sep 11 2013
prev sibling next sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
deadalnix wrote:
 When you are at the caret position, you don't know

If one ever reaches that position one uses petty lexer/grammar definitions. -manfred
Sep 12 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Sep-2013 00:46, Robert Schadek пишет:
 On 09/12/2013 04:21 AM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.


 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

-- Dmitry Olshansky
Sep 12 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 6:54 PM, deadalnix wrote:
 On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?

Not really - I use a dirty trick in that a static instance is always the start of the list. But even so, an extra indirection is better than re-lexing. Lexing is a clear bottleneck in the profiling. I've even been thinking of special casing the code that scans comments to use SIMD instructions.
Sep 11 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 10:10 PM, deadalnix wrote:
 See my comment, it is possible, with increased parser complexity, to handle
many
 cases where you don't know what you are parsing yet. Doing so, lookahead is
only
 required to find matching closing token. I suspect that a fast path in the
lexer
 for that precise use case may be faster than buffering tokens, as it allow to
 save one branch per token.

I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.
Sep 11 2013
parent Martin Nowak <code dawg.eu> writes:
On 09/12/2013 08:17 AM, Walter Bright wrote:
 I don't believe that, because you can see about anything for tokens in
 lookahead and so have to duplicate nearly the whole lexer anyway for the
 'fast path', but you're free to try it out and prove me wrong.

Me neither and if you have to duplicate the code for this it's a double loss. I think it would be more fruitful to batch lexing, i.e. fill a hole memory page (4Kb) with tokens the go back to parsing until you need more.
Sep 12 2013
prev sibling parent Martin Nowak <code dawg.eu> writes:
On 09/12/2013 03:30 AM, deadalnix wrote:
 That's correct, but that implies re-lexing the tokens, which has
 negative performance implications.

Indeed. What solution do you have in mind ?

Buffering the tokens would work. There are some ways promote input ranges to forward ranges. But there are also some pitfalls like the implicit save on copy. I have two prototypes for a generic input range buffer. https://gist.github.com/dawgfoto/2187220 - uses growing ring buffer https://gist.github.com/dawgfoto/1257196 - uses ref counted lookahead buffers in a singly linked list The lexer itself has a ringbuffer for input ranges. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2278
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 12, 2013 at 06:09:41PM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
This can be handled by using an intermediate grammar rule. Reduce all
(...) into an intermediate type, say ArgList, so the reduction
happens something like this:

	int  foo   ()      ()      {}
	Type Ident ArgList ArgList ^

Then have the rule:

	CompileTimeArgs ::= ArgList
	RuntimeArgs ::= ArgList
	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
	FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet fixed
whether they are compile-time arguments or runtime arguments. It's
only after you see the next '(' or '{' that you decide whether
ArgList should reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters (both
runtime and compile-time): types, expressions, symbols. Then when you
reduce it to RuntimeArgs, you reject stuff that can't be interpreted
as parameter declarations.

And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.

No. You can just use ArgListItem for both runtime args and compile-time args. Once you decided which one it is, wrong arguments are rejected at semantic time (which you have to do anyway). Let's take a concrete example. Say we're parsing this invalid code: int foo(alias A)(alias B) {} You'd go through these steps: 1) Parse initial prefix of declaration: int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo \--[ being built ] 2) Parse first (...): int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo |--ArgList | \-- AliasArg | \-- ident: A \--[ being built ] I'm skipping the intermediate steps here, it's obvious how to construct AliasArg from the usual parsing process. 3) Parse second (...): int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo |--ArgList | \-- AliasArg | \-- ident: A |--ArgList | \-- AliasArg | \-- ident: B \--[ being built ] 4) At this point, you now know the first ArgList is CompileTimeArgList, and the second is RuntimeArgList, so you can just change the type fields (along with narrowing FuncDecl to TemplateFuncDecl): AST: TemplateFuncDecl (was: FuncDecl) |--RetType: int |--Ident: foo |--CompileTimeArgList (was: ArgList) | \-- AliasArg | \-- ident: A |--RuntimeArgList (was: ArgList) | \-- AliasArg | \-- ident: B \--[ being built ] Since you're still constructing FuncDecl, your current parsing context should still have a direct reference to the partially-constructed FuncDecl node, which in turn has a direct reference to both ArgList child nodes. So this is just dereferencing a couple of pointers. No backtracking. 5) Finish parsing the declaration: int foo(alias A)(alias B) {} ^ AST: TemplateFuncDecl |--RetType: int |--Ident: foo |--CompileTimeArgList (was: ArgList) | \-- AliasArg | \-- ident: A |--RuntimeArgList (was: ArgList) | \-- AliasArg | \-- ident: B \--FuncBody \-- CompoundStatement \-- [empty body] 6) Run semantic: - Create local symbol table for foo, etc.. - Run semantic on CompileTimeArgList: - Check AliasArg for validity - Run semantic on AliasArg: add A to function's local symbol table, etc. - Run semantic on RuntimeArgList: - Check AliasArg for validity: ERROR: cannot have alias parameter at runtime. - (Add B to local symbol table)(skipped due to previous error) - (Run semantic on FuncBody)(skipped due to previous error) - (Run semantic on RetType (verify return type match, etc.))(skipped due to previous error) - (Add function to parent scope symbol table)(skipped due to previous error) So, no backtracking is necessary. Of course, it sounds like DMD's parser doesn't work this way, but that's a limitation of DMD's parser, not an *inherent* need for backtracking. T -- I see that you JS got Bach.
Sep 12 2013
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
We are talking about parameters here, not arguments. But overall, 
that isn't the point.

If I have int a = 3 as an argument in the first set of (), I 
still have no clue if it is a runtime or compile time parameter. 
But ref int a = 3 ?

Doing so, you ends up with a lot of ambiguous node (like 
FunctionArgumentOrValueTemplateParameter) and have to manage that 
complexity, or have to backtrack on the parsing and fix ambiguous 
node when you have enough information.
Sep 12 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-09-11 22:37, Brian Schott wrote:

 It's possible, but I fear that it would make the code a mess.

Really? It seems you only need to change two places in the code: https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L1724 And https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L400 Although I just did a quick search for "line". -- /Jacob Carlborg
Sep 12 2013
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 09/11/2013 08:49 PM, Walter Bright wrote:

 3. I assumed [an instance of] TokenType is a type.

(is(TokenType) is true).
 But it's not, it's an enum.
 Even the document says it's a 'type', but it's not a type.

It's a type tag. The tag uniquely determines the type. (As in 'the type of a token', as opposed to 'the type of an expression'.)
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be awkward
 if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
 Ditto for the rest. For example that is done better, '*' is called
 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.
 5. The LexerConfig initialization should be a constructor rather than a
sequence of assignments.

Using the { a:2, b:3 }-style initialization syntax?
 6. No clue how lookahead works with this.

Eg. use a CircularBuffer adapter range. I have an implementation currently coupled with my own lexer implementation. If there is interest, I could factor it out. Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.) This approach is typically more efficient than using a free list on contemporary architectures.
Sep 12 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Sep-2013 19:39, Timon Gehr пишет:
 On 09/11/2013 08:49 PM, Walter Bright wrote:
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be awkward
 if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
 Ditto for the rest. For example that is done better, '*' is called
 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.

I like this. Not only this has the benefit of not colliding with keywords. I also imagine that it could be incredibly convenient to get back the symbolic representation of a token (when token used as parameter to AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in symbolic form so learning a pack of names for them feels pointless.
 6. No clue how lookahead works with this.

Eg. use a CircularBuffer adapter range. I have an implementation currently coupled with my own lexer implementation. If there is interest, I could factor it out. Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.) This approach is typically more efficient than using a free list on contemporary architectures.

This ^^ is how. In fact std.d.lexer internally does similar thing with non-RA ranges of bytes. -- Dmitry Olshansky
Sep 12 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/12/2013 06:40 PM, H. S. Teoh wrote:
 I vote for Tok!"..." to denote token types.

 Question: what's the implementation of Tok?  Does it fit into an enum?
 What's the underlying representation? I imagine some kind of canonical
 mapping into an integral type would be desired, to maximize runtime
 performance.

This is just a quick hack. One would maybe want to avoid the unwarranted copies and linear searches at compile time: import std.algorithm; private enum symbols = ["i",".","..",",","0",/+...+/]; struct TokenType{ private uint _tag; // or smaller type private this(int tag){_tag=tag;} string toString(){ // (optional) static array(R)(R s){ string[] r; foreach(x;s) r~=x; return r; } static immutable strs=array(symbols.map!((a)=>`Tok!"`~a~`"`)); return strs[_tag]; } } template Tok(string name)if(symbols.countUntil(name)!=-1){ enum Tok=TokenType(cast(uint)symbols.countUntil(name)); } import std.stdio; void main(){ enum x = Tok!"i"; writeln(x); writeln(Tok!"0"); }
Sep 12 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-09-12 17:39, Timon Gehr wrote:

 Using the { a:2, b:3 }-style initialization syntax?

Unfortunately that's currently not possible to pass to functions, see my other post: http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=6#post-l0ro6h:249mk:241:40digitalmars.com -- /Jacob Carlborg
Sep 12 2013
prev sibling parent reply Martin Nowak <code dawg.eu> writes:
On 09/12/2013 05:39 PM, Timon Gehr wrote:
 Lookahead is realized as follows in the parser:

 (assume 'code' is the circular buffer range.)

 auto saveState(){muteerr++; return code.pushAnchor();} // saves the
 state and mutes all error messages until the state is restored

 void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

 The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
 grows automatically to keep around tokens still reachable by an anchor.
 (The range only needs small constant space besides the buffer to support
 this functionality, though it is unable to detect usage errors.)

Do you think it's possible to use CircularBufferRange itself as anchor. Then calling save() would return a CircularBufferRange and you canould scratch the two functions above. I had some promising experiments in that direction, but the implicit save on copy is annoying because you end up with anchors from temporary copies.
Sep 12 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/12/2013 10:03 PM, Martin Nowak wrote:
 On 09/12/2013 05:39 PM, Timon Gehr wrote:
 Lookahead is realized as follows in the parser:

 (assume 'code' is the circular buffer range.)

 auto saveState(){muteerr++; return code.pushAnchor();} // saves the
 state and mutes all error messages until the state is restored

 void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

 The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
 grows automatically to keep around tokens still reachable by an anchor.
 (The range only needs small constant space besides the buffer to support
 this functionality, though it is unable to detect usage errors.)

Do you think it's possible to use CircularBufferRange itself as anchor.

Well, this implementation is geared towards the usage pattern found in the top-down parser. If the first anchor is not restored last, it will not generally work. This does not conform to the range interface. Implementing .save requires a more advanced and less efficient implementation.
 Then calling save() would return a CircularBufferRange and you could
 scratch the two functions above. I had some promising experiments in
 that direction,

What did you do? The way I think I'd approach it would be to maintain a binary min-heap using a few pointers in each range instance.
 but the implicit save on copy is annoying because you
 end up with anchors from temporary copies.

Is this referring to suboptimal efficiency? I guess that is rather hard to address in a safe way. Basically, you'd want to forget about "anchor management" in case it can be proven that there is another range on the same buffer that stays at most as progressed during the whole lifetime of the range under consideration. Of course, it is always possible to do it in unsafely.
Sep 12 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
The choice of ending token names with underscores was made 
according to the Phobos style guide.

http://dlang.org/dstyle.html
Sep 11 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, September 11, 2013 11:49:44 Walter Bright wrote:
 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

Appending _ is what we do in Phobos when we need to use a keyword, and it's even called out in the style guide: http://dlang.org/dstyle.html Now, if it makes sense to use something other than a keyword, that's probably better, but in cases where it really makes the most sense to use a keyword but we can't (e.g. std.traits.FunctionAttribute), we append _ to the keyword. What makes the most sense ni this case, I don't know, because I haven't looked at either the documentation or the code yet, but if Brian is appending _ to keywords, he's following the official style guide. - Jonathan M Davis
Sep 11 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013. - Jonathan M Davis
Sep 11 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 11 September 2013 at 19:29:48 UTC, Jonathan M Davis 
wrote:
 On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013. - Jonathan M Davis

Yeah. D requires lookahead in both lexing and parsing. Some examples: * String literals such as q{} * If statements need to determine the difference between if (expression) and if (type identifier = expression) * Determining if something is a lamba expression or a function literal expression * Determining if something is a declaration or a statement or an expression. * Differentiating between (type).identifier and a lambda expression * Differentiating between a type and an expression inside a typeid expression * Differentiating between associative array key type and tuple slices in type suffixes
Sep 11 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, September 11, 2013 15:29:33 Jonathan M Davis wrote:
 On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013.

LOL. You were asking about _parsing_ requiring arbitrary lookahead, and I answered as to why lexing requires more than one character of lookahead. I should read more carefully... - Jonathan M Davis
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by Brian
Schott

Thank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that.

I agree.
 2. The example uses an if-then sequence of isBuiltType, isKeyword,
 etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.
 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' would
 be clearer. Ditto for the rest. For example that is done better, '*'
 is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)
 5. The LexerConfig initialization should be a constructor rather than
 a sequence of assignments. LexerConfig documentation is awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?
 6. No clue how lookahead works with this. Parsing D requires arbitrary
 lookahead.

What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things. [...]
 8. 'default_' Again with the awful practice of appending _.

Yeah I didn't like that. Prefixing keywords with "kw" or "kw_" is much better, instead of the confusing way of appending _ to some token types but not others.
 9. Need to insert intra-page navigation links, such as when
 'byToken()' appears in the text, it should be link to where byToken
 is described.

Agreed. [...]
 I believe the state of the documentation is a showstopper, and needs
 to be extensively fleshed out before it can be considered ready for
 voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO. Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first? Now, somebody pointed out that there's currently no way to tell it that the data you're feeding to it is a partial slice of the full source. I think this should be an easy fix: LexerConfig (or DLexer after the rename) can have a field for specifying the initial line number / column number, defaulting to (1, 1) but can be changed by the caller for parsing code fragments. A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)? Would it be possible to skip the additional computations required to populate these fields? T -- MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 12:13:07PM -0700, Walter Bright wrote:
 On 9/11/2013 12:10 PM, Brian Schott wrote:
The choice of ending token names with underscores was made according
to the Phobos style guide.

http://dlang.org/dstyle.html

I didn't realize that was in the style guide. I guess I can't complain about it, then :-)

I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword. T -- One Word to write them all, One Access to find them, One Excel to count them all, And thus to Windows bind them. -- Mike Champion
Sep 11 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
 I disagree. I think it's more readable to use a consistent 
 prefix, like
 kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's 
 clear
 you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
I disagree. I think it's more readable to use a consistent prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
clear you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now. T -- Being able to learn is a great learning; being able to unlearn is a greater learning.
Sep 11 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
 wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that 
it's
clear you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.

Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
I disagree. I think it's more readable to use a consistent prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
clear you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.

Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.

This doesn't violate Phobos style guidelines: enum TokenType { kwInt, kwFloat, kwDouble, ... kwFunction, kwScope, ... // etc. } The token representing a particular keyword is not the same thing as the keyword itself. There's nothing that says "a D lexer must use token type names that are derived from the D keywords it represents". In fact, using a kw prefix for tokens representing keywords is *more* consistent, because it doesn't require inserting _ for some enum values but not for others. I never said anything about changing Phobos style guidelines. T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
Sep 11 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

One litmus test of the validity of the design is to see if one can build a D parser out of it. Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.
Sep 11 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/11/2013 1:44 PM, Brian Schott wrote:
 I'm not sure if you've followed the announce mailing list recently, but this
 lexer is already used in both the DScanner and DCD projects.

Awesome! This is just what I asked for.
Sep 11 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
 2. The example uses an if-then sequence of isBuiltType, 
 isKeyword,
 etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.

There are already plenty of hackish things in that module, so this would fit right in.
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' 
 would
 be clearer. Ditto for the rest. For example that is done 
 better, '*'
 is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.
 5. The LexerConfig initialization should be a constructor 
 rather than
 a sequence of assignments. LexerConfig documentation is 
 awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?

byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.
 6. No clue how lookahead works with this. Parsing D requires 
 arbitrary
 lookahead.

What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.

For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.
 9. Need to insert intra-page navigation links, such as when
 'byToken()' appears in the text, it should be link to where 
 byToken
 is described.

Agreed.

I'll work on this later this evening.
 [...]
 I believe the state of the documentation is a showstopper, and 
 needs
 to be extensively fleshed out before it can be considered 
 ready for
 voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO.

I'm not convinced (yet?).
 Also, it's unclear what types of input is supported -- the code 
 example
 only uses string input, but what about file input? Does it 
 support
 byLine? Or does it need me to slurp the entire file contents 
 into an
 in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.
 Now, somebody pointed out that there's currently no way to tell 
 it that
 the data you're feeding to it is a partial slice of the full 
 source. I
 think this should be an easy fix: LexerConfig (or DLexer after 
 the
 rename) can have a field for specifying the initial line number 
 / column
 number, defaulting to (1, 1) but can be changed by the caller 
 for
 parsing code fragments.

That's simple enough to add.
 A minor concern I have about the current implementation (I 
 didn't look
 at the code yet, but this is just based on the documentation), 
 is that
 there's no way to choose what kind of data you want in the 
 token stream.
 Right now, it appears that it always computes startIndex, line, 
 and
 column. What if I don't care about this information, and only 
 want, say,
 the token type and value (say I'm pretty-printing the source, 
 so I don't
 care how the original was formatted)?  Would it be possible to 
 skip the
 additional computations required to populate these fields?

It's possible, but I fear that it would make the code a mess.
Sep 11 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 11 September 2013 at 20:37:11 UTC, Walter Bright 
wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

One litmus test of the validity of the design is to see if one can build a D parser out of it. Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.

I'm not sure if you've followed the announce mailing list recently, but this lexer is already used in both the DScanner and DCD projects. There is a parser module that goes along with this lexer included in DScanner (and used by DCD as well). It's not ready for review though. I'm still working on improving its error recovery. DScanner is a tool for D source that's able to generate syntax-highlighted HTML, perform syntax validation, generate the AST in XML format, generate CTAGS, list imports, and count lines of code. DCD is an autocompletion client/server program that has integrations for several text editors.
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:
 On 9/11/2013 12:55 PM, H. S. Teoh wrote:
3. I assumed TokenType is a type. But it's not, it's an enum. Even
the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).

A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.

No, in typical D code, user-defined type names are capitalized, such as InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in their names (we don't call them InputRangeType, for example). So this usage doesn't conflict at all.
5. The LexerConfig initialization should be a constructor rather
than a sequence of assignments. LexerConfig documentation is awfully
thin.  For example, 'tokenStyle' is explained as being 'Token
style', whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.

Yes, using a constructor :-)

Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change.
6. No clue how lookahead works with this. Parsing D requires
arbitrary lookahead.

What has this got to do with lexing?

The lexer needs to support backing up and going forward again. That's how the dmd lexer works.

Then the parser should be fixed. Keeping track of what tokens have been seen so far is the parser's job, not the lexer's. LALR(1) parsers, for example, deal with multiple similar-looking constructs without requiring more than a single token lookahead. The trick is that you do not assume a particular grammar production until it has been fully disambiguated. The grammar, if not conducive to this, can usually be rearranged to make it possible. For example, if the grammar has two production rules: StatementA ::= TokenA TokenB TokenC TokenD StatementB ::= TokenA TokenB TokenC TokenE then you may claim that you need a 4-token lookahead in order to tell, given TokenA in the current input, which type of statement it's going to be. However, these rules can be rewritten like this: StatementABPrefix ::= TokenA TokenB TokenC StatementA ::= StatementABPrefix TokenD StatementB ::= StatementABPrefix TokenE Now the parser only needs a 1-token lookahead to do its job, and no backtracking is ever needed. Any information encoded by the first 3 tokens can be stored in the AST node of StatementABPrefix, so it's recoverable without needing to save previous tokens. Unless, you're saying that D grammar is more complex than LALR(1)... If so, I'd like to see an example of where this is the case.
Also, it's unclear what types of input is supported -- the code
example only uses string input, but what about file input? Does it
support byLine? Or does it need me to slurp the entire file contents
into an in-memory buffer first?

The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters? Am I allowed to pass in a range of *strings* (representing source lines)? This needs to be clearly stated. The fact that tokens slice the input also needs to be clearly stated, since that precludes the use of transient ranges like byLine (your token values will be garbled by the time you get to them). T -- Let's call it an accidental feature. -- Larry Wall
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:
 On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
2. The example uses an if-then sequence of isBuiltType, isKeyword,
etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.

There are already plenty of hackish things in that module, so this would fit right in.

Better hope it passes code review, though. :)
4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' would
be clearer. Ditto for the rest. For example that is done better, '*'
is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.

I was just going by what Walter said, but OK, fair enough.
5. The LexerConfig initialization should be a constructor rather
than a sequence of assignments. LexerConfig documentation is awfully
thin.  For example, 'tokenStyle' is explained as being 'Token
style', whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?

byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.

Very good point. In that case, would it be possible to make byToken accept configuration parameters inline? That is: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(IterationStyle.everything, ...); This seems better suited for UFCS-style chains, as opposed to needing to create a separate config object outside the chain just so you can use it inside. And writing LexerConfig(...) inline seems a bit cumbersome: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(LexerConfig(IterationStyle.everything, ...)); though, in retrospect, perhaps not as bad as it initially sounds.
6. No clue how lookahead works with this. Parsing D requires
arbitrary lookahead.

What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.

For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.

I don't think that's necessary. An appropriately-constructed parser doesn't need arbitrary slicing of tokens; its internal state should encode what tokens have been seen thus far so that it never needs to backtrack. [...]
I think the API can be improved. The LexerConfig -> DLexer rename is
an important readability issue IMO.

I'm not convinced (yet?).

OK, I take that back. UFCS chaining is an important emerging D paradigm that we should support, and the current API seems better suited for that purpose.
Also, it's unclear what types of input is supported -- the code
example only uses string input, but what about file input? Does it
support byLine? Or does it need me to slurp the entire file contents
into an in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.

OK, if this can be indicated in the docs, that'd be great. :) (A signature constraint of the form `if (is(ElementType!R == char))` that should be good enough, I think, since DDOC in DMD git HEAD will include signature constraints in the generated docs, and it should be clear enough what is expected. [...]
A minor concern I have about the current implementation (I didn't
look at the code yet, but this is just based on the documentation),
is that there's no way to choose what kind of data you want in the
token stream.  Right now, it appears that it always computes
startIndex, line, and column. What if I don't care about this
information, and only want, say, the token type and value (say I'm
pretty-printing the source, so I don't care how the original was
formatted)?  Would it be possible to skip the additional computations
required to populate these fields?

It's possible, but I fear that it would make the code a mess.

Fair enough. It's not a significant overhead anyway (I believe), so it shouldn't be a big issue. T -- Once bitten, twice cry...
Sep 11 2013
prev sibling next sibling parent reply Martin Nowak <code dawg.eu> writes:
On 09/11/2013 05:01 PM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 Please take this part seriously: "If you identify problems along the
 way, please note if they are minor, serious, or showstoppers."
 (http://wiki.dlang.org/Review/Process). This information later
 will be used to determine if library is ready for voting.

impression, i.e. it looks like a lexer written in D. I do like how you made the lexer configurable to serve the different needs. Also it was very easy to setup and use. The appended underscore for keywords is much nicer than messing with uppercase/lowercase which was what I did. One thing that bothered me was the usage of const(ubyte)[] as input. If you operate on UTF-8 it should take const(char)[] you can cast it to const(ubyte)[] internally. Also a convenience function that reads a file and processes UTF BOM marks would be nice (see toUtf8 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), but that could as well fit somewhere else into phobos.
 ---- Information request from module author ----

 Performance was a major discussed topic in previous thread. Could you
 please provide benchmarking data for version currently ongoing the review?

Nice too, it works as fast as my D lexer (https://github.com/dawgfoto/lexer) which is about as fast as DMD's lexer (when compiled with LDC).
Sep 11 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-09-12 00:36, Martin Nowak wrote:

 Also a convenience function that reads a file and processes UTF BOM
 marks would be nice (see toUtf8
 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429),
 but that could as well fit somewhere else into phobos.

Sounds like that would fit in std.file or similar. -- /Jacob Carlborg
Sep 12 2013
prev sibling next sibling parent "Kapps" <opantm2+spam gmail.com> writes:
On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj 
wrote:
 On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

Thank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.

Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".

C# allows this as well. For example, you can have: void MyVoid(string int) { ... } The name is actually int for purposes of reflection and the like, the just tells the compiler that it's not a keyword.
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 04:42:34PM -0700, Walter Bright wrote:
 On 9/11/2013 3:02 PM, H. S. Teoh wrote:

Yes, but the docs currently doesn't say what form this input range
must take. Must it be a range of characters?

It does say that.

But then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*. T -- Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus Torvalds
Sep 11 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
 But then the code example proceeds to pass byLine() to it. Is 
 that
 correct? If it is, then the docs need to be updated, because 
 last time I
 checked, byLine() isn't a range of char, but a range of char 
 *arrays*.


 T

The example doesn't pass the result of byLine to the byToken function directly.
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 11 September 2013 at 19:46:25 UTC, Brian Schott 
wrote:
 Yeah. D requires lookahead in both lexing and parsing. Some 
 examples:

 * String literals such as q{}
 * If statements need to determine the difference between if 
 (expression) and if (type identifier = expression)
 * Determining if something is a lamba expression or a function 
 literal expression
 * Determining if something is a declaration or a statement or 
 an expression.

Does not require lookahead.
 * Differentiating between (type).identifier and a lambda 
 expression

Yup !
 * Differentiating between a type and an expression inside a 
 typeid expression

Does not require lookahead.
 * Differentiating between associative array key type and tuple 
 slices in type suffixes

ditto.
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 11 September 2013 at 23:44:55 UTC, Walter Bright 
wrote:
 On 9/11/2013 3:09 PM, Michel Fortin wrote:
 On 2013-09-11 18:49:44 +0000, Walter Bright 
 <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires 
 arbitrary lookahead.

Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too.

That's correct, but that implies re-lexing the tokens, which has negative performance implications.

Indeed. What solution do you have in mind ?
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 11 September 2013 at 20:28:06 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh 
 wrote:
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so 
that it's
clear you're referring to token types, not the actual 
keyword.

Not unless you want to change the style guide and break existing Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.

Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.

This doesn't violate Phobos style guidelines: enum TokenType { kwInt, kwFloat, kwDouble, ... kwFunction, kwScope, ... // etc. }

Int, Function, Scope, Import are all valid identifiers. random minimization like kw is really bad. It is even worse when it doesn't make anything sorter.
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?
Sep 11 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword). - Jonathan M Davis
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote:
 On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).

Actually, the main issue I have is that some of the enum values end with _ while others don't. This is inconsistent. I'd rather have consistency than superficial resemblance to the keywords as typed. Either *all* of the enum values should end with _, or *none* of them should. Having a mixture of both is an eyesore, and leads to people wondering, should I add a _ at the end or not? If people insist that the 'default' keyword absolutely must be represented as TokenType.default_ (I really don't see why), then *all* TokenType values should end with _. But honestly, I find that really ugly. Writing something like kwDefault, or tokenTypeDefault, would be far better. Sigh, Andrei was right. Once the bikeshed is up for painting, even the rainbow won't suffice. :-P T -- MASM = Mana Ada Sistem, Man!
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 12, 2013 at 03:17:06AM +0200, Brian Schott wrote:
 On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
But then the code example proceeds to pass byLine() to it. Is that
correct? If it is, then the docs need to be updated, because last
time I checked, byLine() isn't a range of char, but a range of char
*arrays*.


T

The example doesn't pass the result of byLine to the byToken function directly.

*facepalm* You're right, it's calling join() on it. Nevermind what I said then. :-P Sorry for all the unnecessary noise, I don't know what got into me that I didn't see the join(). T -- May you live all the days of your life. -- Jonathan Swift
Sep 11 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 12, 2013 at 04:21:36AM +0200, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
On 9/11/2013 6:30 PM, deadalnix wrote:
Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.

I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary. Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of? T -- The peace of mind---from knowing that viruses which exploit Microsoft system vulnerabilities cannot touch Linux---is priceless. -- Frustrated system administrator.
Sep 11 2013
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 11.09.2013 17:01, schrieb Dicebot:
 std.d.lexer is standard module for lexing D code, written by
 Brian Schott

a standard benchmark scenario is needed - to compare dmd and std.d.lexer with dmc_dmd, ldc, gdc and msc_dmd just to be clear how near it reaches current dmd lexing speed
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
 I still don't understand why backtracking is necessary in the 
 first
 place. I would've thought a modern parser should be well able 
 to encode
 seen tokens in its state so that backtracking is never 
 necessary. Or
 does D grammar have tricky bits that cannot be handled this 
 way, that
 I'm unaware of?

The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance : int foo()() {} ^ When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol. In such situation, it is much simpler to skip input to the closing parentheses, check what's coming after and act accordingly. The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode. Note that the example is not self contained, for instance, both template parameters and argument can imply template instantiation, which means ambiguous argument parsing. SDC does a good deal of ambiguous parsing without backtracking (more than DMD does), but you got to draw the line somewhere. What I'd like to see is a go to the closing token feature, that can eventually take a fast path to do so, or an efficient token buffering system (I'm not sure which feature would be the fastest, but I'm inclined to think this is the first one, however, that won't be suited for a dmd style parser).
Sep 11 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 12 September 2013 at 04:57:50 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:54 PM, deadalnix wrote:
 On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
 wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?

Not really - I use a dirty trick in that a static instance is always the start of the list.

If I understand you correctly, that mean that lookahead of one token do not trigger any allocation.
 But even so, an extra indirection is better than re-lexing. 
 Lexing is a clear bottleneck in the profiling. I've even been 
 thinking of special casing the code that scans comments to use 
 SIMD instructions.

See my comment, it is possible, with increased parser complexity, to handle many cases where you don't know what you are parsing yet. Doing so, lookahead is only required to find matching closing token. I suspect that a fast path in the lexer for that precise use case may be faster than buffering tokens, as it allow to save one branch per token.
Sep 11 2013
prev sibling next sibling parent "sclytrack" <sclytrack qlskdfjl.com> writes:
On Thursday, 12 September 2013 at 03:31:42 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis 
 wrote:
 On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).

Actually, the main issue I have is that some of the enum values end with _ while others don't. This is inconsistent. I'd rather have consistency than superficial resemblance to the keywords as typed. Either *all* of the enum values should end with _, or *none* of them should. Having a mixture of both is an eyesore, and leads to people wondering, should I add a _ at the end or not? If people insist that the 'default' keyword absolutely must be represented as TokenType.default_ (I really don't see why), then *all* TokenType values should end with _. But honestly, I find that really ugly. Writing something like kwDefault, or tokenTypeDefault, would be far better. Sigh, Andrei was right. Once the bikeshed is up for painting, even the rainbow won't suffice. :-P T

Delphi would use TokenType ttDefault MyType mtDefault
Sep 11 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-09-11 17:01, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Finally :) * How does it handler errors, just returns TokenType.invalid? * Personally I think the module is too big. I would go with: - std.d.lexer.token - std.d.lexer.tokentype - std.d.lexer.lexer - contains the rest - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig - CircularRange, StringCache, possibly put somewhere else. I assume this can be used for other things than lexing? - Trie related code, same as above * I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. If we do decide it should throw, it should absolutely _not_ throw a plain Exception. Create a new type, LexException or similar. I hate when code throws plain Exceptions, it makes it useless to catch them. I would also expect this LexException to contain a Token. It shouldn't be needed to parse the exception message to get line and column information. * I like that you overall use clear and descriptive variable and function names. Except "sbox": https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L3265 * Could we get some unit tests for string literals, comments and identifies out side of the ASCII table * I would like to see a short description for each unit test, what it's testing. Personally I have started with this style: describe("byToken") { context("valid string literal") { it("should return a token with the type TokenType.stringLiteral") unittest { // test } it("should return a token with the correct lexeme") unittest { // test } } } Better formatted: http://pastebin.com/Dx78Vw6r People here might think that would be a bit too verbose. The following would be ok as well: describe("short description of the unit test") unittest { } * Could you remove debug code and other code that is commented out: - 344 - 1172 - 1226, is that needed? - 3165-3166 - 3197-3198 - 3392 - 3410 - 3434 Spell errors: * "forwarad" - 292 * "commemnt" - 2031 * "sentenels" - 299 * "messsage" - 301 * "underliying" - 2454 * "alloctors" - 3230 * "strightforward" - 2276 -- /Jacob Carlborg
Sep 12 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Sep-2013 12:05, Jacob Carlborg пишет:
 On 2013-09-11 17:01, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Finally :) * How does it handler errors, just returns TokenType.invalid? * Personally I think the module is too big. I would go with: - std.d.lexer.token - std.d.lexer.tokentype

These could be one module. There is really no meaningful way to use token type separately from token.
 - std.d.lexer.lexer - contains the rest
 - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig

Contrary I see this break down pointless - do you really want to use config without the lexer?
 - CircularRange, StringCache, possibly put somewhere else. I assume this
 can be used for other things than lexing?
 - Trie related code, same as above

No good public interface defined is the reason. Basically the same as with Trie in the new std.uni module - needs its own review.
 * I see that errorMessage throws an exception. Do we really want that? I
 would except it just returns an invalid token.

 If we do decide it should throw, it should absolutely _not_ throw a
 plain Exception. Create a new type, LexException or similar. I hate when
 code throws plain Exceptions, it makes it useless to catch them.

 I would also expect this LexException to contain a Token. It shouldn't
 be needed to parse the exception message to get line and column
 information.

Better yet to have a std exception hierarchy... so that all parsing modules can be tied to ParseException. So this needs to be resolved in a forward-compatible way. -- Dmitry Olshansky
Sep 12 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-09-12 18:21, Dmitry Olshansky wrote:

 Contrary I see this break down pointless - do you really want to use
 config without the lexer?

std.d.lexer.config might be a bit unnecessary. But yes, in general I would like to see smaller modules. -- /Jacob Carlborg
Sep 12 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 12 September 2013 at 06:17:04 UTC, Walter Bright 
wrote:
 On 9/11/2013 10:10 PM, deadalnix wrote:
 See my comment, it is possible, with increased parser 
 complexity, to handle many
 cases where you don't know what you are parsing yet. Doing so, 
 lookahead is only
 required to find matching closing token. I suspect that a fast 
 path in the lexer
 for that precise use case may be faster than buffering tokens, 
 as it allow to
 save one branch per token.

I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.

I plan to, but you know what it is, the best optimization is the one that go from non working to working state.
Sep 12 2013
prev sibling next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
I got some time to work on the lexer this evening. Changeset 
here: 
https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d

The DDoc page has moved here: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

* There are a few more unit tests now
* bitAnd renamed to amp
* slice rename to dotdot
* Much more cross-referencing in the doc comments
* Start line and column can be specified in the lexer config
Sep 12 2013
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 12.09.2013 10:15, schrieb Brian Schott:
 I got some time to work on the lexer this evening. Changeset
 here:
 https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d

 The DDoc page has moved here:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

 * There are a few more unit tests now
 * bitAnd renamed to amp
 * slice rename to dotdot
 * Much more cross-referencing in the doc comments
 * Start line and column can be specified in the lexer config

problem: many occurences of the same string you should use constants for the tokens (and others) string asm_token = "asm"; ... immutable(string[TokenType.max + 1]) tokenValues = [ ... asm_token ... ] and reuse these constants in your "optimization" maybe you can replace these lines with something getting feed with asm_token and give the same result but without 'a' and "sm" as splitted and different magic values - maybe a nice template or subrange... case 'a': if (input[1..$].equal("sm")) return TokenType.asm_; else ... break; and in your unit tests for example on auto expected =
Sep 12 2013
parent dennis luehring <dl.soluz gmx.net> writes:
just an idea regarding your string[TokenType.max + 1] in

immutable(string[TokenType.max + 1]) tokenValues = [..]

it seems that you try to reduce the memory usage

wouldn't it be a nice idea to generate a combined imutable string at 
compiletime like this one

"...pragmaexportpackageprivate..."

and generated string slice accesses?

imuteable string big_one = generated_from(toke_list);
imutable string export_token = big_one[10..6];
Sep 12 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/12/2013 1:15 AM, Brian Schott wrote:
 I got some time to work on the lexer this evening. Changeset here:
 https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d


 The DDoc page has moved here:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

Great!
 * There are a few more unit tests now

I strongly recommend running the unit tests with -cov. A lexer should be able to get near 100% coverage with the unit tests.
 * bitAnd renamed to amp
 * slice rename to dotdot
 * Much more cross-referencing in the doc comments
 * Start line and column can be specified in the lexer config

Sep 12 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/12/2013 4:57 PM, Brian Schott wrote:
 On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright wrote:
 I strongly recommend running the unit tests with -cov. A lexer should be able
 to get near 100% coverage with the unit tests.

Some of the code is only present to be used at compile-time to generate switch statements inside of the lexer. -cov doesn't show code that's executed at compile-time to be covered, and it couldn't show meaningful line numbers on code that's generated and mixed in.

That's right.
 That being said, it's currently 70% covered. I'll be making another pass over
 the code later to fill in some of the gaps.

Thanks.
Sep 12 2013
prev sibling next sibling parent "qznc" <qznc web.de> writes:
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

I think this is the right approach. It can probably be another function, we can put into std.range and reuse for other lexer/parsers. The lexer or the parser should not be made more complex for this.
Sep 12 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 Most important goal of this review is to determine any API / 
 design problems. Any internal implementation tweaks may happen 
 after inclusion to Phobos but it is important to assure that no 
 breaking changes will be required any time soon after module 
 will get wider usage.

One quick remark : we need some kind of value provider to reuse across different lexing, and can be used outside the lexer. If I process a module and have to kick in a new lexing phase because of mixin, I want to generate identifiers out of the same pool.
Sep 12 2013
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

On 09/12/2013 12:09 AM, Manfred Nowak wrote:
 Walter Bright wrote:

 Since the very beginning.

 One example is determining if something is a declaration or an
 expression. 

Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes. -manfred

/IdentifierList: Identifier . IdentifierList TemplateInstance. IdentifierList And Bison also complaint. /
Sep 12 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 12, 2013 at 07:00:09AM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
I still don't understand why backtracking is necessary in the first
place. I would've thought a modern parser should be well able to
encode seen tokens in its state so that backtracking is never
necessary.  Or does D grammar have tricky bits that cannot be handled
this way, that I'm unaware of?

The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance : int foo()() {} ^ When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol.

This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this: int foo () () {} Type Ident ArgList ArgList ^ Then have the rule: CompileTimeArgs ::= ArgList RuntimeArgs ::= ArgList TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ... FuncDecl ::= Type Ident RuntimeArgs '{' ... So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs. ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations. T -- There are two ways to write error-free programs; only the third one works.
Sep 12 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
 This can be handled by using an intermediate grammar rule. 
 Reduce all
 (...) into an intermediate type, say ArgList, so the reduction
 happens something like this:

 	int  foo   ()      ()      {}
 	Type Ident ArgList ArgList ^

 Then have the rule:

 	CompileTimeArgs ::= ArgList
 	RuntimeArgs ::= ArgList
 	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
 	FuncDecl ::= Type Ident RuntimeArgs '{' ...

 So first, all (...) gets parsed to ArgList, but it's not yet 
 fixed
 whether they are compile-time arguments or runtime arguments. 
 It's only
 after you see the next '(' or '{' that you decide whether 
 ArgList should
 reduce to CompileTimeArgs or RuntimeArgs.

 ArgList itself, of course, will accept all possible parameters 
 (both
 runtime and compile-time): types, expressions, symbols. Then 
 when you
 reduce it to RuntimeArgs, you reject stuff that can't be 
 interpreted as
 parameter declarations.

And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.
Sep 12 2013
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/11/2013 05:01 PM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 ---- Input ----

 Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

 ...

(Commenting on what's visible in the documentation only for now.) auto config = ... ... .byToken(config) ... Seems to be a natural candidate for manual partial specialization. enum config = ... ... .byToken!config() ... uint line; ushort column; // is there overflow checking? "Check to see if the token is of the same type and has the same string representation as the given token." Tokens with the same string representation always are of the same type, so this seems redundant. Furthermore, I'd expect (!a.opCmp(b)) === (a == b). Why provide the operator overloads at all? They don't implement essential or natural functionality. "includeSpecialTokens". It's not clear what this flag does. "If the input range supports slicing, the caching layer aliases itself away and the lexing process is much more efficient." It might be more sensible to require the user to manually wrap his range. "pure nothrow bool isOperator(const TokenType t); pure nothrow bool isOperator(ref const Token t); pure nothrow bool isKeyword(const TokenType t); pure nothrow bool isKeyword(ref const Token t); ..." IMO we should get rid of these. TokenType naming seems inconsistent. eg: & is amp, = is assign, == is equal, but &= is bitAndEqual and && is logicAnd IMO better: & is and, = is assign, &= is andAssign and && is andAnd. Of course, it might be best to use a template instead. Tok!"&", Tok!"&=" and Tok!"&&".
Sep 12 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Sep 12, 2013 at 08:10:18PM +0400, Dmitry Olshansky wrote:
 12-Sep-2013 19:39, Timon Gehr пишет:
On 09/11/2013 08:49 PM, Walter Bright wrote:
4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' would
be clearer.  Ditto for the rest. For example that is done better,
'*' is called 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.

I like this. Not only this has the benefit of not colliding with keywords. I also imagine that it could be incredibly convenient to get back the symbolic representation of a token (when token used as parameter to AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in symbolic form so learning a pack of names for them feels pointless.

+1. This is superior to both the ad hoc _ suffix and my ad hoc prefixing approach. Tok!"default" is maximally readable, and requires no silly convolutions like _ or 'kw' / 'tokenType' prefixes. I vote for Tok!"..." to denote token types. Question: what's the implementation of Tok? Does it fit into an enum? What's the underlying representation? I imagine some kind of canonical mapping into an integral type would be desired, to maximize runtime performance. T -- There are three kinds of people in the world: those who can count, and those who can't.
Sep 12 2013
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
On 09/12/2013 04:21 AM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.

linked list for the unusual case would help ..
Sep 12 2013
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

thanks ;)

the constructing and linking nodes in the LL
Sep 12 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, September 12, 2013 23:55:23 Robert Schadek wrote:
 On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

And an extra branch to detect which one is currently the case? No, thanks ;)

I would think that branching and using the vector is faster than using the constructing and linking nodes in the LL

Well, it sounds like there are several ideas to try out in this thread. I guess that only benchmarking and profiling will really tell us which (if any) are better though. - Jonathan M Davis
Sep 12 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright 
wrote:
 I strongly recommend running the unit tests with -cov. A lexer 
 should be able to get near 100% coverage with the unit tests.

Some of the code is only present to be used at compile-time to generate switch statements inside of the lexer. -cov doesn't show code that's executed at compile-time to be covered, and it couldn't show meaningful line numbers on code that's generated and mixed in. That being said, it's currently 70% covered. I'll be making another pass over the code later to fill in some of the gaps.
Sep 12 2013
prev sibling next sibling parent reply "Dicebot" <public dicebot.lv> writes:
Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones that 
can't be figured out during normal pull request review process). 
Brian, what is your state on this? Is there something you want to 
add/change before the voting gets initiated?
Sep 17 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-09-25 04:48, Brian Schott wrote:

 Changes since last time:

 https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master

I had some comments and a couple of minor things like spell errors: * I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. * Could we get some unit tests for string literals, comments and identifies out side of the ASCII table * Personally I would like to see a short description for each unit test, what it's testing * Could you remove debug code and other code that is commented out: - 344 - 1172 - 1226, is that needed? - 3165-3166 - 3197-3198 - 3392 - 3410 - 3434 Spell errors: * "forwarad" - 292 * "commemnt" - 2031 * "sentenels" - 299 * "messsage" - 301 * "underliying" - 2454 * "alloctors" - 3230 * "strightforward" - 2276 I guess these line number might be off now. My original comments was made September 12. For reference see: http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=7#post-l0rsje:24jf9:241:40digitalmars.com -- /Jacob Carlborg
Sep 25 2013
prev sibling parent Jos van Uden <usenet fwend.com> writes:
On 26-9-2013 17:41, Dominikus Dittes Scherkl wrote:
 Hello.

 I'm not sure if this belongs here, but I think there is bug at the very start
of the Lexer chapter:

 Is U+001A really meant to end the source file?
 According to the Unicode specification this is a "replacement character", like
the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to
 end the source (this would fit, as it means "end of media").

 I don't know if anybody ever has ended his source in that way or if it was
tested.

 More important to me is, that all the Space-Characters beyond ASCII are not
 considered whitespace (starting with U+00A0 NBSP, the different wide spaces
 U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and
 the famous U+FEFF). Why?
 Ok, the set is much larger, but for the end-of-line also the unicode versions
(U+2028 and U+2029) are added. This seems inconsequent to me.

I imagine the lexer follows the language specification: http://dlang.org/lex.html#EndOfFile
Sep 26 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones that 
 can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there something 
 you want to add/change before the voting gets initiated?

I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.
Sep 17 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones 
 that can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there 
 something you want to add/change before the voting gets 
 initiated?

I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.

Well, I suppose only Brian can address it :) I am only checking if there are no commonly repeated objections that are likely to ruin the voting. There are plenty of comments of course and I would recommend Brian to at least make some statement about those before voting but that is his call. There is a possibility to mention one issue as a blocker when voting though, that will be taken into consideration when counting actual votes.
Sep 17 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 17 September 2013 at 16:48:44 UTC, Dicebot wrote:
 On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones 
 that can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there 
 something you want to add/change before the voting gets 
 initiated?

I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.

Well, I suppose only Brian can address it :) I am only checking if there are no commonly repeated objections that are likely to ruin the voting. There are plenty of comments of course and I would recommend Brian to at least make some statement about those before voting but that is his call. There is a possibility to mention one issue as a blocker when voting though, that will be taken into consideration when counting actual votes.

I've been busy with things that aren't D-related recently, but I should have time the rest of this week to address the lexer.
Sep 17 2013
prev sibling next sibling parent "ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

Brian, have you got any plans to write common lexer/parser generator based on context-free grammar (CFG) or parsing expression grammar (PEG), not only for D grammar? Something like this: https://github.com/PhilippeSigaud/Pegged
Sep 17 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
 I've been busy with things that aren't D-related recently, but 
 I should have time the rest of this week to address the lexer.

Ok, please notify me then once you feel ready for voting / extra review ( public at dicebot.lv ) Thanks for you work.
Sep 17 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first.
Sep 24 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
 I've been busy with things that aren't D-related recently, but 
 I should have time the rest of this week to address the lexer.

Changes since last time: https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master Test coverage is up to 85% now and a few bugs have been fixed.
Sep 24 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 25 September 2013 at 02:23:36 UTC, Brian Schott 
wrote:
 On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first.

Yes, ideally as alias parameter. This is a show stopper for many usages.
Sep 24 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 25 September 2013 at 09:36:43 UTC, Jacob Carlborg
wrote:
 * I see that errorMessage throws an exception. Do we really 
 want that? I would except it just returns an invalid token.

This is the default behavior that happens when you don't configure an error callback.
 * Could we get some unit tests for string literals, comments 
 and identifies out side of the ASCII table

I've added one.
 * Could you remove debug code and other code that is commented 
 out:

Most of this is now gone.
 Spell errors:

These were fixed weeks ago.
Sep 25 2013
prev sibling next sibling parent "Jacob Carlborg" <doob me.com> writes:
On Wednesday, 25 September 2013 at 16:52:43 UTC, Brian Schott 
wrote:

 This is the default behavior that happens when you don't
 configure an error callback.

I see.
 I've added one.

Thanks.
 Most of this is now gone.

That's good.
 These were fixed weeks ago.

Great, I just never got a respond that. -- /Jacob Carlborg
Sep 25 2013
prev sibling next sibling parent "Dominikus Dittes Scherkl" writes:
Hello.

I'm not sure if this belongs here, but I think there is bug at 
the very start of the Lexer chapter:

Is U+001A really meant to end the source file?
According to the Unicode specification this is a "replacement 
character", like the newer U+FFFC. Or is it simply a spelling 
error and U+0019 was intended to
end the source (this would fit, as it means "end of media").

I don't know if anybody ever has ended his source in that way or 
if it was tested.

More important to me is, that all the Space-Characters beyond 
ASCII are not
considered whitespace (starting with U+00A0 NBSP, the different 
wide spaces
U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, 
U+3000 and
the famous U+FEFF). Why?
Ok, the set is much larger, but for the end-of-line also the 
unicode versions (U+2028 and U+2029) are added. This seems 
inconsequent to me.
Sep 26 2013
prev sibling next sibling parent "Dominikus Dittes Scherkl" writes:
On Thursday, 26 September 2013 at 16:47:09 UTC, Jos van Uden 
wrote:
 Is U+001A really meant to end the source file?
 According to the Unicode specification this is a "replacement 
 character", like the newer U+FFFC. Or is it simply a spelling 
 error and U+0019 was intended to
 end the source (this would fit, as it means "end of media").

 More important to me is, that all the Space-Characters beyond 
 ASCII are not considered whitespace

I imagine the lexer follows the language specification: http://dlang.org/lex.html#EndOfFile

I know. What I wanted to say is: The language specification has a bug here (at least it is strange to interpret "replacement character" as end of file and "end of media" not) and the handling of unicode space characters is not nice. If this is not the right place to discus that matter, please point me to a better place.
Sep 27 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
 The problem is that it can cause a exponential (and I literally 
 mean exponential here) amount of complexity.

 The alternative is to go for some ambiguous function/template 
 parameters parsing and resolve at the end, but as template 
 argument are themselves ambiguous type/expression/symbols, the 
 amount of complexity in the parser is doomed to explode.

Pretty sure a GLR parser handles that well within O(n^2) space. Nothing exponential necessary...
Sep 27 2013
prev sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Saturday, 28 September 2013 at 01:23:48 UTC, Mehrdad wrote:
 On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
 The problem is that it can cause a exponential (and I 
 literally mean exponential here) amount of complexity.

 The alternative is to go for some ambiguous function/template 
 parameters parsing and resolve at the end, but as template 
 argument are themselves ambiguous type/expression/symbols, the 
 amount of complexity in the parser is doomed to explode.

Pretty sure a GLR parser handles that well within O(n^2) space. Nothing exponential necessary...

I meant time... Though it's true for space too.
Sep 27 2013