www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Formal review of std.lexer

reply "Dicebot" <public dicebot.lv> writes:
http://wiki.dlang.org/Review/std.lexer

This is follow-up by Brian to his earlier proposal 
(http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
module focuses instead on generic lexer generation as discussed 
in matching voting thread.

Docs: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

Example topics to evaluate during review:
  - Feasibility of overall design and concept
  - Quality of documentation
  - Consistency with existing Phobos modules
  - Overall quality of implementation

Initial review will end on March the 8th.
Feb 21 2014
next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

First of all, thank you for the great work. This is a very important project. I'll begin with reviewing the documentation.
 Summary

Some simple explanation of the terminology and concepts would be nice. At least a link to Wikipedia.
 Create the string array costants for your language.

Typo ("costants")
 Examples:

An inline complete example of a very simple language would be nice.
 A lexer for D is available here.

Although good to have, this is too much to take in all at once, for documentation purposes.
 A lexer for Lua is available here.

Nary a comment in sight. This might serve as the example lexer if only it was better commented. The comments can be copy-pasted from the module documentation, even that would make the code much easier to grok.
 Template Parameter Definitions

What does this mean? Parameters to what template? Can this section be moved to inside the documentation of Lexer, and Lexer be moved to the first documented symbol in the file?
 A function that serves as the default token lexing function. 
 For most languages this will be the identifier lexing function.

Should the function signature and contracts be explained here?
 This function must return bool and take a single size_t 
 argument representing the number of bytes to skip over before 
 looking for a separating character.

I think it's better to describe the signature in D syntax rather than English.
 A listing of tokens whose value is variable, such as 
 whitespace, identifiers, number literals, and string literals.

No mention of how the list is represented (is it an array? what type of elements should the array have? how are the array values used?), the reader is left to figure that out from the example below.
 Template for determining the type used for a token type. 
 Selects the smallest unsigned integral type that is able to 
 hold the value staticTokens.length + dynamicTokens.length + 
 possibleDefaultTokens.length. For example if there are 20 
 static tokens, 30 dynamic tokens, and 10 possible default 
 tokens, this template will alias itself to ubyte, as 20 + 30 + 
 10 < ubyte.max.

Should this be documented? I understand that this will be instantiated only once, by std.lexer. Utility declarations should preferably be at the end of the module, so that they appear last in the documentation.
 Generates the token type identifier for the given symbol. There 
 are two special cases:

Are these magic constants necessary? Why not declare them as enums?
  In all cases this template will alias itself to a constant of 
 type IdType. This template will fail at compile time if symbol 
 is not one of the staticTokens, dynamicTokens, or 
 possibleDefaultTokens.

Style nit: D symbols should be wrapped in the $D(...) macro.
 == overload for the the token type.

Is it really necessary to document opEquals? But since it's here: how does it interact with extraFields?
 The Column number at which this token occurs.

There was a lot of bikeshedding regarding the correct terminology to use when adding -vcolumns to DMD ( https://github.com/D-Programming-Language/dmd/pull/3077 ). I think the documentation should at least mention what exactly it is counting.
 A function that serves as the default token lexing function. 
 For most languages this will be the identifier lexing function.

What should the function's name be? How will it interact with Lexer? (It's not clear that this refers to the defaultTokenFunction parameter, especially after the previous list item, popFront, is a different piece of the puzzle.) The documentation for Lexer's arguments seem to be thrown all around the module. I suggest to document them only once, all in Lexer's DDoc, add example signatures, and move Lexer to the top of the module.
 Examples:
 struct CalculatorLexer

I think this should be expanded into a full, well-documented example featured in the module DDoc.
 _popFront();

Where did that come from? Perhaps you meant Lexer's DDoc to say "which should call this mixin's _popFront()", and the DDoc escaped the _ character? If so, why not use a named mixin to disambiguate instead of _?
 struct LexerRange;
 struct StringCache;

These are thoroughly documented, but will they be used by anything other than std.lexer?
Feb 21 2014
prev sibling next sibling parent reply "Joseph Cassman" <jc7919 outlook.com> writes:
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal 
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
 module focuses instead on generic lexer generation as discussed 
 in matching voting thread.

 Docs: 
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
 Code: 
 https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

Thanks for all the work Brian. Read through the previous threads about the development of this code (links at the bottom) and I can see a lot of effort has gone into it. So the following comments may come across as uninformed, but hopefully they will be helpful. 1. StringCache is a custom hash table. It looks like it's primary role is to reduce some sort of duplication. Hash tables, though, are difficult to get right. So perhaps could a benchmark comparison be made against the built-in HT to show what savings it brings? Since it is in the public interface should its payload also be public? Although it is built using GC.malloc how about the in-the-works std.allocator module? Perhaps a version 1 could use GC.malloc but if a later PR could make it possible to use a custom allocator that would be nice. 2. I like the fact that a range interface is provided. I realize that the previous discussions stipulated the use of ubyte to avoid encoding work during scanning. The reasoning about performance makes sense to me. That being the case, could a code example be provided showing how to use this module to scan a UTF-8 encoded string? Even if this is going to focus only on scanning code files, the D language spec allows for arbitrary Unicode in a code file. How is this possible? (I have a general idea, just looking for some explicit code sample help). 3. I tried to understand the reason for and usage of the "extraFields" parameter in "TokenStructure" but couldn't figure it out. Could some more explanation of its intent and usage be provided? 4. Do you want the commented-out pragma statement left over on line 601? 5. Should the template "TokenId" perhaps be something like "generateTokenId" instead? I am not sure what an "Id" for a token means. Is it an integral hash value? Had difficulty seeing how it ties in with the concept of "value" in the header documentation. If this is a numerical hash of a string token, why is the string still stored and used in "tokenStringRepresentation"? I probably am missing something big but couldn't the number be used to represent the string everywhere, saving on time and space? 6. I tried but had difficulty understanding the difference between the four token types -- "staticTokens", "dynamicTokens", "possibleDefaultTokens", "tokenHandlers" -- provided as arguments to "Lexer". What is a token that has a value that changes versus a token that does not change? I am not sure where to put my token definitions. 7. Just thinking about using the module and I would like to use it to make a scanner for xml, json, csv, c/c++, etc. I wasn't able to figure out how to do so, however. The initial code example is nice. But could some additional guidance be provided? Also, I wasn't sure how to make use of a lexer once created. The documentation focuses well on how to initialize a "Lexer" but could some guidance also be provided on how to use one past initialization? 8. Andrei's trie search (http://forum.dlang.org/thread/eeenynxifropasqcufdg forum.dlang.org?page=4#post-l2nm7m:2416e1:24 :40digitalmars.com) seemed like a really interesting idea. And I saw in that thread you continued with his ideas. Does this module incorporate that work? Or was it less performant in the end? 9. I ran "dmd -cov" against the module and got zero percent unit test coverage. Perhaps adding some test code will help clarify usage patterns? You have put a lot of work into this code so I apologize if the above comes across as picking it apart. Just some questions I had in trying to make use of the code. Hopefully some of it is helpful. Joseph Other related posts http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org http://forum.dlang.org/thread/eeenynxifropasqcufdg forum.dlang.org
Feb 21 2014
parent reply Martin Nowak <code dawg.eu> writes:
On 02/22/2014 09:31 PM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 But that still doesn't explain why a custom hash table implementation is
 necessary. Maybe a lightweight wrapper around built-in AAs is sufficient?

I'm also wondering what benefit this hash table provides.
Mar 16 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Mar-2014 02:13, Martin Nowak пишет:
 On 02/22/2014 09:31 PM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 But that still doesn't explain why a custom hash table implementation is
 necessary. Maybe a lightweight wrapper around built-in AAs is sufficient?

I'm also wondering what benefit this hash table provides.

Getting back to this. The custom hash map originaly was a product of optimization, the benefits over built-in AAs are: a) Allocation was amortized by allocating nodes in batches. b) Allowed custom hash function to be used with built-in type (string). Not sure how much of that stands today. -- Dmitry Olshansky
May 05 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/21/14, 2:12 PM, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
 focuses instead on generic lexer generation as discussed in matching
 voting thread.

 Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
 Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

 Example topics to evaluate during review:
   - Feasibility of overall design and concept
   - Quality of documentation
   - Consistency with existing Phobos modules
   - Overall quality of implementation

 Initial review will end on March the 8th.

Can we please defer this by one week? Thanks, Andrei
Feb 21 2014
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2014-02-21 18:06, Andrei Alexandrescu wrote:

 Can we please defer this by one week?

Just make the review period one week longer. -- /Jacob Carlborg
Feb 21 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/21/14, 9:51 PM, Brian Schott wrote:
 Does this mean that you're finally getting approval to release your
 lexer generator?

Affirmative. It'll happen Real Soon Now(tm). Andrei
Feb 21 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 21 February 2014 at 17:15:41 UTC, Jacob Carlborg wrote:
 On 2014-02-21 18:06, Andrei Alexandrescu wrote:

 Can we please defer this by one week?

Just make the review period one week longer.

This. There is no real rationale behind existing default review period so extending it until March the 15th won't cause any issues.
Feb 21 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
Does this mean that you're finally getting approval to release 
your lexer generator?

On Friday, 21 February 2014 at 17:06:23 UTC, Andrei Alexandrescu 
wrote:
 Can we please defer this by one week?

 Thanks,

 Andrei

Feb 21 2014
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/21/14, Joseph Cassman <jc7919 outlook.com> wrote:
 1. StringCache is a custom hash table. It looks like it's primary
 role is to reduce some sort of duplication.

It's used for string interning.
 3. I tried to understand the reason for and usage of the
 "extraFields" parameter in "TokenStructure" but couldn't figure
 it out. Could some more explanation of its intent and usage be
 provided?

You could used it to inject a toString method. Here's what I did when I used it: alias Token = TokenStructure!(TokID, q{ /// Better string representation void toString(scope void delegate(const(char)[]) sink) const { import std.conv; import %s; sink("("); sink(this.line.to!string); sink(":"); sink(this.column.to!string); sink(")"); sink(": "); sink(this.text ? this.text : tokToString(this.type)); } }.format(__MODULE__) ); Side-note: Note how I've had to inject an import statement to my own module because the string is mixed-in at the declaration site of a module which I can't modify, where IIRC tokToString didn't exist because it's located in *my* module. It's interesting how you can use this feature in D, IOW: ----- module mymodule; import other_module; void foo() { } mixin_elsewhere!("import mymodule; foo();"); ----- ----- module other_module; void mixin_elsewhere(string str)() { mixin(str); } ----- P.S. the DLexer links here: https://github.com/Hackerpilot/lexer-demo/ I've used it to learn how to use DLexer, here's my more-documented version of the above if it helps anyone, which also takes the D parser's whitespace function and uses that in order to track column numbers, it injects a toString like I've mentioned, and adds some static asserts for sanity checks: https://github.com/AndrejMitrovic/lexer-demo
Feb 21 2014
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/21/14, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 I've used it to learn how to use DLexer, here's my more-documented
 version of the above if it helps anyone, which also takes the D
 parser's whitespace function and uses that in order to track column
 numbers, it injects a toString like I've mentioned, and adds some
 static asserts for sanity checks:

 https://github.com/AndrejMitrovic/lexer-demo

Hmm, maybe I should make a pull request upstream?
Feb 21 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 21 February 2014 at 22:14:36 UTC, Andrej Mitrovic 
wrote:
 Hmm, maybe I should make a pull request upstream?

You should.
Feb 21 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 21 February 2014 at 22:13:30 UTC, Andrej Mitrovic 
wrote:
 On 2/21/14, Joseph Cassman <jc7919 outlook.com> wrote:
 1. StringCache is a custom hash table. It looks like it's 
 primary
 role is to reduce some sort of duplication.

It's used for string interning.

But that still doesn't explain why a custom hash table implementation is necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? (I also apologize if this has already been asked and answered in the last review thread, which is unfortunately too long to read in a short time.)
Feb 22 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
Brian, do you have benchmarks for this proposal similar to ones 
you have provided in old review threads? (vs DMD frontend lexer)
Feb 24 2014
prev sibling next sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal 
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
 module focuses instead on generic lexer generation as discussed 
 in matching voting thread.

 Docs: 
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
 Code: 
 https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

 Example topics to evaluate during review:
  - Feasibility of overall design and concept
  - Quality of documentation
  - Consistency with existing Phobos modules
  - Overall quality of implementation

 Initial review will end on March the 8th.

No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up. Naturally anyone should be encouraged to provide a better solution by submitting a pull request to Phobos developers... So far I haven't seen a better lexer for D source than Brian's std.lexer. If anyone has, please let me know.
Feb 24 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/24/14, 7:18 PM, Adam Wilson wrote:
 Note that I as badly as I want std.lexer to be included I want it to
 pass a rigorous review. This review (and, is passing, subsequent
 inclusion) has opened up an opportunity to start using D at work that I
 did not expect and so I am kind of excited about it.

I think it would be great if we rigged things such that std.lexer simplifies flint and makes it faster. From what I saw Brian has a very similar approach. I'm hoping for a few improvements, which I'll share soon. Andrei
Feb 25 2014
prev sibling next sibling parent "Meta" <jared771 gmail.com> writes:
On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote:
 No criticism should stop this module being accepted, as we do 
 not have any other lexer in the runtime anyway. Therefore I 
 suggest we accept std.lexer until a better solution comes up. 
 Naturally anyone should be encouraged to provide a better 
 solution by submitting a pull request to Phobos developers...

The problem is that this is what has been done before, and now we are more or less stuck with outdated, sometimes poorly-written, often buggy modules (std.signals being one example).
Feb 24 2014
prev sibling next sibling parent "Adam Wilson" <flyboynw gmail.com> writes:
On Mon, 24 Feb 2014 14:32:53 -0800, Meta <jared771 gmail.com> wrote:

 On Monday, 24 February 2014 at 22:14:34 UTC, Dejan Lekic wrote:
 No criticism should stop this module being accepted, as we do not have  
 any other lexer in the runtime anyway. Therefore I suggest we accept  
 std.lexer until a better solution comes up. Naturally anyone should be  
 encouraged to provide a better solution by submitting a pull request to  
 Phobos developers...

The problem is that this is what has been done before, and now we are more or less stuck with outdated, sometimes poorly-written, often buggy modules (std.signals being one example).

Well, we keep voting down replacement candidates, which incidentally, is exactly what happened with the std.signals replacement, so I view this as an orthogonal issue to whether or not it should be included after passing a review. I don't think the fact that a module might not be perfect after review should stop us from approving a module that offers completely new functionality AND passed a review. Handling the problems after inclusion is what bugzilla is for. -- Adam Wilson GitHub/IRC: LightBender Aurora Project Coordinator
Feb 24 2014
prev sibling next sibling parent "Meta" <jared771 gmail.com> writes:
On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote:
 Well, we keep voting down replacement candidates, which 
 incidentally, is exactly what happened with the std.signals 
 replacement, so I view this as an orthogonal issue to whether 
 or not it should be included after passing a review. I don't 
 think the fact that a module might not be perfect after review 
 should stop us from approving a module that offers completely 
 new functionality AND passed a review. Handling the problems 
 after inclusion is what bugzilla is for.

I guess std.signals was a bad example, as there *was* a proposed replacement. However, there were real problems with the replacement that made it not suitable for inclusion. If I recall, these were largely API issues, which are the hardest to change. If we had've accepted the new std.signals despite the issues raised, several years down the road it might turn out to be as broken as the old implementation (no offense to the author, this is just for the sake of argument), and we are unable to replace it for fear of breaking code. There are then 2 options: support the old API with its broken behaviour in the same module as the new API, or introduce std.signals2 or the like, which people have shown resistance to in the past. I think that it's very important to be careful as to what goes into Phobos at this point, as it's going to become increasingly difficult to change anything.
Feb 24 2014
prev sibling next sibling parent "Adam Wilson" <flyboynw gmail.com> writes:
On Mon, 24 Feb 2014 15:36:43 -0800, Meta <jared771 gmail.com> wrote:

 On Monday, 24 February 2014 at 23:07:07 UTC, Adam Wilson wrote:
 Well, we keep voting down replacement candidates, which incidentally,  
 is exactly what happened with the std.signals replacement, so I view  
 this as an orthogonal issue to whether or not it should be included  
 after passing a review. I don't think the fact that a module might not  
 be perfect after review should stop us from approving a module that  
 offers completely new functionality AND passed a review. Handling the  
 problems after inclusion is what bugzilla is for.

I guess std.signals was a bad example, as there *was* a proposed replacement. However, there were real problems with the replacement that made it not suitable for inclusion. If I recall, these were largely API issues, which are the hardest to change. If we had've accepted the new std.signals despite the issues raised, several years down the road it might turn out to be as broken as the old implementation (no offense to the author, this is just for the sake of argument), and we are unable to replace it for fear of breaking code. There are then 2 options: support the old API with its broken behaviour in the same module as the new API, or introduce std.signals2 or the like, which people have shown resistance to in the past. I think that it's very important to be careful as to what goes into Phobos at this point, as it's going to become increasingly difficult to change anything.

Ok. Then by that standpoint we have two remaining options. A) Don't ever change existing code because you might breaking someone else's depending code on accident. Even without an API change, a change in how a function does it's processing can result in incorrect behavior in down-stream code. B) Never introduce new modules because we are terrified that the API might not be right in five years. It is unrealistic in the extreme to demand that a new module in Phobos meet some arbitrary future compatibility bar. We routinely make changes to Phobos that break people's code in subtle ways that don't produce compiler errors and we never hear about it because no sane programmer expects a standard library to be a static thing that will never ever ever change in any way what so ever so that they can expect the exact same behavior five years from now. No, they go fix it and move on. "OMG! We can't add this! It might not be the right API in the future!" By that reasoning we should halt all work on Phobos ever. After all, we might break something or a new API might not be perfect in five years after a new language feature allows an obviously better way of implementing that API. As reasoning goes, the best word I am come up with to describe it is "naive". It is a purely fear based response based on some unspecified and unknown future circumstance. Newsflash, you can't predict the future, which incidentally, is why API's change! The reason std.signals replacement got voted down was because it wasn't API compatible with the original. This is despite the fact that nobody stood up to say that they use the current implementation and like it. Anybody who did use it said that they are desperate for a replacement. No the current std.signals is universally reviled and almost completely unused but we can't change it because of the mythical silent user who might be out there coding under a rock with no Internet and isn't voicing their opinion. Well, guess what, if you don't speak-up you have no right to blame it on the people who did and decided to changed it. You had your chance, stop whining and go update your code. We are programmers. It is our JOB to deal with changes in the API/ABI/ISA. We don't have the RIGHT to whine about change because dealing with change is what we DO. If we aren't dealing with change we're working on something that isn't used at all and therefore pointless. I'm all for pointless pursuits, but don't hamstring me and my non-pointless pursuits with them. I wanted std.signals in the new form badly, but now I can't have it, and Aurora is going to suffer mightily for it. Now I get to go off and make my own substandard iteration of Signals, because some people whined about how the API wasn't backwards compatible, even though it's widely acknowledged that almost nobody uses it and those that do would've been happy to move to the new signals. Note, I am not talking about voting down modules that are obviously poorly implemented, I am talking about modules that work well and can do the task they purport to do with as few bugs as possible. If the module can do what it says it can, and the API isn't horrible, then we should default to adding it. Griping about the API five years from now is just bikeshedding, write up a pull and get that reviewed, chances are high it'll get rejected because the changes are worse anyways. The only difficulty with changing anything is in your head. In practice, programmers who actually need to get things done expect it. -- Adam Wilson GitHub/IRC: LightBender Aurora Project Coordinator
Feb 24 2014
prev sibling next sibling parent "Meta" <jared771 gmail.com> writes:
On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote:
[SNIP]

You're throwing what I said way out of proportion. I was replying to the statement: "No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up." I don't agree with this. Obviously std.lexer is well-written and has been through a few rounds of iteration, but that doesn't mean we should accept it without due diligence to ensure that we won't be regretting some overlooked, poorly-designed or badly-named piece of functionality down the road. "Good enough because we don't yet have anything better" is a bad idea. It seems to me that what Brian has written is much better than "good enough", but I don't think that it should be accepted into Phobos *solely* because we don't have anything else. If the community decides that it is a worthwhile addition, then great, but that must not happen *until* it has passed rigorous review, just like every other recent Phobos module.
Feb 24 2014
prev sibling next sibling parent "Adam Wilson" <flyboynw gmail.com> writes:
On Mon, 24 Feb 2014 17:22:32 -0800, Meta <jared771 gmail.com> wrote:

 On Tuesday, 25 February 2014 at 00:28:26 UTC, Adam Wilson wrote:
 [SNIP]

You're throwing what I said way out of proportion. I was replying to the statement: "No criticism should stop this module being accepted, as we do not have any other lexer in the runtime anyway. Therefore I suggest we accept std.lexer until a better solution comes up." I don't agree with this. Obviously std.lexer is well-written and has been through a few rounds of iteration, but that doesn't mean we should accept it without due diligence to ensure that we won't be regretting some overlooked, poorly-designed or badly-named piece of functionality down the road. "Good enough because we don't yet have anything better" is a bad idea. It seems to me that what Brian has written is much better than "good enough", but I don't think that it should be accepted into Phobos *solely* because we don't have anything else. If the community decides that it is a worthwhile addition, then great, but that must not happen *until* it has passed rigorous review, just like every other recent Phobos module.

Fair enough. I guess I am just still touchy after the way std.signals was shot down. There weren't great technical arguments for shooting it down and so I feel that a good piece of code that would've been immediately useful and accepted by the community was rejected over some pretty silly fears. Note that I as badly as I want std.lexer to be included I want it to pass a rigorous review. This review (and, is passing, subsequent inclusion) has opened up an opportunity to start using D at work that I did not expect and so I am kind of excited about it. -- Adam Wilson GitHub/IRC: LightBender Aurora Project Coordinator
Feb 24 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Feb 24, 2014 at 07:18:58PM -0800, Adam Wilson wrote:
[...]
 Note that I as badly as I want std.lexer to be included I want it to
 pass a rigorous review. This review (and, is passing, subsequent
 inclusion) has opened up an opportunity to start using D at work
 that I did not expect and so I am kind of excited about it.

Me too, I'm looking forward to a well-designed lexer generator in Phobos. I will have lots of uses for it. Sadly, I haven't had the time to review the proposed code as closely as I'd like. T -- What doesn't kill me makes me stranger.
Feb 24 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Tuesday, 25 February 2014 at 03:52:01 UTC, H. S. Teoh wrote:
 Sadly, I haven't had the time to review the proposed code as 
 closely as
 I'd like.

Don't hesitate to ask for extending review period if you need so. Making good review is more important than meeting some arbitrary deadline.
Feb 25 2014
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2014-02-21 13:12, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
 focuses instead on generic lexer generation as discussed in matching
 voting thread.

I don't like it being a new top level module. I would name it std.language.lexer or std.lang.lexer. -- /Jacob Carlborg
Feb 25 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/25/14, 10:58 PM, Brian Schott wrote:
 On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote:
 std.lexer could be the umbrella for a bunch of different lexers. Then
 we could have std.lexer.xml, std.lexer.json, etc.

I think that's a bit backwards. I'd rather have std.d.lexer std.d.ast std.d.parser than std.lexer.d std.parser.d std.ast.d

I think we wouldn't want to add one more package for each language supported. Andrei
Feb 26 2014
parent Jacob Carlborg <doob me.com> writes:
On 2014-02-26 16:18, Andrei Alexandrescu wrote:

 I think we wouldn't want to add one more package for each language
 supported.

That's exactly what we want, preferably in a common package: std.language.d.lexer std.language.d.ast std.language.d.parser std.language.xml.lexer std.language.xml.parser std.language.xml.dom What do you suggest, having multiple lexers for different languages in the same module? -- /Jacob Carlborg
Feb 26 2014
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2014-02-26 07:58, Brian Schott wrote:

 I think that's a bit backwards. I'd rather have

 std.d.lexer
 std.d.ast
 std.d.parser

 than

 std.lexer.d
 std.parser.d
 std.ast.d

I agree with Brian. Although I would have a common package for all languages: std.language.d.lexer std.language.d.ast std.language.d.parser -- /Jacob Carlborg
Feb 26 2014
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2014-02-26 00:25, Dicebot wrote:

 Don't know if it makes sense to introduce random package categorizatin.
 I'd love to see more hierarchy in Phobos too but we'd first need to
 agree to package separation principles then.

Then that's what we need to do. I don't want any more top level modules. There are already too many. -- /Jacob Carlborg
Feb 26 2014
prev sibling next sibling parent "Meta" <jared771 gmail.com> writes:
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg 
wrote:
 On 2014-02-21 13:12, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
 module
 focuses instead on generic lexer generation as discussed in 
 matching
 voting thread.

I don't like it being a new top level module. I would name it std.language.lexer or std.lang.lexer.

std.lexer could be the umbrella for a bunch of different lexers. Then we could have std.lexer.xml, std.lexer.json, etc.
Feb 25 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Tuesday, 25 February 2014 at 20:48:08 UTC, Jacob Carlborg 
wrote:
 I don't like it being a new top level module. I would name it 
 std.language.lexer or std.lang.lexer.

Don't know if it makes sense to introduce random package categorizatin. I'd love to see more hierarchy in Phobos too but we'd first need to agree to package separation principles then.
Feb 25 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 25 February 2014 at 20:21:17 UTC, Andrei Alexandrescu 
wrote:
 I think it would be great if we rigged things such that 
 std.lexer simplifies flint and makes it faster. From what I saw 
 Brian has a very similar approach. I'm hoping for a few 
 improvements, which I'll share soon.

 Andrei

I'll be waiting for this list. In the meantime, I'm making a change to the string interning process to make it more threading friendly.
Feb 25 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 25 February 2014 at 23:17:56 UTC, Meta wrote:
 std.lexer could be the umbrella for a bunch of different 
 lexers. Then we could have std.lexer.xml, std.lexer.json, etc.

I think that's a bit backwards. I'd rather have std.d.lexer std.d.ast std.d.parser than std.lexer.d std.parser.d std.ast.d
Feb 25 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote:
 Brian, do you have benchmarks for this proposal similar to ones 
 you have provided in old review threads? (vs DMD frontend lexer)

Not yet, but I can create them before the review period ends.
Feb 25 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Wednesday, 26 February 2014 at 07:00:33 UTC, Brian Schott 
wrote:
 On Monday, 24 February 2014 at 19:05:35 UTC, Dicebot wrote:
 Brian, do you have benchmarks for this proposal similar to 
 ones you have provided in old review threads? (vs DMD frontend 
 lexer)

Not yet, but I can create them before the review period ends.

Thanks, that will be very interesting information to consider in context of DDMD.
Feb 26 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
Bringing this back to the front page.
Mar 03 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
Reminder about benchmarks.

By the way, is generated lexer usable at CTFE? Imaginary use case
: easier DSL implementation.
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Wednesday, 26 February 2014 at 18:07:37 UTC, Jacob Carlborg 
wrote:
 On 2014-02-26 00:25, Dicebot wrote:

 Don't know if it makes sense to introduce random package 
 categorizatin.
 I'd love to see more hierarchy in Phobos too but we'd first 
 need to
 agree to package separation principles then.

Then that's what we need to do. I don't want any more top level modules. There are already too many.

As much as I hate to say it, but such hierarchy is worth a DIP. Once it is formalized, I can proceed with it in review queue as if it was a new module proposal.
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
Initial review has finished. Voting will be delayed because Brian 
is currently busy and there is ongoing Walter's scopebuffer 
proposal to be processed (per agreement with both Brian and 
Walter).

Anyone late for review can still leave comments, I am sure Brian 
will take them into consideration when doing last moment changes 
before voting.
Mar 16 2014
prev sibling next sibling parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 21/02/2014 12:12 PM, Dicebot wrote:
 http://wiki.dlang.org/Review/std.lexer

 This is follow-up by Brian to his earlier proposal
 (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
 focuses instead on generic lexer generation as discussed in matching
 voting thread.

 Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
 Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

 Example topics to evaluate during review:
   - Feasibility of overall design and concept
   - Quality of documentation
   - Consistency with existing Phobos modules
   - Overall quality of implementation

 Initial review will end on March the 8th.

I know the official review period is long past but I'd not had a good look at this module until this past weekend. Last year I had been working on my own xml lexer/parser but as of yet I have nothing to show for it so I took a look a this proposal with an eye towards using it to make my efforts easier. Andrei's posts about the possible design of a generic lexer had also influenced me, so I was expecting to find similarities between this module and my own work, albeit with the added benefits of being generic (in the good way). I have, however, found it very difficult to understand much of it, which I entirely put down to my own deficiencies with templates and especially the use of mixins. In the example Dlang lexer, the constructor takes a ubyte[] as input and wraps it in a LexerRange struct which defines the normal input range primitives as well as various functions for lookahead. It is not documented whether the lexer needs these extra features or if they are only provided for use within the tokenising functions that are supplied to the template by the user. If they are used by the core of the lexer then it would seem to preclude the use of any other type of input that cannot be coerced into a ubyte[] without the effort on the part of the user to implement the same interface. I think the description of the functionality required of the tokenSeparatingFunction that the user must supply needs to be much better. If I understand correctly, it is intended to differentiate between keywords and identifiers which begin with a keyword. THe more I think about this the less certain I am. When the lexer dispatches to a token handler, the front of the range is left pointing to the beginning of the character sequence that was matched, allowing it to be included in the returned token. However, many of the handlers in the example Dlang lexer begin with a number of blind popFront calls to jump to the end of that match. I am aware that in well meaning code this is a case of the range being logically !empty, but I also wonder how often it might get overlooked when 2 matches of different lengths are dispatched the same handler. (I had a similar situation in my own code, and my solution was to have a variable storing the .save of my inputRange and count how many chars were consumed since it was updated. This way I could either return the whole match or part of it in the token or discard it and include only what came after it.) As there has been some contention about the correct use of the range primitives of late, I will refrain from making any other comment on their use in this module, especially as I am no longer sure that I have been using them correctly myself. In the short time that I have been looking at the features of this lexer I have not been able to figure out a way of writing a standards compliant XML parser without having to lex some parts of the document at least twice, or subverting the token handlers to change behaviour according to context. Several non compliant single pass XML lexers would be possible, but they would not be able to process documents that used some (admittedly obscure and often overlooked) features. The only scalable technique that I can think of to allow XML to be lexed in a single pass in a fully spec compliant way would be to allow handlers to return multiple tokens. I am not sure how feasible this would be or what mechanism would be best to implement it. On the whole I think the the overall design of the module shows promise but requires polish to make it both more idiomatically Dlang-y and easier for the user to build upon (both by documentation and interface). On a side not related to the example lexer for Dlang, I believe the predicate function isDocComment will produce false positives for the following comment delimiters which to my knowledge are not valid DDoc delimiters... //* //+ /*+ /*/ /+* /+/ As the Dlang lexer is not part of the review proper I have not inspected it carefully, this function just happens to be the first one declared in that example. Again, my apologies for the tardiness of this review. A...
Apr 14 2014
parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 14/04/2014 10:34 PM, Brian Schott wrote:
 ubyte[] is required for the lexer to work quickly. That lexer range is
 designed to keep track of the column and line numbers.

I can understand that speed requires the input to be handled as bytes instead of chars, but the restriction to an ubyte[] over an randomAccessRange seems to me un-Dlang-y. If the LexerRange is only there to optionally add line/column numbering support then I think it needs a much more descriptive name and much better documentation.
 That function's purpose is to determine if the current code unit
 signifies the end of an identifier/keyword. When lexing "fortunate", the
 lexer would spot "for", and then call the separating function with the
 index of "u". In the case of D, it would say that this does NOT end a
 word, and "fortunate" should be lexed as an identifier. If it was called
 with an index of a space or parenthesis, it should return true.

Somehow I had skipped the salient part of the documentation for this, so my apologies (at first I thought that the DDoc output I was reading must have been out of sync with the source (it was, but not that much), but further investigations suggest some form of temporary blindness). This description squares with what I had surmised from reading the code, and I can see why it would be more efficient than the common technique of comparing every matched identifier to the list of keywords. I do wonder however if there might be another way to attain the same efficiency without the need for the separating function (I should have replied to this last night when my ideas on the matter were clearer, sleep seems to have stolen my theoretical alternative ><). I'm also curious about the decision that the signature of the separating function should take the offset to the character than needs to be tested. A more Dlang-y thing to do in my mind would be to pass a range that begins at the first character after the matched keyword.
 If more than one prefix is dispatched to the same handler, the handler
 cannot blindly call popFront or popFrontN. If the lexer author knows
 that the handler will only be called in certain places, than checking
 !empty is a waste of cycles. I don't think it's possible for the lexer
 generator to enforce this.

I don't think the lexer needs to do anything extra to ensure that a handler can begin its work both without repeating calls to .empty or blindly calling popFront. To do so requires that the input be treated a a forward range. Before reading a token, store the .save of the input, then advance the input as the token is matched counting the consumed elements. When the handler is called it will have the option of including the token by adding to the count and returning a slice that begins at the .save'd position or ignoring the length of the match and returning a slice that begins at the position after the matched token. I strongly believe that a model that requires the user to reason about a library providing ranges that are "logicaly !empty" is a misstep.
 XML doesn't seem to have very distinct lexing/parsing phases like JSON
 markup or Java code does. This lexer generator may not be the best way
 to deal with XML.

I don't really know how the grammar in the XML standard compares to others in general. It is certainly more convoluted than the published grammar for Dlang, but we all know that that doesn't quite match the language. Some criticise XML for being too complicated, but in some respects it is even more complicated than that, it seems ironic that it was supposed to be simple. But, I might have typed too soon when I wrote that this lexer might not be able to fully tokenize XML in a single pass. As I was drifting off to sleep last night I had the idea of using the extraFields of TokenStructure to add a token pointer (or equivalent construct) to the tokens, making a singly linked list and allowing me to return multiple tokens from a token handler. I just need to convince myself that it is a sane thing to do.
 If you have specific problems, please list them.

I think the calculator example should be improved or replaced so that the documentation's first example uses all the string[]s that can be passed to the lexer. Perhaps a simple symbolic calculator that accepts letter sequences as variables and reserves pi as a keyword? I think that the section that describes the token handlers needs to fully document the primitives that the user has access to in order to lex a token and what state to expect them to be in when the handler is called and what state they should be in when it returns. If tracking lines and columns is optional and supported only when wrapping the input with LexerRange then I think that by default Tokens should not contain fields for them. Perhaps an enum qstring that declares the required fields can be introduced and an example shown where it is concatenated with user declarations to be passed in via extraFields. I'll again restate that I think LexerRange needs a name that better reflects its functionality. From reading the code I see that it does do a little more that simply track line/column by providing more efficient implementations of some of the range primitives. I Think that the user should be able to take advantage of these savings without having to also commit to line/column tracking. I think there should also be the option of tracking only lines, as column information is not always as accurate or useful. I think that is all my concerns for now, I'll leave my crazy ideas for "improvements" for another day.
 If you do find issues with the lexer or parser, please file them here:
 https://github.com/Hackerpilot/Dscanner/issues

Will do! A...
Apr 15 2014
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 16 March 2014 at 22:13:16 UTC, Martin Nowak wrote:
 On 02/22/2014 09:31 PM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 But that still doesn't explain why a custom hash table 
 implementation is
 necessary. Maybe a lightweight wrapper around built-in AAs is 
 sufficient?

I'm also wondering what benefit this hash table provides.

The hash table was added by Dmitry a while back. I assume it's because he had numbers to back it up.
Apr 14 2014
prev sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Monday, 14 April 2014 at 20:43:58 UTC, Alix Pexton wrote:
 I know the official review period is long past but I'd not had 
 a good look at this module until this past weekend.

 Last year I had been working on my own xml lexer/parser but as 
 of yet I have nothing to show for it so I took a look a this 
 proposal with an eye towards using it to make my efforts easier.

 Andrei's posts about the possible design of a generic lexer had 
 also influenced me, so I was expecting to find similarities 
 between this module and my own work, albeit with the added 
 benefits of being generic (in the good way). I have, however, 
 found it very difficult to understand much of it, which I 
 entirely put down to my own deficiencies with templates and 
 especially the use of mixins.

 In the example Dlang lexer, the constructor takes a ubyte[] as 
 input and wraps it in a LexerRange struct which defines the 
 normal input range primitives as well as various functions for 
 lookahead. It is not documented whether the lexer needs these 
 extra features or if they are only provided for use within the 
 tokenising functions that are supplied to the template by the 
 user. If they are used by the core of the lexer then it would 
 seem to preclude the use of any other type of input that cannot 
 be coerced into a ubyte[] without the effort on the part of the 
 user to implement the same interface.

ubyte[] is required for the lexer to work quickly. That lexer range is designed to keep track of the column and line numbers.
 I think the description of the functionality required of the 
 tokenSeparatingFunction that the user must supply needs to be 
 much better. If I understand correctly, it is intended to 
 differentiate between keywords and identifiers which begin with 
 a keyword. THe more I think about this the less certain I am.

That function's purpose is to determine if the current code unit signifies the end of an identifier/keyword. When lexing "fortunate", the lexer would spot "for", and then call the separating function with the index of "u". In the case of D, it would say that this does NOT end a word, and "fortunate" should be lexed as an identifier. If it was called with an index of a space or parenthesis, it should return true.
 When the lexer dispatches to a token handler, the front of the 
 range is left pointing to the beginning of the character 
 sequence that was matched, allowing it to be included in the 
 returned token. However, many of the handlers in the example 
 Dlang lexer begin with a number of blind popFront calls to jump 
 to the end of that match. I am aware that in well meaning code 
 this is a case of the range being logically !empty, but I also 
 wonder how often it might get overlooked when 2 matches of 
 different lengths are dispatched the same handler. (I had a 
 similar situation in my own code, and my solution was to have a 
 variable storing the .save of my inputRange and count how many 
 chars were consumed since it was updated. This way I could 
 either return the whole match or part of it in the token or 
 discard it and include only what came after it.) As there has 
 been some contention about the correct use of the range 
 primitives of late, I will refrain from making any other 
 comment on their use in this module, especially as I am no 
 longer sure that I have been using them correctly myself.

If more than one prefix is dispatched to the same handler, the handler cannot blindly call popFront or popFrontN. If the lexer author knows that the handler will only be called in certain places, than checking !empty is a waste of cycles. I don't think it's possible for the lexer generator to enforce this.
 In the short time that I have been looking at the features of 
 this lexer I have not been able to figure out a way of writing 
 a standards compliant XML parser without having to lex some 
 parts of the document at least twice, or subverting the token 
 handlers to change behaviour according to context. Several non 
 compliant single pass XML lexers would be possible, but they 
 would not be able to process documents that used some 
 (admittedly obscure and often overlooked) features. The only 
 scalable technique that I can think of to allow XML to be lexed 
 in a single pass in a fully spec compliant way would be to 
 allow handlers to return multiple tokens. I am not sure how 
 feasible this would be or what mechanism would be best to 
 implement it.

XML doesn't seem to have very distinct lexing/parsing phases like JSON markup or Java code does. This lexer generator may not be the best way to deal with XML.
 On the whole I think the the overall design of the module shows 
 promise but requires polish to make it both more idiomatically 
 Dlang-y and easier for the user to build upon (both by 
 documentation and interface).

If you have specific problems, please list them.
 On a side not related to the example lexer for Dlang, I believe 
 the predicate function isDocComment will produce false 
 positives for the following comment delimiters which to my 
 knowledge are not valid DDoc delimiters...

 //* //+ /*+ /*/ /+* /+/

https://github.com/Hackerpilot/Dscanner/issues/163
 As the Dlang lexer is not part of the review proper I have not 
 inspected it carefully, this function just happens to be the 
 first one declared in that example.

If you do find issues with the lexer or parser, please file them here: https://github.com/Hackerpilot/Dscanner/issues
 Again, my apologies for the tardiness of this review.

 A...

Apr 14 2014