www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Let's stop parser Hell

reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
There are more and more projects requiring parsing D code (IDE plugins, 
DCT and dfmt now).

We definitely need a _standard_ fast D parser (suitable as tokenizer). 
We already have a good one at least in Visual D. Maybe dmd's parser is 
faster. If so, it can be (easily?) rewritten in D. We also already have 
some other parsers (in C# from Mono-D etc.).

What about to get one and call it standard?

-- 
Денис В. Шеломовский
Denis V. Shelomovskij
Jul 05 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 12:11:33 UTC, Denis Shelomovskij 
wrote:
 There are more and more projects requiring parsing D code (IDE 
 plugins, DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as 
 tokenizer). We already have a good one at least in Visual D. 
 Maybe dmd's parser is faster. If so, it can be (easily?) 
 rewritten in D. We also already have some other parsers (in C# 
 from Mono-D etc.).

 What about to get one and call it standard?

Visual-D is not Boost-licensed (I think this would be possible to change) Mono-D is written in C#, as you mentioned Pegged may eventually become standard, if it will be performance optimized and a bit more customizable Dscanner(https://github.com/Hackerpilot/Dscanner) from Brian Schott is pretty good, too. SDC is another nice option DIL (http://code.google.com/p/dil/) is very nice but GPL I plan to try using Pegged inside my DCT project. Probably that will require huge modifications though... Some more links from Pegged readme:
 Hisayuki Mima's CTPG(https://github.com/youkei/ctpg), very 
 similar, also done in D. Have a look!
 Nick Sabalausky's Goldie 
 (http://www.dsource.org/projects/goldie).
 Benjamin Shropshire's dparser 
 (http://dsource.org/projects/scrapple/browser/trunk/dparser).

 Martin Nowak put these gists on the D newsgroup:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

Jul 05 2012
next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
05.07.2012 16:30, Roman D. Boiko пишет:
 Forgot to add DDMD, which also has been forked and redesigned recently
 by someone (thus two more different compiler frontends).

 But overall I doubt that any project could become standard very soon.

Why? Even were they all almost equal we can select any one. The point is to select something (anything) to guide a coder who want to do something with this task to a good/up-to-date/supported parser. -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-05 15:08, Roman D. Boiko wrote:

 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times. -- /Jacob Carlborg
Jul 05 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-05 18:04, Roman D. Boiko wrote:

 Well, we did something like that for DCT... but I doubt that it would
 fit general needs.

Why wouldn't it.
 If we had, why haven't they been analyzed, classified, discussed, etc.?
 Or have they?

I don't know. Here is what I wrote for DCT: * IDE integration * Refactoring tool * Static analysis * Compiler * Doc generating * Build tool * DI generating In general, use cases that can span several compile phases, i.e. lexing, parsing, semantic analysis and so on. Some of these use cases can be broken in to several new use cases at a lower level. Some examples: IDE integration: * Syntax highlighting * Code completion * Showing lex, syntax and semantic errors * Formatter Refactoring: * Cross-referencing symbols * Renaming of symbols * Extract local variable to instance variable * Extract variable to function/method * Extract a piece of code/method into a new class Build tool: * Tracking module dependencies Doc generating: * Associate a declaration and its documentation Original post: http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141 -- /Jacob Carlborg
Jul 05 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.

Aren't you voting on your own project :) -- /Jacob Carlborg
Jul 05 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 2:16 PM, Philippe Sigaud wrote:
 As much as I'm flattered by that, my current impression is Pegged is
 very far from being performant.

 As a proof-of-concept that, in D,  it's possible to parse a string and
 create a parse tree at compile-time and then generate code from this,
 it's also successful. Go D!

 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I
 learn by coding. Hey, I just received the Dragon Book (International
 Edition), I'm sure I'll learn many things in there.

I'll be glad to buy for you any book you might feel you need for this. Again, there are few things more important for D right now than exploiting its unmatched-by-competition features to great ends. I don't want the lack of educational material to hold you down. Please continue working on this and let me know of what you need.
 So, if anyone is willing to change the code generated by Pegged, I'm
 game. The results you showed me on keyword parsing are very interesting!

 But, my impression is that the need for a 'D'-only parser and lexer is
 far greater and much more imediate that the need for a parser generator.

I very strongly disagree. This is our chance to get things right instead of having a limited product that destroys competition (much like lex and yacc have been for years in the parser generator field).
 All the reasons advanced upthread ask for a D parser, not a generic
 generator. Parser generators are for those of us interested in having
 DSLs or macros in D.
 So Pegged or any other generator should *not* get the community focus
 right now.

Pegged should be the focus.
 My plan would be as follow:

 1- assemble a group of people knowing parsing. I don't think I'm exactly
 knowledgeable, but I'm ready to be part of such a group.
 2- create a github process.
 3- translate an existing parser / adapt a D parser for Phobos. I'm ready
 to be part of this (I'm sure I'll learn in the process)
 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding)
 5- submit it to Phobos and have it adopted.

Sounds good. Replace 1-2 years with 1-2 months. Andrei
Jul 05 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Jul-12 00:16, Roman D. Boiko wrote:
 isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / lookup].

comparison.
 isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].

 isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [ns /
 lookup].

character. Clearly a winner, but I think it can be improved further.

switch that takes a lot of pages (~72Kbytes). It's easily be perfect. Sorry, couldn't resist ;) And I'm not sure how much it could be optimized maybe some measly 10-30%.
 Total: 104183 identifiers + 17488 keywords.

Results are consistent for other files, too.

seen the same numbers :) -- Dmitry Olshansky
Jul 05 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 3:54 PM, Philippe Sigaud wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

Former sux latter rox. Andrei
Jul 05 2012
prev sibling next sibling parent Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
05.07.2012 20:14, Jacob Carlborg пишет:
 On 2012-07-05 18:04, Roman D. Boiko wrote:

 Well, we did something like that for DCT... but I doubt that it would
 fit general needs.

Why wouldn't it.
 If we had, why haven't they been analyzed, classified, discussed, etc.?
 Or have they?

I don't know. Here is what I wrote for DCT: * IDE integration * Refactoring tool * Static analysis * Compiler * Doc generating * Build tool * DI generating

Just want to mention I'm talking about parser only. Things like "Refactoring" have to work perfectly or are unusable so it require a full compiler at least as finished as current dmd. -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 05/07/2012 18:32, Roman D. Boiko a écrit :
 On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:
 Original post:

 http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141

OK, fairly complete. Let it be the starting point. Why not put it on some wiki, then add some more, discuss, vote, etc.? Meanwhile create a matrix of features implemented in each of interesting standardization candidates. And pick up one of them as "standard" or "reference" parser. My vote would be for Pegged, I guess.

Why not program instead of doing bureaucracy ?
Jul 05 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 06/07/2012 00:21, Roman D. Boiko a écrit :
 On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:
 Why not program instead of doing bureaucracy ?

To avoid programming things which are not needed or don't fit. I've thrown away several implementations already... time to reflect a little :) But, actually, for DCT I do know what I need to do. That suggestion was with respect to any candidate for becoming standard. Everybody understands that their way (likely differently than others).

Well you did the mistake to introduce an over complex mechanism in log(n) to get back to source code when an obvious one in O(1) exists. Let me doubt.
Jul 05 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times.

Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?

It's been discussed before, but there are some obvious use cases such as syntax highlighting, code formatting, and manipulation of D source files (e.g. to strip out the unit tests). We need to have a lexer in Phobos which parsers D code and generates a range of tokens, and we need to have a parser in Phobos which operates on a range of tokens. As discussed previously, there is desire to have the lexer and parser ported from dmd's frontend to D for Phobos so that what we get is easily maintainable alongside dmd's frontend and produces the same results (and is fast). It's _also_ desirable that we get a lexer/parser generator into Phobos for generating lexers/parsers for whatever language you might want to generate them for. Pegged is a good example of what can be done, and I think that Andrei was trying to get Philippe to prepare a submission to Phobos from it (though I'm not sure), but regarldess of whether pegged (or anything based on it) makes it into Phobos, we definitely want something similar. So, we have a fair idea of some of the stuff that we want, but it's a question of time and effort. I keep intending to port dmd's lexer to D for submission to Phobos, but I've been too busy to do it. At least a couple of other people have also expressed interest in doing it, but no one has submitted anything for Phobos. So, it remains undone, and anything which would need to lex or parse D code has to find its own solution. As with most everything around here, it's a question of people being having the time and being willing to put in the effort to do it. It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M Davis
Jul 05 2012
next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
05.07.2012 20:28, Jonathan M Davis пишет:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because someone really
 wants it done, takes the time to do it, and sees it through until its done and
 in Phobos.

I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser. -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jul-12 22:23, Denis Shelomovskij wrote:
 05.07.2012 20:28, Jonathan M Davis пишет:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really
 wants it done, takes the time to do it, and sees it through until its
 done and
 in Phobos.

I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser.

It doesn't work like that in OpenSource. No matter what you do people keep writing code ;) -- Dmitry Olshansky
Jul 05 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 12:39 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really wants it done, takes the time to do it, and sees it
 through until its done and in Phobos.

 - Jonathan M Davis

Resume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?

I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important. I also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax. Andrei
Jul 05 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jul-12 22:22, Andrei Alexandrescu wrote:
 On 7/5/12 12:39 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really wants it done, takes the time to do it, and sees it
 through until its done and in Phobos.

 - Jonathan M Davis

Resume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?

I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important.

Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.
 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

Well put. It shouldn't stop people from doing parsers, IMO the more the merrier. -- Dmitry Olshansky
Jul 05 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 2:33 PM, Dmitry Olshansky wrote:
 On 05-Jul-12 22:22, Andrei Alexandrescu wrote:
 I'd really want to create a task force on this, it is of strategic
 importance to D. In Walter's own words, no new feature is going to push
 us forward since we're not really using the great goodies we've got, and
 CTFE technology is the most important.

Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.

Excellent point. Walter is 100% behind the general strategy and we can bring him to work on specific bug reports and enhancement requests if we make a case they are important for Pegged.
 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

Well put. It shouldn't stop people from doing parsers, IMO the more the merrier.

Exactly. Andrei
Jul 05 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 2:38 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:
 Count me as interested.
 CTFE needs more correctness & speed though. So to put it blantly - no
 it's not possible right NOW.
 BUT it doesn't prevent us from planing and doing a proof of concept.
 Pegged seems a good starting point even if we end up re-writing it
 from scratch.

IMO, first priority is to make code generated by Pegged work fast, and second priority - make code generation itself fast.

Well I'd say there are things like supporting large, realistic grammars without blowing up or taking long compilation times is the first priority. Andrei
Jul 05 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-07-05 20:22, Andrei Alexandrescu wrote:
 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

I don't see why you are against this. I'm seeing this as basically the only change for DMD every being written in D. Sure I would much rather have a completely new compiler written in D, with a module approach and designed from scratch to be used as a library, but I'm trying to realistic here. -- /Jacob Carlborg
Jul 05 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 13:06, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I have
 an interest in parsers and familiarity with LL, LALR, PEGs, and even
 Pratt parsers (fun!), but I am still inexperienced.

 One thing that has always concerned me about PEGs is that they always
 say PEGs are slower than traditional two-phase LALR(1) or LL(k)
 parsers. However, I have never seen any benchmarks. Does anyone know
 exactly how much performance you lose in an (optimized) PEG compared
 to an (optimized) LALR/LL parser + LL/regex lexer?

I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.

I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway. -- Dmitry Olshansky
Jul 07 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.

Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem. -- Dmitry Olshansky
Jul 07 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
 On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.

Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative. How many semantics hacks does D's syntax need for a LL(*) parser? Andrei
Jul 07 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 22:25, Andrei Alexandrescu wrote:
 On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
 On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.

Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative. How many semantics hacks does D's syntax need for a LL(*) parser?

I believe that it may need very few _semantic_ predicates. But there are cases where infinite lookahead is a must. Can't recall which cases offhand.
 Andrei

-- Dmitry Olshansky
Jul 07 2012
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken. -- Dmitry Olshansky
Jul 08 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 14:48, Tobias Pankrath wrote:
 On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

Since GLR can parse any CFG the upper bound is in O(n^3), but the actual performance seems to depend on the grammar. From Elkhound [1]
 It should be competitive with Bison for LALR (fragements of) grammars,
 and degrade gracefully from there. On the scale of grammar
 nondeterminism, from none
 (LALR) to some to lots, "some" is the niche Elkhound is going after.
 These goals are driven by Elkhound's primary application, the Elsa C++
 Parser. In essence, Elkhound came about because I wanted to apply
 automatic parsing technology to parsing C++, but found exiting systems
 inadequate.

So it seems to me that it is not worse than PEG if you have a grammar with reasonable many ambiguities. Performance is one thing, but you should be able to express your language in the underlying formalism without too many hurdles.

The problem is that say LL(*) can be easily tweaked in place to run various semantic actions as it parse the source. I think GLR is harder to make do anything other then "parse & say it's fine". -- Dmitry Olshansky
Jul 08 2012
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On 07/07/2012 12:05, Dmitry Olshansky wrote:
 On 07-Jul-12 13:06, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I have
 an interest in parsers and familiarity with LL, LALR, PEGs, and even
 Pratt parsers (fun!), but I am still inexperienced.

 One thing that has always concerned me about PEGs is that they always
 say PEGs are slower than traditional two-phase LALR(1) or LL(k)
 parsers. However, I have never seen any benchmarks. Does anyone know
 exactly how much performance you lose in an (optimized) PEG compared
 to an (optimized) LALR/LL parser + LL/regex lexer?

I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.

I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway.

D isn't 100% CFG. But it is close.
Jul 07 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
deadalnix , dans le message (digitalmars.D:171330), a crit:
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?
Jul 09 2012
parent reply deadalnix <deadalnix gmail.com> writes:
On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a crit :
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?

type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= again
Jul 10 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a crit :
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?

type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= again

'something' is context-free: something ::= type | expression.
Jul 10 2012
next sibling parent reply travert phare.normalesup.org (Christophe Travert) writes:
Timon Gehr , dans le message (digitalmars.D:171814), a crit:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a crit :
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?

type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= again

'something' is context-free: something ::= type | expression.

Do you have to know if something is a type or an expression for a simple parsing? The langage would better not require this, otherwise simple parsing is not possible without looking at all forward references and imported files.
Jul 11 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 07/11/2012 10:23 AM, Christophe Travert wrote:
 Timon Gehr , dans le message (digitalmars.D:171814), a crit :
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a crit :
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?

type[something]<= something can be a type or an expression. typeid(somethning)<= same here identifier!(something)<= again

'something' is context-free: something ::= type | expression.

Do you have to know if something is a type or an expression for a simple parsing?

No. Some token sequences can be both a type and an expression based on the context (the CFG is ambiguous), but that is the analysers business. Parsing D code does not require any kind of analysis.
 The langage would better not require this, otherwise simple
 parsing is not possible without looking at all forward references and
 imported files.

Jul 11 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 07/11/2012 11:55 PM, David Piepgrass wrote:
 On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a écrit :
 D isn't 100% CFG. But it is close.


typeid(somethning) <= same here identifier!(something) <= again

'something' is context-free: something ::= type | expression.

I don't see how "type | expression" is context free. The input "Foo" could be a type or expression, you can't tell which without looking at the context.

'Context free' means that all the production rule left hand sides do not have a context, i.e. they consist of a single non-terminal. What you are describing is the fact that the grammar is ambiguous. The parser can just assume whatever works and the analyzer takes care of the interpretation of the resulting AST. The parser does not have to give a meaning to the code. It just turns it into a representation that is easy to work with.
Jul 11 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/23/12 9:22 AM, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath<tobias pankrath.net>  wrote:

 If you are interested in parsing, than I wouldn't recommend the dragon book,
 but this one

 It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool! I'm playing with fully-backtracking recursive-descent parsers right now, to deal with ambiguous grammars and the small parser works beautifully. Hey, I just saw he gives the code on his website, nice! Now, on to LR. Nick, here I come. I'll at last know how Goldie works.

When will we see the fruit of that all in the form of D libraries? Andrei
Jul 23 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 6:05 AM, Dmitry Olshansky wrote:
 I can tell you that they are not slower then one another in principle.
 Quality of implementations trumps every theoretical aspect here. LALR is
 usually fast even if implemented by book but they are hard to optimize
 futher and quite restrictive on "semantic extensions".

Isn't it the case that PEG require more memory for certain grammars? Andrei
Jul 07 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-07 03:12, Jonathan M Davis wrote:

 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if we write a
 new lexer and parser specifically for Phobos which _doesn't_ port the lexer or
 parser from dmd, then that _would_ help drive making the spec match the
 compiler (or vice versa). So, I agree that could be a definite argument for
 writing a lexer and parser from scratch rather than porting the one from dmd,
 but I don't buy the bit about it smothering parser generators at all. I think
 that the use cases are completely different.

I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos. -- /Jacob Carlborg
Jul 07 2012
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On 07/07/2012 13:05, Jacob Carlborg wrote:
 On 2012-07-07 03:12, Jonathan M Davis wrote:

 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would
 keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if
 we write a
 new lexer and parser specifically for Phobos which _doesn't_ port the
 lexer or
 parser from dmd, then that _would_ help drive making the spec match the
 compiler (or vice versa). So, I agree that could be a definite
 argument for
 writing a lexer and parser from scratch rather than porting the one
 from dmd,
 but I don't buy the bit about it smothering parser generators at all.
 I think
 that the use cases are completely different.

I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed.

I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.
Jul 07 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-07-07 18:49, deadalnix wrote:

 I tried that. This is almost impossible, dmd's parser and AST are very
 tightly mixed with dmd's internals.

I suspected that. -- /Jacob Carlborg
Jul 08 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-07 19:53, Jonathan M Davis wrote:

 There are multiple issues here. The one that Andrei is raising is the fact
 that D isn't properly formalized. Essentially, the compiler _is_ the spec, and
 while the online spec _mostly_ follows it, it doesn't entirely, and the online
 spec isn't always as precise as it needs to be regardless. With a fully
 formalized spec, it should be possible to fully implement a D compiler from
 the spec alone, and I don't think that that's currently possible.

That's been a problem for a long time.
 Writing a D lexer and parser (if not a full-blown compiler) from scratch would
 help highlight the places in the spec which are lacking, and having it be part
 of Phobos would arguably increase Walter's incentive to make sure that the
 spec is in line with the compiler (and vice versa) so that stuff _other_ than
 the compiler which is based on the spec would be able to match the compiler.

 Clang is in a _completely_ different situation. It's a C/C++ compiler, and both
 C and C++ already have official, formalized specs which Clang conforms to (or
is
 supposed to anyway). Clang has no control over the spec at all. It merely
 implements it. So, there is no issue of trying to keep other tools or
 compilers in line with Clang due to it being the language's spec like we
 effectively have with dmd. It may help the tools which use Clang to be fully in
 line with Clang and not have to worry about whether Clang implements the spec
 slightly differently, but in theory, if they all follow the spec correctly,
 that isn't in issue (though obviously in practice it can be).

That should be true for D as well when the spec is complete.
 In D's case, all of the major D compilers use the same frontend, which helps
 compatability but harms the spec, because there's less incentive to keep it
 precise and  up-to-date. So, from the perspective of the spec, implementing
 the D lexer and parser for Phobos separately from dmd would be of great
 benefit.

That might be true.
 IMHO, the reason that porting dmd's lexer and parser would be of great benefit
 is primarily maintenance. It makes it much easier to keep Phobos' lexer and
 parser in line with dmd, making discrepencies less likely, but it arguably
 reduces the incentive to improve the spec.

But then it sounds like the best solution would be not to have a lexer/parser based on DMD and instead making sure the spec is correct.
 The benefits of having a lexer and parser as a library (regardless of whether
 it's from scratch or a port from dmd) are primarly derived from the fact that
 it makes it much easier to create tools which use them. Such tools no longer
 have to write their own lexers or parsers.

Of course, that's the whole point.
 If the compiler uses the same library, it has the added benefit of making it so
 that any tool using the library will be in sync with the compiler, but if the
 spec is properly formalized and up-to-date, and the library is kep up-to-date
 with _it_, that shouldn't really be a problem. You still have the debate as to
 whether it's better to have a separate implementation based on the spec
 (thereby making it more likely that the spec is correct) or whether it's
 better to have the compiler share the implementation so that the library is
 guaranteed to match the compiler (though not necessarily the spec), but I
 think that that's a separate debate from whether we should have the lexer and
 parser as a library.

What keeps popping up in my head is the scenario where users are complaining about the frontend in Phobos not behaving as their compiler. If this is due to they are out of sync, bugs or not following the spec doesn't matter. I still thinks D is changing too much to have two separate implementations of the compiler and a library like this.
 In all honesty though, I would be surprised if you could convince Walter to
 switch dmd's frontend to Phobos' lexer and parser even once they've been
 implemented. So, while I agree that there are benefits in doing so, I'm not
 sure how much chance you have of ever getting any traction with that.

That is perhaps true.
 Another thing to consider is that that might make it so that gdc and ldc
 couldn't share the same frontend with dmd (assuming that they have to keep
 their frontends in C or C++ -  I don't know if they do) - but if so, that
 would increase the incentive for the spec to be correct if dmd ever started
 using a D frontend.

-- /Jacob Carlborg
Jul 08 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-08 22:05, Jonathan M Davis wrote:

 Well, for the lexer and parser, we're probably okay (or very close to it). As
 Daniel pointed out elsewhere in this thread, that part of the frontend doesn't
 change much (particularly the lexer). There's definitely still some churn, but
 it's nothing like it used to be.

It don't feel like that, didn't we get lambdas or UFCS in the last release? -- /Jacob Carlborg
Jul 08 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-07-09 08:39, Jonathan M Davis wrote:

 lambdas are a bit older than that IIRC, and I don't think that UFCS actually
 affects the lexer or parser. Yes, changes are still made, but they're
 increasingly rare, and most of the stuff being changed is on the semantic level
 (most of which is bug fixes). So, out of all of the parts of the compiler to
 duplicate, those are the least likely to have to have major changes made to
 them later (especially the lexer).

I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo(); -- /Jacob Carlborg
Jul 09 2012
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 07/09/2012 08:33 AM, Jacob Carlborg wrote:
 On 2012-07-08 22:05, Jonathan M Davis wrote:

 Well, for the lexer and parser, we're probably okay (or very close to
 it). As
 Daniel pointed out elsewhere in this thread, that part of the frontend
 doesn't
 change much (particularly the lexer). There's definitely still some
 churn, but
 it's nothing like it used to be.

It don't feel like that, didn't we get lambdas or UFCS in the last release?

Those changes are trivial to incorporate into a well written parser.
Jul 09 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-07-08 22:15, Roman D. Boiko wrote:

 Can you provide a *specific* example of performance optimization which a
 custom D compiler would have, but parser generator would be unlikely to
 catch up.

I'm not expert in this field but I've heard that it's harder to get good error reporting with a parser generator. -- /Jacob Carlborg
Jul 08 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

So far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.

Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.
 When I implemented recursive-descent parser by hand in one of early
 drafts of DCT, I strongly felt the need to generalize code in a way
 which in retrospect I would call PEG-like. The structure of my
 hand-written recursive-descent parser was a one-to-one mapping to an
 implemented subset of D specification, and I considered it problematic,
 because it was needed to duplicate the same structure in the resulting AST.

 PEG is basically a language that describes both, the implementation of
 parser, and the language syntax. It greatly reduces implicit code
 duplication.

 I think that generated code can be made almost as fast as a hand-written
 parser for a particular language (probably, a few percent slower).
 Especially if that language is similar to D (context-free, with
 fine-grained hierarchical grammar). Optimizations might require to
 forget about strictly following any theoretical algorithm, but that
 should be OK.

All that sounds really encouraging. I'm really looking forward to more work in that area. If you stumble upon bugs that block you, let us know and Walter agreed he'll boost their priority. Andrei
Jul 07 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
 On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

So far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.

Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.

I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.
 When I implemented recursive-descent parser by hand in one of early
 drafts of DCT, I strongly felt the need to generalize code in a way
 which in retrospect I would call PEG-like. The structure of my
 hand-written recursive-descent parser was a one-to-one mapping to an
 implemented subset of D specification, and I considered it problematic,
 because it was needed to duplicate the same structure in the resulting
 AST.

 PEG is basically a language that describes both, the implementation of
 parser, and the language syntax. It greatly reduces implicit code
 duplication.

 I think that generated code can be made almost as fast as a hand-written
 parser for a particular language (probably, a few percent slower).
 Especially if that language is similar to D (context-free, with
 fine-grained hierarchical grammar). Optimizations might require to
 forget about strictly following any theoretical algorithm, but that
 should be OK.

All that sounds really encouraging. I'm really looking forward to more work in that area. If you stumble upon bugs that block you, let us know and Walter agreed he'll boost their priority. Andrei

-- Dmitry Olshansky
Jul 07 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

Interesting. I'll have to ask for more details on why. Andrei
Jul 07 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 00:09, Andrei Alexandrescu wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

Interesting. I'll have to ask for more details on why.

I think I told that before but anyway the main point is: the reason lexer exists is performance. It's obvious that one can write grammar without ever using lexer. If the notation is good say regex or the one used in PEG (that is not quite the same it turns out) then token can be easily describe in grammar. Once we got lexer can give only 2 things: -manageability, it's jsut splitting grammar in 2 parts that could be maintained separately (in fact some "lexers" are not DFA nor regular) (sometimes it could go the opposite direction - it could be better to have one integrated grammar) - speed. The reason it could be faster is because lexer can use highly deterministic grammar. Now back to PEGs and packrats. What I see everywhere I look is essentially integration of a backtracking-NFA lexer, that is not fast nor is particularly convenient. The notation is the whole other matter. So my view on the matter: whether or not lexer is integrated is two-fold question: notation vs implementation. E.g. it may make regular lexer-like things a part of notation thus having rules like: Identifier -> Alpha (Alpha|Number|_)* Yet it may or may use the same engine for _all_ _rules_, in fact if implementation optimize things by ripping off regular parts of grammar to small DFA engines it would make some kind of lexer behind the scenes! Anyway highly hybrid parsers are all the rage now, and reasons are obvious - they runs fast and can be bend by some magic to quite convoluted grammars of modern languages ;) -- Dmitry Olshansky
Jul 07 2012
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 00:19, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

Interesting. I'll have to ask for more details on why. Andrei

+1. Personally I associate PEG with a parser that includes a __distributed__ lexer inside,

Well put yet it's still lexer. And most horribly it's not optimized to DFAs most of the time last time I've checked papers. which gives the advantage of having to
 check less alternatives at each step (that's similar to deterministic
 parsing). If lexer is separate, it seems to be more difficult to scan
 for only a subset of possible tokens.

thing many times over or ... spend a lot of RAM to remember not to redo it. I recall lexing takes even more then parsing itself. -- Dmitry Olshansky
Jul 07 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 00:50, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
 And given the backtracking nature of PEGs you'll do your distributed
 thing many times over or ... spend a lot of RAM to remember not to
 redo it. I recall lexing takes even more then parsing itself.

I think that your conclusions are about statistical evidences of PEG misuses and poor PEG parser implementations. My point was that there is nothing fundamentally worse in having lexer integrated with parser, but there are performance advantages of having to check less possible cases when the structural information is available (so that lexSmth could be called when Smth is expected, thus requiring less switch branches if switch is used).

You may misunderstood me as well, the point is still the same: there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level.
 As for lexing multiple times, simply use a free list of terminals (aka
 tokens). I still assume that grammar is properly defined, so that there
 is only one way to split source into tokens.

Tokens.. there is no such term in use if we talk about 'pure' PEG. -- Dmitry Olshansky
Jul 07 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 01:24, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
 You may misunderstood me as well, the point is still the same:
 there are 2 things - notation and implementation, the fact that lexer
 is integrated in notation like in PEGs is not related to the fact that
 PEGs in their classic definition never use term Token and do
 backtracking parsing essentially on character level.

But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, try option C... it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it. Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate. This could be avoided via disambiguation a-la LL(k) but that won't be true PEG anymore ;) So that's
 an implementation detail, or a specific algorithm.

Lexer separation
 seems to be independent of this.

 As for lexing multiple times, simply use a free list of terminals
 (aka tokens). I still assume that grammar is properly defined, so
 that there is only one way to split source into tokens.

Tokens.. there is no such term in use if we talk about 'pure' PEG.

Terminal symbols.

-- Dmitry Olshansky
Jul 07 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 01:49, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 01:24, Roman D. Boiko wrote:
 But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, try option C...

higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail.

That would not always work, but yes that's what we should do IMHO. It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats. A greedy regex would also have to check all successor options (C
 in the exapmle above) to determine the longest one.

states* easily it's not so simple with general CFGs. *As in NFA states, I like to think in multiple NFA states vs single DFA state.
 it's obvious how backtracking comes in ... whan say A fails somewhere
 in the middle of it.

described above.

unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input. That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing.
 Of course there is a fair amount of bookkeeping that reduces doing
 work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).

memory requirement for packrats?
 Tokens.. there is no such term in use if we talk about 'pure' PEG.

Terminal symbols.


terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.

Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it? -- Dmitry Olshansky
Jul 07 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
On 07/07/2012 21:17, Dmitry Olshansky wrote:
 On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
 On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

So far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.

Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.

I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.

Many usages only need a lexer.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a écrit :
 D isn't 100% CFG. But it is close.


typeid(somethning) <= same here identifier!(something) <= again

'something' is context-free: something ::= type | expression.

I don't see how "type | expression" is context free. The input "Foo" could be a type or expression, you can't tell which without looking at the context.
Jul 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend the dragon book,
 but this one

 It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool! I'm playing with fully-backtracking recursive-descent parsers right now, to deal with ambiguous grammars and the small parser works beautifully. Hey, I just saw he gives the code on his website, nice! Now, on to LR. Nick, here I come. I'll at last know how Goldie works. Philippe
Jul 23 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath 
 <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend 
 the dragon book,
 but this one

 It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool!

Yes, I'm also reading it and find it amusingly high-quality.
Jul 23 2012
prev sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath 
 <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend 
 the dragon book,
 but this one

 It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool!

I'm glad that you like it.
Jul 23 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 05/07/2012 18:28, Jonathan M Davis a écrit :
 On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times.

Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?

It's been discussed before, but there are some obvious use cases such as syntax highlighting, code formatting, and manipulation of D source files (e.g. to strip out the unit tests). We need to have a lexer in Phobos which parsers D code and generates a range of tokens, and we need to have a parser in Phobos which operates on a range of tokens. As discussed previously, there is desire to have the lexer and parser ported from dmd's frontend to D for Phobos so that what we get is easily maintainable alongside dmd's frontend and produces the same results (and is fast).

You never looked at dmd frontend soure code don't you ? It is a horror museum, and if we want to have something in D, I certainly wouldn't copy that.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to 
 do it themselves, and in general, stuff around here gets done 
 because someone really wants it done, takes the time to do it, 
 and sees it through until its done and in Phobos.

 - Jonathan M Davis

Resume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:
 Count me as interested.
 CTFE needs more correctness & speed though. So to put it 
 blantly - no it's not possible right NOW.
 BUT it doesn't prevent us from planing and doing a proof of 
 concept. Pegged seems a good starting point even if we end up 
 re-writing it from scratch.

IMO, first priority is to make code generated by Pegged work fast, and second priority - make code generation itself fast.
Jul 05 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 Resume: everybody is welcome to join effort of translating DMD 
 front end, and improving Pegged.

 Also I would like to invite those interested in DCT project to 
 help me with it. Right now I'm trying to understand whether it 
 is possible to incorporate Pegged inside without losing 
 anything critical (and I think it is very likely possible), 
 and how exactly to do that.

 Dmitry proposed to help improve Pegged (or some other 
 compiler's) speed.

 Anyone else?

I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important.

Hi everybody! My name's David and I've been dreaming since around 1999 of making my own computer language, and never found the time for it. The first time I looked at D it was around 2004 or so, and it just looked like a "moderately better C++" which I forgot about, having more lofty ideas. When I found out about D2's metaprogramming facilities I instantly became much more interested, although I still wish to accomplish more than is possible ATM. I've been talking to my boss about reducing my working hours, mainly in order to have time to work on something related to D. My goal is to popularize a language that is efficient (as in runtime speed and size), expressive, safe, concise, readable, well-documented, easy-to-use, and good at finding errors in your code. In other words, I want a language that is literally all things to all people, a language that is effective for any task. I want to kill off this preconceived notion that most programmers seem to have, that fast code requires a language like C++ that is hard to use. The notion that Rapid Application Development is incompatible with an efficient executable is nonsense and I want to kill it :) To be honest I have some reservations about D, but of all the languages I know, D is currently closest to my ideal. This work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced. I also like writing documentation and articles, but I always find it hard to figure out how other people's code works well enough to document it. I'm having some trouble following this thread due to the acronyms: CTFE, DCT, AA. At least I managed to figure out that CTFE is Compile Time Function Execution. DCT and AA I already know as Discrete Cosine Transform and Anti-Aliasing, of course.... but what's it mean to you? One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer? Anyway, it's the weekend, during which I hope I can find a place to fit in with you guys.
Jul 06 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jul 07, 2012 at 02:45:35AM +0200, David Piepgrass wrote:
[...]
 I'm having some trouble following this thread due to the acronyms:
 CTFE, DCT, AA. At least I managed to figure out that CTFE is Compile
 Time Function Execution. DCT and AA I already know as Discrete Cosine
 Transform and Anti-Aliasing, of course.... but what's it mean to you?

Frankly, I don't know what DCT stands for (anyone?), but AA stands for Associative Array (aka hash, dictionary, etc.). [...]
 Anyway, it's the weekend, during which I hope I can find a place to
 fit in with you guys.

Welcome aboard!!! D still has some ways to go (there are still warts in some places), but I also have to admit that it's closest to my ideal language too. I'm so sick and fed up of the tedium of C and the convolutions of C++, that I just can't see myself doing any more C/C++ except for code maintenance. New projects I would do in D. (Personal projects, anyway... work projects, I don't get to choose.) Anyway, D still has some ways to go, meaning that more manpower is always more than welcome. T -- Elegant or ugly code as well as fine or rude sentences have something in common: they don't depend on the language. -- Luca De Vitis
Jul 06 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 01:47:21 UTC, H. S. Teoh wrote:
 Frankly, I don't know what DCT stands for (anyone?)

Compiler Tools. I've used this acronym in previous threads but didn't explain this time. Current state is very far from completion, it is in research phase. Link: https://github.com/roman-d-boiko/dct, but I'd suggest to contact me at rb d-coding.com if anyone is interested.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I 
 have an interest in parsers and familiarity with LL, LALR, 
 PEGs, and even Pratt parsers (fun!), but I am still 
 inexperienced.

 One thing that has always concerned me about PEGs is that they 
 always say PEGs are slower than traditional two-phase LALR(1) 
 or LL(k) parsers. However, I have never seen any benchmarks. 
 Does anyone know exactly how much performance you lose in an 
 (optimized) PEG compared to an (optimized) LALR/LL parser + 
 LL/regex lexer?

I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

So far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable. When I implemented recursive-descent parser by hand in one of early drafts of DCT, I strongly felt the need to generalize code in a way which in retrospect I would call PEG-like. The structure of my hand-written recursive-descent parser was a one-to-one mapping to an implemented subset of D specification, and I considered it problematic, because it was needed to duplicate the same structure in the resulting AST. PEG is basically a language that describes both, the implementation of parser, and the language syntax. It greatly reduces implicit code duplication. I think that generated code can be made almost as fast as a hand-written parser for a particular language (probably, a few percent slower). Especially if that language is similar to D (context-free, with fine-grained hierarchical grammar). Optimizations might require to forget about strictly following any theoretical algorithm, but that should be OK.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 10:05:30 UTC, Dmitry Olshansky wrote:
 I can tell you that they are not slower then one another in 
 principle. Quality of implementations trumps every theoretical 
 aspect here. LALR is usually fast even if implemented by book 
 but they are hard to optimize futher and quite restrictive on 
 "semantic extensions".

 Proper CFG parsers all are liner-time anyway.

Exactly, that is also my point. I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, 
 and there are no
 fundamental issues which would make it inherently slower than 
 LALR or a hand-written D-specific parser.

Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 11:33:18 UTC, Dmitry Olshansky wrote:
 Yup, LL(*) is my favorite so far. As for debugging and error 
 recovery they are always a problem.

It's interesting, that I also wanted to use DFA for non-terminals (similarly to LL(*)), but was considering their usage as pure performance optimization. Concerns raised in the article seem to be very reasonable, I will likely reconsider my previous plans... This forum is an endless source of high-quality feed-back for me.
Jul 07 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 I can tell you that they are not slower then one another in 
 principle. Quality of implementations trumps every theoretical 
 aspect here. LALR is usually fast even if implemented by book 
 but they are hard to optimize futher and quite restrictive on 
 "semantic extensions".

 Proper CFG parsers all are liner-time anyway.

To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs. However I do agree, that error handling and flexibility are more important than raw speed. I don't want to get a concrete syntax tree full of unneeded productions, like the ones you get when you encode precedence rules in your grammar.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:
 Proper CFG parsers all are liner-time anyway.

To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs.

non-context-free languages can be described with PEG.
Jul 07 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Saturday, 7 July 2012 at 15:39:52 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:
 Proper CFG parsers all are liner-time anyway.

To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs.

non-context-free languages can be described with PEG.

Interesting, I thought that PEG ⊂ CFG holds.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
 Interesting, I thought that PEG ⊂ CFG holds.

It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.
Jul 07 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sat, Jul 7, 2012 at 6:25 PM, Roman D. Boiko <rb d-coding.com> wrote:
 On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
 Interesting, I thought that PEG =E2=8A=82 CFG holds.

See section 3.4 of http://bford.info/pub/lang/peg.pdf It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.

Not that anyone is interested in such languages, but still :)
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu 
wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated 
 lexing is
 moot. It's more of liability then benefit. At very least it's 
 just
 implementation curiosity not advantage.

Interesting. I'll have to ask for more details on why. Andrei

+1. Personally I associate PEG with a parser that includes a __distributed__ lexer inside, which gives the advantage of having to check less alternatives at each step (that's similar to deterministic parsing). If lexer is separate, it seems to be more difficult to scan for only a subset of possible tokens.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
 And given the backtracking nature of PEGs you'll do your 
 distributed thing many times over or ... spend a lot of RAM to 
 remember not to redo it. I recall lexing takes even more then 
 parsing itself.

I think that your conclusions are about statistical evidences of PEG misuses and poor PEG parser implementations. My point was that there is nothing fundamentally worse in having lexer integrated with parser, but there are performance advantages of having to check less possible cases when the structural information is available (so that lexSmth could be called when Smth is expected, thus requiring less switch branches if switch is used). As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
 You may misunderstood me as well, the point is still the same:
 there are 2 things - notation and implementation, the fact that 
 lexer is integrated in notation like in PEGs is not related to 
 the fact that PEGs in their classic definition never use term 
 Token and do backtracking parsing essentially on character 
 level.

But PEG itself doesn't require backtracking parsing, does it? So that's an implementation detail, or a specific algorithm. Lexer separation seems to be independent of this.
 As for lexing multiple times, simply use a free list of 
 terminals (aka tokens). I still assume that grammar is 
 properly defined, so that there is only one way to split 
 source into tokens.

Tokens.. there is no such term in use if we talk about 'pure' PEG.

Terminal symbols.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote:
 So my view on the matter: whether or not lexer is integrated is 
 two-fold question: notation vs implementation.

 E.g. it may make regular lexer-like things a part of notation 
 thus having rules like:
 	Identifier  -> Alpha (Alpha|Number|_)*

 Yet it may or may use the same engine for _all_ _rules_, in 
 fact if implementation optimize things by ripping off regular 
 parts of grammar to small DFA engines it would make some kind 
 of lexer behind the scenes!

This is the implementation I have in mind as the only sane way to parse PEG efficiently. Plus specializing DFA to only check for those terminals which are allowed according to available structural information at the moment of their invocation.
 Anyway highly hybrid parsers are all the rage now, and reasons 
 are obvious - they runs fast and can be bend by some magic to 
 quite convoluted grammars of modern languages ;)

Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 01:24, Roman D. Boiko wrote:
 But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, try option C...

A has higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail. A greedy regex would also have to check all successor options (C in the exapmle above) to determine the longest one.
 it's obvious how backtracking comes in ... whan say A fails 
 somewhere in the middle of it.

as I described above.
 Of course there is a fair amount of bookkeeping that reduces 
 doing work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).
 Tokens.. there is no such term in use if we talk about 'pure' 
 PEG.

Terminal symbols.


symbols, terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote:
 That would not always work, but yes that's what we should do 
 IMHO.

understanding. When that doesn't work in a particular case, we will need to find the problem and solve that.
 It's not what packrats do however (when I say algorithm I mean 
 specifically packrat). And PEGs are commonly associated with 
 packrats.

 it's obvious how backtracking comes in ... whan say A fails 
 somewhere in the middle of it.

parallel as I described above.

of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input.

simply doesn't match the problem and there's nothing we can do.
 That's why DFA in ANTLR is only used to do lookahead or lexing, 
 it doesn't maintain path through states during parsing.

DFA with additional states could be created for any particular situation of ambiguity, etc. This needs further research.
 Of course there is a fair amount of bookkeeping that reduces 
 doing
 work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).

an O(n) memory requirement for packrats?

finite. When you go to the next character, you don't need previous states any more.
 Yup. But I'm not talking definitions here I'm talking the 
 notion of "integrated lexing" and lexing is certainly about 
 characters or was it?

I thought integrated lexing was about doing it along with parsing and specifying its rules directly inside the grammar. From the former we get structural context to narrow-down available options, and from the latter we have a uniform grammar description.
Jul 07 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.
Jul 08 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

Since GLR can parse any CFG the upper bound is in O(n^3), but the actual performance seems to depend on the grammar. From Elkhound [1]
 It should be competitive with Bison for LALR (fragements of) 
 grammars, and degrade gracefully from there. On the scale of 
 grammar nondeterminism, from none
 (LALR) to some to lots, "some" is the niche Elkhound is going 
 after.
 These goals are driven by Elkhound's primary application, the 
 Elsa C++ Parser. In essence, Elkhound came about because I 
 wanted to apply automatic parsing technology to parsing C++, 
 but found exiting systems inadequate.

So it seems to me that it is not worse than PEG if you have a grammar with reasonable many ambiguities. Performance is one thing, but you should be able to express your language in the underlying formalism without too many hurdles.
Jul 08 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 From Elkhound [1]

http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html
Jul 08 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 The problem is that say LL(*) can be easily tweaked in place to 
 run various semantic actions as it parse the source. I think 
 GLR is harder to make do anything other then "parse & say it's 
 fine".

Did you look at Elkhound? According to the tutorial you can use actions in the same way as if you were using an (LA)LR parser. Dunno if it's sufficient, but with Elkhound an C++-Parser has been written. See also http://stackoverflow.com/questions/428892/what-parser-genera or-do-you-recommend where Ira Baxter (maybe you know him) makes a case for DMS, which uses GLR internally. Here is a google tech talk of Ira explaining the DMS http://www.youtube.com/watch?v=C-_dw9iEzhA . I'm not very experienced with parser generators. Just trying to put all options on the table.
Jul 08 2012
prev sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some disadvantages of GLR with respect to LL(*).
Jul 08 2012
prev sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/05/2012 08:28 AM, Roman D. Boiko wrote:
 Pegged may eventually become standard, if it will be performanceoptimized and
a bit more customizable

Interesting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability?
Jul 07 2012
next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:

 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the last
 rules was enought to get a slight speed improvement, buth that doing
 anymore sotrage was detrimental to the parser's overall efficiency.

That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.

Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.
Jul 07 2012
next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 01:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.

These were my thoughts exactly, although somewhat unsubstantiated in my case ;)
Jul 07 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 22:29, Chad J wrote:
 On 07/07/2012 01:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.

These were my thoughts exactly, although somewhat unsubstantiated in my case ;)

Yup the nice thing about ANTLR is the usage of DFA. e.g. it's used for disambiguating alternatives that LL(k) could never do. -- Dmitry Olshansky
Jul 07 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 20:56, Chad J wrote:
 On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:

 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the last
 rules was enought to get a slight speed improvement, buth that doing
 anymore sotrage was detrimental to the parser's overall efficiency.

That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.

Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.

Another idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions. -- Dmitry Olshansky
Jul 07 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Jul-12 23:35, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator precedence
 grammar to pegged. Of course it should fit naturally with traditional
 PEG, for instance taking responsibility for parsing expressions.

But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)

Yes, I've meant something completely different ;) http://en.wikipedia.org/wiki/Operator-precedence_grammar -- Dmitry Olshansky
Jul 07 2012
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 07/07/2012 09:35 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator precedence
 grammar to pegged. Of course it should fit naturally with traditional
 PEG, for instance taking responsibility for parsing expressions.

But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)

http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method
Jul 07 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 1:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.

Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the tokenizer? Andrei
Jul 07 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/7/12 4:15 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:
 Doesn't ANTLR use full-fledged character-level LL(*) parsing even in
 the tokenizer?

Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure).

This should help: http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers Andrei
Jul 07 2012
prev sibling next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 02:23 PM, David Piepgrass wrote:
 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

Interesting. After trying to use ANTLR-C# several years back, I got disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.
 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging
 because of
 frequent backtracking, it has problems with error recovery, and
 typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of
 using
 Pegged. I will try to analyze whether the issues are so fundamental
 that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would
 fix these
 problems.


I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.

I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops. My vision is to have code similar to this in the front-end: /+ Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) goto exitLoop statements; goto loopAgain exitLoop: +/ void lowerWhileStatement( SyntaxElement* syntaxNode ) { auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE); if ( captures is null ) return; syntaxNode.replaceWith( LabelNode("loopAgain"), TOK_IF_STATEMENT, OP_INSERT, OP_BEGIN, TOK_NEGATE, OP_INSERT, OP_BEGIN, captures[0], // Expression OP_END, GotoStatement("exitLoop"), OP_END, captures[1], // statements GotoStatement("loopAgain"), LabelNode("exitLoop") ); } The specifics will easily change. One problem with the above code is that it could probably stand to use more templates and compile-time action to allow the front-end to merge patterns happening in the same pass to be merged together into one expression, thus preventing any unnecessary rescanning. It all becomes DFAs or DPDAs operating on syntax trees. In this vision I do not use classes and inheritance for my AST. Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code. Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST. This kind of architecture leads to other interesting benefits, like being able to assert which symbols a pattern is designed to handle or which symbols are allowed to exist in the AST at any point in time. Thus if you write a lowering that introduces nodes that a later pass can't handle, you'll know very quickly, at least in principle. I wanted to make such a front-end so that I could easily make a C backend. I believe such a compiler would be able to do that with great ease. I really want a D compiler that can output ANSI C code that can be used with few or no OS/CPU dependencies. I would be willing to lose a lot of the nifty parallelism/concurrency stuff and deal with reference counting instead of full garbage collection, as long as it lets me EASILY target new systems (any phone, console platform, and some embedded microcontrollers). Then what I have is something that's as ubiquitous as C, but adds a lot of useful features like exception handling, dynamic arrays, templates, CTFE, etc etc. My ideas for how to deal with ASTs in pattern recognition and substitution followed from this. Needing to use D in places where it isn't available is a real pain-point for me right now, and I'll probably find ways to spend time on it eventually.
Jul 07 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 07/07/2012 08:55 PM, Chad J wrote:
 On 07/07/2012 02:23 PM, David Piepgrass wrote:
 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

Interesting. After trying to use ANTLR-C# several years back, I got disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.
 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging
 because of
 frequent backtracking, it has problems with error recovery, and
 typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of
 using
 Pegged. I will try to analyze whether the issues are so fundamental
 that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would
 fix these
 problems.


I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.

I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops. My vision is to have code similar to this in the front-end: /+ Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) goto exitLoop statements; goto loopAgain exitLoop: +/ void lowerWhileStatement( SyntaxElement* syntaxNode ) { auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE); if ( captures is null ) return; syntaxNode.replaceWith( LabelNode("loopAgain"), TOK_IF_STATEMENT, OP_INSERT, OP_BEGIN, TOK_NEGATE, OP_INSERT, OP_BEGIN, captures[0], // Expression OP_END, GotoStatement("exitLoop"), OP_END, captures[1], // statements GotoStatement("loopAgain"), LabelNode("exitLoop") ); } The specifics will easily change.

I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);
Jul 07 2012
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 07/07/2012 10:26 PM, Timon Gehr wrote:
 On 07/07/2012 08:55 PM, Chad J wrote:
  ...

 The specifics will easily change.

I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);

Also, I'd like to point out that the transformation is not actually valid, because there are break/continue.
Jul 07 2012
prev sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 04:26 PM, Timon Gehr wrote:
 On 07/07/2012 08:55 PM, Chad J wrote:
 The specifics will easily change.

I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);

I wish. Can you make it happen?
Jul 07 2012
prev sibling next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:

 In this vision I do not use classes and inheritance for my AST.
 Instead I use structs that contain some kind of nodeType member that
 would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
 in the above code. Dynamic dispatch is instead performed by (very
 fast) DFAs recognizing parts of the AST.

Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).

From some of my earlier scribblings: enum : SyntaxElement { AST_EXPRESSION = 0x0001_0000_0000_0000, AST_UNARY_EXPR = 0x0000_0001_0000_0000 | AST_EXPRESSION, AST_NEGATE_EXPR = 0x0000_0000_0001_0000 | AST_UNARY_EXPR, AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 | AST_UNARY_EXPR, AST_POST_ADD_EXPR = 0x0000_0000_0003_0000 | AST_UNARY_EXPR, AST_POST_SUB_EXPR = 0x0000_0000_0004_0000 | AST_UNARY_EXPR, AST_PRE_ADD_EXPR = 0x0000_0000_0005_0000 | AST_UNARY_EXPR, AST_PRE_SUB_EXPR = 0x0000_0000_0006_0000 | AST_UNARY_EXPR, AST_BINARY_EXPR = 0x0000_0002_0000_0000 | AST_EXPRESSION, AST_AND_EXPR = 0x0000_0000_0001_0000 | AST_BINARY_EXPR, AST_OR_EXPR = 0x0000_0000_0002_0000 | AST_BINARY_EXPR, AST_XOR_EXPR = 0x0000_0000_0003_0000 | AST_BINARY_EXPR, AST_AND_AND_EXPR = 0x0000_0000_0004_0000 | AST_BINARY_EXPR, AST_OR_OR_EXPR = 0x0000_0000_0005_0000 | AST_BINARY_EXPR, AST_ADD_EXPR = 0x0000_0000_0006_0000 | AST_BINARY_EXPR, AST_TRINARY_EXPR = 0x0000_0003_0000_0000 | AST_EXPRESSION, AST_NARY_EXPR = 0x0000_0004_0000_0000 | AST_EXPRESSION, AST_STATEMENT = 0x0002_0000_0000_0000, } bool isA( SyntaxElement leafier, SyntaxElement rootier ) { SyntaxElement mask = 0; if ( rootier & 0x0000_0000_FFFF_FFFF ) { if ( rootier & 0x0000_0000_0000_FFFF ) mask = 0xFFFF_FFFF_FFFF_FFFF; else mask = 0xFFFF_FFFF_FFFF_0000; } else { if ( rootier & 0x0000_FFFF_FFFF_FFFF ) mask = 0xFFFF_FFFF_0000_0000; else mask = 0xFFFF_0000_0000_0000; } return (leafier & mask) == rootier; } unittest { assert( isA( AST_EXPRESSION, AST_EXPRESSION) ); assert( isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) ); assert( isA( AST_NEGATE_EXPR, AST_EXPRESSION) ); assert( isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) ); assert( !isA( AST_EXPRESSION, AST_STATEMENT) ); assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) ); assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) ); assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) ); assert(false); } This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case. For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table. I would expect the regex engine to call the isA function as one of it's operations. Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.
Jul 07 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 06:18 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
 enum : SyntaxElement
 {
 AST_EXPRESSION = 0x0001_0000_0000_0000,
 AST_UNARY_EXPR = 0x0000_0001_0000_0000 |

This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning.
 This approach of course has shameful nesting limitations, but I feel
 like these determinations could be fairly well optimized even for the
 general case. For example: another approach that I might be more
 inclined to take is to give each token/symbol a low-valued index into
 a small inheritance table.

Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type).
 I would expect the regex engine to call the isA function as one of
 it's operations. Thus placing an AST_EXPRESSION into your expression
 would also match an AST_NEGATE_EXPR too.

But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.

I see what you mean. I'm not sure that I buy that language properties are known before-hand. The front-end knows what the language grammar looks like and it knows what kind of things it can find in an AST. Thus you can make the regex DSL do this transformation and let the DFAs handle everything: expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr | decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...) Because what we really want it to do is match any of those expression kinds when we place the expression symbol in our regex. I think the important point here though is that inheritance can be reduced to bit-twiddling optimizations in this case.
Jul 07 2012
parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 07:04 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
 I see what you mean.

When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.

Yep. Makes sense.
Jul 07 2012
prev sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 07/07/2012 04:26 PM, David Piepgrass wrote:
 auto captures = syntaxNode.matchNodes(
 TOK_WHILE_NODE,
 OP_ENTER_NODE,
 OP_CAPTURE(0),
 OP_BEGIN,
 TOK_EXPRESSION,
 OP_END,
 OP_CAPTURE(1),
 OP_BEGIN,
 TOK_STATEMENT,
 OP_END,
 OP_LEAVE_NODE);

I'm glad to hear you like the tree-parsing approach, Chad, although the particular syntax here looks pretty unfriendly :O -- does this represent something that you are working on right now?

Yes and yes. I didn't choose this because it because it's pretty. I chose it because: (1) It's easy to implement. (2) Both the implementation and syntax can be altered easily. I do not have time to write a language for the tree pattern recognition and substitution that is needed to do this in an aesthetically pleasing way. I've tried to sketch what it might look like before, and even then it is hard to make it nice, much less begin implementing the thing. I'd love to have such a language, but resource constraints exist. I also think that this approach would allow me to find out what my usage patterns look like before I commit to a more complicated architecture/tool. I really think the design of this regex/language/DSL thing should be dominated by its usage. This is a tricky chicken-and-egg thing because it's not currently used. The hacky syntax you see is the bootstrapping. Point (2) is important because, since we don't have existing usage patterns, this thing is going to change. It's going to change /a lot/. I want it to be easy to change. I think a complete DSL will be harder to change quickly. I also like how it doesn't require a lot of CTFE trickery or pushing DMD too far. D has really cool features, but I find that when I use things like CTFE aggressively then I lose productivity because I end up spending a lot of time finding compiler bugs. This leads to my current strategy: use the simpler features that work for sure, and only use the more advanced stuff when I really need to. I think my syntax fits this strategy and thus contributes to point (1). That said, it is good that even mostly-working CTFE exists and that a powerful template and metaprogramming system exists, because I don't think a compiler like this would be very practical to program otherwise. It would be doable in other languages, but could easily suffer from performance pessimizations due to being forced to compute everything at runtime. If anyone has an approach that shares the above strengths and looks nicer or is more powerful, I'd love to see it.
 5. It's risky 'cause I've never heard of anyone taking this approach
 before. Bring on the danger!

The danger is the fun part! <g>
 I wanted to make such a front-end so that I could easily make a C
 backend. I believe such a compiler would be able to do that with great
 ease.

 Needing to use D in places where it isn't available is a real
 pain-point for me right now, and I'll probably find ways to spend time
 on it eventually.

Yeah, with a tree-transforming parser, I imagine the same thing, except my current fetish is to convert a certain subset of D to multiple other languages automatically. Then I could write libraries that can easily be used by an astonishingly large audience. I certainly would like to see D targetting Android, but that's best done directly from D to ARM.

That does sound very cool. Possibly difficult though, due to having to cater to the lowest-common-denominator in all of your API designs. No templated functions or ranges in your API, that's for sure. I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing. As for D targeting Android, my intent is really to target X where X is any CPU/OS combo you can think of. I want to be able to get D, the language, not necessarily phobos or other niceties, to work on any platform, and to do so without much work. Cross-compiling to a new platform that has never been cross-compiled before should require zero coding. Perhaps it might be possible to have a text file with some key-value configuration that tells it certain common features are available on the target, thus allowing you to have more features with almost no effort involved. Still, I'll always take a crippled but existent D compiler that targets Android over a perfect but non-existent D compiler that targets Android. I think that the D-directly-to-ARM is the current approach for cross-compiling. I critique it for its underwhelming lack of results.
 Anyway, the devil's in the detail. Originally I wanted to do a parser
 generator and a "completely general AST" in C# and couldn't seem to work
 out the details, but D is more flexible and is likely better suited to
 the task.

I can easily see how this is the case. I don't think I'd be interested in doing a project like this in any other language. I imagined trying to do something like this in C or Java or C# and it just doesn't seem practical. For instance, I don't think the "use regular expressions to match AST structures" would work well in other cases because it would either (1) have a bunch of runtime overhead for compiling the expressions into DFAs at runtime or (2) suffer from integration problems if you try to compile the expressions in separate files before compiling the rest of the front-end.
Jul 07 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com>  wrote:


 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find the
 exact place of error (from parser's perspective, not user's) one only needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.
 It is also possible to rerun parsing with some additional heuristics after
 failing, thus enabling advanced error repair scenarios.

Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.

FWIW, this is what most HTML parsers are doing.
Jul 10 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-10 22:25, Jonathan M Davis wrote:

 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever fix.

I'm not sure but I think he was referring to a kind of error reporting technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error. -- /Jacob Carlborg
Jul 10 2012
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 07/10/2012 10:53 PM, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:
 On 2012-07-10 22:25, Jonathan M Davis wrote:
 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever
 fix.

technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error.

Well, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO.

This is actually precisely what many of the more recent curly-brace- and-semicolon languages have been doing with regard to semicolons.
Jul 10 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-07-10 22:53, Jonathan M Davis wrote:

 Well, giving an error, continuing to parse, and giving a partial result can be
 useful (and you give a prime example of that), but "fixing" the problem (e.g by
 inserting the semicolon) and not considering it to be an error would be a
 _huge_ mistake IMHO. And that's essentially what happens with HTML.

No, that is _not_ what happens with HTML. With HTML, the browser _do not_ output the error and continues as if it was valid could. As far as I know, up until HTML 5, the spec hasn't mentioned what should happen with invalid code. This is just a error handling strategy that is an implementation detail. It will not change what is and what isn't valid code. Are you preferring getting just the first error when compiling? Fix the error, compile again, get a new error and so on. -- /Jacob Carlborg
Jul 10 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-07-11 08:52, Jonathan M Davis wrote:

 ??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The
 browser doesn't spit out errors. It just guesses at what you really meant and
 displays that. It "fixes" the error for you, which is a horrible design IMHO.
 Obviously, we're stuck with it for HTML, but it should not be replicated with
 anything else.

 This is in contrast to your example of outputting an error and continuing to
 parse as best it can in order to provide more detail and more error messages
 but _not_ ultimately considering the parsing successful. _That_ is useful.
 HTML's behavior is not.

Ok, I see. It seems we're meaning the same thing. -- /Jacob Carlborg
Jul 11 2012
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 11-Jul-12 00:25, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:
 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find
 the
 exact place of error (from parser's perspective, not user's) one only
 needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.
 It is also possible to rerun parsing with some additional heuristics
 after
 failing, thus enabling advanced error repair scenarios.

Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.

FWIW, this is what most HTML parsers are doing.

Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.

BTW clang does this and even more of stuff on semantic level. It's known to won a legions of users because of that (well not only that but good diagnostic in general). -- Dmitry Olshansky
Jul 10 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
Forgot to add DDMD, which also has been forked and redesigned 
recently by someone (thus two more different compiler frontends).

But overall I doubt that any project could become standard very 
soon.
Jul 05 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jul-12 16:11, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).

Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;) -- Dmitry Olshansky
Jul 05 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, 
 fast and flexible that other stuff can be considered only as 
 toys.

 I might do one, but I'd rather just help other folks make it 
 faster ;)

Would you help to make Pegged faster?
Jul 05 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

Would you help to make Pegged faster?

Well why not. But first I'll need to deliver some stuff on my GSOC project. -- Dmitry Olshansky
Jul 05 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/5/12 9:05 AM, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

Would you help to make Pegged faster?

Well why not. But first I'll need to deliver some stuff on my GSOC project.

Good call :o). Note that we can discuss tweaking the scope of the project after the midterm. Ping me if you have some ideas. Andrei
Jul 05 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05-Jul-12 18:29, Andrei Alexandrescu wrote:
 On 7/5/12 9:05 AM, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

Would you help to make Pegged faster?

Well why not. But first I'll need to deliver some stuff on my GSOC project.

Good call :o). Note that we can discuss tweaking the scope of the project after the midterm. Ping me if you have some ideas.

It's great that you are not opposed to adjusting scope of project. I have a ton of ideas, but let's discuss them after midterm. -- Dmitry Olshansky
Jul 05 2012
prev sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 13:05:41 UTC, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 Well why not.
 But first I'll need to deliver some stuff on my GSOC project.

I bet that after you finish with GSOC optimizing Pegged will not be less relevant than it is now :) And as for DCT, it will take another half a year (at least) until it will become usable for any real needs (except the most trivial).
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 12:53:02 UTC, Denis Shelomovskij 
wrote:
 05.07.2012 16:30, Roman D. Boiko пишет:
 Forgot to add DDMD, which also has been forked and redesigned 
 recently
 by someone (thus two more different compiler frontends).

 But overall I doubt that any project could become standard 
 very soon.

Why? Even were they all almost equal we can select any one. The point is to select something (anything) to guide a coder who want to do something with this task to a good/up-to-date/supported parser.

Well, I guess we'd like to select one and not change it later, don't we? I'm not sure we can decide which is the best currently and will stay the best in the future for all major use cases. Anyway I propose to enumerate major use cases first.
Jul 05 2012
prev sibling next sibling parent Caligo <iteronvexor gmail.com> writes:
Is the actual grammar available somewhere?  The online language
reference is all we got I guess? DMD doesn't seem to be using
yacc/bison either.

On Thu, Jul 5, 2012 at 7:11 AM, Denis Shelomovskij
<verylonglogin.reg gmail.com> wrote:
 There are more and more projects requiring parsing D code (IDE plugins, D=

 and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer). We
 already have a good one at least in Visual D. Maybe dmd's parser is faste=

 If so, it can be (easily?) rewritten in D. We also already have some othe=

 parsers (in C# from Mono-D etc.).

 What about to get one and call it standard?

 --
 =E4=C5=CE=C9=D3 =F7. =FB=C5=CC=CF=CD=CF=D7=D3=CB=C9=CA
 Denis V. Shelomovskij

Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 15:42:22 UTC, Caligo wrote:
 Is the actual grammar available somewhere?  The online language
 reference is all we got I guess? DMD doesn't seem to be using
 yacc/bison either.

https://github.com/roman-d-boiko/Pegged/blob/dct/pegged/examples/dgrammar.d I don't know about any others, though.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:

 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times.

Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:
 Original post:

 http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141

OK, fairly complete. Let it be the starting point. Why not put it on some wiki, then add some more, discuss, vote, etc.? Meanwhile create a matrix of features implemented in each of interesting standardization candidates. And pick up one of them as "standard" or "reference" parser. My vote would be for Pegged, I guess.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 16:38:27 UTC, Jacob Carlborg wrote:
 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.

Aren't you voting on your own project :)

Well, I'm going to base parsing on Pegged, after tweaking it to my needs.
Jul 05 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--f46d042c6c4da7656c04c4192660
Content-Type: text/plain; charset=UTF-8

 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.



As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there. So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator. All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now. My plan would be as follow: 1- assemble a group of people knowing parsing. I don't think I'm exactly knowledgeable, but I'm ready to be part of such a group. 2- create a github process. 3- translate an existing parser / adapt a D parser for Phobos. I'm ready to be part of this (I'm sure I'll learn in the process) 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding) 5- submit it to Phobos and have it adopted. much later: 6- see the way the parser code is organized and tweak a code generator (Pegged is a possibility if recursive parsing is OK) to produce an equivalent code when fed the D grammar. side-effect: maybe a std.tree or std.collection.tree to deal with trees in a generic way. Philippe --f46d042c6c4da7656c04c4192660 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <p><br> &gt;&gt; On 2012-07-05 18:32, Roman D. Boiko wrote:<br> &gt;&gt;<br> &gt;&gt;&gt; My vote would be for Pegged, I guess.<br></p> <p>As much as I&#39;m flattered by that, my current impression is Pegged is= very far from being performant. </p> <p>As a proof-of-concept that, in D,=C2=A0 it&#39;s possible to parse a str= ing and create a parse tree at compile-time and then generate code from thi= s, it&#39;s also successful. Go D!</p> <p>As a parser proper, Pegged is awful :-) Nothing I&#39;m ashamed of, as I= learn by coding. Hey, I just received the Dragon Book (International Editi= on), I&#39;m sure I&#39;ll learn many things in there.</p> <p>So, if anyone is willing to change the code generated by Pegged, I&#39;m= game. The results you showed me on keyword parsing are very interesting!</= p> <p>But, my impression is that the need for a &#39;D&#39;-only parser and le= xer is far greater and much more imediate that the need for a parser genera= tor. All the reasons advanced upthread ask for a D parser, not a generic ge= nerator. Parser generators are for those of us interested in having DSLs or= macros in D.<br> So Pegged or any other generator should *not* get the community focus right= now.</p> <p>My plan would be as follow:</p> <p>1- assemble a group of people knowing parsing. I don&#39;t think I&#39;m= exactly knowledgeable, but I&#39;m ready to be part of such a group. <br> 2- create a github process.<br> 3- translate an existing parser / adapt a D parser for Phobos. I&#39;m read= y to be part of this (I&#39;m sure I&#39;ll learn in the process)<br> 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding)<br> 5- submit it to Phobos and have it adopted.</p> <p>much later:<br> 6- see the way the parser code is organized and tweak a code generator (Peg= ged is a possibility if recursive parsing is OK) to produce an equivalent c= ode when fed the D grammar.</p> <p>side-effect: maybe a std.tree or std.collection.tree to deal with trees = in a generic way.</p> <p>Philippe</p> --f46d042c6c4da7656c04c4192660--
Jul 05 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, July 05, 2012 22:23:00 Denis Shelomovskij wrote:
 05.07.2012 20:28, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do i=


 themselves, and in general, stuff around here gets done because som=


 really wants it done, takes the time to do it, and sees it through =


 its done and in Phobos.

I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser.

Well, until a lexer and parser for D make it into Phobos, more people a= re=20 going to keep writing them, and even if/when they _do_ make it into Pho= bos,=20 people will keep writing them, because some people like to write that k= ind of=20 thing. Honestly though, if your complaint is that there's too much choice, I d= on't=20 have much sympathy for you. In general, we have too little D code out t= here,=20 not too much. If there's a problem with more parsers being written, I t= hink=20 that it's almost purely an issue of some people's time probably being b= etter=20 spent on other projects, but it's their right to work on whatever they = feel=20 like working on. - Jonathan M Davis
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.



As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there. So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator. All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now.

I'm sure it can generate **much** faster code. I'm going to focus on its part that generates D parser (i.e., to make it significantly faster and able to efficiently parse-as-you-type). Actually, I'm sure it will be able to beat any other parser with respect to performance. :) 1. So my plan is the following: invite whoever would want to help. 2. Prove my claims above in practice. :-)))))
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 18:28:21 UTC, Andrei Alexandrescu 
wrote:
 On 7/5/12 2:16 PM, Philippe Sigaud wrote:
 So Pegged or any other generator should *not* get the 
 community focus right now.

Pegged should be the focus.

+10 (can I vote ten times?)
 My plan would be as follow:

 1- assemble a group of people knowing parsing. I don't think 
 I'm exactly
 knowledgeable, but I'm ready to be part of such a group.
 2- create a github process.
 3- translate an existing parser / adapt a D parser for Phobos. 
 I'm ready
 to be part of this (I'm sure I'll learn in the process)
 4- spend 1-2 years fighting over LR parsing and such :) (Just 
 kidding)
 5- submit it to Phobos and have it adopted.

Sounds good. Replace 1-2 years with 1-2 months.

Well, probably with 3-4 months... :)
Jul 05 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I'll be glad to buy for you any book you might feel you need for this.
 Again, there are few things more important for D right now than exploiting
 its unmatched-by-competition features to great ends. I don't want the lack
 of educational material to hold you down. Please continue working on this
 and let me know of what you need.

That's nice of you, if a bit extreme for a public mailing list :) Andrei, money is no problem :) Interest in the field of parsing is no problem. Interest in D future is no problem. Having a demanding job, and three children, is a problem. No, scratch that, you know what I mean. But hey, Roman is doing interesting things on keyword parsing right now, and changing the parser generated by Pegged is not difficult. We will see where this thread lead. (Roman, you should send your results here, because I'm still taken aback by the built-in AA speed compared to linear array look-up for 100 keywords). As Dmitry said, we might hit a CTFE wall: memory consumption, computation speed, ... (*channelling Andrei*: then we will correct whatever makes CTFE a problem. Right) Philippe (Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)
Jul 05 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, 
 as I learn
 by coding. Hey, I just received the Dragon Book (International 
 Edition),

If you are interested in parsing, than I wouldn't recommend the dragon book, but this one http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html It really is an awesome book.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 19:54:39 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I'll be glad to buy for you any book you might feel you need 
 for this.
 Again, there are few things more important for D right now 
 than exploiting
 its unmatched-by-competition features to great ends. I don't 
 want the lack
 of educational material to hold you down. Please continue 
 working on this
 and let me know of what you need.

That's nice of you, if a bit extreme for a public mailing list :) Andrei, money is no problem :) Interest in the field of parsing is no problem. Interest in D future is no problem. Having a demanding job, and three children, is a problem. No, scratch that, you know what I mean.

I have four, from 1 to 7 years old... Wouldn't call them a problem, though :)))
 But hey, Roman is doing interesting things on keyword parsing 
 right
 now, and changing the parser generated by Pegged is not 
 difficult. We
 will see where this thread lead. (Roman, you should send your 
 results
 here, because I'm still taken aback by the built-in AA speed 
 compared
 to linear array look-up for 100 keywords).

Well, I wouldn't call those "results" yet. Just some benchmarks. Here they are: isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / lookup]. isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec / lookup]. isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 [nanosec / lookup]. isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 [nanosec / lookup]. Total: 22949 identifiers + 5551 keywords. isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / lookup]. isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec / lookup]. isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [nanosec / lookup]. isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [nanosec / lookup]. Total: 104183 identifiers + 17488 keywords.
 As Dmitry said, we might hit a CTFE wall: memory consumption,
 computation speed, ...
 (*channelling Andrei*: then we will correct whatever makes CTFE 
 a
 problem. Right)

 Philippe

 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
 isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / 
 lookup].

for comparison.
 isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].

lookup.
 isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 
 [ns / lookup].

by each character. Clearly a winner, but I think it can be improved further.
 isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [ns / 
 lookup].

 isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [ns / 
 lookup].

 isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [ns / lookup].

improved.
 isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [ns 
 / lookup].

not which exactly.
 Total: 104183 identifiers + 17488 keywords.

name.) Results are consistent for other files, too.
Jul 05 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrot=
e:
 On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I lea=


 by coding. Hey, I just received the Dragon Book (International Edition),

If you are interested in parsing, than I wouldn't recommend the dragon bo=

 but
 this one http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html

 It really is an awesome book.

Hey, good catch! I saw this one a few months ago and forgot about it. Having half a book being annotated bibliography (IIUC) scared me. Hmm 72 =E2=82=AC by Springer, 55 =E2=82=AC on Amazon. Something is not righ= t. Paperback vs perfect bound maybe? Is "Modern Compiler Design" by the same authors any good?
Jul 05 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Thu, Jul 5, 2012 at 10:02 PM, Roman D. Boiko <rb d-coding.com>
wrote (on children)
 I have four, from 1 to 7 years old... Wouldn't call them a problem, though
 :)))

Better not telling my wife. She's making noises about having a fourth.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 20:28:32 UTC, Philippe Sigaud wrote:
 Hmm 72 € by Springer, 55 € on Amazon. Something is not 
 right.
 Paperback vs perfect bound maybe?

http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf Not sure that it is legal, but definitely free.
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:
 Why not program instead of doing bureaucracy ?

To avoid programming things which are not needed or don't fit. I've thrown away several implementations already... time to reflect a little :) But, actually, for DCT I do know what I need to do. That suggestion was with respect to any candidate for becoming standard. Everybody understands that their way (likely differently than others).
Jul 05 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 5 July 2012 at 22:32:29 UTC, deadalnix wrote:
 Well you did the mistake to introduce an over complex mechanism 
 in log(n) to get back to source code when an obvious one in 
 O(1) exists.

 Let me doubt.

I'm fine with that, let you doubt.
Jul 05 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, July 06, 2012 00:24:11 deadalnix wrote:
 Le 05/07/2012 18:28, Jonathan M Davis a =C3=A9crit :
 On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times.

Well, we did something like that for DCT... but I doubt that it would fit general needs. =20 If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?

It's been discussed before, but there are some obvious use cases su=


 syntax highlighting, code formatting, and manipulation of D source =


 (e.g. to strip out the unit tests).
=20
 We need to have a lexer in Phobos which parsers D code and generate=


 range of tokens, and we need to have a parser in Phobos which opera=


 a range of tokens. As discussed previously, there is desire to have=


 lexer and parser ported from dmd's frontend to D for Phobos so that=


 we get is easily maintainable alongside dmd's frontend and produces=


 same results (and is fast).

You never looked at dmd frontend soure code don't you ? It is a horro=

 museum, and if we want to have something in D, I certainly wouldn't c=

 that.

I have definitely looked at dmd's frontend's source code. The idea behi= nd=20 copying the lexer and parser from there is then that they'd match the=20= compiler, which would make it much easier to keep them in sync with cha= nges to=20 the compiler. Whether that's what someone would have written had they d= one it=20 purely in D initially is pretty much irrelevant. - Jonathan M Davis
Jul 05 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Thu, 5 Jul 2012 21:54:29 +0200
Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.
Jul 05 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky
<SeeWebsiteToContactMe semitwist.com> wrote:
 On Thu, 5 Jul 2012 21:54:29 +0200
 Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.

Thanks Nick, I'll add it to my 'will buy one day' list (which grows quite rapidly) $122? Ouch.
Jul 06 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Fri, 6 Jul 2012 09:50:26 +0200
Philippe Sigaud <philippe.sigaud gmail.com> wrote:

 On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky
 <SeeWebsiteToContactMe semitwist.com> wrote:
 On Thu, 5 Jul 2012 21:54:29 +0200
 Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.

Thanks Nick, I'll add it to my 'will buy one day' list (which grows quite rapidly) $122? Ouch.

Yea, it's from an academic publisher unfortunately, so $$$. I got ahold of it through the local library systems (All of Ohio's college libraries are connected in a system called OhioLINK, and books from it can be put on hold and shipped to most of the public libraries - or at least most of the public libraries around the Cleveland area.) Maybe there's something similar out your way?
Jul 06 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, July 05, 2012 14:22:35 Andrei Alexandrescu wrote:
 I also am actively opposed to a project of just translating D's 
 front-end to D and dropping it into Phobos because it would smother (a) 
 work on generic parser generators, and (b) strong, dependable 
 formalization of D's syntax.

I'm not even vaguely convinced that having a lexer/parser specifically for D in Phobos (regardless of whether it comes from dmd or not) will have a negative effect on work for making generic parser generators. People are going to want a good parser generator _regardless_ of what the situation with parsing D is. And I'd be very surprised if you couldn't make a hand-written parser for D which was better than one that you can generate. Parser generation is great, because it allows you to quickly and easily put together a parser from a grammar, but it's unlikely to be as fast as a hand-written one optimized for a particular language. However, as the recent discussion on popFront shows, only benchmarking of actual solutions would show that for sure. Now, the issue of a "strong, dependable formalization of D's syntax" is another thing entirely. Porting dmd's lexer and parser to Phobos would keep the Phobos implementation in line with dmd much more easily and avoid inconsistencies in the language definition and the like. However, if we write a new lexer and parser specifically for Phobos which _doesn't_ port the lexer or parser from dmd, then that _would_ help drive making the spec match the compiler (or vice versa). So, I agree that could be a definite argument for writing a lexer and parser from scratch rather than porting the one from dmd, but I don't buy the bit about it smothering parser generators at all. I think that the use cases are completely different. - Jonathan M Davis
Jul 06 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 15:42:30 UTC, Chad J wrote:
 On 07/05/2012 08:28 AM, Roman D. Boiko wrote:
 Pegged may eventually become standard, if it will be 
 performanceoptimized and a bit more customizable

Interesting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability?

There are many possible customization point which would make it a better fit for my project while also being useful in general. The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type). I also wanted to add ability to reuse parts of previously generated AST which correspond to source code that has not been changed, or to identical source code parsed previously. (This would allow incremental parsing, e.g., while user is typing.) The latter would require to track source code positions separately, or even generating them on demand (this is the feature actively criticized by deadalnix in some previous discussions but central to DCT architecture). AST would only hold information about widths of its nodes. I've also written some notes (10-15 suggestions) while studying Pegged code, which will be shared later. However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.
Jul 07 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sat, Jul 7, 2012 at 6:06 PM, Roman D. Boiko <rb d-coding.com> wrote:

 The most critical was the one you implemented: ability to define a custom
 parse tree. I also needed the ability to use immutable structures (almost)
 everywhere, while currently they must be mutable. Also it would be great to
 avoid UTF conversions (instead use functions and types templated by the
 string type).

I added dstrings because 1- at the time (a few months ago), the lists here were awash in UTF-32 discussions and I thought that'd be the way to go anyway 2- other D parsing libraries seemed to go to UTF32 also (CTPG) 3- I wanted to be able to parse mathematical notation like nabla, derivatives, etc. which all have UTF32 symbols.
 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging because of
 frequent backtracking, it has problems with error recovery, and typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of using
 Pegged. I will try to analyze whether the issues are so fundamental that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would fix these
 problems.

Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
 I added dstrings because

 1- at the time (a few months ago), the lists here were awash in 
 UTF-32
 discussions and I thought that'd be the way to go anyway
 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
 3- I wanted to be able to parse mathematical notation like 
 nabla,
 derivatives, etc. which all have UTF32 symbols.

I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.
 Note that PEG does not impose to use packrat parsing, even 
 though it was developed to use it. I think it's a historical 
 'accident' that put the two together: Bryan Ford thesis used 
 the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does 
 not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the 
 last rules was enought to get a slight speed improvement, buth 
 that doing anymore sotrage was detrimental to the parser's 
 overall efficiency.

That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote:
 I tried that. This is almost impossible, dmd's parser and AST 
 are very tightly mixed with dmd's internals.

+1
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced.  I had this 
 suspicion that the packrat parser is not necessarily the 
 best/fastest solution, mostly because of the large allocation 
 that has to happen before you get O(n) performance.  Thus I 
 figured that pegged might eventually use different parsing 
 strategies underneath it all, possibly with a lot of 
 special-casing and clever hand-tuned and profiled 
 optimizations.  At least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Jul 07 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, July 07, 2012 18:37:54 Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
 I added dstrings because
 
 1- at the time (a few months ago), the lists here were awash in
 UTF-32
 discussions and I thought that'd be the way to go anyway
 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
 3- I wanted to be able to parse mathematical notation like
 nabla,
 derivatives, etc. which all have UTF32 symbols.

I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.

I don't know about this particular case, because I haven't really looked at pegged, but in general, string parsing stuff should be taking ranges of dchar and then specializing on string type where appropriate for efficiency. - Jonathan M Davis
Jul 07 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, July 07, 2012 13:05:29 Jacob Carlborg wrote:
 On 2012-07-07 03:12, Jonathan M Davis wrote:
 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would
 keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if we
 write a new lexer and parser specifically for Phobos which _doesn't_ port
 the lexer or parser from dmd, then that _would_ help drive making the
 spec match the compiler (or vice versa). So, I agree that could be a
 definite argument for writing a lexer and parser from scratch rather than
 porting the one from dmd, but I don't buy the bit about it smothering
 parser generators at all. I think that the use cases are completely
 different.

I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos.

There are multiple issues here. The one that Andrei is raising is the fact that D isn't properly formalized. Essentially, the compiler _is_ the spec, and while the online spec _mostly_ follows it, it doesn't entirely, and the online spec isn't always as precise as it needs to be regardless. With a fully formalized spec, it should be possible to fully implement a D compiler from the spec alone, and I don't think that that's currently possible. Writing a D lexer and parser (if not a full-blown compiler) from scratch would help highlight the places in the spec which are lacking, and having it be part of Phobos would arguably increase Walter's incentive to make sure that the spec is in line with the compiler (and vice versa) so that stuff _other_ than the compiler which is based on the spec would be able to match the compiler. Clang is in a _completely_ different situation. It's a C/C++ compiler, and both C and C++ already have official, formalized specs which Clang conforms to (or is supposed to anyway). Clang has no control over the spec at all. It merely implements it. So, there is no issue of trying to keep other tools or compilers in line with Clang due to it being the language's spec like we effectively have with dmd. It may help the tools which use Clang to be fully in line with Clang and not have to worry about whether Clang implements the spec slightly differently, but in theory, if they all follow the spec correctly, that isn't in issue (though obviously in practice it can be). In D's case, all of the major D compilers use the same frontend, which helps compatability but harms the spec, because there's less incentive to keep it precise and up-to-date. So, from the perspective of the spec, implementing the D lexer and parser for Phobos separately from dmd would be of great benefit. IMHO, the reason that porting dmd's lexer and parser would be of great benefit is primarily maintenance. It makes it much easier to keep Phobos' lexer and parser in line with dmd, making discrepencies less likely, but it arguably reduces the incentive to improve the spec. The benefits of having a lexer and parser as a library (regardless of whether it's from scratch or a port from dmd) are primarly derived from the fact that it makes it much easier to create tools which use them. Such tools no longer have to write their own lexers or parsers. If the compiler uses the same library, it has the added benefit of making it so that any tool using the library will be in sync with the compiler, but if the spec is properly formalized and up-to-date, and the library is kep up-to-date with _it_, that shouldn't really be a problem. You still have the debate as to whether it's better to have a separate implementation based on the spec (thereby making it more likely that the spec is correct) or whether it's better to have the compiler share the implementation so that the library is guaranteed to match the compiler (though not necessarily the spec), but I think that that's a separate debate from whether we should have the lexer and parser as a library. In all honesty though, I would be surprised if you could convince Walter to switch dmd's frontend to Phobos' lexer and parser even once they've been implemented. So, while I agree that there are benefits in doing so, I'm not sure how much chance you have of ever getting any traction with that. Another thing to consider is that that might make it so that gdc and ldc couldn't share the same frontend with dmd (assuming that they have to keep their frontends in C or C++ - I don't know if they do) - but if so, that would increase the incentive for the spec to be correct if dmd ever started using a D frontend. - Jonathan M Davis
Jul 07 2012
parent "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.163.1341683651.31962.digitalmars-d puremagic.com...
 [snip]

 In all honesty though, I would be surprised if you could convince Walter 
 to
 switch dmd's frontend to Phobos' lexer and parser even once they've been
 implemented. So, while I agree that there are benefits in doing so, I'm 
 not
 sure how much chance you have of ever getting any traction with that.

 - Jonathan M Davis

A more realistic alternative is to modify dmd to optionally (at build/link time) use Phobos' lexer and/or parser, which would allow them to be run against the test suite. Parser and lexer changes are relatively rare in the compiler, so keeping two implementations in sync is not really that big a deal.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 Note that PEG does not impose to use packrat parsing, even 
 though it
 was developed to use it. I think it's a historical 'accident' 
 that put
 the two together: Bryan Ford thesis used the two together.

Interesting. After trying to use ANTLR-C# several years back, I got disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.
 However, as I found a few hours ago, Packrat parsing 
 (typically used to
 handle PEG) has serious disadvantages: it complicates 
 debugging because of
 frequent backtracking, it has problems with error recovery, 
 and typically
 disallows to add actions with side effects (because of 
 possibility of
 backtracking). These are important enough to reconsider my 
 plans of using
 Pegged. I will try to analyze whether the issues are so 
 fundamental that I
 (or somebody else) will have to create an ANTLR-like parser 
 instead, or
 whether it is possible to introduce changes into Pegged that 
 would fix these
 problems.


I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
 I was thinking the same thing.

 My intent is to create a kind of regular-expression-of-nodes 
 with push/pop operators to recognize ascent and descent on the 
 tree.  Such a regular expression would allow one to capture 
 subtrees out of generalized patterns and then place them into 
 new trees that then become the input for the next pattern or 
 set of patterns.  I think this is much closer to how I 
 conceptualize semantic analysis than how semantic analysis is 
 done in front ends like DMD: it should to be done with pattern 
 recognition and substitution, not with myriad nested 
 if-statements and while-loops.

and xpath for querying, searching and traversing AST with Dmitry a few weeks ago. A custom NDepend-like Code Query Language was the major alternative we considered. Discussion started on this forum and continued via email.
 In this vision I do not use classes and inheritance for my AST.
  Instead I use structs that contain some kind of nodeType 
 member that would be one of the tokens/symbols in the grammar, 
 like TOK_WHILE_NODE in the above code.  Dynamic dispatch is 
 instead performed by (very fast) DFAs recognizing parts of the 
 AST.

Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator 
 precedence grammar to pegged. Of course it should fit naturally 
 with traditional PEG, for instance taking responsibility for 
 parsing expressions.

But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:
 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method

OK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.)
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 19:58:37 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:
 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method

OK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.)

order to parse some expression is costly enough to consider OPP superior to a general PEG for expressions. At first I was surprised when found that D doesn't define operator precedence explicitly, but instead provides a hierarchy of expression types. Then I somehow concluded that the approaches are equivalent. (I started learning parsing techniques only in February '12.) Since then I never reconsidered my past conclusions.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu 
wrote:
 Doesn't ANTLR use full-fledged character-level LL(*) parsing 
 even in the tokenizer?

Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure). The idea to introduce DFA for both determining which rule to apply and lexing of terminal symbols appeared to me much earlier, and the suggestion to introduce them into Pegged is one of options which I think could extremely improve performance.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 	auto captures = syntaxNode.matchNodes(
 		TOK_WHILE_NODE,
 		OP_ENTER_NODE,
 			OP_CAPTURE(0),
 			OP_BEGIN,
 				TOK_EXPRESSION,
 			OP_END,
 			OP_CAPTURE(1),
 			OP_BEGIN,
 				TOK_STATEMENT,
 			OP_END,
 		OP_LEAVE_NODE);

I'm glad to hear you like the tree-parsing approach, Chad, although the particular syntax here looks pretty unfriendly :O -- does this represent something that you are working on right now?
 This kind of architecture leads to other interesting benefits, 
 like being able to assert which symbols a pattern is designed 
 to handle or which symbols are allowed to exist in the AST at 
 any point in time. Thus if you write a lowering that introduces 
 nodes that a later pass can't handle, you'll know very quickly, 
 at least in principle.

 I wanted to make such a front-end so that I could easily make a 
 C backend.  I believe such a compiler would be able to do that 
 with great ease.  I really want a D compiler that can output 
 ANSI C code that can be used with few or no OS/CPU 
 dependencies.  I would be willing to lose a lot of the nifty 
 parallelism/concurrency stuff and deal with reference counting 
 instead of full garbage collection, as long as it lets me 
 EASILY target new systems (any phone, console platform, and 
 some embedded microcontrollers).  Then what I have is something 
 that's as ubiquitous as C, but adds a lot of useful features 
 like exception handling, dynamic arrays, templates, CTFE, etc 
 etc.  My ideas for how to deal with ASTs in pattern recognition 
 and substitution followed from this.

I tend to agree that it would be better to have a "general" node class with the node type as a property rather than a subtype and rather than a myriad of independent types, although in the past I haven't been able to figure out how to make this approach simultaneously general, efficient, and easy to use. I'm kind of a perfectionist which perhaps holds me back sometimes :) I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse): 1. Lexer 2. Tree-ification 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later) The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following: A man (from a [very ugly] house in the suburbs) was quoted as saying { I saw Batman (and Robin) last night! } Is converted to a tree where everything parenthesized or braced gets to be a child: A man ( from a [ very ugly ] house in the suburbs ) was quoted as saying { ... } Some of the things I like about this approach are: 1. It's language-agnostic. Lots of languages and DSLs could re-use exactly the same code from stage 2. (Stage 1, also, is fairly similar between languages and I wonder if a parameterized standard lexer is a worthwhile pursuit.) 2. It mostly eliminates the need for arbitrary-length lookahead for things like D's template_functions(...)(...). Of course, the tokens will almost always end up getting scanned twice, but hey, at least you know you won't need to scan them more than twice, right? (er, of course the semantic analysis will scan it several times anyway. Maybe this point is moot.) 3. It is very efficient for tools that don't need to examine function bodies. Such tools can easily leave out that part of the parser simply by not invoking the function-body sub-parser. 4. It leaves open the door to supporting embedded DSLs in the future. It's trivial to just ignore a block of text in braces and hand it off to a DSL later. It is similar to the way PEGs allow several different parties to contribute parts of a grammar, except that this approach does not constrain all the parties to actually use PEGs; for instance if I am a really lazy DSL author and I already have a SQL parser laying around (whether it's LL(k), LALR, whatever), I can just feed the original input text to that parser (or, better, use the flat token stream, sans comments, that came out of the lexer.) 5. It's risky 'cause I've never heard of anyone taking this approach before. Bring on the danger! I have observed that most PLs (Programming Langs) use one of two versions of stage 2: (1) C-style, with structure indicated entirely with {}, (), [], and possibly <> (shudder), or (2) Python-style, with structure indicated by indentation instead of {}. My favorite is the Boo language, which combines these two, using Python style by default, but also having a WSA parsing mode (whitespace-agnostic) with braces, and switching to WSA mode inside a Python-style module whenever the user uses an opener ("(,{,[") (IIRC). So a single Boo-like stage 2 could be re-used for almost all PLs, and thus would be a reasonable candidate as part of the standard library or a "D Boost library" (there is not such a thing, is there?).
 I wanted to make such a front-end so that I could easily make a 
 C backend.  I believe such a compiler would be able to do that 
 with great ease.

 Needing to use D in places where it isn't available is a real 
 pain-point for me right now, and I'll probably find ways to 
 spend time on it eventually.

Yeah, with a tree-transforming parser, I imagine the same thing, except my current fetish is to convert a certain subset of D to multiple other languages automatically. Then I could write libraries that can easily be used by an astonishingly large audience. I certainly would like to see D targetting Android, but that's best done directly from D to ARM. Anyway, the devil's in the detail. Originally I wanted to do a parser generator and a "completely general AST" in C# and couldn't seem to work out the details, but D is more flexible and is likely better suited to the task.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
 I'd like to add that if we give tree parsing first-class 
 treatment, I believe the most logical approach to parsing has 
 three or more stages instead of the traditional two (lex+parse):

 1. Lexer
 2. Tree-ification
 3. Parsing to AST (which may itself use multiple stages, e.g. 
 parse the declarations first, then parse function bodies later)

 The new stage two simply groups things that are in parenthesis 
 and braces. So an input stream such as the following:

I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead. My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 Since I didn't understand your question I assume that my 
 statement was somehow incorrect (likely because I made some 
 wrong assumptions about ANTLR). I didn't know about its 
 existence until today and still don't understand it completely. 
 What I think I understood is that it uses DFA for deciding 
 which grammar rule to apply instead of doing backtracking. I 
 also think that it uses DFA for low-level scanning (I'm not 
 sure).

ANTLR 3 doesn't use a DFA unless it needs to. If unlimited lookahead is not called for, it uses standard LL(k) or perhaps it uses the modified (approximate? I forget the name) LL(k) from ANTLR 2. DFA comes into play, for instance, if you need to check what comes after an argument list (of, unlimited, length) before you can decide that it *is* an argument list and start the "real" parsing (The author says LL(k) is too inefficient so he used a restricted form of it; personally I'm not convinced, but I digress)
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
On Saturday, 7 July 2012 at 20:39:18 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
 I'd like to add that if we give tree parsing first-class 
 treatment, I believe the most logical approach to parsing has 
 three or more stages instead of the traditional two 
 (lex+parse):

 1. Lexer
 2. Tree-ification
 3. Parsing to AST (which may itself use multiple stages, e.g. 
 parse the declarations first, then parse function bodies later)

 The new stage two simply groups things that are in parenthesis 
 and braces. So an input stream such as the following:

I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead. My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.

Hmm, you've got a good point there, although simple tree-ification is clearly less work than standard parsing, since statements like "auto x = y + z;" would be quickly "blitted" into the same node in phase 2, but would become multiple separate nodes in phase 3. What I like about it is not its performance, but how it matches the way we think about languages. Humans tend to see overall structure first, and examine the fine details later. The tree parsing approach is similarly nonlinear and can be modularized in a way that might be more intuitive than traditional EBNF. On the other hand, one could argue it is /too/ flexible, admitting so many different approaches to parsing that a front-end based on this approach is not necessarily intuitive to follow; and of course, not using a standard EBNF-type grammar could be argued to be bad. Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 What I like about it is not its performance, but how it matches 
 the way we think about languages. Humans tend to see overall 
 structure first, and examine the fine details later. The tree 
 parsing approach is similarly nonlinear and can be modularized 
 in a way that might be more intuitive than traditional EBNF.

That reminds me, I forgot to write a another advantage I expected the three-phase approach to have, namely, that it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it. Example: void f() { if (cond) x = y + 1; y = z + 1; } } // The error appears to be here, but it's really 4 lines up.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote:
 Still... it's a fun concept, and even if the initial parsing 
 ends up using the good-old lex-parse approach, semantic 
 analysis and lowering can benefit from a tree parser. Tree 
 parsing, of course, is just a generalization of linear parsing 
 and so a tree parser generator (TPG) could work equally well 
 for flat input.

Exactly. Semantic analysis is what would benefit from that, as well as client code (for example, refactoring tools, etc.) Parser would become quite complicated. Probably it would need to perform complex tree navigation (which might decrease performance proportionally to the average tree depth or even more, if parser algorithms were themselves fast). Any non-trivial query would require direct character manipulation (like comparing sub-strings of captures with string literals, etc.). Probably you would get frequent cache misses because of the need to jump throughout the (incomplete) AST. It would definitely be problematic to build an immutable AST model, thus (IMO) significantly and needlessly constraining you and users of your library. And likely you would have to deal with many more problems __without__ significant gains.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
 it seems easier to tell what the programmer "meant" with three 
 phases, in the face of errors. I mean, phase 2 can tell when 
 braces and parenthesis are not matched up properly and then it 
 can make reasonable guesses about where those missing 
 braces/parenthesis were meant to be, based on things like 
 indentation. That would be especially helpful when the parser 
 is used in an IDE, since if the IDE guesses the intention 
 correctly, it can still understand broken code and provide code 
 completion for it. And since phase 2 is a standard tool, 
 anybody's parser can use it.

There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
 enum : SyntaxElement
 {
   AST_EXPRESSION          = 0x0001_0000_0000_0000,
     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 |

This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning.
 This approach of course has shameful nesting limitations, but I 
 feel like these determinations could be fairly well optimized 
 even for the general case.  For example: another approach that 
 I might be more inclined to take is to give each token/symbol a 
 low-valued index into a small inheritance table.

Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type).
 I would expect the regex engine to call the isA function as one 
 of it's operations.  Thus placing an AST_EXPRESSION into your 
 expression would also match an AST_NEGATE_EXPR too.

But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
On Saturday, 7 July 2012 at 22:07:02 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
 it seems easier to tell what the programmer "meant" with three 
 phases, in the face of errors. I mean, phase 2 can tell when 
 braces and parenthesis are not matched up properly and then it 
 can make reasonable guesses about where those missing 
 braces/parenthesis were meant to be, based on things like 
 indentation. That would be especially helpful when the parser 
 is used in an IDE, since if the IDE guesses the intention 
 correctly, it can still understand broken code and provide 
 code completion for it. And since phase 2 is a standard tool, 
 anybody's parser can use it.

There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.

This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
 This is all true, but forgetting a brace commonly results in a 
 barrage of error messages anyway. Code that guesses what you 
 meant obviously won't work all the time, and phase 3 would have 
 to take care not to emit an error message about a "{" token 
 that doesn't actually exist (that was merely "guessed-in"). But 
 at least it's nice for a parser to be /able/ to guess what you 
 meant; for a typical parser it would be out of the question, 
 upon detecting an error, to back up four source lines, insert a 
 brace and try again.

So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
 I see what you mean.

When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 22:40:05 UTC, Andrei Alexandrescu
wrote:

 http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Thanks, nice article. Also found another post: http://www.antlr.org/wiki/display/~admin/2008/11/30/Example+tree+rewriting+with+patterns, which is related to some other discussion in this thread :)
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
On Saturday, 7 July 2012 at 22:35:37 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
 This is all true, but forgetting a brace commonly results in a 
 barrage of error messages anyway. Code that guesses what you 
 meant obviously won't work all the time, and phase 3 would 
 have to take care not to emit an error message about a "{" 
 token that doesn't actually exist (that was merely 
 "guessed-in"). But at least it's nice for a parser to be 
 /able/ to guess what you meant; for a typical parser it would 
 be out of the question, upon detecting an error, to back up 
 four source lines, insert a brace and try again.

So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.

I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...?
Jul 07 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 Yeah, with a tree-transforming parser, I imagine the same 
 thing, except
 my current [fantasy] is to convert a certain subset of D to 
 multiple other
 languages automatically. Then I could write libraries that can 
 easily be
 used by an astonishingly large audience. I certainly would 
 like to see D
 targetting Android, but that's best done directly from D to 
 ARM.

having to cater to the lowest-common-denominator in all of your API designs. No templated functions or ranges in your API, that's for sure. I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing.

Well, for templates, in general, it would be necessary to instantiate a particular set of templates and explicitly give them names in the target language. So for instance, I could define a Point!T struct in D, sure, but then I'd have to tell the language converter to create target-language specializations: in C#, PointD=Point!double, PointI=Point!int, etc. If the target were C++, the template could be translated to a C++ template, Point<T>, as long as there aren't any "static ifs" or other things that can't be translated. Notably, if a template P!T depends on another template Q!T, then P!T cannot be translated to a C++/C# P<T> unless Q!T was also translated as Q<T>. Adapting standard libraries could no doubt be a gigantic problem. I don't know how to begin to think about doing that. But for ranges in particular, I think the concept is too important to leave out of public interfaces. So I'd port the major range data structures to the target languages, most likely by hand, so that they could be used by converted code.
 As for D targeting Android, my intent is really to target X 
 where X is any CPU/OS combo you can think of.  I want to be 
 able to get D, the language, not necessarily phobos or other 
 niceties, to work on any platform, and to do so without much 
 work.  Cross-compiling to a new platform that has never been 
 cross-compiled before should require zero coding.

I understand. Conversion to C is an effective last resort. And, well, I hear a lot of compilers have even used it as a standard practice. I guess you'd be stuck with refcounting, though.
 I think that the D-directly-to-ARM is the current approach for 
 cross-compiling.  I critique it for its underwhelming lack of 
 results.

Yeah. I assume it involves weird object-file formats, calling conventions and ABIs. I guess very few want to get involved with that stuff, and very few have the slightest clue where to begin, myself included.
 (2) suffer from integration problems if you try to compile the 
 expressions in separate files before compiling the rest of the 
 front-end.

Absolutely, I love language-integrated metaprogramming. Without it you end up with complicated build environments, and I hate those, cuz there isn't a single standard build environment that everybody likes. I think people should be able to just load up their favorite IDE and add all source files to the project and It Just Works. Or on the command line, do dmd *.d or whatever. Oh, and the ability to run the same code at meta-compile-time, compile-time and run-time, also priceless.
Jul 07 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:
 I'm not seeing any tremendous error-handling difficulty in my 
 idea. Anyway, I missed the part about information being thrown 
 away...?

(I will use the word "you" to refer to some abstract person or compiler.) Error handling could mean either error reporting and stopping after the first error, or reporting several errors and continuing parsing + semantic analysis whenever possible, so that the user would get partial functionality like code completion, information about method overloads, or "go to definition" / find all usages, etc. The first method is the less powerful option, but still requires constructing a meaningful error message which could help the user. There are many possible error recovery strategies. For example, you might decide to insert some token according to the parsing information available up to the moment when error is discovered, if that would fix the problem and allow to continue parsing. Another option is to ignore a series of further tokens (treat them a whitespace, for example), until parser is able to continue its work. Also there are many heuristics for typical error situations. All of these can only be performed if parser gathers the syntax information which can be inferred from source code according to the grammar rules. In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match. Or, if the user inserts a brace in an incorrect location, it is only possible to say that it is either missing somewhere and somewhere else another brace is missing, or that it is missing in one place, and some other brace is redundant. In many cases you won't even notice that a brace is incorrectly placed, and pass the resulting tree to the 3rd stage. You don't take any hint from grammar about the exact locations where some token should be present. Now, stage 3 heavily depends on the output of stage 2. As I demonstrated in some examples, it could get the output which implies incorrect structure, even if that has not been found in the previous stage. You would need to analyze so much information attempting to find the real roots of a problem, that effectively it would involve duplicating (implicitly) the logic of previous stage, but with combinatorial complexity growth. The problems you would need to deal with are much more complex than I described here. So you wouldn't be able to deliver error recovery at all, or (if you could) it would be either trivial or would require needlessly complex logic. Breaking the system at the wrong boundary (when the number of dependencies that cross the boundary is large) always introduces artificial complexity. Described above is the illustration of what I call information loss. I may have described something not as clearly as needed, but I didn't have the goal to provide a thorough and verifiable analysis. I speculated and simplified a lot. If you decide to ignore this, it is not a problem and I don't state that you will fail any of your goals. Just admit that __for me__ this approach is not a viable option. Everything written above is IMHO and there may be many ways to resolve the problems with various degrees of success.
Jul 07 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, July 08, 2012 14:13:18 Jacob Carlborg wrote:
 On 2012-07-07 19:53, Jonathan M Davis wrote:
 There are multiple issues here. The one that Andrei is raising is the fact
 that D isn't properly formalized. Essentially, the compiler _is_ the spec,
 and while the online spec _mostly_ follows it, it doesn't entirely, and
 the online spec isn't always as precise as it needs to be regardless.
 With a fully formalized spec, it should be possible to fully implement a
 D compiler from the spec alone, and I don't think that that's currently
 possible.

That's been a problem for a long time.

True, but Andrei is basically arguing that porting dmd's lexer and parser to D would reduce the incentive to fix this, whereas having a new lexer/parser would encourage fixing it (though he's arguing for purely using a generative parser rather than writing a D-specific one).
 IMHO, the reason that porting dmd's lexer and parser would be of great
 benefit is primarily maintenance. It makes it much easier to keep Phobos'
 lexer and parser in line with dmd, making discrepencies less likely, but
 it arguably reduces the incentive to improve the spec.

But then it sounds like the best solution would be not to have a lexer/parser based on DMD and instead making sure the spec is correct.

That would be the question. Given enough time, what I'd probably want to do would be to port dmd's lexer and parser to D in order to properly figure it all out, updating the spec in the process, and _then_ go write one from scratch, possibly basing some of it on how dmd did it, possibly not. But I don't really have the time for any of that right now, and most of the focus right now from people interested in parsing seems to be on pegged and parser generators (which are very cool and in some ways much more interesting, but I seriously question that that's the performant way to go if you're looking to parse D specifically). So, who knows what we're going to get out of this and when we'll get it.
 What keeps popping up in my head is the scenario where users are
 complaining about the frontend in Phobos not behaving as their compiler.
 If this is due to they are out of sync, bugs or not following the spec
 doesn't matter.
 
 I still thinks D is changing too much to have two separate
 implementations of the compiler and a library like this.

Well, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be. - Jonathan M Davis
Jul 08 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:
 most of the focus right now from people interested in parsing 
 seems to be on pegged and parser generators (which are very 
 cool and in some ways much more interesting, but I seriously 
 question that that's the performant way to go if you're looking 
 to parse D specifically).

Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.
Jul 08 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, July 08, 2012 22:15:26 Roman D. Boiko wrote:
 On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:
 most of the focus right now from people interested in parsing
 seems to be on pegged and parser generators (which are very
 cool and in some ways much more interesting, but I seriously
 question that that's the performant way to go if you're looking
 to parse D specifically).

Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.

It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster. Also, look at dmd and dmc vs other compilers. They use hand-written parsers and are generally much faster than their competitors. One thing to remember about hand-written parsers vs generative ones though is that they usually are completely different in terms of the type of parser that you write (e.g. hand-written parsers are generally recursive-decent parser whereas generative ones usually use bottom-up parsers). So, that could have a large impact on performance as well (in either direction). To be clear though, I have _no_ problem with having a generative parser in Phobos (or having other 3rd party options available). Parsers like pegged are extremely cool and extremely useful. It's just that it's my understanding that well-written hand-written parsers are faster than generated ones, so I think that it would be benecial to have a hand-written parser for D in Phobos _in addition_ to a general, generative solution. But to fully prove that a hand-written one would be faster, we'd of course have to have actual solutions to compare. And if the API for a D-specific parser in Phobos is designed well enough, and it somehow proved that a generative solution was faster, then the hand-written one could be replaced by the generative one underneat the hood. - Jonathan M Davis
Jul 08 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
 It's been too long since I was actively working on parsers to 
 give any details, but it is my understanding that because a 
 hand-written parser is optimized for a specific grammar, it's 
 going to be faster.

My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try. I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible.
 Also, look at dmd and dmc vs other compilers. They use 
 hand-written parsers and are generally much faster than their 
 competitors.

Due to which particular design or implementation decisions? Would it be possible to name a few which are relevant in this context?
 One thing to remember about hand-written parsers vs generative 
 ones though is that they usually are completely different in 
 terms of the type of parser that you write (e.g. hand-written 
 parsers are generally recursive-decent parser whereas 
 generative ones usually use bottom-up parsers).

So far discussion goes in favor of LL(*) parser like ANTLR, which is top-down recursive-descent. Either Pegged will be optimized with LL(*) algorithms, or a new parser generator created.
Jul 08 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
 I believe it would not hurt generality or quality of a parser 
 generator if it contained sews for inserting custom (optimized)

s/sews/seams
Jul 08 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:
 On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
 It's been too long since I was actively working on parsers to 
 give any details, but it is my understanding that because a 
 hand-written parser is optimized for a specific grammar, it's 
 going to be faster.

My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try. I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible.

I'm convinced that the output of a parser generator (PG) can be very nearly as fast as hand-written code. ANTLR's output (last I checked) was not ideal, but the one I planned to make (a few years ago) would have produced faster code. By default the PG's output will not be the speed of hand-written code, but the user can optimize it. Assuming an ANTLR-like PG, the user can inspect the original output looking for inefficient lookahead, or cases where the parser looks for rare cases before common cases, and then improve the grammar and insert ... I forget all the ANTLR terminology ... syntactic predicates or whatever, to optimize the parser.
 So far discussion goes in favor of LL(*) parser like ANTLR, 
 which is top-down recursive-descent. Either Pegged will be 
 optimized with LL(*) algorithms, or a new parser generator 
 created.

Right, for instance I am interested in writing a top-down PG because I understand them better and prefer the top-down approach due to its flexibility (semantic actions, allowing custom code) and understandability (the user can realistically understand the output; in fact readability would be a specific goal of mine) Roman, regarding what you were saying to me earlier:
In stage 2 you have only performed some basic analysis, like, 
e.g., matched braces to define some hierarchy. This means that 
at the time when you find a missing brace, for example, you 
cannot tell anything more than that braces don't match.

Stage 2 actually can tell more than just "a brace is missing somewhere". Because so many languages are C-like. So given this situation: frob (c &% x) blip # gom; } It doesn't need to know what language this is to tell where the brace belongs. Even in a more nebulous case like: frob (c &% x) bar lic blip # gom; } probably the brace belongs at the end of the first line. Perhaps your point is that there are situations where a parser that knows the "entire" grammar could make a better guess about where the missing brace/paren belongs. That's certainly true. However, just because it can guess better, doesn't mean it can reinterpret the code based on that guess. I mean, I don't see any way to "back up" a parser by an arbitrary amount. A hypothetical stage 2 would probably be hand-written and could realistically back up and insert a brace/paren anywhere that the heuristics dictate, because it is producing a simple data structure and it doesn't need to do any semantic actions as it parses. A "full" parser, on the other hand, has done a lot of work that it can't undo, so the best it can do is report to the user "line 54: error: brace mismatch; did you forget a brace on line 13?" The heuristic is still helpful, but it has already parsed lines 13 to 54 in the wrong context (and, in some cases, has already split out a series of error messages that are unrelated to the user's actual mistake).
 As I demonstrated in some examples, it could get the output 
 which implies incorrect structure

I was unable to find the examples you refer to... this thread's getting a little unweildy :)
Jul 08 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
David, I would suggest you to create a proof-of-concept prototype 
and pay attention to some non-trivial use cases. This way you'd 
likely be able to reduce major risks significantly before making 
any serious time investments.
Jul 08 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, July 09, 2012 08:33:31 Jacob Carlborg wrote:
 On 2012-07-08 22:05, Jonathan M Davis wrote:
 Well, for the lexer and parser, we're probably okay (or very close to it).
 As Daniel pointed out elsewhere in this thread, that part of the frontend
 doesn't change much (particularly the lexer). There's definitely still
 some churn, but it's nothing like it used to be.

It don't feel like that, didn't we get lambdas or UFCS in the last release?

lambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer). - Jonathan M Davis
Jul 08 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, July 09, 2012 09:16:41 Jacob Carlborg wrote:
 On 2012-07-09 08:39, Jonathan M Davis wrote:
 lambdas are a bit older than that IIRC, and I don't think that UFCS
 actually affects the lexer or parser. Yes, changes are still made, but
 they're increasingly rare, and most of the stuff being changed is on the
 semantic level (most of which is bug fixes). So, out of all of the parts
 of the compiler to duplicate, those are the least likely to have to have
 major changes made to them later (especially the lexer).

I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo();

That definitely wouldn't affect lexing, because it doesn't affect the tokens at all. Whether it affects the parser would depend on the grammar. There's a good chance it would affect parsing, but it might not. It depends on how numeric literals are treated. In most cases though, UFCS definitely wouldn't affect parsing at all, because it's purely a matter of symbol resolution, so if changes had to be made to the parser, they would almost certainly have been minimal. Yes, some changes are still being made to the parser, but since the language is mostly frozen, almost all of the changes have gone into bug fixes, which are generally semantic issues and don't affect lexing or parsing at all. So, while there would still be some maintenance issues in keeping a separate lexer and parser in line with dmd, I really don't think that it would take much at this point. Most of the work would be in getting them to match dmd when they're first written. - Jonathan M Davis
Jul 09 2012
parent "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.190.1341818983.31962.digitalmars-d puremagic.com...
 I'm pretty sure UFCS affects lexing or parsing. How else would this be
 legal:

 4.foo();

That definitely wouldn't affect lexing, because it doesn't affect the tokens at all.

Not true. This used to be lexed as '4.f' 'oo'. (I think)
Jul 09 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 9 July 2012 at 06:35:38 UTC, Jacob Carlborg wrote:
 I'm not expert in this field but I've heard that it's harder to 
 get good error reporting with a parser generator.

Good point!
Jul 09 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 7 July 2012 at 16:37:56 UTC, Roman D. Boiko wrote:
 Note that PEG does not impose to use packrat parsing, even 
 though it was developed to use it. I think it's a historical 
 'accident' that put the two together: Bryan Ford thesis used 
 the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does 
 not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the 
 last rules was enought to get a slight speed improvement, buth 
 that doing anymore sotrage was detrimental to the parser's 
 overall efficiency.

That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.

One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed. It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios. Since Pegged doesn't use Packrat algorithm, this solution might be either not relevant or not applicable, but I doubt that there will be any fundamental problem with error recovery. Unpleasant debugging experience, however, should be relevant for any parser that uses backtracking heavily.
Jul 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko <rb d-coding.com> wrote:


 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find the
 exact place of error (from parser's perspective, not user's) one only needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.
 It is also possible to rerun parsing with some additional heuristics after
 failing, thus enabling advanced error repair scenarios.

Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.
Jul 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> wrote:

 Do people really what error-repairing parsers? I want my parsers to
 tell me something is bad, and, optionally to advance a possible
 repair, but definitely *not* to automatically repair a inferred error
 and continue happily.

FWIW, this is what most HTML parsers are doing.

Ah, right. I can get it for HTML/XML. JSON also, maybe. I was thinking of parsing a programming language (C, D, etc) Consider me half-convinced :)
Jul 10 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Tuesday, 10 July 2012 at 19:41:29 UTC, Philippe Sigaud wrote:
 On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> 
 wrote:

 Do people really what error-repairing parsers? I want my 
 parsers to
 tell me something is bad, and, optionally to advance a 
 possible
 repair, but definitely *not* to automatically repair a 
 inferred error
 and continue happily.

FWIW, this is what most HTML parsers are doing.

Ah, right. I can get it for HTML/XML. JSON also, maybe. I was thinking of parsing a programming language (C, D, etc) Consider me half-convinced :)

It would still generate errors. But would enable a lot of useful functionality: autocompletion, refactoring, symbol documentation in a tooltip, displaying method overloads with parameters as-you-type, go to definition, etc.
Jul 10 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:
 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find
 the
 exact place of error (from parser's perspective, not user's) one only
 needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.
 It is also possible to rerun parsing with some additional heuristics
 after
 failing, thus enabling advanced error repair scenarios.

Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.

FWIW, this is what most HTML parsers are doing.

Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix. - Jonathan M Davis
Jul 10 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Tuesday, 10 July 2012 at 20:25:12 UTC, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 FWIW, this is what most HTML parsers are doing.

Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.

Not having control over parser or source code causes problems. Ability to deliver useful functionality (see my post above) is a different use case.
Jul 10 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:
 On 2012-07-10 22:25, Jonathan M Davis wrote:
 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever
 fix.

technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error.

Well, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO. And that's essentially what happens with HTML. - Jonathan M Davis
Jul 10 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, July 11, 2012 08:41:53 Jacob Carlborg wrote:
 On 2012-07-10 22:53, Jonathan M Davis wrote:
 Well, giving an error, continuing to parse, and giving a partial result
 can be useful (and you give a prime example of that), but "fixing" the
 problem (e.g by inserting the semicolon) and not considering it to be an
 error would be a _huge_ mistake IMHO. And that's essentially what happens
 with HTML.

not_ output the error and continues as if it was valid could. As far as I know, up until HTML 5, the spec hasn't mentioned what should happen with invalid code. This is just a error handling strategy that is an implementation detail. It will not change what is and what isn't valid code. Are you preferring getting just the first error when compiling? Fix the error, compile again, get a new error and so on.

??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The browser doesn't spit out errors. It just guesses at what you really meant and displays that. It "fixes" the error for you, which is a horrible design IMHO. Obviously, we're stuck with it for HTML, but it should not be replicated with anything else. This is in contrast to your example of outputting an error and continuing to parse as best it can in order to provide more detail and more error messages but _not_ ultimately considering the parsing successful. _That_ is useful. HTML's behavior is not. - Jonathan M Davis
Jul 10 2012
prev sibling next sibling parent reply Kai Meyer <kai unixlords.com> writes:
On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have
 some other parsers (in C# from Mono-D etc.).

 What about to get one and call it standard?

I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :) I would love to receive some critical feedback on my approach as well as the actual code base. To build it, you'll need cmake, and cmaked2 from here: http://code.google.com/p/cmaked2/wiki/GettingStarted Or just dmd *.d :) I haven't tried to build it on Windows yet, but I don't see anything that immediately jumps out as not cross-platform. I've been working on it on both Fedora and CentOS. -Kai Meyer
Jul 31 2012
next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This 
 seems like a nice thread to toss myself into. I've started 
 working on a generic lexer that is based off of a defined 
 grammar.

Every helping hand is appreciated :-)
 As I read through the thread (I unfortunately don't have enough 
 time to read every post, but I skimmed through as many as I 
 could, and read the ones that seemed important), it seems like 
 we need a parser in D that can lex D, and provide a Range of 
 tokens that could be consumed.

Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem. There are already some working D-Parsers buried in this thread.
 With some very minor tweaks, and a properly written Grammar 
 class, I basically have it already done. D was going to be one 
 of the first languages I would have written a definition for.

 https://github.com/kai4785/firstfront

I've only glimpsed at your code. For most languages lexing is far more expensive then parsing and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.
Jul 31 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 01-Aug-12 01:10, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:
 On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This seems like a
 nice thread to toss myself into. I've started working on a generic lexer
 that is based off of a defined grammar.

Every helping hand is appreciated :-)

Hi Kai, welcome here!
 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the ones
 that seemed important), it seems like we need a parser in D that can lex D,
 and provide a Range of tokens that could be consumed.

Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can be grouped or not, but we need both, in Phobos. We also need to define what's needed in a token. Kai's approach is OK, but what's the _type field for an operator or and identifier? Also, a lexer can fill a symbol table and produce identifier tokens that refer to a particular entry in the symbol table.

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

String table is populated by lexer.
 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).
 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It already has D
 output but needs a separate build step.


Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of: a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen) b) full unicode features commitment (again expect some improvement here in near future) c) designed to take multiple captures from matched piece of text. I'm not sure when (or even if) std.regex will ever special case all of the above.
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for
D.

 The only problem I see in Kai's approach (which is the standard one, a
 prioritized list of regexes) is how to recognize tokens that are not
 regular (I mean, that cannot be encoded as a regex), like nesting
 comments?

See below.
 How does the lexer know when to stop producing a 'comment'
 token?

Call special function skipComment when lexer hits comment_start pseudotoken. Typically lexeres are regular as it allows them to be fast. -- Dmitry Olshansky
Jul 31 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Typically lexeres are regular as it allows them to be fast.


Regularity of the language is not required for speed.
 Hmm, but then it has to treat comments a one token. So no Ddoc love?

Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.
Jul 31 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 01-Aug-12 02:54, Timon Gehr wrote:
 On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Typically lexeres are regular as it allows them to be fast.


Regularity of the language is not required for speed.

That's why there is "allows" (not "required") the reason in general is that plain deterministic automation is insanely fast compared with just about any CFG parsing scheme. Yet there are certain grammars/cases where it doesn't matter much. Also tweaking DFA by "semantic" actions could make it handle some irregularities at no extra cost etc.
 Hmm, but then it has to treat comments a one token. So no Ddoc love?

Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.

Yup. -- Dmitry Olshansky
Jul 31 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

String table is populated by lexer.

The latter, I get. The former, not so much.
 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated (richer, if I may say so) for parsing, then parsing is more expensive. I'm reading on GLR, and was wondering what the practical limits are. Do people use GLR to parse thousands of pages of text?
 Have to agree here if anything a better DFA generator is needed, current
 std.regex can't get as good in this field because of:

 a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
 ctRegex via codegen)
 b) full unicode features commitment (again expect some improvement here in
 near future)
 c) designed to take multiple captures from matched piece of text.

 I'm not sure when (or even if) std.regex will ever special case all of the
 above.

Well, - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if)

Well while lookahead will help, there are simpler ways. e.g. regex ("IF\ZYOUR_LOOKAHEAD"); \Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard. Critical difference is that lookahead can be used like this: regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");
 - Unicode is needed to lex D correctly, no?

Not that much of unicode, e.g. it could skip decoding stuff most of the time.
 - multiple captures doesn't seem necessary *for a lexer*


 How does the lexer know when to stop producing a 'comment'
 token?

Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to 'count' the comment_start and comment_end pseudotokens. So in a way, it quits the regular world to become temporarily more powerful.
 Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?

-- Dmitry Olshansky
Jul 31 2012
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

String table is populated by lexer.

The latter, I get. The former, not so much.

Okay. Say lexer maps all unique strings that are not keywords to some ID. Then parser creates a stack of scoped symbol tables. These nested symbol tables use only IDs not strings themselves. (Though I expect that only semantic analysis require the use of these tables) -- Dmitry Olshansky
Jul 31 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-07-31 23:20, Jonathan M Davis wrote:

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see). Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

 - Jonathan M Davis

That's awesome. I really looking forward to this. Keep up the good work. -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-07-31 23:20, Jonathan M Davis wrote:

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see). Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

BTW, do you have to code online somewhere? -- /Jacob Carlborg
Aug 01 2012
prev sibling parent Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
01.08.2012 1:20, Jonathan M Davis пишет:
 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate. - Jonathan M Davis

Good. Will wait for announce. -- Денис В. Шеломовский Denis V. Shelomovskij
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:
 On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This seems like a
 nice thread to toss myself into. I've started working on a generic lexer
 that is based off of a defined grammar.

Every helping hand is appreciated :-)

Hi Kai, welcome here!
 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the ones
 that seemed important), it seems like we need a parser in D that can lex D,
 and provide a Range of tokens that could be consumed.

Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can be grouped or not, but we need both, in Phobos. We also need to define what's needed in a token. Kai's approach is OK, but what's the _type field for an operator or and identifier? Also, a lexer can fill a symbol table and produce identifier tokens that refer to a particular entry in the symbol table. I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.
 There are already some working D-Parsers buried in this thread.

Yes. We also need a D parser (not generic), but no one pushed one for Phobos inclusion right now. Anyway, we can put generic parsers in Phobos too (LALR, LL(*), etc), but I'd say the first priority would be to have a D lexer (producing a range of tokens) and a D parser (consuming a range of tokens, and executing semantic actions, like the building of an AST). Generic parsers can come later on. Having a D parser in the standard distribution would create much goodwill. Many people want that.
 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

Is that so?
 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It already has D
 output but needs a separate build step.

Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D. The only problem I see in Kai's approach (which is the standard one, a prioritized list of regexes) is how to recognize tokens that are not regular (I mean, that cannot be encoded as a regex), like nesting comments? How does the lexer know when to stop producing a 'comment' token? Philippe
Jul 31 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 31-Jul-12 20:10, Kai Meyer wrote:
 On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have
 some other parsers (in C# from Mono-D etc.).

 What about to get one and call it standard?

I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :)

 I would love to receive some critical feedback on my approach as well as
 the actual code base.

Okay. Critics go as follows: First, as an author of std.regex I'm pleased that you find it suitable to use it in lexer. :) Still there is a BIG word of warning: Lexer based on trying out an array of regexes until one will match is NOT fast and not even close to fast ones. It is acceptable as proof of concept/prototype kind of thing only. To help this use case I though of making multi-regex matching a-la: match(text, regex1, regex2, regex3...); then lexing would be effectively a loop of form: foreach(tok; match(text, regex1, regex2, regex3...)){ switch(tok[0]){//suppose match returns tuple - N of regex matched + the usual piece of text ... } } (with some glitches on /+ ... +/ and other non-regular stuff). But until such time it's not to be used seriously in this context. The reason is that each call to match does have some overhead to setup regex engine, it's constant but it's huge compared to what usual lexers typically do. The other thing is that you should use ctRegex for this kind of job if you can (i.e. compiler doesn't explode on it). (keep in mind I only glimpsed the source, will post more feedback later)
 To build it, you'll need cmake, and cmaked2 from here:
 http://code.google.com/p/cmaked2/wiki/GettingStarted

 Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
 don't see anything that immediately jumps out as not cross-platform.
 I've been working on it on both Fedora and CentOS.

rdmd for the win! -- Dmitry Olshansky
Jul 31 2012
parent reply Kai Meyer <kai unixlords.com> writes:
On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:
 On 31-Jul-12 20:10, Kai Meyer wrote:
 On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have
 some other parsers (in C# from Mono-D etc.).

 What about to get one and call it standard?

I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :)

 I would love to receive some critical feedback on my approach as well as
 the actual code base.

Okay. Critics go as follows: First, as an author of std.regex I'm pleased that you find it suitable to use it in lexer. :) Still there is a BIG word of warning: Lexer based on trying out an array of regexes until one will match is NOT fast and not even close to fast ones. It is acceptable as proof of concept/prototype kind of thing only. To help this use case I though of making multi-regex matching a-la: match(text, regex1, regex2, regex3...); then lexing would be effectively a loop of form: foreach(tok; match(text, regex1, regex2, regex3...)){ switch(tok[0]){//suppose match returns tuple - N of regex matched + the usual piece of text ... } } (with some glitches on /+ ... +/ and other non-regular stuff). But until such time it's not to be used seriously in this context. The reason is that each call to match does have some overhead to setup regex engine, it's constant but it's huge compared to what usual lexers typically do. The other thing is that you should use ctRegex for this kind of job if you can (i.e. compiler doesn't explode on it). (keep in mind I only glimpsed the source, will post more feedback later)
 To build it, you'll need cmake, and cmaked2 from here:
 http://code.google.com/p/cmaked2/wiki/GettingStarted

 Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
 don't see anything that immediately jumps out as not cross-platform.
 I've been working on it on both Fedora and CentOS.

rdmd for the win!

I cut my teeth on perl, so everything gets compared to perl in my mind. In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this: while(<STDIN>){ if($_ =~ m/<someregex>/){ some work; } } Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine". Does/Can D's std.regex offer something like this? If not, I would be interested in why not. -Kai Meyer
Aug 06 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Aug-12 22:52, Philippe Sigaud wrote:
 I cut my teeth on perl, so everything gets compared to perl in my mind.

 In perl, I can 'precompile' a regular expression, so I can avoid some
 overhead. So something like this:

 while(<STDIN>){
      if($_ =~ m/<someregex>/){
          some work;
      }
 }

 Would end up re-compiling the regex on each line from STDIN. Perl uses the
 term "compiling" the regular expression, which may be different than what
 you call "setup regex engine".


Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much). What I mean by "setup engine" is extra cost spent on _starting_ matching with compiled pattern. For one thing it adds 1 call to malloc/free. Again I think in Perl the same thing applies. In other words doing continuous search (via foreach(x; match(..., "g" )) ) is much faster then calling match on individual pieces over and over again in cycle.
 Does/Can D's std.regex offer something like this? If not, I would be
 interested in why not.


first (it just provides a shortcut that does it for you behind the scenes).
 D does have compiled regex, it's called ctRegex (meaning compile-time regex):

 http://dlang.org/phobos/std_regex.html#ctRegex

And there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster. Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.
 The tests done recently put it among the fastest regex engine known.

Yeah, on top of said chart. Needless to say the test favored my implementation, ctRegex is not the fastest regex engine in general (yet).
 Caution: not every regex extension known to mankind is implemented
 here!

Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE. Also I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November. -- Dmitry Olshansky
Aug 06 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Aug-12 23:43, Philippe Sigaud wrote:
 On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Of course regex is first compiled to bytecode (same thing as "compile" in
 perl). Moreover if you use regex pattern directly it is compiled on first
 use and put into TLS cache of compiled patterns. From now on it's used in
 compiled form. (there about 8 entries in cache, don't relay on it too much).

Btw, I wanted to ask you that for a long time: what do you mean by 'compiled to bytecode', for D source?

by this: auto r = regex(some_string); // where some_string can come from user input. r - contains bytecode that matches pattern. Unlike ctRegex which does output D source code and compiles it to native code.
 And there is a second version - compiled native code. Unlike perl it's not
 bytecode and thus usually much faster.

Which?

Compiled native code is faster. Or what you
 Frankly the most slow regex I've seen is in Python, the second most sucky
 one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
 slower then top dogs from C++ & std.regex.

Do people *really* need speed, or a great number of extensions?

They want both. In my experience you can't satisfy everybody. I think features are not what people look for in regex, even basic ECMA-262 stuff is way above what most programmers need to do. Unlike extra speed which even newbies could use :) And while I'm at it - we already have a damn good collection of extensions, way too much I'd say. For example, I've yet to see one regex engine that support Unicode to the same level as std.regex, namely I haven't seen a single one with full set operations (not only union but subtraction, intersection etc.) inside of char class [...]. Funny even ICU one didn't support them a year ago, dunno about now. Also some extensions come from implementations inherent inefficiency e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's so hard to make look-behind unrestricted in Perl, the whole thing is a mess.
 Sure as hell. In fact, the most problematic thing is that parser often fails
 during CTFE.

For example?

Ehmn. See bugzilla, search ctRegex. But not yet filed are: said non-union set operations usually fail + the fact that Unicode properties are not readable at CT (thus \p{xxx} fails).
 Also I have a solid plan on enhancing a bunch of things effectively making
 std.regex v2 but no sooner then October-November.

That's good to know.

-- Dmitry Olshansky
Aug 06 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Aug-12 23:59, Dmitry Olshansky wrote:
 And there is a second version - compiled native code. Unlike perl it's not


Which?

Compiled native code is faster. Perl is slower. -- Dmitry Olshansky
Aug 06 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Aug-12 09:11, Philippe Sigaud wrote:
 On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:

 As far as I know, ctRegex _always_ produces machine code. Bytecode is what
 the runtime implementation, i.e. regex, uses.

That's what I had in mind too, but Dmitry seemed to say things are more complicated. Maybe I misunderstood.

Yeah, sorry for confusion - there are only 2 ways to do the job: regex and ctRegex. The first one contains special bytecode, the second one generates native code.
 And do you know why it's called bytecode? I never heard of 'bytecode'
 for D outside of std.regex.

bytecode for D. In other words instruction go as follows: match char, match one of few chars, match one of character set, match any char, jump/loop/option etc.
 Does that just mean 'intermediate form D
 code', as in 'bunch of small and easily optimized instructions.'?

No :) So far there are only 3 cases: RT parsing ---> Bytecode ( { auto x = regex("blah) }) CT parsing ---> Bytecode ( static x = regex("blah"); ) CT parsing ---> Machine code ( static/auto x = ctRegex!"blah" ) It would be nice to have JIT compiler for D at RT but no luck yet. (std.jit?) Then the picture would be complete. -- Dmitry Olshansky
Aug 06 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 31 July 2012 at 21:10:52 UTC, Philippe Sigaud wrote:
 I've only glimpsed at your code. For most languages lexing is 
 far more
 expensive then parsing

Is that so?

I have no numbers. It's a statement that I found in some (1-3) sources about parsing. I'll share if I can find them.
 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It 
 already has D
 output but needs a separate build step.

Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

Yeah.
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
<dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

String table is populated by lexer.

The latter, I get. The former, not so much.
 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated (richer, if I may say so) for parsing, then parsing is more expensive. I'm reading on GLR, and was wondering what the practical limits are. Do people use GLR to parse thousands of pages of text?
 Have to agree here if anything a better DFA generator is needed, current
 std.regex can't get as good in this field because of:

 a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
 ctRegex via codegen)
 b) full unicode features commitment (again expect some improvement here in
 near future)
 c) designed to take multiple captures from matched piece of text.

 I'm not sure when (or even if) std.regex will ever special case all of the
 above.

Well, - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if) - Unicode is needed to lex D correctly, no? - multiple captures doesn't seem necessary *for a lexer*
 How does the lexer know when to stop producing a 'comment'
 token?

Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to 'count' the comment_start and comment_end pseudotokens. So in a way, it quits the regular world to become temporarily more powerful.
 Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 7:48 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 Well,

 - for a lexer lookahead is sometimes useful (the Dragon book cite the
 FORTRAN grammar, for which keywords are not reserved and so when you
 encounter IF, you don't know if (!) it's a function call or a 'real'
 if)

Well while lookahead will help, there are simpler ways. e.g. regex ("IF\ZYOUR_LOOKAHEAD"); \Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard. Critical difference is that lookahead can be used like this: regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

In the FORTRAN case, you indeed *need* to re-lex the stuff after IF, with another regex, once you've determined it's an IF instruction and not some moron who used IF as an identifier. You know, as a first step, I'd be happy to get ctRegex to recognize the \Z flag.
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 8:12 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

String table is populated by lexer.

The latter, I get. The former, not so much.

Okay. Say lexer maps all unique strings that are not keywords to some ID. Then parser creates a stack of scoped symbol tables. These nested symbol tables use only IDs not strings themselves. (Though I expect that only semantic analysis require the use of these tables)

Ah, that! This I know, I misunderstood your initial post. The symbol table can also be pre-charged with keywords, if needed.
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 I cut my teeth on perl, so everything gets compared to perl in my mind.

 In perl, I can 'precompile' a regular expression, so I can avoid some
 overhead. So something like this:

 while(<STDIN>){
     if($_ =~ m/<someregex>/){
         some work;
     }
 }

 Would end up re-compiling the regex on each line from STDIN. Perl uses the
 term "compiling" the regular expression, which may be different than what
 you call "setup regex engine".

 Does/Can D's std.regex offer something like this? If not, I would be
 interested in why not.

D does have compiled regex, it's called ctRegex (meaning compile-time regex): http://dlang.org/phobos/std_regex.html#ctRegex The tests done recently put it among the fastest regex engine known. Caution: not every regex extension known to mankind is implemented here!
Aug 06 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Of course regex is first compiled to bytecode (same thing as "compile" in
 perl). Moreover if you use regex pattern directly it is compiled on first
 use and put into TLS cache of compiled patterns. From now on it's used in
 compiled form. (there about 8 entries in cache, don't relay on it too much).

Btw, I wanted to ask you that for a long time: what do you mean by 'compiled to bytecode', for D source?
 And there is a second version - compiled native code. Unlike perl it's not
 bytecode and thus usually much faster.

Which?
 Frankly the most slow regex I've seen is in Python, the second most sucky
 one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
 slower then top dogs from C++ & std.regex.

Do people *really* need speed, or a great number of extensions?
 Sure as hell. In fact, the most problematic thing is that parser often fails
 during CTFE.

For example?
 Also I have a solid plan on enhancing a bunch of things effectively making
 std.regex v2 but no sooner then October-November.

That's good to know.
Aug 06 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Aug 6, 2012 at 10:00 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 On 06-Aug-12 23:59, Dmitry Olshansky wrote:
 And there is a second version - compiled native code. Unlike perl it's
 not

bytecode and thus usually much faster.

Which?

Compiled native code is faster. Perl is slower.

Sorry, I meant: what second version? How do you distinguish between bytecode-producing and compiled-code producing ctRegex?
Aug 06 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Tuesday, 7 August 2012 at 04:58:05 UTC, Philippe Sigaud wrote:
 Sorry, I meant: what second version? How do you distinguish 
 between
 bytecode-producing and compiled-code producing ctRegex?

As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses. David
Aug 06 2012
prev sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:

 As far as I know, ctRegex _always_ produces machine code. Bytecode is what
 the runtime implementation, i.e. regex, uses.

That's what I had in mind too, but Dmitry seemed to say things are more complicated. Maybe I misunderstood. And do you know why it's called bytecode? I never heard of 'bytecode' for D outside of std.regex. Does that just mean 'intermediate form D code', as in 'bunch of small and easily optimized instructions.'?
Aug 06 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate. - Jonathan M Davis
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).

That was quick! Cool!
Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

Could you please describe the kind of token it produces? Can it build a symbol table? Does it recognize all kind of strings (including q{ } ones)? How does it deal with comments, particularly nested ones? Does it automatically discard whitespaces or produce them as a token? I'd favor this approach, if only because wrapping the lexer in a filter!noWS(tokenRange) is easy. Does it produce a lazy range btw?
Jul 31 2012
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com> 

 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).

That was quick! Cool!

Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong.
Writing it has been going surprisingly

 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the
 spec will be more accurate.

Could you please describe the kind of token it produces? Can it build a symbol table? Does it recognize all kind of strings (including q{ } ones)? How does it deal with comments, particularly nested ones? Does it automatically discard whitespaces or produce them as a token? I'd favor this approach, if only because wrapping the lexer in a filter!noWS(tokenRange) is easy. Does it produce a lazy range btw?

Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that. It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see). I'm in the middle of dealing with the named entity stuff at the moment, which unfortunately has revealed a rather nasty compiler bug with regards to template compile times, which I still need to report (I intend to do that this evening). The file generating the table of named entities currently takes over 6 minutes to compile on my Phenom II thanks to that bug, so I'm not quite sure how that's going to affect things. Regardless, the lexer should support _everything_ as far as what's required for fully lexing D goes by the time that I'm done. I don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_ col; size_t tabWidth = 8; } The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy). str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient. It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside. LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal type (the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals). And pos holds the position in the text where the token started, which should make it easy to use for syntax highlighting and the like (as well as indicating where an error occurred). The initial position is passed as an optional argument to the lexing function, so it doesn't have to be 1:1 (though that's the default), and it allows you to select the tabwidth. So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them. I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes. - Jonathan M Davis
Jul 31 2012
next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:173860), a crit:
 struct Token
 {
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
 }
 
 struct SourcePos
 {
  size_t line;
  size_t col;
  size_t tabWidth = 8;
 }

The occurence of tabWidth surprises me. What is col supposed to be ? an index (code unit), a character number (code point), an estimation of where the caracter is supposed to be printed on the line, given the provided tabwidth ? I don't think the lexer can realy try to calculate at what column the character is printed, since it depends on the editor (if you want to use the lexer to syntax highlight for example), how it supports combining characters, zero or multiple column characters, etc. (which you may not want to have to decode). You may want to provide the number of tabs met so far. Note that there are other whitespace that you may want to count, but you shouldn't have a very complicated SourcePos structure. It might be easier to have whitespace, endofline and endoffile tokens, and let the user filter out or take into account what he wants to take into account. Or just let the user look into the original string...
Aug 01 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-01 00:38, Jonathan M Davis wrote:

 I don't have the code with me at the moment, but I believe that the token type
 looks something like

 struct Token
 {
   TokenType type;
   string str;
   LiteralValue value;
   SourcePos pos;
 }

 struct SourcePos
 {
   size_t line;
   size_ col;
   size_t tabWidth = 8;
 }

What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer? -- /Jacob Carlborg
Aug 01 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 10:40, Jonathan M Davis wrote:

 It could certainly be added, but unless the lexer always knows it (and I'm
 pretty sure that it doesn't), then keeping track of that entails extra
 overhead. But maybe it's worth that overhead. I'll have to look at what I have
 and see. Worst case, the caller can just use walkLength on str, but if it has
 to do that all the time, then that's not exactly conducive to good
 performance.

Doing a syntax highlighter and calculate the length (walkLength) for each token would most likely slow it down a bit. -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing. - Jonathan M Davis
Jul 31 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.

OK. Same for standard comment and doc comments? I was wondering how to get the code possibly inside a ---- / ---- block (I never dealt with something like documentation or syntax highlighting), but your solution makes it easy: Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module, "module"), ... A small specialised parser can then extract text, DDocs macros and code blocks from inside the comment. Findind and stripping '----' is easy and then the lexer can be locally reinvoked on the slice containing the example code.
Jul 31 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 07:26:07 Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com> 

 On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.

OK. Same for standard comment and doc comments?

From the TokenType enum declaration:

blockComment, /// $(D /* */) lineComment, /// $(D // ) nestingBlockComment, /// $(D /+ +/) There are then functions which operate on Tokens to give you information about them. Among them is isDdocComment, which will return true if the Token type is a comment, and that comment is a ddoc comment (i.e. starts with /**, ///, or /++ rather than /*, //, or /+). So, anything that wants to process ddoc comments can lex them out and process them, and if they want to know what symbols that a ddoc comment applies to, then they look at the tokens that follow (though a full-on parser would be required to do that correctly).
 I was wondering how to get the code possibly inside a ---- / ----
 block (I never dealt with something like documentation or syntax
 highlighting), but your solution makes it easy:
 
 Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module,
 "module"), ...
 
 A small specialised parser can then extract text, DDocs macros and
 code blocks from inside the comment. Findind and stripping '----' is
 easy and then the lexer can be locally reinvoked on the slice
 containing the example code.

Yes. The lexer isn't concerned with where the text comes from, and it isn't concerned with lexing comments beyond putting them in a token. But that should be powerful enough to lex the examples if you've already extracted them. - Jonathan M Davis
Jul 31 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
Thanks for taking the time to write such an explanation!


 Yeah. Once I started on it, I made a lot of progress really quickly. There's
 still a fair bit to do (primarily having to do with literals), but it probably
 won't take all that much longer. Certainly, I'd expect to have it done within
 a couple of weeks if not sooner, unless something goes wrong.

Is it based on a prioritized list of regexes?
 Well, it's still a work in progress, so it certainly can be adjusted as
 necessary. I intend it to fully implement the spec (and make sure that both it
 and the spec match what dmd is doing) as far as lexing goes. The idea is that
 you should be able to build a fully compliant D parser on top of it and build
 a fully compliant D compiler on top of that.

My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.
 It already supports all of the comment types and several of the string literal
 types. I haven't sorted out q{} yet, but I will before I'm done, and that may
 or may not affect how some things work, since I'm not quite sure how to handle
 q{} yet (it may end up being done with tokens marking the beginning and end of
 the token sequence encompassed by q{}, but we'll see).

I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.
 I don't have the code with me at the moment, but I believe that the token type
 looks something like

 struct Token
 {
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
 }

Seems OK. I guess it can be templated on the string type, because you sure want to lex dstrings.
 The type is an enum which gives the type of the token (obviously) which
 includes the various comment types and an error type (so errors are reported
 by getting a token that was an error token rather than throwing or anything
 like that, which should make lexing passed malformed stuff easy).

And the recovery is done on the next token. Good idea.
 str holds the exact text which was lexed (this includes the entire comment for
 the various comment token types), which in the case of lexing string rather
 than another range type would normally (always? - I don't remember) be a slice
 of the string being lexed, which should help make lexing string very efficient.

Yeahs, you may as well use this excellent feature. But that means the input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?
 It may or may not make sense to change that to the range type used rather than
 string. For nesting block comments, the whole comment is one token (with the
 token type which is specifically for nested comments), regardless of whether
 there's any nesting going on. But that could be changed if there were a need
 to get separate tokens for the comments inside.

That's OK, as I reply in another message. It's now quite clear for me how to deal with that. IDE/doc generator can parse the comment at their leasure, as they are not long nor critical section of code. They can even easily get the code fragment used as examples inside ---/--- markers.
 LiteralValue is a VariantN of the types that a literal can be (long, ulong,
 real, and string IIRC) and is empty unless the token is a literal type

You can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value. Arrays literals are not considered literals, since they can contain arbitrary expressions, is that right? And it's a good idea to drop the complex literals once and for all.
 (the
 various string postfixes - c,w, and d - are treated as different token types
 rather than giving the literal value different string types - the same with the
 integral and floating point literals).

That's seem reasonable enough, but can you really store a dstring literal in a string field?
 And pos holds the position in the text where the token started, which should
 make it easy to use for syntax highlighting

Does syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.
 and the like (as well as
 indicating where an error occurred). The initial position is passed as an
 optional argument to the lexing function, so it doesn't have to be 1:1 (though
 that's the default), and it allows you to select the tabwidth.

I really like the idea of having malformed token become error token, with the lexing that continue. As I said before, that enables an IDE to continue to do syntax highlighting.
 So, you'll pass a range and an optional starting position to the lexing
 function, and it'll return a lazy range of Tokens. Whitespace is stripped as
 part of the lexing process, but if you take the pos properties of two adjacent
 tokens, you should be able to determine how much whitespace was between them.

OK, so whitespace is strictly that, and comments are not dropped. That's OK. I guess a D parser would then just filter the comments out of the token range.
 I _think_ that that's how it currently is, but again, I don't have the code
 with me at the moment, so it may not be 100% correct. And since it's a work in
 progress, it's certainly open to changes.

My only quip for now would be the string/dstring thingy. Did you achieve UFCS, wrt floating point values? is 1.f seen as a float or as f called on 1 (int)?
Jul 31 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 07:44, Philippe Sigaud wrote:

 Does syntax highlighting need more that a token stream? Without having
 thought a lot about it, it seems to me IDE tend to highlight based
 just on the token type, not on a parse tree. So that means your lexer
 can be used directly by interested people, that's nice.

Some IDE's do a more advanced syntax highlighting based on the semantic analysis. For example, Eclipse highlights instance variables differently. -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
 Is it based on a prioritized list of regexes?

I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.
 Well, it's still a work in progress, so it certainly can be adjusted as
 necessary. I intend it to fully implement the spec (and make sure that
 both it and the spec match what dmd is doing) as far as lexing goes. The
 idea is that you should be able to build a fully compliant D parser on
 top of it and build a fully compliant D compiler on top of that.

My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.

Well, if you want the original text, there's the str property, which contains everything that was in the source code (including whatever escaping there was), whereas the value property includes only the body, and it will have already processed whatever escaped characters needed to be processed. It's the actual value of the literal. So, \` would be ` in the value property, named entities would be their actual code points, etc. So, I don't really see a problem there. Sure, depending on what is done with the processed string, it could cause further problems, but that always happens when doing string processing.
 It already supports all of the comment types and several of the string
 literal types. I haven't sorted out q{} yet, but I will before I'm done,
 and that may or may not affect how some things work, since I'm not quite
 sure how to handle q{} yet (it may end up being done with tokens marking
 the beginning and end of the token sequence encompassed by q{}, but we'll
 see).

I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.

Well, it could very well cause me trouble. I haven't gotten to it yet, and I'll need to study it to fully understand what it does before crafting my solution, but it'll probably end up being a sequence of tokens of some kind given that that's what it is - a string of tokens. Regardless, it'll be done in a way that whatever is using the lexer will be able to know what it is and process is it accordingly.
 str holds the exact text which was lexed (this includes the entire comment
 for the various comment token types), which in the case of lexing string
 rather than another range type would normally (always? - I don't
 remember) be a slice of the string being lexed, which should help make
 lexing string very efficient.

input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?

Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer. Given how slicing works in D, it's pretty much a given that lexers and parsers are going to use them heavily, and any code using a lexer or parser is going to have to take that into account. Slicing is one of the major reasons that Tango's XML parser is so insanely fast.
 LiteralValue is a VariantN of the types that a literal can be (long,
 ulong,
 real, and string IIRC) and is empty unless the token is a literal type

You can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value.

I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem.
 Arrays literals are not considered literals, since they can contain
 arbitrary expressions, is that right?

There is no concept of array literal in the lexing portion of the spec. There may be at the level of the parser, but that's not the lexer's problem.
 And it's a good idea to drop the complex literals once and for all.

I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to.
 (the
 various string postfixes - c,w, and d - are treated as different token
 types rather than giving the literal value different string types - the
 same with the integral and floating point literals).

That's seem reasonable enough, but can you really store a dstring literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.
 Does syntax highlighting need more that a token stream? Without having
 thought a lot about it, it seems to me IDE tend to highlight based
 just on the token type, not on a parse tree. So that means your lexer
 can be used directly by interested people, that's nice.

If all you want to do is highlight specific symbols and keywords, then you don't need a parser. If you want to do fancier stuff related to templates and the like, then you'll need a parser and maybe even a semantic analyzer. But a lexer should be plenty for basic syntax highlighting.
 I really like the idea of having malformed token become error token,
 with the lexing that continue. As I said before, that enables an IDE
 to continue to do syntax highlighting.

That was the idea. You need to be able to continue lexing beyond errors in many cases, and that seemed the natural way to do it.
 Did you achieve UFCS, wrt floating point values? is 1.f seen as a
 float or as f called on 1 (int)?

I haven't tackled floating point literals yet, but since the f in 1.f is not part of the floating point literal, I'll have to handle it. I suspect that that's one area where I'll need to update the spec though, since it probably hasn't been updated for that yet. Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up. - Jonathan M Davis
Jul 31 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 08:11, Jonathan M Davis wrote:

 I'm not using regexes at all. It's using string mixins to reduce code
 duplication, but it's effectively hand-written. If I do it right, it should be
 _very_ difficult to make it any faster than it's going to be. It even
 specifically avoids decoding unicode characters and operates on ASCII
 characters as much as possible.

That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table. BTW, have you seen this: http://woboq.com/blog/utf-8-processing-using-simd.html -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 8:11 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
 Is it based on a prioritized list of regexes?

I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.

OK. It'll more difficult to genericize, then, but that's not your problem (could be mine, though).
 Well, whatever is using the lexer is going to have to make sure that what it
 passes to the lexer continues to exist while it uses the lexer.

Yeah, I realized that just after posting. And anyway, the token are made to be consumed at once, normally.
 I'll have to look at that to see whether using Algebraic is better. I'm not
 super-familiar with std.variant, so I may have picked the wrong type. However,
 VariantN already holds a specific set of types though (unlike Variant), so that
 part isn't a problem.

OK, I forgot about VariantN (I'm not so used to std.variant either)
 I'm not supporting any symbols which are known to be scheduled for deprecation
 (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is
 to-be-deprecated keywords (e.g. volatile and delete), since they still won't
 be legal to use as identifiers. And there's a function to query a token as to
 whether it's using a deprecated or unused keyword so that the program using
 the lexer can flag it if it wants to.

Good idea. A nice bunch of query functions will be a nice complement - isDoc - isComment - isString - isDeprecated ...
 That's seem reasonable enough, but can you really store  a dstring
 literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.

I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?
 Basically, the lexer that I'm writing needs to be 100% compliant with the D
 spec and dmd (which means updating the spec or fixing dmd in some cases), and
 it needs to be possible to build on top of it anything and everything that dmd
 does that would use a lexer (even if it's not the way that dmd currently does
 it) so that it's possible to build a fully compliant D parser and compiler on
 top of it as well as a ddoc generator and anything else that you'd need to do
 with a lexer for D. So, if you have any questions about what my lexer does (or
 is supposed to do) with regards to the spec, that should answer it. If my
 lexer doesn't match the spec or dmd when I'm done (aside from specific
 exceptions relating to stuff like deprecated symbols), then I screwed up.

That's a lofty goal, but that would indeed be quite good to have an officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.
Jul 31 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:
 OK. It'll more difficult to genericize, then, but that's not your
 problem (could be mine, though).

It was never intended to be even vaguely generic. It's targeting D specifically. If someone can take it and make it generic when I'm done, then great. But it's goal is to lex D as efficiently as possible, and it'll do whatever it takes to do that. From how the main switch statement's cases are constructed though, there's a lot there which could be genericized. I currently have several mixins used to create them, but I'm pretty sure that I can generate a _lot_ of the case statements using a single mixin which just takes the list of symbols and their associated tokens, which I'll probably do before I'm done. So, I'm sure that pieces of what I'm doing could be used to generate a lexer for another language, but plenty of it is very specific to D.
 That's seem reasonable enough, but can you really store  a dstring
 literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.

I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?

The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change. "hello world" "hello world"c "hello world"w "hello world"d are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file. In either case, a lot of string literals have to be decoded (e.g if they contain escaped characters), so you often can't create them with a slice anyway, and if a range is used which isn't one of the string types, then it's impossible for Token's value property to use the range type whenever it can't use a slice. So, it's just simpliest to always use string. It may be a slight performance hit for lexing wstrings and dstrings, since they _could_ be both sliced and created as new strings (unlike other ranges), but I don't think that it's worth the extra complication to make it so that the string literal's value could be other string types, especially when lexing strings is likely to be the common case.
 Basically, the lexer that I'm writing needs to be 100% compliant with the
 D
 spec and dmd (which means updating the spec or fixing dmd in some cases),
 and it needs to be possible to build on top of it anything and everything
 that dmd does that would use a lexer (even if it's not the way that dmd
 currently does it) so that it's possible to build a fully compliant D
 parser and compiler on top of it as well as a ddoc generator and anything
 else that you'd need to do with a lexer for D. So, if you have any
 questions about what my lexer does (or is supposed to do) with regards to
 the spec, that should answer it. If my lexer doesn't match the spec or
 dmd when I'm done (aside from specific exceptions relating to stuff like
 deprecated symbols), then I screwed up.

officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.

Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually switch to using the lexer and parser from Phobos, and in that sort of situation, it's that much more imperative that they follow the spec exactly. - Jonathan M Davis
Jul 31 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 08:39, Jonathan M Davis wrote:

 Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be
 in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually
 switch to using the lexer and parser from Phobos, and in that sort of
 situation, it's that much more imperative that they follow the spec exactly.

:) -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 07:12:30 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:173860), a =C3=A9=

 struct Token
 {
=20
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
=20
 }
=20
 struct SourcePos
 {
=20
  size_t line;
  size_t col;
  size_t tabWidth =3D 8;
=20
 }

The occurence of tabWidth surprises me. What is col supposed to be ? an index (code unit), a character number=

 (code point), an estimation of where the caracter is supposed to be
 printed on the line, given the provided tabwidth ?

col counts code points. tabWidth is the number of code points that '\t'= is=20 considered to be. That's it. So, in theory, an editor should be able to= use it=20 to indicate where on the line the token starts. If the code using the lexer wants to treat tabs as having the widtho of= a=20 single code point, then all they have to do is pass in a SourcePos with= a=20 tabWidth of 1. But if the lexer doesn't have a way to count tabs differ= ently,=20 then there's no way for the code using the lexer to figure out the tabs= without=20 going back and lexing the whitespace itself. But counting tabs as a dif= ferent=20 width than everything else is so common that it seemed prudent to add i= t.=20 Given that Phobos doesn't support graphemes and that ranges use code po= ints, a=20 code point is the closest to a character that the lexer would be counti= ng, and=20 it just makes sense to count code points. Now, the one thing that might be a problem with treating tabs as a fixe= d width=20 is that it's not uncommon to treat tabs as being up to a particular wid= th=20 rather than having a fixed width such that if there are other character= s before=20 a tab, then the tabs width is the max tab width minus the number of cha= racters=20 since the start of that tab width. e.g. if tabs had a max width of 8, t= hen \t123\t could have the first tab with a width of 8 and the second as only havin= g a=20 width of 5. But that's too complicated to deal with in the lexer IMHO. Maybe the tab width isn't worth having in SourePos and it will ultimate= ly be=20 removed, but it struck me as a useful feature, so I added it. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 10:25:18 Jacob Carlborg wrote:
 On 2012-08-01 00:38, Jonathan M Davis wrote:
 I don't have the code with me at the moment, but I believe that the token
 type looks something like
 
 struct Token
 {
 
   TokenType type;
   string str;
   LiteralValue value;
   SourcePos pos;
 
 }
 
 struct SourcePos
 {
 
   size_t line;
   size_ col;
   size_t tabWidth = 8;
 
 }

What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer?

I'm not sure. I don't think so. It doesn't really keep track of code points. It operates in code units as much as possible, and pos doesn't really help, because any newline that occurred would make it so that subtracting the start col from the end col would be completely bogus (that and tabs would mess that up pretty thoroughly, but as Christophe pointed out, the whole tabWidth thing may not actually have been a good idea anyway). It could certainly be added, but unless the lexer always knows it (and I'm pretty sure that it doesn't), then keeping track of that entails extra overhead. But maybe it's worth that overhead. I'll have to look at what I have and see. Worst case, the caller can just use walkLength on str, but if it has to do that all the time, then that's not exactly conducive to good performance. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 11:14:52 Jacob Carlborg wrote:
 On 2012-08-01 08:11, Jonathan M Davis wrote:
 I'm not using regexes at all. It's using string mixins to reduce code
 duplication, but it's effectively hand-written. If I do it right, it
 should be _very_ difficult to make it any faster than it's going to be.
 It even specifically avoids decoding unicode characters and operates on
 ASCII characters as much as possible.

That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table.

What's of particular importance is the fact that _all_ of the language constructs are ASCII. So, unicode comes in exclusively with identifiers, string literals, char literals, and whitespace. And with those, ASCII is still going to be far more common, so coding it in a way that makes ASCII faster at slight cost to performance for unicode is perfectly acceptable.
 BTW, have you seen this:
 
 http://woboq.com/blog/utf-8-processing-using-simd.html

No, I'll have to take a look. I know pretty much nothing about SIMD though. I've only heard of it, because Walter implemented some SIMD stuff in dmd not too long ago. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 It was never intended to be even vaguely generic. It's targeting D
 specifically. If someone can take it and make it generic when I'm done, then
 great. But it's goal is to lex D as efficiently as possible, and it'll do
 whatever it takes to do that.

That's exactly what I had in mind. Anyway, we need a D lexer. We also need a generic lexer generator, but as a far-away second choice and we can admit it being less efficient. Of course, any trick used on the D lexer can most probably be reused for Algol-family lexers.
 I don't get it. Say I have an literal with non UTF-8 chars, how will
 it be stored inside the .str field as a string?

The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change. "hello world" "hello world"c "hello world"w "hello world"d are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file.

Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works: say I have auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d; Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters? Or, are they automatically cut in four nonsense chars each? What about comments containing non-ASCII chars? How can code coming after the lexer make sense of it? As Jacob say, many people code in English. That's right, but 1- they most probably use their own language for internal documentation 2- any in8n part of a code base will have non-ASCII chars 3- D is supposed to accept UTF-16 and UTF-32 source code. So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?
Aug 01 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 14:44, Philippe Sigaud wrote:

 Everytime I think I understand D strings, you prove me wrong. So, I
 *still* don't get how that works:

 say I have

 auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

 Then, the "..." part is lexed as a string literal. How can the string
 field in the Token magically contain UTF32 characters? Or, are they
 automatically cut in four nonsense chars each? What about comments
 containing non-ASCII chars? How can code coming after the lexer make
 sense of it?

 As Jacob say, many people code in English. That's right, but

 1- they most probably use their own language for internal documentation
 2- any in8n part of a code base will have non-ASCII chars
 3- D is supposed to accept UTF-16 and UTF-32 source code.

 So, wouldn't it make sense to at least provide an option on the lexer
 to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this: The string representing what's in the source code can be either UFT-8 or the encoding of the file. I'm not sure if the lexer needs to re-encode the string if it's not in the same encoding as the file. Then there's an other field/function that returns the processed token, i.e. for a token of the type "int" it will return an actual int. This function will return different types of string depending on the type of the string literal the token represents. -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 13:30:31 Jacob Carlborg wrote:
 On 2012-07-31 23:20, Jonathan M Davis wrote:
 I'm actually quite far along with one now - one which is specifically
 written and optimized for lexing D. I'll probably be done with it not too
 long after the 2.060 release (though we'll see). Writing it has been
 going surprisingly quickly actually, and I've already found some bugs in
 the spec as a result (some of which have been fixed, some of which I
 still need to create pull requests for). So, regardless of what happens
 with my lexer, at least the spec will be more accurate.

BTW, do you have to code online somewhere?

No, because I'm still in the middle of working on it. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
 Everytime I think I understand D strings, you prove me wrong. So, I
 *still* don't get how that works:
=20
 say I have
=20
 auto s =3D " - some greek or chinese chars, mathematical symbols, wha=

 "d;
=20
 Then, the "..." part is lexed as a string literal. How can the string=

 field in the Token magically contain UTF32 characters?

It contains unicode. The lexer is lexing whatever encoding the source i= s in,=20 which has _nothing_ to do with the d on the end. It could be UTF-8, or = UTF-16,=20 or UTF-32. If we supported other encodings in ranges, it could be one o= f=20 those. Which of those it is is irrelevant. As far as the value of the l= iteral=20 goes, these two strings are identical: "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88" "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8" The encoding of the source file is irrelevant. By tacking a d on the en= d "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"d "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d you're just telling the compiler that you want the value that it genera= tes to=20 be in UTF-32. The source code could be in any of the supported encoding= s, and=20 the string could be held in any encoding until the object code is actua= lly=20 generated.
 So, wouldn't it make sense to at least provide an option on the lexer=

 to specifically store identifier lexemes and comments as a dstring?

You mean make it so that Token is=20 struct Token(R) { TokenType type; R str; LiteralValue value SourcePos pos; } instead of struct Token { TokenType type; string str; LiteralValue value SourcePos pos; } or do you mean something else? I may do something like that, but I woul= d point=20 out that if R doesn't have slicing, then that doesn't work. So, str can= 't=20 always be the same type as the original range. For ranges with no slici= ng, it=20 would have to be something else (probably either string or=20 typeof(takeExactly(range))). However, making str R _does_ come at the c= ost of=20 complicating code using the lexer, since instead of just using Token, y= ou have=20 to worry about whether it's a Token!string, Token!dstring, etc, and whe= ther=20 it's worth that complication is debatable. By far the most common use c= ase is=20 to lex string, and if str is string, and R is not, then you incur the p= enalty=20 of converting R to string. So, the common use case is fast, and the unc= ommon=20 use case still works but is slower, and the user of the lexer doesn't h= ave to=20 care what the original range type was. It could go either way. I used string on first pass, but as I said, I c= ould=20 change it to R later if that makes more sense. I'm not particularly hun= g up on=20 that little detail at this point, and that's probably one of the things= that=20 can be changed reasonably easily later. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrot=
e:

 "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

 The encoding of the source file is irrelevant.

do you mean I can do: string field =3D "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.
 You mean make it so that Token is

 struct Token(R)
 {
     TokenType    type;
     R       str;
     LiteralValue value
     SourcePos    pos;
 }

 instead of

 struct Token
 {
     TokenType    type;
     string       str;
     LiteralValue value
     SourcePos    pos;
 }

 or do you mean something else?

Right, this.
Aug 01 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-01 19:50, Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 "ウェブサイト"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

 The encoding of the source file is irrelevant.

do you mean I can do: string field = "ウェブサイト"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.

Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these encodings can store every character in the Unicode standard. What's different is how the characters are stored and how many bytes a single character takes to store in the string. For example: string str = "ö"; The above character will take up two bytes in the string. On the other hand, this won't work: char c = 'ö'; The reason for that is the the above character needs two bytes to be stored but "char" can only store one byte. Therefore you need to store the character in a type where it fits, i.e. "wchar" or "dchar". Or you can use a string where you can store how many bytes you want. Don't know if that makes it clearer. -- /Jacob Carlborg
Aug 01 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Aug-12 01:23, Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
 <andrej.mitrovich gmail.com> wrote:
 On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode

I will, but not yet. I've a few books on parsing and compilers to read before that. I just read http://www.joelonsoftware.com/articles/Unicode.html, though, and I'm a bit disappointed that char 7 (\u007) does not make my computer beep. I remember now having my computer beep on char 7 during the 80s when ASCII was the only thing that existed.

http://unicode.org/cldr/utility/index.jsp I've found these tools to be incredibly useful. -- Dmitry Olshansky
Aug 01 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now). - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Thu, Aug 2, 2012 at 1:29 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now).

I knew about 1-2-4 bytes schemes and such. But, somehow, for me, string == only-almost-ASCII characters. Anyway, it all *clicked* into place right afterwards and your answers are perfectly clear to me now.
Aug 01 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 22:54, Andrej Mitrovic wrote:

 I think many people viewed Unicode this way at first. But there is a
 metric ton of cool info out there if you want to get to know more
 about unicode (this may or may not be interesting reading material),
 e.g.:

 http://www.catch22.net/tuts/introduction-unicode
 http://icu-project.org/docs/papers/forms_of_unicode/
 http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode

 I used to have more of these links but lost them. There's even a
 gigantic book about unicode (Unicode Demystified).

This is a good read as well: http://www.joelonsoftware.com/articles/Unicode.html -- /Jacob Carlborg
Aug 01 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 22:47, Philippe Sigaud wrote:

 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

 Thanks Jacob!

You're welcome :) -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> 

 "ウェブサイト"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
 
 The encoding of the source file is irrelevant.

do you mean I can do: string field = "ウェブサイト"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.

From http://dlang.org/lex.html

D source text can be in one of the fol­low­ing for­mats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE So, yes, you can stick unicode characters directly in D code. Though I wonder about the correctness of the spec here. It claims that if there's no BOM, then it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of them have BOM markers, but I can put unicode in a .d file just fine with vim. U should probably study up on BOMs. In any case, the source is read in whatever encoding it's in. String literals then all become UTF-8 in the final object code unless they're marked as specifically being another type via the postfix letter or they're inferred as being another type (e.g. when you assign a string literal to a dstring). Regardless, what's in the final object code is based on the types that the type system marks strings as, not what the encoding of the source code was. So, a lexer shouldn't care about what the encoding of the source is beyond what it takes to covert it to a format that it can deal with and potentially being written in a way which makes handling a particular encoding more efficient. The values of literals and the like are completely unaffected regardless. - Jonathan M Davis
Aug 01 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-01 20:24, Jonathan M Davis wrote:

 D source text can be in one of the fol­low­ing for­mats:
 * ASCII
 * UTF-8
 * UTF-16BE
 * UTF-16LE
 * UTF-32BE
 * UTF-32LE

 So, yes, you can stick unicode characters directly in D code. Though I wonder
 about the correctness of the spec here. It claims that if there's no BOM, then
 it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
 them have BOM markers, but I can put unicode in a .d file just fine with vim. U
 should probably study up on BOMs.

 In any case, the source is read in whatever encoding it's in. String literals
 then all become UTF-8 in the final object code unless they're marked as
 specifically being another type via the postfix letter or they're inferred as
 being another type (e.g. when you assign a string literal to a dstring).
 Regardless, what's in the final object code is based on the types that the type
 system marks strings as, not what the encoding of the source code was.

 So, a lexer shouldn't care about what the encoding of the source is beyond
 what it takes to covert it to a format that it can deal with and potentially
 being written in a way which makes handling a particular encoding more
 efficient. The values of literals and the like are completely unaffected
 regardless.

But if you read a source file which is encoded using UTF-16 you would need to re-encode that to store it in the "str" filed in your Token struct? If that's the case, wouldn't it be better to make Token a template to be able to store all Unicode encodings without re-encoding? Although I don't know how if that will complicate the rest of the lexer. -- /Jacob Carlborg
Aug 01 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-01 22:10, Jonathan M Davis wrote:

 It may very well be a good idea to templatize Token on range type. It would be
 nice not to have to templatize it, but that may be the best route to go. The
 main question is whether str is _always_ a slice (or the result of
 takeExactly) of the orignal range. I _think_ that it is, but I'd have to make
 sure of that. If it's not and can't be for whatever reason, then that poses a
 problem. If Token _does_ get templatized, then I believe that R will end up
 being the original type in the case of the various string types or a range
 which has slicing, but it'll be the result of takeExactly(range, len) for
 everything else.

To me a string type would be enough. I don't need support for ranges. How about adding a union instead? enum StringType { utf8, utf16, utf32 } struct Token { StringType stringType; union { string strc; wstring strw; dstring strd; } property T str (T = string) () { static if (is(T == string)) { assert(stringType == StringType.utf8); return strc; } ... } } Most use cases would look like this: string s = token.str;
 I just made str a string to begin with, since it was simple, and I was still
 working on a lot of the initial design and how I was going to go about things.
 If it makes more sense for it to be templated, then it'll be changed so that
 it's templated.

I completely understand that. -- /Jacob Carlborg
Aug 01 2012
prev sibling parent travert phare.normalesup.org (Christophe Travert) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:173942), a crit:
 It may very well be a good idea to templatize Token on range type. It would be 
 nice not to have to templatize it, but that may be the best route to go. The 
 main question is whether str is _always_ a slice (or the result of 
 takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
 sure of that. If it's not and can't be for whatever reason, then that poses a 
 problem.

It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in a forward range using a range adapter that performs buffering, or perform your own buffering internally. You also have to decide how long the token will be valid (I believe if you want lexing to be blazing fast, you don't want to allocate for each token). Maybe you want you lexer to work with range of strings too, like File.byLine or File.byChunk (the latter require buffering if you split in the middle of a token...). But that may wait until a nice API for files, streams, etc. is found.
 If Token _does_ get templatized, then I believe that R will end up 
 being the original type in the case of the various string types or a range 
 which has slicing, but it'll be the result of takeExactly(range, len) for 
 everything else.

A range which has slicing doesn't necessarily return it's own type when opSlice is used, according to hasSlicing. I'm pretty sure parts of Phobos doesn't take that into account. However, the result of takeExactly will always be the good type, since it uses opSlice when it can, so you can just use that. Making a generic lexer that works with any forward range of dchar and returns a range of tokens without performing decoding of litteral seems to be a good first step.
 I just made str a string to begin with, since it was simple, and I was still 
 working on a lot of the initial design and how I was going to go about things. 
 If it makes more sense for it to be templated, then it'll be changed so that
 it's templated.

string may not be where you want to start, because it is a specialization for which you need to optimize utf-8 decoding. Also, you said in this thread that you only need to consider ansy characters in the lexer because non-ansy characters are only used in non-keyword identifier. That is not entirely true: EndOfLine defines 2 non-ansy characters, namely LINE SEPARATORand PARAGRAPH SEPARATOR. http://dlang.org/lex.html#EndOfLine Maybe they should be dropped, since other non-ansy whitespace are not supported. You may want the line count to be consistent with other programs. I don't know what text-processing programs usualy considers an end of line. -- Christophe
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:
 But if you read a source file which is encoded using UTF-16 you would
 need to re-encode that to store it in the "str" filed in your Token struct?

Currently, yes.
 If that's the case, wouldn't it be better to make Token a template to be
 able to store all Unicode encodings without re-encoding? Although I
 don't know how if that will complicate the rest of the lexer.

It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem. If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else. I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 8:24 PM, Jacob Carlborg <doob me.com> wrote:

 Don't know if that makes it clearer.

It does! Particularly this:
 All these encodings can store *every* character in the Unicode standard. What's
 different is how the characters are stored and how many bytes a single
 character takes to store in the string.

I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess. Thanks Jacob!
Aug 01 2012
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode (this may or may not be interesting reading material), e.g.: http://www.catch22.net/tuts/introduction-unicode http://icu-project.org/docs/papers/forms_of_unicode/ http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode I used to have more of these links but lost them. There's even a gigantic book about unicode (Unicode Demystified).
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
<andrej.mitrovich gmail.com> wrote:
 On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode

I will, but not yet. I've a few books on parsing and compilers to read before that. I just read http://www.joelonsoftware.com/articles/Unicode.html, though, and I'm a bit disappointed that char 7 (\u007) does not make my computer beep. I remember now having my computer beep on char 7 during the 80s when ASCII was the only thing that existed.
Aug 01 2012
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/1/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 Once you have time to learn some unicode, check out this page:
 http://unicode.org/cldr/utility/index.jsp

 I've found these tools to be incredibly useful.

Didn't know about that one, cool! Also might come in handy: http://people.w3.org/rishida/scripts/uniview/
Aug 01 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:173942), a =C3=A9=

 It may very well be a good idea to templatize Token on range type. =


 would be nice not to have to templatize it, but that may be the bes=


 route to go. The main question is whether str is _always_ a slice (=


 result of takeExactly) of the orignal range. I _think_ that it is, =


 I'd have to make sure of that. If it's not and can't be for whateve=


 reason, then that poses a problem.

It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in =

 forward range using a range adapter that performs buffering, or perfo=

 your own buffering internally. You also have to decide how long the
 token will be valid (I believe if you want lexing to be blazing fast,=

 you don't want to allocate for each token).

My lexer specifically requires a forward range. The more that I deal wi= th input=20 ranges, the more that I'm convinced that they're nearly useless. If you= need=20 even _one_ character of lookahead, then an input range doesn't fly at a= ll, and=20 considered the performance gains in using slicing (or takeExactly), I j= ust=20 don't think that it makes sense to operate on an input range. Besides, = if=20 anyone wants full performance, they'll probably need to use one of the = built- in string types. Any range of dchar which needs to decode on the call t= o front=20 or popFront will take a serious performance hit. It'll work, but it's n= ot=20 really advisable if you need performance.
 Also, you said in this thread that you only need to consider ansy
 characters in the lexer because non-ansy characters are only used in
 non-keyword identifier. That is not entirely true: EndOfLine defines =

 non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
   http://dlang.org/lex.html#EndOfLine
   Maybe they should be dropped, since other non-ansy whitespace are n=

 supported. You may want the line count to be consistent with other
 programs. I don't know what text-processing programs usualy considers=

 end of line.

I'm well aware of all fo that. Almost all of the lexer can operate enti= rely on=20 ASCII, with a few exceptions, and even in some of those cases, decoding= isn't=20 required (e.g. lineSep and paraSep can be dealt with as code units rath= er than=20 having to decode to compare agains them). The lexer that I'm writing wi= ll=20 follow the spec. And any lexer that wants to get into Phobos will need = to do=20 the same. So, stuff like lineSep and the end of file characters that th= e spec=20 has will be supported. - Jonathan M Davis
Aug 02 2012