digitalmars.D - Let's stop parser Hell

Denis Shelomovskij (10/10) Jul 05 2012 There are more and more projects requiring parsing D code (IDE plugins,

Roman D. Boiko (14/31) Jul 05 2012 Visual-D is not Boost-licensed (I think this would be possible to

Roman D. Boiko (4/4) Jul 05 2012 Forgot to add DDMD, which also has been forked and redesigned

Denis Shelomovskij (7/10) Jul 05 2012 Why? Even were they all almost equal we can select any one. The point is...

Roman D. Boiko (7/18) Jul 05 2012 Well, I guess we'd like to select one and not change it later,

Jacob Carlborg (4/5) Jul 05 2012 Haven't we already done that a couple of times.

Roman D. Boiko (5/8) Jul 05 2012 Well, we did something like that for DCT... but I doubt that it

Jacob Carlborg (32/36) Jul 05 2012 I don't know. Here is what I wrote for DCT:

Roman D. Boiko (7/9) Jul 05 2012 OK, fairly complete. Let it be the starting point. Why not put it

Jacob Carlborg (4/5) Jul 05 2012 Aren't you voting on your own project :)

Roman D. Boiko (3/6) Jul 05 2012 Well, I'm going to base parsing on Pegged, after tweaking it to

Philippe Sigaud (32/35) Jul 05 2012 As much as I'm flattered by that, my current impression is Pegged is ver...

Andrei Alexandrescu (12/37) Jul 05 2012 I'll be glad to buy for you any book you might feel you need for this.

Roman D. Boiko (4/21) Jul 05 2012 +10 (can I vote ten times?)
Philippe Sigaud (20/25) Jul 05 2012 That's nice of you, if a bit extreme for a public mailing list :)

Roman D. Boiko (35/71) Jul 05 2012 I have four, from 1 to 7 years old... Wouldn't call them a

Roman D. Boiko (15/28) Jul 05 2012 Check whether an identifier is a keyword using AA (dictionary)

Dmitry Olshansky (9/21) Jul 05 2012 I'd stress the fact that it's a fully unrolled & hard-coded

Philippe Sigaud (3/5) Jul 05 2012 Better not telling my wife. She's making noises about having a fourth.

Andrei Alexandrescu (3/5) Jul 05 2012 Former sux latter rox.
Nick Sabalausky (6/9) Jul 05 2012 FWIW, I've found this one to be pretty helpful:

Philippe Sigaud (5/14) Jul 06 2012 Thanks Nick, I'll add it to my 'will buy one day' list (which grows

Nick Sabalausky (8/26) Jul 06 2012 Yea, it's from an academic publisher unfortunately, so $$$. I got ahold

Roman D. Boiko (8/40) Jul 05 2012 I'm sure it can generate **much** faster code. I'm going to focus
Tobias Pankrath (6/10) Jul 05 2012 If you are interested in parsing, than I wouldn't recommend the

Philippe Sigaud (10/18) Jul 05 2012 rn

Roman D. Boiko (3/6) Jul 05 2012 http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf

deadalnix (2/12) Jul 05 2012 Why not program instead of doing bureaucracy ?

Roman D. Boiko (8/9) Jul 05 2012 To avoid programming things which are not needed or don't fit.

deadalnix (4/11) Jul 05 2012 Well you did the mistake to introduce an over complex mechanism in

Roman D. Boiko (2/6) Jul 05 2012 I'm fine with that, let you doubt.

Denis Shelomovskij (7/21) Jul 05 2012 Just want to mention I'm talking about parser only. Things like

Jonathan M Davis (28/39) Jul 05 2012 It's been discussed before, but there are some obvious use cases such as...

Roman D. Boiko (11/17) Jul 05 2012 Resume: everybody is welcome to join effort of translating DMD

Andrei Alexandrescu (10/26) Jul 05 2012 I'd really want to create a task force on this, it is of strategic

Dmitry Olshansky (11/41) Jul 05 2012 Count me as interested.

Roman D. Boiko (3/9) Jul 05 2012 IMO, first priority is to make code generated by Pegged work

Andrei Alexandrescu (4/13) Jul 05 2012 Well I'd say there are things like supporting large, realistic grammars

Andrei Alexandrescu (6/25) Jul 05 2012 Excellent point. Walter is 100% behind the general strategy and we can

Jacob Carlborg (8/12) Jul 05 2012 I don't see why you are against this. I'm seeing this as basically the
David Piepgrass (41/59) Jul 06 2012 Hi everybody! My name's David and I've been dreaming since around

H. S. Teoh (17/23) Jul 06 2012 Frankly, I don't know what DCT stands for (anyone?), but AA stands for

Roman D. Boiko (7/8) Jul 07 2012 That's a working (not final) name of my project, a.k.a. The D

Roman D. Boiko (6/16) Jul 07 2012 I decided to ask a question about this:

Dmitry Olshansky (8/21) Jul 07 2012 I can tell you that they are not slower then one another in principle.

Roman D. Boiko (5/11) Jul 07 2012 Exactly, that is also my point. I think that Pegged can be

Roman D. Boiko (6/10) Jul 07 2012 Hmm... found an interesting article:

Dmitry Olshansky (5/15) Jul 07 2012 Yup, LL(*) is my favorite so far. As for debugging and error recovery

Roman D. Boiko (6/8) Jul 07 2012 It's interesting, that I also wanted to use DFA for non-terminals
Andrei Alexandrescu (5/19) Jul 07 2012 That's Terence Parr's discovery, right? I've always liked ANTLR, so if

Dmitry Olshansky (5/25) Jul 07 2012 I believe that it may need very few _semantic_ predicates. But there are...
Tobias Pankrath (1/5) Jul 08 2012 We should also consider using GLR if LL(*) doesn't work.

Dmitry Olshansky (4/9) Jul 08 2012 GLR ... are you serious? It still does parsing in n^3 if I'm not mistake...

Tobias Pankrath (9/29) Jul 08 2012 Since GLR can parse any CFG the upper bound is in O(n^3), but the

Tobias Pankrath (1/2) Jul 08 2012 http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html
Dmitry Olshansky (6/32) Jul 08 2012 The problem is that say LL(*) can be easily tweaked in place to run

Tobias Pankrath (11/15) Jul 08 2012 Did you look at Elkhound? According to the tutorial you can use

Roman D. Boiko (3/14) Jul 08 2012 http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some

Tobias Pankrath (7/13) Jul 07 2012 To be picky here:

Roman D. Boiko (3/7) Jul 07 2012 Also any PEG defines a language with linear-time parsing. Some

Tobias Pankrath (2/10) Jul 07 2012 Interesting, I thought that PEG ⊂ CFG holds.

Roman D. Boiko (5/6) Jul 07 2012 See section 3.4 of http://bford.info/pub/lang/peg.pdf

Philippe Sigaud (2/9) Jul 07 2012 Not that anyone is interested in such languages, but still :)

deadalnix (2/26) Jul 07 2012 D isn't 100% CFG. But it is close.

travert phare.normalesup.org (Christophe Travert) (2/3) Jul 09 2012 What makes D fail to be a CFG?

deadalnix (4/7) Jul 10 2012 type[something] <= something can be a type or an expression.

Timon Gehr (3/11) Jul 10 2012 'something' is context-free:

travert phare.normalesup.org (Christophe Travert) (5/19) Jul 11 2012 Do you have to know if something is a type or an expression for a simple...

Timon Gehr (4/24) Jul 11 2012 No. Some token sequences can be both a type and an expression based on

David Piepgrass (4/14) Jul 11 2012 I don't see how "type | expression" is context free. The input

Timon Gehr (8/24) Jul 11 2012 'Context free' means that all the production rule left hand sides do

Philippe Sigaud (9/12) Jul 23 2012 This little post to say a big 'Thanks' to Tobias. I'm reading Grune's

Andrei Alexandrescu (3/15) Jul 23 2012 When will we see the fruit of that all in the form of D libraries?
Roman D. Boiko (2/15) Jul 23 2012 Yes, I'm also reading it and find it amusingly high-quality.
Tobias Pankrath (2/15) Jul 23 2012 I'm glad that you like it.

Andrei Alexandrescu (3/7) Jul 07 2012 Isn't it the case that PEG require more memory for certain grammars?

Roman D. Boiko (25/26) Jul 07 2012 So far it looks like LALR parsers may have lower constant factors

Andrei Alexandrescu (7/33) Jul 07 2012 Isn't also the fact that lexing and parsing are integrated? With

Dmitry Olshansky (6/46) Jul 07 2012 I'll have to point out that the whole point about integrated lexing is

Andrei Alexandrescu (3/6) Jul 07 2012 Interesting. I'll have to ask for more details on why.

Roman D. Boiko (7/15) Jul 07 2012 +1. Personally I associate PEG with a parser that includes a

Dmitry Olshansky (9/23) Jul 07 2012 Well put yet it's still lexer. And most horribly it's not optimized to

Roman D. Boiko (12/16) Jul 07 2012 I think that your conclusions are about statistical evidences of

Dmitry Olshansky (9/23) Jul 07 2012 You may misunderstood me as well, the point is still the same:

Roman D. Boiko (5/18) Jul 07 2012 But PEG itself doesn't require backtracking parsing, does it? So

Dmitry Olshansky (15/31) Jul 07 2012 It does. Specifically the algorithm is to try option A, try option B,

Roman D. Boiko (18/32) Jul 07 2012 There is no algorithm, only __grammar__. It specifies that option

Dmitry Olshansky (23/53) Jul 07 2012 That would not always work, but yes that's what we should do IMHO.

Roman D. Boiko (18/52) Jul 07 2012 I didn't do a deeper research on this, just shared my current

Dmitry Olshansky (29/34) Jul 07 2012 I think I told that before but anyway the main point is:

Roman D. Boiko (5/17) Jul 07 2012 This is the implementation I have in mind as the only sane way to

deadalnix (2/27) Jul 07 2012 Many usages only need a lexer.

Jonathan M Davis (22/26) Jul 06 2012 I'm not even vaguely convinced that having a lexer/parser specifically f...

Jacob Carlborg (16/26) Jul 07 2012 I think the whole point of having a compiler as a library is that the

deadalnix (3/32) Jul 07 2012 I tried that. This is almost impossible, dmd's parser and AST are very

Roman D. Boiko (2/4) Jul 07 2012 +1
Jacob Carlborg (4/6) Jul 08 2012 I suspected that.

Jonathan M Davis (54/83) Jul 07 2012 There are multiple issues here. The one that Andrei is raising is the fa...

Daniel Murphy (7/15) Jul 07 2012 A more realistic alternative is to modify dmd to optionally (at build/li...
Jacob Carlborg (16/68) Jul 08 2012 That should be true for D as well when the spec is complete.

Jonathan M Davis (20/44) Jul 08 2012 True, but Andrei is basically arguing that porting dmd's lexer and parse...

Roman D. Boiko (4/9) Jul 08 2012 Can you provide a *specific* example of performance optimization

Jonathan M Davis (23/33) Jul 08 2012 It's been too long since I was actively working on parsers to give any

Roman D. Boiko (13/25) Jul 08 2012 My aim is to find out any potential bottlenecks and ensure that

Roman D. Boiko (1/3) Jul 08 2012 s/sews/seams
David Piepgrass (49/71) Jul 08 2012 I'm convinced that the output of a parser generator (PG) can be

Roman D. Boiko (4/4) Jul 08 2012 David, I would suggest you to create a proof-of-concept prototype

Jacob Carlborg (5/8) Jul 08 2012 I'm not expert in this field but I've heard that it's harder to get good...

Roman D. Boiko (2/4) Jul 09 2012 Good point!

Jacob Carlborg (4/8) Jul 08 2012 It don't feel like that, didn't we get lambdas or UFCS in the last relea...

Jonathan M Davis (8/15) Jul 08 2012 lambdas are a bit older than that IIRC, and I don't think that UFCS actu...

Jacob Carlborg (6/12) Jul 09 2012 I'm pretty sure UFCS affects lexing or parsing. How else would this be

Jonathan M Davis (16/28) Jul 09 2012 That definitely wouldn't affect lexing, because it doesn't affect the to...

Daniel Murphy (3/11) Jul 09 2012 Not true. This used to be lexed as '4.f' 'oo'. (I think)

Timon Gehr (2/11) Jul 09 2012 Those changes are trivial to incorporate into a well written parser.

Denis Shelomovskij (7/12) Jul 05 2012 I didn't want for someone to code anything at all! I on the contrary

Jonathan M Davis (23/33) Jul 05 2012 eone
Dmitry Olshansky (5/16) Jul 05 2012 It doesn't work like that in OpenSource. No matter what you do people

deadalnix (4/25) Jul 05 2012 You never looked at dmd frontend soure code don't you ? It is a horror

Jonathan M Davis (19/47) Jul 05 2012 files

Chad J (9/10) Jul 07 2012 Interesting, I thought you were hand-writing this stuff.

Roman D. Boiko (29/41) Jul 07 2012 There are many possible customization point which would make it a

Philippe Sigaud (15/29) Jul 07 2012 I added dstrings because

Roman D. Boiko (9/28) Jul 07 2012 I propose to switch code to use S if(isSomeString!S) everywhere.

Chad J (8/22) Jul 07 2012 Yeah, it's good to hear this notion reinforced. I had this suspicion

Roman D. Boiko (6/14) Jul 07 2012 At the very least, we could use DFA instead of backtracking where

Chad J (3/15) Jul 07 2012 These were my thoughts exactly, although somewhat unsubstantiated in my

Dmitry Olshansky (5/21) Jul 07 2012 Yup the nice thing about ANTLR is the usage of DFA. e.g. it's used for

Andrei Alexandrescu (4/16) Jul 07 2012 Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the...

Roman D. Boiko (13/15) Jul 07 2012 Since I didn't understand your question I assume that my

David Piepgrass (9/17) Jul 07 2012 ANTLR 3 doesn't use a DFA unless it needs to. If unlimited
Andrei Alexandrescu (4/13) Jul 07 2012 This should help:

Roman D. Boiko (6/7) Jul 07 2012 Thanks, nice article.

Dmitry Olshansky (6/30) Jul 07 2012 Another idea that I've never realized is to add operator precedence

Roman D. Boiko (5/9) Jul 07 2012 But that's already available by explicitly defining expression

Dmitry Olshansky (5/13) Jul 07 2012 Yes, I've meant something completely different ;)
Timon Gehr (2/10) Jul 07 2012 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climb...

Roman D. Boiko (6/7) Jul 07 2012 OK, at least I didn't misunderstand. So my question was whether

Roman D. Boiko (10/17) Jul 07 2012 OK, now I agree, the need to perform several nested calls in

Jonathan M Davis (5/20) Jul 07 2012 I don't know about this particular case, because I haven't really looked...
Roman D. Boiko (16/32) Jul 10 2012 One disadvantage of Packrat parsers I mentioned was problematic

Philippe Sigaud (7/15) Jul 10 2012 IIRC, that's what I encoded in Pegged (admittedly limited) error

Timon Gehr (2/17) Jul 10 2012 FWIW, this is what most HTML parsers are doing.

Philippe Sigaud (4/9) Jul 10 2012 Ah, right. I can get it for HTML/XML. JSON also, maybe.

Roman D. Boiko (5/20) Jul 10 2012 It would still generate errors. But would enable a lot of useful

Jonathan M Davis (6/30) Jul 10 2012 Which is horrible. You pretty much have to with HTML because of the horr...

Roman D. Boiko (4/14) Jul 10 2012 Not having control over parser or source code causes problems.
Jacob Carlborg (15/19) Jul 10 2012 I'm not sure but I think he was referring to a kind of error reporting

Jonathan M Davis (6/26) Jul 10 2012 Well, giving an error, continuing to parse, and giving a partial result ...

Timon Gehr (3/28) Jul 10 2012 This is actually precisely what many of the more recent curly-brace-
Jacob Carlborg (11/15) Jul 10 2012 No, that is _not_ what happens with HTML. With HTML, the browser _do

Jonathan M Davis (11/26) Jul 10 2012 ??? I guess that I wasn't clear. I mean that with HTML, it ignores error...

Jacob Carlborg (4/13) Jul 11 2012 Ok, I see. It seems we're meaning the same thing.

Dmitry Olshansky (6/35) Jul 10 2012 BTW clang does this and even more of stuff on semantic level. It's known...

David Piepgrass (64/86) Jul 07 2012 Interesting. After trying to use ANTLR-C# several years back, I

Chad J (91/165) Jul 07 2012 I was thinking the same thing.

Roman D. Boiko (13/30) Jul 07 2012 Funny, we've discussed an idea to introduce some hybrid of regex

Chad J (64/77) Jul 07 2012 From some of my earlier scribblings:

Roman D. Boiko (19/31) Jul 07 2012 This would cause wasting space (probably a lot). Definitely it

Chad J (13/42) Jul 07 2012 I see what you mean.

Roman D. Boiko (15/16) Jul 07 2012 And I think that I expressed myself badly. Let me rephrase.

Chad J (2/17) Jul 07 2012 Yep. Makes sense.

David Piepgrass (79/116) Jul 07 2012 I'm glad to hear you like the tree-parsing approach, Chad,

Roman D. Boiko (11/20) Jul 07 2012 I bet that after stage 2 you would have performed almost the same

David Piepgrass (21/44) Jul 07 2012 Hmm, you've got a good point there, although simple

David Piepgrass (18/23) Jul 07 2012 That reminds me, I forgot to write a another advantage I expected

Roman D. Boiko (10/20) Jul 07 2012 There could be multiple errors that compensate each other and

David Piepgrass (10/30) Jul 07 2012 This is all true, but forgetting a brace commonly results in a

Roman D. Boiko (4/13) Jul 07 2012 So you simply admit that error recovery is difficult to

David Piepgrass (4/17) Jul 07 2012 I'm not seeing any tremendous error-handling difficulty in my

Roman D. Boiko (55/58) Jul 07 2012 (I will use the word "you" to refer to some abstract person or

Roman D. Boiko (15/21) Jul 07 2012 Exactly. Semantic analysis is what would benefit from that, as

Chad J (64/97) Jul 07 2012 Yes and yes.

David Piepgrass (32/61) Jul 07 2012 Well, for templates, in general, it would be necessary to

Timon Gehr (16/159) Jul 07 2012 I'd suggest:

Timon Gehr (3/22) Jul 07 2012 Also, I'd like to point out that the transformation is not actually
Chad J (3/21) Jul 07 2012 I wish.

Dmitry Olshansky (6/9) Jul 05 2012 Then do it. It's all about having something so obviously good, fast and

Roman D. Boiko (2/7) Jul 05 2012 Would you help to make Pegged faster?

Dmitry Olshansky (5/11) Jul 05 2012 Well why not.

Roman D. Boiko (5/8) Jul 05 2012 I bet that after you finish with GSOC optimizing Pegged will not
Andrei Alexandrescu (5/15) Jul 05 2012 Good call :o).

Dmitry Olshansky (5/20) Jul 05 2012 It's great that you are not opposed to adjusting scope of project.

Caligo (8/18) Jul 05 2012 Is the actual grammar available somewhere? The online language

Roman D. Boiko (4/7) Jul 05 2012 In PEG format, yes (not necessarily correct):

Kai Meyer (27/34) Jul 31 2012 I know I'm a little late coming into this conversation. This seems like

Tobias Pankrath (15/28) Jul 31 2012 Yes. To make this statement more precise: We need a lexer that

Philippe Sigaud (26/45) Jul 31 2012 I think we need a tokenizer generator and a parser generator. They can

Tobias Pankrath (4/14) Jul 31 2012 I have no numbers. It's a statement that I found in some (1-3)
Dmitry Olshansky (20/61) Jul 31 2012 Parser can use constant IDs for nested tables, IDs point to string table...

Philippe Sigaud (18/43) Jul 31 2012 The latter, I get. The former, not so much.

Timon Gehr (5/9) Jul 31 2012 Ddoc is typically not required. By default it should be treated as

Jonathan M Davis (6/9) Jul 31 2012 That was how I was intending to deal with ddoc. It's just a nested block...
Philippe Sigaud (11/19) Jul 31 2012 OK. Same for standard comment and doc comments?
Dmitry Olshansky (10/20) Jul 31 2012 That's why there is "allows" (not "required") the reason in general is

Dmitry Olshansky (12/61) Jul 31 2012 Well while lookahead will help, there are simpler ways. e.g.

Philippe Sigaud (5/18) Jul 31 2012 In the FORTRAN case, you indeed *need* to re-lex the stuff after IF,

Dmitry Olshansky (8/16) Jul 31 2012 Okay. Say lexer maps all unique strings that are not keywords to some ID...

Philippe Sigaud (3/21) Jul 31 2012 Ah, that! This I know, I misunderstood your initial post. The symbol

Jonathan M Davis (9/11) Jul 31 2012 I'm actually quite far along with one now - one which is specifically wr...

Jacob Carlborg (4/12) Aug 01 2012 That's awesome. I really looking forward to this. Keep up the good work.
Jacob Carlborg (4/11) Aug 01 2012 BTW, do you have to code online somewhere?

Jonathan M Davis (3/13) Aug 01 2012 No, because I'm still in the middle of working on it.

Denis Shelomovskij (5/16) Aug 01 2012 Good. Will wait for announce.

Philippe Sigaud (10/21) Jul 31 2012 Could you please describe the kind of token it produces?

Dmitry Olshansky (27/63) Jul 31 2012 Okay. Critics go as follows:

Kai Meyer (15/86) Aug 06 2012 I cut my teeth on perl, so everything gets compared to perl in my mind.

Philippe Sigaud (5/18) Aug 06 2012 D does have compiled regex, it's called ctRegex (meaning compile-time re...

Dmitry Olshansky (26/48) Aug 06 2012 Of course regex is first compiled to bytecode (same thing as "compile"

Philippe Sigaud (7/20) Aug 06 2012 Btw, I wanted to ask you that for a long time: what do you mean by

Dmitry Olshansky (26/46) Aug 06 2012 by this:

Dmitry Olshansky (4/7) Aug 06 2012 Compiled native code is faster. Perl is slower.

Philippe Sigaud (3/13) Aug 06 2012 Sorry, I meant: what second version? How do you distinguish between

David Nadlinger (4/7) Aug 06 2012 As far as I know, ctRegex _always_ produces machine code.

Philippe Sigaud (6/8) Aug 06 2012 That's what I had in mind too, but Dmitry seemed to say things are

Dmitry Olshansky (17/26) Aug 06 2012 Yeah, sorry for confusion - there are only 2 ways to do the job:

Jonathan M Davis (69/94) Jul 31 2012 Yeah. Once I started on it, I made a lot of progress really quickly. The...

travert phare.normalesup.org (Christophe Travert) (16/30) Aug 01 2012 The occurence of tabWidth surprises me.

Jonathan M Davis (42/66) Aug 01 2012 col counts code points. tabWidth is the number of code points that '\t'=

Jacob Carlborg (8/23) Aug 01 2012 What about the end/length of a token? Token.str.length would give the

Jonathan M Davis (14/42) Aug 01 2012 I'm not sure. I don't think so. It doesn't really keep track of code poi...

Jacob Carlborg (5/11) Aug 01 2012 Doing a syntax highlighter and calculate the length (walkLength) for

Jonathan M Davis (16/41) Jul 31 2012 blockComment, /// $(D /* */)
Philippe Sigaud (43/98) Jul 31 2012 My fear (having encoded the D grammar once) is that 99% of the code

Jacob Carlborg (5/9) Aug 01 2012 Some IDE's do a more advanced syntax highlighting based on the semantic

Jonathan M Davis (64/121) Jul 31 2012 I'm not using regexes at all. It's using string mixins to reduce code

Jacob Carlborg (8/13) Aug 01 2012 That's good idea. Most code can be treated as ASCII (I assume most

Jonathan M Davis (10/23) Aug 01 2012 What's of particular importance is the fact that _all_ of the language

Philippe Sigaud (17/52) Jul 31 2012 OK. It'll more difficult to genericize, then, but that's not your
Jonathan M Davis (40/68) Jul 31 2012 It was never intended to be even vaguely generic. It's targeting D

Jacob Carlborg (4/8) Aug 01 2012 :)

Philippe Sigaud (20/40) Aug 01 2012 That's exactly what I had in mind. Anyway, we need a D lexer. We also

Jacob Carlborg (11/26) Aug 01 2012 I'm not quite sure how it works either. But I'm thinking like this:

Jonathan M Davis (70/82) Aug 01 2012 It contains unicode. The lexer is lexing whatever encoding the source i=
Philippe Sigaud (9/29) Aug 01 2012 do you mean I can do:

Jacob Carlborg (16/27) Aug 01 2012 Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these

Philippe Sigaud (9/13) Aug 01 2012 (emphasis mine)

Jacob Carlborg (4/10) Aug 01 2012 You're welcome :)

Andrej Mitrovic (10/15) Aug 01 2012 I think many people viewed Unicode this way at first. But there is a

Jacob Carlborg (5/14) Aug 01 2012 This is a good read as well:

Philippe Sigaud (8/17) Aug 01 2012 I will, but not yet. I've a few books on parsing and compilers to read

Dmitry Olshansky (6/24) Aug 01 2012 Once you have time to learn some unicode, check out this page:

Andrej Mitrovic (3/6) Aug 01 2012 Didn't know about that one, cool! Also might come in handy:

Jonathan M Davis (8/13) Aug 01 2012 I guess that that would explain why you didn't understand what I was say...
Philippe Sigaud (5/17) Aug 01 2012 I knew about 1-2-4 bytes schemes and such. But, somehow, for me,

Jonathan M Davis (26/42) Aug 01 2012 D source text can be in one of the following formats:

Jacob Carlborg (8/31) Aug 01 2012 But if you read a source file which is encoded using UTF-16 you would

Jonathan M Davis (16/21) Aug 01 2012 It may very well be a good idea to templatize Token on range type. It wo...

Jacob Carlborg (33/46) Aug 01 2012 To me a string type would be enough. I don't need support for ranges.
travert phare.normalesup.org (Christophe Travert) (32/46) Aug 02 2012 It can't if it is a simple input range! Like a file read with most

Jonathan M Davis (43/66) Aug 02 2012 It

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

There are more and more projects requiring parsing D code (IDE plugins, 
DCT and dfmt now).

We definitely need a _standard_ fast D parser (suitable as tokenizer). 
We already have a good one at least in Visual D. Maybe dmd's parser is 
faster. If so, it can be (easily?) rewritten in D. We also already have 


What about to get one and call it standard?

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 12:11:33 UTC, Denis Shelomovskij 
wrote:
 There are more and more projects requiring parsing D code (IDE 
 plugins, DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as 
 tokenizer). We already have a good one at least in Visual D. 
 Maybe dmd's parser is faster. If so, it can be (easily?) 

 from Mono-D etc.).

 What about to get one and call it standard?

Visual-D is not Boost-licensed (I think this would be possible to 
change)

Pegged may eventually become standard, if it will be performance 
optimized and a bit more customizable
Dscanner(https://github.com/Hackerpilot/Dscanner) from Brian 
Schott is pretty good, too.
SDC is another nice option
DIL (http://code.google.com/p/dil/) is very nice but GPL

I plan to try using Pegged inside my DCT project. Probably that 
will require huge modifications though...

Some more links from Pegged readme:

 Hisayuki Mima's CTPG(https://github.com/youkei/ctpg), very 
 similar, also done in D. Have a look!
 Nick Sabalausky's Goldie 
 (http://www.dsource.org/projects/goldie).
 Benjamin Shropshire's dparser 
 (http://dsource.org/projects/scrapple/browser/trunk/dparser).

 Martin Nowak put these gists on the D newsgroup:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

Forgot to add DDMD, which also has been forked and redesigned 
recently by someone (thus two more different compiler frontends).

But overall I doubt that any project could become standard very 
soon.

Jul 05 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

05.07.2012 16:30, Roman D. Boiko пишет:
 Forgot to add DDMD, which also has been forked and redesigned recently
 by someone (thus two more different compiler frontends).

 But overall I doubt that any project could become standard very soon.

Why? Even were they all almost equal we can select any one. The point is 
to select something (anything) to guide a coder who want to do something 
with this task to a good/up-to-date/supported parser.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 12:53:02 UTC, Denis Shelomovskij 
wrote:
 05.07.2012 16:30, Roman D. Boiko пишет:
 Forgot to add DDMD, which also has been forked and redesigned 
 recently
 by someone (thus two more different compiler frontends).

 But overall I doubt that any project could become standard 
 very soon.

 Why? Even were they all almost equal we can select any one. The 
 point is to select something (anything) to guide a coder who 
 want to do something with this task to a 
 good/up-to-date/supported parser.

Well, I guess we'd like to select one and not change it later, 
don't we?
I'm not sure we can decide which is the best currently and will 
stay the best in the future for all major use cases.

Anyway I propose to enumerate major use cases first.

Jul 05 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-05 15:08, Roman D. Boiko wrote:

 Anyway I propose to enumerate major use cases first.

Haven't we already done that a couple of times.

-- 
/Jacob Carlborg

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:

 Anyway I propose to enumerate major use cases first.

 Haven't we already done that a couple of times.

Well, we did something like that for DCT... but I doubt that it 
would fit general needs.

If we had, why haven't they been analyzed, classified, discussed, 
etc.? Or have they?

Jul 05 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-05 18:04, Roman D. Boiko wrote:

 Well, we did something like that for DCT... but I doubt that it would
 fit general needs.

Why wouldn't it.

 If we had, why haven't they been analyzed, classified, discussed, etc.?
 Or have they?

I don't know. Here is what I wrote for DCT:

* IDE integration
* Refactoring tool
* Static analysis
* Compiler
* Doc generating
* Build tool
* DI generating

In general, use cases that can span several compile phases, i.e. lexing, 
parsing, semantic analysis and so on. Some of these use cases can be 
broken in to several new use cases at a lower level. Some examples:

IDE integration:

* Syntax highlighting
* Code completion
* Showing lex, syntax and semantic errors
* Formatter

Refactoring:

* Cross-referencing symbols
* Renaming of symbols
* Extract local variable to instance variable
* Extract variable to function/method
* Extract a piece of code/method into a new class

Build tool:

* Tracking module dependencies

Doc generating:

* Associate a declaration and its documentation

Original post:

http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141

-- 
/Jacob Carlborg

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:
 Original post:

 http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141

OK, fairly complete. Let it be the starting point. Why not put it 
on some wiki, then add some more, discuss, vote, etc.? Meanwhile 
create a matrix of features implemented in each of interesting 
standardization candidates. And pick up one of them as "standard" 
or "reference" parser.

My vote would be for Pegged, I guess.

Jul 05 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.

Aren't you voting on your own project :)

-- 
/Jacob Carlborg

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 16:38:27 UTC, Jacob Carlborg wrote:
 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.

 Aren't you voting on your own project :)

Well, I'm going to base parsing on Pegged, after tweaking it to
my needs.

Jul 05 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.



As much as I'm flattered by that, my current impression is Pegged is very
far from being performant.

As a proof-of-concept that, in D,  it's possible to parse a string and
create a parse tree at compile-time and then generate code from this, it's
also successful. Go D!

As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn
by coding. Hey, I just received the Dragon Book (International Edition),
I'm sure I'll learn many things in there.

So, if anyone is willing to change the code generated by Pegged, I'm game.
The results you showed me on keyword parsing are very interesting!

But, my impression is that the need for a 'D'-only parser and lexer is far
greater and much more imediate that the need for a parser generator. All
the reasons advanced upthread ask for a D parser, not a generic generator.
Parser generators are for those of us interested in having DSLs or macros
in D.
So Pegged or any other generator should *not* get the community focus right
now.

My plan would be as follow:

1- assemble a group of people knowing parsing. I don't think I'm exactly
knowledgeable, but I'm ready to be part of such a group.
2- create a github process.
3- translate an existing parser / adapt a D parser for Phobos. I'm ready to
be part of this (I'm sure I'll learn in the process)
4- spend 1-2 years fighting over LR parsing and such :) (Just kidding)
5- submit it to Phobos and have it adopted.

much later:
6- see the way the parser code is organized and tweak a code generator
(Pegged is a possibility if recursive parsing is OK) to produce an
equivalent code when fed the D grammar.

side-effect: maybe a std.tree or std.collection.tree to deal with trees in
a generic way.

Philippe

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 2:16 PM, Philippe Sigaud wrote:
 As much as I'm flattered by that, my current impression is Pegged is
 very far from being performant.

 As a proof-of-concept that, in D,  it's possible to parse a string and
 create a parse tree at compile-time and then generate code from this,
 it's also successful. Go D!

 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I
 learn by coding. Hey, I just received the Dragon Book (International
 Edition), I'm sure I'll learn many things in there.

I'll be glad to buy for you any book you might feel you need for this. 
Again, there are few things more important for D right now than 
exploiting its unmatched-by-competition features to great ends. I don't 
want the lack of educational material to hold you down. Please continue 
working on this and let me know of what you need.

 So, if anyone is willing to change the code generated by Pegged, I'm
 game. The results you showed me on keyword parsing are very interesting!

 But, my impression is that the need for a 'D'-only parser and lexer is
 far greater and much more imediate that the need for a parser generator.

I very strongly disagree. This is our chance to get things right instead 
of having a limited product that destroys competition (much like lex and 
yacc have been for years in the parser generator field).

 All the reasons advanced upthread ask for a D parser, not a generic
 generator. Parser generators are for those of us interested in having
 DSLs or macros in D.
 So Pegged or any other generator should *not* get the community focus
 right now.

Pegged should be the focus.

 My plan would be as follow:

 1- assemble a group of people knowing parsing. I don't think I'm exactly
 knowledgeable, but I'm ready to be part of such a group.
 2- create a github process.
 3- translate an existing parser / adapt a D parser for Phobos. I'm ready
 to be part of this (I'm sure I'll learn in the process)
 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding)
 5- submit it to Phobos and have it adopted.

Sounds good. Replace 1-2 years with 1-2 months.



Andrei

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 18:28:21 UTC, Andrei Alexandrescu 
wrote:
 On 7/5/12 2:16 PM, Philippe Sigaud wrote:
 So Pegged or any other generator should *not* get the 
 community focus right now.

 Pegged should be the focus.

+10 (can I vote ten times?)

 My plan would be as follow:

 1- assemble a group of people knowing parsing. I don't think 
 I'm exactly
 knowledgeable, but I'm ready to be part of such a group.
 2- create a github process.
 3- translate an existing parser / adapt a D parser for Phobos. 
 I'm ready
 to be part of this (I'm sure I'll learn in the process)
 4- spend 1-2 years fighting over LR parsing and such :) (Just 
 kidding)
 5- submit it to Phobos and have it adopted.

 Sounds good. Replace 1-2 years with 1-2 months.

Well, probably with 3-4 months... :)

Jul 05 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I'll be glad to buy for you any book you might feel you need for this.
 Again, there are few things more important for D right now than exploiting
 its unmatched-by-competition features to great ends. I don't want the lack
 of educational material to hold you down. Please continue working on this
 and let me know of what you need.

That's nice of you, if a bit extreme for a public mailing list :)
Andrei, money is no problem :)
Interest in the field of parsing is no problem.
Interest in D future is no problem.
Having a demanding job, and three children, is a problem. No, scratch
that, you know what I mean.

But hey, Roman is doing interesting things on keyword parsing right
now, and changing the parser generated by Pegged is not difficult. We
will see where this thread lead. (Roman, you should send your results
here, because I'm still taken aback by the built-in AA speed compared
to linear array look-up for 100 keywords).

As Dmitry said, we might hit a CTFE wall: memory consumption,
computation speed, ...
(*channelling Andrei*: then we will correct whatever makes CTFE a
problem. Right)

Philippe

(Hesitating between 'The Art of the Metaobject Protocol' and
'Compilers, Techniques and Tools', right now)

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 19:54:39 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I'll be glad to buy for you any book you might feel you need 
 for this.
 Again, there are few things more important for D right now 
 than exploiting
 its unmatched-by-competition features to great ends. I don't 
 want the lack
 of educational material to hold you down. Please continue 
 working on this
 and let me know of what you need.

 That's nice of you, if a bit extreme for a public mailing list 
 :)
 Andrei, money is no problem :)
 Interest in the field of parsing is no problem.
 Interest in D future is no problem.
 Having a demanding job, and three children, is a problem. No, 
 scratch
 that, you know what I mean.

I have four, from 1 to 7 years old... Wouldn't call them a 
problem, though :)))

 But hey, Roman is doing interesting things on keyword parsing 
 right
 now, and changing the parser generated by Pegged is not 
 difficult. We
 will see where this thread lead. (Roman, you should send your 
 results
 here, because I'm still taken aback by the built-in AA speed 
 compared
 to linear array look-up for 100 keywords).

Well, I wouldn't call those "results" yet. Just some benchmarks. 
Here they are:

isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / 
lookup].
isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / 
lookup].
isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 
[nanosec / lookup].
isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec 
/ lookup].
isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 
[nanosec / lookup].
isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / 
lookup].
isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 
[nanosec / lookup].
Total: 22949 identifiers + 5551 keywords.

isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / 
lookup].
isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / 
lookup].
isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 
[nanosec / lookup].
isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec 
/ lookup].
isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 
[nanosec / lookup].
isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / 
lookup].
isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 
[nanosec / lookup].
Total: 104183 identifiers + 17488 keywords.

 As Dmitry said, we might hit a CTFE wall: memory consumption,
 computation speed, ...
 (*channelling Andrei*: then we will correct whatever makes CTFE 
 a
 problem. Right)

 Philippe

 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

 isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / 
 lookup].

This one calculates a sum of all identifier code units. Included 
for comparison.

 isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].

Check whether an identifier is a keyword using AA (dictionary) 
lookup.

 isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 
 [ns / lookup].

Switch by identifier length at the top, then recursively switch 
by each character. Clearly a winner, but I think it can be 
improved further.

 isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [ns / 
 lookup].

Binary search in an ordered array of keywords.

 isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [ns / 
 lookup].

Ditto, search is linear.

 isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [ns / lookup].

Lookup a keyword in a trie, created by Dmitry. This will be 
improved.

 isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [ns 
 / lookup].

The same, but only determines whether an identifier is a keyword, 
not which exactly.

 Total: 104183 identifiers + 17488 keywords.

Analyzed the largest Phobos file (DateTime? I didn't check the 
name.) Results are consistent for other files, too.

Jul 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Jul-12 00:16, Roman D. Boiko wrote:
 isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / lookup].

 This one calculates a sum of all identifier code units. Included for
 comparison.

 isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].

 Check whether an identifier is a keyword using AA (dictionary) lookup.

 isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [ns /
 lookup].

 Switch by identifier length at the top, then recursively switch by each
 character. Clearly a winner, but I think it can be improved further.

I'd stress the fact that it's a fully unrolled & hard-coded
switch that takes a lot of pages (~72Kbytes). It's easily be perfect.
Sorry, couldn't resist ;)
And I'm not sure how much it could be optimized maybe some measly 10-30%.

 Total: 104183 identifiers + 17488 keywords.

 Analyzed the largest Phobos file (DateTime? I didn't check the name.)
 Results are consistent for other files, too.

It is std.datetime as I've been running this benchmark against it and 
seen the same numbers :)

-- 
Dmitry Olshansky

Jul 05 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Jul 5, 2012 at 10:02 PM, Roman D. Boiko <rb d-coding.com>
wrote (on children)
 I have four, from 1 to 7 years old... Wouldn't call them a problem, though
 :)))

Better not telling my wife. She's making noises about having a fourth.

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 3:54 PM, Philippe Sigaud wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

Former sux latter rox.

Andrei

Jul 05 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Thu, 5 Jul 2012 21:54:29 +0200
Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

FWIW, I've found this one to be pretty helpful:

http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page

It's the closest one I've seen to being more for programmers than
mathematicians.

Jul 05 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky
<SeeWebsiteToContactMe semitwist.com> wrote:
 On Thu, 5 Jul 2012 21:54:29 +0200
 Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

 FWIW, I've found this one to be pretty helpful:

 http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page

 It's the closest one I've seen to being more for programmers than
 mathematicians.

Thanks Nick, I'll add it to my 'will buy one day' list (which grows
quite rapidly)
$122? Ouch.

Jul 06 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Fri, 6 Jul 2012 09:50:26 +0200
Philippe Sigaud <philippe.sigaud gmail.com> wrote:

 On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky
 <SeeWebsiteToContactMe semitwist.com> wrote:
 On Thu, 5 Jul 2012 21:54:29 +0200
 Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 (Hesitating between 'The Art of the Metaobject Protocol' and
 'Compilers, Techniques and Tools', right now)

 FWIW, I've found this one to be pretty helpful:

 http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page

 It's the closest one I've seen to being more for programmers than
 mathematicians.

 
 Thanks Nick, I'll add it to my 'will buy one day' list (which grows
 quite rapidly)
 $122? Ouch.

Yea, it's from an academic publisher unfortunately, so $$$. I got ahold
of it through the local library systems (All of Ohio's college libraries
are connected in a system called OhioLINK, and books from it can be
put on hold and shipped to most of the public libraries - or at least
most of the public libraries around the Cleveland area.) Maybe there's
something similar out your way?

Jul 06 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 On 2012-07-05 18:32, Roman D. Boiko wrote:

 My vote would be for Pegged, I guess.



 As much as I'm flattered by that, my current impression is 
 Pegged is very
 far from being performant.

 As a proof-of-concept that, in D,  it's possible to parse a 
 string and
 create a parse tree at compile-time and then generate code from 
 this, it's
 also successful. Go D!

 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, 
 as I learn
 by coding. Hey, I just received the Dragon Book (International 
 Edition),
 I'm sure I'll learn many things in there.

 So, if anyone is willing to change the code generated by 
 Pegged, I'm game.
 The results you showed me on keyword parsing are very 
 interesting!

 But, my impression is that the need for a 'D'-only parser and 
 lexer is far
 greater and much more imediate that the need for a parser 
 generator. All
 the reasons advanced upthread ask for a D parser, not a generic 
 generator.
 Parser generators are for those of us interested in having DSLs 
 or macros
 in D.
 So Pegged or any other generator should *not* get the community 
 focus right
 now.

I'm sure it can generate **much** faster code. I'm going to focus 
on its part that generates D parser (i.e., to make it 
significantly faster and able to efficiently parse-as-you-type). 
Actually, I'm sure it will be able to beat any other parser with 
respect to performance. :)

1. So my plan is the following: invite whoever would want to help.
2. Prove my claims above in practice. :-)))))

Jul 05 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, 
 as I learn
 by coding. Hey, I just received the Dragon Book (International 
 Edition),

If you are interested in parsing, than I wouldn't recommend the 
dragon book, but
this one 
http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html

It really is an awesome book.

Jul 05 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrot=
e:
 On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:
 As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I lea=


rn
 by coding. Hey, I just received the Dragon Book (International Edition),


 If you are interested in parsing, than I wouldn't recommend the dragon bo=

ok,
 but
 this one http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html

 It really is an awesome book.

Hey, good catch! I saw this one a few months ago and forgot about it.
Having half a book being annotated bibliography (IIUC) scared me.

Hmm 72 =E2=82=AC by Springer, 55 =E2=82=AC on Amazon. Something is not righ=
t.
Paperback vs perfect bound maybe?

Is "Modern Compiler Design" by the same authors any good?

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 20:28:32 UTC, Philippe Sigaud wrote:
 Hmm 72 € by Springer, 55 € on Amazon. Something is not 
 right.
 Paperback vs perfect bound maybe?

http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf
Not sure that it is legal, but definitely free.

Jul 05 2012

deadalnix <deadalnix gmail.com> writes:

Le 05/07/2012 18:32, Roman D. Boiko a écrit :
 On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:
 Original post:

 http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141

 OK, fairly complete. Let it be the starting point. Why not put it on
 some wiki, then add some more, discuss, vote, etc.? Meanwhile create a
 matrix of features implemented in each of interesting standardization
 candidates. And pick up one of them as "standard" or "reference" parser.

 My vote would be for Pegged, I guess.

Why not program instead of doing bureaucracy ?

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:
 Why not program instead of doing bureaucracy ?

To avoid programming things which are not needed or don't fit. 
I've thrown away several implementations already... time to 
reflect a little :)

But, actually, for DCT I do know what I need to do. That 
suggestion was with respect to any candidate for becoming 
standard. Everybody understands that their way (likely 
differently than others).

Jul 05 2012

deadalnix <deadalnix gmail.com> writes:

Le 06/07/2012 00:21, Roman D. Boiko a écrit :
 On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:
 Why not program instead of doing bureaucracy ?

 To avoid programming things which are not needed or don't fit. I've
 thrown away several implementations already... time to reflect a little :)

 But, actually, for DCT I do know what I need to do. That suggestion was
 with respect to any candidate for becoming standard. Everybody
 understands that their way (likely differently than others).

Well you did the mistake to introduce an over complex mechanism in 
log(n) to get back to source code when an obvious one in O(1) exists.

Let me doubt.

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 22:32:29 UTC, deadalnix wrote:
 Well you did the mistake to introduce an over complex mechanism 
 in log(n) to get back to source code when an obvious one in 
 O(1) exists.

 Let me doubt.

I'm fine with that, let you doubt.

Jul 05 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

05.07.2012 20:14, Jacob Carlborg пишет:
 On 2012-07-05 18:04, Roman D. Boiko wrote:

 Well, we did something like that for DCT... but I doubt that it would
 fit general needs.

 Why wouldn't it.

 If we had, why haven't they been analyzed, classified, discussed, etc.?
 Or have they?

 I don't know. Here is what I wrote for DCT:

 * IDE integration
 * Refactoring tool
 * Static analysis
 * Compiler
 * Doc generating
 * Build tool
 * DI generating

Just want to mention I'm talking about parser only. Things like 
"Refactoring" have to work perfectly or are unusable so it require a 
full compiler at least as finished as current dmd.


-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Jul 05 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

 
 Haven't we already done that a couple of times.

 
 Well, we did something like that for DCT... but I doubt that it
 would fit general needs.
 
 If we had, why haven't they been analyzed, classified, discussed,
 etc.? Or have they?

It's been discussed before, but there are some obvious use cases such as 
syntax highlighting, code formatting, and manipulation of D source files (e.g. 
to strip out the unit tests).

We need to have a lexer in Phobos which parsers D code and generates a range 
of tokens, and we need to have a parser in Phobos which operates on a range of 
tokens. As discussed previously, there is desire to have the lexer and parser 
ported from dmd's frontend to D for Phobos so that what we get is easily 
maintainable alongside dmd's frontend and produces the same results (and is 
fast). It's _also_ desirable that we get a lexer/parser generator into Phobos 
for generating lexers/parsers for whatever language you might want to generate 
them for. Pegged is a good example of what can be done, and I think that 
Andrei was trying to get Philippe to prepare a submission to Phobos from it 
(though I'm not sure), but regarldess of whether pegged (or anything based on 
it) makes it into Phobos, we definitely want something similar.

So, we have a fair idea of some of the stuff that we want, but it's a question 
of time and effort. I keep intending to port dmd's lexer to D for submission to 
Phobos, but I've been too busy to do it. At least a couple of other people 
have also expressed interest in doing it, but no one has submitted anything 
for Phobos. So, it remains undone, and anything which would need to lex or 
parse D code has to find its own solution. As with most everything around here, 
it's a question of people being having the time and being willing to put in 
the effort to do it. It's all too common for someone to suggest that we should 
do something or implement something without ever attempting to do it 
themselves, and in general, stuff around here gets done because someone really 
wants it done, takes the time to do it, and sees it through until its done and 
in Phobos.

- Jonathan M Davis

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to 
 do it themselves, and in general, stuff around here gets done 
 because someone really wants it done, takes the time to do it, 
 and sees it through until its done and in Phobos.

 - Jonathan M Davis

Resume: everybody is welcome to join effort of translating DMD 
front end, and improving Pegged.


Also I would like to invite those interested in DCT project to 
help me with it. Right now I'm trying to understand whether it is 
possible to incorporate Pegged inside without losing anything 
critical (and I think it is very likely possible), and how 
exactly to do that.

Dmitry proposed to help improve Pegged (or some other compiler's) 
speed.

Anyone else?

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 12:39 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really wants it done, takes the time to do it, and sees it
 through until its done and in Phobos.

 - Jonathan M Davis

 Resume: everybody is welcome to join effort of translating DMD front
 end, and improving Pegged.


 Also I would like to invite those interested in DCT project to help me
 with it. Right now I'm trying to understand whether it is possible to
 incorporate Pegged inside without losing anything critical (and I think
 it is very likely possible), and how exactly to do that.

 Dmitry proposed to help improve Pegged (or some other compiler's) speed.

 Anyone else?

I'd really want to create a task force on this, it is of strategic 
importance to D. In Walter's own words, no new feature is going to push 
us forward since we're not really using the great goodies we've got, and 
CTFE technology is the most important.

I also am actively opposed to a project of just translating D's 
front-end to D and dropping it into Phobos because it would smother (a) 
work on generic parser generators, and (b) strong, dependable 
formalization of D's syntax.


Andrei

Jul 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05-Jul-12 22:22, Andrei Alexandrescu wrote:
 On 7/5/12 12:39 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really wants it done, takes the time to do it, and sees it
 through until its done and in Phobos.

 - Jonathan M Davis

 Resume: everybody is welcome to join effort of translating DMD front
 end, and improving Pegged.


 Also I would like to invite those interested in DCT project to help me
 with it. Right now I'm trying to understand whether it is possible to
 incorporate Pegged inside without losing anything critical (and I think
 it is very likely possible), and how exactly to do that.

 Dmitry proposed to help improve Pegged (or some other compiler's) speed.

 Anyone else?

 I'd really want to create a task force on this, it is of strategic
 importance to D. In Walter's own words, no new feature is going to push
 us forward since we're not really using the great goodies we've got, and
 CTFE technology is the most important.

Count me as interested.
CTFE needs more correctness & speed though. So to put it blantly - no 
it's not possible right NOW.
BUT it doesn't prevent us from planing and doing a proof of concept. 
Pegged seems a good starting point even if we end up re-writing it from 
scratch.

 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

Well put. It shouldn't stop people from doing parsers, IMO the more the 
merrier.


-- 
Dmitry Olshansky

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:
 Count me as interested.
 CTFE needs more correctness & speed though. So to put it 
 blantly - no it's not possible right NOW.
 BUT it doesn't prevent us from planing and doing a proof of 
 concept. Pegged seems a good starting point even if we end up 
 re-writing it from scratch.

IMO, first priority is to make code generated by Pegged work 
fast, and second priority - make code generation itself fast.

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 2:38 PM, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:
 Count me as interested.
 CTFE needs more correctness & speed though. So to put it blantly - no
 it's not possible right NOW.
 BUT it doesn't prevent us from planing and doing a proof of concept.
 Pegged seems a good starting point even if we end up re-writing it
 from scratch.

 IMO, first priority is to make code generated by Pegged work fast, and
 second priority - make code generation itself fast.

Well I'd say there are things like supporting large, realistic grammars 
without blowing up or taking long compilation times is the first priority.

Andrei

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 2:33 PM, Dmitry Olshansky wrote:
 On 05-Jul-12 22:22, Andrei Alexandrescu wrote:
 I'd really want to create a task force on this, it is of strategic
 importance to D. In Walter's own words, no new feature is going to push
 us forward since we're not really using the great goodies we've got, and
 CTFE technology is the most important.

 Count me as interested.
 CTFE needs more correctness & speed though. So to put it blantly - no
 it's not possible right NOW.
 BUT it doesn't prevent us from planing and doing a proof of concept.
 Pegged seems a good starting point even if we end up re-writing it from
 scratch.

Excellent point. Walter is 100% behind the general strategy and we can 
bring him to work on specific bug reports and enhancement requests if we 
make a case they are important for Pegged.

 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

 Well put. It shouldn't stop people from doing parsers, IMO the more the
 merrier.

Exactly.


Andrei

Jul 05 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-05 20:22, Andrei Alexandrescu wrote:
 I also am actively opposed to a project of just translating D's
 front-end to D and dropping it into Phobos because it would smother (a)
 work on generic parser generators, and (b) strong, dependable
 formalization of D's syntax.

I don't see why you are against this. I'm seeing this as basically the 
only change for DMD every being written in D. Sure I would much rather 
have a completely new compiler written in D, with a module approach and 
designed from scratch to be used as a library, but I'm trying to 
realistic here.

-- 
/Jacob Carlborg

Jul 05 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 Resume: everybody is welcome to join effort of translating DMD 
 front end, and improving Pegged.

 Also I would like to invite those interested in DCT project to 
 help me with it. Right now I'm trying to understand whether it 
 is possible to incorporate Pegged inside without losing 
 anything critical (and I think it is very likely possible), 
 and how exactly to do that.

 Dmitry proposed to help improve Pegged (or some other 
 compiler's) speed.

 Anyone else?

 I'd really want to create a task force on this, it is of 
 strategic importance to D. In Walter's own words, no new 
 feature is going to push us forward since we're not really 
 using the great goodies we've got, and CTFE technology is the 
 most important.

Hi everybody! My name's David and I've been dreaming since around 
1999 of making my own computer language, and never found the time 
for it. The first time I looked at D it was around 2004 or so, 
and it just looked like a "moderately better C++" which I forgot 
about, having more lofty ideas. When I found out about D2's 
metaprogramming facilities I instantly became much more 
interested, although I still wish to accomplish more than is 
possible ATM.

I've been talking to my boss about reducing my working hours, 
mainly in order to have time to work on something related to D. 
My goal is to popularize a language that is efficient (as in 
runtime speed and size), expressive, safe, concise, readable, 
well-documented, easy-to-use, and good at finding errors in your 
code.  In other words, I want a language that is literally all 
things to all people, a language that is effective for any task. 
I want to kill off this preconceived notion that most programmers 
seem to have, that fast code requires a language like C++ that is 
hard to use. The notion that Rapid Application Development is 
incompatible with an efficient executable is nonsense and I want 
to kill it :)

To be honest I have some reservations about D, but of all the 
languages I know, D is currently closest to my ideal.

This work on parsers might be a good place for me to dive in. I 
have an interest in parsers and familiarity with LL, LALR, PEGs, 
and even Pratt parsers (fun!), but I am still inexperienced.

I also like writing documentation and articles, but I always find 
it hard to figure out how other people's code works well enough 
to document it.

I'm having some trouble following this thread due to the 
acronyms: CTFE, DCT, AA. At least I managed to figure out that 
CTFE is Compile Time Function Execution. DCT and AA I already 
know as Discrete Cosine Transform and Anti-Aliasing, of 
course.... but what's it mean to you?

One thing that has always concerned me about PEGs is that they 
always say PEGs are slower than traditional two-phase LALR(1) or 
LL(k) parsers. However, I have never seen any benchmarks. Does 
anyone know exactly how much performance you lose in an 
(optimized) PEG compared to an (optimized) LALR/LL parser + 
LL/regex lexer?

Anyway, it's the weekend, during which I hope I can find a place 
to fit in with you guys.

Jul 06 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jul 07, 2012 at 02:45:35AM +0200, David Piepgrass wrote:
[...]
 I'm having some trouble following this thread due to the acronyms:
 CTFE, DCT, AA. At least I managed to figure out that CTFE is Compile
 Time Function Execution. DCT and AA I already know as Discrete Cosine
 Transform and Anti-Aliasing, of course.... but what's it mean to you?

Frankly, I don't know what DCT stands for (anyone?), but AA stands for
Associative Array (aka hash, dictionary, etc.).


[...]
 Anyway, it's the weekend, during which I hope I can find a place to
 fit in with you guys.

Welcome aboard!!!

D still has some ways to go (there are still warts in some places), but
I also have to admit that it's closest to my ideal language too. I'm so
sick and fed up of the tedium of C and the convolutions of C++, that I
just can't see myself doing any more C/C++ except for code maintenance.
New projects I would do in D. (Personal projects, anyway... work
projects, I don't get to choose.)

Anyway, D still has some ways to go, meaning that more manpower is
always more than welcome.


T

-- 
Elegant or ugly code as well as fine or rude sentences have something in
common: they don't depend on the language. -- Luca De Vitis

Jul 06 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 01:47:21 UTC, H. S. Teoh wrote:
 Frankly, I don't know what DCT stands for (anyone?)

That's a working (not final) name of my project, a.k.a. The D 
Compiler Tools. I've used this acronym in previous threads but 
didn't explain this time. Current state is very far from 
completion, it is in research phase. Link: 
https://github.com/roman-d-boiko/dct, but I'd suggest to contact 
me at rb d-coding.com if anyone is interested.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I 
 have an interest in parsers and familiarity with LL, LALR, 
 PEGs, and even Pratt parsers (fun!), but I am still 
 inexperienced.

...
 One thing that has always concerned me about PEGs is that they 
 always say PEGs are slower than traditional two-phase LALR(1) 
 or LL(k) parsers. However, I have never seen any benchmarks. 
 Does anyone know exactly how much performance you lose in an 
 (optimized) PEG compared to an (optimized) LALR/LL parser + 
 LL/regex lexer?

I decided to ask a question about this:

http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

Don't hesitate to edit it if you would like to clarify some 
aspect.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 13:06, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I have
 an interest in parsers and familiarity with LL, LALR, PEGs, and even
 Pratt parsers (fun!), but I am still inexperienced.

 ...
 One thing that has always concerned me about PEGs is that they always
 say PEGs are slower than traditional two-phase LALR(1) or LL(k)
 parsers. However, I have never seen any benchmarks. Does anyone know
 exactly how much performance you lose in an (optimized) PEG compared
 to an (optimized) LALR/LL parser + LL/regex lexer?

 I decided to ask a question about this:

 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk


 Don't hesitate to edit it if you would like to clarify some aspect.

I can tell you that they are not slower then one another in principle. 
Quality of implementations trumps every theoretical aspect here. LALR is 
usually fast even if implemented by book but they are hard to optimize 
futher and quite restrictive on "semantic extensions".

Proper CFG parsers all are liner-time anyway.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 10:05:30 UTC, Dmitry Olshansky wrote:
 I can tell you that they are not slower then one another in 
 principle. Quality of implementations trumps every theoretical 
 aspect here. LALR is usually fast even if implemented by book 
 but they are hard to optimize futher and quite restrictive on 
 "semantic extensions".

 Proper CFG parsers all are liner-time anyway.

Exactly, that is also my point. I think that Pegged can be 
heavily optimized in performance, and there are no fundamental 
issues which would make it inherently slower than LALR or a 
hand-written D-specific parser.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, 
 and there are no
 fundamental issues which would make it inherently slower than 
 LALR or a hand-written D-specific parser.

Hmm... found an interesting article: 
http://www.antlr.org/papers/LL-star-PLDI11.pdf

It describes some disadvantages of Packrat parsing, like problems 
with debugging and error recovery. These are important for DCT, 
so I'll have to perform additional research.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

 Hmm... found an interesting article:
 http://www.antlr.org/papers/LL-star-PLDI11.pdf

 It describes some disadvantages of Packrat parsing, like problems with
 debugging and error recovery. These are important for DCT, so I'll have
 to perform additional research.

Yup, LL(*) is my favorite so far. As for debugging and error recovery 
they are always a problem.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 11:33:18 UTC, Dmitry Olshansky wrote:
 Yup, LL(*) is my favorite so far. As for debugging and error 
 recovery they are always a problem.

It's interesting, that I also wanted to use DFA for non-terminals 
(similarly to LL(*)), but was considering their usage as pure 
performance optimization. Concerns raised in the article seem to 
be very reasonable, I will likely reconsider my previous plans...

This forum is an endless source of high-quality feed-back for me.

Jul 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
 On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

 Hmm... found an interesting article:
 http://www.antlr.org/papers/LL-star-PLDI11.pdf

 It describes some disadvantages of Packrat parsing, like problems with
 debugging and error recovery. These are important for DCT, so I'll have
 to perform additional research.

 Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if 
PEGs turn out to have issues LL(*) sounds like a promising alternative.

How many semantics hacks does D's syntax need for a LL(*) parser?


Andrei

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 22:25, Andrei Alexandrescu wrote:
 On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
 On 07-Jul-12 15:30, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
 I think that Pegged can be heavily optimized in performance, and there
 are no
 fundamental issues which would make it inherently slower than LALR or
 a hand-written D-specific parser.

 Hmm... found an interesting article:
 http://www.antlr.org/papers/LL-star-PLDI11.pdf

 It describes some disadvantages of Packrat parsing, like problems with
 debugging and error recovery. These are important for DCT, so I'll have
 to perform additional research.

 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked ANTLR, so if
 PEGs turn out to have issues LL(*) sounds like a promising alternative.

 How many semantics hacks does D's syntax need for a LL(*) parser?

I believe that it may need very few _semantic_ predicates. But there are 
cases where infinite lookahead is a must. Can't recall which cases offhand.

 Andrei


-- 
Dmitry Olshansky

Jul 07 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked 
 ANTLR, so if PEGs turn out to have issues LL(*) sounds like a 
 promising alternative.

We should also consider using GLR if LL(*) doesn't work.

Jul 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked ANTLR, so if
 PEGs turn out to have issues LL(*) sounds like a promising alternative.

 We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

-- 
Dmitry Olshansky

Jul 08 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked 
 ANTLR, so if
 PEGs turn out to have issues LL(*) sounds like a promising 
 alternative.

 We should also consider using GLR if LL(*) doesn't work.

 GLR ... are you serious? It still does parsing in n^3 if I'm 
 not mistaken.

Since GLR can parse any CFG the upper bound is in O(n^3), but the 
actual performance seems to depend on the grammar.

 From Elkhound [1]
 It should be competitive with Bison for LALR (fragements of) 
 grammars, and degrade gracefully from there. On the scale of 
 grammar nondeterminism, from none
 (LALR) to some to lots, "some" is the niche Elkhound is going 
 after.
 These goals are driven by Elkhound's primary application, the 
 Elsa C++ Parser. In essence, Elkhound came about because I 
 wanted to apply automatic parsing technology to parsing C++, 
 but found exiting systems inadequate.

So it seems to me that it is not worse than PEG if you have a 
grammar with
reasonable many ambiguities.

Performance is one thing, but you should be able to express your 
language in the underlying formalism without too many hurdles.

Jul 08 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 From Elkhound [1]

http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html

Jul 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 14:48, Tobias Pankrath wrote:
 On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked ANTLR, so if
 PEGs turn out to have issues LL(*) sounds like a promising alternative.

 We should also consider using GLR if LL(*) doesn't work.

 GLR ... are you serious? It still does parsing in n^3 if I'm not
 mistaken.

 Since GLR can parse any CFG the upper bound is in O(n^3), but the actual
 performance seems to depend on the grammar.

  From Elkhound [1]
 It should be competitive with Bison for LALR (fragements of) grammars,
 and degrade gracefully from there. On the scale of grammar
 nondeterminism, from none
 (LALR) to some to lots, "some" is the niche Elkhound is going after.
 These goals are driven by Elkhound's primary application, the Elsa C++
 Parser. In essence, Elkhound came about because I wanted to apply
 automatic parsing technology to parsing C++, but found exiting systems
 inadequate.

 So it seems to me that it is not worse than PEG if you have a grammar with
 reasonable many ambiguities.

 Performance is one thing, but you should be able to express your
 language in the underlying formalism without too many hurdles.

The problem is that say LL(*) can be easily tweaked in place to run 
various semantic actions as it parse the source. I think GLR is harder 
to make do anything other then "parse & say it's fine".


-- 
Dmitry Olshansky

Jul 08 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 The problem is that say LL(*) can be easily tweaked in place to 
 run various semantic actions as it parse the source. I think 
 GLR is harder to make do anything other then "parse & say it's 
 fine".

Did you look at Elkhound? According to the tutorial you can use 
actions
in the same way as if you were using an (LA)LR parser. Dunno if 
it's sufficient, but with Elkhound an C++-Parser has been written.

See also 
http://stackoverflow.com/questions/428892/what-parser-genera
or-do-you-recommend 
where Ira Baxter (maybe you know him) makes a case for DMS, which 
uses GLR internally. Here is a google tech talk of Ira explaining 
the DMS http://www.youtube.com/watch?v=C-_dw9iEzhA .

I'm not very experienced with parser generators. Just trying to 
put all options on the table.

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 13:05, Tobias Pankrath wrote:
 Yup, LL(*) is my favorite so far.

 That's Terence Parr's discovery, right? I've always liked 
 ANTLR, so if
 PEGs turn out to have issues LL(*) sounds like a promising 
 alternative.

 We should also consider using GLR if LL(*) doesn't work.

 GLR ... are you serious? It still does parsing in n^3 if I'm 
 not mistaken.

http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some 
disadvantages of GLR with respect to LL(*).

Jul 08 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 I can tell you that they are not slower then one another in 
 principle. Quality of implementations trumps every theoretical 
 aspect here. LALR is usually fast even if implemented by book 
 but they are hard to optimize futher and quite restrictive on 
 "semantic extensions".

 Proper CFG parsers all are liner-time anyway.

To be picky here:
The languages that can be parsed in linear time are a strict 
subset of CFGs.

However I do agree, that error handling and flexibility are more 
important than raw speed. I don't want to get a concrete syntax 
tree full of unneeded productions, like the ones you get when you 
encode precedence rules in your grammar.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:
 Proper CFG parsers all are liner-time anyway.

 To be picky here:
 The languages that can be parsed in linear time are a strict 
 subset of CFGs.

Also any PEG defines a language with linear-time parsing. Some 
non-context-free languages can be described with PEG.

Jul 07 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Saturday, 7 July 2012 at 15:39:52 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:
 Proper CFG parsers all are liner-time anyway.

 To be picky here:
 The languages that can be parsed in linear time are a strict 
 subset of CFGs.

 Also any PEG defines a language with linear-time parsing. Some 
 non-context-free languages can be described with PEG.

Interesting, I thought that PEG ⊂ CFG holds.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
 Interesting, I thought that PEG ⊂ CFG holds.

See section 3.4 of http://bford.info/pub/lang/peg.pdf

It contains a simple proof that a non-context-free language (a^n) 
(b^n) (c^n) can be described with PEG syntax. There are 
infinitely many such languages.

Jul 07 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sat, Jul 7, 2012 at 6:25 PM, Roman D. Boiko <rb d-coding.com> wrote:
 On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
 Interesting, I thought that PEG =E2=8A=82 CFG holds.

 See section 3.4 of http://bford.info/pub/lang/peg.pdf

 It contains a simple proof that a non-context-free language (a^n) (b^n)
 (c^n) can be described with PEG syntax. There are infinitely many such
 languages.

Not that anyone is interested in such languages, but still :)

Jul 07 2012

deadalnix <deadalnix gmail.com> writes:

On 07/07/2012 12:05, Dmitry Olshansky wrote:
 On 07-Jul-12 13:06, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
 This work on parsers might be a good place for me to dive in. I have
 an interest in parsers and familiarity with LL, LALR, PEGs, and even
 Pratt parsers (fun!), but I am still inexperienced.

 ...
 One thing that has always concerned me about PEGs is that they always
 say PEGs are slower than traditional two-phase LALR(1) or LL(k)
 parsers. However, I have never seen any benchmarks. Does anyone know
 exactly how much performance you lose in an (optimized) PEG compared
 to an (optimized) LALR/LL parser + LL/regex lexer?

 I decided to ask a question about this:

 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk



 Don't hesitate to edit it if you would like to clarify some aspect.

 I can tell you that they are not slower then one another in principle.
 Quality of implementations trumps every theoretical aspect here. LALR is
 usually fast even if implemented by book but they are hard to optimize
 futher and quite restrictive on "semantic extensions".

 Proper CFG parsers all are liner-time anyway.

D isn't 100% CFG. But it is close.

Jul 07 2012

travert phare.normalesup.org (Christophe Travert) writes:

deadalnix , dans le message (digitalmars.D:171330), a �crit�:
 D isn't 100% CFG. But it is close.

What makes D fail to be a CFG?

Jul 09 2012

deadalnix <deadalnix gmail.com> writes:

On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a �crit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

type[something] <= something can be a type or an expression.
typeid(somethning) <= same here
identifier!(something) <= again

Jul 10 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a �crit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

 type[something] <= something can be a type or an expression.
 typeid(somethning) <= same here
 identifier!(something) <= again

'something' is context-free:

something ::= type | expression.

Jul 10 2012

travert phare.normalesup.org (Christophe Travert) writes:

Timon Gehr , dans le message (digitalmars.D:171814), a �crit�:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a �crit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

 type[something] <= something can be a type or an expression.
 typeid(somethning) <= same here
 identifier!(something) <= again

 
 'something' is context-free:
 
 something ::= type | expression.

Do you have to know if something is a type or an expression for a simple 
parsing? The langage would better not require this, otherwise simple 
parsing is not possible without looking at all forward references and 
imported files.

Jul 11 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/11/2012 10:23 AM, Christophe Travert wrote:
 Timon Gehr , dans le message (digitalmars.D:171814), a �crit :
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a �crit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

 type[something]<= something can be a type or an expression.
 typeid(somethning)<= same here
 identifier!(something)<= again

 'something' is context-free:

 something ::= type | expression.

 Do you have to know if something is a type or an expression for a simple
 parsing?

No. Some token sequences can be both a type and an expression based on
the context (the CFG is ambiguous), but that is the analysers business.
Parsing D code does not require any kind of analysis.

 The langage would better not require this, otherwise simple
 parsing is not possible without looking at all forward references and
 imported files.

Jul 11 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a écrit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

 type[something] <= something can be a type or an expression.
 typeid(somethning) <= same here
 identifier!(something) <= again

 'something' is context-free:

 something ::= type | expression.

I don't see how "type | expression" is context free. The input 
"Foo" could be a type or expression, you can't tell which without 
looking at the context.

Jul 11 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/11/2012 11:55 PM, David Piepgrass wrote:
 On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:
 On 07/11/2012 01:16 AM, deadalnix wrote:
 On 09/07/2012 10:14, Christophe Travert wrote:
 deadalnix , dans le message (digitalmars.D:171330), a écrit :
 D isn't 100% CFG. But it is close.

 What makes D fail to be a CFG?

 type[something] <= something can be a type or an expression.
 typeid(somethning) <= same here
 identifier!(something) <= again

 'something' is context-free:

 something ::= type | expression.

 I don't see how "type | expression" is context free. The input "Foo"
 could be a type or expression, you can't tell which without looking at
 the context.

'Context free' means that all the production rule left hand sides do
not have a context, i.e. they consist of a single non-terminal.

What you are describing is the fact that the grammar is ambiguous. The 
parser can just assume whatever works and the analyzer takes care of the
interpretation of the resulting AST. The parser does not have to
give a meaning to the code. It just turns it into a representation that
is easy to work with.

Jul 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend the dragon book,
 but this one

 It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
Parsing Techniques and find it absolutely wonderful: easy to read,
going into details and limits of the many parsing algos. Very cool!

I'm playing with fully-backtracking recursive-descent parsers right
now, to deal with ambiguous grammars and the small parser works
beautifully. Hey, I just saw he gives the code on his website, nice!

Now, on to LR. Nick, here I come. I'll at last know how Goldie works.


Philippe

Jul 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/23/12 9:22 AM, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath<tobias pankrath.net>  wrote:

 If you are interested in parsing, than I wouldn't recommend the dragon book,
 but this one

 It really is an awesome book.

 This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
 Parsing Techniques and find it absolutely wonderful: easy to read,
 going into details and limits of the many parsing algos. Very cool!

 I'm playing with fully-backtracking recursive-descent parsers right
 now, to deal with ambiguous grammars and the small parser works
 beautifully. Hey, I just saw he gives the code on his website, nice!

 Now, on to LR. Nick, here I come. I'll at last know how Goldie works.

When will we see the fruit of that all in the form of D libraries?

Andrei

Jul 23 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath 
 <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend 
 the dragon book,
 but this one

 It really is an awesome book.

 This little post to say a big 'Thanks' to Tobias. I'm reading 
 Grune's
 Parsing Techniques and find it absolutely wonderful: easy to 
 read,
 going into details and limits of the many parsing algos. Very 
 cool!

Yes, I'm also reading it and find it amusingly high-quality.

Jul 23 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
 On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath 
 <tobias pankrath.net> wrote:

 If you are interested in parsing, than I wouldn't recommend 
 the dragon book,
 but this one

 It really is an awesome book.

 This little post to say a big 'Thanks' to Tobias. I'm reading 
 Grune's
 Parsing Techniques and find it absolutely wonderful: easy to 
 read,
 going into details and limits of the many parsing algos. Very 
 cool!

I'm glad that you like it.

Jul 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 6:05 AM, Dmitry Olshansky wrote:
 I can tell you that they are not slower then one another in principle.
 Quality of implementations trumps every theoretical aspect here. LALR is
 usually fast even if implemented by book but they are hard to optimize
 futher and quite restrictive on "semantic extensions".

Isn't it the case that PEG require more memory for certain grammars?

Andrei

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

So far it looks like LALR parsers may have lower constant factors 
than Packrat.

The difference could be minimized by paying attention to parsing 
of terminal symbols, which was in my plans already. It is not 
necessary to strictly follow Packrat parsing algorithm.

The benefits of Pegged, in my view, are its support of Parsing 
Expression Grammar (PEG) and compile-time evaluation. It is 
easily extensible and modifiable.

When I implemented recursive-descent parser by hand in one of 
early drafts of DCT, I strongly felt the need to generalize code 
in a way which in retrospect I would call PEG-like. The structure 
of my hand-written recursive-descent parser was a one-to-one 
mapping to an implemented subset of D specification, and I 
considered it problematic, because it was needed to duplicate the 
same structure in the resulting AST.

PEG is basically a language that describes both, the 
implementation of parser, and the language syntax. It greatly 
reduces implicit code duplication.

I think that generated code can be made almost as fast as a 
hand-written parser for a particular language (probably, a few 
percent slower). Especially if that language is similar to D 
(context-free, with fine-grained hierarchical grammar). 
Optimizations might require to forget about strictly following 
any theoretical algorithm, but that should be OK.

Jul 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

 So far it looks like LALR parsers may have lower constant factors than
 Packrat.

 The difference could be minimized by paying attention to parsing of
 terminal symbols, which was in my plans already. It is not necessary to
 strictly follow Packrat parsing algorithm.

 The benefits of Pegged, in my view, are its support of Parsing
 Expression Grammar (PEG) and compile-time evaluation. It is easily
 extensible and modifiable.

Isn't also the fact that lexing and parsing are integrated? With 
traditional LALR you need a separate tokenizer.

 When I implemented recursive-descent parser by hand in one of early
 drafts of DCT, I strongly felt the need to generalize code in a way
 which in retrospect I would call PEG-like. The structure of my
 hand-written recursive-descent parser was a one-to-one mapping to an
 implemented subset of D specification, and I considered it problematic,
 because it was needed to duplicate the same structure in the resulting AST.

 PEG is basically a language that describes both, the implementation of
 parser, and the language syntax. It greatly reduces implicit code
 duplication.

 I think that generated code can be made almost as fast as a hand-written
 parser for a particular language (probably, a few percent slower).
 Especially if that language is similar to D (context-free, with
 fine-grained hierarchical grammar). Optimizations might require to
 forget about strictly following any theoretical algorithm, but that
 should be OK.

All that sounds really encouraging. I'm really looking forward to more 
work in that area. If you stumble upon bugs that block you, let us know 
and Walter agreed he'll boost their priority.


Andrei

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
 On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

 So far it looks like LALR parsers may have lower constant factors than
 Packrat.

 The difference could be minimized by paying attention to parsing of
 terminal symbols, which was in my plans already. It is not necessary to
 strictly follow Packrat parsing algorithm.

 The benefits of Pegged, in my view, are its support of Parsing
 Expression Grammar (PEG) and compile-time evaluation. It is easily
 extensible and modifiable.

 Isn't also the fact that lexing and parsing are integrated? With
 traditional LALR you need a separate tokenizer.

I'll have to point out that the whole point about integrated lexing is 
moot. It's more of liability then benefit. At very least it's just 
implementation curiosity not advantage.

 When I implemented recursive-descent parser by hand in one of early
 drafts of DCT, I strongly felt the need to generalize code in a way
 which in retrospect I would call PEG-like. The structure of my
 hand-written recursive-descent parser was a one-to-one mapping to an
 implemented subset of D specification, and I considered it problematic,
 because it was needed to duplicate the same structure in the resulting
 AST.

 PEG is basically a language that describes both, the implementation of
 parser, and the language syntax. It greatly reduces implicit code
 duplication.

 I think that generated code can be made almost as fast as a hand-written
 parser for a particular language (probably, a few percent slower).
 Especially if that language is similar to D (context-free, with
 fine-grained hierarchical grammar). Optimizations might require to
 forget about strictly following any theoretical algorithm, but that
 should be OK.

 All that sounds really encouraging. I'm really looking forward to more
 work in that area. If you stumble upon bugs that block you, let us know
 and Walter agreed he'll boost their priority.


 Andrei


-- 
Dmitry Olshansky

Jul 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

Interesting. I'll have to ask for more details on why.

Andrei

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu 
wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated 
 lexing is
 moot. It's more of liability then benefit. At very least it's 
 just
 implementation curiosity not advantage.

 Interesting. I'll have to ask for more details on why.

 Andrei

+1. Personally I associate PEG with a parser that includes a 
__distributed__ lexer inside, which gives the advantage of having 
to check less alternatives at each step (that's similar to 
deterministic parsing). If lexer is separate, it seems to be more 
difficult to scan for only a subset of possible tokens.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 00:19, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

 Interesting. I'll have to ask for more details on why.

 Andrei

 +1. Personally I associate PEG with a parser that includes a
 __distributed__ lexer inside,


Well put yet it's still lexer. And most horribly it's not optimized to 
DFAs most of the time last time I've checked papers.

  which gives the advantage of having to
 check less alternatives at each step (that's similar to deterministic
 parsing). If lexer is separate, it seems to be more difficult to scan
 for only a subset of possible tokens.

And given the backtracking nature of PEGs you'll do your distributed 
thing many times over or ... spend a lot of RAM to remember not to redo 
it. I recall lexing takes even more then parsing itself.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
 And given the backtracking nature of PEGs you'll do your 
 distributed thing many times over or ... spend a lot of RAM to 
 remember not to redo it. I recall lexing takes even more then 
 parsing itself.

I think that your conclusions are about statistical evidences of 
PEG misuses and poor PEG parser implementations. My point was 
that there is nothing fundamentally worse in having lexer 
integrated with parser, but there are performance advantages of 
having to check less possible cases when the structural 
information is available (so that lexSmth could be called when 
Smth is expected, thus requiring less switch branches if switch 
is used).

As for lexing multiple times, simply use a free list of terminals 
(aka tokens). I still assume that grammar is properly defined, so 
that there is only one way to split source into tokens.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 00:50, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
 And given the backtracking nature of PEGs you'll do your distributed
 thing many times over or ... spend a lot of RAM to remember not to
 redo it. I recall lexing takes even more then parsing itself.

 I think that your conclusions are about statistical evidences of PEG
 misuses and poor PEG parser implementations. My point was that there is
 nothing fundamentally worse in having lexer integrated with parser, but
 there are performance advantages of having to check less possible cases
 when the structural information is available (so that lexSmth could be
 called when Smth is expected, thus requiring less switch branches if
 switch is used).

You may misunderstood me as well, the point is still the same:
there are 2 things - notation and implementation, the fact that lexer is 
integrated in notation like in PEGs is not related to the fact that PEGs 
in their classic definition never use term Token and do backtracking 
parsing essentially on character level.



 As for lexing multiple times, simply use a free list of terminals (aka
 tokens). I still assume that grammar is properly defined, so that there
 is only one way to split source into tokens.

Tokens.. there is no such term in use if we talk about 'pure' PEG.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
 You may misunderstood me as well, the point is still the same:
 there are 2 things - notation and implementation, the fact that 
 lexer is integrated in notation like in PEGs is not related to 
 the fact that PEGs in their classic definition never use term 
 Token and do backtracking parsing essentially on character 
 level.

But PEG itself doesn't require backtracking parsing, does it? So 
that's an implementation detail, or a specific algorithm. Lexer 
separation seems to be independent of this.

 As for lexing multiple times, simply use a free list of 
 terminals (aka tokens). I still assume that grammar is 
 properly defined, so that there is only one way to split 
 source into tokens.

 Tokens.. there is no such term in use if we talk about 'pure' 
 PEG.

Terminal symbols.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 01:24, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
 You may misunderstood me as well, the point is still the same:
 there are 2 things - notation and implementation, the fact that lexer
 is integrated in notation like in PEGs is not related to the fact that
 PEGs in their classic definition never use term Token and do
 backtracking parsing essentially on character level.

 But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, 
try option C...

it's obvious how backtracking comes in ... whan say A fails somewhere in 
the middle of it. Of course there is a fair amount of bookkeeping that 
reduces doing work all over again but it does ... allocate.

This could be avoided via disambiguation a-la LL(k) but that won't be 
true PEG anymore ;)

So that's
 an implementation detail, or a specific algorithm.

No

Lexer separation
 seems to be independent of this.

Yes

 As for lexing multiple times, simply use a free list of terminals
 (aka tokens). I still assume that grammar is properly defined, so
 that there is only one way to split source into tokens.

 Tokens.. there is no such term in use if we talk about 'pure' PEG.

 Terminal symbols.

Characters.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 01:24, Roman D. Boiko wrote:
 But PEG itself doesn't require backtracking parsing, does it?

 It does. Specifically the algorithm is to try option A, try 
 option B, try option C...

There is no algorithm, only __grammar__. It specifies that option 
A has higher priority than option B in expression A / B / C. But 
it doesn't forbid, for example, to analyze all in parallel (track 
state transitions) for each character consequently, as a proper 
DFA implementation would do, until some option succeeds and all 
previous fail. A greedy regex would also have to check all 
successor options (C in the exapmle above) to determine the 
longest one.

 it's obvious how backtracking comes in ... whan say A fails 
 somewhere in the middle of it.

Simply stop tracking respective state. Process others in parallel 
as I described above.

 Of course there is a fair amount of bookkeeping that reduces 
 doing work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of 
input after the branching point times number of rules (actually, 
of states in the DFA, which is likely larger but still finite for 
a particular grammar).

 Tokens.. there is no such term in use if we talk about 'pure' 
 PEG.

 Terminal symbols.

 Characters.

Nope. According to the definition, PEG is a set of non-terminal 
symbols, terminal symbols, production rules, and a starting 
non-terminal. Terminals are not necessarily characters.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 01:49, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
 On 08-Jul-12 01:24, Roman D. Boiko wrote:
 But PEG itself doesn't require backtracking parsing, does it?

 It does. Specifically the algorithm is to try option A, try option B,
 try option C...

 There is no algorithm, only __grammar__. It specifies that option A has
 higher priority than option B in expression A / B / C. But it doesn't
 forbid, for example, to analyze all in parallel (track state
 transitions) for each character consequently, as a proper DFA
 implementation would do, until some option succeeds and all previous
 fail.

That would not always work, but yes that's what we should do IMHO.
It's not what packrats do however (when I say algorithm I mean 
specifically packrat). And PEGs are commonly associated with packrats.

  A greedy regex would also have to check all successor options (C
 in the exapmle above) to determine the longest one.

Yes, yet regex  engine (in simple form) can 'merge' identical parallel 
states*  easily it's not so simple with general CFGs.

*As in NFA states, I like to think in multiple NFA states vs single DFA 
state.

 it's obvious how backtracking comes in ... whan say A fails somewhere
 in the middle of it.

 Simply stop tracking respective state. Process others in parallel as I
 described above.

Yeah it could be done, but it doesn't buy you a thing in a lot of cases 
unlike in regular grammar. The problem is that state is not as simple as 
in regex for instance (even in regex that must capture  some submatches 
DFA won't do BTW). Thus you can't merge seemingly identical states 
because they depend on (different) interpretation of previous part of 
input.

That's why DFA in ANTLR is only used to do lookahead or lexing, it 
doesn't maintain path through states during parsing.

 Of course there is a fair amount of bookkeeping that reduces doing
 work all over again but it does ... allocate.

 The amount of used memory would be proportional to the length of input
 after the branching point times number of rules (actually, of states in
 the DFA, which is likely larger but still finite for a particular grammar).

Finite and proportional to input size? Another way to say why an O(n) 
memory requirement for packrats?

 Tokens.. there is no such term in use if we talk about 'pure' PEG.

 Terminal symbols.

 Characters.

 Nope. According to the definition, PEG is a set of non-terminal symbols,
 terminal symbols, production rules, and a starting non-terminal.
 Terminals are not necessarily characters.

Yup. But I'm not talking definitions here I'm talking the notion of 
"integrated lexing" and lexing is certainly about characters or was it?


-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote:
 That would not always work, but yes that's what we should do 
 IMHO.

I didn't do a deeper research on this, just shared my current 
understanding. When that doesn't work in a particular case, we 
will need to find the problem and solve that.

 It's not what packrats do however (when I say algorithm I mean 
 specifically packrat). And PEGs are commonly associated with 
 packrats.

As Philippe Sigaud admitted, that is an incorrect association.

 it's obvious how backtracking comes in ... whan say A fails 
 somewhere in the middle of it.

 Simply stop tracking respective state. Process others in 
 parallel as I described above.

 Yeah it could be done, but it doesn't buy you a thing in a lot 
 of cases unlike in regular grammar. The problem is that state 
 is not as simple as in regex for instance (even in regex that 
 must capture  some submatches DFA won't do BTW). Thus you can't 
 merge seemingly identical states because they depend on 
 (different) interpretation of previous part of input.

Well, if you must track the path to a particular state, then DFA 
simply doesn't match the problem and there's nothing we can do.

 That's why DFA in ANTLR is only used to do lookahead or lexing, 
 it doesn't maintain path through states during parsing.

See above. Is maintaining path really necessary? I thought that 
DFA with additional states could be created for any particular 
situation of ambiguity, etc. This needs further research.

 Of course there is a fair amount of bookkeeping that reduces 
 doing
 work all over again but it does ... allocate.

 The amount of used memory would be proportional to the length 
 of input
 after the branching point times number of rules (actually, of 
 states in
 the DFA, which is likely larger but still finite for a 
 particular grammar).

 Finite and proportional to input size? Another way to say why 
 an O(n) memory requirement for packrats?

Sorry, I incorrectly included the O(n) multiplier. It should be 
finite. When you go to the next character, you don't need 
previous states any more.

 Yup. But I'm not talking definitions here I'm talking the 
 notion of "integrated lexing" and lexing is certainly about 
 characters or was it?

I thought integrated lexing was about doing it along with parsing 
and specifying its rules directly inside the grammar. From the 
former we get structural context to narrow-down available 
options, and from the latter we have a uniform grammar 
description.

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 08-Jul-12 00:09, Andrei Alexandrescu wrote:
 On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

 Interesting. I'll have to ask for more details on why.

I think I told that before but anyway the main point is:
the reason lexer exists is performance.

It's obvious that one can write grammar without ever using lexer. If the 
notation is good say regex or the one used in PEG (that is not quite the 
same it turns out) then token can be easily describe in grammar. Once we 
got lexer can give only 2 things:
	-manageability, it's jsut splitting grammar in 2 parts that could be 
maintained separately (in fact some "lexers" are not DFA nor regular)
(sometimes it could go the opposite direction - it could be  better to 
have one integrated grammar)
	- speed. The reason it could be faster is because lexer can use highly 
deterministic grammar.

Now back to PEGs and packrats. What I see everywhere I look is 
essentially integration of a backtracking-NFA lexer, that is not fast 
nor is particularly convenient. The notation is the whole other matter.

So my view on the matter: whether or not lexer is integrated is two-fold 
question: notation vs implementation.

E.g. it may make regular lexer-like things a part of notation thus 
having rules like:
	Identifier  -> Alpha (Alpha|Number|_)*

Yet it may or may use the same engine for _all_ _rules_, in fact if 
implementation optimize things by ripping off regular parts of grammar 
to small DFA engines it would make some kind of lexer behind the scenes!

Anyway highly hybrid parsers are all the rage now, and reasons are 
obvious - they runs fast and can be bend by some magic to quite 
convoluted grammars of modern languages ;)

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote:
 So my view on the matter: whether or not lexer is integrated is 
 two-fold question: notation vs implementation.

 E.g. it may make regular lexer-like things a part of notation 
 thus having rules like:
 	Identifier  -> Alpha (Alpha|Number|_)*

 Yet it may or may use the same engine for _all_ _rules_, in 
 fact if implementation optimize things by ripping off regular 
 parts of grammar to small DFA engines it would make some kind 
 of lexer behind the scenes!

This is the implementation I have in mind as the only sane way to 
parse PEG efficiently. Plus specializing DFA to only check for 
those terminals which are allowed according to available 
structural information at the moment of their invocation.

 Anyway highly hybrid parsers are all the rage now, and reasons 
 are obvious - they runs fast and can be bend by some magic to 
 quite convoluted grammars of modern languages ;)

Jul 07 2012

deadalnix <deadalnix gmail.com> writes:

On 07/07/2012 21:17, Dmitry Olshansky wrote:
 On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
 On 7/7/12 6:24 AM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
 http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk

 So far it looks like LALR parsers may have lower constant factors than
 Packrat.

 The difference could be minimized by paying attention to parsing of
 terminal symbols, which was in my plans already. It is not necessary to
 strictly follow Packrat parsing algorithm.

 The benefits of Pegged, in my view, are its support of Parsing
 Expression Grammar (PEG) and compile-time evaluation. It is easily
 extensible and modifiable.

 Isn't also the fact that lexing and parsing are integrated? With
 traditional LALR you need a separate tokenizer.

 I'll have to point out that the whole point about integrated lexing is
 moot. It's more of liability then benefit. At very least it's just
 implementation curiosity not advantage.

Many usages only need a lexer.

Jul 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, July 05, 2012 14:22:35 Andrei Alexandrescu wrote:
 I also am actively opposed to a project of just translating D's 
 front-end to D and dropping it into Phobos because it would smother (a) 
 work on generic parser generators, and (b) strong, dependable 
 formalization of D's syntax.

I'm not even vaguely convinced that having a lexer/parser specifically for D in 
Phobos (regardless of whether it comes from dmd or not) will have a negative 
effect on work for making generic parser generators. People are going to want a 
good parser generator _regardless_ of what the situation with parsing D is. 
And I'd be very surprised if you couldn't make a hand-written parser for D 
which was better than one that you can generate. Parser generation is great, 
because it allows you to quickly and easily put together a parser from a 
grammar, but it's unlikely to be as fast as a hand-written one optimized for a 
particular language. However, as the recent discussion on popFront shows, only 
benchmarking of actual solutions would show that for sure.

Now, the issue of a "strong, dependable formalization of D's syntax" is 
another thing entirely. Porting dmd's lexer and parser to Phobos would keep 
the Phobos implementation in line with dmd much more easily and avoid 
inconsistencies in the language definition and the like. However, if we write a 
new lexer and parser specifically for Phobos which _doesn't_ port the lexer or 
parser from dmd, then that _would_ help drive making the spec match the 
compiler (or vice versa). So, I agree that could be a definite argument for 
writing a lexer and parser from scratch rather than porting the one from dmd, 
but I don't buy the bit about it smothering parser generators at all. I think 
that the use cases are completely different.

- Jonathan M Davis

Jul 06 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-07 03:12, Jonathan M Davis wrote:

 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if we write a
 new lexer and parser specifically for Phobos which _doesn't_ port the lexer or
 parser from dmd, then that _would_ help drive making the spec match the
 compiler (or vice versa). So, I agree that could be a definite argument for
 writing a lexer and parser from scratch rather than porting the one from dmd,
 but I don't buy the bit about it smothering parser generators at all. I think
 that the use cases are completely different.

I think the whole point of having a compiler as a library is that the 
compiler should use the library as well. Otherwise the two will get out 
of sync.

Just look at Clang, LLVM, LLDB and Xcode, they took the correct 
approach. Clang and LLVM (and I think LLDB) are available as libraries. 
Then the compiler, debugger (lldb) and IDE uses these libraries as part 
of their implementation. They don't have their own implementation that 
is similar to the libraries, making it "easy" to stay in sync. They 
_use_ the libraries as libraries.

This is what DMD and Phobos should do as well. If it's too complicated 
to port the lexer/parser to D then it would be better, at least as a 
first step, to modify DMD as needed. Create a C API for DMD and then 
create D bindings to be put into Phobos.

-- 
/Jacob Carlborg

Jul 07 2012

deadalnix <deadalnix gmail.com> writes:

On 07/07/2012 13:05, Jacob Carlborg wrote:
 On 2012-07-07 03:12, Jonathan M Davis wrote:

 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would
 keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if
 we write a
 new lexer and parser specifically for Phobos which _doesn't_ port the
 lexer or
 parser from dmd, then that _would_ help drive making the spec match the
 compiler (or vice versa). So, I agree that could be a definite
 argument for
 writing a lexer and parser from scratch rather than porting the one
 from dmd,
 but I don't buy the bit about it smothering parser generators at all.
 I think
 that the use cases are completely different.

 I think the whole point of having a compiler as a library is that the
 compiler should use the library as well. Otherwise the two will get out
 of sync.

 Just look at Clang, LLVM, LLDB and Xcode, they took the correct
 approach. Clang and LLVM (and I think LLDB) are available as libraries.
 Then the compiler, debugger (lldb) and IDE uses these libraries as part
 of their implementation. They don't have their own implementation that
 is similar to the libraries, making it "easy" to stay in sync. They
 _use_ the libraries as libraries.

 This is what DMD and Phobos should do as well. If it's too complicated
 to port the lexer/parser to D then it would be better, at least as a
 first step, to modify DMD as needed.

I tried that. This is almost impossible, dmd's parser and AST are very 
tightly mixed with dmd's internals.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote:
 I tried that. This is almost impossible, dmd's parser and AST 
 are very tightly mixed with dmd's internals.

+1

Jul 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-07 18:49, deadalnix wrote:

 I tried that. This is almost impossible, dmd's parser and AST are very
 tightly mixed with dmd's internals.

I suspected that.

-- 
/Jacob Carlborg

Jul 08 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, July 07, 2012 13:05:29 Jacob Carlborg wrote:
 On 2012-07-07 03:12, Jonathan M Davis wrote:
 Now, the issue of a "strong, dependable formalization of D's syntax" is
 another thing entirely. Porting dmd's lexer and parser to Phobos would
 keep
 the Phobos implementation in line with dmd much more easily and avoid
 inconsistencies in the language definition and the like. However, if we
 write a new lexer and parser specifically for Phobos which _doesn't_ port
 the lexer or parser from dmd, then that _would_ help drive making the
 spec match the compiler (or vice versa). So, I agree that could be a
 definite argument for writing a lexer and parser from scratch rather than
 porting the one from dmd, but I don't buy the bit about it smothering
 parser generators at all. I think that the use cases are completely
 different.

 
 I think the whole point of having a compiler as a library is that the
 compiler should use the library as well. Otherwise the two will get out
 of sync.
 
 Just look at Clang, LLVM, LLDB and Xcode, they took the correct
 approach. Clang and LLVM (and I think LLDB) are available as libraries.
 Then the compiler, debugger (lldb) and IDE uses these libraries as part
 of their implementation. They don't have their own implementation that
 is similar to the libraries, making it "easy" to stay in sync. They
 _use_ the libraries as libraries.
 
 This is what DMD and Phobos should do as well. If it's too complicated
 to port the lexer/parser to D then it would be better, at least as a
 first step, to modify DMD as needed. Create a C API for DMD and then
 create D bindings to be put into Phobos.

There are multiple issues here. The one that Andrei is raising is the fact 
that D isn't properly formalized. Essentially, the compiler _is_ the spec, and 
while the online spec _mostly_ follows it, it doesn't entirely, and the online 
spec isn't always as precise as it needs to be regardless. With a fully 
formalized spec, it should be possible to fully implement a D compiler from 
the spec alone, and I don't think that that's currently possible.

Writing a D lexer and parser (if not a full-blown compiler) from scratch would 
help highlight the places in the spec which are lacking, and having it be part 
of Phobos would arguably increase Walter's incentive to make sure that the 
spec is in line with the compiler (and vice versa) so that stuff _other_ than 
the compiler which is based on the spec would be able to match the compiler.

Clang is in a _completely_ different situation. It's a C/C++ compiler, and both 
C and C++ already have official, formalized specs which Clang conforms to (or
is 
supposed to anyway). Clang has no control over the spec at all. It merely 
implements it. So, there is no issue of trying to keep other tools or 
compilers in line with Clang due to it being the language's spec like we 
effectively have with dmd. It may help the tools which use Clang to be fully in 
line with Clang and not have to worry about whether Clang implements the spec 
slightly differently, but in theory, if they all follow the spec correctly, 
that isn't in issue (though obviously in practice it can be).

In D's case, all of the major D compilers use the same frontend, which helps 
compatability but harms the spec, because there's less incentive to keep it 
precise and  up-to-date. So, from the perspective of the spec, implementing 
the D lexer and parser for Phobos separately from dmd would be of great 
benefit.

IMHO, the reason that porting dmd's lexer and parser would be of great benefit 
is primarily maintenance. It makes it much easier to keep Phobos' lexer and 
parser in line with dmd, making discrepencies less likely, but it arguably 
reduces the incentive to improve the spec.

The benefits of having a lexer and parser as a library (regardless of whether 
it's from scratch or a port from dmd) are primarly derived from the fact that 
it makes it much easier to create tools which use them. Such tools no longer 
have to write their own lexers or parsers.

If the compiler uses the same library, it has the added benefit of making it so 
that any tool using the library will be in sync with the compiler, but if the 
spec is properly formalized and up-to-date, and the library is kep up-to-date 
with _it_, that shouldn't really be a problem. You still have the debate as to 
whether it's better to have a separate implementation based on the spec 
(thereby making it more likely that the spec is correct) or whether it's 
better to have the compiler share the implementation so that the library is 
guaranteed to match the compiler (though not necessarily the spec), but I 
think that that's a separate debate from whether we should have the lexer and 
parser as a library.

In all honesty though, I would be surprised if you could convince Walter to 
switch dmd's frontend to Phobos' lexer and parser even once they've been 
implemented. So, while I agree that there are benefits in doing so, I'm not 
sure how much chance you have of ever getting any traction with that.

Another thing to consider is that that might make it so that gdc and ldc 
couldn't share the same frontend with dmd (assuming that they have to keep 
their frontends in C or C++ -  I don't know if they do) - but if so, that 
would increase the incentive for the spec to be correct if dmd ever started 
using a D frontend.

- Jonathan M Davis

Jul 07 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.163.1341683651.31962.digitalmars-d puremagic.com...
 [snip]

 In all honesty though, I would be surprised if you could convince Walter 
 to
 switch dmd's frontend to Phobos' lexer and parser even once they've been
 implemented. So, while I agree that there are benefits in doing so, I'm 
 not
 sure how much chance you have of ever getting any traction with that.

 - Jonathan M Davis

A more realistic alternative is to modify dmd to optionally (at build/link 
time) use Phobos' lexer and/or parser, which would allow them to be run 
against the test suite.  Parser and lexer changes are relatively rare in the 
compiler, so keeping two implementations in sync is not really that big a 
deal.

Jul 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-07 19:53, Jonathan M Davis wrote:

 There are multiple issues here. The one that Andrei is raising is the fact
 that D isn't properly formalized. Essentially, the compiler _is_ the spec, and
 while the online spec _mostly_ follows it, it doesn't entirely, and the online
 spec isn't always as precise as it needs to be regardless. With a fully
 formalized spec, it should be possible to fully implement a D compiler from
 the spec alone, and I don't think that that's currently possible.

That's been a problem for a long time.

 Writing a D lexer and parser (if not a full-blown compiler) from scratch would
 help highlight the places in the spec which are lacking, and having it be part
 of Phobos would arguably increase Walter's incentive to make sure that the
 spec is in line with the compiler (and vice versa) so that stuff _other_ than
 the compiler which is based on the spec would be able to match the compiler.

 Clang is in a _completely_ different situation. It's a C/C++ compiler, and both
 C and C++ already have official, formalized specs which Clang conforms to (or
is
 supposed to anyway). Clang has no control over the spec at all. It merely
 implements it. So, there is no issue of trying to keep other tools or
 compilers in line with Clang due to it being the language's spec like we
 effectively have with dmd. It may help the tools which use Clang to be fully in
 line with Clang and not have to worry about whether Clang implements the spec
 slightly differently, but in theory, if they all follow the spec correctly,
 that isn't in issue (though obviously in practice it can be).

That should be true for D as well when the spec is complete.

 In D's case, all of the major D compilers use the same frontend, which helps
 compatability but harms the spec, because there's less incentive to keep it
 precise and  up-to-date. So, from the perspective of the spec, implementing
 the D lexer and parser for Phobos separately from dmd would be of great
 benefit.

That might be true.

 IMHO, the reason that porting dmd's lexer and parser would be of great benefit
 is primarily maintenance. It makes it much easier to keep Phobos' lexer and
 parser in line with dmd, making discrepencies less likely, but it arguably
 reduces the incentive to improve the spec.

But then it sounds like the best solution would be not to have a 
lexer/parser based on DMD and instead making sure the spec is correct.

 The benefits of having a lexer and parser as a library (regardless of whether
 it's from scratch or a port from dmd) are primarly derived from the fact that
 it makes it much easier to create tools which use them. Such tools no longer
 have to write their own lexers or parsers.

Of course, that's the whole point.

 If the compiler uses the same library, it has the added benefit of making it so
 that any tool using the library will be in sync with the compiler, but if the
 spec is properly formalized and up-to-date, and the library is kep up-to-date
 with _it_, that shouldn't really be a problem. You still have the debate as to
 whether it's better to have a separate implementation based on the spec
 (thereby making it more likely that the spec is correct) or whether it's
 better to have the compiler share the implementation so that the library is
 guaranteed to match the compiler (though not necessarily the spec), but I
 think that that's a separate debate from whether we should have the lexer and
 parser as a library.

What keeps popping up in my head is the scenario where users are 
complaining about the frontend in Phobos not behaving as their compiler. 
If this is due to they are out of sync, bugs or not following the spec 
doesn't matter.

I still thinks D is changing too much to have two separate 
implementations of the compiler and a library like this.

 In all honesty though, I would be surprised if you could convince Walter to
 switch dmd's frontend to Phobos' lexer and parser even once they've been
 implemented. So, while I agree that there are benefits in doing so, I'm not
 sure how much chance you have of ever getting any traction with that.

That is perhaps true.

 Another thing to consider is that that might make it so that gdc and ldc
 couldn't share the same frontend with dmd (assuming that they have to keep
 their frontends in C or C++ -  I don't know if they do) - but if so, that
 would increase the incentive for the spec to be correct if dmd ever started
 using a D frontend.

-- 
/Jacob Carlborg

Jul 08 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, July 08, 2012 14:13:18 Jacob Carlborg wrote:
 On 2012-07-07 19:53, Jonathan M Davis wrote:
 There are multiple issues here. The one that Andrei is raising is the fact
 that D isn't properly formalized. Essentially, the compiler _is_ the spec,
 and while the online spec _mostly_ follows it, it doesn't entirely, and
 the online spec isn't always as precise as it needs to be regardless.
 With a fully formalized spec, it should be possible to fully implement a
 D compiler from the spec alone, and I don't think that that's currently
 possible.

 
 That's been a problem for a long time.

True, but Andrei is basically arguing that porting dmd's lexer and parser to D 
would reduce the incentive to fix this, whereas having a new lexer/parser would 
encourage fixing it (though he's arguing for purely using a generative parser 
rather than writing a D-specific one).

 IMHO, the reason that porting dmd's lexer and parser would be of great
 benefit is primarily maintenance. It makes it much easier to keep Phobos'
 lexer and parser in line with dmd, making discrepencies less likely, but
 it arguably reduces the incentive to improve the spec.

 
 But then it sounds like the best solution would be not to have a
 lexer/parser based on DMD and instead making sure the spec is correct.

That would be the question. Given enough time, what I'd probably want to do 
would be to port dmd's lexer and parser to D in order to properly figure it all 
out, updating the spec in the process, and _then_ go write one from scratch, 
possibly basing some of it on how dmd did it, possibly not. But I don't really 
have the time for any of that right now, and most of the focus right now from 
people interested in parsing seems to be on pegged and parser generators 
(which are very cool and in some ways much more interesting, but I seriously 
question that that's the performant way to go if you're looking to parse D 
specifically). So, who knows what we're going to get out of this and when we'll 
get it.

 What keeps popping up in my head is the scenario where users are
 complaining about the frontend in Phobos not behaving as their compiler.
 If this is due to they are out of sync, bugs or not following the spec
 doesn't matter.
 
 I still thinks D is changing too much to have two separate
 implementations of the compiler and a library like this.

Well, for the lexer and parser, we're probably okay (or very close to it). As 
Daniel pointed out elsewhere in this thread, that part of the frontend doesn't 
change much (particularly the lexer). There's definitely still some churn, but 
it's nothing like it used to be.

- Jonathan M Davis

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:
 most of the focus right now from people interested in parsing 
 seems to be on pegged and parser generators (which are very 
 cool and in some ways much more interesting, but I seriously 
 question that that's the performant way to go if you're looking 
 to parse D specifically).

Can you provide a *specific* example of performance optimization 
which a custom D compiler would have, but parser generator would 
be unlikely to catch up.

Jul 08 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, July 08, 2012 22:15:26 Roman D. Boiko wrote:
 On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:
 most of the focus right now from people interested in parsing
 seems to be on pegged and parser generators (which are very
 cool and in some ways much more interesting, but I seriously
 question that that's the performant way to go if you're looking
 to parse D specifically).

 
 Can you provide a *specific* example of performance optimization
 which a custom D compiler would have, but parser generator would
 be unlikely to catch up.

It's been too long since I was actively working on parsers to give any 
details, but it is my understanding that because a hand-written parser is 
optimized for a specific grammar, it's going to be faster. Also, look at dmd 
and dmc vs other compilers. They use hand-written parsers and are generally 
much faster than their competitors.

One thing to remember about hand-written parsers vs generative ones though is 
that they usually are completely different in terms of the type of parser that 
you write (e.g. hand-written parsers are generally recursive-decent parser 
whereas generative ones usually use bottom-up parsers). So, that could have a 
large impact on performance as well (in either direction).

To be clear though, I have _no_ problem with having a generative parser in 
Phobos (or having other 3rd party options available). Parsers like pegged are 
extremely cool and extremely useful. It's just that it's my understanding that 
well-written hand-written parsers are faster than generated ones, so I think 
that it would be benecial to have a hand-written parser for D in Phobos _in 
addition_ to a general, generative solution.

But to fully prove that a hand-written one would be faster, we'd of course 
have to have actual solutions to compare. And if the API for a D-specific 
parser in Phobos is designed well enough, and it somehow proved that a 
generative solution was faster, then the hand-written one could be replaced by 
the generative one underneat the hood.

- Jonathan M Davis

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
 It's been too long since I was actively working on parsers to 
 give any details, but it is my understanding that because a 
 hand-written parser is optimized for a specific grammar, it's 
 going to be faster.

My aim is to find out any potential bottlenecks and ensure that 
those are possible to get rid of. So, let's try.

I believe it would not hurt generality or quality of a parser 
generator if it contained sews for inserting custom (optimized) 
code where necessary, including those needed to take advantage of 
some particular aspects of D grammar. Thus I claim that 
optimization for D grammar is possible.

 Also, look at dmd and dmc vs other compilers. They use 
 hand-written parsers and are generally much faster than their 
 competitors.

Due to which particular design or implementation decisions? Would 
it be possible to name a few which are relevant in this context?

 One thing to remember about hand-written parsers vs generative 
 ones though is that they usually are completely different in 
 terms of the type of parser that you write (e.g. hand-written 
 parsers are generally recursive-decent parser whereas 
 generative ones usually use bottom-up parsers).

So far discussion goes in favor of LL(*) parser like ANTLR, which 
is top-down recursive-descent. Either Pegged will be optimized 
with LL(*) algorithms, or a new parser generator created.

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

 I believe it would not hurt generality or quality of a parser 
 generator if it contained sews for inserting custom (optimized)

s/sews/seams

Jul 08 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:
 On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
 It's been too long since I was actively working on parsers to 
 give any details, but it is my understanding that because a 
 hand-written parser is optimized for a specific grammar, it's 
 going to be faster.

 My aim is to find out any potential bottlenecks and ensure that 
 those are possible to get rid of. So, let's try.

 I believe it would not hurt generality or quality of a parser 
 generator if it contained sews for inserting custom (optimized) 
 code where necessary, including those needed to take advantage 
 of some particular aspects of D grammar. Thus I claim that 
 optimization for D grammar is possible.

I'm convinced that the output of a parser generator (PG) can be 
very nearly as fast as hand-written code. ANTLR's output (last I 
checked) was not ideal, but the one I planned to make (a few 
years ago) would have produced faster code.

By default the PG's output will not be the speed of hand-written 
code, but the user can optimize it. Assuming an ANTLR-like PG, 
the user can inspect the original output looking for inefficient 
lookahead, or cases where the parser looks for rare cases before 
common cases, and then improve the grammar and insert ... I 
forget all the ANTLR terminology ... syntactic predicates or 
whatever, to optimize the parser.

 So far discussion goes in favor of LL(*) parser like ANTLR, 
 which is top-down recursive-descent. Either Pegged will be 
 optimized with LL(*) algorithms, or a new parser generator 
 created.

Right, for instance I am interested in writing a top-down PG 
because I understand them better and prefer the top-down approach 
due to its flexibility (semantic actions, allowing custom code) 
and understandability (the user can realistically understand the 
output; in fact readability would be a specific goal of mine)

Roman, regarding what you were saying to me earlier:
In stage 2 you have only performed some basic analysis, like, 
e.g., matched braces to define some hierarchy. This means that 
at the time when you find a missing brace, for example, you 
cannot tell anything more than that braces don't match.

Stage 2 actually can tell more than just "a brace is missing 
somewhere". Because so many languages are C-like. So given this 
situation:

    frob (c &% x)

    }

It doesn't need to know what language this is to tell where the 
brace belongs. Even in a more nebulous case like:

    frob (c &% x) bar   lic

    }

probably the brace belongs at the end of the first line.

Perhaps your point is that there are situations where a parser 
that knows the "entire" grammar could make a better guess about 
where the missing brace/paren belongs. That's certainly true.

However, just because it can guess better, doesn't mean it can 
reinterpret the code based on that guess. I mean, I don't see any 
way to "back up" a parser by an arbitrary amount. A hypothetical 
stage 2 would probably be hand-written and could realistically 
back up and insert a brace/paren anywhere that the heuristics 
dictate, because it is producing a simple data structure and it 
doesn't need to do any semantic actions as it parses. A "full" 
parser, on the other hand, has done a lot of work that it can't 
undo, so the best it can do is report to the user "line 54: 
error: brace mismatch; did you forget a brace on line 13?" The 
heuristic is still helpful, but it has already parsed lines 13 to 
54 in the wrong context (and, in some cases, has already split 
out a series of error messages that are unrelated to the user's 
actual mistake).

 As I demonstrated in some examples, it could get the output 
 which implies incorrect structure

I was unable to find the examples you refer to... this thread's 
getting a little unweildy :)

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

David, I would suggest you to create a proof-of-concept prototype 
and pay attention to some non-trivial use cases. This way you'd 
likely be able to reduce major risks significantly before making 
any serious time investments.

Jul 08 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-08 22:15, Roman D. Boiko wrote:

 Can you provide a *specific* example of performance optimization which a
 custom D compiler would have, but parser generator would be unlikely to
 catch up.

I'm not expert in this field but I've heard that it's harder to get good 
error reporting with a parser generator.

-- 
/Jacob Carlborg

Jul 08 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 9 July 2012 at 06:35:38 UTC, Jacob Carlborg wrote:
 I'm not expert in this field but I've heard that it's harder to 
 get good error reporting with a parser generator.

Good point!

Jul 09 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-08 22:05, Jonathan M Davis wrote:

 Well, for the lexer and parser, we're probably okay (or very close to it). As
 Daniel pointed out elsewhere in this thread, that part of the frontend doesn't
 change much (particularly the lexer). There's definitely still some churn, but
 it's nothing like it used to be.

It don't feel like that, didn't we get lambdas or UFCS in the last release?

-- 
/Jacob Carlborg

Jul 08 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, July 09, 2012 08:33:31 Jacob Carlborg wrote:
 On 2012-07-08 22:05, Jonathan M Davis wrote:
 Well, for the lexer and parser, we're probably okay (or very close to it).
 As Daniel pointed out elsewhere in this thread, that part of the frontend
 doesn't change much (particularly the lexer). There's definitely still
 some churn, but it's nothing like it used to be.

 
 It don't feel like that, didn't we get lambdas or UFCS in the last release?

lambdas are a bit older than that IIRC, and I don't think that UFCS actually 
affects the lexer or parser. Yes, changes are still made, but they're 
increasingly rare, and most of the stuff being changed is on the semantic level 
(most of which is bug fixes). So, out of all of the parts of the compiler to 
duplicate, those are the least likely to have to have major changes made to 
them later (especially the lexer).

- Jonathan M Davis

Jul 08 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-09 08:39, Jonathan M Davis wrote:

 lambdas are a bit older than that IIRC, and I don't think that UFCS actually
 affects the lexer or parser. Yes, changes are still made, but they're
 increasingly rare, and most of the stuff being changed is on the semantic level
 (most of which is bug fixes). So, out of all of the parts of the compiler to
 duplicate, those are the least likely to have to have major changes made to
 them later (especially the lexer).

I'm pretty sure UFCS affects lexing or parsing. How else would this be 
legal:

4.foo();

-- 
/Jacob Carlborg

Jul 09 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, July 09, 2012 09:16:41 Jacob Carlborg wrote:
 On 2012-07-09 08:39, Jonathan M Davis wrote:
 lambdas are a bit older than that IIRC, and I don't think that UFCS
 actually affects the lexer or parser. Yes, changes are still made, but
 they're increasingly rare, and most of the stuff being changed is on the
 semantic level (most of which is bug fixes). So, out of all of the parts
 of the compiler to duplicate, those are the least likely to have to have
 major changes made to them later (especially the lexer).

 
 I'm pretty sure UFCS affects lexing or parsing. How else would this be
 legal:
 
 4.foo();

That definitely wouldn't affect lexing, because it doesn't affect the tokens at 
all. Whether it affects the parser would depend on the grammar. There's a good 
chance it would affect parsing, but it might not. It depends on how numeric 
literals are treated. In most cases though, UFCS definitely wouldn't affect 
parsing at all, because it's purely a matter of symbol resolution, so if 
changes had to be made to the parser, they would almost certainly have been 
minimal.

Yes, some changes are still being made to the parser, but since the language 
is mostly frozen, almost all of the changes have gone into bug fixes, which are 
generally semantic issues and don't affect lexing or parsing at all. So, while 
there would still be some maintenance issues in keeping a separate lexer and 
parser in line with dmd, I really don't think that it would take much at this 
point. Most of the work would be in getting them to match dmd when they're 
first written.

- Jonathan M Davis

Jul 09 2012

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.190.1341818983.31962.digitalmars-d puremagic.com...
 I'm pretty sure UFCS affects lexing or parsing. How else would this be
 legal:

 4.foo();

 That definitely wouldn't affect lexing, because it doesn't affect the 
 tokens at
 all.

Not true.  This used to be lexed as '4.f' 'oo'. (I think)

Jul 09 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/09/2012 08:33 AM, Jacob Carlborg wrote:
 On 2012-07-08 22:05, Jonathan M Davis wrote:

 Well, for the lexer and parser, we're probably okay (or very close to
 it). As
 Daniel pointed out elsewhere in this thread, that part of the frontend
 doesn't
 change much (particularly the lexer). There's definitely still some
 churn, but
 it's nothing like it used to be.

 It don't feel like that, didn't we get lambdas or UFCS in the last release?

Those changes are trivial to incorporate into a well written parser.

Jul 09 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

05.07.2012 20:28, Jonathan M Davis пишет:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because someone really
 wants it done, takes the time to do it, and sees it through until its done and
 in Phobos.

I didn't want for someone to code anything at all! I on the contrary 
want them to stop writing parsers because it results in the only 
consequence: one have to spend more time to find a better parser.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Jul 05 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, July 05, 2012 22:23:00 Denis Shelomovskij wrote:
 05.07.2012 20:28, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do i=


t
 themselves, and in general, stuff around here gets done because som=


eone
 really wants it done, takes the time to do it, and sees it through =


until
 its done and in Phobos.

=20
 I didn't want for someone to code anything at all! I on the contrary
 want them to stop writing parsers because it results in the only
 consequence: one have to spend more time to find a better parser.

Well, until a lexer and parser for D make it into Phobos, more people a=
re=20
going to keep writing them, and even if/when they _do_ make it into Pho=
bos,=20
people will keep writing them, because some people like to write that k=
ind of=20
thing.

Honestly though, if your complaint is that there's too much choice, I d=
on't=20
have much sympathy for you. In general, we have too little D code out t=
here,=20
not too much. If there's a problem with more parsers being written, I t=
hink=20
that it's almost purely an issue of some people's time probably being b=
etter=20
spent on other projects, but it's their right to work on whatever they =
feel=20
like working on.

- Jonathan M Davis

Jul 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05-Jul-12 22:23, Denis Shelomovskij wrote:
 05.07.2012 20:28, Jonathan M Davis пишет:
 It's all too common for someone to suggest that we should
 do something or implement something without ever attempting to do it
 themselves, and in general, stuff around here gets done because
 someone really
 wants it done, takes the time to do it, and sees it through until its
 done and
 in Phobos.

 I didn't want for someone to code anything at all! I on the contrary
 want them to stop writing parsers because it results in the only
 consequence: one have to spend more time to find a better parser.

It doesn't work like that in OpenSource. No matter what you do people 
keep writing code ;)

-- 
Dmitry Olshansky

Jul 05 2012

deadalnix <deadalnix gmail.com> writes:

Le 05/07/2012 18:28, Jonathan M Davis a écrit :
 On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

 Haven't we already done that a couple of times.

 Well, we did something like that for DCT... but I doubt that it
 would fit general needs.

 If we had, why haven't they been analyzed, classified, discussed,
 etc.? Or have they?

 It's been discussed before, but there are some obvious use cases such as
 syntax highlighting, code formatting, and manipulation of D source files (e.g.
 to strip out the unit tests).

 We need to have a lexer in Phobos which parsers D code and generates a range
 of tokens, and we need to have a parser in Phobos which operates on a range of
 tokens. As discussed previously, there is desire to have the lexer and parser
 ported from dmd's frontend to D for Phobos so that what we get is easily
 maintainable alongside dmd's frontend and produces the same results (and is
 fast).

You never looked at dmd frontend soure code don't you ? It is a horror 
museum, and if we want to have something in D, I certainly wouldn't copy 
that.

Jul 05 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, July 06, 2012 00:24:11 deadalnix wrote:
 Le 05/07/2012 18:28, Jonathan M Davis a =C3=A9crit :
 On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:
 On 2012-07-05 15:08, Roman D. Boiko wrote:
 Anyway I propose to enumerate major use cases first.

=20
 Haven't we already done that a couple of times.

=20
 Well, we did something like that for DCT... but I doubt that it
 would fit general needs.
=20
 If we had, why haven't they been analyzed, classified, discussed,
 etc.? Or have they?

=20
 It's been discussed before, but there are some obvious use cases su=


ch as
 syntax highlighting, code formatting, and manipulation of D source =


files
 (e.g. to strip out the unit tests).
=20
 We need to have a lexer in Phobos which parsers D code and generate=


s a
 range of tokens, and we need to have a parser in Phobos which opera=


tes on
 a range of tokens. As discussed previously, there is desire to have=


 the
 lexer and parser ported from dmd's frontend to D for Phobos so that=


 what
 we get is easily maintainable alongside dmd's frontend and produces=


 the
 same results (and is fast).

=20
 You never looked at dmd frontend soure code don't you ? It is a horro=

r
 museum, and if we want to have something in D, I certainly wouldn't c=

opy
 that.

I have definitely looked at dmd's frontend's source code. The idea behi=
nd=20
copying the lexer and parser from there is then that they'd match the=20=

compiler, which would make it much easier to keep them in sync with cha=
nges to=20
the compiler. Whether that's what someone would have written had they d=
one it=20
purely in D initially is pretty much irrelevant.

- Jonathan M Davis

Jul 05 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/05/2012 08:28 AM, Roman D. Boiko wrote:
 Pegged may eventually become standard, if it will be performanceoptimized and
a bit more customizable

Interesting, I thought you were hand-writing this stuff.

I'm a fan of pegged and made some pull requests, one of which was aimed 
at making it more customizable by allowing the user to define what type 
gets used as a parse tree node, thus allowing one to potentially use 
their parse tree as an AST (and maybe do a couple other things too). 
It's a WIP, but the proof of concept is done: DMD can handle the extra 
templating at compile time, so it works.

What kind of things did you want in terms of customizability?

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 15:42:30 UTC, Chad J wrote:
 On 07/05/2012 08:28 AM, Roman D. Boiko wrote:
 Pegged may eventually become standard, if it will be 
 performanceoptimized and a bit more customizable

 Interesting, I thought you were hand-writing this stuff.

 I'm a fan of pegged and made some pull requests, one of which 
 was aimed at making it more customizable by allowing the user 
 to define what type gets used as a parse tree node, thus 
 allowing one to potentially use their parse tree as an AST (and 
 maybe do a couple other things too). It's a WIP, but the proof 
 of concept is done: DMD can handle the extra templating at 
 compile time, so it works.

 What kind of things did you want in terms of customizability?

There are many possible customization point which would make it a 
better fit for my project while also being useful in general.

The most critical was the one you implemented: ability to define 
a custom parse tree. I also needed the ability to use immutable 
structures (almost) everywhere, while currently they must be 
mutable. Also it would be great to avoid UTF conversions (instead 
use functions and types templated by the string type). I also 
wanted to add ability to reuse parts of previously generated AST 
which correspond to source code that has not been changed, or to 
identical source code parsed previously. (This would allow 
incremental parsing, e.g., while user is typing.) The latter 
would require to track source code positions separately, or even 
generating them on demand (this is the feature actively 
criticized by deadalnix in some previous discussions but central 
to DCT architecture). AST would only hold information about 
widths of its nodes.

I've also written some notes (10-15 suggestions) while studying 
Pegged code, which will be shared later.

However, as I found a few hours ago, Packrat parsing (typically 
used to handle PEG) has serious disadvantages: it complicates 
debugging because of frequent backtracking, it has problems with 
error recovery, and typically disallows to add actions with side 
effects (because of possibility of backtracking). These are 
important enough to reconsider my plans of using Pegged. I will 
try to analyze whether the issues are so fundamental that I (or 
somebody else) will have to create an ANTLR-like parser instead, 
or whether it is possible to introduce changes into Pegged that 
would fix these problems.

Jul 07 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sat, Jul 7, 2012 at 6:06 PM, Roman D. Boiko <rb d-coding.com> wrote:

 The most critical was the one you implemented: ability to define a custom
 parse tree. I also needed the ability to use immutable structures (almost)
 everywhere, while currently they must be mutable. Also it would be great to
 avoid UTF conversions (instead use functions and types templated by the
 string type).

I added dstrings because

1- at the time (a few months ago), the lists here were awash in UTF-32
discussions and I thought that'd be the way to go anyway
2- other D parsing libraries seemed to go to UTF32 also (CTPG)
3- I wanted to be able to parse mathematical notation like nabla,
derivatives, etc. which all have UTF32 symbols.


 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging because of
 frequent backtracking, it has problems with error recovery, and typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of using
 Pegged. I will try to analyze whether the issues are so fundamental that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would fix these
 problems.

Note that PEG does not impose to use packrat parsing, even though it
was developed to use it. I think it's a historical 'accident' that put
the two together: Bryan Ford thesis used the two together.

Note that many PEG parsers do not rely on packrat (Pegged does not).
There are a bunch of articles on Bryan Ford's website by a guy
writting a PEG parser for Java, and who found that storing the last
rules was enought to get a slight speed improvement, buth that doing
anymore sotrage was detrimental to the parser's overall efficiency.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
 I added dstrings because

 1- at the time (a few months ago), the lists here were awash in 
 UTF-32
 discussions and I thought that'd be the way to go anyway
 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
 3- I wanted to be able to parse mathematical notation like 
 nabla,
 derivatives, etc. which all have UTF32 symbols.

I propose to switch code to use S if(isSomeString!S) everywhere. 
Client code would first determine source encoding scheme, and 
then instantiate parsers specifying a string type. This is not a 
trivial change, but I'm willing to help implementing it.

 Note that PEG does not impose to use packrat parsing, even 
 though it was developed to use it. I think it's a historical 
 'accident' that put the two together: Bryan Ford thesis used 
 the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does 
 not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the 
 last rules was enought to get a slight speed improvement, buth 
 that doing anymore sotrage was detrimental to the parser's 
 overall efficiency.

That's great! Anyway I want to understand the advantages and 
limitations of both Pegged and ANTLR, and probably study some 
more techniques. Such research consumes a lot of time but can be 
done incrementally along with development.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:

 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the last
 rules was enought to get a slight speed improvement, buth that doing
 anymore sotrage was detrimental to the parser's overall efficiency.

 That's great! Anyway I want to understand the advantages and limitations
 of both Pegged and ANTLR, and probably study some more techniques. Such
 research consumes a lot of time but can be done incrementally along with
 development.

Yeah, it's good to hear this notion reinforced.  I had this suspicion 
that the packrat parser is not necessarily the best/fastest solution, 
mostly because of the large allocation that has to happen before you get 
O(n) performance.  Thus I figured that pegged might eventually use 
different parsing strategies underneath it all, possibly with a lot of 
special-casing and clever hand-tuned and profiled optimizations.  At 
least that's what makes sense to me.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced.  I had this 
 suspicion that the packrat parser is not necessarily the 
 best/fastest solution, mostly because of the large allocation 
 that has to happen before you get O(n) performance.  Thus I 
 figured that pegged might eventually use different parsing 
 strategies underneath it all, possibly with a lot of 
 special-casing and clever hand-tuned and profiled 
 optimizations.  At least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where 
possible. This is the approach implemented in ANTLR, but I wanted 
to introduce them before I knew about existence of the latter, 
simply because this would likely produce the fastest parsers 
possible.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 01:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

 At the very least, we could use DFA instead of backtracking where
 possible. This is the approach implemented in ANTLR, but I wanted to
 introduce them before I knew about existence of the latter, simply
 because this would likely produce the fastest parsers possible.

These were my thoughts exactly, although somewhat unsubstantiated in my 
case ;)

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 22:29, Chad J wrote:
 On 07/07/2012 01:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

 At the very least, we could use DFA instead of backtracking where
 possible. This is the approach implemented in ANTLR, but I wanted to
 introduce them before I knew about existence of the latter, simply
 because this would likely produce the fastest parsers possible.

 These were my thoughts exactly, although somewhat unsubstantiated in my
 case ;)

Yup the nice thing about ANTLR is the usage of DFA. e.g. it's used for 
disambiguating alternatives that LL(k) could never do.

-- 
Dmitry Olshansky

Jul 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 1:01 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
 Yeah, it's good to hear this notion reinforced. I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you
 get O(n) performance. Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations. At
 least that's what makes sense to me.

 At the very least, we could use DFA instead of backtracking where
 possible. This is the approach implemented in ANTLR, but I wanted to
 introduce them before I knew about existence of the latter, simply
 because this would likely produce the fastest parsers possible.

Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the 
tokenizer?

Andrei

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu 
wrote:
 Doesn't ANTLR use full-fledged character-level LL(*) parsing 
 even in the tokenizer?

Since I didn't understand your question I assume that my 
statement was somehow incorrect (likely because I made some wrong 
assumptions about ANTLR). I didn't know about its existence until 
today and still don't understand it completely. What I think I 
understood is that it uses DFA for deciding which grammar rule to 
apply instead of doing backtracking. I also think that it uses 
DFA for low-level scanning (I'm not sure).

The idea to introduce DFA for both determining which rule to 
apply and lexing of terminal symbols appeared to me much earlier, 
and the suggestion to introduce them into Pegged is one of 
options which I think could extremely improve performance.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 Since I didn't understand your question I assume that my 
 statement was somehow incorrect (likely because I made some 
 wrong assumptions about ANTLR). I didn't know about its 
 existence until today and still don't understand it completely. 
 What I think I understood is that it uses DFA for deciding 
 which grammar rule to apply instead of doing backtracking. I 
 also think that it uses DFA for low-level scanning (I'm not 
 sure).

ANTLR 3 doesn't use a DFA unless it needs to. If unlimited 
lookahead is not called for, it uses standard LL(k) or perhaps it 
uses the modified (approximate? I forget the name) LL(k) from 
ANTLR 2. DFA comes into play, for instance, if you need to check 
what comes after an argument list (of, unlimited, length) before 
you can decide that it *is* an argument list and start the "real" 
parsing (The author says LL(k) is too inefficient so he used a 
restricted form of it; personally I'm not convinced, but I 
digress)

Jul 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/7/12 4:15 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:
 Doesn't ANTLR use full-fledged character-level LL(*) parsing even in
 the tokenizer?

 Since I didn't understand your question I assume that my statement was
 somehow incorrect (likely because I made some wrong assumptions about
 ANTLR). I didn't know about its existence until today and still don't
 understand it completely. What I think I understood is that it uses DFA
 for deciding which grammar rule to apply instead of doing backtracking.
 I also think that it uses DFA for low-level scanning (I'm not sure).

This should help:

http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Andrei

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 22:40:05 UTC, Andrei Alexandrescu
wrote:

 http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Thanks, nice article.

Also found another post:
http://www.antlr.org/wiki/display/~admin/2008/11/30/Example+tree+rewriting+with+patterns,
which is related to some other discussion in this thread :)

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 20:56, Chad J wrote:
 On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:

 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the last
 rules was enought to get a slight speed improvement, buth that doing
 anymore sotrage was detrimental to the parser's overall efficiency.

 That's great! Anyway I want to understand the advantages and limitations
 of both Pegged and ANTLR, and probably study some more techniques. Such
 research consumes a lot of time but can be done incrementally along with
 development.

 Yeah, it's good to hear this notion reinforced.  I had this suspicion
 that the packrat parser is not necessarily the best/fastest solution,
 mostly because of the large allocation that has to happen before you get
 O(n) performance.  Thus I figured that pegged might eventually use
 different parsing strategies underneath it all, possibly with a lot of
 special-casing and clever hand-tuned and profiled optimizations.  At
 least that's what makes sense to me.

Another idea that I've never realized is to add operator precedence 
grammar to pegged. Of course it should fit naturally with traditional 
PEG, for instance taking responsibility for parsing expressions.

-- 
Dmitry Olshansky

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator 
 precedence grammar to pegged. Of course it should fit naturally 
 with traditional PEG, for instance taking responsibility for 
 parsing expressions.

But that's already available by explicitly defining expression 
grammar via nested rules. See for example the examples/dgrammar.d 
in Pegged. This way, for example, multiplication has precedence 
over addition. (It looks like I misunderstood you. Did I?)

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Jul-12 23:35, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator precedence
 grammar to pegged. Of course it should fit naturally with traditional
 PEG, for instance taking responsibility for parsing expressions.

 But that's already available by explicitly defining expression grammar
 via nested rules. See for example the examples/dgrammar.d in Pegged.
 This way, for example, multiplication has precedence over addition. (It
 looks like I misunderstood you. Did I?)

Yes, I've meant something completely different ;)
http://en.wikipedia.org/wiki/Operator-precedence_grammar

-- 
Dmitry Olshansky

Jul 07 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/07/2012 09:35 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:
 Another idea that I've never realized is to add operator precedence
 grammar to pegged. Of course it should fit naturally with traditional
 PEG, for instance taking responsibility for parsing expressions.

 But that's already available by explicitly defining expression grammar
 via nested rules. See for example the examples/dgrammar.d in Pegged.
 This way, for example, multiplication has precedence over addition. (It
 looks like I misunderstood you. Did I?)

http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:
 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method

OK, at least I didn't misunderstand. So my question was whether 
the alternative that I described and which exists in PEG is 
somehow worse than OPP. Wikipedia seems to provide an answer to 
that: OPP tend to be simpler. (I didn't investigate this topic 
further.)

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 19:58:37 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:
 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method

 OK, at least I didn't misunderstand. So my question was whether 
 the alternative that I described and which exists in PEG is 
 somehow worse than OPP. Wikipedia seems to provide an answer to 
 that: OPP tend to be simpler. (I didn't investigate this topic 
 further.)

OK, now I agree, the need to perform several nested calls in 
order to parse some expression is costly enough to consider OPP 
superior to a general PEG for expressions.

At first I was surprised when found that D doesn't define 
operator precedence explicitly, but instead provides a hierarchy 
of expression types. Then I somehow concluded that the approaches 
are equivalent. (I started learning parsing techniques only in 
February '12.) Since then I never reconsidered my past 
conclusions.

Jul 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, July 07, 2012 18:37:54 Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
 I added dstrings because
 
 1- at the time (a few months ago), the lists here were awash in
 UTF-32
 discussions and I thought that'd be the way to go anyway
 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
 3- I wanted to be able to parse mathematical notation like
 nabla,
 derivatives, etc. which all have UTF32 symbols.

 
 I propose to switch code to use S if(isSomeString!S) everywhere.
 Client code would first determine source encoding scheme, and
 then instantiate parsers specifying a string type. This is not a
 trivial change, but I'm willing to help implementing it.

I don't know about this particular case, because I haven't really looked at 
pegged, but in general, string parsing stuff should be taking ranges of dchar 
and then specializing on string type where appropriate for efficiency.

- Jonathan M Davis

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 16:37:56 UTC, Roman D. Boiko wrote:
 Note that PEG does not impose to use packrat parsing, even 
 though it was developed to use it. I think it's a historical 
 'accident' that put the two together: Bryan Ford thesis used 
 the two together.

 Note that many PEG parsers do not rely on packrat (Pegged does 
 not).
 There are a bunch of articles on Bryan Ford's website by a guy
 writting a PEG parser for Java, and who found that storing the 
 last rules was enought to get a slight speed improvement, buth 
 that doing anymore sotrage was detrimental to the parser's 
 overall efficiency.

 That's great! Anyway I want to understand the advantages and 
 limitations of both Pegged and ANTLR, and probably study some 
 more techniques. Such research consumes a lot of time but can 
 be done incrementally along with development.

One disadvantage of Packrat parsers I mentioned was problematic 
error recovery (according to the article from ANTLR website). 
After some additional research, I found that it is not a critical 
problem. To find the exact place of error (from parser's 
perspective, not user's) one only needs to remember the farthest 
successfully parsed position (among several backtracking 
attempts) and the reason that it failed.

It is also possible to rerun parsing with some additional 
heuristics after failing, thus enabling advanced error repair 
scenarios.

Since Pegged doesn't use Packrat algorithm, this solution might 
be either not relevant or not applicable, but I doubt that there 
will be any fundamental problem with error recovery.

Unpleasant debugging experience, however, should be relevant for 
any parser that uses backtracking heavily.

Jul 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko <rb d-coding.com> wrote:


 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find the
 exact place of error (from parser's perspective, not user's) one only needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

IIRC, that's what I encoded in Pegged (admittedly limited) error
reporting: remember the farthest error.

 It is also possible to rerun parsing with some additional heuristics after
 failing, thus enabling advanced error repair scenarios.

Do people really what error-repairing parsers? I want my parsers to
tell me something is bad, and, optionally to advance a possible
repair, but definitely *not* to automatically repair a inferred error
and continue happily.

Jul 10 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com>  wrote:


 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find the
 exact place of error (from parser's perspective, not user's) one only needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

 IIRC, that's what I encoded in Pegged (admittedly limited) error
 reporting: remember the farthest error.

 It is also possible to rerun parsing with some additional heuristics after
 failing, thus enabling advanced error repair scenarios.

 Do people really what error-repairing parsers? I want my parsers to
 tell me something is bad, and, optionally to advance a possible
 repair, but definitely *not* to automatically repair a inferred error
 and continue happily.

FWIW, this is what most HTML parsers are doing.

Jul 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> wrote:

 Do people really what error-repairing parsers? I want my parsers to
 tell me something is bad, and, optionally to advance a possible
 repair, but definitely *not* to automatically repair a inferred error
 and continue happily.


 FWIW, this is what most HTML parsers are doing.

Ah, right. I can get it for HTML/XML. JSON also, maybe.
I was thinking of parsing a programming language (C, D, etc)

Consider me half-convinced :)

Jul 10 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 10 July 2012 at 19:41:29 UTC, Philippe Sigaud wrote:
 On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> 
 wrote:

 Do people really what error-repairing parsers? I want my 
 parsers to
 tell me something is bad, and, optionally to advance a 
 possible
 repair, but definitely *not* to automatically repair a 
 inferred error
 and continue happily.


 FWIW, this is what most HTML parsers are doing.

 Ah, right. I can get it for HTML/XML. JSON also, maybe.
 I was thinking of parsing a programming language (C, D, etc)

 Consider me half-convinced :)

It would still generate errors. But would enable a lot of useful 
functionality: autocompletion, refactoring, symbol documentation 
in a tooltip, displaying method overloads with parameters 
as-you-type, go to definition, etc.

Jul 10 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:
 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find
 the
 exact place of error (from parser's perspective, not user's) one only
 needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

 
 IIRC, that's what I encoded in Pegged (admittedly limited) error
 reporting: remember the farthest error.
 
 It is also possible to rerun parsing with some additional heuristics
 after
 failing, thus enabling advanced error repair scenarios.

 
 Do people really what error-repairing parsers? I want my parsers to
 tell me something is bad, and, optionally to advance a possible
 repair, but definitely *not* to automatically repair a inferred error
 and continue happily.

 
 FWIW, this is what most HTML parsers are doing.

Which is horrible. You pretty much have to with HTML because of the horrid 
decision that it should be parsed so laxly by browsers, but pretty much 
nothing else should do that. Either it's correct or it's not. Having the 
compiler "fix" your code would cause far more problems that it would ever fix.

- Jonathan M Davis

Jul 10 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 10 July 2012 at 20:25:12 UTC, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 FWIW, this is what most HTML parsers are doing.

 Which is horrible. You pretty much have to with HTML because of 
 the horrid
 decision that it should be parsed so laxly by browsers, but 
 pretty much
 nothing else should do that. Either it's correct or it's not. 
 Having the
 compiler "fix" your code would cause far more problems that it 
 would ever fix.

Not having control over parser or source code causes problems. 
Ability to deliver useful functionality (see my post above) is a 
different use case.

Jul 10 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-10 22:25, Jonathan M Davis wrote:

 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever fix.

I'm not sure but I think he was referring to a kind of error reporting 
technique used by compilers. Example:

int foo ()
{
    int a = 3 // note the missing semicolon
    return a;
}

Instead of the parser going completely mad because of the missing 
semicolon. It will basically insert a semicolon, report the error and 
then happily continue parsing. I think this will make it easier to find 
later errors and less likely to report incorrect errors due to a 
previous error.

-- 
/Jacob Carlborg

Jul 10 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:
 On 2012-07-10 22:25, Jonathan M Davis wrote:
 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever
 fix.

 I'm not sure but I think he was referring to a kind of error reporting
 technique used by compilers. Example:
 
 int foo ()
 {
 int a = 3 // note the missing semicolon
 return a;
 }
 
 Instead of the parser going completely mad because of the missing
 semicolon. It will basically insert a semicolon, report the error and
 then happily continue parsing. I think this will make it easier to find
 later errors and less likely to report incorrect errors due to a
 previous error.

Well, giving an error, continuing to parse, and giving a partial result can be 
useful (and you give a prime example of that), but "fixing" the problem (e.g by 
inserting the semicolon) and not considering it to be an error would be a 
_huge_ mistake IMHO. And that's essentially what happens with HTML.

- Jonathan M Davis

Jul 10 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/10/2012 10:53 PM, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:
 On 2012-07-10 22:25, Jonathan M Davis wrote:
 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever
 fix.

 I'm not sure but I think he was referring to a kind of error reporting
 technique used by compilers. Example:

 int foo ()
 {
     int a = 3 // note the missing semicolon
     return a;
 }

 Instead of the parser going completely mad because of the missing
 semicolon. It will basically insert a semicolon, report the error and
 then happily continue parsing. I think this will make it easier to find
 later errors and less likely to report incorrect errors due to a
 previous error.

 Well, giving an error, continuing to parse, and giving a partial result can be
 useful (and you give a prime example of that), but "fixing" the problem (e.g by
 inserting the semicolon) and not considering it to be an error would be a
 _huge_ mistake IMHO.

This is actually precisely what many of the more recent curly-brace-
and-semicolon languages have been doing with regard to semicolons.

Jul 10 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-10 22:53, Jonathan M Davis wrote:

 Well, giving an error, continuing to parse, and giving a partial result can be
 useful (and you give a prime example of that), but "fixing" the problem (e.g by
 inserting the semicolon) and not considering it to be an error would be a
 _huge_ mistake IMHO. And that's essentially what happens with HTML.

No, that is _not_ what happens with HTML. With HTML, the browser _do 
not_ output the error and continues as if it was valid could. As far as 
I know, up until HTML 5, the spec hasn't mentioned what should happen 
with invalid code.

This is just a error handling strategy that is an implementation detail. 
It will not change what is and what isn't valid code. Are you preferring 
getting just the first error when compiling? Fix the error, compile 
again, get a new error and so on.

-- 
/Jacob Carlborg

Jul 10 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, July 11, 2012 08:41:53 Jacob Carlborg wrote:
 On 2012-07-10 22:53, Jonathan M Davis wrote:
 Well, giving an error, continuing to parse, and giving a partial result
 can be useful (and you give a prime example of that), but "fixing" the
 problem (e.g by inserting the semicolon) and not considering it to be an
 error would be a _huge_ mistake IMHO. And that's essentially what happens
 with HTML.

 No, that is _not_ what happens with HTML. With HTML, the browser _do
 not_ output the error and continues as if it was valid could. As far as
 I know, up until HTML 5, the spec hasn't mentioned what should happen
 with invalid code.
 
 This is just a error handling strategy that is an implementation detail.
 It will not change what is and what isn't valid code. Are you preferring
 getting just the first error when compiling? Fix the error, compile
 again, get a new error and so on.

??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The 
browser doesn't spit out errors. It just guesses at what you really meant and 
displays that. It "fixes" the error for you, which is a horrible design IMHO. 
Obviously, we're stuck with it for HTML, but it should not be replicated with 
anything else.

This is in contrast to your example of outputting an error and continuing to 
parse as best it can in order to provide more detail and more error messages 
but _not_ ultimately considering the parsing successful. _That_ is useful. 
HTML's behavior is not.

- Jonathan M Davis

Jul 10 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-11 08:52, Jonathan M Davis wrote:

 ??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The
 browser doesn't spit out errors. It just guesses at what you really meant and
 displays that. It "fixes" the error for you, which is a horrible design IMHO.
 Obviously, we're stuck with it for HTML, but it should not be replicated with
 anything else.

 This is in contrast to your example of outputting an error and continuing to
 parse as best it can in order to provide more detail and more error messages
 but _not_ ultimately considering the parsing successful. _That_ is useful.
 HTML's behavior is not.

Ok, I see. It seems we're meaning the same thing.

-- 
/Jacob Carlborg

Jul 11 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 11-Jul-12 00:25, Jonathan M Davis wrote:
 On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:
 On 07/10/2012 09:14 PM, Philippe Sigaud wrote:
 Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:
 One disadvantage of Packrat parsers I mentioned was problematic error
 recovery (according to the article from ANTLR website). After some
 additional research, I found that it is not a critical problem. To find
 the
 exact place of error (from parser's perspective, not user's) one only
 needs
 to remember the farthest successfully parsed position (among several
 backtracking attempts) and the reason that it failed.

 IIRC, that's what I encoded in Pegged (admittedly limited) error
 reporting: remember the farthest error.

 It is also possible to rerun parsing with some additional heuristics
 after
 failing, thus enabling advanced error repair scenarios.

 Do people really what error-repairing parsers? I want my parsers to
 tell me something is bad, and, optionally to advance a possible
 repair, but definitely *not* to automatically repair a inferred error
 and continue happily.

 FWIW, this is what most HTML parsers are doing.

 Which is horrible. You pretty much have to with HTML because of the horrid
 decision that it should be parsed so laxly by browsers, but pretty much
 nothing else should do that. Either it's correct or it's not. Having the
 compiler "fix" your code would cause far more problems that it would ever fix.

BTW clang does this and even more of stuff on semantic level. It's known 
to won a legions of users because of that (well not only that but good 
diagnostic in general).


-- 
Dmitry Olshansky

Jul 10 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 Note that PEG does not impose to use packrat parsing, even 
 though it
 was developed to use it. I think it's a historical 'accident' 
 that put
 the two together: Bryan Ford thesis used the two together.


got disillusioned because nobody was interested in fixing the 
bugs in it (ANTLR's author is a Java guy first and foremost) and 
the source code of the required libraries didn't have source code 
or a license (wtf.)

So, for awhile I was thinking about how I might make my own 
parser generator that was "better" than ANTLR. I liked the syntax 
of PEG descriptions, but I was concerned about the performance 
hit of packrat and, besides, I already liked the syntax and 
flexibility of ANTLR. So my idea was to make something that was 
LL(k) and mixed the syntax of ANTLR and PEG while using more sane 
(IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 
3 still uses the same semantics today...) All of this is 'water 
under the bridge' now, but I hand-wrote a lexer to help me plan 
out how my parser-generator would produce code. The output code 
was to be both more efficient and significantly more readable 
than ANTLR's output. I didn't get around to writing the 
parser-generator itself but I'll have a look back at my handmade 
lexer for inspiration.

 However, as I found a few hours ago, Packrat parsing 
 (typically used to
 handle PEG) has serious disadvantages: it complicates 
 debugging because of
 frequent backtracking, it has problems with error recovery, 
 and typically
 disallows to add actions with side effects (because of 
 possibility of
 backtracking). These are important enough to reconsider my 
 plans of using
 Pegged. I will try to analyze whether the issues are so 
 fundamental that I
 (or somebody else) will have to create an ANTLR-like parser 
 instead, or
 whether it is possible to introduce changes into Pegged that 
 would fix these
 problems.


I don't like the sound of this either. Even if PEGs were fast, 
difficulty in debugging, error handling, etc. would give me 
pause. I insist on well-rounded tools. For example, even though 
LALR(1) may be the fastest type of parser (is it?), I prefer not 
to use it due to its inflexibility (it just doesn't like some 
reasonable grammars), and the fact that the generated code is 
totally unreadable and hard to debug (mind you, when I learned 
LALR in school I found that it is possible to visualize how it 
works in a pretty intuitive way--but debuggers won't do that for 
you.)

While PEGs are clearly far more flexible than LALR and probably 
more flexible than LL(k), I am a big fan of old-fashioned 
recursive descent because it's very flexible (easy to insert 
actions during parsing, and it's possible to use custom parsing 
code in certain places, if necessary*) and the parser generator's 
output is potentially very straightforward to understand and 
debug. In my mind, the main reason you want to use a parser 
generator instead of hand-coding is convenience, e.g. (1) to 
compress the grammar down so you can see it clearly, (2) have the 
PG compute the first-sets and follow-sets for you, (3) get 
reasonably automatic error handling.

* (If the language you want to parse is well-designed, you'll 
probably not need much custom parsing. But it's a nice thing to 
offer in a general-purpose parser generator.)

I'm not totally sure yet how to support good error messages, 
efficiency and straightforward output at the same time, but by 
the power of D I'm sure I could think of something...

I would like to submit another approach to parsing that I dare 
say is my favorite, even though I have hardly used it at all yet. 
ANTLR offers something called "tree parsing" that is extremely 
cool. It parses trees instead of linear token streams, and 
produces other trees as output. I don't have a good sense of how 
tree parsing works, but I think that some kind of tree-based 
parser generator could become the basis for a very flexible and 
easy-to-understand D front-end. If a PG operates on trees instead 
of linear token streams, I have a sneaky suspicion that it could 
revolutionize how a compiler front-end works.

Why? because right now parsers operate just once, on the user's 
input, and from there you manipulate the AST with "ordinary" 
code. But if you have a tree parser, you can routinely manipulate 
and transform parts of the tree with a sequence of independent 
parsers and grammars. Thus, parsers would replace a lot of things 
for which you would otherwise use a visitor pattern, or 
something. I think I'll try to sketch out this idea in more 
detail later.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 02:23 PM, David Piepgrass wrote:
 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.


 disillusioned because nobody was interested in fixing the bugs in it
 (ANTLR's author is a Java guy first and foremost) and the source code of
 the required libraries didn't have source code or a license (wtf.)

 So, for awhile I was thinking about how I might make my own parser
 generator that was "better" than ANTLR. I liked the syntax of PEG
 descriptions, but I was concerned about the performance hit of packrat
 and, besides, I already liked the syntax and flexibility of ANTLR. So my
 idea was to make something that was LL(k) and mixed the syntax of ANTLR
 and PEG while using more sane (IMO) semantics than ANTLR did at the time
 (I've no idea if ANTLR 3 still uses the same semantics today...) All of
 this is 'water under the bridge' now, but I hand-wrote a lexer to help
 me plan out how my parser-generator would produce code. The output code
 was to be both more efficient and significantly more readable than
 ANTLR's output. I didn't get around to writing the parser-generator
 itself but I'll have a look back at my handmade lexer for inspiration.

 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging
 because of
 frequent backtracking, it has problems with error recovery, and
 typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of
 using
 Pegged. I will try to analyze whether the issues are so fundamental
 that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would
 fix these
 problems.


 I don't like the sound of this either. Even if PEGs were fast,
 difficulty in debugging, error handling, etc. would give me pause. I
 insist on well-rounded tools. For example, even though LALR(1) may be
 the fastest type of parser (is it?), I prefer not to use it due to its
 inflexibility (it just doesn't like some reasonable grammars), and the
 fact that the generated code is totally unreadable and hard to debug
 (mind you, when I learned LALR in school I found that it is possible to
 visualize how it works in a pretty intuitive way--but debuggers won't do
 that for you.)

 While PEGs are clearly far more flexible than LALR and probably more
 flexible than LL(k), I am a big fan of old-fashioned recursive descent
 because it's very flexible (easy to insert actions during parsing, and
 it's possible to use custom parsing code in certain places, if
 necessary*) and the parser generator's output is potentially very
 straightforward to understand and debug. In my mind, the main reason you
 want to use a parser generator instead of hand-coding is convenience,
 e.g. (1) to compress the grammar down so you can see it clearly, (2)
 have the PG compute the first-sets and follow-sets for you, (3) get
 reasonably automatic error handling.

 * (If the language you want to parse is well-designed, you'll probably
 not need much custom parsing. But it's a nice thing to offer in a
 general-purpose parser generator.)

 I'm not totally sure yet how to support good error messages, efficiency
 and straightforward output at the same time, but by the power of D I'm
 sure I could think of something...

 I would like to submit another approach to parsing that I dare say is my
 favorite, even though I have hardly used it at all yet. ANTLR offers
 something called "tree parsing" that is extremely cool. It parses trees
 instead of linear token streams, and produces other trees as output. I
 don't have a good sense of how tree parsing works, but I think that some
 kind of tree-based parser generator could become the basis for a very
 flexible and easy-to-understand D front-end. If a PG operates on trees
 instead of linear token streams, I have a sneaky suspicion that it could
 revolutionize how a compiler front-end works.

 Why? because right now parsers operate just once, on the user's input,
 and from there you manipulate the AST with "ordinary" code. But if you
 have a tree parser, you can routinely manipulate and transform parts of
 the tree with a sequence of independent parsers and grammars. Thus,
 parsers would replace a lot of things for which you would otherwise use
 a visitor pattern, or something. I think I'll try to sketch out this
 idea in more detail later.

I was thinking the same thing.

My intent is to create a kind of regular-expression-of-nodes with 
push/pop operators to recognize ascent and descent on the tree.  Such a 
regular expression would allow one to capture subtrees out of 
generalized patterns and then place them into new trees that then become 
the input for the next pattern or set of patterns.  I think this is much 
closer to how I conceptualize semantic analysis than how semantic 
analysis is done in front ends like DMD: it should to be done with 
pattern recognition and substitution, not with myriad nested 
if-statements and while-loops.

My vision is to have code similar to this in the front-end:

/+
Lower
	while ( boolExpr )
	{
		statements;
	}

Into

	loopAgain:
	if ( !boolExpr )
		goto exitLoop
	statements;
	goto loopAgain
	exitLoop:
+/
void lowerWhileStatement( SyntaxElement* syntaxNode )
{
	auto captures = syntaxNode.matchNodes(
		TOK_WHILE_NODE,
		OP_ENTER_NODE,
			OP_CAPTURE(0),
			OP_BEGIN,
				TOK_EXPRESSION,
			OP_END,
			OP_CAPTURE(1),
			OP_BEGIN,
				TOK_STATEMENT,
			OP_END,
		OP_LEAVE_NODE);
	
	if ( captures is null )
		return;
	
	syntaxNode.replaceWith(
		LabelNode("loopAgain"),
		TOK_IF_STATEMENT,
		OP_INSERT,
		OP_BEGIN,
			TOK_NEGATE,
			OP_INSERT,
			OP_BEGIN,
				captures[0], // Expression
			OP_END,
			GotoStatement("exitLoop"),
		OP_END,
		captures[1], // statements
		GotoStatement("loopAgain"),
		LabelNode("exitLoop")
		);
}


The specifics will easily change.  One problem with the above code is 
that it could probably stand to use more templates and compile-time 
action to allow the front-end to merge patterns happening in the same 
pass to be merged together into one expression, thus preventing any 
unnecessary rescanning.  It all becomes DFAs or DPDAs operating on 
syntax trees.

In this vision I do not use classes and inheritance for my AST.  Instead 
I use structs that contain some kind of nodeType member that would be 
one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the 
above code.  Dynamic dispatch is instead performed by (very fast) DFAs 
recognizing parts of the AST.

This kind of architecture leads to other interesting benefits, like 
being able to assert which symbols a pattern is designed to handle or 
which symbols are allowed to exist in the AST at any point in time. 
Thus if you write a lowering that introduces nodes that a later pass 
can't handle, you'll know very quickly, at least in principle.

I wanted to make such a front-end so that I could easily make a C 
backend.  I believe such a compiler would be able to do that with great 
ease.  I really want a D compiler that can output ANSI C code that can 
be used with few or no OS/CPU dependencies.  I would be willing to lose 
a lot of the nifty parallelism/concurrency stuff and deal with reference 
counting instead of full garbage collection, as long as it lets me 
EASILY target new systems (any phone, console platform, and some 
embedded microcontrollers).  Then what I have is something that's as 
ubiquitous as C, but adds a lot of useful features like exception 
handling, dynamic arrays, templates, CTFE, etc etc.  My ideas for how to 
deal with ASTs in pattern recognition and substitution followed from this.

Needing to use D in places where it isn't available is a real pain-point 
for me right now, and I'll probably find ways to spend time on it 
eventually.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
 I was thinking the same thing.

 My intent is to create a kind of regular-expression-of-nodes 
 with push/pop operators to recognize ascent and descent on the 
 tree.  Such a regular expression would allow one to capture 
 subtrees out of generalized patterns and then place them into 
 new trees that then become the input for the next pattern or 
 set of patterns.  I think this is much closer to how I 
 conceptualize semantic analysis than how semantic analysis is 
 done in front ends like DMD: it should to be done with pattern 
 recognition and substitution, not with myriad nested 
 if-statements and while-loops.

Funny, we've discussed an idea to introduce some hybrid of regex 
and xpath for querying, searching and traversing AST with Dmitry 
a few weeks ago. A custom NDepend-like Code Query Language was 
the major alternative we considered. Discussion started on this 
forum and continued via email.

 In this vision I do not use classes and inheritance for my AST.
  Instead I use structs that contain some kind of nodeType 
 member that would be one of the tokens/symbols in the grammar, 
 like TOK_WHILE_NODE in the above code.  Dynamic dispatch is 
 instead performed by (very fast) DFAs recognizing parts of the 
 AST.

Exactly. This idea first came to me in April after I implemented 
the first top-down recursive descent custom parser for a D 
subset. I tried Visitor pattern before that and wasn't happy. 
There are some subtle difficulties which I believe will be 
possible to overcome, most important being the need to introduce 
a mechanism for hierarchical classification (like a pow 
expression being an assign expression at the same time).

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:

 In this vision I do not use classes and inheritance for my AST.
 Instead I use structs that contain some kind of nodeType member that
 would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
 in the above code. Dynamic dispatch is instead performed by (very
 fast) DFAs recognizing parts of the AST.

 Exactly. This idea first came to me in April after I implemented the
 first top-down recursive descent custom parser for a D subset. I tried
 Visitor pattern before that and wasn't happy. There are some subtle
 difficulties which I believe will be possible to overcome, most
 important being the need to introduce a mechanism for hierarchical
 classification (like a pow expression being an assign expression at the
 same time).

 From some of my earlier scribblings:

enum : SyntaxElement
{
   AST_EXPRESSION          = 0x0001_0000_0000_0000,
     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 | AST_EXPRESSION,
       AST_NEGATE_EXPR     = 0x0000_0000_0001_0000 |  AST_UNARY_EXPR,
       AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 |  AST_UNARY_EXPR,
       AST_POST_ADD_EXPR   = 0x0000_0000_0003_0000 |  AST_UNARY_EXPR,
       AST_POST_SUB_EXPR   = 0x0000_0000_0004_0000 |  AST_UNARY_EXPR,
       AST_PRE_ADD_EXPR    = 0x0000_0000_0005_0000 |  AST_UNARY_EXPR,
       AST_PRE_SUB_EXPR    = 0x0000_0000_0006_0000 |  AST_UNARY_EXPR,
     AST_BINARY_EXPR       = 0x0000_0002_0000_0000 | AST_EXPRESSION,
       AST_AND_EXPR        = 0x0000_0000_0001_0000 |  AST_BINARY_EXPR,
       AST_OR_EXPR         = 0x0000_0000_0002_0000 |  AST_BINARY_EXPR,
       AST_XOR_EXPR        = 0x0000_0000_0003_0000 |  AST_BINARY_EXPR,
       AST_AND_AND_EXPR    = 0x0000_0000_0004_0000 |  AST_BINARY_EXPR,
       AST_OR_OR_EXPR      = 0x0000_0000_0005_0000 |  AST_BINARY_EXPR,
       AST_ADD_EXPR        = 0x0000_0000_0006_0000 |  AST_BINARY_EXPR,
     AST_TRINARY_EXPR      = 0x0000_0003_0000_0000 | AST_EXPRESSION,
     AST_NARY_EXPR         = 0x0000_0004_0000_0000 | AST_EXPRESSION,
   AST_STATEMENT           = 0x0002_0000_0000_0000,
}


bool isA( SyntaxElement leafier, SyntaxElement rootier )
{
	SyntaxElement mask = 0;
	
	if ( rootier & 0x0000_0000_FFFF_FFFF )
	{
		if ( rootier & 0x0000_0000_0000_FFFF )
			mask = 0xFFFF_FFFF_FFFF_FFFF;
		else
			mask = 0xFFFF_FFFF_FFFF_0000;
	}
	else
	{
		if ( rootier & 0x0000_FFFF_FFFF_FFFF )
			mask = 0xFFFF_FFFF_0000_0000;
		else
			mask = 0xFFFF_0000_0000_0000;
	}
	
	return (leafier & mask) == rootier;
}

unittest
{
	assert(  isA( AST_EXPRESSION,  AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) );
	assert(  isA( AST_NEGATE_EXPR, AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) );
	assert( !isA( AST_EXPRESSION,  AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) );
	assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) );
	assert(false);
}

This approach of course has shameful nesting limitations, but I feel 
like these determinations could be fairly well optimized even for the 
general case.  For example: another approach that I might be more 
inclined to take is to give each token/symbol a low-valued index into a 
small inheritance table.

I would expect the regex engine to call the isA function as one of it's 
operations.  Thus placing an AST_EXPRESSION into your expression would 
also match an AST_NEGATE_EXPR too.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
 enum : SyntaxElement
 {
   AST_EXPRESSION          = 0x0001_0000_0000_0000,
     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 |

This would cause wasting space (probably a lot). Definitely it 
would not be easy to implement in a parser generator, when 
various language properties are not known beforehand for 
fine-grained tuning.

 This approach of course has shameful nesting limitations, but I 
 feel like these determinations could be fairly well optimized 
 even for the general case.  For example: another approach that 
 I might be more inclined to take is to give each token/symbol a 
 low-valued index into a small inheritance table.

Depending on implementation, that might introduce the multiplier 
overhead of table access per each comparison (and there would be 
many in case of searching for nodes of specific type).

 I would expect the regex engine to call the isA function as one 
 of it's operations.  Thus placing an AST_EXPRESSION into your 
 expression would also match an AST_NEGATE_EXPR too.

But actually it is not so difficult to implement in a very 
similar way to what you described. I was thinking about a lookup 
table, but different from a traditional inheritance table. It 
would be indexed by AST node type (integral enum value), and 
store various classification information as bits. Maybe this is 
what you meant and I misunderstood you... Example is here: 
https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d 
(sorry, it doesn't show how to do classification, and has a 
different context, but I hope you get the idea). The advantage 
over storing hierarchical information directly in each token is 
obviously memory usage.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 06:18 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
 enum : SyntaxElement
 {
 AST_EXPRESSION = 0x0001_0000_0000_0000,
 AST_UNARY_EXPR = 0x0000_0001_0000_0000 |

 This would cause wasting space (probably a lot). Definitely it would not
 be easy to implement in a parser generator, when various language
 properties are not known beforehand for fine-grained tuning.

 This approach of course has shameful nesting limitations, but I feel
 like these determinations could be fairly well optimized even for the
 general case. For example: another approach that I might be more
 inclined to take is to give each token/symbol a low-valued index into
 a small inheritance table.

 Depending on implementation, that might introduce the multiplier
 overhead of table access per each comparison (and there would be many in
 case of searching for nodes of specific type).

 I would expect the regex engine to call the isA function as one of
 it's operations. Thus placing an AST_EXPRESSION into your expression
 would also match an AST_NEGATE_EXPR too.

 But actually it is not so difficult to implement in a very similar way
 to what you described. I was thinking about a lookup table, but
 different from a traditional inheritance table. It would be indexed by
 AST node type (integral enum value), and store various classification
 information as bits. Maybe this is what you meant and I misunderstood
 you... Example is here:
 https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d
 (sorry, it doesn't show how to do classification, and has a different
 context, but I hope you get the idea). The advantage over storing
 hierarchical information directly in each token is obviously memory usage.

I see what you mean.

I'm not sure that I buy that language properties are known before-hand. 
  The front-end knows what the language grammar looks like and it knows 
what kind of things it can find in an AST.

Thus you can make the regex DSL do this transformation and let the DFAs 
handle everything:

expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr 
| decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...)

Because what we really want it to do is match any of those expression 
kinds when we place the expression symbol in our regex.

I think the important point here though is that inheritance can be 
reduced to bit-twiddling optimizations in this case.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
 I see what you mean.

And I think that I expressed myself badly. Let me rephrase.

When the memory hierarchy is deep, every level would require at 
least one bit position. Or even every base class would require a 
separate bit. (I think that the former + several bits to 
distinguish among hierarchies.) Otherwise it would not be easy to 
check by a bit-mask. Even if the above is incorrect (and that is 
likely since I didn't try to encode that compactly for the real 
grammar), I think that in general that information would only be 
possible to store in a fairly large integral. Especially if we 
try to generate parser from grammar, and thus can't do 
fine-tuning to pack the information tightly. This overhead would 
be paid per each AST node __instance__. But that is not 
necessary, since we could store information in a lookup table 
only once per node __type__.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 07:04 PM, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
 I see what you mean.

 And I think that I expressed myself badly. Let me rephrase.

 When the memory hierarchy is deep, every level would require at least
 one bit position. Or even every base class would require a separate bit.
 (I think that the former + several bits to distinguish among
 hierarchies.) Otherwise it would not be easy to check by a bit-mask.
 Even if the above is incorrect (and that is likely since I didn't try to
 encode that compactly for the real grammar), I think that in general
 that information would only be possible to store in a fairly large
 integral. Especially if we try to generate parser from grammar, and thus
 can't do fine-tuning to pack the information tightly. This overhead
 would be paid per each AST node __instance__. But that is not necessary,
 since we could store information in a lookup table only once per node
 __type__.

Yep. Makes sense.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 	auto captures = syntaxNode.matchNodes(
 		TOK_WHILE_NODE,
 		OP_ENTER_NODE,
 			OP_CAPTURE(0),
 			OP_BEGIN,
 				TOK_EXPRESSION,
 			OP_END,
 			OP_CAPTURE(1),
 			OP_BEGIN,
 				TOK_STATEMENT,
 			OP_END,
 		OP_LEAVE_NODE);

I'm glad to hear you like the tree-parsing approach, Chad, 
although the particular syntax here looks pretty unfriendly :O -- 
does this represent something that you are working on right now?

 This kind of architecture leads to other interesting benefits, 
 like being able to assert which symbols a pattern is designed 
 to handle or which symbols are allowed to exist in the AST at 
 any point in time. Thus if you write a lowering that introduces 
 nodes that a later pass can't handle, you'll know very quickly, 
 at least in principle.

 I wanted to make such a front-end so that I could easily make a 
 C backend.  I believe such a compiler would be able to do that 
 with great ease.  I really want a D compiler that can output 
 ANSI C code that can be used with few or no OS/CPU 
 dependencies.  I would be willing to lose a lot of the nifty 
 parallelism/concurrency stuff and deal with reference counting 
 instead of full garbage collection, as long as it lets me 
 EASILY target new systems (any phone, console platform, and 
 some embedded microcontrollers).  Then what I have is something 
 that's as ubiquitous as C, but adds a lot of useful features 
 like exception handling, dynamic arrays, templates, CTFE, etc 
 etc.  My ideas for how to deal with ASTs in pattern recognition 
 and substitution followed from this.

I tend to agree that it would be better to have a "general" node 
class with the node type as a property rather than a subtype and 
rather than a myriad of independent types, although in the past I 
haven't been able to figure out how to make this approach 
simultaneously general, efficient, and easy to use. I'm kind of a 
perfectionist which perhaps holds me back sometimes :)

I'd like to add that if we give tree parsing first-class 
treatment, I believe the most logical approach to parsing has 
three or more stages instead of the traditional two (lex+parse):

1. Lexer
2. Tree-ification
3. Parsing to AST (which may itself use multiple stages, e.g. 
parse the declarations first, then parse function bodies later)

The new stage two simply groups things that are in parenthesis 
and braces. So an input stream such as the following:

A man (from a [very ugly] house in the suburbs) was quoted as 
saying {
     I saw Batman (and Robin) last night!
}

Is converted to a tree where everything parenthesized or braced 
gets to be a child:

A man (
    from a [
        very ugly
    ] house in the suburbs
) was quoted as saying {
    ...
}

Some of the things I like about this approach are:

1. It's language-agnostic. Lots of languages and DSLs could 
re-use exactly the same code from stage 2. (Stage 1, also, is 
fairly similar between languages and I wonder if a parameterized 
standard lexer is a worthwhile pursuit.)

2. It mostly eliminates the need for arbitrary-length lookahead 
for things like D's template_functions(...)(...). Of course, the 
tokens will almost always end up getting scanned twice, but hey, 
at least you know you won't need to scan them more than twice, 
right? (er, of course the semantic analysis will scan it several 
times anyway. Maybe this point is moot.)

3. It is very efficient for tools that don't need to examine 
function bodies. Such tools can easily leave out that part of the 
parser simply by not invoking the function-body sub-parser.

4. It leaves open the door to supporting embedded DSLs in the 
future. It's trivial to just ignore a block of text in braces and 
hand it off to a DSL later. It is similar to the way PEGs allow 
several different parties to contribute parts of a grammar, 
except that this approach does not constrain all the parties to 
actually use PEGs; for instance if I am a really lazy DSL author 
and I already have a SQL parser laying around (whether it's 
LL(k), LALR, whatever), I can just feed the original input text 
to that parser (or, better, use the flat token stream, sans 
comments, that came out of the lexer.)

5. It's risky 'cause I've never heard of anyone taking this 
approach before. Bring on the danger!

I have observed that most PLs (Programming Langs) use one of two 
versions of stage 2: (1) C-style, with structure indicated 
entirely with {}, (), [], and possibly <> (shudder), or (2) 
Python-style, with structure indicated by indentation instead of 
{}. My favorite is the Boo language, which combines these two, 
using Python style by default, but also having a WSA parsing mode 
(whitespace-agnostic) with braces, and switching to WSA mode 
inside a Python-style module whenever the user uses an opener 
("(,{,[") (IIRC). So a single Boo-like stage 2 could be re-used 
for almost all PLs, and thus would be a reasonable candidate as 
part of the standard library or a "D Boost library" (there is not 
such a thing, is there?).

 I wanted to make such a front-end so that I could easily make a 
 C backend.  I believe such a compiler would be able to do that 
 with great ease.

 Needing to use D in places where it isn't available is a real 
 pain-point for me right now, and I'll probably find ways to 
 spend time on it eventually.

Yeah, with a tree-transforming parser, I imagine the same thing, 
except my current fetish is to convert a certain subset of D to 
multiple other languages automatically. Then I could write 
libraries that can easily be used by an astonishingly large 
audience. I certainly would like to see D targetting Android, but 
that's best done directly from D to ARM.

Anyway, the devil's in the detail. Originally I wanted to do a 

couldn't seem to work out the details, but D is more flexible and 
is likely better suited to the task.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
 I'd like to add that if we give tree parsing first-class 
 treatment, I believe the most logical approach to parsing has 
 three or more stages instead of the traditional two (lex+parse):

 1. Lexer
 2. Tree-ification
 3. Parsing to AST (which may itself use multiple stages, e.g. 
 parse the declarations first, then parse function bodies later)

 The new stage two simply groups things that are in parenthesis 
 and braces. So an input stream such as the following:

I bet that after stage 2 you would have performed almost the same 
amount of work (in other words, spent almost the same time) as 
you would if you did full parsing. After that you would need to 
iterate the whole tree (possibly multiple times), modify (or 
recreate if the AST is immutable) its nodes, etc. Altogether this 
might be a lot of overhead.

My opinion is that tree manipulation is something that should be 
available to clients of parser-as-a-library or even of 
parser+semantic analyzer, but not necessarily advantageous for 
parser itself.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

On Saturday, 7 July 2012 at 20:39:18 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
 I'd like to add that if we give tree parsing first-class 
 treatment, I believe the most logical approach to parsing has 
 three or more stages instead of the traditional two 
 (lex+parse):

 1. Lexer
 2. Tree-ification
 3. Parsing to AST (which may itself use multiple stages, e.g. 
 parse the declarations first, then parse function bodies later)

 The new stage two simply groups things that are in parenthesis 
 and braces. So an input stream such as the following:

 I bet that after stage 2 you would have performed almost the 
 same amount of work (in other words, spent almost the same 
 time) as you would if you did full parsing. After that you 
 would need to iterate the whole tree (possibly multiple times), 
 modify (or recreate if the AST is immutable) its nodes, etc. 
 Altogether this might be a lot of overhead.

 My opinion is that tree manipulation is something that should 
 be available to clients of parser-as-a-library or even of 
 parser+semantic analyzer, but not necessarily advantageous for 
 parser itself.

Hmm, you've got a good point there, although simple 
tree-ification is clearly less work than standard parsing, since 
statements like "auto x = y + z;" would be quickly "blitted" into 
the same node in phase 2, but would become multiple separate 
nodes in phase 3.

What I like about it is not its performance, but how it matches 
the way we think about languages. Humans tend to see overall 
structure first, and examine the fine details later. The tree 
parsing approach is similarly nonlinear and can be modularized in 
a way that might be more intuitive than traditional EBNF.

On the other hand, one could argue it is /too/ flexible, 
admitting so many different approaches to parsing that a 
front-end based on this approach is not necessarily intuitive to 
follow; and of course, not using a standard EBNF-type grammar 
could be argued to be bad.

Still... it's a fun concept, and even if the initial parsing ends 
up using the good-old lex-parse approach, semantic analysis and 
lowering can benefit from a tree parser. Tree parsing, of course, 
is just a generalization of linear parsing and so a tree parser 
generator (TPG) could work equally well for flat input.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 What I like about it is not its performance, but how it matches 
 the way we think about languages. Humans tend to see overall 
 structure first, and examine the fine details later. The tree 
 parsing approach is similarly nonlinear and can be modularized 
 in a way that might be more intuitive than traditional EBNF.

That reminds me, I forgot to write a another advantage I expected 
the three-phase approach to have, namely, that it seems easier to 
tell what the programmer "meant" with three phases, in the face 
of errors. I mean, phase 2 can tell when braces and parenthesis 
are not matched up properly and then it can make reasonable 
guesses about where those missing braces/parenthesis were meant 
to be, based on things like indentation. That would be especially 
helpful when the parser is used in an IDE, since if the IDE 
guesses the intention correctly, it can still understand broken 
code and provide code completion for it. And since phase 2 is a 
standard tool, anybody's parser can use it.

Example:

void f() {
     if (cond)
         x = y + 1;
         y = z + 1;
     }
} // The error appears to be here, but it's really 4 lines up.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
 it seems easier to tell what the programmer "meant" with three 
 phases, in the face of errors. I mean, phase 2 can tell when 
 braces and parenthesis are not matched up properly and then it 
 can make reasonable guesses about where those missing 
 braces/parenthesis were meant to be, based on things like 
 indentation. That would be especially helpful when the parser 
 is used in an IDE, since if the IDE guesses the intention 
 correctly, it can still understand broken code and provide code 
 completion for it. And since phase 2 is a standard tool, 
 anybody's parser can use it.

There could be multiple errors that compensate each other and 
make your phase 2 succeed and prevent phase 3 from doing proper 
error handling. Even knowing that there is an error, in many 
cases you would not be able to create a meaningful error message. 
And any error would make your phase-2 tree incorrect, so it would 
be difficult to recover from it by inserting an additional token 
or ignoring tokens until parser is able to continue its work 
properly. All this would suffer for the same reason: you loose 
information.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

On Saturday, 7 July 2012 at 22:07:02 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
 it seems easier to tell what the programmer "meant" with three 
 phases, in the face of errors. I mean, phase 2 can tell when 
 braces and parenthesis are not matched up properly and then it 
 can make reasonable guesses about where those missing 
 braces/parenthesis were meant to be, based on things like 
 indentation. That would be especially helpful when the parser 
 is used in an IDE, since if the IDE guesses the intention 
 correctly, it can still understand broken code and provide 
 code completion for it. And since phase 2 is a standard tool, 
 anybody's parser can use it.

 There could be multiple errors that compensate each other and 
 make your phase 2 succeed and prevent phase 3 from doing proper 
 error handling. Even knowing that there is an error, in many 
 cases you would not be able to create a meaningful error 
 message. And any error would make your phase-2 tree incorrect, 
 so it would be difficult to recover from it by inserting an 
 additional token or ignoring tokens until parser is able to 
 continue its work properly. All this would suffer for the same 
 reason: you loose information.

This is all true, but forgetting a brace commonly results in a 
barrage of error messages anyway. Code that guesses what you 
meant obviously won't work all the time, and phase 3 would have 
to take care not to emit an error message about a "{" token that 
doesn't actually exist (that was merely "guessed-in"). But at 
least it's nice for a parser to be /able/ to guess what you 
meant; for a typical parser it would be out of the question, upon 
detecting an error, to back up four source lines, insert a brace 
and try again.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
 This is all true, but forgetting a brace commonly results in a 
 barrage of error messages anyway. Code that guesses what you 
 meant obviously won't work all the time, and phase 3 would have 
 to take care not to emit an error message about a "{" token 
 that doesn't actually exist (that was merely "guessed-in"). But 
 at least it's nice for a parser to be /able/ to guess what you 
 meant; for a typical parser it would be out of the question, 
 upon detecting an error, to back up four source lines, insert a 
 brace and try again.

So you simply admit that error recovery is difficult to 
implement. For me, it is a must-have, and thus throwing away 
information is bad.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

On Saturday, 7 July 2012 at 22:35:37 UTC, Roman D. Boiko wrote:
 On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
 This is all true, but forgetting a brace commonly results in a 
 barrage of error messages anyway. Code that guesses what you 
 meant obviously won't work all the time, and phase 3 would 
 have to take care not to emit an error message about a "{" 
 token that doesn't actually exist (that was merely 
 "guessed-in"). But at least it's nice for a parser to be 
 /able/ to guess what you meant; for a typical parser it would 
 be out of the question, upon detecting an error, to back up 
 four source lines, insert a brace and try again.

 So you simply admit that error recovery is difficult to 
 implement. For me, it is a must-have, and thus throwing away 
 information is bad.

I'm not seeing any tremendous error-handling difficulty in my 
idea. Anyway, I missed the part about information being thrown 
away...?

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:
 I'm not seeing any tremendous error-handling difficulty in my 
 idea. Anyway, I missed the part about information being thrown 
 away...?

(I will use the word "you" to refer to some abstract person or 
compiler.)

Error handling could mean either error reporting and stopping 
after the first error, or reporting several errors and continuing 
parsing + semantic analysis whenever possible, so that the user 
would get partial functionality like code completion, information 
about method overloads, or "go to definition" / find all usages, 
etc.

The first method is the less powerful option, but still requires 
constructing a meaningful error message which could help the user.

There are many possible error recovery strategies. For example, 
you might decide to insert some token according to the parsing 
information available up to the moment when error is discovered, 
if that would fix the problem and allow to continue parsing. 
Another option is to ignore a series of further tokens (treat 
them a whitespace, for example), until parser is able to continue 
its work. Also there are many heuristics for typical error 
situations.

All of these can only be performed if parser gathers the syntax 
information which can be inferred from source code according to 
the grammar rules.

In stage 2 you have only performed some basic analysis, like, 
e.g., matched braces to define some hierarchy. This means that at 
the time when you find a missing brace, for example, you cannot 
tell anything more than that braces don't match. Or, if the user 
inserts a brace in an incorrect location, it is only possible to 
say that it is either missing somewhere and somewhere else 
another brace is missing, or that it is missing in one place, and 
some other brace is redundant. In many cases you won't even 
notice that a brace is incorrectly placed, and pass the resulting 
tree to the 3rd stage. You don't take any hint from grammar about 
the exact locations where some token should be present.

Now, stage 3 heavily depends on the output of stage 2. As I 
demonstrated in some examples, it could get the output which 
implies incorrect structure, even if that has not been found in 
the previous stage. You would need to analyze so much information 
attempting to find the real roots of a problem, that effectively 
it would involve duplicating (implicitly) the logic of previous 
stage, but with combinatorial complexity growth.

The problems you would need to deal with are much more complex 
than I described here. So you wouldn't be able to deliver error 
recovery at all, or (if you could) it would be either trivial or 
would require needlessly complex logic. Breaking the system at 
the wrong boundary (when the number of dependencies that cross 
the boundary is large) always introduces artificial complexity.

Described above is the illustration of what I call information 
loss. I may have described something not as clearly as needed, 
but I didn't have the goal to provide a thorough and verifiable 
analysis. I speculated and simplified a lot. If you decide to 
ignore this, it is not a problem and I don't state that you will 
fail any of your goals. Just admit that __for me__ this approach 
is not a viable option. Everything written above is IMHO and 
there may be many ways to resolve the problems with various 
degrees of success.

Jul 07 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote:
 Still... it's a fun concept, and even if the initial parsing 
 ends up using the good-old lex-parse approach, semantic 
 analysis and lowering can benefit from a tree parser. Tree 
 parsing, of course, is just a generalization of linear parsing 
 and so a tree parser generator (TPG) could work equally well 
 for flat input.

Exactly. Semantic analysis is what would benefit from that, as 
well as client code (for example, refactoring tools, etc.)

Parser would become quite complicated. Probably it would need to 
perform complex tree navigation (which might decrease performance 
proportionally to the average tree depth or even more, if parser 
algorithms were themselves fast). Any non-trivial query would 
require direct character manipulation (like comparing sub-strings 
of captures with string literals, etc.). Probably you would get 
frequent cache misses because of the need to jump throughout the 
(incomplete) AST. It would definitely be problematic to build an 
immutable AST model, thus (IMO) significantly and needlessly 
constraining you and users of your library. And likely you would 
have to deal with many more problems __without__ significant 
gains.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 04:26 PM, David Piepgrass wrote:
 auto captures = syntaxNode.matchNodes(
 TOK_WHILE_NODE,
 OP_ENTER_NODE,
 OP_CAPTURE(0),
 OP_BEGIN,
 TOK_EXPRESSION,
 OP_END,
 OP_CAPTURE(1),
 OP_BEGIN,
 TOK_STATEMENT,
 OP_END,
 OP_LEAVE_NODE);

 I'm glad to hear you like the tree-parsing approach, Chad, although the
 particular syntax here looks pretty unfriendly :O -- does this represent
 something that you are working on right now?

Yes and yes.

I didn't choose this because it because it's pretty.

I chose it because:
(1) It's easy to implement.
(2) Both the implementation and syntax can be altered easily.

I do not have time to write a language for the tree pattern recognition 
and substitution that is needed to do this in an aesthetically pleasing 
way.  I've tried to sketch what it might look like before, and even then 
it is hard to make it nice, much less begin implementing the thing.  I'd 
love to have such a language, but resource constraints exist.

I also think that this approach would allow me to find out what my usage 
patterns look like before I commit to a more complicated 
architecture/tool.  I really think the design of this regex/language/DSL 
thing should be dominated by its usage.  This is a tricky 
chicken-and-egg thing because it's not currently used.  The hacky syntax 
you see is the bootstrapping.

Point (2) is important because, since we don't have existing usage 
patterns, this thing is going to change.  It's going to change /a lot/. 
  I want it to be easy to change.  I think a complete DSL will be harder 
to change quickly.

I also like how it doesn't require a lot of CTFE trickery or pushing DMD 
too far.  D has really cool features, but I find that when I use things 
like CTFE aggressively then I lose productivity because I end up 
spending a lot of time finding compiler bugs.  This leads to my current 
strategy: use the simpler features that work for sure, and only use the 
more advanced stuff when I really need to.  I think my syntax fits this 
strategy and thus contributes to point (1).

That said, it is good that even mostly-working CTFE exists and that a 
powerful template and metaprogramming system exists, because I don't 
think a compiler like this would be very practical to program otherwise. 
  It would be doable in other languages, but could easily suffer from 
performance pessimizations due to being forced to compute everything at 
runtime.

If anyone has an approach that shares the above strengths and looks 
nicer or is more powerful, I'd love to see it.

 5. It's risky 'cause I've never heard of anyone taking this approach
 before. Bring on the danger!

The danger is the fun part! <g>

 I wanted to make such a front-end so that I could easily make a C
 backend. I believe such a compiler would be able to do that with great
 ease.

 Needing to use D in places where it isn't available is a real
 pain-point for me right now, and I'll probably find ways to spend time
 on it eventually.

 Yeah, with a tree-transforming parser, I imagine the same thing, except
 my current fetish is to convert a certain subset of D to multiple other
 languages automatically. Then I could write libraries that can easily be
 used by an astonishingly large audience. I certainly would like to see D
 targetting Android, but that's best done directly from D to ARM.

That does sound very cool.  Possibly difficult though, due to having to 
cater to the lowest-common-denominator in all of your API designs.  No 
templated functions or ranges in your API, that's for sure.  I'm sure 
there are some things where this is very doable though; it probably 
depends on what kind of libraries you are writing.

As for D targeting Android, my intent is really to target X where X is 
any CPU/OS combo you can think of.  I want to be able to get D, the 
language, not necessarily phobos or other niceties, to work on any 
platform, and to do so without much work.  Cross-compiling to a new 
platform that has never been cross-compiled before should require zero 
coding.  Perhaps it might be possible to have a text file with some 
key-value configuration that tells it certain common features are 
available on the target, thus allowing you to have more features with 
almost no effort involved.

Still, I'll always take a crippled but existent D compiler that targets 
Android over a perfect but non-existent D compiler that targets Android.

I think that the D-directly-to-ARM is the current approach for 
cross-compiling.  I critique it for its underwhelming lack of results.

 Anyway, the devil's in the detail. Originally I wanted to do a parser

 out the details, but D is more flexible and is likely better suited to
 the task.

I can easily see how this is the case.  I don't think I'd be interested 
in doing a project like this in any other language.  I imagined trying 

practical.  For instance, I don't think the "use regular expressions to 
match AST structures" would work well in other cases because it would 
either (1) have a bunch of runtime overhead for compiling the 
expressions into DFAs at runtime or (2) suffer from integration problems 
if you try to compile the expressions in separate files before compiling 
the rest of the front-end.

Jul 07 2012

"David Piepgrass" <qwertie256 gmail.com> writes:

 Yeah, with a tree-transforming parser, I imagine the same 
 thing, except
 my current [fantasy] is to convert a certain subset of D to 
 multiple other
 languages automatically. Then I could write libraries that can 
 easily be
 used by an astonishingly large audience. I certainly would 
 like to see D
 targetting Android, but that's best done directly from D to 
 ARM.

 That does sound very cool.  Possibly difficult though, due to 
 having to cater to the lowest-common-denominator in all of your 
 API designs.  No templated functions or ranges in your API, 
 that's for sure.  I'm sure there are some things where this is 
 very doable though; it probably depends on what kind of 
 libraries you are writing.

Well, for templates, in general, it would be necessary to 
instantiate a particular set of templates and explicitly give 
them names in the target language. So for instance, I could 
define a Point!T struct in D, sure, but then I'd have to tell the 
language converter to create target-language specializations: in 

were C++, the template could be translated to a C++ template, 
Point<T>, as long as there aren't any "static ifs" or other 
things that can't be translated. Notably, if a template P!T 
depends on another template Q!T, then P!T cannot be translated to 


Adapting standard libraries could no doubt be a gigantic problem. 
I don't know how to begin to think about doing that.

But for ranges in particular, I think the concept is too 
important to leave out of public interfaces. So I'd port the 
major range data structures to the target languages, most likely 
by hand, so that they could be used by converted code.

 As for D targeting Android, my intent is really to target X 
 where X is any CPU/OS combo you can think of.  I want to be 
 able to get D, the language, not necessarily phobos or other 
 niceties, to work on any platform, and to do so without much 
 work.  Cross-compiling to a new platform that has never been 
 cross-compiled before should require zero coding.

I understand. Conversion to C is an effective last resort. And, 
well, I hear a lot of compilers have even used it as a standard 
practice. I guess you'd be stuck with refcounting, though.

 I think that the D-directly-to-ARM is the current approach for 
 cross-compiling.  I critique it for its underwhelming lack of 
 results.

Yeah. I assume it involves weird object-file formats, calling 
conventions and ABIs. I guess very few want to get involved with 
that stuff, and very few have the slightest clue where to begin, 
myself included.

 (2) suffer from integration problems if you try to compile the 
 expressions in separate files before compiling the rest of the 
 front-end.

Absolutely, I love language-integrated metaprogramming. Without 
it you end up with complicated build environments, and I hate 
those, cuz there isn't a single standard build environment that 
everybody likes. I think people should be able to just load up 
their favorite IDE and add all source files to the project and It 
Just Works. Or on the command line, do dmd *.d or whatever. Oh, 
and the ability to run the same code at meta-compile-time, 
compile-time and run-time, also priceless.

Jul 07 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/07/2012 08:55 PM, Chad J wrote:
 On 07/07/2012 02:23 PM, David Piepgrass wrote:
 Note that PEG does not impose to use packrat parsing, even though it
 was developed to use it. I think it's a historical 'accident' that put
 the two together: Bryan Ford thesis used the two together.


 disillusioned because nobody was interested in fixing the bugs in it
 (ANTLR's author is a Java guy first and foremost) and the source code of
 the required libraries didn't have source code or a license (wtf.)

 So, for awhile I was thinking about how I might make my own parser
 generator that was "better" than ANTLR. I liked the syntax of PEG
 descriptions, but I was concerned about the performance hit of packrat
 and, besides, I already liked the syntax and flexibility of ANTLR. So my
 idea was to make something that was LL(k) and mixed the syntax of ANTLR
 and PEG while using more sane (IMO) semantics than ANTLR did at the time
 (I've no idea if ANTLR 3 still uses the same semantics today...) All of
 this is 'water under the bridge' now, but I hand-wrote a lexer to help
 me plan out how my parser-generator would produce code. The output code
 was to be both more efficient and significantly more readable than
 ANTLR's output. I didn't get around to writing the parser-generator
 itself but I'll have a look back at my handmade lexer for inspiration.

 However, as I found a few hours ago, Packrat parsing (typically used to
 handle PEG) has serious disadvantages: it complicates debugging
 because of
 frequent backtracking, it has problems with error recovery, and
 typically
 disallows to add actions with side effects (because of possibility of
 backtracking). These are important enough to reconsider my plans of
 using
 Pegged. I will try to analyze whether the issues are so fundamental
 that I
 (or somebody else) will have to create an ANTLR-like parser instead, or
 whether it is possible to introduce changes into Pegged that would
 fix these
 problems.


 I don't like the sound of this either. Even if PEGs were fast,
 difficulty in debugging, error handling, etc. would give me pause. I
 insist on well-rounded tools. For example, even though LALR(1) may be
 the fastest type of parser (is it?), I prefer not to use it due to its
 inflexibility (it just doesn't like some reasonable grammars), and the
 fact that the generated code is totally unreadable and hard to debug
 (mind you, when I learned LALR in school I found that it is possible to
 visualize how it works in a pretty intuitive way--but debuggers won't do
 that for you.)

 While PEGs are clearly far more flexible than LALR and probably more
 flexible than LL(k), I am a big fan of old-fashioned recursive descent
 because it's very flexible (easy to insert actions during parsing, and
 it's possible to use custom parsing code in certain places, if
 necessary*) and the parser generator's output is potentially very
 straightforward to understand and debug. In my mind, the main reason you
 want to use a parser generator instead of hand-coding is convenience,
 e.g. (1) to compress the grammar down so you can see it clearly, (2)
 have the PG compute the first-sets and follow-sets for you, (3) get
 reasonably automatic error handling.

 * (If the language you want to parse is well-designed, you'll probably
 not need much custom parsing. But it's a nice thing to offer in a
 general-purpose parser generator.)

 I'm not totally sure yet how to support good error messages, efficiency
 and straightforward output at the same time, but by the power of D I'm
 sure I could think of something...

 I would like to submit another approach to parsing that I dare say is my
 favorite, even though I have hardly used it at all yet. ANTLR offers
 something called "tree parsing" that is extremely cool. It parses trees
 instead of linear token streams, and produces other trees as output. I
 don't have a good sense of how tree parsing works, but I think that some
 kind of tree-based parser generator could become the basis for a very
 flexible and easy-to-understand D front-end. If a PG operates on trees
 instead of linear token streams, I have a sneaky suspicion that it could
 revolutionize how a compiler front-end works.

 Why? because right now parsers operate just once, on the user's input,
 and from there you manipulate the AST with "ordinary" code. But if you
 have a tree parser, you can routinely manipulate and transform parts of
 the tree with a sequence of independent parsers and grammars. Thus,
 parsers would replace a lot of things for which you would otherwise use
 a visitor pattern, or something. I think I'll try to sketch out this
 idea in more detail later.

 I was thinking the same thing.

 My intent is to create a kind of regular-expression-of-nodes with
 push/pop operators to recognize ascent and descent on the tree.  Such a
 regular expression would allow one to capture subtrees out of
 generalized patterns and then place them into new trees that then become
 the input for the next pattern or set of patterns.  I think this is much
 closer to how I conceptualize semantic analysis than how semantic
 analysis is done in front ends like DMD: it should to be done with
 pattern recognition and substitution, not with myriad nested
 if-statements and while-loops.

 My vision is to have code similar to this in the front-end:

 /+
 Lower
      while ( boolExpr )
      {
          statements;
      }

 Into

      loopAgain:
      if ( !boolExpr )
          goto exitLoop
      statements;
      goto loopAgain
      exitLoop:
 +/
 void lowerWhileStatement( SyntaxElement* syntaxNode )
 {
      auto captures = syntaxNode.matchNodes(
          TOK_WHILE_NODE,
          OP_ENTER_NODE,
              OP_CAPTURE(0),
              OP_BEGIN,
                  TOK_EXPRESSION,
              OP_END,
              OP_CAPTURE(1),
              OP_BEGIN,
                  TOK_STATEMENT,
              OP_END,
          OP_LEAVE_NODE);

      if ( captures is null )
          return;

      syntaxNode.replaceWith(
          LabelNode("loopAgain"),
          TOK_IF_STATEMENT,
          OP_INSERT,
          OP_BEGIN,
              TOK_NEGATE,
              OP_INSERT,
              OP_BEGIN,
                  captures[0], // Expression
              OP_END,
              GotoStatement("exitLoop"),
          OP_END,
          captures[1], // statements
          GotoStatement("loopAgain"),
          LabelNode("exitLoop")
          );
 }


 The specifics will easily change.

I'd suggest:

AstOp!`
Lower
     while ( boolExpr )
     {
         statements;
     }

Into
     loopAgain:
     if ( !boolExpr ) {
         statements;
     } else goto exitLoop
     goto loopAgain
     exitLoop:

`.run(syntaxNode);

Jul 07 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/07/2012 10:26 PM, Timon Gehr wrote:
 On 07/07/2012 08:55 PM, Chad J wrote:
  ...

 The specifics will easily change.

 I'd suggest:

 AstOp!`
 Lower
      while ( boolExpr )
      {
          statements;
      }

 Into
      loopAgain:
      if ( !boolExpr ) {
          statements;
      } else goto exitLoop
      goto loopAgain
      exitLoop:

 `.run(syntaxNode);

Also, I'd like to point out that the transformation is not actually 
valid, because there are break/continue.

Jul 07 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 07/07/2012 04:26 PM, Timon Gehr wrote:
 On 07/07/2012 08:55 PM, Chad J wrote:
 The specifics will easily change.

 I'd suggest:

 AstOp!`
 Lower
 while ( boolExpr )
 {
 statements;
 }

 Into
 loopAgain:
 if ( !boolExpr ) {
 statements;
 } else goto exitLoop
 goto loopAgain
 exitLoop:

 `.run(syntaxNode);

I wish.

Can you make it happen?

Jul 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05-Jul-12 16:11, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).

Then do it. It's all about having something so obviously good, fast and 
flexible that other stuff can be considered only as toys.

I might do one, but I'd rather just help other folks make it faster ;)


-- 
Dmitry Olshansky

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, 
 fast and flexible that other stuff can be considered only as 
 toys.

 I might do one, but I'd rather just help other folks make it 
 faster ;)

Would you help to make Pegged faster?

Jul 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

 Would you help to make Pegged faster?

Well why not.
But first I'll need to deliver some stuff on my GSOC project.

-- 
Dmitry Olshansky

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 13:05:41 UTC, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 Well why not.
 But first I'll need to deliver some stuff on my GSOC project.

I bet that after you finish with GSOC optimizing Pegged will not 
be less relevant than it is now :) And as for DCT, it will take 
another half a year (at least) until it will become usable for 
any real needs (except the most trivial).

Jul 05 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/5/12 9:05 AM, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

 Would you help to make Pegged faster?

 Well why not.
 But first I'll need to deliver some stuff on my GSOC project.

Good call :o).

Note that we can discuss tweaking the scope of the project after the 
midterm. Ping me if you have some ideas.


Andrei

Jul 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05-Jul-12 18:29, Andrei Alexandrescu wrote:
 On 7/5/12 9:05 AM, Dmitry Olshansky wrote:
 On 05-Jul-12 17:04, Roman D. Boiko wrote:
 On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:
 Then do it. It's all about having something so obviously good, fast
 and flexible that other stuff can be considered only as toys.

 I might do one, but I'd rather just help other folks make it faster ;)

 Would you help to make Pegged faster?

 Well why not.
 But first I'll need to deliver some stuff on my GSOC project.

 Good call :o).

 Note that we can discuss tweaking the scope of the project after the
 midterm. Ping me if you have some ideas.

It's great that you are not opposed to adjusting scope of project.
I have a ton of ideas, but let's discuss them after midterm.


-- 
Dmitry Olshansky

Jul 05 2012

Caligo <iteronvexor gmail.com> writes:

Is the actual grammar available somewhere?  The online language
reference is all we got I guess? DMD doesn't seem to be using
yacc/bison either.

On Thu, Jul 5, 2012 at 7:11 AM, Denis Shelomovskij
<verylonglogin.reg gmail.com> wrote:
 There are more and more projects requiring parsing D code (IDE plugins, D=

CT
 and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer). We
 already have a good one at least in Visual D. Maybe dmd's parser is faste=

r.
 If so, it can be (easily?) rewritten in D. We also already have some othe=

r


 What about to get one and call it standard?

 --
 =E4=C5=CE=C9=D3 =F7. =FB=C5=CC=CF=CD=CF=D7=D3=CB=C9=CA
 Denis V. Shelomovskij

Jul 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Thursday, 5 July 2012 at 15:42:22 UTC, Caligo wrote:
 Is the actual grammar available somewhere?  The online language
 reference is all we got I guess? DMD doesn't seem to be using
 yacc/bison either.

In PEG format, yes (not necessarily correct):
https://github.com/roman-d-boiko/Pegged/blob/dct/pegged/examples/dgrammar.d

I don't know about any others, though.

Jul 05 2012

Kai Meyer <kai unixlords.com> writes:

On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have


 What about to get one and call it standard?

I know I'm a little late coming into this conversation. This seems like 
a nice thread to toss myself into. I've started working on a generic 
lexer that is based off of a defined grammar.

As I read through the thread (I unfortunately don't have enough time to 
read every post, but I skimmed through as many as I could, and read the 
ones that seemed important), it seems like we need a parser in D that 
can lex D, and provide a Range of tokens that could be consumed.

With some very minor tweaks, and a properly written Grammar class, I 
basically have it already done. D was going to be one of the first 
languages I would have written a definition for.

https://github.com/kai4785/firstfront

I haven't had time to look through Pegged, but there are some 
differences in my approach that I can immediately see in Pegged's.

1) Pegged does compile-time or run-time code generation for your parser. 
Mine is run-time only and regex based.
2) Pegged uses PEG format, which I have been introduced to through this 
thread. One of my goals was to actually invent something like PEG. This 
will save me time :)

I would love to receive some critical feedback on my approach as well as 
the actual code base.

To build it, you'll need cmake, and cmaked2 from here:
http://code.google.com/p/cmaked2/wiki/GettingStarted

Or just dmd *.d :) I haven't tried to build it on Windows yet, but I 
don't see anything that immediately jumps out as not cross-platform. 
I've been working on it on both Fedora and CentOS.

-Kai Meyer

Jul 31 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This 
 seems like a nice thread to toss myself into. I've started 
 working on a generic lexer that is based off of a defined 
 grammar.

Every helping hand is appreciated :-)

 As I read through the thread (I unfortunately don't have enough 
 time to read every post, but I skimmed through as many as I 
 could, and read the ones that seemed important), it seems like 
 we need a parser in D that can lex D, and provide a Range of 
 tokens that could be consumed.

Yes. To make this statement more precise: We need a lexer that 
provides
a range of tokens and we need a parser which makes it possible to 
build
an AST. Pegged combines both approaches but imposes an overhead 
if you
just need a token list. However I'm not sure if this is a problem.

There are already some working D-Parsers buried in this thread.

 With some very minor tweaks, and a properly written Grammar 
 class, I basically have it already done. D was going to be one 
 of the first languages I would have written a definition for.

 https://github.com/kai4785/firstfront

I've only glimpsed at your code. For most languages lexing is far 
more expensive then parsing and thus the lexer has to be very 
fast and I wouldn't pursue your approach and instead use 
something like ragel. It already has D output but needs a 
separate build step.

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:
 On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This seems like a
 nice thread to toss myself into. I've started working on a generic lexer
 that is based off of a defined grammar.


 Every helping hand is appreciated :-)

Hi Kai, welcome here!

 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the ones
 that seemed important), it seems like we need a parser in D that can lex D,
 and provide a Range of tokens that could be consumed.


 Yes. To make this statement more precise: We need a lexer that provides
 a range of tokens and we need a parser which makes it possible to build
 an AST. Pegged combines both approaches but imposes an overhead if you
 just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can
be grouped or not, but we need both, in Phobos.

We also need to define what's needed in a token. Kai's approach is OK,
but what's the _type field for an operator or and identifier?
Also, a lexer can fill a symbol table and produce identifier tokens
that refer to a particular entry in the symbol table.

I guess creating a tree of symbol tables according to scope visibility
is then more the job of the parser, but I'm not sure.


 There are already some working D-Parsers buried in this thread.

Yes. We also need a D parser (not generic), but no one pushed one for
Phobos inclusion right now.
Anyway, we can put generic parsers in Phobos too (LALR, LL(*), etc),
but I'd say the first priority would be to have a D lexer (producing a
range of tokens) and a D parser (consuming a range of tokens, and
executing semantic actions, like the building of an AST). Generic
parsers can come later on. Having a D parser in the standard
distribution would create much goodwill. Many people want that.


 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

Is that so?

 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It already has D
 output but needs a separate build step.

Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

The only problem I see in Kai's approach (which is the standard one, a
prioritized list of regexes) is how to recognize tokens that are not
regular (I mean, that cannot be encoded as a regex), like nesting
comments? How does the lexer know when to stop producing a 'comment'
token?

Philippe

Jul 31 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 31 July 2012 at 21:10:52 UTC, Philippe Sigaud wrote:
 I've only glimpsed at your code. For most languages lexing is 
 far more
 expensive then parsing

 Is that so?

I have no numbers. It's a statement that I found in some (1-3) 
sources about parsing. I'll share if I can find them.

 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It 
 already has D
 output but needs a separate build step.

 Having std.lexer in Phobos would be quite good. With a 
 pre-compiled lexer for D.

Yeah.

Jul 31 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 01-Aug-12 01:10, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:
 On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

 I know I'm a little late coming into this conversation. This seems like a
 nice thread to toss myself into. I've started working on a generic lexer
 that is based off of a defined grammar.


 Every helping hand is appreciated :-)

 Hi Kai, welcome here!

 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the ones
 that seemed important), it seems like we need a parser in D that can lex D,
 and provide a Range of tokens that could be consumed.


 Yes. To make this statement more precise: We need a lexer that provides
 a range of tokens and we need a parser which makes it possible to build
 an AST. Pegged combines both approaches but imposes an overhead if you
 just need a token list. However I'm not sure if this is a problem.

 I think we need a tokenizer generator and a parser generator. They can
 be grouped or not, but we need both, in Phobos.

 We also need to define what's needed in a token. Kai's approach is OK,
 but what's the _type field for an operator or and identifier?
 Also, a lexer can fill a symbol table and produce identifier tokens
 that refer to a particular entry in the symbol table.

Yeah.

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

Parser can use constant IDs for nested tables, IDs point to string table.
String table is populated by lexer.
 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing

 Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim 
that it's very slow but GLL/GLR are slower for obvious reasons).

 and thus the lexer has to be very fast and I wouldn't
 pursue your approach and instead use something like ragel. It already has D
 output but needs a separate build step.


Have to agree here if anything a better DFA generator is needed, current 
std.regex can't get as good in this field because of:

a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in 
ctRegex via codegen)
b) full unicode features commitment (again expect some improvement here 
in near future)
c) designed to take multiple captures from matched piece of text.

I'm not sure when (or even if) std.regex will ever special case all of 
the above.

 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for
D.

 The only problem I see in Kai's approach (which is the standard one, a
 prioritized list of regexes) is how to recognize tokens that are not
 regular (I mean, that cannot be encoded as a regex), like nesting
 comments?

See below.

 How does the lexer know when to stop producing a 'comment'
 token?

Call special function skipComment when lexer hits comment_start pseudotoken.

Typically lexeres are regular as it allows them to be fast.


-- 
Dmitry Olshansky

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
<dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

 Parser can use constant IDs for nested tables, IDs point to string table.
 String table is populated by lexer.

The latter, I get. The former, not so much.


 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing


 Is that so?


 It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
 it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated
(richer, if I may say so) for parsing, then parsing is more expensive.
I'm reading on GLR, and was wondering what the practical limits are.
Do people use GLR to parse thousands of pages of text?


 Have to agree here if anything a better DFA generator is needed, current
 std.regex can't get as good in this field because of:

 a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
 ctRegex via codegen)
 b) full unicode features commitment (again expect some improvement here in
 near future)
 c) designed to take multiple captures from matched piece of text.

 I'm not sure when (or even if) std.regex will ever special case all of the
 above.

Well,

- for a lexer lookahead is sometimes useful (the Dragon book cite the
FORTRAN grammar, for which keywords are not reserved and so when you
encounter IF, you don't know if (!) it's a function call or a 'real'
if)
- Unicode is needed to lex D correctly, no?
- multiple captures doesn't seem necessary *for a lexer*


 How does the lexer know when to stop producing a 'comment'
 token?


 Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to
'count' the comment_start and comment_end pseudotokens. So in a way,
it quits the regular world to become temporarily more powerful.

 Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?

Jul 31 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Typically lexeres are regular as it allows them to be fast.


Regularity of the language is not required for speed.

 Hmm, but then it has to treat comments a one token. So no Ddoc love?

Ddoc is typically not required. By default it should be treated as
whitespace. If it is required, one token seems reasonable: The
post-processing of the doc comment is best done as a separate step.

Jul 31 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

That was how I was intending to deal with ddoc. It's just a nested block 
comment token. The whole comment string is there, so the ddoc processor can 
use that to do whatever it does. ddoc isn't part of lexing really. It's a 
separate thing.

- Jonathan M Davis

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

 That was how I was intending to deal with ddoc. It's just a nested block
 comment token. The whole comment string is there, so the ddoc processor can
 use that to do whatever it does. ddoc isn't part of lexing really. It's a
 separate thing.

OK. Same for standard comment and doc comments?
I was wondering how to get the code possibly inside a ---- / ----
block (I never dealt with something like documentation or syntax
highlighting), but your solution makes it easy:

Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module,
"module"), ...

A small specialised parser can then extract text, DDocs macros and
code blocks from inside the comment. Findind and stripping '----' is
easy and then the lexer can be locally reinvoked on the slice
containing the example code.

Jul 31 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 01-Aug-12 02:54, Timon Gehr wrote:
 On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:
 Typically lexeres are regular as it allows them to be fast.


 Regularity of the language is not required for speed.

That's why there is "allows" (not "required") the reason in general is 
that plain deterministic automation is insanely fast compared with just 
about any CFG parsing scheme.

Yet there are certain grammars/cases where it doesn't matter much. Also 
tweaking DFA by "semantic" actions could make it handle some 
irregularities at no extra cost etc.

 Hmm, but then it has to treat comments a one token. So no Ddoc love?

 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

Yup.

-- 
Dmitry Olshansky

Jul 31 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

 Parser can use constant IDs for nested tables, IDs point to string table.
 String table is populated by lexer.

 The latter, I get. The former, not so much.


 I've only glimpsed at your code. For most languages lexing is far more
 expensive then parsing


 Is that so?


 It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
 it's very slow but GLL/GLR are slower for obvious reasons).

 I'd have thought that, seeing as 'the end result' is more complicated
 (richer, if I may say so) for parsing, then parsing is more expensive.
 I'm reading on GLR, and was wondering what the practical limits are.
 Do people use GLR to parse thousands of pages of text?


 Have to agree here if anything a better DFA generator is needed, current
 std.regex can't get as good in this field because of:

 a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
 ctRegex via codegen)
 b) full unicode features commitment (again expect some improvement here in
 near future)
 c) designed to take multiple captures from matched piece of text.

 I'm not sure when (or even if) std.regex will ever special case all of the
 above.

 Well,

 - for a lexer lookahead is sometimes useful (the Dragon book cite the
 FORTRAN grammar, for which keywords are not reserved and so when you
 encounter IF, you don't know if (!) it's a function call or a 'real'
 if)

Well while lookahead will help, there are simpler ways. e.g.
regex ("IF\ZYOUR_LOOKAHEAD");

\Z means that you are capturing text only up to \Z. It's still regular. 
I think Boost regex does support this. We could have supported this 
cleanly but have chosen ECM-262 standard.


Critical difference is that lookahead can be used like this:

regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

 - Unicode is needed to lex D correctly, no?

Not that much of unicode, e.g. it could skip decoding stuff most of the 
time.

 - multiple captures doesn't seem necessary *for a lexer*


 How does the lexer know when to stop producing a 'comment'
 token?


 Call special function skipComment when lexer hits comment_start pseudotoken.

 Ah, and this special function must somehow maintain a stack, to
 'count' the comment_start and comment_end pseudotokens. So in a way,
 it quits the regular world to become temporarily more powerful.

 Typically lexeres are regular as it allows them to be fast.

 Hmm, but then it has to treat comments a one token. So no Ddoc love?


-- 
Dmitry Olshansky

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 7:48 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 Well,

 - for a lexer lookahead is sometimes useful (the Dragon book cite the
 FORTRAN grammar, for which keywords are not reserved and so when you
 encounter IF, you don't know if (!) it's a function call or a 'real'
 if)


 Well while lookahead will help, there are simpler ways. e.g.
 regex ("IF\ZYOUR_LOOKAHEAD");

 \Z means that you are capturing text only up to \Z. It's still regular. I
 think Boost regex does support this. We could have supported this cleanly
 but have chosen ECM-262 standard.


 Critical difference is that lookahead can be used like this:

 regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

In the FORTRAN case, you indeed *need* to re-lex the stuff after IF,
with another regex, once you've determined it's an IF instruction and
not some moron who used IF as an identifier.

You know, as a first step, I'd be happy to get ctRegex to recognize the \Z flag.

Jul 31 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

 Parser can use constant IDs for nested tables, IDs point to string table.
 String table is populated by lexer.

 The latter, I get. The former, not so much.

Okay. Say lexer maps all unique strings that are not keywords to some ID.

Then parser creates a stack of scoped symbol tables.
These nested symbol tables use only IDs not strings themselves.
(Though I expect that only semantic analysis require the use of these 
tables)

-- 
Dmitry Olshansky

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 8:12 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 On 01-Aug-12 02:01, Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 I guess creating a tree of symbol tables according to scope visibility
 is then more the job of the parser, but I'm not sure.

 Parser can use constant IDs for nested tables, IDs point to string table.
 String table is populated by lexer.


 The latter, I get. The former, not so much.


 Okay. Say lexer maps all unique strings that are not keywords to some ID.

 Then parser creates a stack of scoped symbol tables.
 These nested symbol tables use only IDs not strings themselves.
 (Though I expect that only semantic analysis require the use of these
 tables)

Ah, that! This I know, I misunderstood your initial post. The symbol
table can also be pre-charged with keywords, if needed.

Jul 31 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

I'm actually quite far along with one now - one which is specifically written 
and optimized for lexing D. I'll probably be done with it not too long after 
the 2.060 release (though we'll see). Writing it has been going surprisingly 
quickly actually, and I've already found some bugs in the spec as a result 
(some of which have been fixed, some of which I still need to create pull 
requests for). So, regardless of what happens with my lexer, at least the spec 
will be more accurate.

- Jonathan M Davis

Jul 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-31 23:20, Jonathan M Davis wrote:

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see). Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

 - Jonathan M Davis

That's awesome. I really looking forward to this. Keep up the good work.

-- 
/Jacob Carlborg

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-07-31 23:20, Jonathan M Davis wrote:

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see). Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

BTW, do you have to code online somewhere?

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 13:30:31 Jacob Carlborg wrote:
 On 2012-07-31 23:20, Jonathan M Davis wrote:
 I'm actually quite far along with one now - one which is specifically
 written and optimized for lexing D. I'll probably be done with it not too
 long after the 2.060 release (though we'll see). Writing it has been
 going surprisingly quickly actually, and I've already found some bugs in
 the spec as a result (some of which have been fixed, some of which I
 still need to create pull requests for). So, regardless of what happens
 with my lexer, at least the spec will be more accurate.

 
 BTW, do you have to code online somewhere?

No, because I'm still in the middle of working on it.

- Jonathan M Davis

Aug 01 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

01.08.2012 1:20, Jonathan M Davis пишет:
 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see). Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

 - Jonathan M Davis

Good. Will wait for announce.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

 I'm actually quite far along with one now - one which is specifically written
 and optimized for lexing D. I'll probably be done with it not too long after
 the 2.060 release (though we'll see).

That was quick! Cool!

Writing it has been going surprisingly
 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the spec
 will be more accurate.

Could you please describe the kind of token it produces?
Can it build a symbol table?
Does it recognize all kind of strings (including q{ } ones)?
How does it deal with comments, particularly nested ones?
Does it automatically discard whitespaces or produce them as a token?
I'd favor this approach, if only because wrapping the lexer in a
filter!noWS(tokenRange) is easy.
Does it produce a lazy range btw?

Jul 31 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 31-Jul-12 20:10, Kai Meyer wrote:
 On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have


 What about to get one and call it standard?

 I know I'm a little late coming into this conversation. This seems like
 a nice thread to toss myself into. I've started working on a generic
 lexer that is based off of a defined grammar.

 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the
 ones that seemed important), it seems like we need a parser in D that
 can lex D, and provide a Range of tokens that could be consumed.

 With some very minor tweaks, and a properly written Grammar class, I
 basically have it already done. D was going to be one of the first
 languages I would have written a definition for.

 https://github.com/kai4785/firstfront

 I haven't had time to look through Pegged, but there are some
 differences in my approach that I can immediately see in Pegged's.

 1) Pegged does compile-time or run-time code generation for your parser.
 Mine is run-time only and regex based.
 2) Pegged uses PEG format, which I have been introduced to through this
 thread. One of my goals was to actually invent something like PEG. This
 will save me time :)

 I would love to receive some critical feedback on my approach as well as
 the actual code base.

Okay. Critics go as follows:

First, as an author of std.regex I'm pleased that you find it suitable 
to use it in lexer.  :)

Still there is a BIG word of warning:

Lexer based on trying out an array of regexes until one will match is 
NOT fast and not even close to fast ones. It is acceptable as proof of 
concept/prototype kind of thing only.

To help this use case I though of making multi-regex matching a-la:
match(text, regex1, regex2, regex3...);
then  lexing would be effectively a loop of form:

foreach(tok; match(text, regex1, regex2, regex3...)){
	switch(tok[0]){//suppose match returns tuple - N of regex matched + the 
usual piece of text
		...
	}
}
(with some glitches on /+ ... +/ and other non-regular stuff).

But until such time it's not to be used seriously in this context.

The reason is that each call to match does have some overhead to setup 
regex engine, it's constant but it's huge compared to what usual lexers 
typically do. The other thing is that you should use ctRegex for this 
kind of job if you can (i.e. compiler doesn't explode on it).


(keep in mind I only glimpsed the source, will post more feedback later)

 To build it, you'll need cmake, and cmaked2 from here:
 http://code.google.com/p/cmaked2/wiki/GettingStarted

 Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
 don't see anything that immediately jumps out as not cross-platform.
 I've been working on it on both Fedora and CentOS.

rdmd for the win!


-- 
Dmitry Olshansky

Jul 31 2012

Kai Meyer <kai unixlords.com> writes:

On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:
 On 31-Jul-12 20:10, Kai Meyer wrote:
 On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
 There are more and more projects requiring parsing D code (IDE plugins,
 DCT and dfmt now).

 We definitely need a _standard_ fast D parser (suitable as tokenizer).
 We already have a good one at least in Visual D. Maybe dmd's parser is
 faster. If so, it can be (easily?) rewritten in D. We also already have


 What about to get one and call it standard?

 I know I'm a little late coming into this conversation. This seems like
 a nice thread to toss myself into. I've started working on a generic
 lexer that is based off of a defined grammar.

 As I read through the thread (I unfortunately don't have enough time to
 read every post, but I skimmed through as many as I could, and read the
 ones that seemed important), it seems like we need a parser in D that
 can lex D, and provide a Range of tokens that could be consumed.

 With some very minor tweaks, and a properly written Grammar class, I
 basically have it already done. D was going to be one of the first
 languages I would have written a definition for.

 https://github.com/kai4785/firstfront

 I haven't had time to look through Pegged, but there are some
 differences in my approach that I can immediately see in Pegged's.

 1) Pegged does compile-time or run-time code generation for your parser.
 Mine is run-time only and regex based.
 2) Pegged uses PEG format, which I have been introduced to through this
 thread. One of my goals was to actually invent something like PEG. This
 will save me time :)

 I would love to receive some critical feedback on my approach as well as
 the actual code base.

 Okay. Critics go as follows:

 First, as an author of std.regex I'm pleased that you find it suitable
 to use it in lexer. :)

 Still there is a BIG word of warning:

 Lexer based on trying out an array of regexes until one will match is
 NOT fast and not even close to fast ones. It is acceptable as proof of
 concept/prototype kind of thing only.

 To help this use case I though of making multi-regex matching a-la:
 match(text, regex1, regex2, regex3...);
 then lexing would be effectively a loop of form:

 foreach(tok; match(text, regex1, regex2, regex3...)){
 switch(tok[0]){//suppose match returns tuple - N of regex matched + the
 usual piece of text
 ...
 }
 }
 (with some glitches on /+ ... +/ and other non-regular stuff).

 But until such time it's not to be used seriously in this context.

 The reason is that each call to match does have some overhead to setup
 regex engine, it's constant but it's huge compared to what usual lexers
 typically do. The other thing is that you should use ctRegex for this
 kind of job if you can (i.e. compiler doesn't explode on it).


 (keep in mind I only glimpsed the source, will post more feedback later)

 To build it, you'll need cmake, and cmaked2 from here:
 http://code.google.com/p/cmaked2/wiki/GettingStarted

 Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
 don't see anything that immediately jumps out as not cross-platform.
 I've been working on it on both Fedora and CentOS.

 rdmd for the win!

I cut my teeth on perl, so everything gets compared to perl in my mind.

In perl, I can 'precompile' a regular expression, so I can avoid some 
overhead. So something like this:

while(<STDIN>){
     if($_ =~ m/<someregex>/){
         some work;
     }
}

Would end up re-compiling the regex on each line from STDIN. Perl uses 
the term "compiling" the regular expression, which may be different than 
what you call "setup regex engine".

Does/Can D's std.regex offer something like this? If not, I would be 
interested in why not.

-Kai Meyer

Aug 06 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 I cut my teeth on perl, so everything gets compared to perl in my mind.

 In perl, I can 'precompile' a regular expression, so I can avoid some
 overhead. So something like this:

 while(<STDIN>){
     if($_ =~ m/<someregex>/){
         some work;
     }
 }

 Would end up re-compiling the regex on each line from STDIN. Perl uses the
 term "compiling" the regular expression, which may be different than what
 you call "setup regex engine".

 Does/Can D's std.regex offer something like this? If not, I would be
 interested in why not.

D does have compiled regex, it's called ctRegex (meaning compile-time regex):

http://dlang.org/phobos/std_regex.html#ctRegex

The tests done recently put it among the fastest regex engine known.
Caution: not every regex extension known to mankind is implemented
here!

Aug 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Aug-12 22:52, Philippe Sigaud wrote:
 I cut my teeth on perl, so everything gets compared to perl in my mind.

 In perl, I can 'precompile' a regular expression, so I can avoid some
 overhead. So something like this:

 while(<STDIN>){
      if($_ =~ m/<someregex>/){
          some work;
      }
 }

 Would end up re-compiling the regex on each line from STDIN. Perl uses the
 term "compiling" the regular expression, which may be different than what
 you call "setup regex engine".


Of course regex is first compiled to bytecode (same thing as "compile" 
in perl). Moreover if you use regex pattern directly it is compiled on 
first use and put into TLS cache of compiled patterns. From now on it's 
used in compiled form. (there about 8 entries in cache, don't relay on 
it too much).

What I mean by "setup engine" is extra cost spent on _starting_ matching 
with compiled pattern. For one thing it adds 1 call to malloc/free.
Again I think in Perl the same thing applies. In other words doing 
continuous search (via foreach(x; match(..., "g" )) ) is much faster 
then calling match on individual pieces over and over again in cycle.

 Does/Can D's std.regex offer something like this? If not, I would be
 interested in why not.


Of course it does, in fact you can't match without compiling pattern 
first (it just provides a shortcut that does it for you behind the scenes).

 D does have compiled regex, it's called ctRegex (meaning compile-time regex):

 http://dlang.org/phobos/std_regex.html#ctRegex

And there is a second version - compiled native code. Unlike perl it's 
not bytecode and thus usually much faster.
Frankly the most slow regex I've seen is in Python, the second most 
sucky one is PCRE (but is incredibly popular somehow). Perl is not bad 
but usually slower then top dogs from C++ & std.regex.

 The tests done recently put it among the fastest regex engine known.

Yeah, on top of said chart. Needless to say the test favored my 
implementation, ctRegex is not the fastest regex engine  in general (yet).

 Caution: not every regex extension known to mankind is implemented
 here!

Sure as hell. In fact, the most problematic thing is that parser often 
fails during CTFE.
Also I have a solid plan on enhancing a bunch of things effectively 
making std.regex v2 but no sooner then October-November.


-- 
Dmitry Olshansky

Aug 06 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Of course regex is first compiled to bytecode (same thing as "compile" in
 perl). Moreover if you use regex pattern directly it is compiled on first
 use and put into TLS cache of compiled patterns. From now on it's used in
 compiled form. (there about 8 entries in cache, don't relay on it too much).

Btw, I wanted to ask you that for a long time: what do you mean by
'compiled to bytecode', for D source?

 And there is a second version - compiled native code. Unlike perl it's not
 bytecode and thus usually much faster.

Which?


 Frankly the most slow regex I've seen is in Python, the second most sucky
 one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
 slower then top dogs from C++ & std.regex.

Do people *really* need speed, or a great number of extensions?

 Sure as hell. In fact, the most problematic thing is that parser often fails
 during CTFE.

For example?

 Also I have a solid plan on enhancing a bunch of things effectively making
 std.regex v2 but no sooner then October-November.

That's good to know.

Aug 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Aug-12 23:43, Philippe Sigaud wrote:
 On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Of course regex is first compiled to bytecode (same thing as "compile" in
 perl). Moreover if you use regex pattern directly it is compiled on first
 use and put into TLS cache of compiled patterns. From now on it's used in
 compiled form. (there about 8 entries in cache, don't relay on it too much).

 Btw, I wanted to ask you that for a long time: what do you mean by
 'compiled to bytecode', for D source?

by this:
auto r = regex(some_string); // where some_string can come from user input.
r - contains bytecode that matches pattern.

Unlike ctRegex which does output D source code and compiles it to native 
code.
 And there is a second version - compiled native code. Unlike perl it's not
 bytecode and thus usually much faster.

 Which?

Compiled native code is faster. Or what you
 Frankly the most slow regex I've seen is in Python, the second most sucky
 one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
 slower then top dogs from C++ & std.regex.

 Do people *really* need speed, or a great number of extensions?

They want both. In my experience you can't satisfy everybody.
I think features are not what people look for in regex, even basic 
ECMA-262 stuff is way above what most programmers need to do. Unlike 
extra speed which even newbies could use :)
And while I'm at it - we already have a damn good collection of 
extensions, way too much I'd say.

For example, I've yet to see one regex engine that support Unicode to 
the same level as std.regex, namely I haven't seen a single one with 
full set operations (not only union but subtraction, intersection etc.) 
inside of char class [...]. Funny even ICU one didn't support them a 
year ago, dunno about now.

Also some extensions come from implementations inherent inefficiency 
e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's 
so hard to make look-behind unrestricted in Perl, the whole thing is a mess.


 Sure as hell. In fact, the most problematic thing is that parser often fails
 during CTFE.

 For example?

Ehmn. See bugzilla, search ctRegex.
But not yet filed are: said non-union set operations usually fail + the 
fact that Unicode properties are not readable at CT (thus \p{xxx} fails).

 Also I have a solid plan on enhancing a bunch of things effectively making
 std.regex v2 but no sooner then October-November.

 That's good to know.


-- 
Dmitry Olshansky

Aug 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Aug-12 23:59, Dmitry Olshansky wrote:
 And there is a second version - compiled native code. Unlike perl it's not

 bytecode and thus usually much faster.

 Which?

Compiled native code is faster. Perl is slower.


-- 
Dmitry Olshansky

Aug 06 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Aug 6, 2012 at 10:00 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 On 06-Aug-12 23:59, Dmitry Olshansky wrote:
 And there is a second version - compiled native code. Unlike perl it's
 not

 bytecode and thus usually much faster.


 Which?


 Compiled native code is faster. Perl is slower.

Sorry, I meant: what second version? How do you distinguish between
bytecode-producing and compiled-code producing ctRegex?

Aug 06 2012

"David Nadlinger" <see klickverbot.at> writes:

On Tuesday, 7 August 2012 at 04:58:05 UTC, Philippe Sigaud wrote:
 Sorry, I meant: what second version? How do you distinguish 
 between
 bytecode-producing and compiled-code producing ctRegex?

As far as I know, ctRegex _always_ produces machine code. 
Bytecode is what the runtime implementation, i.e. regex, uses.

David

Aug 06 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:

 As far as I know, ctRegex _always_ produces machine code. Bytecode is what
 the runtime implementation, i.e. regex, uses.

That's what I had in mind too, but Dmitry seemed to say things are
more complicated. Maybe I misunderstood.
And do you know why it's called bytecode? I never heard of 'bytecode'
for D outside of std.regex. Does that just mean 'intermediate form D
code', as in 'bunch of small and easily optimized instructions.'?

Aug 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Aug-12 09:11, Philippe Sigaud wrote:
 On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:

 As far as I know, ctRegex _always_ produces machine code. Bytecode is what
 the runtime implementation, i.e. regex, uses.

 That's what I had in mind too, but Dmitry seemed to say things are
 more complicated. Maybe I misunderstood.

Yeah, sorry for confusion  - there are only 2 ways to do the job:
regex and ctRegex. The first one contains special bytecode, the second 
one generates native code.

 And do you know why it's called bytecode? I never heard of 'bytecode'
 for D outside of std.regex.

It's a special code that encodes pattern matching machine :) It's not a 
bytecode for D. In other words instruction go as follows: match char, 
match one of few chars, match one of character set, match any char, 
jump/loop/option etc.

 Does that just mean 'intermediate form D
 code', as in 'bunch of small and easily optimized instructions.'?

No :)

So far there are only 3 cases:
RT parsing ---> Bytecode ( { auto x = regex("blah) })
CT parsing ---> Bytecode ( static x = regex("blah"); )
CT parsing ---> Machine code ( static/auto  x = ctRegex!"blah" )

It would be nice to have JIT compiler for D at RT but no luck yet. 
(std.jit?) Then the picture would be complete.
-- 
Dmitry Olshansky

Aug 06 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:
 On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com> 

wrote:
 On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
 Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
 for D.

 
 I'm actually quite far along with one now - one which is specifically
 written and optimized for lexing D. I'll probably be done with it not too
 long after the 2.060 release (though we'll see).

 
 That was quick! Cool!

Yeah. Once I started on it, I made a lot of progress really quickly. There's 
still a fair bit to do (primarily having to do with literals), but it probably 
won't take all that much longer. Certainly, I'd expect to have it done within 
a couple of weeks if not sooner, unless something goes wrong.

Writing it has been going surprisingly

 quickly actually, and I've already found some bugs in the spec as a result
 (some of which have been fixed, some of which I still need to create pull
 requests for). So, regardless of what happens with my lexer, at least the
 spec will be more accurate.

 
 Could you please describe the kind of token it produces?
 Can it build a symbol table?
 Does it recognize all kind of strings (including q{ } ones)?
 How does it deal with comments, particularly nested ones?
 Does it automatically discard whitespaces or produce them as a token?
 I'd favor this approach, if only because wrapping the lexer in a
 filter!noWS(tokenRange) is easy.
 Does it produce a lazy range btw?

Well, it's still a work in progress, so it certainly can be adjusted as 
necessary. I intend it to fully implement the spec (and make sure that both it 
and the spec match what dmd is doing) as far as lexing goes. The idea is that 
you should be able to build a fully compliant D parser on top of it and build 
a fully compliant D compiler on top of that.

It already supports all of the comment types and several of the string literal 
types. I haven't sorted out q{} yet, but I will before I'm done, and that may 
or may not affect how some things work, since I'm not quite sure how to handle 
q{} yet (it may end up being done with tokens marking the beginning and end of 
the token sequence encompassed by q{}, but we'll see). I'm in the middle of 
dealing with the named entity stuff at the moment, which unfortunately has 
revealed a rather nasty compiler bug with regards to template compile times, 
which I still need to report (I intend to do that this evening). The file 
generating the table of named entities currently takes over 6 minutes to 
compile on my Phenom II thanks to that bug, so I'm not quite sure how that's 
going to affect things. Regardless, the lexer should support _everything_ as 
far as what's required for fully lexing D goes by the time that I'm done.

I don't have the code with me at the moment, but I believe that the token type 
looks something like

struct Token
{
 TokenType type;
 string str;
 LiteralValue value;
 SourcePos pos;
}

struct SourcePos
{
 size_t line;
 size_ col;
 size_t tabWidth = 8;
}

The type is an enum which gives the type of the token (obviously) which 
includes the various comment types and an error type (so errors are reported 
by getting a token that was an error token rather than throwing or anything 
like that, which should make lexing passed malformed stuff easy).

str holds the exact text which was lexed (this includes the entire comment for 
the various comment token types), which in the case of lexing string rather 
than another range type would normally (always? - I don't remember) be a slice 
of the string being lexed, which should help make lexing string very efficient. 
It may or may not make sense to change that to the range type used rather than 
string. For nesting block comments, the whole comment is one token (with the 
token type which is specifically for nested comments), regardless of whether 
there's any nesting going on. But that could be changed if there were a need 
to get separate tokens for the comments inside.

LiteralValue is a VariantN of the types that a literal can be (long, ulong, 
real, and string IIRC) and is empty unless the token is a literal type (the 
various string postfixes - c,w, and d - are treated as different token types 
rather than giving the literal value different string types - the same with the 
integral and floating point literals).

And pos holds the position in the text where the token started, which should 
make it easy to use for syntax highlighting and the like (as well as 
indicating where an error occurred). The initial position is passed as an 
optional argument to the lexing function, so it doesn't have to be 1:1 (though 
that's the default), and it allows you to select the tabwidth.

So, you'll pass a range and an optional starting position to the lexing 
function, and it'll return a lazy range of Tokens. Whitespace is stripped as 
part of the lexing process, but if you take the pos properties of two adjacent 
tokens, you should be able to determine how much whitespace was between them.

I _think_ that that's how it currently is, but again, I don't have the code 
with me at the moment, so it may not be 100% correct. And since it's a work in 
progress, it's certainly open to changes.

- Jonathan M Davis

Jul 31 2012

travert phare.normalesup.org (Christophe Travert) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:173860), a �crit�:
 struct Token
 {
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
 }
 
 struct SourcePos
 {
  size_t line;
  size_t col;
  size_t tabWidth = 8;
 }

The occurence of tabWidth surprises me.
What is col supposed to be ? an index (code unit), a character number 
(code point), an estimation of where the caracter is supposed to be 
printed on the line, given the provided tabwidth ?

I don't think the lexer can realy try to calculate at what column the 
character is printed, since it depends on the editor (if you want to use 
the lexer to syntax highlight for example), how it supports combining 
characters, zero or multiple column characters, etc. (which you may not 
want to have to decode).

You may want to provide the number of tabs met so far. Note that there 
are other whitespace that you may want to count, but you shouldn't have 
a very complicated SourcePos structure. It might be easier to have 
whitespace, endofline and endoffile tokens, and let the user filter out 
or take into account what he wants to take into account. Or just let the 
user look into the original string...

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 07:12:30 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:173860), a =C3=A9=

crit :
 struct Token
 {
=20
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
=20
 }
=20
 struct SourcePos
 {
=20
  size_t line;
  size_t col;
  size_t tabWidth =3D 8;
=20
 }

=20
 The occurence of tabWidth surprises me.
 What is col supposed to be ? an index (code unit), a character number=

 (code point), an estimation of where the caracter is supposed to be
 printed on the line, given the provided tabwidth ?

col counts code points. tabWidth is the number of code points that '\t'=
 is=20
considered to be. That's it. So, in theory, an editor should be able to=
 use it=20
to indicate where on the line the token starts.

If the code using the lexer wants to treat tabs as having the widtho of=
 a=20
single code point, then all they have to do is pass in a SourcePos with=
 a=20
tabWidth of 1. But if the lexer doesn't have a way to count tabs differ=
ently,=20
then there's no way for the code using the lexer to figure out the tabs=
 without=20
going back and lexing the whitespace itself. But counting tabs as a dif=
ferent=20
width than everything else is so common that it seemed prudent to add i=
t.=20
Given that Phobos doesn't support graphemes and that ranges use code po=
ints, a=20
code point is the closest to a character that the lexer would be counti=
ng, and=20
it just makes sense to count code points.

Now, the one thing that might be a problem with treating tabs as a fixe=
d width=20
is that it's not uncommon to treat tabs as being up to a particular wid=
th=20
rather than having a fixed width such that if there are other character=
s before=20
a tab, then the tabs width is the max tab width minus the number of cha=
racters=20
since the start of that tab width. e.g. if tabs had a max width of 8, t=
hen

\t123\t

could have the first tab with a width of 8 and the second as only havin=
g a=20
width of 5. But that's too complicated to deal with in the lexer IMHO.

Maybe the tab width isn't worth having in SourePos and it will ultimate=
ly be=20
removed, but it struck me as a useful feature, so I added it.

- Jonathan M Davis

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 00:38, Jonathan M Davis wrote:

 I don't have the code with me at the moment, but I believe that the token type
 looks something like

 struct Token
 {
   TokenType type;
   string str;
   LiteralValue value;
   SourcePos pos;
 }

 struct SourcePos
 {
   size_t line;
   size_ col;
   size_t tabWidth = 8;
 }

What about the end/length of a token? Token.str.length would give the 
number of bytes (code units?) instead of the number of characters (code 
points?). I'm not entirely sure what's needed when, for example, doing 
syntax highlighting. I assume you would know the length in characters of 
a given token internally inside the lexer?

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 10:25:18 Jacob Carlborg wrote:
 On 2012-08-01 00:38, Jonathan M Davis wrote:
 I don't have the code with me at the moment, but I believe that the token
 type looks something like
 
 struct Token
 {
 
   TokenType type;
   string str;
   LiteralValue value;
   SourcePos pos;
 
 }
 
 struct SourcePos
 {
 
   size_t line;
   size_ col;
   size_t tabWidth = 8;
 
 }

 
 What about the end/length of a token? Token.str.length would give the
 number of bytes (code units?) instead of the number of characters (code
 points?). I'm not entirely sure what's needed when, for example, doing
 syntax highlighting. I assume you would know the length in characters of
 a given token internally inside the lexer?

I'm not sure. I don't think so. It doesn't really keep track of code points. 
It operates in code units as much as possible, and pos doesn't really help, 
because any newline that occurred would make it so that subtracting the start 
col from the end col would be completely bogus (that and tabs would mess that 
up pretty thoroughly, but as Christophe pointed out, the whole tabWidth thing 
may not actually have been a good idea anyway).

It could certainly be added, but unless the lexer always knows it (and I'm 
pretty sure that it doesn't), then keeping track of that entails extra 
overhead. But maybe it's worth that overhead. I'll have to look at what I have 
and see. Worst case, the caller can just use walkLength on str, but if it has 
to do that all the time, then that's not exactly conducive to good 
performance.

- Jonathan M Davis

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 10:40, Jonathan M Davis wrote:

 It could certainly be added, but unless the lexer always knows it (and I'm
 pretty sure that it doesn't), then keeping track of that entails extra
 overhead. But maybe it's worth that overhead. I'll have to look at what I have
 and see. Worst case, the caller can just use walkLength on str, but if it has
 to do that all the time, then that's not exactly conducive to good
 performance.

Doing a syntax highlighter and calculate the length (walkLength) for 
each token would most likely slow it down a bit.

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 07:26:07 Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com> 

wrote:
 On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
 Ddoc is typically not required. By default it should be treated as
 whitespace. If it is required, one token seems reasonable: The
 post-processing of the doc comment is best done as a separate step.

 
 That was how I was intending to deal with ddoc. It's just a nested block
 comment token. The whole comment string is there, so the ddoc processor
 can
 use that to do whatever it does. ddoc isn't part of lexing really. It's a
 separate thing.

 
 OK. Same for standard comment and doc comments?

From the TokenType enum declaration:

    blockComment,         /// $(D /* */)
    lineComment,          /// $(D // )
    nestingBlockComment,  /// $(D /+ +/)

There are then functions which operate on Tokens to give you information about 
them. Among them is isDdocComment, which will return true if the Token type is 
a comment, and that comment is a ddoc comment (i.e. starts with /**, ///, or 
/++ rather than /*, //, or /+). So, anything that wants to process ddoc 
comments can lex them out and process them, and if they want to know what 
symbols that a ddoc comment applies to, then they look at the tokens that 
follow (though a full-on parser would be required to do that correctly).

 I was wondering how to get the code possibly inside a ---- / ----
 block (I never dealt with something like documentation or syntax
 highlighting), but your solution makes it easy:
 
 Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module,
 "module"), ...
 
 A small specialised parser can then extract text, DDocs macros and
 code blocks from inside the comment. Findind and stripping '----' is
 easy and then the lexer can be locally reinvoked on the slice
 containing the example code.

Yes. The lexer isn't concerned with where the text comes from, and it isn't 
concerned with lexing comments beyond putting them in a token. But that should 
be powerful enough to lex the examples if you've already extracted them.

- Jonathan M Davis

Jul 31 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

Thanks for taking the time to write such an explanation!


 Yeah. Once I started on it, I made a lot of progress really quickly. There's
 still a fair bit to do (primarily having to do with literals), but it probably
 won't take all that much longer. Certainly, I'd expect to have it done within
 a couple of weeks if not sooner, unless something goes wrong.

Is it based on a prioritized list of regexes?

 Well, it's still a work in progress, so it certainly can be adjusted as
 necessary. I intend it to fully implement the spec (and make sure that both it
 and the spec match what dmd is doing) as far as lexing goes. The idea is that
 you should be able to build a fully compliant D parser on top of it and build
 a fully compliant D compiler on top of that.

My fear (having encoded the D grammar once) is that 99% of the code
constructs are easily done (even new stuff like the => syntax), but
that the remaining 1% is a nightmare. The many different kind of
strings with or without escaping, for example. Your lexer return a
string literal as a string field, right? If so, escaping chars like "
and ` makes me nervous.

 It already supports all of the comment types and several of the string literal
 types. I haven't sorted out q{} yet, but I will before I'm done, and that may
 or may not affect how some things work, since I'm not quite sure how to handle
 q{} yet (it may end up being done with tokens marking the beginning and end of
 the token sequence encompassed by q{}, but we'll see).

I see. My question was because, as I said before I found this to be a
difficult and hairy part of the D grammar, and an IDE might want to
colorize/manipulate the content of a string, when it knows its a
string mixin.


 I don't have the code with me at the moment, but I believe that the token type
 looks something like

 struct Token
 {
  TokenType type;
  string str;
  LiteralValue value;
  SourcePos pos;
 }

Seems OK. I guess it can be templated on the string type, because you
sure want to lex dstrings.


 The type is an enum which gives the type of the token (obviously) which
 includes the various comment types and an error type (so errors are reported
 by getting a token that was an error token rather than throwing or anything
 like that, which should make lexing passed malformed stuff easy).

And the recovery is done on the next token. Good idea.

 str holds the exact text which was lexed (this includes the entire comment for
 the various comment token types), which in the case of lexing string rather
 than another range type would normally (always? - I don't remember) be a slice
 of the string being lexed, which should help make lexing string very efficient.

Yeahs, you may as well use this excellent feature. But that means the
input must be held in memory always, no? If it's an input range
(coming from a big file), can you keep slices from it?

 It may or may not make sense to change that to the range type used rather than
 string. For nesting block comments, the whole comment is one token (with the
 token type which is specifically for nested comments), regardless of whether
 there's any nesting going on. But that could be changed if there were a need
 to get separate tokens for the comments inside.

That's OK, as I reply in another message. It's now quite clear for me
how to deal with that.
IDE/doc generator can parse the comment at their leasure, as they are
not long nor critical section of code.
They can even easily get the code fragment used as examples inside
---/--- markers.

 LiteralValue is a VariantN of the types that a literal can be (long, ulong,
 real, and string IIRC) and is empty unless the token is a literal type


You can make it an Algebraic!(long, ulong, real, string), as your
universe of type is restricted. That can even catch a non-legal value.
Arrays literals are not considered literals, since they can contain
arbitrary expressions, is that right?

And it's a good idea to drop the complex literals once and for all.

 (the
 various string postfixes - c,w, and d - are treated as different token types
 rather than giving the literal value different string types - the same with the
 integral and floating point literals).

That's seem reasonable enough, but can you really store  a dstring
literal in a string field?

 And pos holds the position in the text where the token started, which should
 make it easy to use for syntax highlighting

Does syntax highlighting need more that a token stream? Without having
thought a lot about it, it seems to me IDE tend to highlight based
just on the token type, not on a parse tree. So that means your lexer
can be used directly by interested people, that's nice.

 and the like (as well as
 indicating where an error occurred). The initial position is passed as an
 optional argument to the lexing function, so it doesn't have to be 1:1 (though
 that's the default), and it allows you to select the tabwidth.

I really like the idea of having malformed token become error token,
with the lexing that continue. As I said before, that enables an IDE
to continue to do syntax highlighting.


 So, you'll pass a range and an optional starting position to the lexing
 function, and it'll return a lazy range of Tokens. Whitespace is stripped as
 part of the lexing process, but if you take the pos properties of two adjacent
 tokens, you should be able to determine how much whitespace was between them.

OK, so whitespace is strictly that, and comments are not dropped. That's OK.
I guess a D parser would then just filter the comments out of the token range.


 I _think_ that that's how it currently is, but again, I don't have the code
 with me at the moment, so it may not be 100% correct. And since it's a work in
 progress, it's certainly open to changes.

My only quip for now would be the string/dstring thingy.

Did you achieve UFCS, wrt floating point values? is 1.f seen as a
float or as f called on 1 (int)?

Jul 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 07:44, Philippe Sigaud wrote:

 Does syntax highlighting need more that a token stream? Without having
 thought a lot about it, it seems to me IDE tend to highlight based
 just on the token type, not on a parse tree. So that means your lexer
 can be used directly by interested people, that's nice.

Some IDE's do a more advanced syntax highlighting based on the semantic 
analysis. For example, Eclipse highlights instance variables differently.

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
 Is it based on a prioritized list of regexes?

I'm not using regexes at all. It's using string mixins to reduce code 
duplication, but it's effectively hand-written. If I do it right, it should be 
_very_ difficult to make it any faster than it's going to be. It even 
specifically avoids decoding unicode characters and operates on ASCII 
characters as much as possible.

 Well, it's still a work in progress, so it certainly can be adjusted as
 necessary. I intend it to fully implement the spec (and make sure that
 both it and the spec match what dmd is doing) as far as lexing goes. The
 idea is that you should be able to build a fully compliant D parser on
 top of it and build a fully compliant D compiler on top of that.

 
 My fear (having encoded the D grammar once) is that 99% of the code
 constructs are easily done (even new stuff like the => syntax), but
 that the remaining 1% is a nightmare. The many different kind of
 strings with or without escaping, for example. Your lexer return a
 string literal as a string field, right? If so, escaping chars like "
 and ` makes me nervous.

Well, if you want the original text, there's the str property, which contains 
everything that was in the source code (including whatever escaping there 
was), whereas the value property includes only the body, and it will have 
already processed whatever escaped characters needed to be processed. It's the 
actual value of the literal. So, \` would be ` in the value property, named 
entities would be their actual code points, etc.

So, I don't really see a problem there. Sure, depending on what is done with 
the processed string, it could cause further problems, but that always happens 
when doing string processing.

 It already supports all of the comment types and several of the string
 literal types. I haven't sorted out q{} yet, but I will before I'm done,
 and that may or may not affect how some things work, since I'm not quite
 sure how to handle q{} yet (it may end up being done with tokens marking
 the beginning and end of the token sequence encompassed by q{}, but we'll
 see).

 
 I see. My question was because, as I said before I found this to be a
 difficult and hairy part of the D grammar, and an IDE might want to
 colorize/manipulate the content of a string, when it knows its a
 string mixin.

Well, it could very well cause me trouble. I haven't gotten to it yet, and 
I'll need to study it to fully understand what it does before crafting my 
solution, but it'll probably end up being a sequence of tokens of some kind 
given that that's what it is - a string of tokens. Regardless, it'll be done 
in a way that whatever is using the lexer will be able to know what it is and 
process is it accordingly.

 str holds the exact text which was lexed (this includes the entire comment
 for the various comment token types), which in the case of lexing string
 rather than another range type would normally (always? - I don't
 remember) be a slice of the string being lexed, which should help make
 lexing string very efficient.

 Yeahs, you may as well use this excellent feature. But that means the
 input must be held in memory always, no? If it's an input range
 (coming from a big file), can you keep slices from it?

Well, whatever is using the lexer is going to have to make sure that what it 
passes to the lexer continues to exist while it uses the lexer. Given how 
slicing works in D, it's pretty much a given that lexers and parsers are going 
to use them heavily, and any code using a lexer or parser is going to have to 
take that into account. Slicing is one of the major reasons that Tango's XML 
parser is so insanely fast.

 LiteralValue is a VariantN of the types that a literal can be (long,
 ulong,
 real, and string IIRC) and is empty unless the token is a literal type

 
 You can make it an Algebraic!(long, ulong, real, string), as your
 universe of type is restricted. That can even catch a non-legal value.

I'll have to look at that to see whether using Algebraic is better. I'm not 
super-familiar with std.variant, so I may have picked the wrong type. However, 
VariantN already holds a specific set of types though (unlike Variant), so that 
part isn't a problem.

 Arrays literals are not considered literals, since they can contain
 arbitrary expressions, is that right?

There is no concept of array literal in the lexing portion of the spec. There 
may be at the level of the parser, but that's not the lexer's problem.

 And it's a good idea to drop the complex literals once and for all.

I'm not supporting any symbols which are known to be scheduled for deprecation 
(e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is 
to-be-deprecated keywords (e.g. volatile and delete), since they still won't 
be legal to use as identifiers. And there's a function to query a token as to 
whether it's using a deprecated or unused keyword so that the program using 
the lexer can flag it if it wants to.

 (the
 various string postfixes - c,w, and d - are treated as different token
 types rather than giving the literal value different string types - the
 same with the integral and floating point literals).

 
 That's seem reasonable enough, but can you really store  a dstring
 literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the 
type of the literal. The only difference is the letter tacked onto the end. 
That will be turned into the appropriate string type be the semantic analyzer, 
but the lexer doesn't care.

 Does syntax highlighting need more that a token stream? Without having
 thought a lot about it, it seems to me IDE tend to highlight based
 just on the token type, not on a parse tree. So that means your lexer
 can be used directly by interested people, that's nice.

If all you want to do is highlight specific symbols and keywords, then you 
don't need a parser. If you want to do fancier stuff related to templates and 
the like, then you'll need a parser and maybe even a semantic analyzer. But a 
lexer should be plenty for basic syntax highlighting.

 I really like the idea of having malformed token become error token,
 with the lexing that continue. As I said before, that enables an IDE
 to continue to do syntax highlighting.

That was the idea. You need to be able to continue lexing beyond errors in 
many cases, and that seemed the natural way to do it.

 Did you achieve UFCS, wrt floating point values? is 1.f seen as a
 float or as f called on 1 (int)?

I haven't tackled floating point literals yet, but since the f in 1.f is not 
part of the floating point literal, I'll have to handle it. I suspect that 
that's one area where I'll need to update the spec though, since it probably 
hasn't been updated for that yet.

Basically, the lexer that I'm writing needs to be 100% compliant with the D 
spec and dmd (which means updating the spec or fixing dmd in some cases), and 
it needs to be possible to build on top of it anything and everything that dmd 
does that would use a lexer (even if it's not the way that dmd currently does 
it) so that it's possible to build a fully compliant D parser and compiler on 
top of it as well as a ddoc generator and anything else that you'd need to do 
with a lexer for D. So, if you have any questions about what my lexer does (or 
is supposed to do) with regards to the spec, that should answer it. If my 
lexer doesn't match the spec or dmd when I'm done (aside from specific 
exceptions relating to stuff like deprecated symbols), then I screwed up.

- Jonathan M Davis

Jul 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 08:11, Jonathan M Davis wrote:

 I'm not using regexes at all. It's using string mixins to reduce code
 duplication, but it's effectively hand-written. If I do it right, it should be
 _very_ difficult to make it any faster than it's going to be. It even
 specifically avoids decoding unicode characters and operates on ASCII
 characters as much as possible.

That's good idea. Most code can be treated as ASCII (I assume most 
people code in english). It would basically only be string literals 
containing characters outside the ASCII table.

BTW, have you seen this:

http://woboq.com/blog/utf-8-processing-using-simd.html

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 11:14:52 Jacob Carlborg wrote:
 On 2012-08-01 08:11, Jonathan M Davis wrote:
 I'm not using regexes at all. It's using string mixins to reduce code
 duplication, but it's effectively hand-written. If I do it right, it
 should be _very_ difficult to make it any faster than it's going to be.
 It even specifically avoids decoding unicode characters and operates on
 ASCII characters as much as possible.

 
 That's good idea. Most code can be treated as ASCII (I assume most
 people code in english). It would basically only be string literals
 containing characters outside the ASCII table.

What's of particular importance is the fact that _all_ of the language 
constructs are ASCII. So, unicode comes in exclusively with identifiers, string 
literals, char literals, and whitespace. And with those, ASCII is still going 
to be far more common, so coding it in a way that makes ASCII faster at slight 
cost to performance for unicode is perfectly acceptable.

 BTW, have you seen this:
 
 http://woboq.com/blog/utf-8-processing-using-simd.html

No, I'll have to take a look. I know pretty much nothing about SIMD though. 
I've only heard of it, because Walter implemented some SIMD stuff in dmd not 
too long ago.

- Jonathan M Davis

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 8:11 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
 Is it based on a prioritized list of regexes?

 I'm not using regexes at all. It's using string mixins to reduce code
 duplication, but it's effectively hand-written. If I do it right, it should be
 _very_ difficult to make it any faster than it's going to be. It even
 specifically avoids decoding unicode characters and operates on ASCII
 characters as much as possible.

OK. It'll more difficult to genericize, then, but that's not your
problem (could be mine, though).


 Well, whatever is using the lexer is going to have to make sure that what it
 passes to the lexer continues to exist while it uses the lexer.

Yeah, I realized that just after posting. And anyway, the token are
made to be consumed at once, normally.

 I'll have to look at that to see whether using Algebraic is better. I'm not
 super-familiar with std.variant, so I may have picked the wrong type. However,
 VariantN already holds a specific set of types though (unlike Variant), so that
 part isn't a problem.

OK, I forgot about VariantN (I'm not so used to std.variant either)


 I'm not supporting any symbols which are known to be scheduled for deprecation
 (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is
 to-be-deprecated keywords (e.g. volatile and delete), since they still won't
 be legal to use as identifiers. And there's a function to query a token as to
 whether it's using a deprecated or unused keyword so that the program using
 the lexer can flag it if it wants to.

Good idea. A nice bunch of query functions will be a nice complement

- isDoc
- isComment
- isString
- isDeprecated
...


 That's seem reasonable enough, but can you really store  a dstring
 literal in a string field?

 Yeah. Why not? The string is the same in the source code regardless of the
 type of the literal. The only difference is the letter tacked onto the end.
 That will be turned into the appropriate string type be the semantic analyzer,
 but the lexer doesn't care.

I don't get it. Say I have an literal with non UTF-8 chars, how will
it be stored inside the .str field as a string?


 Basically, the lexer that I'm writing needs to be 100% compliant with the D
 spec and dmd (which means updating the spec or fixing dmd in some cases), and
 it needs to be possible to build on top of it anything and everything that dmd
 does that would use a lexer (even if it's not the way that dmd currently does
 it) so that it's possible to build a fully compliant D parser and compiler on
 top of it as well as a ddoc generator and anything else that you'd need to do
 with a lexer for D. So, if you have any questions about what my lexer does (or
 is supposed to do) with regards to the spec, that should answer it. If my
 lexer doesn't match the spec or dmd when I'm done (aside from specific
 exceptions relating to stuff like deprecated symbols), then I screwed up.

That's a lofty goal, but that would indeed be quite good to have an
officially integrated lexer in Phobos that would (as Andrei said) "be
it". The token spec would be the lexer.

Jul 31 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:
 OK. It'll more difficult to genericize, then, but that's not your
 problem (could be mine, though).

It was never intended to be even vaguely generic. It's targeting D 
specifically. If someone can take it and make it generic when I'm done, then 
great. But it's goal is to lex D as efficiently as possible, and it'll do 
whatever it takes to do that. From how the main switch statement's cases are 
constructed though, there's a lot there which could be genericized. I 
currently have several mixins used to create them, but I'm pretty sure that I 
can generate a _lot_ of the case statements using a single mixin which just 
takes the list of symbols and their associated tokens, which I'll probably do 
before I'm done. So, I'm sure that pieces of what I'm doing could be used to 
generate a lexer for another language, but plenty of it is very specific to D.

 That's seem reasonable enough, but can you really store  a dstring
 literal in a string field?

 
 Yeah. Why not? The string is the same in the source code regardless of the
 type of the literal. The only difference is the letter tacked onto the
 end.
 That will be turned into the appropriate string type be the semantic
 analyzer, but the lexer doesn't care.

 
 I don't get it. Say I have an literal with non UTF-8 chars, how will
 it be stored inside the .str field as a string?

The literal is written in whatever encoding the range is in. If it's UTF-8, 
it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set 
of characters that UTF-32 can. Your range could be UTF-32, but the string 
literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when 
the literal is UTF-32. The characters themselves are in the encoding type of 
the range regardless. It's just the values that the compiler generates which 
change.

"hello world"
"hello world"c
"hello world"w
"hello world"d

are absolutely identical as far as lexing goes save for the trailing 
character. It would be the same regardless of the characters in the strings or 
the encoding used in the source file.

In either case, a lot of string literals have to be decoded (e.g if they 
contain escaped characters), so you often can't create them with a slice 
anyway, and if a range is used which isn't one of the string types, then it's 
impossible for Token's value property to use the range type whenever it can't 
use a slice. So, it's just simpliest to always use string. It may be a slight 
performance hit for lexing wstrings and dstrings, since they _could_ be both 
sliced and created as new strings (unlike other ranges), but I don't think 
that it's worth the extra complication to make it so that the string literal's 
value could be other string types, especially when lexing strings is likely to 
be the common case.

 Basically, the lexer that I'm writing needs to be 100% compliant with the
 D
 spec and dmd (which means updating the spec or fixing dmd in some cases),
 and it needs to be possible to build on top of it anything and everything
 that dmd does that would use a lexer (even if it's not the way that dmd
 currently does it) so that it's possible to build a fully compliant D
 parser and compiler on top of it as well as a ddoc generator and anything
 else that you'd need to do with a lexer for D. So, if you have any
 questions about what my lexer does (or is supposed to do) with regards to
 the spec, that should answer it. If my lexer doesn't match the spec or
 dmd when I'm done (aside from specific exceptions relating to stuff like
 deprecated symbols), then I screwed up.

 That's a lofty goal, but that would indeed be quite good to have an
 officially integrated lexer in Phobos that would (as Andrei said) "be
 it". The token spec would be the lexer.

Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be 
in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually 
switch to using the lexer and parser from Phobos, and in that sort of 
situation, it's that much more imperative that they follow the spec exactly.

- Jonathan M Davis

Jul 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 08:39, Jonathan M Davis wrote:

 Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be
 in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually
 switch to using the lexer and parser from Phobos, and in that sort of
 situation, it's that much more imperative that they follow the spec exactly.

:)

-- 
/Jacob Carlborg

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 It was never intended to be even vaguely generic. It's targeting D
 specifically. If someone can take it and make it generic when I'm done, then
 great. But it's goal is to lex D as efficiently as possible, and it'll do
 whatever it takes to do that.

That's exactly what I had in mind. Anyway, we need a D lexer. We also
need a generic lexer generator, but as a far-away second choice and we
can admit it being less efficient. Of course, any trick used on the D
lexer can most probably be reused for Algol-family lexers.


 I don't get it. Say I have an literal with non UTF-8 chars, how will
 it be stored inside the .str field as a string?

 The literal is written in whatever encoding the range is in. If it's UTF-8,
 it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set
 of characters that UTF-32 can. Your range could be UTF-32, but the string
 literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when
 the literal is UTF-32. The characters themselves are in the encoding type of
 the range regardless. It's just the values that the compiler generates which
 change.

 "hello world"
 "hello world"c
 "hello world"w
 "hello world"d

 are absolutely identical as far as lexing goes save for the trailing
 character. It would be the same regardless of the characters in the strings or
 the encoding used in the source file.

Everytime I think I understand D strings, you prove me wrong. So, I
*still* don't get how that works:

say I have

auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

Then, the "..." part is lexed as a string literal. How can the string
field in the Token magically contain UTF32 characters? Or, are they
automatically cut in four nonsense chars each? What about comments
containing non-ASCII chars? How can code coming after the lexer make
sense of it?

As Jacob say, many people code in English. That's right, but

1- they most probably use their own language for internal documentation
2- any in8n part of a code base will have non-ASCII chars
3- D is supposed to accept UTF-16 and UTF-32 source code.

So, wouldn't it make sense to at least provide an option on the lexer
to specifically store identifier lexemes and comments as a dstring?

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 14:44, Philippe Sigaud wrote:

 Everytime I think I understand D strings, you prove me wrong. So, I
 *still* don't get how that works:

 say I have

 auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

 Then, the "..." part is lexed as a string literal. How can the string
 field in the Token magically contain UTF32 characters? Or, are they
 automatically cut in four nonsense chars each? What about comments
 containing non-ASCII chars? How can code coming after the lexer make
 sense of it?

 As Jacob say, many people code in English. That's right, but

 1- they most probably use their own language for internal documentation
 2- any in8n part of a code base will have non-ASCII chars
 3- D is supposed to accept UTF-16 and UTF-32 source code.

 So, wouldn't it make sense to at least provide an option on the lexer
 to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this:

The string representing what's in the source code can be either UFT-8 or 
the encoding of the file. I'm not sure if the lexer needs to re-encode 
the string if it's not in the same encoding as the file.

Then there's an other field/function that returns the processed token, 
i.e. for a token of the type "int" it will return an actual int. This 
function will return different types of string depending on the type of 
the string literal the token represents.

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
 Everytime I think I understand D strings, you prove me wrong. So, I
 *still* don't get how that works:
=20
 say I have
=20
 auto s =3D " - some greek or chinese chars, mathematical symbols, wha=

tever -
 "d;
=20
 Then, the "..." part is lexed as a string literal. How can the string=

 field in the Token magically contain UTF32 characters?

It contains unicode. The lexer is lexing whatever encoding the source i=
s in,=20
which has _nothing_ to do with the d on the end. It could be UTF-8, or =
UTF-16,=20
or UTF-32. If we supported other encodings in ranges, it could be one o=
f=20
those. Which of those it is is irrelevant. As far as the value of the l=
iteral=20
goes, these two strings are identical:

"=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

The encoding of the source file is irrelevant. By tacking a d on the en=
d

"=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"d
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d

you're just telling the compiler that you want the value that it genera=
tes to=20
be in UTF-32. The source code could be in any of the supported encoding=
s, and=20
the string could be held in any encoding until the object code is actua=
lly=20
generated.

 So, wouldn't it make sense to at least provide an option on the lexer=

 to specifically store identifier lexemes and comments as a dstring?

You mean make it so that Token is=20

struct Token(R)
{
    TokenType    type;
    R       str;
    LiteralValue value
    SourcePos    pos;
}

instead of

struct Token
{
    TokenType    type;
    string       str;
    LiteralValue value
    SourcePos    pos;
}

or do you mean something else? I may do something like that, but I woul=
d point=20
out that if R doesn't have slicing, then that doesn't work. So, str can=
't=20
always be the same type as the original range. For ranges with no slici=
ng, it=20
would have to be something else (probably either string or=20
typeof(takeExactly(range))). However, making str R _does_ come at the c=
ost of=20
complicating code using the lexer, since instead of just using Token, y=
ou have=20
to worry about whether it's a Token!string, Token!dstring, etc, and whe=
ther=20
it's worth that complication is debatable. By far the most common use c=
ase is=20
to lex string, and if str is string, and R is not, then you incur the p=
enalty=20
of converting R to string. So, the common use case is fast, and the unc=
ommon=20
use case still works but is slower, and the user of the lexer doesn't h=
ave to=20
care what the original range type was.

It could go either way. I used string on first pass, but as I said, I c=
ould=20
change it to R later if that makes more sense. I'm not particularly hun=
g up on=20
that little detail at this point, and that's probably one of the things=
 that=20
can be changed reasonably easily later.

- Jonathan M Davis

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrot=
e:

 "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

 The encoding of the source file is irrelevant.

do you mean I can do:

string field =3D "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88";

?

Geez, just tested it, it works. even writeln(field) correctly output
the japanese chars and dmd doesn't choke on it.
Bang, back to state 0: I don't get how D strings work.



 You mean make it so that Token is

 struct Token(R)
 {
     TokenType    type;
     R       str;
     LiteralValue value
     SourcePos    pos;
 }

 instead of

 struct Token
 {
     TokenType    type;
     string       str;
     LiteralValue value
     SourcePos    pos;
 }

 or do you mean something else?

Right, this.

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 19:50, Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 "ウェブサイト"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

 The encoding of the source file is irrelevant.

 do you mean I can do:

 string field = "ウェブサイト";

 ?

 Geez, just tested it, it works. even writeln(field) correctly output
 the japanese chars and dmd doesn't choke on it.
 Bang, back to state 0: I don't get how D strings work.

Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these 
encodings can store every character in the Unicode standard. What's 
different is how the characters are stored and how many bytes a single 
character takes to store in the string. For example:

string str = "ö";

The above character will take up two bytes in the string. On the other 
hand, this won't work:

char c = 'ö';

The reason for that is the the above character needs two bytes to be 
stored but "char" can only store one byte. Therefore you need to store 
the character in a type where it fits, i.e. "wchar" or "dchar". Or you 
can use a string where you can store how many bytes you want.

Don't know if that makes it clearer.

-- 
/Jacob Carlborg

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 8:24 PM, Jacob Carlborg <doob me.com> wrote:

 Don't know if that makes it clearer.

It does! Particularly this:

 All these encodings can store *every* character in the Unicode standard. What's
 different is how the characters are stored and how many bytes a single
 character takes to store in the string.

(emphasis mine)

I somehow thought that with UTF-8 you were limited to a part of
Unicode, and to another, bigger part with UTF-16.
I equated Unicode with UTF-32.
This is what completely warped my vision. It's good to learn something
new everyday, I guess.

Thanks Jacob!

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 22:47, Philippe Sigaud wrote:

 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

 Thanks Jacob!

You're welcome :)

-- 
/Jacob Carlborg

Aug 01 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a
metric ton of cool info out there if you want to get to know more
about unicode (this may or may not be interesting reading material),
e.g.:

http://www.catch22.net/tuts/introduction-unicode
http://icu-project.org/docs/papers/forms_of_unicode/
http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode

I used to have more of these links but lost them. There's even a
gigantic book about unicode (Unicode Demystified).

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 22:54, Andrej Mitrovic wrote:

 I think many people viewed Unicode this way at first. But there is a
 metric ton of cool info out there if you want to get to know more
 about unicode (this may or may not be interesting reading material),
 e.g.:

 http://www.catch22.net/tuts/introduction-unicode
 http://icu-project.org/docs/papers/forms_of_unicode/
 http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode

 I used to have more of these links but lost them. There's even a
 gigantic book about unicode (Unicode Demystified).

This is a good read as well:

http://www.joelonsoftware.com/articles/Unicode.html

-- 
/Jacob Carlborg

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
<andrej.mitrovich gmail.com> wrote:
 On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

 I think many people viewed Unicode this way at first. But there is a
 metric ton of cool info out there if you want to get to know more
 about unicode

I will, but not yet. I've a few books on parsing and compilers to read
before that.
I just read http://www.joelonsoftware.com/articles/Unicode.html,
though, and I'm a bit disappointed that char 7 (\u007) does not make
my computer beep. I remember now having my computer beep on char 7
during the 80s when ASCII was the only thing that existed.

Aug 01 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02-Aug-12 01:23, Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
 <andrej.mitrovich gmail.com> wrote:
 On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

 I think many people viewed Unicode this way at first. But there is a
 metric ton of cool info out there if you want to get to know more
 about unicode

 I will, but not yet. I've a few books on parsing and compilers to read
 before that.
 I just read http://www.joelonsoftware.com/articles/Unicode.html,
 though, and I'm a bit disappointed that char 7 (\u007) does not make
 my computer beep. I remember now having my computer beep on char 7
 during the 80s when ASCII was the only thing that existed.

Once you have time to learn some unicode, check out this page:
http://unicode.org/cldr/utility/index.jsp

I've found these tools to be incredibly useful.

-- 
Dmitry Olshansky

Aug 01 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 8/1/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 Once you have time to learn some unicode, check out this page:
 http://unicode.org/cldr/utility/index.jsp

 I've found these tools to be incredibly useful.

Didn't know about that one, cool! Also might come in handy:
http://people.w3.org/rishida/scripts/uniview/

Aug 01 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

I guess that that would explain why you didn't understand what I was saying. I 
was highly confused as to what was confusing about what I was saying, but it 
didn't even occur to me that you had that sort of misunderstanding. You really 
should get a better grip on unicode if you want to be writing code that lexes 
or parses it efficiently (though it sounds like you're reading up on a lot 
already right now).

- Jonathan M Davis

Aug 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Aug 2, 2012 at 1:29 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
 I somehow thought that with UTF-8 you were limited to a part of
 Unicode, and to another, bigger part with UTF-16.
 I equated Unicode with UTF-32.
 This is what completely warped my vision. It's good to learn something
 new everyday, I guess.

 I guess that that would explain why you didn't understand what I was saying. I
 was highly confused as to what was confusing about what I was saying, but it
 didn't even occur to me that you had that sort of misunderstanding. You really
 should get a better grip on unicode if you want to be writing code that lexes
 or parses it efficiently (though it sounds like you're reading up on a lot
 already right now).

I knew about 1-2-4 bytes schemes and such. But, somehow, for me,
string == only-almost-ASCII characters.
Anyway, it all *clicked* into place right afterwards and your answers
are perfectly clear to me now.

Aug 01 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
 On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> 

wrote:
 "ウェブサイト"
 "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
 
 The encoding of the source file is irrelevant.

 
 do you mean I can do:
 
 string field = "ウェブサイト";
 
 ?
 
 Geez, just tested it, it works. even writeln(field) correctly output
 the japanese chars and dmd doesn't choke on it.
 Bang, back to state 0: I don't get how D strings work.

From http://dlang.org/lex.html

D source text can be in one of the following formats: 
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE

So, yes, you can stick unicode characters directly in D code. Though I wonder 
about the correctness of the spec here. It claims that if there's no BOM, then 
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of 
them have BOM markers, but I can put unicode in a .d file just fine with vim. U 
should probably study up on BOMs.

In any case, the source is read in whatever encoding it's in. String literals 
then all become UTF-8 in the final object code unless they're marked as 
specifically being another type via the postfix letter or they're inferred as 
being another type (e.g. when you assign a string literal to a dstring). 
Regardless, what's in the final object code is based on the types that the type 
system marks strings as, not what the encoding of the source code was.

So, a lexer shouldn't care about what the encoding of the source is beyond 
what it takes to covert it to a format that it can deal with and potentially 
being written in a way which makes handling a particular encoding more 
efficient. The values of literals and the like are completely unaffected 
regardless.

- Jonathan M Davis

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 20:24, Jonathan M Davis wrote:

 D source text can be in one of the following formats:
 * ASCII
 * UTF-8
 * UTF-16BE
 * UTF-16LE
 * UTF-32BE
 * UTF-32LE

 So, yes, you can stick unicode characters directly in D code. Though I wonder
 about the correctness of the spec here. It claims that if there's no BOM, then
 it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
 them have BOM markers, but I can put unicode in a .d file just fine with vim. U
 should probably study up on BOMs.

 In any case, the source is read in whatever encoding it's in. String literals
 then all become UTF-8 in the final object code unless they're marked as
 specifically being another type via the postfix letter or they're inferred as
 being another type (e.g. when you assign a string literal to a dstring).
 Regardless, what's in the final object code is based on the types that the type
 system marks strings as, not what the encoding of the source code was.

 So, a lexer shouldn't care about what the encoding of the source is beyond
 what it takes to covert it to a format that it can deal with and potentially
 being written in a way which makes handling a particular encoding more
 efficient. The values of literals and the like are completely unaffected
 regardless.

But if you read a source file which is encoded using UTF-16 you would 
need to re-encode that to store it in the "str" filed in your Token struct?

If that's the case, wouldn't it be better to make Token a template to be 
able to store all Unicode encodings without re-encoding? Although I 
don't know how if that will complicate the rest of the lexer.

-- 
/Jacob Carlborg

Aug 01 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:
 But if you read a source file which is encoded using UTF-16 you would
 need to re-encode that to store it in the "str" filed in your Token struct?

Currently, yes.

 If that's the case, wouldn't it be better to make Token a template to be
 able to store all Unicode encodings without re-encoding? Although I
 don't know how if that will complicate the rest of the lexer.

It may very well be a good idea to templatize Token on range type. It would be 
nice not to have to templatize it, but that may be the best route to go. The 
main question is whether str is _always_ a slice (or the result of 
takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
sure of that. If it's not and can't be for whatever reason, then that poses a 
problem. If Token _does_ get templatized, then I believe that R will end up 
being the original type in the case of the various string types or a range 
which has slicing, but it'll be the result of takeExactly(range, len) for 
everything else.

I just made str a string to begin with, since it was simple, and I was still 
working on a lot of the initial design and how I was going to go about things. 
If it makes more sense for it to be templated, then it'll be changed so that
it's templated.

- Jonathan M Davis

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-01 22:10, Jonathan M Davis wrote:

 It may very well be a good idea to templatize Token on range type. It would be
 nice not to have to templatize it, but that may be the best route to go. The
 main question is whether str is _always_ a slice (or the result of
 takeExactly) of the orignal range. I _think_ that it is, but I'd have to make
 sure of that. If it's not and can't be for whatever reason, then that poses a
 problem. If Token _does_ get templatized, then I believe that R will end up
 being the original type in the case of the various string types or a range
 which has slicing, but it'll be the result of takeExactly(range, len) for
 everything else.

To me a string type would be enough. I don't need support for ranges. 
How about adding a union instead?

enum StringType
{
     utf8,
     utf16,
     utf32
}

struct Token
{
     StringType stringType;

     union
     {
         string strc;
         wstring strw;
         dstring strd;
     }

      property T str (T = string) ()
     {
         static if (is(T == string))
         {
             assert(stringType == StringType.utf8);
             return strc;
         }
         ...
     }
}

Most use cases would look like this:

string s = token.str;

 I just made str a string to begin with, since it was simple, and I was still
 working on a lot of the initial design and how I was going to go about things.
 If it makes more sense for it to be templated, then it'll be changed so that
 it's templated.

I completely understand that.

-- 
/Jacob Carlborg

Aug 01 2012

travert phare.normalesup.org (Christophe Travert) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:173942), a �crit�:
 It may very well be a good idea to templatize Token on range type. It would be 
 nice not to have to templatize it, but that may be the best route to go. The 
 main question is whether str is _always_ a slice (or the result of 
 takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
 sure of that. If it's not and can't be for whatever reason, then that poses a 
 problem.

It can't if it is a simple input range! Like a file read with most 
'lazy' methods. Then you need either to transform the input range in a 
forward range using a range adapter that performs buffering, or perform 
your own buffering internally. You also have to decide how long the 
token will be valid (I believe if you want lexing to be blazing fast, 
you don't want to allocate for each token).

Maybe you want you lexer to work with range of strings too, like 
File.byLine or File.byChunk (the latter require buffering if you split 
in the middle of a token...). But that may wait until a nice API for 
files, streams, etc. is found.

 If Token _does_ get templatized, then I believe that R will end up 
 being the original type in the case of the various string types or a range 
 which has slicing, but it'll be the result of takeExactly(range, len) for 
 everything else.

A range which has slicing doesn't necessarily return it's own type when 
opSlice is used, according to hasSlicing. I'm pretty sure parts of 
Phobos doesn't take that into account. However, the result of 
takeExactly will always be the good type, since it uses opSlice when it 
can, so you can just use that.

Making a generic lexer that works with any forward range of dchar and 
returns a range of tokens without performing decoding of litteral seems 
to be a good first step.

 I just made str a string to begin with, since it was simple, and I was still 
 working on a lot of the initial design and how I was going to go about things. 
 If it makes more sense for it to be templated, then it'll be changed so that
 it's templated.

string may not be where you want to start, because it is a 
specialization for which you need to optimize utf-8 decoding.

Also, you said in this thread that you only need to consider ansy 
characters in the lexer because non-ansy characters are only used in 
non-keyword identifier. That is not entirely true: EndOfLine defines 2 
non-ansy characters, namely LINE SEPARATOR�and PARAGRAPH SEPARATOR. 
  http://dlang.org/lex.html#EndOfLine
  Maybe they should be dropped, since other non-ansy whitespace are not 
supported. You may want the line count to be consistent with other 
programs. I don't know what text-processing programs usualy considers an 
end of line.

-- 
Christophe

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:173942), a =C3=A9=

crit :
 It may very well be a good idea to templatize Token on range type. =


It
 would be nice not to have to templatize it, but that may be the bes=


t
 route to go. The main question is whether str is _always_ a slice (=


or the
 result of takeExactly) of the orignal range. I _think_ that it is, =


but
 I'd have to make sure of that. If it's not and can't be for whateve=


r
 reason, then that poses a problem.

=20
 It can't if it is a simple input range! Like a file read with most
 'lazy' methods. Then you need either to transform the input range in =

a
 forward range using a range adapter that performs buffering, or perfo=

rm
 your own buffering internally. You also have to decide how long the
 token will be valid (I believe if you want lexing to be blazing fast,=

 you don't want to allocate for each token).

My lexer specifically requires a forward range. The more that I deal wi=
th input=20
ranges, the more that I'm convinced that they're nearly useless. If you=
 need=20
even _one_ character of lookahead, then an input range doesn't fly at a=
ll, and=20
considered the performance gains in using slicing (or takeExactly), I j=
ust=20
don't think that it makes sense to operate on an input range. Besides, =
if=20
anyone wants full performance, they'll probably need to use one of the =
built-
in string types. Any range of dchar which needs to decode on the call t=
o front=20
or popFront will take a serious performance hit. It'll work, but it's n=
ot=20
really advisable if you need performance.

 Also, you said in this thread that you only need to consider ansy
 characters in the lexer because non-ansy characters are only used in
 non-keyword identifier. That is not entirely true: EndOfLine defines =

2
 non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
   http://dlang.org/lex.html#EndOfLine
   Maybe they should be dropped, since other non-ansy whitespace are n=

ot
 supported. You may want the line count to be consistent with other
 programs. I don't know what text-processing programs usualy considers=

 an
 end of line.

I'm well aware of all fo that. Almost all of the lexer can operate enti=
rely on=20
ASCII, with a few exceptions, and even in some of those cases, decoding=
 isn't=20
required (e.g. lineSep and paraSep can be dealt with as code units rath=
er than=20
having to decode to compare agains them). The lexer that I'm writing wi=
ll=20
follow the spec. And any lexer that wants to get into Phobos will need =
to do=20
the same. So, stuff like lineSep and the end of file characters that th=
e spec=20
has will be supported.

- Jonathan M Davis

Aug 02 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Let's stop parser Hell