digitalmars.D - Lexer and parser generators using CTFE

Andrei Alexandrescu (31/31) Feb 27 2012 I'm starting a new thread on this because I think the matter is of

CTFE-4-the-win (6/41) Feb 28 2012 Definitely, I applaud this initiative! I've long been of the

bls (27/34) Feb 28 2012 Are you aware of Philippe Sigaud's PEG work (may become part of his
Christopher Bergqvist (12/66) Feb 28 2012 I agree that the current direction of D in this area is

Jonathan M Davis (5/17) Feb 28 2012 Well, one cool thing would be that a compiler could effectively compile ...

Piotr Szturmaj (3/19) Feb 28 2012 CTFE code can be much slower than native one, and I would like to see

Marco Leise (4/6) Feb 29 2012 I second this. As a fan of clean optimizations this is one of the things...

Andrei Alexandrescu (6/14) Mar 01 2012 I think this is the kind of development that will be naturally motivated...

James Miller (32/94) Feb 28 2012 d
H. S. Teoh (9/26) Feb 28 2012 [...]
d coder (9/14) Feb 28 2012 CTFE parsing is especially useful for DSEL (Domain Specific Embedded
Philippe Sigaud (40/48) Feb 29 2012 variables/identifiers in the parsed subset of DSL with the mainstream D

Andrei Alexandrescu (4/9) Feb 29 2012 That looks about right, but still has a fair amount of noise. I think

deadalnix (3/12) Mar 01 2012 It make sense so an external file can be imported as string and
Philippe Sigaud (8/14) Mar 01 2012 approach of putting the entire grammar in one string is best.

John Belmonte (5/17) May 27 2012 I'm wondering if people have seen LPeg. It's a Lua library, but

Philippe Sigaud (22/26) May 28 2012 This is exactly the approach followed by C++ Boost::Spirit, with halas

John Belmonte (22/34) May 28 2012 Fair enough. I did notice the following in the markdown PEG

Philippe Sigaud (9/24) May 29 2012 te*

H. S. Teoh (6/22) Mar 01 2012 You could maybe just put D code in the grammar string, which gets
Philippe Sigaud (5/9) Mar 01 2012 I could, but one of my driving goals while beginning this project a mont...

Jacob Carlborg (9/19) Mar 01 2012 Maybe we can take some ideas from the CoffeeScript compiler (written in

d coder (3/7) Feb 29 2012 I am eagerly waiting for you to upload the stuff on GitHub.
Andrei Alexandrescu (20/29) Feb 29 2012 Barrier of entry and granularity of approach, I think.

Christopher Bergqvist (7/31) Feb 29 2012 Thanks for your response. The lowered barrier of entry in
H. S. Teoh (15/24) Feb 29 2012 I see that std.stdio.writef already parses format strings at
Andrej Mitrovic (2/2) Feb 29 2012 Why do you need grammar rules for json? Just use a custom-defined

Hisayuki Mima (33/64) Feb 28 2012 Hello Andrei,

Hisayuki Mima (4/80) Feb 29 2012 A minimum of documentation is now available here:

Andrei Alexandrescu (4/6) Feb 29 2012 Great start, you may want to aggressively add examples. Then people

Andrei Alexandrescu (61/86) Feb 29 2012 Nice! I have a few comments. I should say that they entail a fair amount...

Hisayuki Mima (26/116) Mar 04 2012 Certainly, "built AST" option is very interesting.

Andrei Alexandrescu (12/28) Mar 04 2012 It's your project so you define what's fun to work on. Let me just say

Ben Hanson (10/15) Mar 05 2012 This chimes nicely with the comments made at the Going Native

d coder (32/38) May 27 2012 Hello Youkei

Hisayuki Mima (9/48) May 31 2012 Hello Puneet,

d coder (10/14) May 27 2012 Ok. I see my folly. At compile time, there would not be any "this" since

H. S. Teoh (23/42) Feb 28 2012 CTFE is one of the big features that attracted me to D.

Andrei Alexandrescu (3/10) Feb 29 2012 For example integrated lexer and parser generators!

Nick Sabalausky (8/18) Feb 29 2012 ...With statically-checked symbol names that can include arbitrary

mist (2/2) Feb 28 2012 Something similar to Boost::Spirit2 but with sane syntax and

bcs (3/5) Feb 28 2012 How about not having to encode the syntax as arithmetic expressions? D

Andrei Alexandrescu (3/8) Feb 29 2012 I think that's what he meant by "sane syntax".

Andrei Alexandrescu (4/6) Feb 29 2012 Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear -

Martin Nowak (22/53) Feb 28 2012 I wrote a generic lexer generator some time ago.

Dmitry Olshansky (24/89) Feb 28 2012 To be truly successful it should have some distinct properties like:

H. S. Teoh (38/70) Feb 28 2012 [...]

Dmitry Olshansky (17/84) Feb 29 2012 Well I see no problem in defining a separate OperatorGrammar

Andrei Alexandrescu (6/12) Feb 29 2012 What does stacking grammars mean?

Dmitry Olshansky (32/44) Feb 29 2012 In general it should be read as using one parser as lexer for another

Andrei Alexandrescu (9/26) Feb 29 2012 Yes. But AST creation should be supported without any custom code.

Nick Sabalausky (10/20) Feb 29 2012 Probably LR, too, unless you build the state tables in a separate prior

Nick Sabalausky (3/13) Feb 29 2012 Erm, I mean LALR specifically.
Andrei Alexandrescu (5/15) Feb 29 2012 It's been a while since I read about PEGs (the packrat parser paper),

Nick Sabalausky (13/28) Feb 29 2012 I don't know much about packrat parsers, but what I meant is that genera...

Ben Hanson (29/32) Mar 05 2012 Hi Nick,

Dmitry Olshansky (11/24) Feb 29 2012 It might involve parsing and grammars, unless academia folk seduce me

H. S. Teoh (17/35) Feb 28 2012 Cool! I'll have to take a look at this sometime.

deadalnix (6/10) Feb 28 2012 std.conainer is already a good candidate for allocators.
Martin Nowak (25/43) Feb 28 2012 Yacc does work but at the price of an additional build step and total

H. S. Teoh (8/28) Feb 28 2012 Excellent idea, I like this.

Nick Sabalausky (27/51) Feb 28 2012 In Goldie, I've taken an inverted approach, which IMHO is easier to use:...

Nick Sabalausky (4/61) Feb 28 2012 Hmm, maybe I need to think about what it would take to make Goldie able ...

Nick Sabalausky (8/10) Feb 28 2012 Just gave it a quick shot. It was looking like it might not be too bad, ...

Philippe Sigaud (14/26) Feb 29 2012 but

Nick Sabalausky (4/25) Feb 29 2012 "Philippe Sigaud" wrote in message

Nick Sabalausky (3/15) Mar 07 2012 Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667

Andrei Alexandrescu (3/28) Feb 29 2012 I think this is the right approach.

Nick Sabalausky (3/40) Feb 29 2012 What? We agree? ;)

Andrei Alexandrescu (3/4) Feb 29 2012 It's 2012 after all.

Andrei Alexandrescu (9/29) Feb 29 2012 I'm not sure what a parser combinator is. I think grammar inheritance
Timon Gehr (2/4) Feb 29 2012 Well, it is slower at lexing than DMD at parsing. What is the bottleneck...

Martin Nowak (6/12) Feb 29 2012 No, it's as fast as dmd's lexer.

H. S. Teoh (12/28) Feb 29 2012 This sounds like something I ran into in my D lexer project. Lexing
d coder (1/7) Feb 29 2012 And how different was just a few seconds from 10-15 seconds? :-)

Nick Sabalausky (4/15) Feb 29 2012 ~10 seconds

H. S. Teoh (6/17) Feb 29 2012 OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So
Timon Gehr (2/13) Feb 29 2012 I did that.

Martin Nowak (14/33) Feb 29 2012 Interesting, I've commented it out https://gist.github.com/1262321#L1559...

Timon Gehr (4/38) Feb 29 2012 I get 0.160s for lexing using your lexer.

Martin Nowak (10/52) Feb 29 2012 Mmh, I've retested and you're right dmd's lexer is about 2x faster.

H. S. Teoh (26/34) Feb 29 2012 [...]

bcs (6/37) Feb 28 2012 A bit outdated but....
F i L (71/71) May 27 2012 I'm not sure I follow all the details of what Andrei's suggesting

Jacob Carlborg (8/75) May 28 2012 This is a very cool feature of Nimrod. It allows to move several

Ken (11/46) Jun 01 2012 Great! So what's the next step? Do we wait for the maintainers

Andrei Alexandrescu (7/15) Jun 01 2012 I think this core strength of the language should be properly supplanted...

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I'm starting a new thread on this because I think the matter is of 
strategic importance.

We all felt for a long time that there's a lot of potential in CTFE, and 
potential applications have been discussed more than a few times, 
ranging from formatting strings parsed to DSLs and parser generators.

Such feats are now approaching fruition because a number of factors 
converge:

* Dmitry Olshansky's regex library (now in Phobos) generates efficient D 
code straight from regexen.

* The scope and quality of CTFE has improved enormously, making more 
advanced uses possible and even relatively easy (thanks Don!)

* Hisayuki Mima implemented a parser generator in only 3000 lines of 
code (sadly, no comments or documentation yet :o))

* With the occasion of that announcement we also find out Philippe 
Sigaud has already a competing design and implementation of a parser 
generator.

This is the kind of stuff I've had an eye on for the longest time. I'm 
saying it's of strategic importance because CTFE technology, though not 
new and already available with some languages, has unique powers when 
combined with other features of D. With CTFE we get to do things that 
are quite literally impossible to do in other languages.

We need to have a easy-to-use, complete, seamless, and efficient 
lexer-parser generator combo in Phobos, pronto. The lexer itself could 
use a character-level PEG or a classic automaton, and emit tokens for 
consumption by a parser generator. The two should work in perfect tandem 
(no need for glue code). At the end of the day, defining a complete 
lexer+parser combo for a language should be just a few lines longer than 
the textual representation of the grammar itself.

What do you all think? Let's get this project off the ground!


Thanks,

Andrei

Feb 27 2012

"CTFE-4-the-win" <CTFE 4the.win> writes:

On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
wrote:
 I'm starting a new thread on this because I think the matter is 
 of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs and 
 parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making 
 more advanced uses possible and even relatively easy (thanks 
 Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

Definitely, I applaud this initiative! I've long been of the
opinion that CTFE parsing is D's killer-feature, which would
allow me to "sneak" D into a "nameless above average size
company". ;)

Feb 28 2012

bls <bizprac orange.fr> writes:

On 02/28/2012 12:36 AM, CTFE-4-the-win wrote:
 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

Are you aware of Philippe Sigaud's PEG work (may become part of his 
template book)

Quote  Philippe ..
I recently wrote a parsing expression grammar module in D, also to 
create grammars and parsers at compile-time and parse inputs at CT.

(PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar)

Usage is like this:

mixin Grammar(
       "JSON <- '{' ( Pair ( ',' Pair )* )? '}'"
     , "Pair <- String ':' Value"
     , "Value <- String / Number / JSON / Array / True / False / Null"
     , `True <- "true"`
(..., rest of JSON grammar)
);

enum result = JSON.parse(
`{ "Hello":42,
     "World!":
     {
         "a":[0,1,2,3]
     }
}`);


End Quote
No that bad :) , I think.

Well, I am still a fan of EBNF based parser generators (recursive 
decent) but that's an other story. If I recall correctly BCS has created 
something working. ( BNF , so limited)

Feb 28 2012

"Christopher Bergqvist" <spambox0 digitalpoetry.se> writes:

On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win 
wrote:
 On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei 
 Alexandrescu
 wrote:
 I'm starting a new thread on this because I think the matter 
 is of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs 
 and parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, 
 making more advanced uses possible and even relatively easy 
 (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

 Definitely, I applaud this initiative! I've long been of the
 opinion that CTFE parsing is D's killer-feature, which would
 allow me to "sneak" D into a "nameless above average size
 company". ;)

I agree that the current direction of D in this area is 
impressive.  However, I fail to see a killer-feature in 
generating a lexer-parser generator at compile-time instead of 
run-time.

A run-time generator would benefit from not having to execute 
within the limited CTFE environment and would always be on-par in 
that respect.  A compile time generator would internalize the 
generation and compilation of the result (with possible 
glue-code), simplifying the build process somewhat.

What am I failing to pick up on?

Feb 28 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is
 impressive. However, I fail to see a killer-feature in
 generating a lexer-parser generator at compile-time instead of
 run-time.
 
 A run-time generator would benefit from not having to execute
 within the limited CTFE environment and would always be on-par in
 that respect. A compile time generator would internalize the
 generation and compilation of the result (with possible
 glue-code), simplifying the build process somewhat.
 
 What am I failing to pick up on?

Well, one cool thing would be that a compiler could effectively compile itself 
- or at least the lexer and parser would compile themselves. You wouldn't have 
to run a program to generate them. It could be self-contained.

- Jonathan M Davis

Feb 28 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Jonathan M Davis wrote:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is
 impressive. However, I fail to see a killer-feature in
 generating a lexer-parser generator at compile-time instead of
 run-time.

 A run-time generator would benefit from not having to execute
 within the limited CTFE environment and would always be on-par in
 that respect. A compile time generator would internalize the
 generation and compilation of the result (with possible
 glue-code), simplifying the build process somewhat.

 What am I failing to pick up on?

 Well, one cool thing would be that a compiler could effectively compile itself
 - or at least the lexer and parser would compile themselves. You wouldn't have
 to run a program to generate them. It could be self-contained.

CTFE code can be much slower than native one, and I would like to see 
some kind of compiler cache for such code.

Feb 28 2012

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 CTFE code can be much slower than native one, and I would like to see
 some kind of compiler cache for such code.

I second this. As a fan of clean optimizations this is one of the things I
tossed around my head for a while. It could use a cache file or the compiler
could be started as a daemon process keeping a memory cache. All code that is
called through CTFE would go into the cache, indexed by the internal
representation of the function body and parameters.
But there are a few open questions, like how seamless this could be integrated.
Is it easy to get a hash for function bodies and parameters? How should the
cache be limited? N functions, n bytes or maybe one cache per module (together
with the .o files)? The latter case would mean that if a.d uses CTFE, that
executes code from b.d the results of CTFE would all be cached together with
a.o, because that was the entry point. And if there was a module c.d that does
the same as a.d it would duplicate the b.d part in its own cache. The benefit
is that the cache works with both single module and whole program compilation
and it doesn't need any limits. Instead the caches for each module are always
refreshed with what was last used in the compilation.
In addition to the last compilation, the caches could be aware of versioned
compilation. I usually want to compile debug versions and Linux/Windows release
versions at least, so I wouldn't want to invalidate the caches. For 32-bit vs.
64-bit I assume it is the best to just cache them separately as it could prove
difficult to distinguish two versions of code that uses (void*).sizeof or
something else that isn't declared wrapped in a version statement like size_t
is.

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/12 1:54 AM, Marco Leise wrote:
 Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 CTFE code can be much slower than native one, and I would like to see
 some kind of compiler cache for such code.

 I second this. As a fan of clean optimizations this is one of the things
 I tossed around my head for a while. It could use a cache file or the
 compiler could be started as a daemon process keeping a memory cache.
 All code that is called through CTFE would go into the cache, indexed by
 the internal representation of the function body and parameters.

I think this is the kind of development that will be naturally motivated 
and pushed by serious CTFE applications pushing the boundary. I'm not 
very worried about it right now as about investigating CTFE potential 
and blockers.

Andrei

Mar 01 2012

James Miller <james aatch.net> writes:

On 29 February 2012 14:16, Christopher Bergqvist
<spambox0 digitalpoetry.se> wrote:
 On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win wrote:
 On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
 wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, an=



d
 potential applications have been discussed more than a few times, rangi=



ng
 from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient =



D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of co=



de
 (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe Siga=



ud
 has already a competing design and implementation of a parser generator=



.
 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not=



 new
 and already available with some languages, has unique powers when combi=



ned
 with other features of D. With CTFE we get to do things that are quite
 literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could =



use a
 character-level PEG or a classic automaton, and emit tokens for consump=



tion
 by a parser generator. The two should work in perfect tandem (no need f=



or
 glue code). At the end of the day, defining a complete lexer+parser com=



bo
 for a language should be just a few lines longer than the textual
 representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei


 Definitely, I applaud this initiative! I've long been of the
 opinion that CTFE parsing is D's killer-feature, which would
 allow me to "sneak" D into a "nameless above average size
 company". ;)


 I agree that the current direction of D in this area is impressive.
 =C2=A0However, I fail to see a killer-feature in generating a lexer-parse=

r
 generator at compile-time instead of run-time.

 A run-time generator would benefit from not having to execute within the
 limited CTFE environment and would always be on-par in that respect. =C2=

=A0A
 compile time generator would internalize the generation and compilation o=

f
 the result (with possible glue-code), simplifying the build process
 somewhat.

 What am I failing to pick up on?

As far as I can tell, the advantage of compile-time parser-generation
is that you get significantly better start-up times for the program.
Generating a lexer/parser at runtime produces signficant overhead.
Also, if you want to do a yacc-style action system, then it being
created at compile-time means that you don't have the overhead of
using anonymous functions or delegates at runtime, you can just give
it a raw string of D code.

The CTFE environment is not that limited, and has enough functionality
to produce a lexer/parser.

Complex language definitions would be benefited by CTFE
parser-generation since the load times to create the generator can be
quite long, a problem for programs that may not be long-running (a
lint tool for example).

--
James Miller

Feb 28 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Feb 28, 2012 at 08:25:21PM -0500, Jonathan M Davis wrote:
 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.
 
 A run-time generator would benefit from not having to execute within
 the limited CTFE environment and would always be on-par in that
 respect. A compile time generator would internalize the generation
 and compilation of the result (with possible glue-code), simplifying
 the build process somewhat.
 
 What am I failing to pick up on?

 
 Well, one cool thing would be that a compiler could effectively
 compile itself - or at least the lexer and parser would compile
 themselves. You wouldn't have to run a program to generate them. It
 could be self-contained.

[...]

You also eliminate possibly lengthy program startup initialization.
Which can add up, if the program is run many times. Plus, if something
can already be computed at compile-time, why waste time & resources
doing it at runtime?


T

-- 
Без труда не выловишь и рыбку из пруда.

Feb 28 2012

d coder <dlang.coder gmail.com> writes:

 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.



CTFE parsing is especially useful for DSEL (Domain Specific Embedded
Languages) or internal DSLs. The advantages are:

1. Syntactic errors (in the parsed constructs)  are given out at compile
time.
2. D reflections are available only at compile time. Referencing the
variables/identifiers in the parsed subset of DSL with the mainstream D
code is impossible without reflections in place.

Regards
- Puneet

Feb 28 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.




 CTFE parsing is especially useful for DSEL (Domain Specific Embedded

Languages) or internal DSLs. The advantages are:
 1. Syntactic errors (in the parsed constructs)  are given out at compile

time.
 2. D reflections are available only at compile time. Referencing the

variables/identifiers in the parsed subset of DSL with the mainstream D
code is impossible without reflections in place.

One of my  goals while writing a CT grammar generator was to get a
compile-time parse-tree. Since it contains strings, it's easy to walk the
tree, assembling strings as you go and generating the code you want
(if//when you want to write code, that is)

Strings are a D way to represent code, so any way to get structured strings
at compile-time opens whole vistas for code generation.

As for semantic actions, I added them in my code yesterday. I had hopes for
using D's new anonymous syntax (p => p), but by being anonymous, they
cannot be inserted easily in string mixins (other modules do not now about
__lambda1 and co).

Anyway, I now have semantic actions at compile-time, I used them to write a
small (as in, veeery simple) XML parser: I use semantic actions to push
node names while encountering them and pop the last tag while encountering
a closing tag. It seems to work OK.

That looks a bit like this (sorry, writing on a pad)

mixin(Grammar!("Doc <- Node*"
                "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
                "OpeningTag <- '<' Identifier '>'", OpeningAction,
                "ClosingTag <-  `</` Identifier '>'", ClosingAction,
                "Text <- (!(OpeningTag / ClosingTag) _)+"));

The PEG for Text just means: any char, as long as it's not an OpeningTag
nor a ClosingTag. PEG use '.' to say 'Any char', but I wanted to be able to
deal with qualified names, so I chose '_' instead.
When there is no action, it default to NoOp, as is the case for  Doc, Node
and Text.

I also added named captures (and capture comparison), to be able to say: "I
want a sequence of equal chars":

"Equal <- _ first (_=first)*"

That is: any char, store it as "first", then take any number of char, long
as their match is equal to first's match.

All this work at CT.

I'm  afraid being in holidays right now means I do not have easy access to
GitHub (no git on a pad, and the computer I use to code right now does not
have any network connection). I'll put all this online in a few days,
because that must seems like the ramblings of a madman right now...

Philippe

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 3:45 AM, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));

That looks about right, but still has a fair amount of noise. I think 
the approach of putting the entire grammar in one string is best.

Andrei

Feb 29 2012

deadalnix <deadalnix gmail.com> writes:

Le 29/02/2012 17:45, Andrei Alexandrescu a écrit :
 On 2/29/12 3:45 AM, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <- `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));

 That looks about right, but still has a fair amount of noise. I think
 the approach of putting the entire grammar in one string is best.

 Andrei

It make sense so an external file can be imported as string and 
processed at compile time.

Mar 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));


 That looks about right, but still has a fair amount of noise. I think the

approach of putting the entire grammar in one string is best.

Yes, using one string is indeed better. That won't be too difficult to code.

But how to associate actions with a rule, in that case? I mean, some rules
will have actions, some not.

In my case, a parse tree is automatically generated (indeed, it's almost
the only thing returned by a parse call, that with recognized string and
named captures). I have some additional syntax to 'drop' a node ans such,
similar to what you proposed a few posts ago.

Mar 01 2012

"John Belmonte" <john neggie.net> writes:

On Thursday, 1 March 2012 at 15:10:36 UTC, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));


 That looks about right, but still has a fair amount of noise. 
 I think the

 approach of putting the entire grammar in one string is best.

 Yes, using one string is indeed better. That won't be too 
 difficult to code.

I'm wondering if people have seen LPeg.  It's a Lua library, but 
the design is interesting in that patterns are first class 
objects which can be composed with operator overloading.

http://www.inf.puc-rio.br/~roberto/lpeg/

May 27 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, May 27, 2012 at 11:13 PM, John Belmonte <john neggie.net> wrote:
 I'm wondering if people have seen LPeg. =C2=A0It's a Lua library, but the=

 design
 is interesting in that patterns are first class objects which can be
 composed with operator overloading.

 http://www.inf.puc-rio.br/~roberto/lpeg/

This is exactly the approach followed by C++ Boost::Spirit, with halas
the limitations from Algol-language operators and precedence: no
postfix '*' (Kleene Star) operator. It seems that lpeg has the same
problem.

I played with this idea with my own Pegged
(https://github.com/PhilippeSigaud/Pegged), but I wasn't quite
convinced by the result, exactly for the reason above. Also, when
looking at real-world Spirit examples, I was a bit disappointed by the
resulting syntax: it's not *that* readable for complicated
expressions. In fact, that's exactly why I decided to follow the DSL
road with Pegged, so as to obtain exactly the PEG syntax, with the
real operators and their standard precedence.

Btw, if you're interested in expression templates, I uploaded
yesterday a very preliminary project in Github, to construct, and then
manipulate, expressions:

auto e =3D expr(); // The null expression
auto f =3D 2*e + 1- (e~"abc")*3; // f is an expression template that
encodes the right-hand side.

Then, it's up to the user to define what the expression represents.

https://github.com/PhilippeSigaud/Expression-Tree

May 28 2012

"John Belmonte" <john neggie.net> writes:

On Monday, 28 May 2012 at 12:27:09 UTC, Philippe Sigaud wrote:
 I played with this idea with my own Pegged
 (https://github.com/PhilippeSigaud/Pegged), but I wasn't quite
 convinced by the result, exactly for the reason above. Also, 
 when
 looking at real-world Spirit examples, I was a bit disappointed 
 by the
 resulting syntax: it's not *that* readable for complicated
 expressions. In fact, that's exactly why I decided to follow 
 the DSL
 road with Pegged, so as to obtain exactly the PEG syntax, with 
 the
 real operators and their standard precedence.

Fair enough.  I did notice the following in the markdown PEG 
though which could benefit from first class patterns:





HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl 
HtmlAttribute* '>'
HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") 
Spnl '>'
HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress /
    !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress

HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") 
Spnl
    HtmlAttribute* '>'
HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / 
"BLOCKQUOTE") Spnl '>'
HtmlBlockBlockquote <- HtmlBlockOpenBlockquote 
(HtmlBlockBlockquote /
    !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquote

.
.
.

May 28 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, May 28, 2012 at 11:42 PM, John Belmonte <john neggie.net> wrote:
 Fair enough. =C2=A0I did notice the following in the markdown PEG though =

which
 could benefit from first class patterns:





 HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl HtmlAttribu=

te*
 '>'
 HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") Spnl '>'
 HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress /
 =C2=A0 !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress

 HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") Spnl
 =C2=A0 HtmlAttribute* '>'
 HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / "BLOCKQUOTE") Sp=

nl
 '>'
 HtmlBlockBlockquote <- HtmlBlockOpenBlockquote (HtmlBlockBlockquote /
 =C2=A0 !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquote

You're exactly right! Nice catch.

I took this PEG from another github project (I hope I put the
attribution somewhere?) and that was before Pegged accepted
parameterized rules. I could indeed drastically factor the previous
code.

May 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Mar 01, 2012 at 04:10:26PM +0100, Philippe Sigaud wrote:
 mixin(Grammar!("Doc <- Node*"
 "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
 "OpeningTag <- '<' Identifier '>'", OpeningAction,
 "ClosingTag <-  `</` Identifier '>'", ClosingAction,
 "Text <- (!(OpeningTag / ClosingTag) _)+"));


 That looks about right, but still has a fair amount of noise. I
 think the approach of putting the entire grammar in one string is
 best.

 
 Yes, using one string is indeed better. That won't be too difficult to
 code.
 
 But how to associate actions with a rule, in that case? I mean, some
 rules will have actions, some not.

You could maybe just put D code in the grammar string, which gets
compiled as a string mixin by CTFE?


T

-- 
"A man's wife has more power over him than the state has." -- Ralph Emerson

Mar 01 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 But how to associate actions with a rule, in that case? I mean, some
 rules will have actions, some not.

 You could maybe just put D code in the grammar string, which gets
 compiled as a string mixin by CTFE?

I could, but one of my driving goals while beginning this project a month
ao was to use the shiny new lambda syntax directly :)

"rule1", o => o
"rule2", o => o

etc.

Mar 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-01 18:21, Philippe Sigaud wrote:
  > > But how to associate actions with a rule, in that case? I mean, some
  > > rules will have actions, some not.
  >
  > You could maybe just put D code in the grammar string, which gets
  > compiled as a string mixin by CTFE?

 I could, but one of my driving goals while beginning this project a
 month ao was to use the shiny new lambda syntax directly :)

 "rule1", o => o
 "rule2", o => o

 etc.

Maybe we can take some ideas from the CoffeeScript compiler (written in 
CoffeeScript) :

Grammar: 
http://jashkenas.github.com/coffee-script/documentation/docs/grammar.html

Lexer: 
http://jashkenas.github.com/coffee-script/documentation/docs/lexer.html

-- 
/Jacob Carlborg

Mar 01 2012

d coder <dlang.coder gmail.com> writes:

 I'm  afraid being in holidays right now means I do not have easy access to
 GitHub (no git on a pad, and the computer I use to code right now does not
 have any network connection). I'll put all this online in a few days,
 because that must seems like the ramblings of a madman right now...

I am eagerly waiting for you to upload the stuff on GitHub.

Regards
- Puneet

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
 I agree that the current direction of D in this area is impressive.
 However, I fail to see a killer-feature in generating a lexer-parser
 generator at compile-time instead of run-time.

 A run-time generator would benefit from not having to execute within the
 limited CTFE environment and would always be on-par in that respect. A
 compile time generator would internalize the generation and compilation
 of the result (with possible glue-code), simplifying the build process
 somewhat.

 What am I failing to pick up on?

Barrier of entry and granularity of approach, I think.

Currently if one wants to parse some simple grammar, there are options 
such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc.

Parsing by hand has the obvious disadvantages. Using boost::spirit has a 
steep learning curve and tends to create very contorted grammar 
representations, full of representation noise, and scales very poorly. 
Using lex/yacc is hamfisted - there's an additional build step, 
generated files to deal with, and the related logistics, which make 
lex/yacc a viable choice only for "big" grammars.

An efficient, integrated parser generator would lower the barrier of 
entry dramatically - if we play our cards right, even a sprintf 
specifier string could be parsed simpler and faster using an embedded 
grammar, instead of painfully writing the recognizer by hand. Parsing 
config files, XML, JSON, CSV, various custom file formats and many 
others - all would all be a few lines away. Ideally a user who has a 
basic understanding of grammars should have an easier time using a small 
grammar to parse simple custom formats, than writing the parsing code by 
hand.


Andrei

Feb 29 2012

"Christopher Bergqvist" <spambox0 digitalpoetry.se> writes:

On Wednesday, 29 February 2012 at 16:41:22 UTC, Andrei 
Alexandrescu wrote:
 On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
 What am I failing to pick up on?

 Barrier of entry and granularity of approach, I think.

 Currently if one wants to parse some simple grammar, there are 
 options such as (a) do it by hand, (b) use boost::spirit, or 
 (c) use lex/yacc.

 Parsing by hand has the obvious disadvantages. Using 
 boost::spirit has a steep learning curve and tends to create 
 very contorted grammar representations, full of representation 
 noise, and scales very poorly. Using lex/yacc is hamfisted - 
 there's an additional build step, generated files to deal with, 
 and the related logistics, which make lex/yacc a viable choice 
 only for "big" grammars.

 An efficient, integrated parser generator would lower the 
 barrier of entry dramatically - if we play our cards right, 
 even a sprintf specifier string could be parsed simpler and 
 faster using an embedded grammar, instead of painfully writing 
 the recognizer by hand. Parsing config files, XML, JSON, CSV, 
 various custom file formats and many others - all would all be 
 a few lines away. Ideally a user who has a basic understanding 
 of grammars should have an easier time using a small grammar to 
 parse simple custom formats, than writing the parsing code by 
 hand.


 Andrei

Thanks for your response.  The lowered barrier of entry in 
parsing something like a customized JSON format or config files 
is nice, and something I could see myself use.  I'm still 
skeptical about the level of "killer-featureness" but I would be 
glad to be proven wrong.

Feb 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 29, 2012 at 10:41:22AM -0600, Andrei Alexandrescu wrote:
[...]
 An efficient, integrated parser generator would lower the barrier of
 entry dramatically - if we play our cards right, even a sprintf
 specifier string could be parsed simpler and faster using an embedded
 grammar, instead of painfully writing the recognizer by hand.

I see that std.stdio.writef already parses format strings at
compile-time, though the code is somewhat ugly. :) It would be ideal if
that can be done just by a series of ENBF/regex declarations.


 Parsing config files, XML, JSON, CSV, various custom file formats and
 many others - all would all be a few lines away. Ideally a user who
 has a basic understanding of grammars should have an easier time using
 a small grammar to parse simple custom formats, than writing the
 parsing code by hand.

[...]

This is definitely something worth having. It has always been my dream
that an ideal programming language should let you write grammar rules
and pretty much autogen the code for you. There shouldn't need to be
external utilities with the associated messy build rules, auxiliary
files and other such things, esp. if you just need a one-time small
parser for a very specific task.


T

-- 
Why ask rhetorical questions? -- JC

Feb 29 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Why do you need grammar rules for json? Just use a custom-defined
struct and pass it to a toJson function like ae's json module. :)

Feb 29 2012

Hisayuki Mima <youxkei gmail.com> writes:

(2012/02/28 16:59), Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

Hello Andrei,

Certainly, I don't write yet the documentation of my library, ctpg.
(But a few examples available here: 
https://github.com/youkei/ctpg/tree/master/example)
So, I'd like to describe it here.

To be honest, ctpg is inspired by one module of Scala's standard 
library, Parser Combinators.
One of the differences between Parser Combinators and ctpg is that 
Parser Combinators generates parsers in run-time, but ctpg generates 
parsers in compile-time by the power of CTFE and mixin.

A parser generated by ctpg is a recursive descent parser, so it does 
lexical analysis and parsing at a time. And the type of input which it 
can accept is string, wstring, dstring and ForwardRange whose element 
type is char, wchar or dchar.
So, dividing lexical analysis and parsing into two processes is 
difficult in ctpg.

Importantly, a parser generated by ctpg is statically typed as one of 
the examples, "parsing simple arithmetic expression" shows.
Generally a parser generated by other tool and accepting tokens returns 
the abstract syntax tree, but it return the evaluated value in the example.
In other words, it does lexical analysis, parsing and (type) converting 
at a time.
If you want simply abstract syntax tree, it may be a little pain to use 
ctpg.

That's all I'd like to say about ctpg here.

By the way, I wholeheartedly agree with Andrei's opinion. But currently, 
I think, CTFE is limited because of this issue: 
http://d.puremagic.com/issues/show_bug.cgi?id=6498 .
Without dealing with this issue, We couldn't bring out the full 
potential of CTFE.

Thanks,
Hisayuki Mima

Feb 28 2012

Hisayuki Mima <youxkei gmail.com> writes:

(2012/02/28 23:58), Hisayuki Mima wrote:
 (2012/02/28 16:59), Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

 Hello Andrei,

 Certainly, I don't write yet the documentation of my library, ctpg.
 (But a few examples available here:
 https://github.com/youkei/ctpg/tree/master/example)
 So, I'd like to describe it here.

 To be honest, ctpg is inspired by one module of Scala's standard
 library, Parser Combinators.
 One of the differences between Parser Combinators and ctpg is that
 Parser Combinators generates parsers in run-time, but ctpg generates
 parsers in compile-time by the power of CTFE and mixin.

 A parser generated by ctpg is a recursive descent parser, so it does
 lexical analysis and parsing at a time. And the type of input which it
 can accept is string, wstring, dstring and ForwardRange whose element
 type is char, wchar or dchar.
 So, dividing lexical analysis and parsing into two processes is
 difficult in ctpg.

 Importantly, a parser generated by ctpg is statically typed as one of
 the examples, "parsing simple arithmetic expression" shows.
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the example.
 In other words, it does lexical analysis, parsing and (type) converting
 at a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 That's all I'd like to say about ctpg here.

 By the way, I wholeheartedly agree with Andrei's opinion. But currently,
 I think, CTFE is limited because of this issue:
 http://d.puremagic.com/issues/show_bug.cgi?id=6498 .
 Without dealing with this issue, We couldn't bring out the full
 potential of CTFE.

 Thanks,
 Hisayuki Mima

A minimum of documentation is now available here:
https://github.com/youkei/ctpg/wiki/Home-en

Hisayuki Mima

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 10:03 AM, Hisayuki Mima wrote:
 A minimum of documentation is now available here:
 https://github.com/youkei/ctpg/wiki/Home-en

Great start, you may want to aggressively add examples. Then people 
would love using ctpg and will write more documentation, tutorials etc.

Andrei

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 8:58 AM, Hisayuki Mima wrote:
 Certainly, I don't write yet the documentation of my library, ctpg.
 (But a few examples available here:
 https://github.com/youkei/ctpg/tree/master/example)

Nice! I have a few comments. I should say that they entail a fair amount 
of work.

The current setup makes things very difficult for a common case - 
generating a parse tree. The user would need to insert a lot of custom 
code to create the appropriate nodes, but that's very easy for the 
parser because it already has the components. The parser should have a 
"build AST" option, in which case it should build all nodes without much 
effort from the user. That's what ANTLR does, and it has a special 
operator "^" to indicate where the root of the tree should be 
(http://www.antlr2.org/doc/trees.html).

So here's what I think your examples should look like:

The syntax could be changed a bit - it should be less like D and more 
like PEG. The "$" is implicit and shouldn't be needed. The ";" is a 
useful terminator for large productions spanning more than one line, so 
let's keep it. I don't understand the use of "!", for example the PEG 
for expression parsing at 
http://en.wikipedia.org/wiki/Parsing_expression_grammar is:

Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
Expr    ← Sum

but your grammar has:

int primary = !"(" addExp !")" / intLit_p;

whereas it seems to me it should be

int primary = "(" addExp ")" / intLit_p;

No?

Here's how I think a parser that also builds an AST looks like:

mixin generateParsers!(
   Options.createAST,
   q{
     root    <- addExp;
     addExp  <- mulExp (("+" / "-")^ addExp)?
     mulExp  <- primary (("*" / "/")^ mulExp)?
     primary <- "(" addExp ")"% / intLit_p;
   }
);

I used the following notation: a node suffixed with "^" becomes the root 
of the produced AST, and a node suffixed with "%" will be dropped 
altogether (that's because ")" is not needed in the AST after the 
expression has been parsed). With only three characters I informed the 
parser what the shape of the AST will be.

At this point calling parse!root("1+2*3") would return an object of type 
ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs 
node has type ASTNode!"intLit_p" which has inside a value "1". The rhs 
node has type ASTNode!"mulExp", and that in turn has two children nodes 
lhs and rhs, both of type ASTNode!"intLit_p", each with its own value 
("2" and "3", respectively). All these nodes are dynamically created by 
the parsing process and inherit a common type, e.g. ASTNode!"" or some 
type ASTRootNode.

 So, I'd like to describe it here.

 To be honest, ctpg is inspired by one module of Scala's standard
 library, Parser Combinators.
 One of the differences between Parser Combinators and ctpg is that
 Parser Combinators generates parsers in run-time, but ctpg generates
 parsers in compile-time by the power of CTFE and mixin.

 A parser generated by ctpg is a recursive descent parser, so it does
 lexical analysis and parsing at a time.

Oh, I forgot this property of PEGs - integrated lexing and parsing. So 
no need for a separate lexer!

 And the type of input which it
 can accept is string, wstring, dstring and ForwardRange whose element
 type is char, wchar or dchar.

Great. I think we should actually define the notion of BufferedRange. 
I'll get to that topic later.

 So, dividing lexical analysis and parsing into two processes is
 difficult in ctpg.

Yes, sorry I forgot about that!

 Importantly, a parser generated by ctpg is statically typed as one of
 the examples, "parsing simple arithmetic expression" shows.
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the example.
 In other words, it does lexical analysis, parsing and (type) converting
 at a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 That's all I'd like to say about ctpg here.

Why would it be difficult to integrate AST creation with parsing? I'm 
not sure I understand this. My understanding is that you should be able 
to add a couple of simple operators ("^" and "%" as described above) to 
inform AST creation, and you're done!


Thanks,

Andrei

Feb 29 2012

Hisayuki Mima <youxkei gmail.com> writes:

(2012/03/01 1:19), Andrei Alexandrescu wrote:
 On 2/28/12 8:58 AM, Hisayuki Mima wrote:
 Certainly, I don't write yet the documentation of my library, ctpg.
 (But a few examples available here:
 https://github.com/youkei/ctpg/tree/master/example)

 Nice! I have a few comments. I should say that they entail a fair amount
 of work.

 The current setup makes things very difficult for a common case -
 generating a parse tree. The user would need to insert a lot of custom
 code to create the appropriate nodes, but that's very easy for the
 parser because it already has the components. The parser should have a
 "build AST" option, in which case it should build all nodes without much
 effort from the user. That's what ANTLR does, and it has a special
 operator "^" to indicate where the root of the tree should be
 (http://www.antlr2.org/doc/trees.html).

 So here's what I think your examples should look like:

 The syntax could be changed a bit - it should be less like D and more
 like PEG. The "$" is implicit and shouldn't be needed. The ";" is a
 useful terminator for large productions spanning more than one line, so
 let's keep it. I don't understand the use of "!", for example the PEG
 for expression parsing at
 http://en.wikipedia.org/wiki/Parsing_expression_grammar is:

 Value ← [0-9]+ / '(' Expr ')'
 Product ← Value (('*' / '/') Value)*
 Sum ← Product (('+' / '-') Product)*
 Expr ← Sum

 but your grammar has:

 int primary = !"(" addExp !")" / intLit_p;

 whereas it seems to me it should be

 int primary = "(" addExp ")" / intLit_p;

 No?

 Here's how I think a parser that also builds an AST looks like:

 mixin generateParsers!(
 Options.createAST,
 q{
 root <- addExp;
 addExp <- mulExp (("+" / "-")^ addExp)?
 mulExp <- primary (("*" / "/")^ mulExp)?
 primary <- "(" addExp ")"% / intLit_p;
 }
 );

 I used the following notation: a node suffixed with "^" becomes the root
 of the produced AST, and a node suffixed with "%" will be dropped
 altogether (that's because ")" is not needed in the AST after the
 expression has been parsed). With only three characters I informed the
 parser what the shape of the AST will be.

 At this point calling parse!root("1+2*3") would return an object of type
 ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs
 node has type ASTNode!"intLit_p" which has inside a value "1". The rhs
 node has type ASTNode!"mulExp", and that in turn has two children nodes
 lhs and rhs, both of type ASTNode!"intLit_p", each with its own value
 ("2" and "3", respectively). All these nodes are dynamically created by
 the parsing process and inherit a common type, e.g. ASTNode!"" or some
 type ASTRootNode.

 So, I'd like to describe it here.

 To be honest, ctpg is inspired by one module of Scala's standard
 library, Parser Combinators.
 One of the differences between Parser Combinators and ctpg is that
 Parser Combinators generates parsers in run-time, but ctpg generates
 parsers in compile-time by the power of CTFE and mixin.

 A parser generated by ctpg is a recursive descent parser, so it does
 lexical analysis and parsing at a time.

 Oh, I forgot this property of PEGs - integrated lexing and parsing. So
 no need for a separate lexer!

 And the type of input which it
 can accept is string, wstring, dstring and ForwardRange whose element
 type is char, wchar or dchar.

 Great. I think we should actually define the notion of BufferedRange.
 I'll get to that topic later.

 So, dividing lexical analysis and parsing into two processes is
 difficult in ctpg.

 Yes, sorry I forgot about that!

 Importantly, a parser generated by ctpg is statically typed as one of
 the examples, "parsing simple arithmetic expression" shows.
 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the
 example.
 In other words, it does lexical analysis, parsing and (type) converting
 at a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 That's all I'd like to say about ctpg here.

 Why would it be difficult to integrate AST creation with parsing? I'm
 not sure I understand this. My understanding is that you should be able
 to add a couple of simple operators ("^" and "%" as described above) to
 inform AST creation, and you're done!


 Thanks,

 Andrei

Certainly, "built AST" option is very interesting.
I don't know whether building AST is a common case or not because I'm 
unfamiliar with syntax analysis.
But I'd like to complete ctpg as D version of boost::spirit::qi first.
Currently, ctpg can be used probably like boost::spirit::qi. (Both do 
typed syntax analysis.)
There are some points where ctpg is superior to boost::spirit::qi. For 
example, The source code written by using ctpg is simple because it is 
like PEG.
In addition, I'm planning to improve error messages by making excellent 
use of #line and __LINE__.
I'd like to write the library which is functionally same 
boost::spirit::qi and written in D style.
Hence, I'd like to make implementing "built AST" option i.e. untyped 
syntax analysis a low priority.
What is your idea?

P.S. prefix operator "!" is same as postfix operator "%" which you said.
The types which parsers on both sides of "/" operator have should be 
compatible.
The type of "(" addExp ")" is Tuple!(string, int, string) and the type 
of intLit_p is int. They are not compatible.
The type of !"(" addExp !")" is int. So the types of !"(" addExp !")" 
and intLit_p are compatible.

Thanks,
Hisayuki Mima

Mar 04 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/4/12 8:34 AM, Hisayuki Mima wrote:
 Certainly, "built AST" option is very interesting.
 I don't know whether building AST is a common case or not because I'm
 unfamiliar with syntax analysis.
 But I'd like to complete ctpg as D version of boost::spirit::qi first.
 Currently, ctpg can be used probably like boost::spirit::qi. (Both do
 typed syntax analysis.)
 There are some points where ctpg is superior to boost::spirit::qi. For
 example, The source code written by using ctpg is simple because it is
 like PEG.
 In addition, I'm planning to improve error messages by making excellent
 use of #line and __LINE__.
 I'd like to write the library which is functionally same
 boost::spirit::qi and written in D style.
 Hence, I'd like to make implementing "built AST" option i.e. untyped
 syntax analysis a low priority.
 What is your idea?

It's your project so you define what's fun to work on. Let me just say 
that I worked on a lot of parsing-related stuff, and most often projects 
that start as "this grammar is simple enough, let's just do processing 
during parsing" ultimately required a rewrite to do generate AST, 
process it, and then use it for computation.

You can get away without an AST for the simplest grammars (e.g. printf 
etc.) but in that case you compete with hand-written code. I wouldn't 
think of parsing a serious grammar with more than a few productions 
without generating an AST. My intuition is that your work will best 
excel at those grammars.


Andrei

Mar 04 2012

"Ben Hanson" <Ben.Hanson tikit.com> writes:

On Sunday, 4 March 2012 at 21:00:00 UTC, Andrei Alexandrescu
wrote:
 You can get away without an AST for the simplest grammars (e.g. 
 printf etc.) but in that case you compete with hand-written 
 code. I wouldn't think of parsing a serious grammar with more 
 than a few productions without generating an AST. My intuition 
 is that your work will best excel at those grammars.

This chimes nicely with the comments made at the Going Native
conference with regard to clang generating an AST and MSVC and
GCC not. Herb Sutter referred to it as "AST envy" (as in he
wished MSVC took the AST route).

I only mention it as maybe not everyone here watched that vid.
Andrei himself was at the conference of course!

Regards,

Ben

Mar 05 2012

d coder <dlang.coder gmail.com> writes:

 Generally a parser generated by other tool and accepting tokens returns
 the abstract syntax tree, but it return the evaluated value in the example.
 In other words, it does lexical analysis, parsing and (type) converting at
 a time.
 If you want simply abstract syntax tree, it may be a little pain to use
 ctpg.

 Hello Youkei

I am trying to use CTPG for compile time parsing for a DSL I am working on.
I have tried the examples you created in the examples directory.

I would like the parser to effect some side effects. For this purpose, I
tried including the parser mixin into a class, but I got a strange error
saying:

Error: need 'this' to access member parse

I have copied the code I am trying to compile at the end of the email. Let
me know what I could be doing wrong here.

Regards
- Puneet


import ctpg;
import std.array: join;
import std.conv: to;

class Foo
{
  int result;
  mixin(generateParsers(q{
        int root = mulExp $;

        int mulExp =
          primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; }
        / primary;

        int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int;
      }));

  void frop() {
    result = parse!root("5*8");
  }
}


void main(){
  Foo foo = new Foo();
  foo.frop();
}

May 27 2012

Hisayuki Mima <youxkei gmail.com> writes:

(2012年05月28日 02:31), d coder wrote:
     Generally a parser generated by other tool and accepting tokens
     returns the abstract syntax tree, but it return the evaluated value
     in the example.
     In other words, it does lexical analysis, parsing and (type)
     converting at a time.
     If you want simply abstract syntax tree, it may be a little pain to
     use ctpg.


   Hello Youkei

 I am trying to use CTPG for compile time parsing for a DSL I am working
 on. I have tried the examples you created in the examples directory.

 I would like the parser to effect some side effects. For this purpose, I
 tried including the parser mixin into a class, but I got a strange error
 saying:

 Error: need 'this' to access member parse

 I have copied the code I am trying to compile at the end of the email.
 Let me know what I could be doing wrong here.

 Regards
 - Puneet


 import ctpg;
 import std.array: join;
 import std.conv: to;

 class Foo
 {
    int result;
    mixin(generateParsers(q{
          int root = mulExp $;

          int mulExp =
            primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; }
          / primary;

          int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int;
        }));

    void frop() {
      result = parse!root("5*8");
    }
 }


 void main(){
    Foo foo = new Foo();
    foo.frop();
 }

Hello Puneet,

Thank you for your report. I fixed it. Now CTPG creates a static 
function as a parser.
But I'm afraid this fix doesn't help you because I don't understand what 
a side effect you said is.
Can you show me some examples which include the side effect?

Thanks,
Hisayuki Mima

May 31 2012

d coder <dlang.coder gmail.com> writes:

 I would like the parser to effect some side effects. For this purpose, I
 tried including the parser mixin into a class, but I got a strange error
 saying:

 Error: need 'this' to access member parse

Ok. I see my folly. At compile time, there would not be any "this" since
the class has not been instantiated yet.

I will have to think of other solutions. Basically, I am trying to use the
parser to create functions that I can use at run time. But I wanted to
create two functions from the same parser. I have succeeded with creating
one function. I do not want to create two separate parsers because each of
these parsers would add to memory footprint of the compiler.

Any ideas? Maybe I could use tuples here?

Regards
- Puneet

May 27 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

+1.


 We all felt for a long time that there's a lot of potential in CTFE,
 and potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser
 generators.

CTFE is one of the big features that attracted me to D.


[...]
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.

For example?


 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect
 tandem (no need for glue code). At the end of the day, defining a
 complete lexer+parser combo for a language should be just a few lines
 longer than the textual representation of the grammar itself.

Definitely agree. I frequently work with code that requires some amount
of lexing/parsing; having something like this in Phobos would be
absolutely awesome.


 What do you all think? Let's get this project off the ground!

[...]

+1.

On that note, I'd like to mention that CTFE currently has some parts
that need more work:

(1) We need compile-time alternatives to the functions in std.math so
that functions that are currently only implemented in asm can be used in
CTFE.

(2) It should be possible to generate AA literals from CTFE and use them
to initialize things like parsing/lexing lookup tables, etc.. While this
in itself is not a major critical issue, it would make D look bad if we
tout D's CTFE capabilities yet at the same time AA's have to be created
by static this() at runtime.


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert

Feb 28 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 9:57 AM, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.

 For example?

For example integrated lexer and parser generators!


Andrei

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jiljf7$18em$1 digitalmars.com...
 On 2/28/12 9:57 AM, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
 This is the kind of stuff I've had an eye on for the longest time.
 I'm saying it's of strategic importance because CTFE technology,
 though not new and already available with some languages, has unique
 powers when combined with other features of D. With CTFE we get to do
 things that are quite literally impossible to do in other languages.

 For example?

 For example integrated lexer and parser generators!

...With statically-checked symbol names that can include arbitrary 
characters:

Sym!"+"
Sym!"{"
Sym!"<Foo>"
etc...

Feb 29 2012

"mist" <none none.none> writes:

Something similar to Boost::Spirit2 but with sane syntax and 
better debugging would have been awesome.

Feb 28 2012

bcs <bcs example.com> writes:

On 02/28/2012 08:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.

How about not having to encode the syntax as arithmetic expressions? D 
can do string parsing of eBNF at compile time if you want that.

Feb 28 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 12:07 AM, bcs wrote:
 On 02/28/2012 08:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.

 How about not having to encode the syntax as arithmetic expressions? D
 can do string parsing of eBNF at compile time if you want that.

I think that's what he meant by "sane syntax".

Andrei

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 10:23 AM, mist wrote:
 Something similar to Boost::Spirit2 but with sane syntax and better
 debugging would have been awesome.

Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear - 
using inadequate technology for a specific purpose.

Andrei

Feb 29 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I'm starting a new thread on this because I think the matter is of  
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and  
 potential applications have been discussed more than a few times,  
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors  
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D  
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more  
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of  
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe  
 Sigaud has already a competing design and implementation of a parser  
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm  
 saying it's of strategic importance because CTFE technology, though not  
 new and already available with some languages, has unique powers when  
 combined with other features of D. With CTFE we get to do things that  
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient  
 lexer-parser generator combo in Phobos, pronto. The lexer itself could  
 use a character-level PEG or a classic automaton, and emit tokens for  
 consumption by a parser generator. The two should work in perfect tandem  
 (no need for glue code). At the end of the day, defining a complete  
 lexer+parser combo for a language should be just a few lines longer than  
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

I wrote a generic lexer generator some time ago.
It already let to some compiler O(N^2) optimizations, because the token
declarations sneak into the mangling :(.


https://gist.github.com/1255439 - lexer generator
https://gist.github.com/1262321 - complete and fast D lexer

I've ditched an attempt to write a parser combinator. It was based on
expression templates and ended up at spirit craziness.

<PERSONAL OPINION
The hassle of providing good error messages and synthesizing parse results
in a generic parser outweigh the benefit of a declarative grammar.
/PERSONAL OPINION>

A lot becomes feasible from the CTFE perspective,
despite some bugfixes I only miss exp and log currently.

I do not agree that it's the right moment to write a parser though.
It hits the first of phobos two biggest shortcomings, the lack of a good  
I/O
system and the missing Allocators.
Any parser written now will either risk to not play nice with ranges
or has to come up with it's own buffering again.

Feb 28 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 28.02.2012 22:46, Martin Nowak wrote:
 On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE,
 and potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient
 D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though
 not new and already available with some languages, has unique powers
 when combined with other features of D. With CTFE we get to do things
 that are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect
 tandem (no need for glue code). At the end of the day, defining a
 complete lexer+parser combo for a language should be just a few lines
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


To be truly successful it should have some distinct properties like:
- being faster or identical to handwritten stuff we already have (like 
e.g. std.json ), reminds us of recent INI parser proposal, should be an 
easy pick for general parser.
- be flexible, the result need not be a predefined AST tree nodes, 
Hisayuki Mima's parser shows some nice movement in this direction
- have reasonable compile times and memory consumption (though it will 
only improve over time)


Recalling EBNF parser idea that I run with before finally being dragged 
down by real life. Roughly I thought to follow hybrid LL(*) aproach, 
while I had a solid plan on doing DFA for lexer and parser lookahead, 
the other things were more or less floating.

 Thanks,

 Andrei

 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.


 https://gist.github.com/1255439 - lexer generator


Cool stuff.

 https://gist.github.com/1262321 - complete and fast D lexer

 I've ditched an attempt to write a parser combinator. It was based on
 expression templates and ended up at spirit craziness.

 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>

Error reporting is the weak spot, still no good ideas on solving that.

 A lot becomes feasible from the CTFE perspective,
 despite some bugfixes I only miss exp and log currently.

Reminds me that I have to dig up a wonderful CTFE bug in std.regex 
that's being somehow worked around by duping arrays since August.

 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a good
 I/O
 system and the missing Allocators.
 Any parser written now will either risk to not play nice with ranges
 or has to come up with it's own buffering again.

All good points. There is prototype of interactive regex matcher that 
works directly on stream (buried in std.regex), it even passed dry-run 
unittests back then. Though I had to postpone till I/O is sorted out. I 
really loved Steven's design with it's easy access to buffer and well 
thought out primitives, hope it will come about sometime soon.

-- 
Dmitry Olshansky

Feb 28 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
 On 28.02.2012 22:46, Martin Nowak wrote:
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:


[...]
We need to have a easy-to-use, complete, seamless, and efficient
lexer-parser generator combo in Phobos, pronto. The lexer itself
could use a character-level PEG or a classic automaton, and emit
tokens for consumption by a parser generator. The two should work in
perfect tandem (no need for glue code). At the end of the day,
defining a complete lexer+parser combo for a language should be just
a few lines longer than the textual representation of the grammar
itself.



[...]
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have
 (like e.g. std.json ), reminds us of recent INI parser proposal,
 should be an easy pick for general parser.

We probably have to use string mixins to achieve this.

OTOH, a generic parser generator is unlikely to be able to handle
special case optimizations such as operator-precedence parsers
(http://en.wikipedia.org/wiki/Operator-precedence_parser), because
operator precedence is only implied by grammar rules, and to optimize
for this case requires explicitly specifying each operator's precedence.
There is some price to be paid for generality, unfortunately.


 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction

In my mind, D's delegates are ideal for this sort of thing. Each grammar
rule will have an associated callback. We can provide default callbacks
that generates AST nodes, but the user can override them with his own
delegates to do whatever they want (evaluate expressions/commands on the
fly, build only AST subtrees of interest, etc.).


 - have reasonable compile times and memory consumption (though it will
 only improve over time)

This will be good motivation to improve dmd's CTFE implementation. ;-)


[...]
 Error reporting is the weak spot, still no good ideas on solving that.

This is not specific to D, though, right?


[...]
I do not agree that it's the right moment to write a parser though.
It hits the first of phobos two biggest shortcomings, the lack of a
good I/O system and the missing Allocators.  Any parser written now
will either risk to not play nice with ranges or has to come up with
it's own buffering again.

 
 All good points. There is prototype of interactive regex matcher
 that works directly on stream (buried in std.regex), it even passed
 dry-run unittests back then. Though I had to postpone till I/O is
 sorted out. I really loved Steven's design with it's easy access to
 buffer and well thought out primitives, hope it will come about
 sometime soon.

[...]

Are you talking about std.io? I glanced over that code briefly recently;
it looks very promising. Hope it will make it into Phobos soon. We
*need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
The current situation to an outsider looks quite bad (streams not
compatible with stdio, no easy way to substitute string buffer for
files, etc.).

We also need automatic UTF encoding detection for all input
streams/files/ranges/whatever. Including mismatching endianness (i.e.
auto byte-swapping).  In a recent personal project to write a D lexer
(as an exercise to help me learn D better), I ended up having to roll my
own input stream abstraction in order to handle automatic encoding
detection. Which is quite poorly written, I've to admit. Something like
this needs to be standardized in Phobos with a well-written
implementation.


T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about
all I can take." -- Calvin & Hobbes

Feb 28 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29.02.2012 0:19, H. S. Teoh wrote:
 On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
 On 28.02.2012 22:46, Martin Nowak wrote:
 On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:


 [...]
 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself
 could use a character-level PEG or a classic automaton, and emit
 tokens for consumption by a parser generator. The two should work in
 perfect tandem (no need for glue code). At the end of the day,
 defining a complete lexer+parser combo for a language should be just
 a few lines longer than the textual representation of the grammar
 itself.



 [...]
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have
 (like e.g. std.json ), reminds us of recent INI parser proposal,
 should be an easy pick for general parser.

 We probably have to use string mixins to achieve this.

 OTOH, a generic parser generator is unlikely to be able to handle
 special case optimizations such as operator-precedence parsers
 (http://en.wikipedia.org/wiki/Operator-precedence_parser), because
 operator precedence is only implied by grammar rules, and to optimize
 for this case requires explicitly specifying each operator's precedence.
 There is some price to be paid for generality, unfortunately.

Well I see no problem in defining a separate OperatorGrammar 
constructor, that will take base Unit production and a bunch of 
operators with arity, plus brackets with their priority is plenty. 
Semantic then actions are trivially defined for each of operators 
respectively.

Someway of stacking grammars is bonus points for the project. As Andrei 
started with lexer+parser, I see no problem with it being 
lexer+parser(+parser)*.

 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction

 In my mind, D's delegates are ideal for this sort of thing. Each grammar
 rule will have an associated callback. We can provide default callbacks
 that generates AST nodes, but the user can override them with his own
 delegates to do whatever they want (evaluate expressions/commands on the
 fly, build only AST subtrees of interest, etc.).

Yup delegates and mixins are the tools. More over lambdas shine here, 
less clutter + type deduction.

 - have reasonable compile times and memory consumption (though it will
 only improve over time)

 This will be good motivation to improve dmd's CTFE implementation. ;-)


 [...]
 Error reporting is the weak spot, still no good ideas on solving that.

 This is not specific to D, though, right?

Such a framework relying on mixins is bound to produce awful error 
messages at compile-time, unless explicitly stuffed up to watterline 
with some kind of static assert(__traits(compiles,...),"Error x + info");

 [...]
 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a
 good I/O system and the missing Allocators.  Any parser written now
 will either risk to not play nice with ranges or has to come up with
 it's own buffering again.

 All good points. There is prototype of interactive regex matcher
 that works directly on stream (buried in std.regex), it even passed
 dry-run unittests back then. Though I had to postpone till I/O is
 sorted out. I really loved Steven's design with it's easy access to
 buffer and well thought out primitives, hope it will come about
 sometime soon.

 [...]

 Are you talking about std.io? I glanced over that code briefly recently;
 it looks very promising. Hope it will make it into Phobos soon. We
 *need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
 The current situation to an outsider looks quite bad (streams not
 compatible with stdio, no easy way to substitute string buffer for
 files, etc.).

Aye, that's it. Not to mention DIP9, though that's another topic in itself.

 We also need automatic UTF encoding detection for all input
 streams/files/ranges/whatever. Including mismatching endianness (i.e.
 auto byte-swapping).  In a recent personal project to write a D lexer
 (as an exercise to help me learn D better), I ended up having to roll my
 own input stream abstraction in order to handle automatic encoding
 detection. Which is quite poorly written, I've to admit. Something like
 this needs to be standardized in Phobos with a well-written
 implementation.


 T


-- 
Dmitry Olshansky

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
 Someway of stacking grammars is bonus points for the project. As Andrei
 started with lexer+parser, I see no problem with it being
 lexer+parser(+parser)*.

What does stacking grammars mean?

Anyhow, one thing that would become important is tree walkers - e.g. you 
have the tree and you define code for evaluating it etc.

 Such a framework relying on mixins is bound to produce awful error
 messages at compile-time, unless explicitly stuffed up to watterline
 with some kind of static assert(__traits(compiles,...),"Error x + info");

We need to figure out ways to make that simpler.


Andrei

Feb 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29.02.2012 20:47, Andrei Alexandrescu wrote:
 On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
 Someway of stacking grammars is bonus points for the project. As Andrei
 started with lexer+parser, I see no problem with it being
 lexer+parser(+parser)*.

 What does stacking grammars mean?

In general it should be read as using one parser as lexer for another 
parser. But there is some nice ways to integrate another parser within 
recursive descent parser.
Let's assume simple recursive descent parser that uses operator 
precedence grammar. Suppose we have the main grammar:
Declaration = Type Identifier;
Statement = 'if' '(' Expr ')' | Expr
Type = int | float | ...

Where Expr is connected to the second grammar expressed as ...
Ok, let's sketch it, skipping semantic actions:
operatorGrammar!q{
Expr
Unit: Constant | Identifier //assume we have lexer
Operators:
	Not, !, prefix, 4
	UnMinus, -, prefix, 4
	Mul, *, binary, 3
	Mul, /, binary, 3
	Plus, +, binary, 2
	BiMinus, -, binary, 2
	Index, [], brace-like postfix
	Braces, (), brace-like
};

Effectively operator grammar for expressions parses "token" Expr. It's 
easy in recursive descent because it can just try to call this parser, 
then try the other alternative if it fails.
I'm not sure this stacking does add that much, but it's something to 
keep in mind.


 Anyhow, one thing that would become important is tree walkers - e.g. you
 have the tree and you define code for evaluating it etc.

Yes, it's a must have if AST is generated generically via annotations.

 Such a framework relying on mixins is bound to produce awful error
 messages at compile-time, unless explicitly stuffed up to watterline
 with some kind of static assert(__traits(compiles,...),"Error x + info");

 We need to figure out ways to make that simpler.


 Andrei


-- 
Dmitry Olshansky

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 To be truly successful it should have some distinct properties like:
 - being faster or identical to handwritten stuff we already have (like
 e.g. std.json ), reminds us of recent INI parser proposal, should be an
 easy pick for general parser.

Yes.

 - be flexible, the result need not be a predefined AST tree nodes,
 Hisayuki Mima's parser shows some nice movement in this direction

Yes. But AST creation should be supported without any custom code.

 - have reasonable compile times and memory consumption (though it will
 only improve over time)

Yes. I guess PEGs have problems there.

 Recalling EBNF parser idea that I run with before finally being dragged
 down by real life. Roughly I thought to follow hybrid LL(*) aproach,
 while I had a solid plan on doing DFA for lexer and parser lookahead,
 the other things were more or less floating.

Well, maybe you could integrate that within your up-and-coming research. 
Grammars have been considered a solved problem every five years since 
1970, and there's always new work coming :o).

 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.

An interactive regex would be a dream come true to me...


Andrei

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)

 Yes. I guess PEGs have problems there.

Probably LR, too, unless you build the state tables in a separate prior 
build step.

 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.

 An interactive regex would be a dream come true to me...

One thing I've been wanting to see in a lexer's regex would be a way to 
match things like D's nested comments or:

qEOS fancy string blah blah EOS

Support for either one would probably render it "not *technically* a regex" 
but it should be entirely possible by just adding some extra data/processing 
to the states.

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:jilmhr$1dul$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)

 Yes. I guess PEGs have problems there.

 Probably LR, too, unless you build the state tables in a separate prior 
 build step.

Erm, I mean LALR specifically.

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 11:15 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)

 Yes. I guess PEGs have problems there.

 Probably LR, too, unless you build the state tables in a separate prior
 build step.

It's been a while since I read about PEGs (the packrat parser paper), 
but think it's a more fundamental issue with PEGs because they need to 
memoize several possible parses of the input.

Andrei

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jilnjg$1eu9$3 digitalmars.com...
 On 2/29/12 11:15 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:jiljsg$193s$1 digitalmars.com...
 On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
 - have reasonable compile times and memory consumption (though it will
 only improve over time)

 Yes. I guess PEGs have problems there.

 Probably LR, too, unless you build the state tables in a separate prior
 build step.

 It's been a while since I read about PEGs (the packrat parser paper), but 
 think it's a more fundamental issue with PEGs because they need to memoize 
 several possible parses of the input.

I don't know much about packrat parsers, but what I meant is that generating 
LALR tables (which are then used by an LALR parser) can be very 
computationally expensive: For simple grammars, it's trivial, but for more 
complex "typical programming langauge" grammars, it can easily take upwards 
of quite a few minutes when *not* done in CTFE (at least on my underpowered 
machine). Although maybe my implementation and GOLD's implementation both 
just have some additional hidden scaling issue that isn't inherent to the 
algorithms?

Actually using an *already*-generated LALR table (which only needs to be 
generated once per grammar, not once per input) to parse doesn't have major 
inherent efficiency issues compared to other parsing algorithms, AFAIK.

Feb 29 2012

"Ben Hanson" <Ben.Hanson tikit.com> writes:

Hi Nick,

On Wednesday, 29 February 2012 at 17:16:44 UTC, Nick Sabalausky
wrote:
 One thing I've been wanting to see in a lexer's regex would be 
 a way to
 match things like D's nested comments or:

I have already implemented a lexer generator that can handle
recursion in the token definitions (using multiple lexer states).
See http://www.benhanson.net/lexertl.html

The library is C++ and generates lexers at runtime, but the
concepts should be easily transferable. Basically I have a
boolean for pop in the end state and a separate column for
push_dfa_:

      if (end_state_)
      {
          // Return longest match
          if (pop_)
          {
              start_state_ = results_.stack.top ().first;
              results_.stack.pop ();
          }
          else if (push_dfa_ != results_.npos ())
          {
              results_.stack.push (typename results::id_type_pair
                  (push_dfa_, id_));
          }
.
.
.

I'm happy to answer any questions etc.

Regards,

Ben

Mar 05 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29.02.2012 20:31, Andrei Alexandrescu wrote:
 Recalling EBNF parser idea that I run with before finally being dragged
 down by real life. Roughly I thought to follow hybrid LL(*) aproach,
 while I had a solid plan on doing DFA for lexer and parser lookahead,
 the other things were more or less floating.

 Well, maybe you could integrate that within your up-and-coming research.
 Grammars have been considered a solved problem every five years since
 1970, and there's always new work coming :o).

It might involve parsing and grammars, unless academia folk seduce me 
with their ideas of optimization for highly parallel architectures :).

 There is prototype of interactive regex matcher that
 works directly on stream (buried in std.regex), it even passed dry-run
 unittests back then. Though I had to postpone till I/O is sorted out. I
 really loved Steven's design with it's easy access to buffer and well
 thought out primitives, hope it will come about sometime soon.

 An interactive regex would be a dream come true to me...

Where you've been? ;) I mean during GSOC we've spent days with Fawzi 
talking about shoehorning regex matcher to work on buffered streams. He 
actually did that prototype, I was mostly reviewing/fixing the source to 
make sure it fits. Recalling the inner works it was even expected to 
optionally do NFC normalization on the fly (wow, now that was ambitious) 
and all that without copying stuff around 99% of the time.


-- 
Dmitry Olshansky

Feb 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote:
[...]
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.

 
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

Cool! I'll have to take a look at this sometime.


[...]
 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>

But things like lex/yacc have been useful throughout the years. With D's
delegates, lexer/parser action rules should be even cleaner, no?


 A lot becomes feasible from the CTFE perspective, despite some
 bugfixes I only miss exp and log currently.

All of std.math should have CTFE versions so that we can perform complex
calculations at compile-time (e.g., create table of scaled sin/cos
values for use in high-performance 3D apps -- no need to waste time to
generate these tables at startup time).


 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a
 good I/O system and the missing Allocators.  Any parser written now
 will either risk to not play nice with ranges or has to come up with
 it's own buffering again.

OTOH, perhaps once we start writing a lexer/parser generator, that will
force us to fix the I/O and allocator system. Without a good project to
motivate us to fix these tedious issues, it seems that we just lose
inspiration to do it after a while.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5

Feb 28 2012

deadalnix <deadalnix gmail.com> writes:

Le 28/02/2012 21:02, H. S. Teoh a �crit :
 OTOH, perhaps once we start writing a lexer/parser generator, that will
 force us to fix the I/O and allocator system. Without a good project to
 motivate us to fix these tedious issues, it seems that we just lose
 inspiration to do it after a while.

std.conainer is already a good candidate for allocators.

Back to the topic, this is a great idea, but isn't it building on weak 
bases ? We still have issue with const correctness, or postblit, and on 
lib side, we lack allocators, error handling is poor, containers are 
poor, etc . . .

Feb 28 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Tue, 28 Feb 2012 21:02:24 +0100, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote:
 [...]
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.


 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Cool! I'll have to take a look at this sometime.


 [...]
 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse  
 results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>

 But things like lex/yacc have been useful throughout the years. With D's
 delegates, lexer/parser action rules should be even cleaner, no?

Yacc does work but at the price of an additional build step and total  
automaton obfuscation.
And even at that price the results are still hard to maintain klingon  
sources.
http://supercollider.git.sourceforge.net/git/gitweb.cgi?p=supercollider/supercollider;a=blob;f=lang/LangSource/Bison/lang11d

I won't deny that the combination of CTFE text processing and static  
introspection could
improve on this. It could be made more feasible by some conventions, e.g.  
parse result
always uses structs or classes and built-in arrays.

----

class Module
{
     this(Declaration[]);
}

class StructDeclaration : Declaration
{
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";

     this(Identifier, Declaration[]);
}

...

Parser!(Module, StructDeclaration, ...) parser;
Module m = parser.parse(read("file.d"));

Feb 28 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
[...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.

Excellent idea, I like this.


 class Module
 {
     this(Declaration[]);
 }
 
 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";
 
     this(Identifier, Declaration[]);
 }
 
 ...
 
 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));

I like this! Definitely an improvement over yacc syntax.


T

-- 
Unix was not designed to stop people from doing stupid things, because
that would also stop them from doing clever things. -- Doug Gwyn

Feb 28 2012

"Nick Sabalausky" <a a.a> writes:

"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...
 On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
 [...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.

 Excellent idea, I like this.


 class Module
 {
     this(Declaration[]);
 }

 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";

     this(Identifier, Declaration[]);
 }

 ...

 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));

 I like this! Definitely an improvement over yacc syntax.

In Goldie, I've taken an inverted approach, which IMHO is easier to use: The 
types are automatically generated from the grammar, not the other way 
around. So applying that approach to the above code, it'd be more like this:

mixin genGrammar!("myGrammar", `
    Identifier = [a-zA-Z_][a-zA-Z_0-9]*
    Module = Declaration+
    Declaration = StructDeclaration
    StructDeclaration = 'struct' Identifier '{' Declaration* '}'
`);

Which generates these classes:

Parser!"myGrammar"
Symbol!("myGrammar.Identifier")
Symbol!("myGrammar.Module")
Symbol!("myGrammar.Declaration")
Symbol!("myGrammar.StructDeclaration")

and/or these:

Parser_myGrammar
Symbol_myGrammar!"Identifier"
Symbol_myGrammar!"Module"
Symbol_myGrammar!"Declaration"
Symbol_myGrammar!"StructDeclaration"

would could then be aliased by the user however they wanted:

alias Symbol_myGrammar MySym;

And there can still be hooks (delegates, subclassing, whatever) to add 
customized behavior/functionality.

Feb 28 2012

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:jikcca$1vq7$1 digitalmars.com...
 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message 
 news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...
 On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote:
 [...]
 I won't deny that the combination of CTFE text processing and static
 introspection could improve on this. It could be made more feasible by
 some conventions, e.g. parse result always uses structs or classes and
 built-in arrays.

 Excellent idea, I like this.


 class Module
 {
     this(Declaration[]);
 }

 class StructDeclaration : Declaration
 {
     enum _enbf = "struct $1=Identifier { $2=Declaration* }";

     this(Identifier, Declaration[]);
 }

 ...

 Parser!(Module, StructDeclaration, ...) parser;
 Module m = parser.parse(read("file.d"));

 I like this! Definitely an improvement over yacc syntax.

 In Goldie, I've taken an inverted approach, which IMHO is easier to use: 
 The types are automatically generated from the grammar, not the other way 
 around. So applying that approach to the above code, it'd be more like 
 this:

 mixin genGrammar!("myGrammar", `
    Identifier = [a-zA-Z_][a-zA-Z_0-9]*
    Module = Declaration+
    Declaration = StructDeclaration
    StructDeclaration = 'struct' Identifier '{' Declaration* '}'
 `);

 Which generates these classes:

 Parser!"myGrammar"
 Symbol!("myGrammar.Identifier")
 Symbol!("myGrammar.Module")
 Symbol!("myGrammar.Declaration")
 Symbol!("myGrammar.StructDeclaration")

 and/or these:

 Parser_myGrammar
 Symbol_myGrammar!"Identifier"
 Symbol_myGrammar!"Module"
 Symbol_myGrammar!"Declaration"
 Symbol_myGrammar!"StructDeclaration"

 would could then be aliased by the user however they wanted:

 alias Symbol_myGrammar MySym;

 And there can still be hooks (delegates, subclassing, whatever) to add 
 customized behavior/functionality.

Hmm, maybe I need to think about what it would take to make Goldie able to 
parse at compile-time...

Feb 28 2012

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able to 
 parse at compile-time...

Just gave it a quick shot. It was looking like it might not be too bad, but 
then I hit:

Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 
'interpret.c'

Bleh.

(This was with DMD 2.058)

Feb 28 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

Le 29 f=C3=A9vr. 2012 07:20, "Nick Sabalausky" <a a.a> a =C3=A9crit :
 "Nick Sabalausky" <a a.a> wrote in message
 news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able


to
 parse at compile-time...

 Just gave it a quick shot. It was looking like it might not be too bad,

but
 then I hit:

 Assertion failure: 'ctfeStack.stackPointer() =3D=3D 0' on line 4823 in fi=

le
 'interpret.c'

 Bleh.

 (This was with DMD 2.058)

Yeah, I had the very same yesterday :(

Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing.

Too bad.

In my case, I found a workaround: I was doing

array[] ~=3D SomeStruct(args);

which asserts at CTFE.

But:

auto s =3D SomeStructs(args);
array[] ~=3D s;

works.

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Philippe Sigaud" <philippe.sigaud gmail.com> wrote in message 
news:mailman.229.1330508922.24984.digitalmars-d puremagic.com...
Le 29 f�vr. 2012 07:20, "Nick Sabalausky" <a a.a> a �crit :
 Just gave it a quick shot. It was looking like it might not be too bad,

but
 then I hit:

 Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file
 'interpret.c'

 Bleh.

 (This was with DMD 2.058)

Yeah, I had the very same yesterday :(

Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing.

Too bad.

In my case, I found a workaround: I was doing

array[] ~= SomeStruct(args);

which asserts at CTFE.

But:

auto s = SomeStructs(args);
array[] ~= s;

works.

Ooh, cool! I know exactly where I'm doing that!

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:jikfpi$25cc$1 digitalmars.com...
 "Nick Sabalausky" <a a.a> wrote in message 
 news:jikcit$201o$1 digitalmars.com...
 Hmm, maybe I need to think about what it would take to make Goldie able 
 to parse at compile-time...

 Just gave it a quick shot. It was looking like it might not be too bad, 
 but then I hit:

 Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 
 'interpret.c'

 Bleh.

 (This was with DMD 2.058)

Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667

Mar 07 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 11:15 PM, Nick Sabalausky wrote:
 In Goldie, I've taken an inverted approach, which IMHO is easier to use: The
 types are automatically generated from the grammar, not the other way
 around. So applying that approach to the above code, it'd be more like this:

 mixin genGrammar!("myGrammar", `
      Identifier = [a-zA-Z_][a-zA-Z_0-9]*
      Module = Declaration+
      Declaration = StructDeclaration
      StructDeclaration = 'struct' Identifier '{' Declaration* '}'
 `);

 Which generates these classes:

 Parser!"myGrammar"
 Symbol!("myGrammar.Identifier")
 Symbol!("myGrammar.Module")
 Symbol!("myGrammar.Declaration")
 Symbol!("myGrammar.StructDeclaration")

 and/or these:

 Parser_myGrammar
 Symbol_myGrammar!"Identifier"
 Symbol_myGrammar!"Module"
 Symbol_myGrammar!"Declaration"
 Symbol_myGrammar!"StructDeclaration"

 would could then be aliased by the user however they wanted:

 alias Symbol_myGrammar MySym;

 And there can still be hooks (delegates, subclassing, whatever) to add
 customized behavior/functionality.

I think this is the right approach.

Andrei

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:jilkj6$1a50$2 digitalmars.com...
 On 2/28/12 11:15 PM, Nick Sabalausky wrote:
 In Goldie, I've taken an inverted approach, which IMHO is easier to use: 
 The
 types are automatically generated from the grammar, not the other way
 around. So applying that approach to the above code, it'd be more like 
 this:

 mixin genGrammar!("myGrammar", `
      Identifier = [a-zA-Z_][a-zA-Z_0-9]*
      Module = Declaration+
      Declaration = StructDeclaration
      StructDeclaration = 'struct' Identifier '{' Declaration* '}'
 `);

 Which generates these classes:

 Parser!"myGrammar"
 Symbol!("myGrammar.Identifier")
 Symbol!("myGrammar.Module")
 Symbol!("myGrammar.Declaration")
 Symbol!("myGrammar.StructDeclaration")

 and/or these:

 Parser_myGrammar
 Symbol_myGrammar!"Identifier"
 Symbol_myGrammar!"Module"
 Symbol_myGrammar!"Declaration"
 Symbol_myGrammar!"StructDeclaration"

 would could then be aliased by the user however they wanted:

 alias Symbol_myGrammar MySym;

 And there can still be hooks (delegates, subclassing, whatever) to add
 customized behavior/functionality.

 I think this is the right approach.

What? We agree? ;)

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/29/12 11:05 AM, Nick Sabalausky wrote:
 What? We agree? ;)

It's 2012 after all.

Andrei

Feb 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/28/12 12:46 PM, Martin Nowak wrote:
 I wrote a generic lexer generator some time ago.
 It already let to some compiler O(N^2) optimizations, because the token
 declarations sneak into the mangling :(.


 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

Wow, now we have an embarrassment of riches!

 I've ditched an attempt to write a parser combinator. It was based on
 expression templates and ended up at spirit craziness.

I'm not sure what a parser combinator is. I think grammar inheritance 
could actually be interesting to explore.

 <PERSONAL OPINION
 The hassle of providing good error messages and synthesizing parse results
 in a generic parser outweigh the benefit of a declarative grammar.
 /PERSONAL OPINION>

I think synthesizing ASTs is a solved problem, see my post before. Error 
messages are still a hassle, but I missed the point they became good in 
hand-written parsers :o).

 A lot becomes feasible from the CTFE perspective,
 despite some bugfixes I only miss exp and log currently.

 I do not agree that it's the right moment to write a parser though.
 It hits the first of phobos two biggest shortcomings, the lack of a good
 I/O
 system and the missing Allocators.
 Any parser written now will either risk to not play nice with ranges
 or has to come up with it's own buffering again.

Agreed, but that doesn't seem like the largest hurdle to me.


Andrei

Feb 29 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

Well, it is slower at lexing than DMD at parsing. What is the bottleneck?

Feb 29 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Well, it is slower at lexing than DMD at parsing. What is the bottleneck?

No, it's as fast as dmd's lexer.

Writing the tokens to stdout takes a lot of time though.
Just disable the "writeln(tok);" in the main loop.
There is also an --echo option which recreates the complete
source from the tokens.

Feb 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 29, 2012 at 07:28:48PM +0100, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:
 
On 02/28/2012 07:46 PM, Martin Nowak wrote:
https://gist.github.com/1255439 - lexer generator
https://gist.github.com/1262321 - complete and fast D lexer

Well, it is slower at lexing than DMD at parsing. What is the bottleneck?

 
 No, it's as fast as dmd's lexer.
 
 Writing the tokens to stdout takes a lot of time though.
 Just disable the "writeln(tok);" in the main loop.
 There is also an --echo option which recreates the complete
 source from the tokens.

This sounds like something I ran into in my D lexer project. Lexing
Phobos took upwards of 10-15 seconds, which is extremely slow
considering that dmd can *compile* it in just a few seconds. So I did
some profiling, and found out that most of the time was spent formatting
tokens into strings and running writeln. After I commented that out, the
running time dropped to just a few seconds. :-)


T

-- 
The fact that anyone still uses AOL shows that even the presence of
options doesn't stop some people from picking the pessimal one. - Mike
Ellis

Feb 29 2012

d coder <dlang.coder gmail.com> writes:

 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)

And how different was just a few seconds from 10-15 seconds? :-)

Feb 29 2012

"Nick Sabalausky" <a a.a> writes:

"d coder" <dlang.coder gmail.com> wrote in message 
news:mailman.241.1330541814.24984.digitalmars-d puremagic.com...
 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)

 And how different was just a few seconds from 10-15 seconds? :-)

~10 seconds

;)

Feb 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Mar 01, 2012 at 12:19:30AM +0530, d coder wrote:
 This sounds like something I ran into in my D lexer project. Lexing
 Phobos took upwards of 10-15 seconds, which is extremely slow
 considering that dmd can *compile* it in just a few seconds. So I did
 some profiling, and found out that most of the time was spent formatting
 tokens into strings and running writeln. After I commented that out, the
 running time dropped to just a few seconds. :-)

 And how different was just a few seconds from 10-15 seconds? :-)

OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So
it's about an order of magnitude difference. :)


T

-- 
"I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf,
Iraqi Minister of Information

Feb 29 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Well, it is slower at lexing than DMD at parsing. What is the bottleneck?

 No, it's as fast as dmd's lexer.

 Writing the tokens to stdout takes a lot of time though.
 Just disable the "writeln(tok);" in the main loop.

I did that.

Feb 29 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Well, it is slower at lexing than DMD at parsing. What is the  
 bottleneck?

 No, it's as fast as dmd's lexer.

 Writing the tokens to stdout takes a lot of time though.
 Just disable the "writeln(tok);" in the main loop.

 I did that.

Interesting, I've commented it out https://gist.github.com/1262321#L1559  
and get the following.

<<<
PHOBOS=~/Code/D/DPL/phobos
mkdir test_lexer
cd test_lexer
curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
curl https://raw.github.com/gist/1262321/entity.d > entity.d
dmd -O -release -inline dlexer lexer entity
wc -l ${PHOBOS}/std/*.d
time ./dlexer ${PHOBOS}/std/*.d



./dlexer ${PHOBOS}/std/*.d  0.21s user 0.00s system 99% cpu 0.211 total

Feb 29 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/29/2012 09:03 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Well, it is slower at lexing than DMD at parsing. What is the
 bottleneck?

 No, it's as fast as dmd's lexer.

 Writing the tokens to stdout takes a lot of time though.
 Just disable the "writeln(tok);" in the main loop.

 I did that.

 Interesting, I've commented it out https://gist.github.com/1262321#L1559
 and get the following.

 <<<
 PHOBOS=~/Code/D/DPL/phobos
 mkdir test_lexer
 cd test_lexer
 curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
 curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
 curl https://raw.github.com/gist/1262321/entity.d > entity.d
 dmd -O -release -inline dlexer lexer entity
 wc -l ${PHOBOS}/std/*.d
 time ./dlexer ${PHOBOS}/std/*.d



 ./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total

I get 0.160s for lexing using your lexer.
Parsing the same file with DMDs parser takes 0.155 seconds. The 
difference grows with larger files.

Feb 29 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Wed, 29 Feb 2012 21:30:57 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 02/29/2012 09:03 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 On 02/29/2012 07:28 PM, Martin Nowak wrote:
 On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 On 02/28/2012 07:46 PM, Martin Nowak wrote:
 https://gist.github.com/1255439 - lexer generator
 https://gist.github.com/1262321 - complete and fast D lexer

 Well, it is slower at lexing than DMD at parsing. What is the
 bottleneck?

 No, it's as fast as dmd's lexer.

 Writing the tokens to stdout takes a lot of time though.
 Just disable the "writeln(tok);" in the main loop.

 I did that.

 Interesting, I've commented it out https://gist.github.com/1262321#L1559
 and get the following.

 <<<
 PHOBOS=~/Code/D/DPL/phobos
 mkdir test_lexer
 cd test_lexer
 curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
 curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
 curl https://raw.github.com/gist/1262321/entity.d > entity.d
 dmd -O -release -inline dlexer lexer entity
 wc -l ${PHOBOS}/std/*.d
 time ./dlexer ${PHOBOS}/std/*.d



 ./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total

 I get 0.160s for lexing using your lexer.
 Parsing the same file with DMDs parser takes 0.155 seconds. The  
 difference grows with larger files.

Mmh, I've retested and you're right dmd's lexer is about 2x faster.
The main overhead stems from using ranges and enforce.

Quick profiling shows that 25% is spent in popFront and std.utf.stride.
Last time I worked on this I rewrote std.utf.decode to be much faster.
But utf characters are still "decoded" twice, once for front
and then again for popFront. Also stride uses table lookup and
can't be inlined.

If switch tables were implemented on x64 one could use them for
integral ElementType.

Feb 29 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Mar 01, 2012 at 12:04:39AM +0100, Martin Nowak wrote:
[...]
 Mmh, I've retested and you're right dmd's lexer is about 2x faster.
 The main overhead stems from using ranges and enforce.
 
 Quick profiling shows that 25% is spent in popFront and
 std.utf.stride.  Last time I worked on this I rewrote std.utf.decode
 to be much faster.  But utf characters are still "decoded" twice, once
 for front and then again for popFront. Also stride uses table lookup
 and can't be inlined.

[...]

One way to not decode characters twice is by using a single-dchar buffer
in your range. Something like:

	struct MyRange {
		private File src;
		char buf[];
		dchar readahead;

		this(File _src) {
			// ... fill up buf from src here
			popFront();
		}

		 property pure dchar front() {
			return readahead;
		}

		void popFront() {
			int stride;
			readahead = decode(buf, stride);
			buf = buf[stride..$];
			// ... fill up buf more if needed
		}
	}


T

-- 
"A man's wife has more power over him than the state has." -- Ralph Emerson

Feb 29 2012

bcs <bcs example.com> writes:

On 02/27/2012 11:59 PM, Andrei Alexandrescu wrote:
 I'm starting a new thread on this because I think the matter is of
 strategic importance.

 We all felt for a long time that there's a lot of potential in CTFE, and
 potential applications have been discussed more than a few times,
 ranging from formatting strings parsed to DSLs and parser generators.

 Such feats are now approaching fruition because a number of factors
 converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
 code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making more
 advanced uses possible and even relatively easy (thanks Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 lines of
 code (sadly, no comments or documentation yet :o))

A bit outdated but....

http://dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d

BTW, that uses very little in the way of CTFE and even less in the way 
of string mixins. Those are, IMHO, very powerful tools that should only 
be combined as weapons of last resort.

 * With the occasion of that announcement we also find out Philippe
 Sigaud has already a competing design and implementation of a parser
 generator.

 This is the kind of stuff I've had an eye on for the longest time. I'm
 saying it's of strategic importance because CTFE technology, though not
 new and already available with some languages, has unique powers when
 combined with other features of D. With CTFE we get to do things that
 are quite literally impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and efficient
 lexer-parser generator combo in Phobos, pronto. The lexer itself could
 use a character-level PEG or a classic automaton, and emit tokens for
 consumption by a parser generator. The two should work in perfect tandem
 (no need for glue code). At the end of the day, defining a complete
 lexer+parser combo for a language should be just a few lines longer than
 the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

Feb 28 2012

"F i L" <witte2008 gmail.com> writes:

I'm not sure I follow all the details of what Andrei's suggesting 
and what's being talked about here, this parser/lexer stuff is 
still very new to me, so this may be a bit off-topic. However, I 
thought I'd weigh in on something I was very impressed with about 
the Nimrod language's direct AST access/manipulation.

Nim has a "template" which is very much like D's mixin templates, 
example:


   template foo(b:string) =
     var bar = b

   block main:
     foo("test")
     assert(bar == "test")

and the equivalent in...

   // D
   mixin template foo(string b) {
     auto bar = b;
   }

   void main() {
     mixin foo("test");
     assert(bar == "test");
   }

which is all very straight forward stuff, the cool part comes 
with Nim's macro's. Nim has a two unique types: expr & stmt 
(expression & statement). They're direct AST structures which can 
be passed to template/macro procedures and arbitrarily mutated. 
Example:

   macro foo(s:stmt): stmt =
     result = newNimNode(nnkStmtList)
     for i in 1 .. s.len-1:
       var str = s[i].toStrLit()
       result.add(newCall("echo", str))

   block main:
     foo:
       bar
       baz

the above code prints:
"
   bar
   baz
"

**Some notes: result is what's returned, and the reason you can 
use "foo" with a statement body is because any macro/template 
who's last parameter is type 'stmt' can be called with block 
semantics; similar to how UFCS works with the first parameter.**

The above *might* look like the following in D:

   macro foo(ASTNode[] stmts...) {
     ASTNode[] result;
     foreach (s; stmts) {
       auto str = s.toASTString();
       result ~= new ASTCall!"writeln"(str);
     }
     return result;
   }

   void main() {
     foo {
       bar;
       baz;
     }
   }

This kind of direct AST manipulation + body semantics opens the 
doors for a lot of cool things. If you read through Nim's lib 
documentation you'll notice many of the "language" features are 
actually just Library procedures in the defaultly included system 
module. Which is great because contributions to the *language* 
can be made very easily. Also, the infrastructure to read/write 
AST is no-doubt immensely useful for IDE support and other such 
dev tools.

I'm not a huge fan of everything in Nimrod, but this is something 
they definitely got right, and I think D could gain from their 
experience.

May 27 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-27 22:15, F i L wrote:
 I'm not sure I follow all the details of what Andrei's suggesting and
 what's being talked about here, this parser/lexer stuff is still very
 new to me, so this may be a bit off-topic. However, I thought I'd weigh
 in on something I was very impressed with about the Nimrod language's
 direct AST access/manipulation.

 Nim has a "template" which is very much like D's mixin templates, example:


 template foo(b:string) =
 var bar = b

 block main:
 foo("test")
 assert(bar == "test")

 and the equivalent in...

 // D
 mixin template foo(string b) {
 auto bar = b;
 }

 void main() {
 mixin foo("test");
 assert(bar == "test");
 }

 which is all very straight forward stuff, the cool part comes with Nim's
 macro's. Nim has a two unique types: expr & stmt (expression &
 statement). They're direct AST structures which can be passed to
 template/macro procedures and arbitrarily mutated. Example:

 macro foo(s:stmt): stmt =
 result = newNimNode(nnkStmtList)
 for i in 1 .. s.len-1:
 var str = s[i].toStrLit()
 result.add(newCall("echo", str))

 block main:
 foo:
 bar
 baz

 the above code prints:
 "
 bar
 baz
 "

 **Some notes: result is what's returned, and the reason you can use
 "foo" with a statement body is because any macro/template who's last
 parameter is type 'stmt' can be called with block semantics; similar to
 how UFCS works with the first parameter.**

 The above *might* look like the following in D:

 macro foo(ASTNode[] stmts...) {
 ASTNode[] result;
 foreach (s; stmts) {
 auto str = s.toASTString();
 result ~= new ASTCall!"writeln"(str);
 }
 return result;
 }

 void main() {
 foo {
 bar;
 baz;
 }
 }

 This kind of direct AST manipulation + body semantics opens the doors
 for a lot of cool things. If you read through Nim's lib documentation
 you'll notice many of the "language" features are actually just Library
 procedures in the defaultly included system module. Which is great
 because contributions to the *language* can be made very easily. Also,
 the infrastructure to read/write AST is no-doubt immensely useful for
 IDE support and other such dev tools.

 I'm not a huge fan of everything in Nimrod, but this is something they
 definitely got right, and I think D could gain from their experience.

This is a very cool feature of Nimrod. It allows to move several 
language features to the library.

* synchronized
* scope
* foreach (possibly)

-- 
/Jacob Carlborg

May 28 2012

"Ken" <kennethuil gmail.com> writes:

On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu 
wrote:
 I'm starting a new thread on this because I think the matter is 
 of strategic importance.

 We all felt for a long time that there's a lot of potential in 
 CTFE, and potential applications have been discussed more than 
 a few times, ranging from formatting strings parsed to DSLs and 
 parser generators.

 Such feats are now approaching fruition because a number of 
 factors converge:

 * Dmitry Olshansky's regex library (now in Phobos) generates 
 efficient D code straight from regexen.

 * The scope and quality of CTFE has improved enormously, making 
 more advanced uses possible and even relatively easy (thanks 
 Don!)

 * Hisayuki Mima implemented a parser generator in only 3000 
 lines of code (sadly, no comments or documentation yet :o))

 * With the occasion of that announcement we also find out 
 Philippe Sigaud has already a competing design and 
 implementation of a parser generator.

 This is the kind of stuff I've had an eye on for the longest 
 time. I'm saying it's of strategic importance because CTFE 
 technology, though not new and already available with some 
 languages, has unique powers when combined with other features 
 of D. With CTFE we get to do things that are quite literally 
 impossible to do in other languages.

 We need to have a easy-to-use, complete, seamless, and 
 efficient lexer-parser generator combo in Phobos, pronto. The 
 lexer itself could use a character-level PEG or a classic 
 automaton, and emit tokens for consumption by a parser 
 generator. The two should work in perfect tandem (no need for 
 glue code). At the end of the day, defining a complete 
 lexer+parser combo for a language should be just a few lines 
 longer than the textual representation of the grammar itself.

 What do you all think? Let's get this project off the ground!


 Thanks,

 Andrei

Great!  So what's the next step?  Do we wait for the maintainers 
of one of the CTFE parser gen packages to drop it in the Phobos 
Review Queue?  Do a reimplementation for Phobos?

We could attack this in pieces.  Start with a lexer/FSA generator 
(like Ragel but using CTFE) - this will make it much easier to 
consume many wire protocols, for starters (I found it very easy 
to make a simple HTTP client using Ragel), and will be quite 
useful on its own.  Then extend it into a lexer for a parser 
generator.

Jun 01 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/1/12 8:39 AM, Ken wrote:
 Great! So what's the next step? Do we wait for the maintainers of one of
 the CTFE parser gen packages to drop it in the Phobos Review Queue? Do a
 reimplementation for Phobos?

 We could attack this in pieces. Start with a lexer/FSA generator (like
 Ragel but using CTFE) - this will make it much easier to consume many
 wire protocols, for starters (I found it very easy to make a simple HTTP
 client using Ragel), and will be quite useful on its own. Then extend it
 into a lexer for a parser generator.

I think this core strength of the language should be properly supplanted 
by library support, so I'd be very interested to look at Phobos 
proposals. The proposals should come with sample grammars of nontrivial 
size, ideally a parser for the entire D language.

There might be things to fix in the compiler, e.g. memory consumption.


Andrei

Jun 01 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Lexer and parser generators using CTFE