www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Looking for champion - std.lang.d.lex

reply Walter Bright <newshound2 digitalmars.com> writes:
As we all know, tool support is important for D's success. Making tools easier 
to build will help with that.

To that end, I think we need a lexer for the standard library - std.lang.d.lex. 
It would be helpful in writing color syntax highlighting filters, pretty 
printers, repl, doc generators, static analyzers, and even D compilers.

It should:

1. support a range interface for its input, and a range interface for its output
2. optionally not generate lexical errors, but just try to recover and continue
3. optionally return comments and ddoc comments as tokens
4. the tokens should be a value type, not a reference type
5. generally follow along with the C++ one so that they can be maintained in
tandem

It can also serve as the basis for creating a javascript implementation that
can 
be embedded into web pages for syntax highlighting, and eventually an 
std.lang.d.parse.

Anyone want to own this?
Oct 21 2010
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
and how about

6. ctfe compatible

?
Oct 21 2010
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 21 October 2010 15:12:41 Ellery Newcomer wrote:
 and how about
 
 6. ctfe compatible
 
 ?
That would seem like a good idea (though part of me cringes at the idea of a program specifically running the lexer (and possibly the parser) as part of its own compilation process), but for the main purpose of being used for tools for D, that would seem completely unnecessary. So, I'd say that it would be a good idea to make it CTFE-able if it is at all reasonable to do so but that if making it CTFE-able would harm the design for more typical use, then it shouldn't be made CTFE-able. Personally, I don't have a good feel for exactly what is CTFE- able though, so I have no idea how easy it would be to make it CTFE-able. However, it does seem like a good idea if it's reasonable to do so. And if it's not, hopefully as dmd's CTFE capabilities become more advanced, it will become possible to do so. - Jonathan M Davis
Oct 21 2010
parent Don <nospam nospam.com> writes:
Jonathan M Davis wrote:
 On Thursday 21 October 2010 15:12:41 Ellery Newcomer wrote:
 and how about

 6. ctfe compatible

 ?
That would seem like a good idea (though part of me cringes at the idea of a program specifically running the lexer (and possibly the parser) as part of its own compilation process), but for the main purpose of being used for tools for D, that would seem completely unnecessary. So, I'd say that it would be a good idea to make it CTFE-able if it is at all reasonable to do so but that if making it CTFE-able would harm the design for more typical use, then it shouldn't be made CTFE-able. Personally, I don't have a good feel for exactly what is CTFE- able though, so I have no idea how easy it would be to make it CTFE-able. However, it does seem like a good idea if it's reasonable to do so. And if it's not, hopefully as dmd's CTFE capabilities become more advanced, it will become possible to do so. - Jonathan M Davis
In the long term, the requirements for CTFE will be pretty much: 1. the function must be safe (eg, no asm). 2. the function must be pure 3. the compiler must have access to the source code You'll probably satisfy all those requirements anyway.
Oct 22 2010
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, October 21, 2010 15:01:21 Walter Bright wrote:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and even
 D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
 output 2. optionally not generate lexical errors, but just try to recover
 and continue 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained
 in tandem
 
 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.
 
 Anyone want to own this?
You mean that you're going to make someone actually pull out their compiler book? ;) I'd love to do this (lexers and parsers are great fun IMHO - it's the code generation that isn't so fun), but I'm afraid that I'm busy enough at the moment that if I take it on, it won't get done very quickly. It is so very tempting though... So, as long as you're not in a hurry, I'm up for it, but I can't guarantee anything even approaching fast delivery. - Jonathan M Davis
Oct 21 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 So, as long as you're not in a hurry, I'm up for it, but I can't guarantee 
 anything even approaching fast delivery.
You may open the project here: http://github.com/ And then other people may help you along the way. Bye, bearophile
Oct 21 2010
next sibling parent Russel Winder <russel russel.org.uk> writes:
On Thu, 2010-10-21 at 19:51 -0400, bearophile wrote:
 Jonathan M Davis:
=20
 So, as long as you're not in a hurry, I'm up for it, but I can't guaran=
tee=20
 anything even approaching fast delivery.
=20 You may open the project here: http://github.com/ And then other people may help you along the way.
Of course using BitBucket or Launchpad may well be more likely to get support as Mercurial and Bazaar are so much more usable that Git. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Oct 21 2010
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, October 21, 2010 17:24:34 Russel Winder wrote:
 On Thu, 2010-10-21 at 19:51 -0400, bearophile wrote:
 Jonathan M Davis:
 So, as long as you're not in a hurry, I'm up for it, but I can't
 guarantee anything even approaching fast delivery.
You may open the project here: http://github.com/ And then other people may help you along the way.
Of course using BitBucket or Launchpad may well be more likely to get support as Mercurial and Bazaar are so much more usable that Git.
I've never actually used Mercurial or Bazaar. I do use git all the time though. I quite like it. Now, it could be Mercurial or Bazaar is better (like I said, I haven't used them), but I do find git to be quite useable. The simple fact that I can just create a repository in place instead of having to set up a separate location for a repository (like you have to do with svn) is a _huge_ improvement. I didn't really use source control on my personal projects before git. git actually makes it easy enough to do so that I do it all the time now. - Jonathan M Davis
Oct 21 2010
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Jonathan M Davis wrote:
 You mean that you're going to make someone actually pull out their compiler 
 book? ;)
Not really, you can just use the dmd lexer source as a guide. Should be straightforward.
 So, as long as you're not in a hurry, I'm up for it, but I can't guarantee 
 anything even approaching fast delivery.
As long as it gets done!
Oct 21 2010
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 21 October 2010 15:01:21 Walter Bright wrote:
 5. generally follow along with the C++ one so that they can be maintained
 in tandem
Does this mean that you want a pseudo-port of the C++ front end's lexer to D for this? Or are you looking for just certain pieces of it to be similar? I haven't looked at the front end code yet, so I don't know how it works there, but I wouldn't expect it to uses ranges, for instance, so I would expect that the basic design would naturally stray a bit from whatever was done in C++ simply by doing things in fairly idiomatic D. And if I do look at the front end to see how that's done, there's the issue of the license. As I understand it, the front end is LGPL, and Phobos is generally Boost, which would mean that I would be looking at LGPL-licensed code when designing Boost-licensed, even though it wouldn't really be copying the code per se since it's a change of language (though if you did the whole front end, obviously the license issue can be waved quite easily). License issues aside, however, I do think that it would make sense for std.lang.d.lex to do things similiarly to the C++ front end, even if there are a number of basic differences. - Jonathan M Davis
Oct 21 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Jonathan M Davis wrote:
 On Thursday 21 October 2010 15:01:21 Walter Bright wrote:
 5. generally follow along with the C++ one so that they can be maintained
 in tandem
Does this mean that you want a pseudo-port of the C++ front end's lexer to D for this? Or are you looking for just certain pieces of it to be similar?
Yes, but not a straight port. The C++ version has things in it that are unnecessary for the D version, like the external string table (should use an associative array instead), the support for lookahead can be put in the parser, doesn't tokenize comments, etc. Essentially I'd like the D lexer to be self-contained in one file.
 I haven't looked at the front end code yet, so I don't know how it works
there, 
 but I wouldn't expect it to uses ranges, for instance, so I would expect that 
 the basic design would naturally stray a bit from whatever was done in C++ 
 simply by doing things in fairly idiomatic D. And if I do look at the front
end 
 to see how that's done, there's the issue of the license. As I understand it, 
 the front end is LGPL, and Phobos is generally Boost, which would mean that I 
 would be looking at LGPL-licensed code when designing Boost-licensed, even 
 though it wouldn't really be copying the code per se since it's a change of 
 language (though if you did the whole front end, obviously the license issue
can 
 be waved quite easily).
Since the license is mine, I can change the D version to the Boost license, no problem.
 License issues aside, however, I do think that it would make sense for 
 std.lang.d.lex to do things similiarly to the C++ front end, even if there are
a 
 number of basic differences.
Yup. The idea is the D version lexes exactly the same grammar as the dmd one. The easiest way to ensure that is to do equivalent logic.
Oct 21 2010
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 21 October 2010 23:55:42 Walter Bright wrote:
 Jonathan M Davis wrote:
 On Thursday 21 October 2010 15:01:21 Walter Bright wrote:
 5. generally follow along with the C++ one so that they can be
 maintained in tandem
Does this mean that you want a pseudo-port of the C++ front end's lexer to D for this? Or are you looking for just certain pieces of it to be similar?
Yes, but not a straight port. The C++ version has things in it that are unnecessary for the D version, like the external string table (should use an associative array instead), the support for lookahead can be put in the parser, doesn't tokenize comments, etc. Essentially I'd like the D lexer to be self-contained in one file.
 I haven't looked at the front end code yet, so I don't know how it works
 there, but I wouldn't expect it to uses ranges, for instance, so I would
 expect that the basic design would naturally stray a bit from whatever
 was done in C++ simply by doing things in fairly idiomatic D. And if I
 do look at the front end to see how that's done, there's the issue of
 the license. As I understand it, the front end is LGPL, and Phobos is
 generally Boost, which would mean that I would be looking at
 LGPL-licensed code when designing Boost-licensed, even though it
 wouldn't really be copying the code per se since it's a change of
 language (though if you did the whole front end, obviously the license
 issue can be waved quite easily).
Since the license is mine, I can change the D version to the Boost license, no problem.
 License issues aside, however, I do think that it would make sense for
 std.lang.d.lex to do things similiarly to the C++ front end, even if
 there are a number of basic differences.
Yup. The idea is the D version lexes exactly the same grammar as the dmd one. The easiest way to ensure that is to do equivalent logic.
Okay. Good to know. I'll start looking at the C++ front end some time in the next few days, but like I said, I really don't know how much time I'm going to be able to spend on it, so it won't necessarily be quick. However, porting logic should be much faster than doing it from scratch. - Jonathan M Davis
Oct 22 2010
parent Lutger <lutger.blijdestijn gmail.com> writes:
Jonathan M Davis wrote:

...
 Okay. Good to know. I'll start looking at the C++ front end some time in
 the next few days, but like I said, I really don't know how much time I'm
 going to be able to spend on it, so it won't necessarily be quick.
 However, porting logic should be much faster than doing it from scratch.
 
 - Jonathan M Davis
If you are gonna port from the C++ front end, there is already a port called ddmd which may give you a head start: www.dsource.org/projects/ddmd
Oct 22 2010
prev sibling parent dolive <dolive89 sina.com> writes:
Walter Bright 写到:

 Jonathan M Davis wrote:
 On Thursday 21 October 2010 15:01:21 Walter Bright wrote:
 5. generally follow along with the C++ one so that they can be maintained
 in tandem
Does this mean that you want a pseudo-port of the C++ front end's lexer to D for this? Or are you looking for just certain pieces of it to be similar?
Yes, but not a straight port. The C++ version has things in it that are unnecessary for the D version, like the external string table (should use an associative array instead), the support for lookahead can be put in the parser, doesn't tokenize comments, etc. Essentially I'd like the D lexer to be self-contained in one file.
 I haven't looked at the front end code yet, so I don't know how it works
there, 
 but I wouldn't expect it to uses ranges, for instance, so I would expect that 
 the basic design would naturally stray a bit from whatever was done in C++ 
 simply by doing things in fairly idiomatic D. And if I do look at the front
end 
 to see how that's done, there's the issue of the license. As I understand it, 
 the front end is LGPL, and Phobos is generally Boost, which would mean that I 
 would be looking at LGPL-licensed code when designing Boost-licensed, even 
 though it wouldn't really be copying the code per se since it's a change of 
 language (though if you did the whole front end, obviously the license issue
can 
 be waved quite easily).
Since the license is mine, I can change the D version to the Boost license, no problem.
 License issues aside, however, I do think that it would make sense for 
 std.lang.d.lex to do things similiarly to the C++ front end, even if there are
a 
 number of basic differences.
Yup. The idea is the D version lexes exactly the same grammar as the dmd one. The easiest way to ensure that is to do equivalent logic.
dmd2.050 October will release it ? thank's
Oct 22 2010
prev sibling next sibling parent reply dolive <dolive89 sina.com> writes:
Walter Bright 写到:

 As we all know, tool support is important for D's success. Making tools easier 
 to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
std.lang.d.lex. 
 It would be helpful in writing color syntax highlighting filters, pretty 
 printers, repl, doc generators, static analyzers, and even D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
output
 2. optionally not generate lexical errors, but just try to recover and continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained in
tandem
 
 It can also serve as the basis for creating a javascript implementation that
can 
 be embedded into web pages for syntax highlighting, and eventually an 
 std.lang.d.parse.
 
 Anyone want to own this?
Do you have Scintilla for D ?
Oct 22 2010
next sibling parent dolive <dolive89 sina.com> writes:
dolive 写到:

 Walter Bright 写到:
 
 As we all know, tool support is important for D's success. Making tools easier 
 to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
std.lang.d.lex. 
 It would be helpful in writing color syntax highlighting filters, pretty 
 printers, repl, doc generators, static analyzers, and even D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
output
 2. optionally not generate lexical errors, but just try to recover and continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained in
tandem
 
 It can also serve as the basis for creating a javascript implementation that
can 
 be embedded into web pages for syntax highlighting, and eventually an 
 std.lang.d.parse.
 
 Anyone want to own this?
Do you have Scintilla for D ?
Should be port Scintilla to D.
Oct 22 2010
prev sibling parent dolive <dolive89 sina.com> writes:
dolive 写到:

 Walter Bright 写到:
 
 As we all know, tool support is important for D's success. Making tools easier 
 to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
std.lang.d.lex. 
 It would be helpful in writing color syntax highlighting filters, pretty 
 printers, repl, doc generators, static analyzers, and even D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
output
 2. optionally not generate lexical errors, but just try to recover and continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained in
tandem
 
 It can also serve as the basis for creating a javascript implementation that
can 
 be embedded into web pages for syntax highlighting, and eventually an 
 std.lang.d.parse.
 
 Anyone want to own this?
Do you have Scintilla for D ?
Should be port Scintilla to D.
Oct 22 2010
prev sibling next sibling parent reply BLS <windevguy hotmail.de> writes:
Why not creating a DLL/so based Lexer/Parser based on the existing DMD 
front end.? It could be always up to date. Necessary Steps. functional 
wrappers around C++ classes, Implementing the visitor pattern (AST), 
create std.lex and std.parse..

my 2 cents

On 22/10/2010 00:01, Walter Bright wrote:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.

 Anyone want to own this?
Oct 22 2010
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
BLS wrote:
 Why not creating a DLL/so based Lexer/Parser based on the existing DMD 
 front end.? It could be always up to date. Necessary Steps. functional 
 wrappers around C++ classes, Implementing the visitor pattern (AST), 
 create std.lex and std.parse..
I've done things like that before, they're even more work.
Oct 22 2010
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2010-10-22 17:37, BLS wrote:
 Why not creating a DLL/so based Lexer/Parser based on the existing DMD
 front end.? It could be always up to date. Necessary Steps. functional
 wrappers around C++ classes, Implementing the visitor pattern (AST),
 create std.lex and std.parse..

 my 2 cents

 On 22/10/2010 00:01, Walter Bright wrote:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.

 Anyone want to own this?
I think it would be better to create a lexer/parser in D and have it in the standard library. Then one could begin the process of porting the DMD frontend using this library. Then hopefully the DMD frontend will be written in D and use this new library, being one code base and will always be up to date. -- /Jacob Carlborg
Oct 22 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Jacob Carlborg" <doob me.com> wrote in message 
news:i9spln$lbj$1 digitalmars.com...
 On 2010-10-22 17:37, BLS wrote:
 Why not creating a DLL/so based Lexer/Parser based on the existing DMD
 front end.? It could be always up to date. Necessary Steps. functional
 wrappers around C++ classes, Implementing the visitor pattern (AST),
 create std.lex and std.parse..

 my 2 cents

 On 22/10/2010 00:01, Walter Bright wrote:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.

 Anyone want to own this?
I think it would be better to create a lexer/parser in D and have it in the standard library. Then one could begin the process of porting the DMD frontend using this library. Then hopefully the DMD frontend will be written in D and use this new library, being one code base and will always be up to date.
*cough* DDMD
Oct 22 2010
parent Jacob Carlborg <doob me.com> writes:
On 2010-10-22 22:42, Nick Sabalausky wrote:
 "Jacob Carlborg"<doob me.com>  wrote in message
 news:i9spln$lbj$1 digitalmars.com...
 On 2010-10-22 17:37, BLS wrote:
 Why not creating a DLL/so based Lexer/Parser based on the existing DMD
 front end.? It could be always up to date. Necessary Steps. functional
 wrappers around C++ classes, Implementing the visitor pattern (AST),
 create std.lex and std.parse..

 my 2 cents

 On 22/10/2010 00:01, Walter Bright wrote:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.

 Anyone want to own this?
I think it would be better to create a lexer/parser in D and have it in the standard library. Then one could begin the process of porting the DMD frontend using this library. Then hopefully the DMD frontend will be written in D and use this new library, being one code base and will always be up to date.
*cough* DDMD
I know, I would more than love to see DDMD becoming the official D compiler but if that will happen I would still like that the frontend is based on the lexer/parser library in phobos. -- /Jacob Carlborg
Oct 23 2010
prev sibling next sibling parent reply =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:
Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>  =

napisa=B3(a):

 As we all know, tool support is important for D's success. Making tool=
s =
 easier to build will help with that.

 To that end, I think we need a lexer for the standard library -  =
 std.lang.d.lex. It would be helpful in writing color syntax highlighti=
ng =
 filters, pretty printers, repl, doc generators, static analyzers, and =
=
 even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for =
=
 its output
 2. optionally not generate lexical errors, but just try to recover and=
=
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be  =
 maintained in tandem

 It can also serve as the basis for creating a javascript implementatio=
n =
 that can be embedded into web pages for syntax highlighting, and  =
 eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, = Thrift, etc, so lexers will have to be written all over to grok interfac= e = files. Perhaps a generic tokenizer which can be parametrized with a = lexical grammar would bring more ROI, I got a hunch D's templates are = strong enough to pull this off without any source code generation ala = JavaCC. The books I read on compilers say tokenization is a solved = problem, so the theory part on what a good abstraction should be is done= . = What you think? -- Tomek
Oct 22 2010
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Tomek Sowi駍ki wrote:
 Interesting idea. Here's another: D will soon need bindings for CORBA, 
 Thrift, etc, so lexers will have to be written all over to grok 
 interface files. Perhaps a generic tokenizer which can be parametrized 
 with a lexical grammar would bring more ROI, I got a hunch D's templates 
 are strong enough to pull this off without any source code generation 
 ala JavaCC. The books I read on compilers say tokenization is a solved 
 problem, so the theory part on what a good abstraction should be is 
 done. What you think?
Lexers are so simple, it is less work to just build them by hand than use lexer generator tools.
Oct 22 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/22/10 14:17 CDT, Walter Bright wrote:
 Tomek Sowi艅ski wrote:
 Interesting idea. Here's another: D will soon need bindings for CORBA,
 Thrift, etc, so lexers will have to be written all over to grok
 interface files. Perhaps a generic tokenizer which can be parametrized
 with a lexical grammar would bring more ROI, I got a hunch D's
 templates are strong enough to pull this off without any source code
 generation ala JavaCC. The books I read on compilers say tokenization
 is a solved problem, so the theory part on what a good abstraction
 should be is done. What you think?
Lexers are so simple, it is less work to just build them by hand than use lexer generator tools.
I wrote a C++ lexer. It wasn't at all easy except if I compared it against the work necessary to build a full compiler. Andrei
Oct 22 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisa艂(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator. I have in mind the entire implementation of a simple design, but never had the time to execute on it. The tokenizer would work like this: alias Lexer!( "+", "PLUS", "-", "MINUS", "+=", "PLUS_EQ", ... "if", "IF", "else", "ELSE" ... ) DLexer; Such a declaration generates numeric values DLexer.PLUS etc. and generates an efficient code that extracts a stream of tokens from a stream of text. Each token in the token stream has the ID and the text. Comments, strings etc. can be handled in one of several ways but that's a longer discussion. The undertaking is doable but nontrivial. Andrei
Oct 22 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i9spsa$ll0$1 digitalmars.com...
 On 10/22/10 14:02 CDT, Tomek Sowinski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisal(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
FWIW, I've been converting my Goldie lexing/parsing library/toolset ( http://www.dsource.org/projects/goldie ) to D2/Phobos, and that should have a release sometime in the next couple months or so. I'm not sure it would really be appropriate for Phobos since it's pretty range-ified yet, probably doesn't use Phobos coding conventions, and relies on one of my other libraries/tools. But it does do generalized lexing/parsing (LALR) via the GOLD ( http://www.devincook.com/goldparser/ ) grammar file formats, can optionally generate source files for better compile-time checking (for instance, so Token!"<Statemnt>" will generate a compile-time error), has full documentation, and I'm working on a tool/lib that will compile the grammars without having to use the Windows/GUI-based GOLD Parser Builder tool.
Oct 22 2010
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu Wrote:
 
 I have in mind the entire implementation of a simple design, but never 
 had the time to execute on it. The tokenizer would work like this:
 
 alias Lexer!(
      "+", "PLUS",
      "-", "MINUS",
      "+=", "PLUS_EQ",
      ...
      "if", "IF",
      "else", "ELSE"
      ...
 ) DLexer;
 
 Such a declaration generates numeric values DLexer.PLUS etc. and 
 generates an efficient code that extracts a stream of tokens from a 
 stream of text. Each token in the token stream has the ID and the text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Oct 22 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 I have in mind the entire implementation of a simple design, but never
 had the time to execute on it. The tokenizer would work like this:

 alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "if", "IF",
       "else", "ELSE"
       ...
 ) DLexer;

 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from a
 stream of text. Each token in the token stream has the ID and the text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler. Andrei
Oct 22 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:
 
 alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "if", "IF",
       "else", "ELSE"
       ...
 ) DLexer;
 
 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
Oct 23 2010
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Sean Kelly <sean invisibleduck.org> wrote:
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:
 
 alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "if", "IF",
       "else", "ELSE"
       ...
 ) DLexer;
 
 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
Or maybe not. A /* could be CommentBegin. I'll have to think on it a bit more.
Oct 23 2010
parent Sean Kelly <sean invisibleduck.org> writes:
Sean Kelly <sean invisibleduck.org> wrote:
 Sean Kelly <sean invisibleduck.org> wrote:
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:
 
 alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "if", "IF",
       "else", "ELSE"
       ...
 ) DLexer;
 
 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from
 a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
Or maybe not. A /* could be CommentBegin. I'll have to think on it a bit more.
I still think it won't work. The stuff inside the comment would come through as a string of random tokens. Also, the // comment is EOL sensitive, and this info Ian normally communicated to the parser.
Oct 23 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/23/10 11:44 CDT, Sean Kelly wrote:
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:

 alias Lexer!(
        "+", "PLUS",
        "-", "MINUS",
        "+=", "PLUS_EQ",
        ...
        "if", "IF",
        "else", "ELSE"
        ...
 ) DLexer;

 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
I was thinking comments could be easily caught by simple routines: alias Lexer!( "+", "PLUS", "-", "MINUS", "+=", "PLUS_EQ", ... "/*", q{parseNonNestedComment("*/")}, "/+", q{parseNestedComment("+/")}, "//", q{parseOneLineComment()}, ... "if", "IF", "else", "ELSE", ... ) DLexer; During compilation, such non-tokens are recognized as code by the lexer generator and called appropriately. A comprehensive library of such routines completes a useful library. Andrei
Oct 23 2010
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 During compilation, such non-tokens are recognized as code by the lexer 
 generator and called appropriately. A comprehensive library of such 
 routines completes a useful library.
I agree, a set of "canned" and heavily optimized lexing functions for common things like identifiers, numbers, comments, etc., would make a lexing library much more practical. Those will work great for inventing DSLs, but for existing languages, the trouble is that the different languages have subtle variations on how they handle them. For example, D's numeric literals allow embedded underscores. Go doesn't overflow on numeric literals. Javascript has some wacky rules to distinguish a comment from a regex. The \uNNNN letters allowed in identifiers in some languages. So while a general purpose lexing library will be very useful, for lexing D code (and Java, Javascript, etc.) a custom one will probably be much more practical.
Oct 23 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/23/10 13:41 CDT, Walter Bright wrote:
 Andrei Alexandrescu wrote:
 During compilation, such non-tokens are recognized as code by the
 lexer generator and called appropriately. A comprehensive library of
 such routines completes a useful library.
I agree, a set of "canned" and heavily optimized lexing functions for common things like identifiers, numbers, comments, etc., would make a lexing library much more practical. Those will work great for inventing DSLs, but for existing languages, the trouble is that the different languages have subtle variations on how they handle them. For example, D's numeric literals allow embedded underscores. Go doesn't overflow on numeric literals. Javascript has some wacky rules to distinguish a comment from a regex. The \uNNNN letters allowed in identifiers in some languages. So while a general purpose lexing library will be very useful, for lexing D code (and Java, Javascript, etc.) a custom one will probably be much more practical.
I don't see these two in tension. "General" does not need entail "unsuitable for subtle particularities". It is more difficult, but not impossible. Again, a general parser that takes care of the 90% of the drudgework and gives enough hooks to do the remaining 10%, all as efficient as hand-written code. Andrei
Oct 23 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 I don't see these two in tension. "General" does not need entail 
 "unsuitable for subtle particularities". It is more difficult, but not 
 impossible. Again, a general parser that takes care of the 90% of the 
 drudgework and gives enough hooks to do the remaining 10%, all as 
 efficient as hand-written code.
In general I agree with you, but that is a major project to do that and make it general, efficient, and easy to use - and then, one has to make a D lexer out of it. In the meantime, we have a lexer for D that would be straightforward to adapt to be a D library module. The only decisions that have to be made is what the API to it will be.
Oct 23 2010
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 On 10/23/10 11:44 CDT, Sean Kelly wrote:
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:
 
 alias Lexer!(
        "+", "PLUS",
        "-", "MINUS",
        "+=", "PLUS_EQ",
        ...
        "if", "IF",
        "else", "ELSE"
        ...
 ) DLexer;
 
 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from
 a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
I was thinking comments could be easily caught by simple routines: alias Lexer!( "+", "PLUS", "-", "MINUS", "+=", "PLUS_EQ", ... "/*", q{parseNonNestedComment("*/")}, "/+", q{parseNestedComment("+/")}, "//", q{parseOneLineComment()}, ... "if", "IF", "else", "ELSE", ... ) DLexer; During compilation, such non-tokens are recognized as code by the lexer generator and called appropriately. A comprehensive library of such routines completes a useful library.
Ah so the only issue is identifying the first set for a lexical element, is essence. That works.
Oct 23 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i9v8vq$2gvh$1 digitalmars.com...
 On 10/23/10 11:44 CDT, Sean Kelly wrote:
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 On 10/22/10 16:28 CDT, Sean Kelly wrote:
 Andrei Alexandrescu Wrote:
 I have in mind the entire implementation of a simple design, but
 never
 had the time to execute on it. The tokenizer would work like this:

 alias Lexer!(
        "+", "PLUS",
        "-", "MINUS",
        "+=", "PLUS_EQ",
        ...
        "if", "IF",
        "else", "ELSE"
        ...
 ) DLexer;

 Such a declaration generates numeric values DLexer.PLUS etc. and
 generates an efficient code that extracts a stream of tokens from a
 stream of text. Each token in the token stream has the ID and the
 text.
What about, say, floating-point literals? It seems like the first element of a pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers, identifiers) there are at least two possibilities that I see: 1. Go the full route of allowing regexen in the definition. This is very hard because you need to generate an efficient (N|D)FA during compilation. 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the compile-time table matches, just call onUnrecognizedString(). In conjunction with a few simple specialized functions, that makes it very simple to define arbitrarily complex lexers where the bulk of the work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical elements into the parser. For example, a comment may be defined as /**/, which if there is no lexical definition of a comment means that it parses as four distinct valid tokens, div mul mul div.
I was thinking comments could be easily caught by simple routines: alias Lexer!( "+", "PLUS", "-", "MINUS", "+=", "PLUS_EQ", ... "/*", q{parseNonNestedComment("*/")}, "/+", q{parseNestedComment("+/")}, "//", q{parseOneLineComment()}, ... "if", "IF", "else", "ELSE", ... ) DLexer; During compilation, such non-tokens are recognized as code by the lexer generator and called appropriately. A comprehensive library of such routines completes a useful library.
What's wrong with regexes? That's pretty typical for lexers.
Oct 23 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/23/10 16:39 CDT, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i9v8vq$2gvh$1 digitalmars.com...
 What's wrong with regexes? That's pretty typical for lexers.
I mentioned that using regexes is possible but would make it much more difficult to generate good quality lexers. Besides, regexen are IMHO quite awkward at expressing certain things that can be easily parsed by hand, such as comments or recursive comments. Andrei
Oct 23 2010
parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i9vlep$8ao$1 digitalmars.com...
 On 10/23/10 16:39 CDT, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i9v8vq$2gvh$1 digitalmars.com...
 What's wrong with regexes? That's pretty typical for lexers.
I mentioned that using regexes is possible but would make it much more difficult to generate good quality lexers.
I see. Maybe a lexer 2.0 thing.
 Besides, regexen are IMHO quite awkward at expressing certain things that 
 can be easily parsed by hand, such as comments
//[^\n]*\n /\*(.|\*[^/])*\*/ Pretty simple as far as regexes go, and I'm far from a regex expert. Plus there's nothing stopping the use of a vastly improved regex syntax like GOLD uses ( http://www.devincook.com/goldparser/doc/grammars/define-terminals.htm ). In that, the two regexes above would look like: {LineCommentChar} = {Printable} - {LF} LineComment = '//' {LineCommentChar}* {LF} {BlockCommentChar} = {Printable} - [*] {BlockCommentCharNoSlash} = {BlockCommentChar} - [/] BlockComment = '/*' ({BlockCommentChar} | '*' {BlockCommentCharNoSlash})* '*/' And further syntactical improvement is easy to imagine, such as in-line character set creation.
 or recursive comments.
Granted, although I think there is precident for regex engines that can handle matched nested pairs just fine.
Oct 23 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 What's wrong with regexes?
They don't handle recursion.
Oct 23 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:i9vn3l$bd1$2 digitalmars.com...
 Nick Sabalausky wrote:
 What's wrong with regexes?
They don't handle recursion.
Neither do plain-old strings. But regexes will get you farther than plain strings before needing to resort to customized lexing. But I'm a big data-driven fan anyway. If you're not than I can see why it wouldn't seem as appealing as it does to me. In any case, if I have a chance I might see about adapting my Goldie ( www.dsource.org/projects/goldie ) library to more Phobos-friendly requirements. It's already a fully-usable lexer/parser (and the lexer/parser parts can be used independantly), with a complete grammar description language and I already have misc related tools written. And it's mostly working on D2 already (just need the next DMD because it has a fix for a bug that's a breaker for one of the tools). So if I can get it into a state more suitable for Phobos then that might end up putting things ahead of where they would be if someone just started from scratch. The initial versions might not be completely Phobos-ified, but it could definitely get there (especially if I had some guidance from people with more Phobos2 experience than me). Would Walter & co be interested in this? If not, I won't bother, but if so, then I may give it a shot.
Oct 23 2010
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:ia01q3$1i1a$1 digitalmars.com...
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:i9vn3l$bd1$2 digitalmars.com...
 Nick Sabalausky wrote:
 What's wrong with regexes?
They don't handle recursion.
Neither do plain-old strings. But regexes will get you farther than plain strings before needing to resort to customized lexing. But I'm a big data-driven fan anyway. If you're not than I can see why it wouldn't seem as appealing as it does to me. In any case, if I have a chance I might see about adapting my Goldie ( www.dsource.org/projects/goldie ) library to more Phobos-friendly requirements. It's already a fully-usable lexer/parser (and the lexer/parser parts can be used independantly), with a complete grammar description language and I already have misc related tools written. And it's mostly working on D2 already (just need the next DMD because it has a fix for a bug that's a breaker for one of the tools). So if I can get it into a state more suitable for Phobos then that might end up putting things ahead of where they would be if someone just started from scratch. The initial versions might not be completely Phobos-ified, but it could definitely get there (especially if I had some guidance from people with more Phobos2 experience than me). Would Walter & co be interested in this? If not, I won't bother, but if so, then I may give it a shot.
And FWIW, I was already thnking about making some improvements to Goldie's API enyway.
Oct 23 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:ia01sk$1i7s$1 digitalmars.com...
 "Nick Sabalausky" <a a.a> wrote in message 
 news:ia01q3$1i1a$1 digitalmars.com...
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:i9vn3l$bd1$2 digitalmars.com...
 Nick Sabalausky wrote:
 What's wrong with regexes?
They don't handle recursion.
Neither do plain-old strings. But regexes will get you farther than plain strings before needing to resort to customized lexing. But I'm a big data-driven fan anyway. If you're not than I can see why it wouldn't seem as appealing as it does to me. In any case, if I have a chance I might see about adapting my Goldie ( www.dsource.org/projects/goldie ) library to more Phobos-friendly requirements. It's already a fully-usable lexer/parser (and the lexer/parser parts can be used independantly), with a complete grammar description language and I already have misc related tools written. And it's mostly working on D2 already (just need the next DMD because it has a fix for a bug that's a breaker for one of the tools). So if I can get it into a state more suitable for Phobos then that might end up putting things ahead of where they would be if someone just started from scratch. The initial versions might not be completely Phobos-ified, but it could definitely get there (especially if I had some guidance from people with more Phobos2 experience than me). Would Walter & co be interested in this? If not, I won't bother, but if so, then I may give it a shot.
And FWIW, I was already thnking about making some improvements to Goldie's API enyway.
But that's all if you want generalized lexing or parsing though. If you just want "lexing D code"/"parsing D code", then IMO anything other than adapting parts of DDMD would be the wrong way to go.
Oct 23 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If you just 
 want "lexing D code"/"parsing D code", then IMO anything other than adapting 
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing? Bye, bearophile
Oct 23 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:ia0410$1lju$1 digitalmars.com...
 Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If you 
 just
 want "lexing D code"/"parsing D code", then IMO anything other than 
 adapting
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing?
I'd certainly hope so. If it isn't, then that would probably mean DMD's FE license is incompatible with Phobos. Which would be rather...weird. In any case, I asked that and a couple other Q's here, but haven't gotten an answer yet: http://www.dsource.org/forums/viewtopic.php?t=5627
Oct 23 2010
next sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 24 Oct 2010 06:55:22 +0400, Nick Sabalausky <a a.a> wrote:

 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:ia0410$1lju$1 digitalmars.com...
 Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If you
 just
 want "lexing D code"/"parsing D code", then IMO anything other than
 adapting
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing?
I'd certainly hope so. If it isn't, then that would probably mean DMD's FE license is incompatible with Phobos. Which would be rather...weird. In any case, I asked that and a couple other Q's here, but haven't gotten an answer yet: http://www.dsource.org/forums/viewtopic.php?t=5627
Sorry, I wasn't checking the forum. IIRC DMD license is GPL so DDMD must be GPL too but I'm all for relicensing it as Boost.
Oct 24 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Denis Koroskin" <2korden gmail.com> wrote in message 
news:op.vk2na9bpo7cclz korden-pc...
 On Sun, 24 Oct 2010 06:55:22 +0400, Nick Sabalausky <a a.a> wrote:

 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:ia0410$1lju$1 digitalmars.com...
 Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If you
 just
 want "lexing D code"/"parsing D code", then IMO anything other than
 adapting
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing?
I'd certainly hope so. If it isn't, then that would probably mean DMD's FE license is incompatible with Phobos. Which would be rather...weird. In any case, I asked that and a couple other Q's here, but haven't gotten an answer yet: http://www.dsource.org/forums/viewtopic.php?t=5627
Sorry, I wasn't checking the forum. IIRC DMD license is GPL so DDMD must be GPL too but I'm all for relicensing it as Boost.
According to a random file I picked out of trunk, it's dual-licensed with GPL (not sure which version) and Artistic (also not sure which version) http://www.dsource.org/projects/dmd/browser/trunk/src/access.c
Oct 24 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:ia0v9p$11p$1 digitalmars.com...
 "Denis Koroskin" <2korden gmail.com> wrote in message 
 news:op.vk2na9bpo7cclz korden-pc...
 On Sun, 24 Oct 2010 06:55:22 +0400, Nick Sabalausky <a a.a> wrote:

 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:ia0410$1lju$1 digitalmars.com...
 Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If 
 you
 just
 want "lexing D code"/"parsing D code", then IMO anything other than
 adapting
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing?
I'd certainly hope so. If it isn't, then that would probably mean DMD's FE license is incompatible with Phobos. Which would be rather...weird. In any case, I asked that and a couple other Q's here, but haven't gotten an answer yet: http://www.dsource.org/forums/viewtopic.php?t=5627
Sorry, I wasn't checking the forum. IIRC DMD license is GPL so DDMD must be GPL too but I'm all for relicensing it as Boost.
According to a random file I picked out of trunk, it's dual-licensed with GPL (not sure which version) and Artistic (also not sure which version) http://www.dsource.org/projects/dmd/browser/trunk/src/access.c
That does surprise me though, since I'm pretty sure Phobos is Boost License. Anyone know why the difference?
Oct 24 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 That does surprise me though, since I'm pretty sure Phobos is Boost License. 
 Anyone know why the difference?
Phobos is Boost licensed to enable maximum usage for any purpose. The dmd front end is GPL licensed in order to ensure it stays open source and to discourage closed source forks.
Oct 24 2010
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2010-10-24 04:55, Nick Sabalausky wrote:
 "bearophile"<bearophileHUGS lycos.com>  wrote in message
 news:ia0410$1lju$1 digitalmars.com...
 Nick Sabalausky:

 But that's all if you want generalized lexing or parsing though. If you
 just
 want "lexing D code"/"parsing D code", then IMO anything other than
 adapting
 parts of DDMD would be the wrong way to go.
Is the DDMD licence compatible with the Phobos one? Is the DDMD author(s) willing?
I'd certainly hope so. If it isn't, then that would probably mean DMD's FE license is incompatible with Phobos. Which would be rather...weird. In any case, I asked that and a couple other Q's here, but haven't gotten an answer yet: http://www.dsource.org/forums/viewtopic.php?t=5627
As Walter wrote in the first post of this thread: "generally follow along with the C++ one so that they can be maintained in tandem" and in another post: "Since the license is mine, I can change the D version to the Boost license, no problem." http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D&artnum=120221 -- /Jacob Carlborg
Oct 24 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 Would Walter & co be interested in this? If not, I won't bother, 
 but if so, then I may give it a shot.
The problem is I never have used parser/lexer generators, so I am not really in a good position to review it.
Oct 23 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia0cfv$22kp$1 digitalmars.com...
 Nick Sabalausky wrote:
 Would Walter & co be interested in this? If not, I won't bother, but if 
 so, then I may give it a shot.
The problem is I never have used parser/lexer generators, so I am not really in a good position to review it.
Understandable. FWIW though, Goldie isn't really lexer/parse generator per se. Traditional lexer/parser generators like lex/yacc or ANTLR will actually generate the source code for a lexer or parser. Goldie just has a single lexer and parser, both already pre-written. They're just completely data-driven: Compared to the generators, Goldie's lexer is more like a general regex engine that simultaneously matches against multiple pre-compiled "regexes". By pre-compiled, I mean turned into a DFA - which is currently done by a separate non-source-available tool I didn't write, but I'm going to be writing my own version soon. By "regexes", I mean they're functionally regexes, but they're written in a much easier-to-read syntax than the typical PCRE. Goldie's parser is really just a rather typical (from what I understand) LALR parser. I don't know how much you know about LALR's, but the parser itself is naturally grammar-independent (at least as described in CS texts). Using an LALR involves converting the grammar completely into a table of states and lookaheads (single-token lookahead; unlike LL, any more than that is never really needed), and then the actual parser is directed entirely by that table (much like how regexes are converted to data, ie DFA, and then processed generically), so it's completely grammar-independent. And of, course, the actual lexer and parser can be optimized/rewritten/whatever with minimal impact on everything else. If anyone's interested, further details are here(1): http://www.devincook.com/goldparser/ Goldie does have optional code-generation capabilities, but it's entirely for the sake of providing a better statically-checked API tailored to your grammar (ex: to use D's type system to ensure at compile-time, instead of run-time, that token names are valid and that BNF rules you reference actually exist). It doesn't actually affect the lexer/parser in any non-trivial way. (1): By that site's terminology, Goldie would technically be a "GOLD Engine", plus some additional tools. But, my current work on Goldie will cut that actual "GOLD Parser Builder" program completely out-of-the-loop (but it will still maintain compatibility with it for anyone who wants to use it).
Oct 23 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 If anyone's interested, further details are here(1):
 http://www.devincook.com/goldparser/
It looks nice, but in clicking around on FAQ, documentation, getting started, etc., I can't find any example code.
Oct 24 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia0pce$2pbk$1 digitalmars.com...
 Nick Sabalausky wrote:
 If anyone's interested, further details are here(1):
 http://www.devincook.com/goldparser/
It looks nice, but in clicking around on FAQ, documentation, getting started, etc., I can't find any example code.
Well, that's because that program (GOLD Parser Builder) is just a tool that takes in a grammar description file and spits out the lexer/parser DFA/LALR tables. Then you use any GOLD-compatible engine in any langauge (such as Goldie) to load the DFA/LALR tables and use them to lex/parse. (But again, I'm currently working on code that will do that without having to use GOLD Parser Builder.) Here's some specific links for Goldie, and keep in mind that 1. I already have it pretty much converted to D2/Phobos in trunk (it used to be D1/Tango), 2. The API is not final and definitely open to suggestions (I have a few ideas already), 3. Any suggestions for improvements to the documentation, are, of course, welcome too, 4. Like I've said, in the next official release, using "GOLD Parser Builder" won't actually be required. Main Goldie Project page: http://www.dsource.org/projects/goldie Documentation for latest official release: http://www.semitwist.com/goldiedocs/current/Docs/ Samples directory in trunk: http://www.dsource.org/projects/goldie/browser/trunk/src/samples Slightly old documentation for the samples: http://www.semitwist.com/goldiedocs/current/Docs/SampleApps/ There's two "calculator" samples. They're the same, but correspond to the two different styles Goldie supports. One, "dynamic", doesn't involve any source-code-generation step and can load and use any arbitrary grammar at runtime (neat usages of this are shown in the "ParseAnything" sample and in the "Parse" tool http://www.semitwist.com/goldiedocs/current/Docs/Tools/Parse/ ). The other, "static", does involve generating some source-code (via a comand-line tool), but that gives you an API that's statically-checked against the grammar. The differences and pros/cons between these two styles are explained here (let me know if it's unclear): http://www.semitwist.com/goldiedocs/current/Docs/APIOver/StatVsDyn/
Oct 24 2010
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
It looks like a solid engine, and a nice tool. Does it belong as part of
Phobos? 
I don't know. What do other D users think?
Oct 24 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
     http://www.semitwist.com/goldiedocs/current/Docs/APIOver/StatVsDyn/
One question I have is how does it compare with Spirit? That would be its main counterpart in the C++ space.
Oct 24 2010
next sibling parent div0 <div0 sourceforge.net> writes:
On 24/10/2010 18:19, Walter Bright wrote:
 Nick Sabalausky wrote:
 http://www.semitwist.com/goldiedocs/current/Docs/APIOver/StatVsDyn/
One question I have is how does it compare with Spirit? That would be its main counterpart in the C++ space.
Spirit is a LL parser, so it's not really suitable for human edited input as doing exact error reporting is tricky. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Oct 24 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia1ps7$1fq5$2 digitalmars.com...
 Nick Sabalausky wrote:
     http://www.semitwist.com/goldiedocs/current/Docs/APIOver/StatVsDyn/
One question I have is how does it compare with Spirit? That would be its main counterpart in the C++ space.
Can't say I'm really familiar with Spirit. From a brief lookover, these are my impresions of the differences: Spirit: Grammar is embedded into your source code as actual C++ code. Goldie: Grammar is defined in a domain-specfic language. But either one could probably have a wrapper to work the other way. Spirit: Uses (abuses?) operator overloading (Although, apperently SpiritD doesn't inherit Spirit's operator overloading: http://www.sstk.co.uk/spiritd.php ) Goldie: Operator overloading isn't really applicable, because of using a DSL. As they stand, Spirit seems like it could be pretty handly for simple, quick little DSLs, ex, things for which Goldie might seem like overkill. But Goldie's interface could probably be improved to compete pretty well in those cases. OTOH, Goldie's approach (being based on GOLD) has a deliberate separation between grammar and parsing, which has it's own benefits; for instance, grammar definitions can be re-used for any purpose.
Oct 24 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 Can't say I'm really familiar with Spirit. From a brief lookover, these are 
 my impresions of the differences:
 
 Spirit: Grammar is embedded into your source code as actual C++ code.
 Goldie: Grammar is defined in a domain-specfic language.
 But either one could probably have a wrapper to work the other way.
 
 Spirit: Uses (abuses?) operator overloading (Although, apperently SpiritD 
 doesn't inherit Spirit's operator overloading: 
 http://www.sstk.co.uk/spiritd.php )
 Goldie: Operator overloading isn't really applicable, because of using a 
 DSL.
 
 As they stand, Spirit seems like it could be pretty handly for simple, quick 
 little DSLs, ex, things for which Goldie might seem like overkill. But 
 Goldie's interface could probably be improved to compete pretty well in 
 those cases. OTOH, Goldie's approach (being based on GOLD) has a deliberate 
 separation between grammar and parsing, which has it's own benefits; for 
 instance, grammar definitions can be re-used for any purpose.
 
 
Does Goldie have (like Spirit) a set of canned routines for things like numeric literals? Can the D version of Goldie be turned into one file?
Oct 24 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia2duj$2j7e$1 digitalmars.com...
 Nick Sabalausky wrote:
 Can't say I'm really familiar with Spirit. From a brief lookover, these 
 are my impresions of the differences:

 Spirit: Grammar is embedded into your source code as actual C++ code.
 Goldie: Grammar is defined in a domain-specfic language.
 But either one could probably have a wrapper to work the other way.

 Spirit: Uses (abuses?) operator overloading (Although, apperently SpiritD 
 doesn't inherit Spirit's operator overloading: 
 http://www.sstk.co.uk/spiritd.php )
 Goldie: Operator overloading isn't really applicable, because of using a 
 DSL.

 As they stand, Spirit seems like it could be pretty handly for simple, 
 quick little DSLs, ex, things for which Goldie might seem like overkill. 
 But Goldie's interface could probably be improved to compete pretty well 
 in those cases. OTOH, Goldie's approach (being based on GOLD) has a 
 deliberate separation between grammar and parsing, which has it's own 
 benefits; for instance, grammar definitions can be re-used for any 
 purpose.
Does Goldie have (like Spirit) a set of canned routines for things like numeric literals?
No, but such things can easily be provided in the docs for simple copy-paste. For instance: DecimalLiteral = {Number} ({Number} | '_')* HexLiteral = '0' [xX] ({Number} | [ABCDEFabcdef_])+ Identifier = ('_' | {Letter}) ('_' | {AlphaNumeric})* {StringChar} = {Printable} - ["] StringLiteral = '"' ({StringChar} | '\' {Printable})* '"' All one would need to do to use those is copy-paste them into their grammar definition. Some sort of import mechanism could certainly be added though, to allow for selective import of pre-defined things like that. There are many pre-defined character sets though (and others can be manually-created, of course): http://www.devincook.com/goldparser/doc/grammars/character-sets.htm
 Can the D version of Goldie be turned into one file?
Assuming just the library and not the included tools (many of which could be provided as part of the library, though), and not counting files generated for the static-style, then yes, but it would probably be a bit long.
Oct 24 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 Does Goldie have (like Spirit) a set of canned routines for things like 
 numeric literals?
No, but such things can easily be provided in the docs for simple copy-paste. For instance: DecimalLiteral = {Number} ({Number} | '_')* HexLiteral = '0' [xX] ({Number} | [ABCDEFabcdef_])+ Identifier = ('_' | {Letter}) ('_' | {AlphaNumeric})* {StringChar} = {Printable} - ["] StringLiteral = '"' ({StringChar} | '\' {Printable})* '"' All one would need to do to use those is copy-paste them into their grammar definition. Some sort of import mechanism could certainly be added though, to allow for selective import of pre-defined things like that.
In the regexp code, I provided special regexes for email addresses and URLs. Those are hard to get right, so it's a large convenience to provide them. Also, many literals can be fairly complex, and evaluating them can produce errors (such as integer overflow in the numeric literals). Having canned ones makes it much quicker for a user to get going. I'm guessing that a numeric literal is returned as a string. Is this string allocated on the heap? If so, it's a performance problem. Storage allocation costs figure large when trying to lex millions of lines.
 There are many pre-defined character sets though (and others can be 
 manually-created, of course): 
 http://www.devincook.com/goldparser/doc/grammars/character-sets.htm
 
 Can the D version of Goldie be turned into one file?
Assuming just the library and not the included tools (many of which could be provided as part of the library, though), and not counting files generated for the static-style, then yes, but it would probably be a bit long.
Long files aren't a problem. That's why we have .di files! I worry more about clutter.
Oct 24 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia34up$ldb$1 digitalmars.com...
 In the regexp code, I provided special regexes for email addresses and 
 URLs. Those are hard to get right, so it's a large convenience to provide 
 them.

 Also, many literals can be fairly complex, and evaluating them can produce 
 errors (such as integer overflow in the numeric literals). Having canned 
 ones makes it much quicker for a user to get going.
I'm not sure what exectly you're suggesting in these two paragraphs? (Or just commenting?)
 I'm guessing that a numeric literal is returned as a string. Is this 
 string allocated on the heap? If so, it's a performance problem. Storage 
 allocation costs figure large when trying to lex millions of lines.
Good point. I've just checked and there is allocation going on for each terminal lexed. But thanks to D's awesomeness, I can easily fix that to just use a slice of the original source string. I'll do that...
 Long files aren't a problem. That's why we have .di files! I worry more 
 about clutter.
I really find long files to be a pain to read and edit. It would be nice if Then, modules with a lot of code could be broken down as appropriate for their maintainers without having to bother the users with the "module blah.all" workaround (which Goldie currently uses, but I realize isn't normal Phobos style). AIUI, .di files don't really solve that. There is one other other minor related issue, though. One of my big principles for Goldie is flexibility. So in addition to the basic API that most people would use, I like to expose lower-level APIs for people who might want to sidestep certain parts of Goldie, or provide other less-typical but potentially useful things. But such things shouldn't be automatically imported for typical users, so that sort of stuff would be best left to a separate-but-related module. Maybe it's just too late over here for me, but can you be more specific on "clutter"? Do you mean like API clutter?
Oct 24 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:ia34up$ldb$1 digitalmars.com...
 In the regexp code, I provided special regexes for email addresses and 
 URLs. Those are hard to get right, so it's a large convenience to provide 
 them.

 Also, many literals can be fairly complex, and evaluating them can produce 
 errors (such as integer overflow in the numeric literals). Having canned 
 ones makes it much quicker for a user to get going.
I'm not sure what exectly you're suggesting in these two paragraphs? (Or just commenting?)
Does Goldie's lexer not convert numeric literals to integer values?
 I'm guessing that a numeric literal is returned as a string. Is this 
 string allocated on the heap? If so, it's a performance problem. Storage 
 allocation costs figure large when trying to lex millions of lines.
Good point. I've just checked and there is allocation going on for each terminal lexed. But thanks to D's awesomeness, I can easily fix that to just use a slice of the original source string. I'll do that...
Are all tokens returned as strings?
 Long files aren't a problem. That's why we have .di files! I worry more 
 about clutter.
I really find long files to be a pain to read and edit. It would be nice if Then, modules with a lot of code could be broken down as appropriate for their maintainers without having to bother the users with the "module blah.all" workaround (which Goldie currently uses, but I realize isn't normal Phobos style). AIUI, .di files don't really solve that. There is one other other minor related issue, though. One of my big principles for Goldie is flexibility. So in addition to the basic API that most people would use, I like to expose lower-level APIs for people who might want to sidestep certain parts of Goldie, or provide other less-typical but potentially useful things. But such things shouldn't be automatically imported for typical users, so that sort of stuff would be best left to a separate-but-related module.
If I may suggest, leave the low level stuff out of the api until demand for it justifies it. It's hard to predict just what will be useful, so I suggest conservatism rather than kitchen sink. It can always be added later, but it's really hard to remove.
 Maybe it's just too late over here for me, but can you be more specific on 
 "clutter"? Do you mean like API clutter?
That too, but I meant a clutter of files. Long files aren't a problem with D.
Oct 25 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia3c3r$14k8$1 digitalmars.com...
 Does Goldie's lexer not convert numeric literals to integer values?

 Are all tokens returned as strings?
Goldie's lexer (and parser) are based on the GOLD system ( http://www.devincook.com/goldparser/ ) which is deliberately independent of both grammar and implementation language. As such, it doesn't know anything about what the specific terminals actually represent (There are 4 exceptions though: Comment tokens, Whitespace tokens, an "Error" token (ie, for lex errors), and the EOF token.) So the lexed data is always represented as a string. Although, the lexer actually returns an array of "class Token" ( http://www.semitwist.com/goldiedocs/current/Docs/APIRef/Token/#Token ). To get the original data that got lexed or parsed into that token, you call "toString()". (BTW, there are currently different "modes" of "toString()" for non-terminals, but I'm considering just ripping them all out and replacing them with a single "return a slice from the start of the first terminal to the end of the last terminal" - unless you think it would be useful to get a representation of the non-terminal's original data sans comments/whitespace, or with comments/whitespace converted to a single space.) I'm not sure that calling "to!whatever(token.toString())" is really all that much of a problem for user code.
 If I may suggest, leave the low level stuff out of the api until demand 
 for it justifies it. It's hard to predict just what will be useful, so I 
 suggest conservatism rather than kitchen sink. It can always be added 
 later, but it's really hard to remove.
That may be a good idea.
 That too, but I meant a clutter of files. Long files aren't a problem with 
 D.
Well, again, it may not be a problem with DMD, but I really think reading/editing a long file is a pain regardless of language. Maybe we just have different ideas of "long file"? To put it into numbers: At the moment, Goldie's library (not counting tools and the optional generated "static-mode" files) is about 3200 lines, including comment/blank lines. That size would be pretty unwieldy to maintain as a single source file, particularly since Goldie has a natural internal organization. Personally, I'd much rather have a clutter of source files than a cluttered source file. (But of course, I don't go to Java extremes and put *every* tiny little thing in a separate file.) As long as the complexity of having multiple files isn't passed along to user code (hence the frequent "module foo.all" idiom), then I can't say I really see a problem.
Oct 25 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:ia3c3r$14k8$1 digitalmars.com...
 Does Goldie's lexer not convert numeric literals to integer values?

 Are all tokens returned as strings?
Goldie's lexer (and parser) are based on the GOLD system ( http://www.devincook.com/goldparser/ ) which is deliberately independent of both grammar and implementation language. As such, it doesn't know anything about what the specific terminals actually represent (There are 4 exceptions though: Comment tokens, Whitespace tokens, an "Error" token (ie, for lex errors), and the EOF token.) So the lexed data is always represented as a string. Although, the lexer actually returns an array of "class Token" ( http://www.semitwist.com/goldiedocs/current/Docs/APIRef/Token/#Token ). To get the original data that got lexed or parsed into that token, you call "toString()". (BTW, there are currently different "modes" of "toString()" for non-terminals, but I'm considering just ripping them all out and replacing them with a single "return a slice from the start of the first terminal to the end of the last terminal" - unless you think it would be useful to get a representation of the non-terminal's original data sans comments/whitespace, or with comments/whitespace converted to a single space.) I'm not sure that calling "to!whatever(token.toString())" is really all that much of a problem for user code.
Consider a string literal, say "abc\"def". With Goldie's method, I infer this string has to be scanned twice. Once to find its limits, and the second to convert it to the actual string. The latter is user code and will have to replicate whatever Goldie did.
 If I may suggest, leave the low level stuff out of the api until demand 
 for it justifies it. It's hard to predict just what will be useful, so I 
 suggest conservatism rather than kitchen sink. It can always be added 
 later, but it's really hard to remove.
That may be a good idea.
What Goldie will be compared against is Spirit. Spirit is a reasonably successful add-on to C++. Goldie doesn't have to do things the same way as Spirit (expression templates - ugh), but it should be as easy to use and at least as powerful.
 That too, but I meant a clutter of files. Long files aren't a problem with 
 D.
Well, again, it may not be a problem with DMD, but I really think reading/editing a long file is a pain regardless of language. Maybe we just have different ideas of "long file"? To put it into numbers: At the moment, Goldie's library (not counting tools and the optional generated "static-mode" files) is about 3200 lines, including comment/blank lines. That size would be pretty unwieldy to maintain as a single source file, particularly since Goldie has a natural internal organization.
Actually, I think 3200 lines is of moderate, not large, size :-)
 Personally, I'd much rather have a clutter of source files than a cluttered 
 source file. (But of course, I don't go to Java extremes and put *every* 
 tiny little thing in a separate file.) As long as the complexity of having 
 multiple files isn't passed along to user code (hence the frequent "module 
 foo.all" idiom), then I can't say I really see a problem.
I tend to just not like having to constantly grep to see which file XXX is in.
Oct 25 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia59si$1r0j$1 digitalmars.com...
 Consider a string literal, say "abc\"def". With Goldie's method, I infer 
 this string has to be scanned twice. Once to find its limits, and the 
 second to convert it to the actual string.
Yea, that is true. With that string in the input, the value given to the user code will be: assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"}); That's a consequence of the grammar being separated from lexing/parsing implementation. You're right that that does seem less than ideal. Although I'm not sure how to remedy that without loosing the independence between grammar and lex/parse implementation that is the main point of the GOLD-based style. But there's something I don't quite understand about the approach you're suggesting: You seem to be suggesting that a terminal be progressively converted into its final form *as* it's still in the process of being recognized by the DFA. Which means, you don't know *what* you're supposed to be converting it into *while* you're converting it. Which means, you have to be speculatively converting it into all types of tokens that the current DFA state could possibly be on its way towards accepting (also, the DFA would need to contain a record of possible terminals for each DFA state). And then the result is thrown away if it turns out to be a different terminal. Is this correct? If so, is there generally enough lexical difference between the terminals that need such treatment to compensate for the extra processing needed in situations that are closer to worst-case (that is, in comparison to Goldie's current approach)? If all of that is so, then what would be your thoughts on this approach?: Suppose Goldie had a way to associate an optional "simultaneous/lockstep conversion" to a type of terminal. For instance: myLanguage.associateConversion("StringLiteral", new StringLiteralConverter()); Then, 'StringLiteralConverter' would be something that could be either user-provided or offered by Goldie (both ways would be supported). It would be some sort of class or something that had three basic functions: class StringLiteralConverter : ITerminalConverter { void process(dchar c) {...} // Or maybe this to make it possible to minimize allocations // in certain circumstances by utilizing slices: void process(dchar c, size_t indexIntoSource, string fullOrignalSource) {...} Variant emit() {...} void clear() {...} } Each state in the lexer's DFA would know which terminals it could possibly be processing. And for each of those terminals that has an associated converter, the lexer will call 'process()'. If a terminal is accepted, 'emit' is called to get the final result (and maybe do any needed finalization first), and then 'clear' is called on all converters that had been used. This feature would preclude the use of the actual "GOLD Parser Builder" program, but since I'm writing a tool to handle that functionality anyway, I'm not too concerned about that. Do you think that would work? Would its benefits be killed by the overhead introduced? If so, could those overheads be sufficiently reduced without scrapping the general idea?
 What Goldie will be compared against is Spirit. Spirit is a reasonably 
 successful add-on to C++. Goldie doesn't have to do things the same way as 
 Spirit (expression templates - ugh), but it should be as easy to use and 
 at least as powerful.
Understood.
 Personally, I'd much rather have a clutter of source files than a 
 cluttered source file. (But of course, I don't go to Java extremes and 
 put *every* tiny little thing in a separate file.) As long as the 
 complexity of having multiple files isn't passed along to user code 
 (hence the frequent "module foo.all" idiom), then I can't say I really 
 see a problem.
I tend to just not like having to constantly grep to see which file XXX is in.
Diff'rent strokes, I guess. I've only ever had that problem with Tango, which seems to kinda follow from the Java-STD-lib school of API design (no offense intended, Tango guys). But if I'm working on something that involves different sections of a codebase, which is very frequent, then I find it to be quite a pain to constantly scroll all around instead of just Ctrl-Tabbing between open files in different tabs.
Oct 25 2010
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:ia59si$1r0j$1 digitalmars.com...
 Consider a string literal, say "abc\"def". With Goldie's method, I infer 
 this string has to be scanned twice. Once to find its limits, and the 
 second to convert it to the actual string.
Yea, that is true. With that string in the input, the value given to the user code will be: assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"}); That's a consequence of the grammar being separated from lexing/parsing implementation. You're right that that does seem less than ideal. Although I'm not sure how to remedy that without loosing the independence between grammar and lex/parse implementation that is the main point of the GOLD-based style. But there's something I don't quite understand about the approach you're suggesting: You seem to be suggesting that a terminal be progressively converted into its final form *as* it's still in the process of being recognized by the DFA. Which means, you don't know *what* you're supposed to be converting it into *while* you're converting it. Which means, you have to be speculatively converting it into all types of tokens that the current DFA state could possibly be on its way towards accepting (also, the DFA would need to contain a record of possible terminals for each DFA state). And then the result is thrown away if it turns out to be a different terminal. Is this correct? If so, is there generally enough lexical difference between the terminals that need such treatment to compensate for the extra processing needed in situations that are closer to worst-case (that is, in comparison to Goldie's current approach)?
Probably that's why I don't use lexer generators. Building lexers is the simplest part of building a compiler, and I've always been motivated by trying to make it as fast as possible. To specifically answer your question, yes, in the lexers I make, you know you're parsing a string, so you process it as you parse it.
 If all of that is so, then what would be your thoughts on this approach?:
 
 Suppose Goldie had a way to associate an optional "simultaneous/lockstep 
 conversion" to a type of terminal. For instance:
 
 myLanguage.associateConversion("StringLiteral", new 
 StringLiteralConverter());
 
 Then, 'StringLiteralConverter' would be something that could be either 
 user-provided or offered by Goldie (both ways would be supported). It would 
 be some sort of class or something that had three basic functions:
 
 class StringLiteralConverter : ITerminalConverter
 {
     void process(dchar c) {...}
 
     // Or maybe this to make it possible to minimize allocations
     // in certain circumstances by utilizing slices:
     void process(dchar c, size_t indexIntoSource, string fullOrignalSource) 
 {...}
 
     Variant emit() {...}
     void clear() {...}
 }
 
 Each state in the lexer's DFA would know which terminals it could possibly 
 be processing. And for each of those terminals that has an associated 
 converter, the lexer will call 'process()'. If a terminal is accepted, 
 'emit' is called to get the final result (and maybe do any needed 
 finalization first), and then 'clear' is called on all converters that had 
 been used.
 
 This feature would preclude the use of the actual "GOLD Parser Builder" 
 program, but since I'm writing a tool to handle that functionality anyway, 
 I'm not too concerned about that.
 
 Do you think that would work? Would its benefits be killed by the overhead 
 introduced? If so, could those overheads be sufficiently reduced without 
 scrapping the general idea?
I don't know. I'd have to study the issue for a while. I suggest taking a look at dmd's lexer and compare. I'm not sure what Spirit's approach to this is.
 What Goldie will be compared against is Spirit. Spirit is a reasonably 
 successful add-on to C++. Goldie doesn't have to do things the same way as 
 Spirit (expression templates - ugh), but it should be as easy to use and 
 at least as powerful.
Understood.
Oct 25 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia5j41$2bnk$1 digitalmars.com...
 To specifically answer your question, yes, in the lexers I make, you know 
 you're parsing a string, so you process it as you parse it.

...

 I don't know. I'd have to study the issue for a while. I suggest taking a 
 look at dmd's lexer and compare. I'm not sure what Spirit's approach to 
 this is.
I've taken a deeper look at Spirit's docs: In the older Spirit 1.x, the lexing is handled as part of the parsing. The structure of it definitely suggests it should be easy for it to do all token-conversion right as the string is being lexed, although I couldn't tell whether or not it actually did so (I'd have to look at the source). But, since Spirit 1.x doesn't handle lexing separately from parsing, I *think* backtracking (it *is* a backtracking parser) results in re-lexing, even for terminals that never get special processing, such as keywords (But I'm not completely certain because I don't have much experience with LL). In Spirit 2.x, standard usage involves having the lexing separate from parsing. I didn't see anything at all in the docs for Spirit 2.x that seemed to suggest even the possibility of it processing tokens as they're lexed. However, Spirit is designed with heavy policy-based customizability in mind, so such a thing might still possible in Spirit 2.x...But if so, it's definitely an advanced feature (or just really poorly documented). I have thought of another way to get such an ability into Goldie, and it would be very easy-to-use, but it would also be a fairly non-trivial to implement. And really, I'm starting to question again how important it would *really* be, at least initially. When I think of typical code, usually only a small amount of it is made up of the the sorts of terminals that would need extra processing. I have to admit, I still have no idea whether or not it would be worth it to get Goldie into Phobos. Maybe, maybe not, I dunno. I think popular opinion would probably be the best gauge of that. It seems like we're the only ones still in this thread, though...maybe that's a bad sign? ;) I do still think that if your primary goal is to provide parsing of D code through Phobos, then adapting DDMD would be the best best. Goldie would be more appropriate if customized lexing/parsing is the goal.
Oct 26 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Nick Sabalausky:

 I've taken a deeper look at Spirit's docs:
I have not used Spirit, but from what I have read, it doesn't scale (the compilation becomes too much slower when the system you have built becomes bigger). Bye, bearophile
Oct 26 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:ia6a0h$nst$1 digitalmars.com...
 Nick Sabalausky:

 I've taken a deeper look at Spirit's docs:
I have not used Spirit, but from what I have read, it doesn't scale (the compilation becomes too much slower when the system you have built becomes bigger).
I think that's just because it's C++ though. I'd bet a D lib that worked the same way would probably do a lot better. In any case, I started writing a comparison of the main fundamental differences between Spirit and Goldie, and it ended up kinda rambling and not so just-the-main-fundamentals. But the gist was: Spirit is very flexible in how grammars are defined and processed, and Goldie is very flexible in what you can do with a given grammar once it's written (ie, how much mileage you can get out of it without changing one line of grammar and without designing it from the start to be flexible). Goldie does get some of that "flexibility in what you can do with it" though by tossing in some features and some limitations/requirements that Spirit leaves as "if you want it, put it in yourself (more or less manually), otherwise you don't pay a price for it." I think both approaches have their merits. Although I haven't a clue which is best for Phobos, or if Phobos even needs either.
Oct 26 2010
prev sibling parent reply Leandro Lucarella <luca llucax.com.ar> writes:
bearophile, el 26 de octubre a las 06:20 me escribiste:
 Nick Sabalausky:
 
 I've taken a deeper look at Spirit's docs:
I have not used Spirit, but from what I have read, it doesn't scale (the compilation becomes too much slower when the system you have built becomes bigger).
I can confirm that, at least for Spirit 1, and for simple things it looks "nice" (in the C++ scale), but for real more complex things, the resulting code is really a mess. -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- A can of diet coke will float in water While a can of regular coke will sink
Oct 26 2010
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 26.10.2010 15:55, schrieb Leandro Lucarella:
 bearophile, el 26 de octubre a las 06:20 me escribiste:
  Nick Sabalausky:

  >  I've taken a deeper look at Spirit's docs:

  I have not used Spirit, but from what I have read, it doesn't scale
  (the compilation becomes too much slower when the system you have
  built becomes bigger).
I can confirm that, at least for Spirit 1, and for simple things it looks "nice" (in the C++ scale), but for real more complex things, the resulting code is really a mess.
yupp - Spirit feels right on the integration-side, but becomes more and more evil when stuff gets bigger a compiletime-ebnf-script parser would do better, especially when the ebnf-script comes through compiletime-file-include and can be used/developed from outside in an ide like gold parsers a compiletime-parse could "generated" the stub code like Spirit do but without beeing to much inside the language itselfe
Oct 26 2010
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 26.10.2010 16:48, schrieb dennis luehring:
 Am 26.10.2010 15:55, schrieb Leandro Lucarella:
  bearophile, el 26 de octubre a las 06:20 me escribiste:
   Nick Sabalausky:

   >   I've taken a deeper look at Spirit's docs:

   I have not used Spirit, but from what I have read, it doesn't scale
   (the compilation becomes too much slower when the system you have
   built becomes bigger).
I can confirm that, at least for Spirit 1, and for simple things it looks "nice" (in the C++ scale), but for real more complex things, the resulting code is really a mess.
yupp - Spirit feels right on the integration-side, but becomes more and more evil when stuff gets bigger a compiletime-ebnf-script parser would do better, especially when the ebnf-script comes through compiletime-file-include and can be used/developed from outside in an ide like gold parsers a compiletime-parse could "generated" the stub code like Spirit do but without beeing to much inside the language itselfe
that combined with compiletime-features something like the bsn-parse do http://code.google.com/p/bsn-goldparser/ i think this all is very very doable in D
Oct 26 2010
parent "Nick Sabalausky" <a a.a> writes:
"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:ia6s3b$1q90$1 digitalmars.com...
 Am 26.10.2010 16:48, schrieb dennis luehring:
 Am 26.10.2010 15:55, schrieb Leandro Lucarella:

 yupp - Spirit feels right on the integration-side, but becomes more and
 more evil when stuff gets bigger
Goldie (and any GOLD-based system, really) should scale up pretty well. The only possible scaling-up issues would be: 1. Splitting a large grammar across multiple files is not yet supported (and if I do add support for that in Goldie, and I may, then the "GOLD Parser Builder" IDE wouldn't know how to handle it). classic-style ASP, or anything that involves a preprocessing step that hasn't already been done) aren't really supported yet. Spirit 1.x should be able to handle that, at least in some cases. I think Spirit 2.x's separation of lexing and parsing may have some trouble with it though. 3. I haven't had a chance to add any sort of character set optimization yet, so grammars that allow a large amount of Unicode characters will probably be slow to generate into tables and slow to lex. At least until I get around to taking care of that. I've never actually used Spirit, but its scaling up issues do seem to be a fairly fundamental issue with it's design (particularly so since it's C++). Although they do say on their site that some C++ compilers can handle Spirit without compile time growing exponentially in relation to grammar complexity.
 a compiletime-ebnf-script parser would do better, especially when
 the ebnf-script comes through compiletime-file-include and can be
 used/developed from outside in an ide like gold parsers
There's one problem with doing things via CTFE that us D folks often overlook: You can't use a build tool like make/rake/scons to detect when that particular data doesn't need to be recomputed and can thus be skipped. (Although it may be possible to manually make it work that way *if* CTFE gains the ability to access the filesystem.) I'm not opposed to the idea of making Goldie's compiling-a-grammar (ie, "process a grammar into the appropriate tables") ctfe-able, but it does already work in a way that you only need to compile a grammar into tables when you change the grammar (and changing a grammar is needed less frequently in Goldie than in Spirit because in Goldie no processing code is ever embedded into the grammar.).
 that combined with compiletime-features something like the bsn-parse do

 http://code.google.com/p/bsn-goldparser/

 i think this all is very very doable in D
Yea, I was pretty impressed with BSN. I definitely want to do something like that for Goldie, but I have a somewhat different idea in mind: I'm thinking of enhancing the grammar definition language so that all the information on how to construct an AST is right there in the grammar definition itself, and can thus be completely automated by Goldie. This would be in line with GOLD's philosophy and benefits of keeping the grammar definition separate from the processing code. And it would also be a step towards the idea I've had in mind since before Goldie was Goldie of being able to automate (or partially automate) generalized language translation/transformation.
Oct 26 2010
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2010-10-26 04:44, Nick Sabalausky wrote:
 "Walter Bright"<newshound2 digitalmars.com>  wrote in message
 news:ia59si$1r0j$1 digitalmars.com...
 Consider a string literal, say "abc\"def". With Goldie's method, I infer
 this string has to be scanned twice. Once to find its limits, and the
 second to convert it to the actual string.
Yea, that is true. With that string in the input, the value given to the user code will be: assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"}); That's a consequence of the grammar being separated from lexing/parsing implementation. You're right that that does seem less than ideal. Although I'm not sure how to remedy that without loosing the independence between grammar and lex/parse implementation that is the main point of the GOLD-based style. But there's something I don't quite understand about the approach you're suggesting: You seem to be suggesting that a terminal be progressively converted into its final form *as* it's still in the process of being recognized by the DFA. Which means, you don't know *what* you're supposed to be converting it into *while* you're converting it.
I don't have much knowledge in this area but isn't this what a look-ahead is for? Just look ahead (hopefully) one character and decide what to convert to. -- /Jacob Carlborg
Oct 26 2010
prev sibling next sibling parent =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:
Dnia 22-10-2010 o 21:48:49 Andrei Alexandrescu  =

<SeeWebsiteForEmail erdani.org> napisa=B3(a):

 On 10/22/10 14:02 CDT, Tomek Sowi=F1ski wrote:
 Interesting idea. Here's another: D will soon need bindings for CORBA=
,
 Thrift, etc, so lexers will have to be written all over to grok
 interface files. Perhaps a generic tokenizer which can be parametrize=
d
 with a lexical grammar would bring more ROI, I got a hunch D's templa=
tes
 are strong enough to pull this off without any source code generation=
 ala JavaCC. The books I read on compilers say tokenization is a solve=
d
 problem, so the theory part on what a good abstraction should be is
 done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenize=
r =
 generator.

 I have in mind the entire implementation of a simple design, but never=
=
 had the time to execute on it. The tokenizer would work like this:

 alias Lexer!(
      "+", "PLUS",
      "-", "MINUS",
      "+=3D", "PLUS_EQ",
      ...
      "if", "IF",
      "else", "ELSE"
      ...
 ) DLexer;
Yes. One remark: native language constructs scale better for a grammar: enum TokenDef : string { Digit =3D "[0-9]", Letter =3D "[a-zA-Z_]", Identifier =3D Letter~'('~Letter~'|'~Digit~')', ... Plus =3D "+", Minus =3D "-", PlusEq =3D "+=3D", ... If =3D "if", Else =3D "else", ... } alias Lexer!TokenDef DLexer; BTW, there's a bug related: http://d.puremagic.com/issues/show_bug.cgi?id=3D2950
 Such a declaration generates numeric values DLexer.PLUS etc. and  =
 generates an efficient code that extracts a stream of tokens from a  =
 stream of text. Each token in the token stream has the ID and the text=
. All good ideas.
 Comments, strings etc. can be handled in one of several ways but that'=
s =
 a longer discussion.
The discussion's started anyhow. So what're the options? -- = Tomek
Oct 22 2010
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisa艂(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?) -- Bruno Medeiros - Software Engineer
Nov 19 2010
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday 19 November 2010 13:03:53 Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi=C5=84ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
=20
 napisa=C5=82(a):
 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.
=20
 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.
=20
 It should:
=20
 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem
=20
 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.
=20
 Anyone want to own this?
=20 Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templat=
es
 are strong enough to pull this off without any source code generation
 ala JavaCC. The books I read on compilers say tokenization is a solved
 problem, so the theory part on what a good abstraction should be is
 done. What you think?
=20 Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
=20 Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. =20 Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?)
We want to make it easy for tools to be built to work on and deal with D co= de.=20 An IDE, for example, needs to be able to tokenize and parse D code. A progr= am=20 like lint needs to be able to tokenize and parse D code. By providing a lex= er=20 and parser in the standard library, we are making it far easier for such to= ols=20 to be written, and they could be of major benefit to the D community. Sure,= the=20 average program won't need to lex or parse D, but some will, and making it = easy=20 to do will make it a lot easier for such programs to be written. =2D Jonathan M Davis
Nov 19 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 21:27, Jonathan M Davis wrote:
 On Friday 19 November 2010 13:03:53 Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright<newshound2 digitalmars.com>

 napisa艂(a):
 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?)
We want to make it easy for tools to be built to work on and deal with D code. An IDE, for example, needs to be able to tokenize and parse D code. A program like lint needs to be able to tokenize and parse D code. By providing a lexer and parser in the standard library, we are making it far easier for such tools to be written, and they could be of major benefit to the D community. Sure, the average program won't need to lex or parse D, but some will, and making it easy to do will make it a lot easier for such programs to be written. - Jonathan M Davis
And by providing a lexer and a parser outside the standard library, wouldn't it make it just as easy for those tools to be written? What's the advantage of being in the standard library? I see only disadvantages: to begin with it potentially increases the time that Walter or other Phobos contributors may have to spend on it, even if it's just reviewing patches or making sure the code works. -- Bruno Medeiros - Software Engineer
Nov 19 2010
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, November 19, 2010 13:53:12 Bruno Medeiros wrote:
 On 19/11/2010 21:27, Jonathan M Davis wrote:
 On Friday 19 November 2010 13:03:53 Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi=C5=84ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright<newshound2 digitalmars.com>
=20
 napisa=C5=82(a):
 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.
=20
 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.
=20
 It should:
=20
 1. support a range interface for its input, and a range interface f=
or
 its output
 2. optionally not generate lexical errors, but just try to recover
 and continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem
=20
 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.
=20
 Anyone want to own this?
=20 Interesting idea. Here's another: D will soon need bindings for CORB=
A,
 Thrift, etc, so lexers will have to be written all over to grok
 interface files. Perhaps a generic tokenizer which can be parametriz=
ed
 with a lexical grammar would bring more ROI, I got a hunch D's
 templates are strong enough to pull this off without any source code
 generation ala JavaCC. The books I read on compilers say tokenization
 is a solved problem, so the theory part on what a good abstraction
 should be is done. What you think?
=20 Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokeniz=
er
 generator.
=20 Agreed, of all the things desired for D, a D tokenizer would rank pret=
ty
 low I think.
=20
 Another thing, even though a tokenizer generator would be much more
 desirable, I wonder if it is wise to have that in the standard library?
 It does not seem to be of wide enough interest to be in a standard
 library. (Out of curiosity, how many languages have such a thing in
 their standard library?)
=20 We want to make it easy for tools to be built to work on and deal with D code. An IDE, for example, needs to be able to tokenize and parse D code. A program like lint needs to be able to tokenize and parse D code. By providing a lexer and parser in the standard library, we are making it far easier for such tools to be written, and they could be of major benefit to the D community. Sure, the average program won't need to lex or parse D, but some will, and making it easy to do will make it a lot easier for such programs to be written. =20 - Jonathan M Davis
=20 And by providing a lexer and a parser outside the standard library, wouldn't it make it just as easy for those tools to be written? What's the advantage of being in the standard library? I see only disadvantages: to begin with it potentially increases the time that Walter or other Phobos contributors may have to spend on it, even if it's just reviewing patches or making sure the code works.
If nothing, else, it makes it easier to keep in line with dmd itself. Since= the=20 dmd front end is LGPL, it's not possible to have a Boost port of it (like t= he=20 Phobos version will be) without Walter's consent. And I'd be surprised if h= e did=20 that for a third party library (though he seems to be pretty open on a lot = of=20 that kind of stuff). Not to mention, Walter and the core developers are _ex= actly_=20 the kind of people that you want working on a lexer or parser of the langua= ge=20 itself, because they're the ones who work on it. =2D Jonathan M Davis
Nov 19 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 22:02, Jonathan M Davis wrote:
 On Friday, November 19, 2010 13:53:12 Bruno Medeiros wrote:
 On 19/11/2010 21:27, Jonathan M Davis wrote:

 And by providing a lexer and a parser outside the standard library,
 wouldn't it make it just as easy for those tools to be written? What's
 the advantage of being in the standard library? I see only
 disadvantages: to begin with it potentially increases the time that
 Walter or other Phobos contributors may have to spend on it, even if
 it's just reviewing patches or making sure the code works.
If nothing, else, it makes it easier to keep in line with dmd itself. Since the dmd front end is LGPL, it's not possible to have a Boost port of it (like the Phobos version will be) without Walter's consent. And I'd be surprised if he did that for a third party library (though he seems to be pretty open on a lot of that kind of stuff). Not to mention, Walter and the core developers are _exactly_ the kind of people that you want working on a lexer or parser of the language itself, because they're the ones who work on it. - Jonathan M Davis
Eh? That license argument doesn't make sense: if the lexer and parser were to be based on DMD itself, then putting it in the standard library is equivalent (in licensing terms) to licensing the lexer and parser parts of DMD in Boost. More correctly, what I mean by equivalent, is that there no reason why Walter would allow one thing and not the other... (because on both cases he would have to issue that license) As for your second argument, yes, Walter and the core developers would be the most qualified people to work in it, no question about it. But my point is, I don't think Walter and Phobos core devs should be working on it, because it takes time away from other things that are much more important. Their time is precious. I think our main point of disagreement is just how important a D lexer and/or parser would be. I think it would be of very low interest, definitely not a "major benefit to the D community". For starters, regarding its use in IDEs: I think we are *ages* away from the point were an IDE based on D only will be able to compete with IDEs based in Eclipse/Visual-Studio/Xcode/etc.. I think much sooner we will have a full D compiler written in D than a (competitive) D IDE written in D. We barely have mature GUI libraries from what I understand. (What may be more realistic is an IDE partially written in D, and otherwise based on Eclipse/Visual-Studio/etc., but even so, I think it would be hard to compete with other non-D IDEs) -- Bruno Medeiros - Software Engineer
Nov 19 2010
next sibling parent reply Todd VanderVeen <TDVanderVeen gmail.com> writes:
== Quote from Bruno Medeiros (brunodomedeiros+spam com.gmail)'s article
 I think much sooner we will
 have a full D compiler written in D than a (competitive) D IDE written
 in D.
I agree. I do like the suggestion for developing the D grammar in Antlr though and it is something I would be interested in working on. With this in hand, the prospect of adding D support as was done for C++ to Eclipse or Netbeans becomes much more feasible. Has a complete grammar been defined/compiled or is anyone currently working in this direction? Having a robust IDE seems far more important than whether it is written in D itself.
Nov 19 2010
parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 23:45, Todd VanderVeen wrote:
 == Quote from Bruno Medeiros (brunodomedeiros+spam com.gmail)'s article
 I think much sooner we will
 have a full D compiler written in D than a (competitive) D IDE written
 in D.
I agree. I do like the suggestion for developing the D grammar in Antlr though and it is something I would be interested in working on. With this in hand, the prospect of adding D support as was done for C++ to Eclipse or Netbeans becomes much more feasible. Has a complete grammar been defined/compiled or is anyone currently working in this direction? Having a robust IDE seems far more important than whether it is written in D itself.
See the comment I made below, to Michael Stover. ( news://news.digitalmars.com:119/ic71pa$1lev$1 digitalmars.com ) -- Bruno Medeiros - Software Engineer
Nov 19 2010
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, November 19, 2010 15:17:35 Bruno Medeiros wrote:
 On 19/11/2010 22:02, Jonathan M Davis wrote:
 On Friday, November 19, 2010 13:53:12 Bruno Medeiros wrote:
 On 19/11/2010 21:27, Jonathan M Davis wrote:
 
 And by providing a lexer and a parser outside the standard library,
 wouldn't it make it just as easy for those tools to be written? What's
 the advantage of being in the standard library? I see only
 disadvantages: to begin with it potentially increases the time that
 Walter or other Phobos contributors may have to spend on it, even if
 it's just reviewing patches or making sure the code works.
If nothing, else, it makes it easier to keep in line with dmd itself. Since the dmd front end is LGPL, it's not possible to have a Boost port of it (like the Phobos version will be) without Walter's consent. And I'd be surprised if he did that for a third party library (though he seems to be pretty open on a lot of that kind of stuff). Not to mention, Walter and the core developers are _exactly_ the kind of people that you want working on a lexer or parser of the language itself, because they're the ones who work on it. - Jonathan M Davis
Eh? That license argument doesn't make sense: if the lexer and parser were to be based on DMD itself, then putting it in the standard library is equivalent (in licensing terms) to licensing the lexer and parser parts of DMD in Boost. More correctly, what I mean by equivalent, is that there no reason why Walter would allow one thing and not the other... (because on both cases he would have to issue that license)
It's very different to have D implementation of something - which is based on a C++ version but definitely different in some respects - be under Boost and generally available, and having the C++ implementation be under Boost - particularly when the C++ version covers far more than just a lexer and parser. Someone _could_ port the D code back to C++ and have that portion useable under Boost, but that's a lot more work than just taking the C++ code and using it, and it's only the portions of the compiler which were ported to D to which could be re-used that way. And since the Boost code could be used in a commercial product while the LGPL is more restricted, it could make a definite difference. I'm not a licensing expert, and I'm not an expert on what Walter does and doesn't want done with his code, but he put the compiler front end under the LGPL, not Boost, and he's given his permission to have the lexer alone ported to D and put under the Boost license in the standard library, which is very different from putting the entire front end under Boost. I expect that the parser will follow eventually, but even if it does, that's still not the entire front end. So, there is a difference in licenses does have a real impact. And no one can take the LGPL C++ code and port it to D - for the standard library or otherwise - without Walter's permission, because its his copyright on the code. As for the usefulness of a D lexer and parser, I've already had several programs or functions which I've wanted to write which would require it, and the lack has made them infeasible. For instance, I was considering posting a version of my datetime code without the unit tests in it, so that it would be easier to read the actual code (given the large number of unit tests), but I found that to accurately do that, you need a lexer for D, so I gave up on it for the time being. Having a function which stripped out unnecessary whitespace (and especially newlines) for string mixins would be great (particularly since line numbers get messed up with multi-line string mixins), but that would require CTFE-able D lexer to work correctly (though you might be able to hack together something which would mostly work), which we don't have. The D lexer won't be CTFE-able initially (though hopefully it will be once the CTFE capabilites of dmd improve), so you still won't be able to do that once the lexer is done, but it is a case where a lexer would be useful. are huge. It will take time to get there, and we'll need more developers, but I don't think that it really makes sense to not put things in the standard library because it might take more dev time - particularly when a D lexer is the sort of thing that likely won't need much changing once it's done, since it would only need to be changed when the language changed or when a bug with it was found (which would likely equate to a bug in the compiler anyway), so ultimately, the developer cost is likely fairly low. Additionally, Walter thinks that the development costs will be lower to have it in the standard library with an implementation similar to dmd's rather than having it separate. And it's his call. So, it's going to get done. There are several people around here who lament the lack of D parser in Phobos at least periodically, and I think that it will be good to have an appropriate lexer and parser for D in Phobos. Having other 3rd party stuff - like antlr - is great too, but that's no reason not to put it in the standard library. I think that we're just going to have to agree to disagree on this one. - Jonathan M Davis
Nov 19 2010
parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 20/11/2010 01:29, Jonathan M Davis wrote:
 On Friday, November 19, 2010 15:17:35 Bruno Medeiros wrote:
 On 19/11/2010 22:02, Jonathan M Davis wrote:
 On Friday, November 19, 2010 13:53:12 Bruno Medeiros wrote:
 On 19/11/2010 21:27, Jonathan M Davis wrote:

 And by providing a lexer and a parser outside the standard library,
 wouldn't it make it just as easy for those tools to be written? What's
 the advantage of being in the standard library? I see only
 disadvantages: to begin with it potentially increases the time that
 Walter or other Phobos contributors may have to spend on it, even if
 it's just reviewing patches or making sure the code works.
If nothing, else, it makes it easier to keep in line with dmd itself. Since the dmd front end is LGPL, it's not possible to have a Boost port of it (like the Phobos version will be) without Walter's consent. And I'd be surprised if he did that for a third party library (though he seems to be pretty open on a lot of that kind of stuff). Not to mention, Walter and the core developers are _exactly_ the kind of people that you want working on a lexer or parser of the language itself, because they're the ones who work on it. - Jonathan M Davis
Eh? That license argument doesn't make sense: if the lexer and parser were to be based on DMD itself, then putting it in the standard library is equivalent (in licensing terms) to licensing the lexer and parser parts of DMD in Boost. More correctly, what I mean by equivalent, is that there no reason why Walter would allow one thing and not the other... (because on both cases he would have to issue that license)
It's very different to have D implementation of something - which is based on a C++ version but definitely different in some respects - be under Boost and generally available, and having the C++ implementation be under Boost - particularly when the C++ version covers far more than just a lexer and parser. Someone _could_ port the D code back to C++ and have that portion useable under Boost, but that's a lot more work than just taking the C++ code and using it, and it's only the portions of the compiler which were ported to D to which could be re-used that way. And since the Boost code could be used in a commercial product while the LGPL is more restricted, it could make a definite difference. I'm not a licensing expert, and I'm not an expert on what Walter does and doesn't want done with his code, but he put the compiler front end under the LGPL, not Boost, and he's given his permission to have the lexer alone ported to D and put under the Boost license in the standard library, which is very different from putting the entire front end under Boost. I expect that the parser will follow eventually, but even if it does, that's still not the entire front end. So, there is a difference in licenses does have a real impact. And no one can take the LGPL C++ code and port it to D - for the standard library or otherwise - without Walter's permission, because its his copyright on the code.
There are some misunderstandings here. First, the DMD front-end is licenced under the GPL, not LGPL. Second, more importantly, it is actually also licensed under the Artistic license, a very permissible license. This is the basis for me stating that almost certainly Walter would not mind licensing the DMD parser and lexer under Boost, as it's actually not that different from the Artistic license.

 are huge. It will take time to get there, and we'll need more developers, but I
bigger than Phobos, and yet they have no functionality for lexing/parsing their own languages (or any other for that matter)! -- Bruno Medeiros - Software Engineer
Nov 24 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/19/10 1:03 PM, Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisa艂(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?)
Even C has strtok. Andrei
Nov 19 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 23:39, Andrei Alexandrescu wrote:
 On 11/19/10 1:03 PM, Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisa艂(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?)
Even C has strtok. Andrei
That's just a fancy splitter, I wouldn't call that a proper tokenizer. I meant something that, at the very least, would tokenize based on regular expressions (and have heterogenous tokens). -- Bruno Medeiros - Software Engineer
Nov 24 2010
parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 24/11/2010 13:30, Bruno Medeiros wrote:
 On 19/11/2010 23:39, Andrei Alexandrescu wrote:
 On 11/19/10 1:03 PM, Bruno Medeiros wrote:
 On 22/10/2010 20:48, Andrei Alexandrescu wrote:
 On 10/22/10 14:02 CDT, Tomek Sowi艅ski wrote:
 Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 digitalmars.com>
 napisa艂(a):

 As we all know, tool support is important for D's success. Making
 tools easier to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax
 highlighting filters, pretty printers, repl, doc generators, static
 analyzers, and even D compilers.

 It should:

 1. support a range interface for its input, and a range interface for
 its output
 2. optionally not generate lexical errors, but just try to recover
 and
 continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be
 maintained in tandem

 It can also serve as the basis for creating a javascript
 implementation that can be embedded into web pages for syntax
 highlighting, and eventually an std.lang.d.parse.

 Anyone want to own this?
Interesting idea. Here's another: D will soon need bindings for CORBA, Thrift, etc, so lexers will have to be written all over to grok interface files. Perhaps a generic tokenizer which can be parametrized with a lexical grammar would bring more ROI, I got a hunch D's templates are strong enough to pull this off without any source code generation ala JavaCC. The books I read on compilers say tokenization is a solved problem, so the theory part on what a good abstraction should be is done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.
Agreed, of all the things desired for D, a D tokenizer would rank pretty low I think. Another thing, even though a tokenizer generator would be much more desirable, I wonder if it is wise to have that in the standard library? It does not seem to be of wide enough interest to be in a standard library. (Out of curiosity, how many languages have such a thing in their standard library?)
Even C has strtok. Andrei
That's just a fancy splitter, I wouldn't call that a proper tokenizer. I meant something that, at the very least, would tokenize based on regular expressions (and have heterogenous tokens).
In other words, a lexer, that might be a better term in this context. -- Bruno Medeiros - Software Engineer
Nov 24 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 As we all know, tool support is important for D's success. Making tools easier 
 to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
std.lang.d.lex. 
 It would be helpful in writing color syntax highlighting filters, pretty 
 printers, repl, doc generators, static analyzers, and even D compilers.
This is a quite long talk by Steve Yegge that I've just seen (linked from Reddit): http://vimeo.com/16069687 I don't suggest you to see it all unless you are very interested in that topic. But the most important thing it says is that, given that big software companies use several languages, and programmers often don't want to change their preferred IDE, there is a problem: given N languages and M editors/IDEs, total toolchain effort is N * M. That means N syntax highlighters, N indenters, N refactoring suites, etc. Result: most languages have bad toolchains and most IDEs manage very well only one or very few languages. So he has suggested the Grok project, that allows to reduce the toolchain effort to N + M. Each language needs to have one of each service: indenter, highlighter, name resolver, refactory, etc. So each IDE may link (using a standard interface provided by Grok) to those services and use them. Today Grok is not available yet, and its development is at the first stages, but after this talk I think that it may be positive to add to Phobos not just the D lexer, but also other things, even a bit higher level as an indenter, highlighter, name resolver, refactory, etc. Even if they don't use the standard universal interface used by Grok I think they may speed up the development of the D toolchain. Bye, bearophile
Oct 23 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
 This is a quite long talk by Steve Yegge that I've just seen (linked from
Reddit):
 http://vimeo.com/16069687
Sorry, the Reddit thread: http://www.reddit.com/r/programming/comments/dvd9x/steve_yegge_on_scalable_programming_language/
Oct 23 2010
prev sibling next sibling parent "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:i9vs3v$142e$1 digitalmars.com...
 Walter:

 As we all know, tool support is important for D's success. Making tools 
 easier
 to build will help with that.

 To that end, I think we need a lexer for the standard library - 
 std.lang.d.lex.
 It would be helpful in writing color syntax highlighting filters, pretty
 printers, repl, doc generators, static analyzers, and even D compilers.
This is a quite long talk by Steve Yegge that I've just seen (linked from Reddit): http://vimeo.com/16069687 I don't suggest you to see it all unless you are very interested in that topic. But the most important thing it says is that, given that big software companies use several languages, and programmers often don't want to change their preferred IDE, there is a problem: given N languages and M editors/IDEs, total toolchain effort is N * M. That means N syntax highlighters, N indenters, N refactoring suites, etc. Result: most languages have bad toolchains and most IDEs manage very well only one or very few languages. So he has suggested the Grok project, that allows to reduce the toolchain effort to N + M. Each language needs to have one of each service: indenter, highlighter, name resolver, refactory, etc. So each IDE may link (using a standard interface provided by Grok) to those services and use them. Today Grok is not available yet, and its development is at the first stages, but after this talk I think that it may be positive to add to Phobos not just the D lexer, but also other things, even a bit higher level as an indenter, highlighter, name resolver, refactory, etc. Even if they don't use the standard universal interface used by Grok I think they may speed up the development of the D toolchain.
I haven't looked at the video, but that sounds like the direction I've had in mind for Goldie.
Oct 23 2010
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 24/10/2010 00:46, bearophile wrote:
 Walter:

 As we all know, tool support is important for D's success. Making tools easier
 to build will help with that.

 To that end, I think we need a lexer for the standard library - std.lang.d.lex.
 It would be helpful in writing color syntax highlighting filters, pretty
 printers, repl, doc generators, static analyzers, and even D compilers.
This is a quite long talk by Steve Yegge that I've just seen (linked from Reddit): http://vimeo.com/16069687 I don't suggest you to see it all unless you are very interested in that topic. But the most important thing it says is that, given that big software companies use several languages, and programmers often don't want to change their preferred IDE, there is a problem: given N languages and M editors/IDEs, total toolchain effort is N * M. That means N syntax highlighters, N indenters, N refactoring suites, etc. Result: most languages have bad toolchains and most IDEs manage very well only one or very few languages. So he has suggested the Grok project, that allows to reduce the toolchain effort to N + M. Each language needs to have one of each service: indenter, highlighter, name resolver, refactory, etc. So each IDE may link (using a standard interface provided by Grok) to those services and use them. Today Grok is not available yet, and its development is at the first stages, but after this talk I think that it may be positive to add to Phobos not just the D lexer, but also other things, even a bit higher level as an indenter, highlighter, name resolver, refactory, etc. Even if they don't use the standard universal interface used by Grok I think they may speed up the development of the D toolchain. Bye, bearophile
Hum, very interesting topic! A few disjoint comments: (*) I'm glad to see another person, especially one who is "prominent" in the development community (like Andrei), discuss the importance of the toolchain, specificaly IDEs, for emerging languages. Or for any language for that matter. At the beggining of the talk I was like "man, this is spot-on, that's what I've said before, I wish Walter would *hear* this"! LOL, imagine my surprise when I found that Walter was in fact *there*! (When I saw the talk I didn't even know this was at NWCPP, otherwise I might have suspected) (*) I actually thought about some similar ideas before, for example, I thought about the idea of exposing some (if not all) of the functionality of DDT through the command-line (note that Eclipse can run headless, without any UI). And this would not be just semantic/indexer functionality, so for example: * DDoc generation, like Descent had at some point (http://www.mail-archive.com/digitalmars-d-announce puremagic.com/msg02734.html) * build functionality - only really interesting if the DDT builder becomes smarter, ie, does more useful stuff than what it does now. * semantic functionality: find-ref, code completion. (*) I wished I was at that talk, I would have liked to ask and discuss some things with Steve Yegge, particularly his comments about Eclipse's indexer. I become curious for details about what he thinks is wrong about Eclipse's indexer. Also, I wonder if he's not conflating "CDT's indexer" with "Eclipse indexer", because actually there is no such thing as a "Eclipse indexer". I'm gonna take a better look at the comments for this one. (*) As for Grok itself, it looks potentially interesting, but I still have only a very vague impression of what it does (let alone *how*). -- Bruno Medeiros - Software Engineer
Nov 24 2010
parent reply Andrew Wiley <debio264 gmail.com> writes:
 On 24/10/2010 00:46, bearophile wrote:

 Walter:


  As we all know, tool support is important for D's success. Making tools
 easier
 to build will help with that.

 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex.
 It would be helpful in writing color syntax highlighting filters, pretty
 printers, repl, doc generators, static analyzers, and even D compilers.
This is a quite long talk by Steve Yegge that I've just seen (linked from Reddit): http://vimeo.com/16069687 I don't suggest you to see it all unless you are very interested in that topic. But the most important thing it says is that, given that big software companies use several languages, and programmers often don't want to change their preferred IDE, there is a problem: given N languages and M editors/IDEs, total toolchain effort is N * M. That means N syntax highlighters, N indenters, N refactoring suites, etc. Result: most languages have bad toolchains and most IDEs manage very well only one or very few languages. So he has suggested the Grok project, that allows to reduce the toolchain effort to N + M. Each language needs to have one of each service: indenter, highlighter, name resolver, refactory, etc. So each IDE may link (using a standard interface provided by Grok) to those services and use them. Today Grok is not available yet, and its development is at the first stages, but after this talk I think that it may be positive to add to Phobos not just the D lexer, but also other things, even a bit higher level as an indenter, highlighter, name resolver, refactory, etc. Even if they don't use the standard universal interface used by Grok I think they may speed up the development of the D toolchain. Bye, bearophile
From watching this, I'm reminded that in the Scala world, the compiler can
be used in this way. The Eclipse plugin for Scala (and I assume the Netbeans and IDEA plugins work similarly) is really just a wrapper around the compiler because the compiler can be used as a library, allowing a rich IDE with minimal effort because rather than implementing parsing and semantic analysis, the IDE team can just query the compiler's data structures.
Nov 24 2010
parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 24/11/2010 18:48, Andrew Wiley wrote:
     On 24/10/2010 00:46, bearophile wrote:

         Walter:


             As we all know, tool support is important for D's success.
             Making tools easier
             to build will help with that.

             To that end, I think we need a lexer for the standard
             library - std.lang.d.lex.
             It would be helpful in writing color syntax highlighting
             filters, pretty
             printers, repl, doc generators, static analyzers, and even D
             compilers.


         This is a quite long talk by Steve Yegge that I've just seen
         (linked from Reddit):
         http://vimeo.com/16069687

         I don't suggest you to see it all unless you are very interested
         in that topic. But the most important thing it says is that,
         given that big software companies use several languages, and
         programmers often don't want to change their preferred IDE,
         there is a problem: given N languages and M editors/IDEs, total
         toolchain effort is N * M. That means N syntax highlighters, N
         indenters, N refactoring suites, etc. Result: most languages
         have bad toolchains and most IDEs manage very well only one or
         very few languages.

         So he has suggested the Grok project, that allows to reduce the
         toolchain effort to N + M. Each language needs to have one of
         each service: indenter, highlighter, name resolver, refactory,
         etc. So each IDE may link (using a standard interface provided
         by Grok) to those services and use them.

         Today Grok is not available yet, and its development is at the
         first stages, but after this talk I think that it may be
         positive to add to Phobos not just the D lexer, but also other
         things, even a bit higher level as an indenter, highlighter,
         name resolver, refactory, etc. Even if they don't use the
         standard universal interface used by Grok I think they may speed
         up the development of the D toolchain.

         Bye,
         bearophile


  From watching this, I'm reminded that in the Scala world, the compiler
 can be used in this way. The Eclipse plugin for Scala (and I assume the
 Netbeans and IDEA plugins work similarly) is really just a wrapper
 around the compiler because the compiler can be used as a library,
 allowing a rich IDE with minimal effort because rather than implementing
 parsing and semantic analysis, the IDE team can just query the
 compiler's data structures.
Interesting, very wise of them to do that. But not very surprising, Scala is close to the Java world, so they (the Scala people) must have known how important it would be to have the best toolchain possible, in order to compete (with Java, JDT, also Visual Studio, etc.). -- Bruno Medeiros - Software Engineer
Nov 25 2010
prev sibling next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:i9qd8q$1ls4$1 digitalmars.com...
 4. the tokens should be a value type, not a reference type
I'm curious, is your reason for this purely to avoid allocations during lexing, or are there other reasons too? If it's mainly to avoid allocations during lexing then, maybe I've understood wrong, but isn't D2 getting the ability to construct class objects in-place into pre-allocated memory? (or already has the ability?) If so, do you think just creating the tokens that way would likely be close enough?
Oct 26 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:i9qd8q$1ls4$1 digitalmars.com...
 4. the tokens should be a value type, not a reference type
I'm curious, is your reason for this purely to avoid allocations during lexing, or are there other reasons too?
It's one big giant reason. Storage allocation gets unbelievably costly in a lexer. Another is it makes tokens easy to copy. Another one is that classes are for polymorphic behavior. What kind of polymorphic behavior would one want with tokens?
 If it's mainly to avoid allocations during lexing then, maybe I've 
 understood wrong, but isn't D2 getting the ability to construct class 
 objects in-place into pre-allocated memory?
If you do that, might as well make them value types. The only reason classes exist is to support runtime polymorphism. C++ made a vast mistake in failing to distinguish between value types and reference types. Java made a related mistake by failing to acknowledge that value types have any useful purpose at all (unless they are built-in).
Oct 26 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 Java made a related mistake by failing to acknowledge that 
 value types have any useful purpose at all (unless they are built-in).
Java was designed to be simple! Simple means to have a more uniform semantics. Removing value types was a good idea if you want to simplify a language (and remove a mountain of details from C++). And from the huge success of Java, I a more complex than Java). The Java VM also is now often able to allocate not escaping objects on the stack (escape analysis) regaining some of the lost performance. What I miss more in Java is not single structs (single values), but a way to build an array of values (structs). Because using parallel arrays is not nice at all. Even in Python using numPy you may create an array of structs (compound value items). Bye, bearophile
Oct 26 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 Walter:
 
 Java made a related mistake by failing to acknowledge that value types have
 any useful purpose at all (unless they are built-in).
Java was designed to be simple! Simple means to have a more uniform semantics.
So was Pascal. See the thread about how useless it was as a result. A hatchet is a very simple tool, easy to understand, and I could build a house with nothing but a hatchet. But it would make the house several times as expensive to build, and it would look like it was built with a hatchet.
 Removing value types was a good idea if you want to simplify a
 language (and remove a mountain of details from C++). And from the huge


 often able to allocate not escaping objects on the stack (escape analysis)
 regaining some of the lost performance.
The issue isn't just about lost performance. It's about proper encapsulation of a type. Value types and polymorphic types are different, have different purposes, different behaviors, etc. Conflating the two into the same construct makes for poor and confusing abstractions. It shifts the problem out of the language and onto the programmer. It does NOT make the complexity go away.
 What I miss more in Java is not single structs (single values),
There's a lot more to miss than that. I find Java code tends to be excessively complex, and that's because it lacks expressive power. It was summed up for me by a colleague who said that one needs an IDE to program in Java because with one button it will auto-generate 100 lines of boilerplate.
Oct 26 2010
next sibling parent reply retard <re tard.com.invalid> writes:
Tue, 26 Oct 2010 21:39:32 -0700, Walter Bright wrote:

 bearophile wrote:
 Walter:
 
 Java made a related mistake by failing to acknowledge that value types
 have any useful purpose at all (unless they are built-in).
Java was designed to be simple! Simple means to have a more uniform semantics.
So was Pascal. See the thread about how useless it was as a result.
Blablabla.. this nostalgic lesson reminded me, have you even started studying the list of type system concepts I listed few days ago. A new version with links is coming at some point of time.
 What I miss more in Java is not single structs (single values),
There's a lot more to miss than that. I find Java code tends to be excessively complex, and that's because it lacks expressive power.
Adding structs to Java wouldn't fix that. You probably know that. Unifying structs and classes in a language like D and adding good escape analysis wouldn't worsen the performance that badly in general purpose applications. Java is mostly used for general purpose programming so your claims about usefulness and the need for extreme performance look silly.
Oct 27 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
retard wrote:
 have you even started
 studying the list of type system concepts I listed few days ago.
Java has proved that such things aren't useful in programming languages :-)
 Adding structs to Java wouldn't fix that.  You probably know that.
 Unifying structs and classes in a language like D and adding good escape 
 analysis wouldn't worsen the performance that badly in general purpose 
 applications. Java is mostly used for general purpose programming so your 
 claims about usefulness and the need for extreme performance look silly.
If that were true, why are Java char/int/double types value types, not a reference type derived from Object?
Oct 27 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 So was Pascal. See the thread about how useless it was as a result.
But Java is probably currently the most used language, so I guess they have created a simpler language, but not too much simple as Pascal was.
 Value types and polymorphic types are different, have different
purposes, different behaviors, etc. Right.
Conflating the two into the same construct makes for poor and confusing
abstractions.<
In Python there are (more or less) only objects, and they are managed "by name" (similar a "by reference") and it works well enough.
It shifts the problem out of the language and onto the programmer. It does NOT
make the complexity go away.<
This is partially true. The presence of just objects doesn't solve all problems, so part of the complexity doesn't go away, it goes into the program. On the other hand value semantics introduces a large amount of complexity by itself (in C++ there is a huge amount of stuff about this semantics, and even in D the design is unfinished still after ten years and after all the experience with C++). So in my opinion in the end the net result is that removing structs makes the language+programs simpler.
 There's a lot more to miss than that. I find Java code tends to be excessively
 complex, and that's because it lacks expressive power. It was summed up for me
 by a colleague who said that one needs an IDE to program in Java because with
 one button it will auto-generate 100 lines of boilerplate.
Yes, clearly Java has several faults. It's far from perfect. But introducing structs inside Java is in my opinion not going to solve those problems much.
 [...] If that were true, why are Java char/int/double types value types,
 not a reference type derived from Object?
For performance reasons, because originally Java didn't have the advanced compilation strategies used today. Languages like Clojure that run on the JavaVM use more reference types (for integer numbers too). After all this discussion I want to remind you that I am here because I like D and I like D structs, unions and all that :-) I prefer to use D many times over Java. And I agree that structs (or tagged unions) are better in D for the lexer if you want the lexer to be quite fast. Bye, bearophile
Oct 27 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 After all this discussion I want to remind you that I am here because I like
 D and I like D structs, unions and all that :-) I prefer to use D many times
 over Java. And I agree that structs (or tagged unions) are better in D for
 the lexer if you want the lexer to be quite fast.
So, there is "value" in value types after all. I confess I have no idea why you argue against them.
Oct 27 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 So, there is "value" in value types after all. I confess I have no idea why
you 
 argue against them.
I am not arguing against them in absolute. They are good in some situations and not so good in other situations :-) Compound value types are very useful in a certain imperative low-level language. While if you are designing a simpler language, it's better to not add structs to it (and yes, in practice the world needs simpler languages too, not everyone needs a Ferrari in every moment. And I believe that removing structs makes on average simpler the sum of the language+its programs). Bye, bearophile
Oct 27 2010
prev sibling next sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 27/10/2010 05:39, Walter Bright wrote:
 What I miss more in Java is not single structs (single values),
There's a lot more to miss than that. I find Java code tends to be excessively complex, and that's because it lacks expressive power. It was summed up for me by a colleague who said that one needs an IDE to program in Java because with one button it will auto-generate 100 lines of boilerplate.
I've been hearing that a lot, but I find this to be excessively exaggerated. Can you give some concrete examples? Because regarding excessive verbosity in Java, I cab only remember tree significant things at the moment (at least disregarding meta programming), and one of them is nearly as verbose in D as in Java: 1) writing getters and setters for fields 2) verbose syntax for closures. (need to use an anonymous class, outer variables must be final, and wrapped in an array if write access is needed) 3) writing trivial constructors whose parameters mirror the fields, and then constructors assign the parameters to the fields. I don't think 1 and 2 happens that often to be that much of an annoyance. (unless you're one of those Java persons that thinks that directly accessing the public field of another class is a sin, instead every single field must have getters/setters and never ever be public...) As an additional note, I don't think having an IDE auto-generate X lines of boilerplate code is necessarily a bad thing. It's only bad if the alternative of having a better language feature would actually save me coding time (whether initial coding, or subsequent modifications) or improve code understanding. _Isn't this what matters?_ -- Bruno Medeiros - Software Engineer
Nov 19 2010
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 27/10/2010 05:39, Walter Bright wrote:
 bearophile wrote:
 Walter:
 Java was designed to be simple! Simple means to have a more uniform
 semantics.
So was Pascal. See the thread about how useless it was as a result.
There's good simple, and there's bad simple... -- Bruno Medeiros - Software Engineer
Nov 19 2010
prev sibling next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:ia8321$vug$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message 
 news:i9qd8q$1ls4$1 digitalmars.com...
 4. the tokens should be a value type, not a reference type
I'm curious, is your reason for this purely to avoid allocations during lexing, or are there other reasons too?
It's one big giant reason. Storage allocation gets unbelievably costly in a lexer. Another is it makes tokens easy to copy. Another one is that classes are for polymorphic behavior. What kind of polymorphic behavior would one want with tokens?
Honestly, I'm not entirely certain whether or not Goldie actually needs its tokens to be classes instead of structs, but I'll explain my current usage: In the basic "dynamic" style, every token in Goldie, terminal or nonterminal, is of type "class Token" no matter what its symbol or part of grammar it represents. But Goldie has an optional "static" style which creates a class hierarchy, for the sake of compile-time type safety, with "Token" as the base. Suppose you have a grammar named "simple" that's a series of one or more a's and b's separated by plus signs: <Item> ::= 'a' | 'b' <List> ::= <List> '+' <Item> | <Item> So there's three terminals: "a", "b", and "+" And two nonterminals: "<Item>" and "<List>", each with exactly two possible reductions. So in dynamic-style, all of those are "class Token", and that's all that exists. But with the optional static-style, the following class hierarchy is effectively created: class Token; class Token_simple : Token; class Token_simple!"a" : Token_simple; class Token_simple!"b" : Token_simple; class Token_simple!"+" : Token_simple; class Token_simple!"<Item>" : Token_simple; class Token_simple!"<List>" : Token_simple; class Token_simple!("<Item>", "a") : Token_simple!"<Item>"; class Token_simple!("<Item>", "b") : Token_simple!"<Item>"; class Token_simple!("<List>", "<List>", "+", "<Item>") : Token_simple!"<List>"; class Token_simple!("<List>", "<Item>") : Token_simple!"<List>"; So rules inherit from the nonterminal they reduce to; terminals and nonterminals inherit from a dummy class dedicated specifically to the given grammar; and that inherits from plain old dynamic-style "Token". All of those template parameters are validated at compile-time. (At some point I'd like to make it possible to specify the rule-based tokens as something like: Token!("<List> ::= <List> '+' <Item>"), but I haven't gotten to it yet.) There's one more trick: The plain old Token exposes a member "subX" which can be numerically indexed to obtain the sub-tokens (for terminals, subX.length==0): void foo(Token token) { if(token.matches("<List>", "<List>", "+", "<Item>")) { auto leftSide = token.subX[0]; auto rightSide = token.subX[2]; //auto dummy = token.subX[10]; // Run-time error static assert(is(typeof(leftSide) == Token)); static assert(is(typeof(rightSide) == Token)); } } Note that it's impossible for the "matches" function to verify at compile-time that its arguments are valid. All of the static-style token types retain all of that for total compatibility with the dynamic-style. But the static-style nonterminals provide an additional member, "sub", that can be used like this: void foo(Token_simple!("<List>", "<List>", "+", "<Item>") token) { auto leftSide = token.sub!0; auto rightSide = token.sub!2; //auto dummy = token.sub!10; // Compile-time error static assert(is(typeof(leftSide) == Token_simple!"<List>")); static assert(is(typeof(rightSide) == Token_simple!"<Item>")); } As for whether or not this effect can be reasonably accomplished with structs: I have no idea, I haven't really looked into it.
Oct 26 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 As for whether or not this effect can be reasonably accomplished with 
 structs: I have no idea, I haven't really looked into it.
I use a tagged variant for the token struct. This doesn't make any difference if one is parsing small pieces of code. But when you're trying to stuff millions of lines of code down its maw, avoiding an allocation per token is a big deal. The indirect calls to the member functions of a class also perform poorly relative to tagged variants.
Oct 26 2010
prev sibling parent reply retard <re tard.com.invalid> writes:
Tue, 26 Oct 2010 19:32:44 -0700, Walter Bright wrote:

 Nick Sabalausky wrote:
 "Walter Bright" <newshound2 digitalmars.com> wrote in message
 news:i9qd8q$1ls4$1 digitalmars.com...
 4. the tokens should be a value type, not a reference type
I'm curious, is your reason for this purely to avoid allocations during lexing, or are there other reasons too?
It's one big giant reason. Storage allocation gets unbelievably costly in a lexer. Another is it makes tokens easy to copy. Another one is that classes are for polymorphic behavior. What kind of polymorphic behavior would one want with tokens?
This is why the basic data structure in functional languages, algebraic data types, suits better for this purpose.
Oct 27 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
retard wrote:
 This is why the basic data structure in functional languages, algebraic 
 data types, suits better for this purpose.
I think you recently demonstrated otherwise, as proven by the widespread use of Java :-)
Oct 27 2010
parent reply retard <re tard.com.invalid> writes:
Wed, 27 Oct 2010 12:08:19 -0700, Walter Bright wrote:

 retard wrote:
 This is why the basic data structure in functional languages, algebraic
 data types, suits better for this purpose.
I think you recently demonstrated otherwise, as proven by the widespread use of Java :-)
I don't understand your logic -- Widespread use of Java proves that algebraic data types aren't a better suited way for expressing compiler's data structures such as syntax trees?
Oct 27 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
retard wrote:
 Wed, 27 Oct 2010 12:08:19 -0700, Walter Bright wrote:
 
 retard wrote:
 This is why the basic data structure in functional languages, algebraic
 data types, suits better for this purpose.
I think you recently demonstrated otherwise, as proven by the widespread use of Java :-)
I don't understand your logic -- Widespread use of Java proves that algebraic data types aren't a better suited way for expressing compiler's data structures such as syntax trees?
You told me that widespread use of Java proved that nothing more complex than what Java provides is useful: "Java is mostly used for general purpose programming so your claims about usefulness and the need for extreme performance look silly." I'd be surprised if you seriously meant that, as it implies that Java is the pinnacle of computer language design, but I can't resist teasing you about it. :-)
Oct 27 2010
parent reply retard <re tard.com.invalid> writes:
Wed, 27 Oct 2010 13:52:29 -0700, Walter Bright wrote:

 retard wrote:
 Wed, 27 Oct 2010 12:08:19 -0700, Walter Bright wrote:
 
 retard wrote:
 This is why the basic data structure in functional languages,
 algebraic data types, suits better for this purpose.
I think you recently demonstrated otherwise, as proven by the widespread use of Java :-)
I don't understand your logic -- Widespread use of Java proves that algebraic data types aren't a better suited way for expressing compiler's data structures such as syntax trees?
You told me that widespread use of Java proved that nothing more complex than what Java provides is useful: "Java is mostly used for general purpose programming so your claims about usefulness and the need for extreme performance look silly." I'd be surprised if you seriously meant that, as it implies that Java is the pinnacle of computer language design, but I can't resist teasing you about it. :-)
I only meant that the widespead adoption of Java shows how the public at large cares very little about the performance issues you mentioned. Java is one of the most widely used languages and it's also successful in many fields. Things could be better from programming language theory's point of view, but the business world is more interesting in profits and the large pool of Java coders has given better benefits than more expressive languages. I don't think that says anything against my notes about algebraic data types.
Oct 27 2010
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
retard wrote:
 I only meant that the widespead adoption of Java shows how the public at 
 large cares very little about the performance issues you mentioned  Java
 is one of the most widely used languages and it's also successful in many 
 fields. Things could be better from programming language theory's point 
 of view, but the business world is more interesting in profits and the 
 large pool of Java coders has given better benefits than more expressive 
 languages. I don't think that says anything against my notes about 
 algebraic data types.
Choice of a language has numerous factors, so you cannot dismiss one factor because the other factors still make it an attractive choice. For example: "the widespread adoption of horses shows how the public at large cares very little about the cars you mentioned."
Oct 27 2010
parent retard <re tard.com.invalid> writes:
Wed, 27 Oct 2010 14:15:04 -0700, Walter Bright wrote:

 retard wrote:
 I only meant that the widespead adoption of Java shows how the public
 at large cares very little about the performance issues you mentioned 
 Java is one of the most widely used languages and it's also successful
 in many fields. Things could be better from programming language
 theory's point of view, but the business world is more interesting in
 profits and the large pool of Java coders has given better benefits
 than more expressive languages. I don't think that says anything
 against my notes about algebraic data types.
Choice of a language has numerous factors
I know that.
, so you cannot dismiss one
 factor because the other factors still make it an attractive choice.
I don't think I said anything that contradicts that.
 For example:
 
 "the widespread adoption of horses shows how the public at large cares
 very little about the cars you mentioned."
I meant caring in a way that results in masses of programmers migrating their code from Java to a language with those performance issues solved (e.g. D). A layman can make general remarks from people switching from Java/C++/C to "new" languages such as Groovy, Javascript, Python, PHP, and Ruby. The people want "simpler" languages. For example Ruby has terrible performance, but the performance becomes a non-issue once the web service framework is built in a scalable way.
Oct 27 2010
prev sibling next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"retard" <re tard.com.invalid> wrote in message 
news:iaa44v$17sf$2 digitalmars.com...
 I only meant that the widespead adoption of Java shows how the public at
 large cares very little about the performance issues you mentioned.
The public at large is convinced that "Java is fast now, really!". So I'm not certain widespread adoption of Java necessarily indicates they don't care so much about performance. Of course, Java is quickly becoming a legacy language anyway (the next COBOL, IMO), so that throws another wrench into the works.
Oct 27 2010
next sibling parent reply "Todd D. VanderVeen" <tdv part.net> writes:
Legacy in the sense that C is perhaps.

http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html



-----Original Message-----
From: digitalmars-d-bounces puremagic.com
[mailto:digitalmars-d-bounces puremagic.com] On Behalf Of Nick Sabalausky
Sent: Wednesday, October 27, 2010 3:43 PM
To: digitalmars-d puremagic.com
Subject: Re: Looking for champion - std.lang.d.lex

"retard" <re tard.com.invalid> wrote in message 
news:iaa44v$17sf$2 digitalmars.com...
 I only meant that the widespead adoption of Java shows how the public at
 large cares very little about the performance issues you mentioned.
The public at large is convinced that "Java is fast now, really!". So I'm not certain widespread adoption of Java necessarily indicates they don't care so much about performance. Of course, Java is quickly becoming a legacy language anyway (the next COBOL, IMO), so that throws another wrench into the works.
Oct 27 2010
parent reply retard <re tard.com.invalid> writes:
Wed, 27 Oct 2010 16:04:34 -0600, Todd D. VanderVeen wrote:

 Legacy in the sense that C is perhaps.
 
 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
Probably the top 10 names are more or less correct there, but some funny notes: 33. D 36. Scratch 40. Haskell 42. JavaFX Script 49. Scala Scratch is an educational tool. It isn't really suitable for any real world applications. It slows down considerably with too many expressions. There are several books about Haskell and Scala. Both have several books on them, active mailing lists, and also very many active community projects. Haven't heard much about JavaFX outside Sun/Oracle. These statistics look really weird.
Oct 27 2010
parent reply Don <nospam nospam.com> writes:
retard wrote:
 Wed, 27 Oct 2010 16:04:34 -0600, Todd D. VanderVeen wrote:
 
 Legacy in the sense that C is perhaps.

 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
Probably the top 10 names are more or less correct there, but some funny notes: 33. D 36. Scratch 40. Haskell 42. JavaFX Script 49. Scala Scratch is an educational tool. It isn't really suitable for any real world applications. It slows down considerably with too many expressions. There are several books about Haskell and Scala. Both have several books on them, active mailing lists, and also very many active community projects. Haven't heard much about JavaFX outside Sun/Oracle. These statistics look really weird.
I reckon Fortran is the one to look at it. If Tiobe's stats were sensible, the Fortran numbers would be solid as a rock. And ADA ought to be pretty stable too. But look at this: http://www.tiobe.com/index.php/paperinfo/tpci/Ada.html Laughable.
Oct 28 2010
parent Matthias Pleh <sufu alter.de> writes:
Am 28.10.2010 16:46, schrieb Don:
 retard wrote:
 Wed, 27 Oct 2010 16:04:34 -0600, Todd D. VanderVeen wrote:

 Legacy in the sense that C is perhaps.

 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
Probably the top 10 names are more or less correct there, but some funny notes: 33. D 36. Scratch 40. Haskell 42. JavaFX Script 49. Scala Scratch is an educational tool. It isn't really suitable for any real world applications. It slows down considerably with too many expressions. There are several books about Haskell and Scala. Both have several books on them, active mailing lists, and also very many active community projects. Haven't heard much about JavaFX outside Sun/Oracle. These statistics look really weird.
I reckon Fortran is the one to look at it. If Tiobe's stats were sensible, the Fortran numbers would be solid as a rock. And ADA ought to be pretty stable too. But look at this: http://www.tiobe.com/index.php/paperinfo/tpci/Ada.html Laughable.
There was an article in the Ct-Magazin (German) where they took a closer look at this rankings. For example: - search for 'C' you got 3080 M - search for 'Java' you got 167 M 'Java' only competes with the island Java 'C' competes with C&A, c't-Magazin, C-Quadrat, C+C, char 'c', ... and many many more .... So, to correct this, only the first *100* (hundred) results are reviewed and the resulting factor applied to the sum of results. Just look at 1-100, then at 101-200, 201-300 and so on .. You get complete different factors. So this numbers at tiobe are really lying!!!!!! source: http://www.heise.de/developer/artikel/Traue-keiner-Statistik-993137.html greets Matthias
Oct 28 2010
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 27/10/2010 22:43, Nick Sabalausky wrote:
 "retard"<re tard.com.invalid>  wrote in message
 news:iaa44v$17sf$2 digitalmars.com...
 I only meant that the widespead adoption of Java shows how the public at
 large cares very little about the performance issues you mentioned.
The public at large is convinced that "Java is fast now, really!". So I'm not certain widespread adoption of Java necessarily indicates they don't care so much about performance. Of course, Java is quickly becoming a legacy language anyway (the next COBOL, IMO), so that throws another wrench into the works.
Java is quickly becoming a legacy language? the next COBOL? SRSLY?... Just two years ago, the now hugely popular Android platform choose Java as it's language of choice, and you think Java is becoming legacy?... The development of the Java language itself has stagnated over the last 6 years or so (especially due to corporate politics, which now has become even worse and uncertain with all the shit Oracle is doing), but that's a completely different statement from saying Java is becoming legacy. In fact, all the uproar and concern about the future of Java under Oracle, of the JVM, of the JCP (the body that regulates changes to Java),etc., is a testament to the huge popularity of Java. Otherwise people (and corporations) wouldn't care, they would just let it wither away with much less concern. -- Bruno Medeiros - Software Engineer
Nov 19 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Bruno Medeiros:

 Java is quickly becoming a legacy language? the next COBOL? SRSLY?...
 Just two years ago, the now hugely popular Android platform choose Java 
 as it's language of choice, and you think Java is becoming legacy?...
Java on Adroid is not going well, there is a Oracle->Google lawsuit in progress. Google is interested in using a variant of NaCL on Android too. Bye, bearophile
Nov 19 2010
parent reply Andrew Wiley <debio264 gmail.com> writes:
On Fri, Nov 19, 2010 at 4:20 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Bruno Medeiros:

 Java is quickly becoming a legacy language? the next COBOL? SRSLY?...
 Just two years ago, the now hugely popular Android platform choose Java
 as it's language of choice, and you think Java is becoming legacy?...
Java on Adroid is not going well, there is a Oracle->Google lawsuit in progress. Google is interested in using a variant of NaCL on Android too.
I have to agree with Bruno here, Java isn't going anywhere soon. It has an active community, corporations that very actively support it, and an open source effort that's probably the largest of any language (check out the Apache Foundation project lists). Toss in Clojure, Scala, Groovy, and their friends that can build on top of Java libraries, and you wind up with a package that isn't becoming obsolete any time soon.
Nov 19 2010
parent "Nick Sabalausky" <a a.a> writes:
"Andrew Wiley" <debio264 gmail.com> wrote in message 
news:mailman.501.1290205603.21107.digitalmars-d puremagic.com...
 On Fri, Nov 19, 2010 at 4:20 PM, bearophile 
 <bearophileHUGS lycos.com>wrote:

 Bruno Medeiros:

 Java is quickly becoming a legacy language? the next COBOL? SRSLY?...
 Just two years ago, the now hugely popular Android platform choose Java
 as it's language of choice, and you think Java is becoming legacy?...
Java on Adroid is not going well, there is a Oracle->Google lawsuit in progress. Google is interested in using a variant of NaCL on Android too.
I have to agree with Bruno here, Java isn't going anywhere soon. It has an active community, corporations that very actively support it, and an open source effort that's probably the largest of any language (check out the Apache Foundation project lists). Toss in Clojure, Scala, Groovy, and their friends that can build on top of Java libraries, and you wind up with a package that isn't becoming obsolete any time soon.
To be clear, I meant Java the language, not Java the VM. But yea, you're right, I probably overstated my point. What I have noticed though is, like Bruno said, a slowdown in Java language development and I can certainly imagine complications from the Oracle takeover of Sun. I've also been noticing decreasing interest in using Java (the language) for new projects (although, yes, not *completely* stalled interest) compared to 5-10 years ago, sharply increased awareness and recognition of Java's drawbacks compared to 5-10 years ago, and greatly increased interest and usage of other JVM languages besides Java. Ten years from now, I wouldn't at all be surprised to see Java (the language) being used primarily for maintenance on existing software that had already been written in Java. In fact, I'd be surprised if that doesn't end up being the case at that point. But I do imagine seeing things like D, Nemerle, Scala and Python thriving at that point.
Nov 23 2010
prev sibling parent reply Michael Stover <michael.r.stover gmail.com> writes:
As for D lexers and tokenizers, what would be nice is to
A) build an antlr grammar for D
B) build D targets for antlr so that antlr can generate lexers and parsers
in the D language.

For B) I found http://www.mbutscher.de/antlrd/index.html

For A) A good list of antlr grammars is at http://www.antlr.org/grammar/list,
but there isn't a D grammar.

These things wouldn't be an enormous amount of work to create and maintain,
and, if done, anyone could parse D code in many languages, including Java
and C which would make providing IDE features for D development easier in
those languages (eclipse for instance), and you could build lexers and
parsers in D using antlr grammars.

-Mike



On Fri, Nov 19, 2010 at 5:09 PM, Bruno Medeiros
<brunodomedeiros+spam com.gmail> wrote:

 On 27/10/2010 22:43, Nick Sabalausky wrote:

 "retard"<re tard.com.invalid>  wrote in message
 news:iaa44v$17sf$2 digitalmars.com...

 I only meant that the widespead adoption of Java shows how the public at
 large cares very little about the performance issues you mentioned.
The public at large is convinced that "Java is fast now, really!". So I'm not certain widespread adoption of Java necessarily indicates they don't care so much about performance. Of course, Java is quickly becoming a legacy language anyway (the next COBOL, IMO), so that throws another wrench into the works.
Java is quickly becoming a legacy language? the next COBOL? SRSLY?... Just two years ago, the now hugely popular Android platform choose Java as it's language of choice, and you think Java is becoming legacy?... The development of the Java language itself has stagnated over the last 6 years or so (especially due to corporate politics, which now has become even worse and uncertain with all the shit Oracle is doing), but that's a completely different statement from saying Java is becoming legacy. In fact, all the uproar and concern about the future of Java under Oracle, of the JVM, of the JCP (the body that regulates changes to Java),etc., is a testament to the huge popularity of Java. Otherwise people (and corporations) wouldn't care, they would just let it wither away with much less concern. -- Bruno Medeiros - Software Engineer
Nov 19 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 22:25, Michael Stover wrote:
 As for D lexers and tokenizers, what would be nice is to
 A) build an antlr grammar for D
 B) build D targets for antlr so that antlr can generate lexers and
 parsers in the D language.

 For B) I found http://www.mbutscher.de/antlrd/index.html

 For A) A good list of antlr grammars is at
 http://www.antlr.org/grammar/list, but there isn't a D grammar.

 These things wouldn't be an enormous amount of work to create and
 maintain, and, if done, anyone could parse D code in many languages,
 including Java and C which would make providing IDE features for D
 development easier in those languages (eclipse for instance), and you
 could build lexers and parsers in D using antlr grammars.

 -Mike
Yes, that would be much better. It would be directly and immediately useful for the DDT project: "But better yet would be to start coding our own custom parser (using a parser generator like ANTLR for example), that could really be tailored for IDE needs. In the medium/long term, that's probably what needs to be done. " in http://www.digitalmars.com/d/archives/digitalmars/D/ide/Future_of_Descent_and_D_Eclipse_IDE_635.html -- Bruno Medeiros - Software Engineer
Nov 19 2010
parent reply Michael Stover <michael.r.stover gmail.com> writes:
so that was 4 months ago - how do things currently stand on that initiative?

-Mike

On Fri, Nov 19, 2010 at 6:37 PM, Bruno Medeiros
<brunodomedeiros+spam com.gmail> wrote:

 On 19/11/2010 22:25, Michael Stover wrote:

 As for D lexers and tokenizers, what would be nice is to
 A) build an antlr grammar for D
 B) build D targets for antlr so that antlr can generate lexers and
 parsers in the D language.

 For B) I found http://www.mbutscher.de/antlrd/index.html

 For A) A good list of antlr grammars is at
 http://www.antlr.org/grammar/list, but there isn't a D grammar.

 These things wouldn't be an enormous amount of work to create and
 maintain, and, if done, anyone could parse D code in many languages,
 including Java and C which would make providing IDE features for D
 development easier in those languages (eclipse for instance), and you
 could build lexers and parsers in D using antlr grammars.

 -Mike
Yes, that would be much better. It would be directly and immediately useful for the DDT project: "But better yet would be to start coding our own custom parser (using a parser generator like ANTLR for example), that could really be tailored for IDE needs. In the medium/long term, that's probably what needs to be done. " in http://www.digitalmars.com/d/archives/digitalmars/D/ide/Future_of_Descent_and_D_Eclipse_IDE_635.html -- Bruno Medeiros - Software Engineer
Nov 19 2010
next sibling parent Matthias Pleh <gonzo web.at> writes:
Am 20.11.2010 00:56, schrieb Michael Stover:
 so that was 4 months ago - how do things currently stand on that initiative?

 -Mike

 On Fri, Nov 19, 2010 at 6:37 PM, Bruno Medeiros
 <brunodomedeiros+spam com.gmail> wrote:

     On 19/11/2010 22:25, Michael Stover wrote:

         As for D lexers and tokenizers, what would be nice is to
         A) build an antlr grammar for D
         B) build D targets for antlr so that antlr can generate lexers and
         parsers in the D language.

         For B) I found http://www.mbutscher.de/antlrd/index.html

         For A) A good list of antlr grammars is at
         http://www.antlr.org/grammar/list, but there isn't a D grammar.

         These things wouldn't be an enormous amount of work to create and
         maintain, and, if done, anyone could parse D code in many languages,
         including Java and C which would make providing IDE features for D
         development easier in those languages (eclipse for instance),
         and you
         could build lexers and parsers in D using antlr grammars.

         -Mike


     Yes, that would be much better. It would be directly and immediately
     useful for the DDT project:

     "But better yet would be to start coding our own custom parser
     (using a parser generator like ANTLR for example), that could really
     be tailored for IDE needs. In the medium/long term, that's probably
     what needs to be done. "
     in
     http://www.digitalmars.com/d/archives/digitalmars/D/ide/Future_of_Descent_and_D_Eclipse_IDE_635.html

     --
     Bruno Medeiros - Software Engineer
There is a project with an antlr D-grammar in work. http://code.google.com/p/vs-d-integration/ Maybe this can be finished? matthias
Nov 20 2010
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 19/11/2010 23:56, Michael Stover wrote:
 so that was 4 months ago - how do things currently stand on that initiative?

 -Mike

 On Fri, Nov 19, 2010 at 6:37 PM, Bruno Medeiros
 <brunodomedeiros+spam com.gmail> wrote:

     On 19/11/2010 22:25, Michael Stover wrote:

         As for D lexers and tokenizers, what would be nice is to
         A) build an antlr grammar for D
         B) build D targets for antlr so that antlr can generate lexers and
         parsers in the D language.

         For B) I found http://www.mbutscher.de/antlrd/index.html

         For A) A good list of antlr grammars is at
         http://www.antlr.org/grammar/list, but there isn't a D grammar.

         These things wouldn't be an enormous amount of work to create and
         maintain, and, if done, anyone could parse D code in many languages,
         including Java and C which would make providing IDE features for D
         development easier in those languages (eclipse for instance),
         and you
         could build lexers and parsers in D using antlr grammars.

         -Mike


     Yes, that would be much better. It would be directly and immediately
     useful for the DDT project:

     "But better yet would be to start coding our own custom parser
     (using a parser generator like ANTLR for example), that could really
     be tailored for IDE needs. In the medium/long term, that's probably
     what needs to be done. "
     in
     http://www.digitalmars.com/d/archives/digitalmars/D/ide/Future_of_Descent_and_D_Eclipse_IDE_635.html

     --
     Bruno Medeiros - Software Engineer
I don't know about Ellery, as you can see in that thread he/she(?) mentioned interest in working on that, but I don't know anything more. As for me, I didn't work on that, nor did I plan to. Nor am I planning to anytime soon, DDT can handle things with the current parser for now (bugs can be fixed on the current code, perhaps some limitations can be resolved by merging some more code from DMD), so I'll likely work on other more important features before I go there. For example, I'll likely work on debugger integration, and code completion improvements before I would go on writing a new parser from scratch. Plus, it gives more time to hopefully someone else work on it. :P Unlike Walter, I can't write a D parser in a weekend... :) Not even on a week, especially since I never done anything of this kind before. -- Bruno Medeiros - Software Engineer
Nov 24 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 11/24/2010 09:13 AM, Bruno Medeiros wrote:
 I don't know about Ellery, as you can see in that thread he/she(?)
 mentioned interest in working on that, but I don't know anything more.
Normally I go by 'it'. Been pretty busy this semester, so I haven't been doing much. But the bottom line is, yes I have working antlr grammars for D1 and D2 if you don't mind 1) they're slow 2) they're tied to a hacked-out version of the netbeans fork of ANTLR2 3) they're tied to some custom java code 4) I haven't been keeping the tree grammars so up to date I've not released them for those reasons. Semester will be over in about 3 weeks, though, and I'll have time then.
 As for me, I didn't work on that, nor did I plan to.
 Nor am I planning to anytime soon, DDT can handle things with the
 current parser for now (bugs can be fixed on the current code, perhaps
 some limitations can be resolved by merging some more code from DMD), so
 I'll likely work on other more important features before I go there. For
 example, I'll likely work on debugger integration, and code completion
 improvements before I would go on writing a new parser from scratch.
 Plus, it gives more time to hopefully someone else work on it. :P

 Unlike Walter, I can't write a D parser in a weekend... :) Not even on a
 week, especially since I never done anything of this kind before.
It took me like 3 months to read his parser to figure out what was going on.
Nov 24 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 24/11/2010 16:19, Ellery Newcomer wrote:
 On 11/24/2010 09:13 AM, Bruno Medeiros wrote:
 I don't know about Ellery, as you can see in that thread he/she(?)
 mentioned interest in working on that, but I don't know anything more.
Normally I go by 'it'.
I didn't meant to offend or anything, I was just unsure of that. To me Ellery seems like a female name (but that can be a bias due to English not being my first language, or some other cultural thing). On the other hand, I would be surprised if a person of the female variety would be that interested in D, to the point of contributing in such way.
 Been pretty busy this semester, so I haven't been doing much.

 But the bottom line is, yes I have working antlr grammars for D1 and D2
 if you don't mind
 1) they're slow
 2) they're tied to a hacked-out version of the netbeans fork of ANTLR2
 3) they're tied to some custom java code
 4) I haven't been keeping the tree grammars so up to date

 I've not released them for those reasons. Semester will be over in about
 3 weeks, though, and I'll have time then.
Hum, doesn't sound like it might be suitable for DDT, but I wasn't counting on that either.
 As for me, I didn't work on that, nor did I plan to.
 Nor am I planning to anytime soon, DDT can handle things with the
 current parser for now (bugs can be fixed on the current code, perhaps
 some limitations can be resolved by merging some more code from DMD), so
 I'll likely work on other more important features before I go there. For
 example, I'll likely work on debugger integration, and code completion
 improvements before I would go on writing a new parser from scratch.
 Plus, it gives more time to hopefully someone else work on it. :P

 Unlike Walter, I can't write a D parser in a weekend... :) Not even on a
 week, especially since I never done anything of this kind before.
It took me like 3 months to read his parser to figure out what was going on.
Not 3 man-months for sure!, right? (Man-month in the sense of someone working 40 hours per week during a month.) -- Bruno Medeiros - Software Engineer
Nov 24 2010
next sibling parent Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 11/24/2010 02:09 PM, Bruno Medeiros wrote:
 I didn't meant to offend or anything, I was just unsure of that.
None taken; I'm just laughing at you. As I understand it, though, 'Ellery' is a unisex name, so it is entirely ambiguous.
 It took me like 3 months to read his parser to figure out what was going
 on.
Not 3 man-months for sure!, right? (Man-month in the sense of someone working 40 hours per week during a month.)
Probably not
Nov 24 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Bruno Medeiros:

 On the other hand, I would be surprised if a person of the female variety
 would be that interested in D, to the point of contributing in such way.
In Python newsgroups I have seen few women, now and then, but in the D newsgroup so far... not many. So far D seems a male thing. I don't know why. At the university at the Computer Science course there are a good enough number of female students (and few female teachers too). Bye, bearophile
Nov 24 2010
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
bearophile schrieb:
 Bruno Medeiros:
 
 On the other hand, I would be surprised if a person of the female variety
 would be that interested in D, to the point of contributing in such way.
In Python newsgroups I have seen few women, now and then, but in the D newsgroup so far... not many. So far D seems a male thing. I don't know why. At the university at the Computer Science course there are a good enough number of female students (and few female teachers too). Bye, bearophile
At my university there are *very* few woman studying computer science. Most women sitting in CS lectures here are studying maths and have to do some basic CS lectures (I don't think they're the kind that would try D voluntarily). We have two female professors though.
Nov 24 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Daniel Gibson" <metalcaedes gmail.com> wrote in message 
news:icjv6l$p1r$2 digitalmars.com...
 bearophile schrieb:
 Bruno Medeiros:

 On the other hand, I would be surprised if a person of the female 
 variety
 would be that interested in D, to the point of contributing in such way.
In Python newsgroups I have seen few women, now and then, but in the D newsgroup so far... not many. So far D seems a male thing. I don't know why. At the university at the Computer Science course there are a good enough number of female students (and few female teachers too). Bye, bearophile
At my university there are *very* few woman studying computer science. Most women sitting in CS lectures here are studying maths and have to do some basic CS lectures (I don't think they're the kind that would try D voluntarily). We have two female professors though.
fest.
Nov 25 2010
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 24/11/2010 21:12, Daniel Gibson wrote:
 bearophile schrieb:
 Bruno Medeiros:

 On the other hand, I would be surprised if a person of the female
 variety
 would be that interested in D, to the point of contributing in such way.
In Python newsgroups I have seen few women, now and then, but in the D newsgroup so far... not many. So far D seems a male thing. I don't know why. At the university at the Computer Science course there are a good enough number of female students (and few female teachers too). Bye, bearophile
At my university there are *very* few woman studying computer science. Most women sitting in CS lectures here are studying maths and have to do some basic CS lectures (I don't think they're the kind that would try D voluntarily). We have two female professors though.
It is well know that there is a big gender gap in CS with regards to students and professionals. Something like 5-20% I guess, depending on university, company, etc.. But the interesting thing (although also quite unfortunate), is that this gap takes a even greater dip downwards when you consider the communities of FOSS developers/contributors. It must be well below 1%! (note that I'm not talking about *users* of FOSS software, but only people who actually contribute code, whether for FOSS projects, or for their own indie/toy projects) -- Bruno Medeiros - Software Engineer
Nov 26 2010
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 27/10/2010 22:04, retard wrote:
 Wed, 27 Oct 2010 13:52:29 -0700, Walter Bright wrote:

 retard wrote:
 Wed, 27 Oct 2010 12:08:19 -0700, Walter Bright wrote:

 retard wrote:
 This is why the basic data structure in functional languages,
 algebraic data types, suits better for this purpose.
I think you recently demonstrated otherwise, as proven by the widespread use of Java :-)
I don't understand your logic -- Widespread use of Java proves that algebraic data types aren't a better suited way for expressing compiler's data structures such as syntax trees?
You told me that widespread use of Java proved that nothing more complex than what Java provides is useful: "Java is mostly used for general purpose programming so your claims about usefulness and the need for extreme performance look silly." I'd be surprised if you seriously meant that, as it implies that Java is the pinnacle of computer language design, but I can't resist teasing you about it. :-)
I only meant that the widespead adoption of Java shows how the public at large cares very little about the performance issues you mentioned. Java is one of the most widely used languages and it's also successful in many fields. Things could be better from programming language theory's point of view, but the business world is more interesting in profits and the large pool of Java coders has given better benefits than more expressive languages. I don't think that says anything against my notes about algebraic data types.
"the widespead adoption of Java shows how the public at large cares very little about the performance issues you mentioned" WTF? The widespead adoption of Java means that _Java developers_ at large don't care about those performance issues (mostly because they work on stuff where they don't need to). But it's no statement about all the pool of developers. Java is hugely popular, but not in a "it's practically the only language people use" way. It's not like Windows on the desktop. -- Bruno Medeiros - Software Engineer
Nov 19 2010
prev sibling parent reply dolive <dolive89 sina.com> writes:
Walter Bright 写到:

 As we all know, tool support is important for D's success. Making tools easier 
 to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
std.lang.d.lex. 
 It would be helpful in writing color syntax highlighting filters, pretty 
 printers, repl, doc generators, static analyzers, and even D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
output
 2. optionally not generate lexical errors, but just try to recover and continue
 3. optionally return comments and ddoc comments as tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained in
tandem
 
 It can also serve as the basis for creating a javascript implementation that
can 
 be embedded into web pages for syntax highlighting, and eventually an 
 std.lang.d.parse.
 
 Anyone want to own this?
intense support! Someone to do it?
Feb 26 2011
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 26 February 2011 02:06:18 dolive wrote:
 Walter Bright =D0=B4=B5=BD:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.
=20
 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.
=20
 It should:
=20
 1. support a range interface for its input, and a range interface for i=
ts
 output 2. optionally not generate lexical errors, but just try to
 recover and continue 3. optionally return comments and ddoc comments as
 tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintain=
ed
 in tandem
=20
 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.
=20
 Anyone want to own this?
=20 intense support! Someone to do it?
I'm working on it, but I have enough else going on right now that I'm no be= ing=20 very quick about it. I don't know when it will be done. =2D Jonathan M Davis
Feb 26 2011
parent dolive <dolive89 sina.com> writes:
Jonathan M Davis 写到:

 On Saturday 26 February 2011 02:06:18 dolive wrote:
 Walter Bright 写到:
 As we all know, tool support is important for D's success. Making tools
 easier to build will help with that.
 
 To that end, I think we need a lexer for the standard library -
 std.lang.d.lex. It would be helpful in writing color syntax highlighting
 filters, pretty printers, repl, doc generators, static analyzers, and
 even D compilers.
 
 It should:
 
 1. support a range interface for its input, and a range interface for its
 output 2. optionally not generate lexical errors, but just try to
 recover and continue 3. optionally return comments and ddoc comments as
 tokens
 4. the tokens should be a value type, not a reference type
 5. generally follow along with the C++ one so that they can be maintained
 in tandem
 
 It can also serve as the basis for creating a javascript implementation
 that can be embedded into web pages for syntax highlighting, and
 eventually an std.lang.d.parse.
 
 Anyone want to own this?
intense support! Someone to do it?
I'm working on it, but I have enough else going on right now that I'm no being very quick about it. I don't know when it will be done. - Jonathan M Davis
thanks, make an all out effort !
Feb 26 2011