digitalmars.D - More lexer questions

H. S. Teoh (27/27) Feb 11 2012 According to the online specs, the lexer tries to tokenize by maximal

Timon Gehr (11/37) Feb 11 2012 No. maximal munch:

Martin Nowak (4/4) Feb 11 2012 Just wanted to point you to my working D lexer (needs a CTFE bugfix

Timon Gehr (10/14) Feb 11 2012 This seems to do the job:

simendsjo (13/28) Feb 11 2012 Another thing.. Using /+ and +/ in strings gives unexpected results when...

H. S. Teoh (16/30) Feb 11 2012 It's designed. At least according to the online specs:
Jonathan M Davis (5/19) Feb 11 2012 It's by design. Everything between /+ and +/ is a comment. It doesn't ma...
Martin Nowak (4/37) Feb 11 2012 /+ comments do nest. So you have opened two levels and the comment stops...

Alex_Dovhal (3/12) Feb 12 2012 and got

Martin Nowak (5/18) Feb 12 2012 I don't know about that one, it doesn't happen with 2.057 for me.

H. S. Teoh (10/15) Feb 11 2012 Cool, thanks!

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

According to the online specs, the lexer tries to tokenize by maximal
matching (except for one exception in the case of ranges like "1..2").
The fact that this exception is stated seems to indicate that it's
permitted to have two literals side-by-side without an intervening
space.

So does that mean "1e2" should be tokenized as (float lit: 1e2) and
"1f2" should be tokenized as (int lit: 1)(identifier: f2)?

Or, for that matter, "123abcdefg" should be tokenized as (int lit:
123)(identifier: abcdefg) whereas "0x123abcdefg" should be tokenized as
(int lit: 0x123abcdef)(identifier: g)?

Or worse, if we still allow octals, "0129" should be tokenized as (octal
lit: 012)(int lit: 9)?

Or do we expect that any integer/float literal will always span the
longest string that has characters permitted in any numerical literal,
and then after the fact the lexer will give an error if the string
cannot be interpreted as a legal literal? IOW, "0129" will first be
scanned in its entirety as a numerical literal, then afterwards the
lexer decides that '9' doesn't belong in an octal so it throws an error
(as opposed to maximally matching "012" as an octal literal followed by
a decimal literal "9").  Or, for that matter, "0123xel.u123" will be
scanned as a numerical literal (since all the characters in it occur in
some kind of numerical literal), and then an error generated after the
fact when the lexer realizes that this string isn't a legal numerical
literal?


T

-- 
All men are mortal. Socrates is mortal. Therefore all men are Socrates.

Feb 11 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/11/2012 07:42 PM, H. S. Teoh wrote:
 According to the online specs, the lexer tries to tokenize by maximal
 matching (except for one exception in the case of ranges like "1..2").
 The fact that this exception is stated seems to indicate that it's
 permitted to have two literals side-by-side without an intervening
 space.

 So does that mean "1e2" should be tokenized as (float lit: 1e2) and

Yes.

 "1f2" should be tokenized as (int lit: 1)(identifier: f2)?

No. maximal munch:

(float lit: 1f)(int lit 2)


 Or, for that matter, "123abcdefg" should be tokenized as (int lit:
 123)(identifier: abcdefg)

Yes.

 whereas "0x123abcdefg" should be tokenized as
 (int lit: 0x123abcdef)(identifier: g)?

 Or worse, if we still allow octals, "0129" should be tokenized as (octal
 lit: 012)(int lit: 9)?

DMD views 0129 as an error. Therefore, the best way to handle integer 
literals with initial 0 is to just parse them as decimal and to reject 
them if they exceed 7.

 Or do we expect that any integer/float literal will always span the
 longest string that has characters permitted in any numerical literal,
 and then after the fact the lexer will give an error if the string
 cannot be interpreted as a legal literal? IOW, "0129" will first be
 scanned in its entirety as a numerical literal, then afterwards the
 lexer decides that '9' doesn't belong in an octal so it throws an error
 (as opposed to maximally matching "012" as an octal literal followed by
 a decimal literal "9").  Or, for that matter, "0123xel.u123" will be


(int lit: 0123)(identifier: xel)(token: '.')(identifier: u123)

 scanned as a numerical literal (since all the characters in it occur in
 some kind of numerical literal), and then an error generated after the
 fact when the lexer realizes that this string isn't a legal numerical
 literal?


 T

No. As an example, that kind of processing the code would reject the 
valid token q{0123xel.u123}.

Feb 11 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

Just wanted to point you to my working D lexer (needs a CTFE bugfix  
http://d.puremagic.com/issues/show_bug.cgi?id=6815).

https://gist.github.com/1262321 D part
https://gist.github.com/1255439 Generic part

Feb 11 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/11/2012 09:59 PM, Martin Nowak wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).

This seems to do the job:
constfold.c:1566
-        if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
+        if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar 
|| tn->ty == Tdchar))

However, I don't know the compiler's internals at all, therefore it is 
quite possible that the fix is incorrect.


 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

Bug: The lexer cannot handle /++/ and /**/ (without new line character 
at the end).

Feb 11 2012

simendsjo <simendsjo gmail.com> writes:

On 02/12/2012 12:35 AM, Timon Gehr wrote:
 On 02/11/2012 09:59 PM, Martin Nowak wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).

 This seems to do the job:
 constfold.c:1566
 - if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
 + if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar ||
 tn->ty == Tdchar))

 However, I don't know the compiler's internals at all, therefore it is
 quite possible that the fix is incorrect.


 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

 Bug: The lexer cannot handle /++/ and /**/ (without new line character
 at the end).

Another thing.. Using /+ and +/ in strings gives unexpected results when 
commented out:
/+
auto a = "/+";
+/
everything from this point is commented out.

/+
auto a = "+/";
+/ // already terminated by the string value.

Is this a bug, or as designed? /++/ is meant to comment out code, so it 
would have been nice if it was able to handle this, but I guess it would 
complicate the lexer a great deal.

Feb 11 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Feb 12, 2012 at 01:00:07AM +0100, simendsjo wrote:
[...]
 Another thing.. Using /+ and +/ in strings gives unexpected results
 when commented out:
 /+
 auto a = "/+";
 +/
 everything from this point is commented out.
 
 /+
 auto a = "+/";
 +/ // already terminated by the string value.
 
 Is this a bug, or as designed? /++/ is meant to comment out code, so
 it would have been nice if it was able to handle this, but I guess
 it would complicate the lexer a great deal.

It's designed. At least according to the online specs:

	The contents of strings and comments are not tokenized.
	Consequently, comment openings occurring within a string do not
	begin a comment, and string delimiters within a comment do not
	affect the recognition of comment closings and nested "/+"
	comment openings. With the exception of "/+" occurring within a
	"/+" comment, comment openings within a comment are ignored.

		a = /+ // +/ 1;    // parses as if 'a = 1;'
		a = /+ "+/" +/ 1"; // parses as if 'a = " +/ 1";'
		a = /+ /* +/ */ 3; // parses as if 'a = */ 3;'

For commenting out code, a much better way is to use version(none){...}.


T

-- 
No! I'm not in denial!

Feb 11 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, February 12, 2012 01:00:07 simendsjo wrote:
 Another thing.. Using /+ and +/ in strings gives unexpected results when
 commented out:
 /+
 auto a = "/+";
 +/
 everything from this point is commented out.
 
 /+
 auto a = "+/";
 +/ // already terminated by the string value.
 
 Is this a bug, or as designed? /++/ is meant to comment out code, so it
 would have been nice if it was able to handle this, but I guess it would
 complicate the lexer a great deal.

It's by design. Everything between /+ and +/ is a comment. It doesn't matter 
what it is. There's nothing special about " which would make it ignore the 
characters following it when looking for the +/ to end the comment.

- Jonathan M Davis

Feb 11 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 12 Feb 2012 01:00:07 +0100, simendsjo <simendsjo gmail.com> wrote:

 On 02/12/2012 12:35 AM, Timon Gehr wrote:
 On 02/11/2012 09:59 PM, Martin Nowak wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).

 This seems to do the job:
 constfold.c:1566
 - if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
 + if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar ||
 tn->ty == Tdchar))

 However, I don't know the compiler's internals at all, therefore it is
 quite possible that the fix is incorrect.


 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

 Bug: The lexer cannot handle /++/ and /**/ (without new line character
 at the end).

 Another thing.. Using /+ and +/ in strings gives unexpected results when  
 commented out:
 /+
 auto a = "/+";

/+ comments do nest. So you have opened two levels and the comment stops  
after two pairing +/.
/* comments do not nest.

 +/
 everything from this point is commented out.

 /+
 auto a = "+/";
 +/ // already terminated by the string value.

 Is this a bug, or as designed? /++/ is meant to comment out code, so it  
 would have been nice if it was able to handle this, but I guess it would  
 complicate the lexer a great deal.

Feb 11 2012

"Alex_Dovhal" <alex_dovhal yahoo.com> writes:

"Martin Nowak" <dawg dawgfoto.de> wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix 
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).

 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

Hi, how it should be compiled? I tried with DMD 2.057:
dmd dlexer.d

and got
c:\Programs\Programming\Lang\dmd2\windows\bin\..\..\src\phobos\std\conv.d(94): 
Error: template instance 
std.format.formatValue!(Appender!(string),defineToken,char) recursive 
expansion

Feb 12 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 12 Feb 2012 11:18:26 +0100, Alex_Dovhal <alex_dovhal yahoo.com>  
wrote:

 "Martin Nowak" <dawg dawgfoto.de> wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).

 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

 Hi, how it should be compiled? I tried with DMD 2.057:
 dmd dlexer.d

 and got
 c:\Programs\Programming\Lang\dmd2\windows\bin\..\..\src\phobos\std\conv.d(94):
 Error: template instance
 std.format.formatValue!(Appender!(string),defineToken,char) recursive
 expansion


I don't know about that one, it doesn't happen with 2.057 for me.
The code triggers a lot of compiler bugs and it won't compile unless some  
CTFE bugs are fixed.

Feb 12 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Feb 11, 2012 at 09:59:06PM +0100, Martin Nowak wrote:
 Just wanted to point you to my working D lexer (needs a CTFE bugfix
 http://d.puremagic.com/issues/show_bug.cgi?id=6815).
 
 https://gist.github.com/1262321 D part
 https://gist.github.com/1255439 Generic part

Cool, thanks!

Looks like you've gone far beyond what I'm doing. :-) But it's still a
good learning exercise for me to get comfortable with coding in D.


T

-- 
"The whole problem with the world is that fools and fanatics are always
so certain of themselves, but wiser people so full of doubts." --
Bertrand Russell.
"How come he didn't put 'I think' at the end of it?" -- Anonymous

Feb 11 2012

D Programming

C/C++ Programming

Other

digitalmars.D - More lexer questions