digitalmars.D - A lexical change (a breaking change, but trivial to fix)

Mehrdad (15/15) Jul 07 2012 This might sound silly, but how about if D stopped allowing

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (7/20) Jul 07 2012 ... why is this even done at the lexical stage? It should be done at the...

Mehrdad (4/6) Jul 07 2012 Well, even better than -- it makes it easier to make a parser.
H. S. Teoh (13/18) Jul 07 2012 [...]

deadalnix (2/16) Jul 07 2012 0. should be banned because of UFCS anyway.

Jonathan M Davis (19/43) Jul 07 2012 as a

H. S. Teoh (16/18) Jul 07 2012 [...]

Mehrdad (9/28) Jul 07 2012 Yeah that's exactly what happened to me lol.

Timon Gehr (20/45) Jul 07 2012 You could go like this:

Mehrdad (11/30) Jul 08 2012 You kinda glossed over the crucial detail in parseNumber(). ;)
Mehrdad (9/56) Jul 08 2012 Right, it's trivial to fix with an extra state variable like

H. S. Teoh (15/23) Jul 09 2012 This is eventually what I did in my own D lexer.

Mehrdad (2/8) Jul 09 2012 Yup, hence my line above. :P

Timon Gehr (10/24) Jul 07 2012 It does not make it easier to create a lexer, because this is not

Timon Gehr (2/30) Jul 07 2012

Jonathan M Davis (5/24) Jul 07 2012 +1
Jonathan M Davis (4/29) Jul 07 2012 There's an existing enhancement request for it:

"Mehrdad" <wfunction hotmail.com> writes:

This might sound silly, but how about if D stopped allowing   
0..2  as a range, and instead just said "invalid floating-point 
number"?

Fixing it en masse would be pretty trivial... just run a regex to 
replace
	"\b(\d+)\.\."
with
	"\1 .. "
and you're good to go.

(Or if you want more accuracy, just take the compiler output and 
feed it back with a fix -- that would work too.)

The benefit, though, is that now you can do maximal munch without 
worrying about this edge case... which sure makes it easier to 
make a lexer.

Thoughts?

Jul 07 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <alex lycus.org> writes:

On 07-07-2012 23:39, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing 0..2  as a
 range, and instead just said "invalid floating-point number"?

 Fixing it en masse would be pretty trivial... just run a regex to replace
      "\b(\d+)\.\."
 with
      "\1 .. "
 and you're good to go.

 (Or if you want more accuracy, just take the compiler output and feed it
 back with a fix -- that would work too.)

 The benefit, though, is that now you can do maximal munch without
 worrying about this edge case... which sure makes it easier to make a
 lexer.

 Thoughts?

... why is this even done at the lexical stage? It should be done at the 
parsing stage if anything.

-- 
Alex Rønne Petersen
alex lycus.org
http://lycus.org

Jul 07 2012

"Mehrdad" <wfunction hotmail.com> writes:

On Saturday, 7 July 2012 at 21:41:44 UTC, Alex Rønne Petersen 
wrote:
 ... why is this even done at the lexical stage? It should be 
 done at the parsing stage if anything.

Well, even better than -- it makes it easier to make a parser.

That said, what's wrong with doing it in the lexical stage?

Jul 07 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jul 07, 2012 at 11:41:43PM +0200, Alex R�nne Petersen wrote:
 On 07-07-2012 23:39, Mehrdad wrote:
This might sound silly, but how about if D stopped allowing 0..2  as a
range, and instead just said "invalid floating-point number"?


[...]
 ... why is this even done at the lexical stage? It should be done at
 the parsing stage if anything.

[...]

This is because the lexer can mistakenly identify it as "0." followed by
".2" instead of "0" followed by ".." followed by "2".

IMAO, this problem is caused by floating point notational stupidities
like 0. and .1, especially the former. Get rid of the former (and
optionally the latter) will fix a whole bunch of lexer pain in D.


T

-- 
They say that "guns don't kill people, people kill people." Well I think
the gun helps. If you just stood there and yelled BANG, I don't think
you'd kill too many people. -- Eddie Izzard, Dressed to Kill

Jul 07 2012

deadalnix <deadalnix gmail.com> writes:

On 08/07/2012 00:04, H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:41:43PM +0200, Alex R�nne Petersen wrote:
 On 07-07-2012 23:39, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing 0..2  as a
 range, and instead just said "invalid floating-point number"?


 [...]
 ... why is this even done at the lexical stage? It should be done at
 the parsing stage if anything.

 [...]

 This is because the lexer can mistakenly identify it as "0." followed by
 ".2" instead of "0" followed by ".." followed by "2".

 IMAO, this problem is caused by floating point notational stupidities
 like 0. and .1, especially the former. Get rid of the former (and
 optionally the latter) will fix a whole bunch of lexer pain in D.


 T

0. should be banned because of UFCS anyway.

Jul 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, July 08, 2012 03:29:37 deadalnix wrote:
 On 08/07/2012 00:04, H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:41:43PM +0200, Alex R=C3=B8nne Petersen =


wrote:
 On 07-07-2012 23:39, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing 0..2 =




 as a
 range, and instead just said "invalid floating-point number"?


=20
 [...]
=20
 ... why is this even done at the lexical stage? It should be done =



at
 the parsing stage if anything.

=20
 [...]
=20
 This is because the lexer can mistakenly identify it as "0." follow=


ed by
 ".2" instead of "0" followed by ".." followed by "2".
=20
 IMAO, this problem is caused by floating point notational stupiditi=


es
 like 0. and .1, especially the former. Get rid of the former (and
 optionally the latter) will fix a whole bunch of lexer pain in D.
=20
=20
 T

=20
 0. should be banned because of UFCS anyway.

If you do 0.func(), UFCS works. If you do 0.f(), UFCS works (so 0.f as =
a=20
literal is illegal), but there's notthing about UFCS preventing 0. from=
=20
working as long as there's nothing immediately after it which could be=20=

considered a function (which would just be any letter and _). So, UFCS =
really=20
isn't an argument for banning 0. as a literal. It's the facts that it's=
=20
ludicrous to accept a partial literal and that it causes parsing proble=
ms=20
which make it so that it should be banned.

- Jonathan M Davis

Jul 07 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing   0..2
 as a range, and instead just said "invalid floating-point number"?

[...]

I like writing 0..2 as a range. It's especially nice in array slice
notation, where you _want_ to have it as concise as possible.

OTOH, having implemented a D lexer before (just for practice, not
production quality), I do see how ambiguities with floating-point
numbers can cause a lot of code convolutions.

But I'm gonna have to say no to this one; *I* think a better solution
would be to prohibit things like 0. or 1. in a float literal. Either
follow it with a digit, or don't write the dot. This will also save us a
lot of pain in the UFCS department, where 4.sqrt is currently a pain to
lex. Once this is done, 0..2 is no longer ambiguous, and any respectable
DFA lexer should be able to handle it with ease.


T

-- 
If a person can't communicate, the very least he could do is to shut up. -- Tom
Lehrer, on people who bemoan their communication woes with their loved ones.

Jul 07 2012

"Mehrdad" <wfunction hotmail.com> writes:

On Saturday, 7 July 2012 at 22:00:43 UTC, H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing   
 0..2
 as a range, and instead just said "invalid floating-point 
 number"?

 [...]

 I like writing 0..2 as a range. It's especially nice in array 
 slice notation, where you _want_ to have it as concise as 
 possible.

Hmm... true..

 OTOH, having implemented a D lexer before (just for practice, 
 not production quality), I do see how ambiguities with 
 floating-point numbers can cause a lot of code convolutions.

Yeah that's exactly what happened to me lol.
(Mainly the problem I ran into was that I was REALLY trying to 
avoid extra lookaheads if possible, since I was sticking to the 
range interface of front/popFront, and trying not to consume more 
than I can handle... and this was the edge case that broke it.)

 But I'm gonna have to say no to this one; *I* think a better 
 solution would be to prohibit things like 0. or 1. in a float 
 literal. Either follow it with a digit, or don't write the dot. 
 This will also save us a lot of pain in the UFCS department, 
 where 4.sqrt is currently a pain to lex. Once this is done, 
 0..2 is no longer ambiguous, and any respectable DFA lexer 
 should be able to handle it with ease.

Good idea, I like it too. How about just disallowing trailing 
decimal points then?

Jul 07 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/08/2012 12:23 AM, Mehrdad wrote:
 On Saturday, 7 July 2012 at 22:00:43 UTC, H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing 0..2
 as a range, and instead just said "invalid floating-point number"?

 [...]

 I like writing 0..2 as a range. It's especially nice in array slice
 notation, where you _want_ to have it as concise as possible.

 Hmm... true..

 OTOH, having implemented a D lexer before (just for practice, not
 production quality), I do see how ambiguities with floating-point
 numbers can cause a lot of code convolutions.

 Yeah that's exactly what happened to me lol.
 (Mainly the problem I ran into was that I was REALLY trying to avoid
 extra lookaheads if possible, since I was sticking to the range
 interface of front/popFront, and trying not to consume more than I can
 handle... and this was the edge case that broke it.)

You could go like this:

switch(input.front) {
     case '0'..'9':
         bool consumedtrailingdot;
         output.put(parseNumber(input, consumedtrailingdot));
         if(!consumedtrailingdot) continue;
         if(input.front != '.') {
             output.put(Token("."));
             continue;
         }
         input.popFront();
         if(input.front != '.') {
             output.put(Token(".."));
             continue;
         }
         output.put(Token("..."));
         continue;
}

 But I'm gonna have to say no to this one; *I* think a better solution
 would be to prohibit things like 0. or 1. in a float literal. Either
 follow it with a digit, or don't write the dot. This will also save us
 a lot of pain in the UFCS department, where 4.sqrt is currently a pain
 to lex. Once this is done, 0..2 is no longer ambiguous, and any
 respectable DFA lexer should be able to handle it with ease.

 Good idea, I like it too. How about just disallowing trailing decimal
 points then?

+1.

Jul 07 2012

"Mehrdad" <wfunction hotmail.com> writes:

On Saturday, 7 July 2012 at 22:54:15 UTC, Timon Gehr wrote:
 On 07/08/2012 12:23 AM, Mehrdad wrote:

 You could go like this:

 switch(input.front) {
     case '0'..'9':
         bool consumedtrailingdot;
         output.put(parseNumber(input, consumedtrailingdot));
         if(!consumedtrailingdot) continue;
         if(input.front != '.') {
             output.put(Token("."));
             continue;
         }
         input.popFront();
         if(input.front != '.') {
             output.put(Token(".."));
             continue;
         }
         output.put(Token("..."));
         continue;
 }

You kinda glossed over the crucial detail in parseNumber().  ;)

What happens if it sees   "2..3" ?

Then it /must/ have eaten the first period (it can't see the 
second period otherwise)... in which case now you have no idea 
that happened.

Of course, it's trivial to fix with an extra lookahead, but that 
would require using a forward range instead of an input range. 
(Which, again, is easy to do with an adapter -- what I ended up 
doing -- but the point is, it makes it harder to lex the code 
with just an input range.)

Jul 08 2012

"Mehrdad" <wfunction hotmail.com> writes:

On Saturday, 7 July 2012 at 22:54:15 UTC, Timon Gehr wrote:
 On 07/08/2012 12:23 AM, Mehrdad wrote:
 On Saturday, 7 July 2012 at 22:00:43 UTC, H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing 
 0..2
 as a range, and instead just said "invalid floating-point 
 number"?

 [...]

 I like writing 0..2 as a range. It's especially nice in array 
 slice
 notation, where you _want_ to have it as concise as possible.

 Hmm... true..

 OTOH, having implemented a D lexer before (just for practice, 
 not
 production quality), I do see how ambiguities with 
 floating-point
 numbers can cause a lot of code convolutions.

 Yeah that's exactly what happened to me lol.
 (Mainly the problem I ran into was that I was REALLY trying to 
 avoid
 extra lookaheads if possible, since I was sticking to the range
 interface of front/popFront, and trying not to consume more 
 than I can
 handle... and this was the edge case that broke it.)

 You could go like this:

 switch(input.front) {
     case '0'..'9':
         bool consumedtrailingdot;
         output.put(parseNumber(input, consumedtrailingdot));
         if(!consumedtrailingdot) continue;
         if(input.front != '.') {
             output.put(Token("."));
             continue;
         }
         input.popFront();
         if(input.front != '.') {
             output.put(Token(".."));
             continue;
         }
         output.put(Token("..."));
         continue;
 }


Right, it's trivial to fix with an extra state variable like 
'consumedtrailingdot'.

The point was, it requires an extra lookahead character, which I 
was trying to avoid (mainly for fun).

In this case, it doesn't really make a difference in practice -- 
but in general I don't like lookaheads, because depending on 
future data makes it hard for e.g. the user to enter data via the 
console.

Jul 08 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Jul 08, 2012 at 09:59:38AM +0200, Mehrdad wrote:
[...]
 Right, it's trivial to fix with an extra state variable like
 'consumedtrailingdot'.

This is eventually what I did in my own D lexer.

Well, actually, I kinda blasted an ant with an M16... I had a queue of
backlogged tokens which getNext will return if non-empty, and when I
recognized things like 1..4, I would push 2 or 3 tokens onto the backlog
queue, so no extra state is needed (although the backlog queue itself is
really just another form of extra state).


 The point was, it requires an extra lookahead character, which I was
 trying to avoid (mainly for fun).
 
 In this case, it doesn't really make a difference in practice -- but
 in general I don't like lookaheads, because depending on future data
 makes it hard for e.g. the user to enter data via the console.

In my case, the extra lookahead only happens when the lexer sees string
prefixes like "4..", which doesn't usually happen at the end of a line.
In all other cases, no lookahead is actually necessary, so except for
the very rare case, entering data via console actually works just fine.


T

-- 
I am a consultant. My job is to make your job redundant. -- Mr Tom

Jul 09 2012

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 9 July 2012 at 17:06:44 UTC, H. S. Teoh wrote:
 In this case, it doesn't really make a difference in practice

 In my case, the extra lookahead only happens when the lexer 
 sees string prefixes like "4..", which doesn't usually happen 
 at the end of a line. In all other cases, no lookahead is 
 actually necessary, so except for the very rare case, entering 
 data via console actually works just fine.

Yup, hence my line above. :P

Jul 09 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/07/2012 11:39 PM, Mehrdad wrote:
 This might sound silly,

+1.

 but how about if D stopped allowing 0..2 as a
 range, and instead just said "invalid floating-point number"?

 Fixing it en masse would be pretty trivial... just run a regex to replace
 "\b(\d+)\.\."
 with
 "\1 .. "
 and you're good to go.

 (Or if you want more accuracy, just take the compiler output and feed it
 back with a fix -- that would work too.)

 The benefit, though, is that now you can do maximal munch without
 worrying about this edge case... which sure makes it easier to make a
 lexer.

 Thoughts?

It does not make it easier to create a lexer, because this is not
actually an edge case worth explicitly testing for.

switch(input.front){
     case '0'..'9': ...
     case 'a'..'f', 'A'..'F': ...
     case '.': if('0'>input[1]||input[1]>'9') break;
         ...
}

Jul 07 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 07/08/2012 12:12 AM, Timon Gehr wrote:
 On 07/07/2012 11:39 PM, Mehrdad wrote:
 This might sound silly,

 +1.

 but how about if D stopped allowing 0..2 as a
 range, and instead just said "invalid floating-point number"?

 Fixing it en masse would be pretty trivial... just run a regex to replace
 "\b(\d+)\.\."
 with
 "\1 .. "
 and you're good to go.

 (Or if you want more accuracy, just take the compiler output and feed it
 back with a fix -- that would work too.)

 The benefit, though, is that now you can do maximal munch without
 worrying about this edge case... which sure makes it easier to make a
 lexer.

 Thoughts?

 It does not make it easier to create a lexer, because this is not
 actually an edge case worth explicitly testing for.

 switch(input.front){

I meant input[0]. No need for decoding.

      case '0'..'9': ...
      case 'a'..'f', 'A'..'F': ...
      case '.': if('0'>input[1]||input[1]>'9') break;
          ...
 }

Jul 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, July 07, 2012 15:01:50 H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing   0..2
 as a range, and instead just said "invalid floating-point number"?

 
 [...]
 
 I like writing 0..2 as a range. It's especially nice in array slice
 notation, where you _want_ to have it as concise as possible.
 
 OTOH, having implemented a D lexer before (just for practice, not
 production quality), I do see how ambiguities with floating-point
 numbers can cause a lot of code convolutions.
 
 But I'm gonna have to say no to this one; *I* think a better solution
 would be to prohibit things like 0. or 1. in a float literal. Either
 follow it with a digit, or don't write the dot. This will also save us a
 lot of pain in the UFCS department, where 4.sqrt is currently a pain to
 lex. Once this is done, 0..2 is no longer ambiguous, and any respectable
 DFA lexer should be able to handle it with ease.

+1

I think that it's ridiculous that 1. and .1 are legal. 1.f was fixed, so I was 
shocked to find out recently that 1. and .1 weren't.

- Jonathan M Davis

Jul 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, July 07, 2012 15:20:28 Jonathan M Davis wrote:
 On Saturday, July 07, 2012 15:01:50 H. S. Teoh wrote:
 On Sat, Jul 07, 2012 at 11:39:59PM +0200, Mehrdad wrote:
 This might sound silly, but how about if D stopped allowing   0..2
 as a range, and instead just said "invalid floating-point number"?

 
 [...]
 
 I like writing 0..2 as a range. It's especially nice in array slice
 notation, where you _want_ to have it as concise as possible.
 
 OTOH, having implemented a D lexer before (just for practice, not
 production quality), I do see how ambiguities with floating-point
 numbers can cause a lot of code convolutions.
 
 But I'm gonna have to say no to this one; *I* think a better solution
 would be to prohibit things like 0. or 1. in a float literal. Either
 follow it with a digit, or don't write the dot. This will also save us a
 lot of pain in the UFCS department, where 4.sqrt is currently a pain to
 lex. Once this is done, 0..2 is no longer ambiguous, and any respectable
 DFA lexer should be able to handle it with ease.

 
 +1
 
 I think that it's ridiculous that 1. and .1 are legal. 1.f was fixed, so I
 was shocked to find out recently that 1. and .1 weren't.

There's an existing enhancement request for it:

http://d.puremagic.com/issues/show_bug.cgi?id=6277

- Jonathan M Davis

Jul 07 2012

D Programming

C/C++ Programming

Other

digitalmars.D - A lexical change (a breaking change, but trivial to fix)