www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 1466] New: Spec claims maximal munch technique always works: not for "1..3"

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1466

           Summary: Spec claims maximal munch technique always works: not
                    for "1..3"
           Product: D
           Version: 1.020
          Platform: All
               URL: http://digitalmars.com/d/1.0/lex.html
        OS/Version: All
            Status: NEW
          Keywords: spec
          Severity: minor
          Priority: P3
         Component: www.digitalmars.com
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: deewiant gmail.com


A snippet from http://digitalmars.com/d/1.0/lex.html:

"The source text is split into tokens using the maximal munch technique, i.e.,
the lexical analyzer tries to make the longest token it can."

Relevant parts of the grammar:

Token:
        FloatLiteral
        ..

FloatLiteral:
        Float

Float:
        DecimalFloat

DecimalFloat:
        DecimalDigits .
        . Decimal

DecimalDigits:
        DecimalDigit

DecimalDigit:
        NonZeroDigit

Decimal:
        NonZeroDigit

Based on the above, if a lexer encounters "1..3", for instance in a slice:
"foo[1..3]", it should, using the maximal munch technique, make the longest
possible token from "1..3": this is the Float "1.". Next, it should come up
with the Float ".3".

Of course, this isn't currently happening, and would be problematic if it did.
But, according to the grammar, that's what should happen, unless I'm missing
something.

Either some exception needs to be made or remove the "DecimalDigits ."
possibility from the grammar and the compiler.


-- 
Sep 01 2007
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to d-bugmail puremagic.com,

 http://d.puremagic.com/issues/show_bug.cgi?id=1466
 
 Summary: Spec claims maximal munch technique always works:
 not
 for "1..3"
 Product: D
 Version: 1.020
 Platform: All
 URL: http://digitalmars.com/d/1.0/lex.html
 OS/Version: All
 Status: NEW
 Keywords: spec
 Severity: minor
 Priority: P3
 Component: www.digitalmars.com
 AssignedTo: bugzilla digitalmars.com
 ReportedBy: deewiant gmail.com
 A snippet from http://digitalmars.com/d/1.0/lex.html:
 
 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest token
 it can."
 
 Relevant parts of the grammar:
 
 Token:
 FloatLiteral
 ..
 FloatLiteral:
 Float
 Float:
 DecimalFloat
 DecimalFloat:
 DecimalDigits .
 . Decimal
 DecimalDigits:
 DecimalDigit
 DecimalDigit:
 NonZeroDigit
 Decimal:
 NonZeroDigit
 Based on the above, if a lexer encounters "1..3", for instance in a
 slice: "foo[1..3]", it should, using the maximal munch technique, make
 the longest possible token from "1..3": this is the Float "1.". Next,
 it should come up with the Float ".3".
 
 Of course, this isn't currently happening, and would be problematic if
 it did. But, according to the grammar, that's what should happen,
 unless I'm missing something.
 
 Either some exception needs to be made or remove the "DecimalDigits ."
 possibility from the grammar and the compiler.
 

or make it "DecimalDigits . [^.]" where the ^ production is non consuming.
Sep 01 2007
parent Jascha Wetzel <"[firstname]" mainia.de> writes:
BCS wrote:
 Reply to d-bugmail puremagic.com,
 
 http://d.puremagic.com/issues/show_bug.cgi?id=1466

 Summary: Spec claims maximal munch technique always works:
 not
 for "1..3"
 Product: D
 Version: 1.020
 Platform: All
 URL: http://digitalmars.com/d/1.0/lex.html
 OS/Version: All
 Status: NEW
 Keywords: spec
 Severity: minor
 Priority: P3
 Component: www.digitalmars.com
 AssignedTo: bugzilla digitalmars.com
 ReportedBy: deewiant gmail.com
 A snippet from http://digitalmars.com/d/1.0/lex.html:

 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest token
 it can."

 Relevant parts of the grammar:

 Token:
 FloatLiteral
 ..
 FloatLiteral:
 Float
 Float:
 DecimalFloat
 DecimalFloat:
 DecimalDigits .
 . Decimal
 DecimalDigits:
 DecimalDigit
 DecimalDigit:
 NonZeroDigit
 Decimal:
 NonZeroDigit
 Based on the above, if a lexer encounters "1..3", for instance in a
 slice: "foo[1..3]", it should, using the maximal munch technique, make
 the longest possible token from "1..3": this is the Float "1.". Next,
 it should come up with the Float ".3".

 Of course, this isn't currently happening, and would be problematic if
 it did. But, according to the grammar, that's what should happen,
 unless I'm missing something.

 Either some exception needs to be made or remove the "DecimalDigits ."
 possibility from the grammar and the compiler.

or make it "DecimalDigits . [^.]" where the ^ production is non consuming.

it is possible to parse D using a maximal munch lexer - see the seatd grammar for an example. it's a matter of what lexemes exactly you choose. in this particular case, the float lexemes need to be split, such that those floats with a trailing dot are not matched by a single lexeme.
Sep 01 2007
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to d-bugmail puremagic.com,


 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest token
 it can."
 

another case: actual !isGood -> ! isGood MaxMunch !isGood -> !is Good
Sep 02 2007
parent reply Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
BCS wrote:
 Reply to d-bugmail puremagic.com,
 
 
 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest token
 it can."

another case: actual !isGood -> ! isGood MaxMunch !isGood -> !is Good

I might be wrong, but my guess is that 'is' is always treated as its own entity, so that '!is' is really ('!' 'is'). Its not a bad practice when one has keyword-operators to do this, to avoid MM screwing up user's identifiers. But, as I haven't taken any trips through the DMD frontend source, I might be completely off. -- Chris Nicholson-Sauls
Sep 02 2007
parent reply BCS <ao pathlink.com> writes:
Reply to Chris Nicholson-Sauls,

 BCS wrote:
 
 Reply to d-bugmail puremagic.com,
 
 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest
 token it can."
 

actual !isGood -> ! isGood MaxMunch !isGood -> !is Good

own entity, so that '!is' is really ('!' 'is'). Its not a bad

That's how I spoted it in the first place
 practice when one has keyword-operators to do this, to avoid MM
 screwing up user's identifiers.  But, as I haven't taken any trips
 through the DMD frontend source, I might be completely off.
 

For that to work the lexer has to keep track of whitespace. :-b
Sep 02 2007
parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
BCS wrote:
 For that to work the lexer has to keep track of whitespace. :-b

you can also match "(!is)[^_a-zA-Z0-9]", advancing the input only for the submatch. or use a single-character lookahead.
Sep 03 2007
parent BCS <ao pathlink.com> writes:
Reply to Jascha,

 BCS wrote:
 
 For that to work the lexer has to keep track of whitespace. :-b
 

the submatch. or use a single-character lookahead.

That's what I'm hoping to do sooner or later. I already do somthing like that for ".." vs "."
Sep 03 2007
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1466


jascha mainia.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jascha mainia.de




------- Comment #5 from jascha mainia.de  2007-09-03 06:08 -------
(In reply to comment #0)
 A snippet from http://digitalmars.com/d/1.0/lex.html:
 
 "The source text is split into tokens using the maximal munch technique, i.e.,
 the lexical analyzer tries to make the longest token it can."
 
 Relevant parts of the grammar:
 
 Token:
         FloatLiteral
         ..
 
 FloatLiteral:
         Float
 
 Float:
         DecimalFloat
 
 DecimalFloat:
         DecimalDigits .
         . Decimal
 
 DecimalDigits:
         DecimalDigit
 
 DecimalDigit:
         NonZeroDigit
 
 Decimal:
         NonZeroDigit
 
 Based on the above, if a lexer encounters "1..3", for instance in a slice:
 "foo[1..3]", it should, using the maximal munch technique, make the longest
 possible token from "1..3": this is the Float "1.". Next, it should come up
 with the Float ".3".
 
 Of course, this isn't currently happening, and would be problematic if it did.
 But, according to the grammar, that's what should happen, unless I'm missing
 something.
 
 Either some exception needs to be made or remove the "DecimalDigits ."
 possibility from the grammar and the compiler.
 

(In reply to comment #1)
 Reply to d-bugmail puremagic.com,
 
 http://d.puremagic.com/issues/show_bug.cgi?id=1466
 
 Summary: Spec claims maximal munch technique always works:
 not
 for "1..3"
 Product: D
 Version: 1.020
 Platform: All
 URL: http://digitalmars.com/d/1.0/lex.html
 OS/Version: All
 Status: NEW
 Keywords: spec
 Severity: minor
 Priority: P3
 Component: www.digitalmars.com
 AssignedTo: bugzilla digitalmars.com
 ReportedBy: deewiant gmail.com
 A snippet from http://digitalmars.com/d/1.0/lex.html:
 
 "The source text is split into tokens using the maximal munch
 technique, i.e., the lexical analyzer tries to make the longest token
 it can."
 
 Relevant parts of the grammar:
 
 Token:
 FloatLiteral
 ..
 FloatLiteral:
 Float
 Float:
 DecimalFloat
 DecimalFloat:
 DecimalDigits .
 . Decimal
 DecimalDigits:
 DecimalDigit
 DecimalDigit:
 NonZeroDigit
 Decimal:
 NonZeroDigit
 Based on the above, if a lexer encounters "1..3", for instance in a
 slice: "foo[1..3]", it should, using the maximal munch technique, make
 the longest possible token from "1..3": this is the Float "1.". Next,
 it should come up with the Float ".3".
 
 Of course, this isn't currently happening, and would be problematic if
 it did. But, according to the grammar, that's what should happen,
 unless I'm missing something.
 
 Either some exception needs to be made or remove the "DecimalDigits ."
 possibility from the grammar and the compiler.
 

or make it "DecimalDigits . [^.]" where the ^ production is non consuming.

it is possible to parse D using a maximal munch lexer - see the seatd grammar for an example. it's a matter of what lexemes exactly you choose. in this particular case, the float lexemes need to be split, such that those floats with a trailing dot are not matched by a single lexeme. --
Sep 03 2007
prev sibling next sibling parent reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1466





------- Comment #7 from matti.niemenmaa+dbugzilla iki.fi  2007-09-09 12:26
-------
Here's some example code underlining the issue:

class Foo {
        static int opSlice(double a, double b) {
                return 0;
        }
}

void main() {
        // works
        assert (Foo[0. .. 1] == 0);
        // thinks it's [0 ... 1], no maximal munch taking place
        assert (Foo[0... 1] == 0);
}


-- 
Sep 09 2007
parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
d-bugmail puremagic.com wrote:
         // thinks it's [0 ... 1], no maximal munch taking place
         assert (Foo[0... 1] == 0);
 }

this *is* maximal munch taking place. because of the ".." lexeme, float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".
Sep 09 2007
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Jascha,

 d-bugmail puremagic.com wrote:
 
 // thinks it's [0 ... 1], no maximal munch taking place
 assert (Foo[0... 1] == 0);
 }

float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".

But is it the the correct way to do it? (Not is is doing what the spec says, but is it doing what it should be designed to do)
Sep 09 2007
parent Jascha Wetzel <"[firstname]" mainia.de> writes:
BCS wrote:
 Reply to Jascha,
 
 d-bugmail puremagic.com wrote:

 // thinks it's [0 ... 1], no maximal munch taking place
 assert (Foo[0... 1] == 0);
 }

float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".

But is it the the correct way to do it? (Not is is doing what the spec says, but is it doing what it should be designed to do)

i think "ptr ! is null" shouldn't be allowed, because it suggests that "!" and "is" are separate operators and "is null" is an unary expression. if your lexer supports lookaheads (a single character is enough), you can match float literals as lexemes. this is also what DMD does.
Sep 09 2007
prev sibling parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
Jascha Wetzel wrote:
 d-bugmail puremagic.com wrote:
         // thinks it's [0 ... 1], no maximal munch taking place
         assert (Foo[0... 1] == 0);
 }

this *is* maximal munch taking place. because of the ".." lexeme, float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".

this was formulated poorly. float literals *may* be considered context free to solve this problem...
Sep 09 2007
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Jascha Wetzel wrote:
 Jascha Wetzel wrote:
 d-bugmail puremagic.com wrote:
         // thinks it's [0 ... 1], no maximal munch taking place
         assert (Foo[0... 1] == 0);
 }

this *is* maximal munch taking place. because of the ".." lexeme, float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".

this was formulated poorly. float literals *may* be considered context free to solve this problem...

Exactly. But the way I read the spec, float literals are considered tokens in and of themselves. Maybe I misunderstand, but it could use some clarification in that case: lex.html specifically says that the lexer splits the code into tokens, one of which is "0.", with maximal munch. This isn't a /problem/ per se. In the extreme case, of course it is possible to parse D with maximal munch by considering every character a lexeme of its own and figuring everything else out thereafter. I'm just saying that the spec seems to contradict itself in saying that maximal munch should be used, and that it should match (among other things) "0." as one token. If you do things that way, it doesn't work the way DMD currently does it. If you match "0." as "0" and "." and construct a float literal from that later, it works. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Sep 10 2007
parent reply Jascha Wetzel <"[firstname]" mainia.de> writes:
Matti Niemenmaa wrote:
 Jascha Wetzel wrote:
 Jascha Wetzel wrote:
 d-bugmail puremagic.com wrote:
         // thinks it's [0 ... 1], no maximal munch taking place
         assert (Foo[0... 1] == 0);
 }

float literals are not lexemes. they are context free production rules consisting of multiple lexemes. therefore "0." consists of two lexemes and "..." wins the max munch over ".".

free to solve this problem...

Exactly. But the way I read the spec, float literals are considered tokens in and of themselves. Maybe I misunderstand, but it could use some clarification in that case: lex.html specifically says that the lexer splits the code into tokens, one of which is "0.", with maximal munch. This isn't a /problem/ per se. In the extreme case, of course it is possible to parse D with maximal munch by considering every character a lexeme of its own and figuring everything else out thereafter. I'm just saying that the spec seems to contradict itself in saying that maximal munch should be used, and that it should match (among other things) "0." as one token. If you do things that way, it doesn't work the way DMD currently does it. If you match "0." as "0" and "." and construct a float literal from that later, it works.

i agree that it's not clear. one can argue, though, that if you consider lookaheads as part of the lexeme specification, the maximal munch property remains intact. then "0." can only match if not followed by another "." - no contradiction to the max munch. but that should be stated in the specs, of course.
Sep 10 2007
parent Matti Niemenmaa <see_signature for.real.address> writes:
Jascha Wetzel wrote:
<snip>
 but that should be stated in the specs, of course.

Exactly, which is what the bug is about. :-) -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Sep 10 2007
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1466


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |bugzilla digitalmars.com
         Resolution|                            |FIXED


--- Comment #9 from Walter Bright <bugzilla digitalmars.com> 2010-11-09
19:43:04 PST ---
http://www.dsource.org/projects/phobos/changeset/2148

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 09 2010