www.digitalmars.com         C & C++   DMDScript  

D - [BUG] dmd does not implement LR analysis

reply Manfred Nowak <svv1999 hotmail.com> writes:
Also not explicitely specified the usual left-to-right lexical analysis
and parsing of the grammar of D is currently not implemented in dmd.

Currently `2.' and `.4' are legal real numbers. Therefore the look alike
range `[cast(int)2..4]' is not a range but should be analysed as two
consecutive real numbers, as if it is written as `[cast(int)2. .4]', and
therefore should yield something like:

| found '0.4' when expecting ']'

In the lexical analysis phase of dmd there has been done some trickery to
prevent this, i.e. looking ahead and backing up.

On the other hand this trickery prevents now, that the legal range
expression `[cast(int)2...4]' which could be written as `[cast(int)2. ..
4]' is not correctly identified by dmd. dmd yields:

| found '...' when expecting ']'

So long.
 
Mar 12 2004
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:c2uekl$1995$1 digitaldaemon.com...
 Also not explicitely specified the usual left-to-right lexical analysis
 and parsing of the grammar of D is currently not implemented in dmd.

 Currently `2.' and `.4' are legal real numbers. Therefore the look alike
 range `[cast(int)2..4]' is not a range but should be analysed as two
 consecutive real numbers, as if it is written as `[cast(int)2. .4]', and
 therefore should yield something like:

 | found '0.4' when expecting ']'

 In the lexical analysis phase of dmd there has been done some trickery to
 prevent this, i.e. looking ahead and backing up.

 On the other hand this trickery prevents now, that the legal range
 expression `[cast(int)2...4]' which could be written as `[cast(int)2. ..
 4]' is not correctly identified by dmd. dmd yields:

 | found '...' when expecting ']'

 So long.

... is a valid token. You'll need to put the space after the first . to get the meaning you wish. True, the lexer does a bit of lookahead, but why not?
Mar 13 2004
next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Walter wrote:

 ... is a valid token. You'll need to put the space after the first . to
 get the meaning you wish.

I am not talking about meanings I wish. I noticed this departure from the norm, because the public available syntax highlighting extension for D for vim exposed me `[2..4]' as two consecutive reals, thereby pointing me out, that my own syntax highlighting extension is wrong because I thought, that it is illegal to have an empty integer or fractional part in a real. Then: following the usual left-to-right-analysis it is correct to analyze the construct in question as two consecutive reals and furthermore there is no way to build an LR-highlighter that is able to highlight the construct in question as two integer numbers divided by the range operator `..'. Even the `d2html' example highlights the construct in question as the real `2.', followed by a `.', followed by the integer `4'. I do not believe that any syntax highlighter currently out there is able to highlight the construct in question correctly.
 True, the lexer does a bit of lookahead, but why not?

That depends on what DigitalMars has in mind with the language D and the de facto reference compiler dmd. If the intention of DigitalMars is to tempt a certain amount of computer nerds to the language D by promising an open standard and at the same time bind them to a proprietary implementation not fully consistent with the proposed standard and its somehow natural interpretation, then it is quite okay to make even more departures than the two I have detected: - the one which is the matter of this thread, and - the `cast' operator beeing optional in dmd. If the intention of DigitalMars is to keep the language D and the de facto reference compiler dmd in a homogeneous state, then the existence of both exposed deviations is not okay. There might be more intentions of DigitalMars, which I am unable to recognize. So long.
Mar 13 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Manfred Nowak wrote:
<snip>
 Even the `d2html' example highlights the construct in question as the
 real `2.', followed by a `.', followed by the integer `4'.
 
 I do not believe that any syntax highlighter currently out there is able
 to highlight the construct in question correctly.

You're right, that syntax highlighters that are strictly LR have trouble with syntaxes that aren't strictly LR. But see below....
 True, the lexer does a bit of lookahead, but why not?


Depends on whether the lexicality is supposed to be strictly LR. But I did just notice this in the spec: "There are no digraphs or trigraphs in D. The source text is split into tokens using the maximal munch technique, i.e., the lexical analyzer tries to make the longest token it can. For example >> is a right shift token, not two greater than tokens." But if that's exactly true, then from the way string literals are specified, surely in qwert("yuiop", "asdfg") a single, 14-character string is being passed?
 That depends on what DigitalMars has in mind with the language D and the
 de facto reference compiler dmd.

I think what it should have in mind is making the spec clearer. You're right, there's nothing suggesting that 2..4 should be 2 .. 4 and not 2. .4 or even any of the three other possibilities. Of course it isn't difficult to write a lexer that looks ahead two or three characters. The only trouble is that it's doing it for what's not clearly specified.
 If the intention of DigitalMars is to tempt a certain amount of computer
 nerds to the language D by promising an open standard and at the same time
 bind them to a proprietary implementation not fully consistent with the
 proposed standard and its somehow natural interpretation, then it is quite
 okay to make even more departures than the two I have detected:
 
 - the one which is the matter of this thread, and
 - the `cast' operator beeing optional in dmd.

You're right, that's just what I've been thinking for a while. There does seem to be both an inconsistency and a deviation from CFG with casts. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Mar 16 2004
next sibling parent reply "Matthew" <matthew stlsoft.org> writes:
 If the intention of DigitalMars is to tempt a certain amount of computer
 nerds to the language D by promising an open standard and at the same


 bind them to a proprietary implementation not fully consistent with the
 proposed standard and its somehow natural interpretation, then it is


 okay to make even more departures than the two I have detected:

 - the one which is the matter of this thread, and
 - the `cast' operator beeing optional in dmd.

You're right, that's just what I've been thinking for a while. There does seem to be both an inconsistency and a deviation from CFG with casts.

I think the cast operator should be mandatory
Mar 16 2004
parent reply J C Calvarese <jcc7 cox.net> writes:
Matthew wrote:
If the intention of DigitalMars is to tempt a certain amount of computer
nerds to the language D by promising an open standard and at the same


time
bind them to a proprietary implementation not fully consistent with the
proposed standard and its somehow natural interpretation, then it is


quite
okay to make even more departures than the two I have detected:

- the one which is the matter of this thread, and
- the `cast' operator beeing optional in dmd.

<snip> You're right, that's just what I've been thinking for a while. There does seem to be both an inconsistency and a deviation from CFG with casts.

I think the cast operator should be mandatory

I absolutely agree. It has to be now. Before D 1.0 is set and we have a bunch of legacy code with C-style casts hanging around. -- Justin http://jcc_7.tripod.com/d/
Mar 16 2004
parent "Matthew" <matthew stlsoft.org> writes:
"J C Calvarese" <jcc7 cox.net> wrote in message
news:c38i7b$un$1 digitaldaemon.com...
 Matthew wrote:
If the intention of DigitalMars is to tempt a certain amount of




nerds to the language D by promising an open standard and at the same


time
bind them to a proprietary implementation not fully consistent with the
proposed standard and its somehow natural interpretation, then it is


quite
okay to make even more departures than the two I have detected:

- the one which is the matter of this thread, and
- the `cast' operator beeing optional in dmd.

<snip> You're right, that's just what I've been thinking for a while. There does seem to be both an inconsistency and a deviation from CFG with



 I think the cast operator should be mandatory

I absolutely agree. It has to be now. Before D 1.0 is set and we have a bunch of legacy code with C-style casts hanging around.

Quite right. Let me presumptuously institute a vote.
Mar 17 2004
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Stewart Gordon wrote:

[...]
 "There are no digraphs or trigraphs in D. The source text is split into
 tokens using the maximal munch technique, i.e., the lexical analyzer
 tries to make the longest token it can. For example >> is a right shift
 token, not two greater than tokens."

Thanks for this link.
 But if that's exactly true, then from the way string literals are
 specified, surely in
 
 	qwert("yuiop", "asdfg")
 
 a single, 14-character string is being passed?

Right. It should be specified, that allowed characters do not include the delimiting `"' or ``'.
 I think what it should have in mind is making the spec clearer.  You're
 right, there's nothing suggesting that 2..4 should be 2 .. 4 and not 2.
 .4 or even any of the three other possibilities.

I see five, but only when not using longest match. [...] So long!
Mar 17 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Manfred Nowak wrote:

<snip>
 I think what it should have in mind is making the spec clearer.
 You're right, there's nothing suggesting that 2..4 should be 2 .. 4
 and not 2. .4 or even any of the three other possibilities.

I see five, but only when not using longest match.

2. . 4 2 . .4 2 . . 4 Of course, the character sequence could be split up as 2.. 4 2 ..4 but these involve what aren't valid D tokens. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Mar 17 2004
parent Manfred Nowak <svv1999 hotmail.com> writes:
Stewart Gordon wrote:

[...]
 but these involve what aren't valid D tokens.

Agreed. I did not think of this argument. So long!
Mar 17 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
On Sat, 13 Mar 2004 14:28:35 -0800, "Walter" <walter digitalmars.com>
wrote:

"Manfred Nowak" <svv1999 hotmail.com> wrote in message
news:c2uekl$1995$1 digitaldaemon.com...
 Also not explicitely specified the usual left-to-right lexical analysis
 and parsing of the grammar of D is currently not implemented in dmd.

 Currently `2.' and `.4' are legal real numbers. Therefore the look alike
 range `[cast(int)2..4]' is not a range but should be analysed as two
 consecutive real numbers, as if it is written as `[cast(int)2. .4]', and
 therefore should yield something like:

 | found '0.4' when expecting ']'

 In the lexical analysis phase of dmd there has been done some trickery to
 prevent this, i.e. looking ahead and backing up.

 On the other hand this trickery prevents now, that the legal range
 expression `[cast(int)2...4]' which could be written as `[cast(int)2. ..
 4]' is not correctly identified by dmd. dmd yields:

 | found '...' when expecting ']'

 So long.

... is a valid token. You'll need to put the space after the first . to get the meaning you wish. True, the lexer does a bit of lookahead, but why not?

Fortran, MATLAB and Python use : for slicing instead of .. I don't know the history of why but maybe this parsing issue factored into it. The .. reminds me more of Pascal. -Ben
Mar 13 2004
parent "C. Sauls" <ibisbasenji yahoo.com> writes:
MOO uses '..' as well, and having recently written a MOO 
parser/compiler/driver I can say its do-able.  Of course, MOO requires 
that floating-point numbers contain both integer and fraction, even if 
one is equal to 0 so maybe that makes all the difference.

-C. Sauls
-Invironz

Ben Hinkle wrote:
 Fortran, MATLAB and Python use : for slicing instead of ..
 I don't know the history of why but maybe this parsing issue factored
 into it. The .. reminds me more of Pascal.

Mar 14 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Manfred Nowak wrote:
<snip>
 Currently `2.' and `.4' are legal real numbers. Therefore the look alike
 range `[cast(int)2..4]' is not a range but should be analysed as two
 consecutive real numbers, as if it is written as `[cast(int)2. .4]', and
 therefore should yield something like:

That's news to me. I'd imagined the tokenisation of D was supposed to be context-free. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Mar 15 2004
next sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Stewart Gordon wrote:

[...]
 I'd imagined the tokenisation of D was supposed to be context-free.

context free is an attribute that belongs to grammars. At your will dmd has not a context free lexical analysis, because the case "natural number followed by a point" is treated in a special way. Lexical analysis usually is carried out by left-to-right finding the next _longest_ part of the remaining source that belongs to a token. This is called LR analysis. I.e. `return2;' is the identifier `return2', not the keyword `return' followed by the integer number `2', followed by a `;'. Not having an LR lexical analysis does not change the attribute context free for the grammar, also it is a convention to have LR lexical analysis with a context free grammar. If D breaks this convention it should be explicitely mentioned in the specification. If the non LR anaylsis stays, then the door is open for more implicite deviations from the conventions, like the one I mentioned with the `return2'. Even the suggestion of an operator that overrides the usual LR lexical analysis may arise. I would like `$ ' to be supported then :-) So long!
Mar 15 2004
prev sibling parent larry cowan <larry_member pathlink.com> writes:
In article <c348v1$1o1i$1 digitaldaemon.com>, Stewart Gordon says...
Manfred Nowak wrote:
<snip>
 Currently `2.' and `.4' are legal real numbers. Therefore the look alike
 range `[cast(int)2..4]' is not a range but should be analysed as two
 consecutive real numbers, as if it is written as `[cast(int)2. .4]', and
 therefore should yield something like:

That's news to me. I'd imagined the tokenisation of D was supposed to be context-free. Stewart.

but I would rather have leading and trailing 0's required for literal floats,doubles, and reals. -(.5-4.), 4.-.5 , 4.*-8. , 4./.2 , .1/16. , and 04*20. all look pretty strange at first glance. I think FP literals should be more obviously differentiated from integer literals.
Mar 15 2004