www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String Literal Docs

reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
I've been sketching some grammar diagrams for D2.0, a little like those 
on JSON.org, and of course I didn't get far before I ran into something odd.

In the section of www.digitalmars.com/d/2.0/lex.html on string literals, 
the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg, 
DoubleQuote and Hex strings and not on either Delimited or Token 
strings, which didn't make a lot of sense to me, so I tested it with DMD 
(v2.046, win)...

---

import std.stdio;

void main(){
	auto t1 = "double quote"d; // OK
	auto t2 = `back tick`d;    // OK
	auto t3 = x"dead beef";    // postfix not allowed on hexstrings!
	auto t4 = q"<delimited/>"d;// OK
	auto t5 = q{if}d;          // OK

	writefln("all literals A-OK!");
}

---

This makes sense to me, HexStrings with wide chars would have made my 
brain scream ><

So, to correct the documentation, the "postfix" needs to be removed from 
HexString and added to DelimitedString and TokenString.

I tried to see if this was already reporded in the bug tracker but 
couldn't see anything close.

On a slightly quieter note, there is also a spare underscore in the 
definition of HexidecimalDigit as it "extends" DecimalDigit which 
already has an underscore.

I also noticed a bug in the tracker related to initial underscores in 
float literals, if the diagrams start getting to puzzling I might look 
into that ^^

A...

PS, my copy of tDPL is in the post, yay!
Jun 19 2010
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/19/2010 03:12 PM, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

 In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
 the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg,
 DoubleQuote and Hex strings and not on either Delimited or Token
 strings, which didn't make a lot of sense to me, so I tested it with DMD
 (v2.046, win)...

 ---

 import std.stdio;

 void main(){
 auto t1 = "double quote"d; // OK
 auto t2 = `back tick`d; // OK
 auto t3 = x"dead beef"; // postfix not allowed on hexstrings!
 auto t4 = q"<delimited/>"d;// OK
 auto t5 = q{if}d; // OK

 writefln("all literals A-OK!");
 }

 ---

 This makes sense to me, HexStrings with wide chars would have made my
 brain scream ><

http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine: auto w = x"1e1d 1e1f"w; on dmd 2.047 but what it results in is pretty screwy.
 So, to correct the documentation, the "postfix" needs to be removed from
 HexString and added to DelimitedString and TokenString.

 I tried to see if this was already reporded in the bug tracker but
 couldn't see anything close.

 On a slightly quieter note, there is also a spare underscore in the
 definition of HexidecimalDigit as it "extends" DecimalDigit which
 already has an underscore.

 I also noticed a bug in the tracker related to initial underscores in
 float literals, if the diagrams start getting to puzzling I might look
 into that ^^

What what?
 A...

 PS, my copy of tDPL is in the post, yay!

Jun 19 2010
next sibling parent reply div0 <div0 users.sourceforge.net> writes:
On 19/06/2010 22:16, Ellery Newcomer wrote:
 On 06/19/2010 03:12 PM, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

 In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
 the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg,
 DoubleQuote and Hex strings and not on either Delimited or Token
 strings, which didn't make a lot of sense to me, so I tested it with DMD
 (v2.046, win)...

 ---

 import std.stdio;

 void main(){
 auto t1 = "double quote"d; // OK
 auto t2 = `back tick`d; // OK
 auto t3 = x"dead beef"; // postfix not allowed on hexstrings!
 auto t4 = q"<delimited/>"d;// OK
 auto t5 = q{if}d; // OK

 writefln("all literals A-OK!");
 }

 ---

 This makes sense to me, HexStrings with wide chars would have made my
 brain scream ><

http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine:

Hex strings are specifically exempted from the requirement for valid utf. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 19 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/19/2010 04:26 PM, div0 wrote:
 On 19/06/2010 22:16, Ellery Newcomer wrote:
 On 06/19/2010 03:12 PM, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

 In the section of www.digitalmars.com/d/2.0/lex.html on string literals,
 the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg,
 DoubleQuote and Hex strings and not on either Delimited or Token
 strings, which didn't make a lot of sense to me, so I tested it with DMD
 (v2.046, win)...

 ---

 import std.stdio;

 void main(){
 auto t1 = "double quote"d; // OK
 auto t2 = `back tick`d; // OK
 auto t3 = x"dead beef"; // postfix not allowed on hexstrings!
 auto t4 = q"<delimited/>"d;// OK
 auto t5 = q{if}d; // OK

 writefln("all literals A-OK!");
 }

 ---

 This makes sense to me, HexStrings with wide chars would have made my
 brain scream ><

http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine:

Hex strings are specifically exempted from the requirement for valid utf.

All I can say is auto w = x"dead beef"w; results in Error: invalid UTF-8 sequence on dmd 2.047
Jun 19 2010
parent reply div0 <div0 users.sourceforge.net> writes:
On 19/06/2010 23:17, Ellery Newcomer wrote:
 All I can say is

 auto w = x"dead beef"w;

 results in

 Error: invalid UTF-8 sequence

 on dmd 2.047

Then you've found a bug, you know what to do: http://d.puremagic.com/issues/ -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 19 2010
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 20/06/2010 01:09, div0 wrote:
 On 19/06/2010 23:17, Ellery Newcomer wrote:
 All I can say is

 auto w = x"dead beef"w;

 results in

 Error: invalid UTF-8 sequence

 on dmd 2.047

Then you've found a bug, you know what to do: http://d.puremagic.com/issues/

Hmn, that would seem to indicate to me that the postfix is being allowed when the hex represents a valid UTF sequence, but not otherwise. I didn't do too much testing myself as I know next to zilch about string internals >< The text that describes hex strings says that they have to have an even number of digits, but this would seem to imply that they have to have a multiple of 4 or 8 for wstrings and dstrings respectively, which makes sense, but I'm not sure that can be verified in the lexing of a string literal without insane lookahead rules >< But, then I guess that is why the spec says that hex strings are exempt from the valid UTF rule, and in that case hexstrings should really make byte arrays rather than strings, but failing that, always chars and not anything wider. A...
Jun 20 2010
parent reply div0 <div0 users.sourceforge.net> writes:
On 20/06/2010 11:03, Alix Pexton wrote:
 On 20/06/2010 01:09, div0 wrote:
 On 19/06/2010 23:17, Ellery Newcomer wrote:
 All I can say is

 auto w = x"dead beef"w;

 results in

 Error: invalid UTF-8 sequence

 on dmd 2.047

Then you've found a bug, you know what to do: http://d.puremagic.com/issues/

Hmn, that would seem to indicate to me that the postfix is being allowed when the hex represents a valid UTF sequence, but not otherwise. I didn't do too much testing myself as I know next to zilch about string internals >< The text that describes hex strings says that they have to have an even number of digits, but this would seem to imply that they have to have a multiple of 4 or 8 for wstrings and dstrings respectively, which makes sense, but I'm not sure that can be verified in the lexing of a string literal without insane lookahead rules ><

It says multiple of 2, not even number of digits. To me that implies it's always 2 and the suffix acceptance is just a bug. It could be made more clear though.
 But, then I guess that is why the spec says that hex strings are exempt
 from the valid UTF rule, and in that case hexstrings should really make
 byte arrays rather than strings, but failing that, always chars and not
 anything wider.

 A...

Yeah, hex strings should probably have the type ubyte[] If you using them to put arbitrary binary in your program you're almost certainly going to cast the array to something else anyway, so char[], wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, making them ?char[] seems wrong. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 20 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"div0" <div0 users.sourceforge.net> wrote in message 
news:hvkrsc$2r5c$1 digitalmars.com...
 It says multiple of 2, not even number of digits.

"multiple of 2" == "even number" "Even" as in "even vs odd"
 Yeah, hex strings should probably have the type ubyte[]

 If you using them to put arbitrary binary in your program you're almost 
 certainly going to cast the array to something else anyway, so char[], 
 wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, 
 making them ?char[] seems wrong.

You have me completely convinced.
Jun 20 2010
parent reply div0 <div0 users.sourceforge.net> writes:
On 20/06/2010 18:55, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>  wrote in message
 news:hvkrsc$2r5c$1 digitalmars.com...
 It says multiple of 2, not even number of digits.

"multiple of 2" == "even number" "Even" as in "even vs odd"

I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here. I read the spec. as specifying that the hex characters should be in groups of 2, I also take it as implying that the suffixes are not applicable. You're more than welcome to your own take on it. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 20 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"div0" <div0 users.sourceforge.net> wrote in message 
news:hvlok6$1rfu$1 digitalmars.com...
 On 20/06/2010 18:55, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>  wrote in message
 news:hvkrsc$2r5c$1 digitalmars.com...
 It says multiple of 2, not even number of digits.

"multiple of 2" == "even number" "Even" as in "even vs odd"

I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.

That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at.
Jun 20 2010
next sibling parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 20/06/2010 20:14, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>  wrote in message
 news:hvlok6$1rfu$1 digitalmars.com...
 On 20/06/2010 18:55, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>   wrote in message
 news:hvkrsc$2r5c$1 digitalmars.com...
 It says multiple of 2, not even number of digits.

"multiple of 2" == "even number" "Even" as in "even vs odd"

I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.

That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at.

From looking at the source, I now know that all string literals can have a postfix, and that as far as lexing goes, all strings are in UTF8. I've not tracked down yet where the the value of the postfix is applied, but I'm fairly certain that it would be easy enough to turn off the UTF verification for the hexstrings at that end. As far as making my diagrams, I don't think it matters, for now... A...
Jun 20 2010
prev sibling parent div0 <div0 users.sourceforge.net> writes:
On 20/06/2010 20:14, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>  wrote in message
 news:hvlok6$1rfu$1 digitalmars.com...
 On 20/06/2010 18:55, Nick Sabalausky wrote:
 "div0"<div0 users.sourceforge.net>   wrote in message
 news:hvkrsc$2r5c$1 digitalmars.com...
 It says multiple of 2, not even number of digits.

"multiple of 2" == "even number" "Even" as in "even vs odd"

I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.

That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at.

What I was getting at is that if you use the w suffix, then surely you would expect the number of hex digits to be a multiple of 4 not 2. If there are only 6 digits what then? Are the missing one inferred to be 0, is it a compile error, or something else? Because of the use of the 2, I inferred from the spec that the suffixes were not supposed to be allowed. If it had said even number of digits, I'd have been more inclined to think that the suffixes are legal. Either which way it just high lights that the spec isn't sufficiently clear. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 22 2010
prev sibling parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 19/06/2010 22:16, Ellery Newcomer wrote:
 On 06/19/2010 03:12 PM, Alix Pexton wrote:
 I also noticed a bug in the tracker related to initial underscores in
 float literals, if the diagrams start getting to puzzling I might look
 into that ^^

What what?

Bug 2734 is the underscores in floats issue. Bug 949 also has a shed full of replacement grammar rules that fix escape sequences and some coner cases in floats (and probably some other stuff too!) A...
Jun 20 2010
prev sibling parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 19/06/2010 21:12, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...
Jun 20 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/20/2010 03:01 PM, Alix Pexton wrote:
 On 19/06/2010 21:12, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...

Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).
Jun 20 2010
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 20/06/2010 21:37, Ellery Newcomer wrote:
 On 06/20/2010 03:01 PM, Alix Pexton wrote:
 On 19/06/2010 21:12, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...

Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).

of course ^^ A...
Jun 20 2010
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 20/06/2010 22:46, Alix Pexton wrote:
 On 20/06/2010 21:37, Ellery Newcomer wrote:
 On 06/20/2010 03:01 PM, Alix Pexton wrote:
 On 19/06/2010 21:12, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like those
 on JSON.org, and of course I didn't get far before I ran into something
 odd.

I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...

Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).

of course ^^ A...

Well, I think I have got my head around lexer.c now, and its various peculiarities, like "000377." being a valid float (although not according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^). The weirdness occurs because some of some corner cases are handled not by the neat little state state machine that validates reals, but in the scanner at the point where it recognises a number beginning with a zero. The productions in lex.html represent the range of inputs that are accepted by the state machine without taking into account that the scanner rejects the sequence "._" (which makes sense as that is the identifier "_" in the outer scope). Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat, but a strict reading of lex.html would not allow it. Overall the diagram for hexfloat is much simpler than the one for decimalfloat, which I think will have to be split into 3 >< A... PS, octal must die!
Jun 21 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/21/2010 02:21 PM, Alix Pexton wrote:
 On 20/06/2010 22:46, Alix Pexton wrote:
 On 20/06/2010 21:37, Ellery Newcomer wrote:
 On 06/20/2010 03:01 PM, Alix Pexton wrote:
 On 19/06/2010 21:12, Alix Pexton wrote:
 I've been sketching some grammar diagrams for D2.0, a little like
 those
 on JSON.org, and of course I didn't get far before I ran into
 something
 odd.

I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...

Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).

of course ^^ A...

Well, I think I have got my head around lexer.c now, and its various peculiarities, like "000377." being a valid float (although not according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^).

Oh wow. That's a sweet little diagram. Those dots are hard to see though.
 The weirdness occurs because some of some corner cases are handled not
 by the neat little state state machine that validates reals, but in the
 scanner at the point where it recognises a number beginning with a zero.
 The productions in lex.html represent the range of inputs that are
 accepted by the state machine without taking into account that the
 scanner rejects the sequence "._" (which makes sense as that is the
 identifier "_" in the outer scope).

to hell with lexer.c. I'm not changing anything.
 Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat,
 but a strict reading of lex.html would not allow it.

 Overall the diagram for hexfloat is much simpler than the one for
 decimalfloat, which I think will have to be split into 3 ><

 A...

 PS, octal must die!

I'll settle for modified syntax 0c123. But yeah. Are your diagrams solely concerned with the lexer? Because I have a (messy) parser grammar which I'm a bit more confident about if you're interested.
Jun 21 2010
next sibling parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 21/06/2010 21:20, Ellery Newcomer wrote:

 Are your diagrams solely concerned with the lexer? Because I have a
 (messy) parser grammar which I'm a bit more confident about if you're
 interested.

So far I have only covered the lexer, but most of it needs redoing in light of the errors in the DMD docs, but I am hoping to cover the whole spec, eventually... The more I do the quicker I'm able to make them as my workflow evolves, so its hard to say how long it will take... A...
Jun 21 2010
prev sibling parent Justin Spahr-Summers <Justin.SpahrSummers gmail.com> writes:
On Mon, 21 Jun 2010 15:20:16 -0500, Ellery Newcomer <ellery-
newcomer utulsa.edu> wrote:
 Are your diagrams solely concerned with the lexer? Because I have a 
 (messy) parser grammar which I'm a bit more confident about if you're 
 interested. 

I can't speak for Alix, but I would absolutely be interested. I'm working on an "Objective-D" preprocessor and my parsing still has lots of holes, even besides the stuff I have marked to-do. A strict reading of the website has already turned up a few inaccuracies.
Jun 22 2010