digitalmars.D - Proposal: clean up semantics of array literals vs string literals

Don Clugston (73/73) Oct 02 2012 The problem

Tobias Pankrath (6/16) Oct 02 2012 If every string literal is \0-terminated, then there should be

Don Clugston (6/28) Oct 02 2012 The \0 is *not* part of the string, it lies after the string.

deadalnix (23/23) Oct 02 2012 Well the whole mess come from the fact that D conflate C string and D

Don Clugston (4/23) Oct 02 2012 This still doesn't solve the problem of the difference between array

deadalnix (3/30) Oct 02 2012 OK, infact we have 2 different and unrelated problems here. I have to

Andrej Mitrovic (7/16) Oct 02 2012 What about these, will these pass?:

Don Clugston (3/23) Oct 02 2012 Yes, they pass. The \0 is not included in the string length. It's

kenji hara (46/118) Oct 02 2012 Maybe your proposal is correct.
monarch_dodra (18/21) Oct 02 2012 While I think it is convenient to be able to write

Bernard Helyer (3/5) Oct 04 2012 That ship has long since sailed. You'll break code in an

Andrei Alexandrescu (39/49) Oct 02 2012 [snip]

Peter Alexander (15/17) Oct 02 2012 I think your view of what is common in D code is not
Don Clugston (18/29) Oct 04 2012 [snip]
Bernard Helyer (5/7) Oct 04 2012 My experience has been much different. Interfacing with C occurs

Jakob Ovrum (11/18) Oct 04 2012 Agreed. I'm always happy when I find that the particular C API I

Don Clugston <dac nospam.com> writes:

The problem
-----------

String literals in D are a little bit magical; they have a trailing \0. 
This means that is possible to write,

printf("Hello, World!\n");

without including a trailing \0. This is important for compatibility 
with C. This trailing \0 is mentioned in the spec but only incidentally, 
and generally in connection with printf.

But the semantics are not well defined.

printf("Hello, W" ~ "orld!\n");

Does this have a trailing \0 ? I think it should, because it improves 
readability of string literals that are longer than one line. Currently 
DMD adds a \0, but it is not in the spec.

Now consider array literals.

printf(['H','e', 'l', 'l','o','\n']);

Does this have a trailing \0 ? Currently DMD does not put one in.
How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?

And "Hello " ~ ['W','o','r','l','d','\n']   ?

And "Hello World!" ~ '\n' ?
And  null ~ "Hello World!\n" ?

Currently DMD puts \0 in some cases but not others, and it's rather random.

The root cause is that this trailing zero is not part of the type, it's 
part of the literal. There are no rules for how literals are propagated 
inside expressions, they are just literals. This is a mess.

There is a second difference.
Array literals of char type, have completely different semantics from 
string literals. In module scope:

char[] x = ['a'];  // OK -- array literals can have an implicit .dup
char[] y = "b";    // illegal

This is a big problem for CTFE, because for CTFE, a string is just a 
compile-time value, it's neither string literal nor array literal!

See bug 8660 for further details of the problems this causes.


A proposal to clean up this mess
--------------------------------

Any compile-time value of type immutable(char)[] or const(char)[], 
behaves a string literals currently do, and will have a \0 appended when 
it is stored in the executable.

ie,

enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
printf(hello);

will work.

Any value of type char[], which is generated at compile time, will not 
have the trailing \0, and it will do an implicit dup (as current array 
literals do).

char [] foo()
{
     return "abc";
}

char [] x = foo();

// x does not have a trailing \0, and it is implicitly duped, even 
though it was not declared with an array literal.

-------------------
So that the difference between string literals and char array literals 
would simply be that the latter are polysemous. There would be no 
semantics associated with the form of the literal itself.


We still have this oddity:


void foo(char qqq = 'b') {

    string x = "abc";            // trailing \0
    string y = ['a', 'b', 'c'];  // trailing \0
    string z = ['a', qqq, 'c'];  // no trailing \0
}

This is because we made the (IMHO mistaken) decision to allow variables 
inside array literals.
This is the reason why I listed _compile time value_ in the requirement 
for having a \0, rather than entirely basing it on the type.

We could fix that with a language change: an array literal which 
contains a variable should not be of immutable type. It should be of 
mutable type (or const, in the case where it contains other, immutable 
values).

So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, 
even though w is allocated on the heap).

But that's a separate proposal from the one I'm making here. I just need 
a decision on the main proposal so that I can fix a pile of CTFE bugs.

Oct 02 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a 
 trailing \0. This means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for 
 compatibility with C. This trailing \0 is mentioned in the spec 
 but only incidentally, and generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");

If every string literal is \0-terminated, then there should be 
two \0 in the final string. I guess that's not the case and 
that's actually my preferred behaviour, but the spec should make 
it crystal clear in which situations a
string literal gets a terminator and in which not.

Oct 02 2012

Don Clugston <dac nospam.com> writes:

On 02/10/12 13:18, Tobias Pankrath wrote:
 On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing
 \0. This means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for compatibility
 with C. This trailing \0 is mentioned in the spec but only
 incidentally, and generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");

 If every string literal is \0-terminated, then there should be two \0 in
 the final string. I guess that's not the case and that's actually my
 preferred behaviour, but the spec should make it crystal clear in which
 situations a
 string literal gets a terminator and in which not.

The \0 is *not* part of the string, it lies after the string.
It's as if all memory is cleared, then the string literals are copied 
into it, with a gap of at least one byte between each. The 'trailing 0' 
is not part of the literal, it's the underlying cleared memory.

At least, that's how I understand it. The spec is very vague.

Oct 02 2012

deadalnix <deadalnix gmail.com> writes:

Well the whole mess come from the fact that D conflate C string and D 
string.

The first problem come from the fact that D array are implicitly 
convertible to pointer. So calling D function that expect a char* is 
possible with D string even if it is unsafe and will not work in the 
general case.

The fact that D provide tricks that will make it work in special cases 
is armful as previous discussion have shown (many D programmer assume 
that this will always work because of toy tests they have made, where in 
case it won't and toStringz must be used).

The only sane solution I can think of is to :
  - disallow slice to convert implicitly to pointer. .ptr is made for that.
  - Do not put any trailing 0 in string literal, unless it is specified 
explicitly ( "foobar\0" ).
  - Except if a const(char)* is expected from the string literal. In 
case it becomes a Cstring literal, with a trailing 0. This is made to 
allow uses like printf("foobar");

In other terms, the receiver type is used to decide if the compiler 
generate a string literal or a Cstring literal.

Other addition of 0 are just confusing, and will make incorrect code 
work in special cases, which is something you usually don't want. Code 
that work by accident often backfire in spectacular ways at the least 
expected moment.

Oct 02 2012

Don Clugston <dac nospam.com> writes:

On 02/10/12 13:26, deadalnix wrote:
 Well the whole mess come from the fact that D conflate C string and D
 string.

 The first problem come from the fact that D array are implicitly
 convertible to pointer. So calling D function that expect a char* is
 possible with D string even if it is unsafe and will not work in the
 general case.

 The fact that D provide tricks that will make it work in special cases
 is armful as previous discussion have shown (many D programmer assume
 that this will always work because of toy tests they have made, where in
 case it won't and toStringz must be used).

 The only sane solution I can think of is to :
   - disallow slice to convert implicitly to pointer. .ptr is made for that.
   - Do not put any trailing 0 in string literal, unless it is specified
 explicitly ( "foobar\0" ).
   - Except if a const(char)* is expected from the string literal. In
 case it becomes a Cstring literal, with a trailing 0. This is made to
 allow uses like printf("foobar");

 In other terms, the receiver type is used to decide if the compiler
 generate a string literal or a Cstring literal.

This still doesn't solve the problem of the difference between array 
literals and string literals (the magical implicit .dup), which is the 
key problem I'm trying to solve.

Oct 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/10/2012 15:12, Don Clugston a �crit :
 On 02/10/12 13:26, deadalnix wrote:
 Well the whole mess come from the fact that D conflate C string and D
 string.

 The first problem come from the fact that D array are implicitly
 convertible to pointer. So calling D function that expect a char* is
 possible with D string even if it is unsafe and will not work in the
 general case.

 The fact that D provide tricks that will make it work in special cases
 is armful as previous discussion have shown (many D programmer assume
 that this will always work because of toy tests they have made, where in
 case it won't and toStringz must be used).

 The only sane solution I can think of is to :
 - disallow slice to convert implicitly to pointer. .ptr is made for that.
 - Do not put any trailing 0 in string literal, unless it is specified
 explicitly ( "foobar\0" ).
 - Except if a const(char)* is expected from the string literal. In
 case it becomes a Cstring literal, with a trailing 0. This is made to
 allow uses like printf("foobar");

 In other terms, the receiver type is used to decide if the compiler
 generate a string literal or a Cstring literal.

 This still doesn't solve the problem of the difference between array
 literals and string literals (the magical implicit .dup), which is the
 key problem I'm trying to solve.

OK, infact we have 2 different and unrelated problems here. I have to 
say I have no idea for the second one.

Oct 02 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 10/2/12, Don Clugston <dac nospam.com> wrote:
 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[],
 behaves a string literals currently do, and will have a \0 appended when
 it is stored in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.

What about these, will these pass?:

enum string x = "foo";
assert(x.length == 3);

void test(string x) { assert(x.length == 3); }
test(x);

If these don't pass the proposal will break code.

Oct 02 2012

Don Clugston <dac nospam.com> writes:

On 02/10/12 14:02, Andrej Mitrovic wrote:
 On 10/2/12, Don Clugston <dac nospam.com> wrote:
 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[],
 behaves a string literals currently do, and will have a \0 appended when
 it is stored in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.

 What about these, will these pass?:

 enum string x = "foo";
 assert(x.length == 3);

 void test(string x) { assert(x.length == 3); }
 test(x);

 If these don't pass the proposal will break code.

Yes, they pass. The \0 is not included in the string length. It's 
effectively in the data segment, not in the string.

Oct 02 2012

kenji hara <k.hara.pg gmail.com> writes:

2012/10/2 Don Clugston <dac nospam.com>:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0. This
 means that is possible to write,

 printf("Hello, World!\n");

 without including a trailing \0. This is important for compatibility with C.
 This trailing \0 is mentioned in the spec but only incidentally, and
 generally in connection with printf.

 But the semantics are not well defined.

 printf("Hello, W" ~ "orld!\n");

 Does this have a trailing \0 ? I think it should, because it improves
 readability of string literals that are longer than one line. Currently DMD
 adds a \0, but it is not in the spec.

 Now consider array literals.

 printf(['H','e', 'l', 'l','o','\n']);

 Does this have a trailing \0 ? Currently DMD does not put one in.
 How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?

 And "Hello " ~ ['W','o','r','l','d','\n']   ?

 And "Hello World!" ~ '\n' ?
 And  null ~ "Hello World!\n" ?

 Currently DMD puts \0 in some cases but not others, and it's rather random.

 The root cause is that this trailing zero is not part of the type, it's part
 of the literal. There are no rules for how literals are propagated inside
 expressions, they are just literals. This is a mess.

 There is a second difference.
 Array literals of char type, have completely different semantics from string
 literals. In module scope:

 char[] x = ['a'];  // OK -- array literals can have an implicit .dup
 char[] y = "b";    // illegal

 This is a big problem for CTFE, because for CTFE, a string is just a
 compile-time value, it's neither string literal nor array literal!

 See bug 8660 for further details of the problems this causes.


 A proposal to clean up this mess
 --------------------------------

 Any compile-time value of type immutable(char)[] or const(char)[], behaves a
 string literals currently do, and will have a \0 appended when it is stored
 in the executable.

 ie,

 enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
 printf(hello);

 will work.

 Any value of type char[], which is generated at compile time, will not have
 the trailing \0, and it will do an implicit dup (as current array literals
 do).

 char [] foo()
 {
     return "abc";
 }

 char [] x = foo();

 // x does not have a trailing \0, and it is implicitly duped, even though it
 was not declared with an array literal.

 -------------------
 So that the difference between string literals and char array literals would
 simply be that the latter are polysemous. There would be no semantics
 associated with the form of the literal itself.


 We still have this oddity:


 void foo(char qqq = 'b') {

    string x = "abc";            // trailing \0
    string y = ['a', 'b', 'c'];  // trailing \0
    string z = ['a', qqq, 'c'];  // no trailing \0
 }

 This is because we made the (IMHO mistaken) decision to allow variables
 inside array literals.
 This is the reason why I listed _compile time value_ in the requirement for
 having a \0, rather than entirely basing it on the type.

 We could fix that with a language change: an array literal which contains a
 variable should not be of immutable type. It should be of mutable type (or
 const, in the case where it contains other, immutable values).

 So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even
 though w is allocated on the heap).

 But that's a separate proposal from the one I'm making here. I just need a
 decision on the main proposal so that I can fix a pile of CTFE bugs.

Maybe your proposal is correct.
I think the key idea is *polysemous typed string literal*.

When based on the Ideal D Interpreter in my brain, the organized rule
will become like follows.

1-1) In semantic level, D should have just one polysemous string
literal, which is "an array of char".
1-2) In token level, D has two represents for the polysemous string
literal, they are "str" and ['s','t','r'].

2) The polysemous string literl is implicitly convertible to
[wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is
not need, because immutable([wd]?char)[] is implicitly convertible to
them).

3) The concatenation result between polysemous literals is still
polysemous, but its representation is different based on the both side
of the operator.

   "str" ~ "str";         // "strstr"
   "str" ~ ['s','t','r']; // ['s','t','r','s','t','r']
   "str" ~ 's';           // "strs"
   ['s','t','r'] ~ 's';   // ['s','t','r','s']
   "str" ~ null;          // "str"
   ['s','t','r'] ~ null;  // ['s','t','r']

4) After semantics _and_ optimization, polysemous string literal which
represented as like
 4-1) "str" is typed as immutable([wd]?char)[] (The char type is
depends on the literal suffix).
 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is
depends on the common type of its elements).

5) In object file generating phase, string literal which typed as
  5-1) immutable([wd]?)char[] is stored in the executable and
implicitly terminated with \0.
  5-2) [wd]?char[] are stored in the executable as the original image
and implicitly 'dup'ed in runtime.

----
Additionally, in following case, both concatenation should generate
polysemous string literals in CT and RT.
Because, after concatenation of chars and char arrays, newly allocated
strings are *purely immutable* value and implicitly convertible to
mutable.

immutable char ic = 'a';
pragma(msg, typeof(['s', 't', ic, 'r']));   // prints const(char)[]
immutable(char)[] s = ['s', 't', ic, 'r'];  // BUT, should be allowed

char mc = 'a';
pragma(msg, typeof("st"~mc~"r"));   // prints const(char)[]
char[] s = "st"~mc~"r";             // BUT, should be allowed

Kenji Hara

Oct 02 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
 [SNIP]
 A proposal to clean up this mess
 [SNIP]

While I think it is convenient to be able to write 
'printf("world");', as you point out, I think that the fact that 
it works "inconsistently" (and by that, I mean there are rules 
and exceptions), is even more dangerous.

If at all possible, I'd rather side with consistency, then the 
"we got your back... except when we don't" approach: IE: strings 
are NEVER null terminated.

In theory, how often do you *really* need null terminated 
strings? And when you do, wouldn't it be safer to just write 
'printf("world\0")'? or 'printf(str ~ "world" ~ '\0');' rather 
than "Am I in a case where it is null terminated? Yeah... 90% 
confident I am..."

If you want 0 termination, then make it explicit, that's my 
opinion.

Besides, as you said, the null termination is not documented, so 
anything relying on it is a bug really. Just an observation of an 
implementation detail.

Oct 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Tuesday, 2 October 2012 at 14:03:36 UTC, monarch_dodra wrote:
 If you want 0 termination, then make it explicit, that's my 
 opinion.

That ship has long since sailed. You'll break code in an
incredibly dangerous way if you were to change it now.

Oct 04 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/2/12 7:11 AM, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0.

[snip]

I don't mean to be Debbie Downer on this because I reckon it addresses 
an issue that some have, although I never do. With that warning, a few 
candid opinions follow.

First, I think zero-terminated strings shouldn't be needed frequently 
enough in D code to make this necessary.

Second, a simple and workable solution to this would be to address the 
matter dynamically: make toStringz opportunistically look whether 
there's a \0 beyond the end of the string, EXCEPT when the string 
happens to end exactly at a page boundary (in which case accessing 
memory beyond the end of the string may produce a page fault). With this 
simple dynamic test we don't need precise and stringent rules for the 
implementation.

Third, the complex set of rules proposed pushes the number of cases in 
which the \0 is guaranteed, but doesn't make for a clear and easy to 
remember boundary. Therefore people will need to remember some more 
rules to make sure they can, well, avoid a call to toStringz.

On 10/2/12 10:55 AM, Regan Heath wrote:
 Recent discussions on the zero terminated string problems and
 inconsistency of string literals has me, again, wondering why D
 doesn't have a 'type' to represent C's zero terminated strings.  It
 seems to me that having a type, and typing C functions with it would
 solve a lot of problems.

[snip]
 I am probably missing something obvious, or I have forgotten one of
 the array/slice complexities which makes this a nightmare.

You're not missing anything and defining a zero-terminated type is 
something I considered doing and have been highly interested in. My 
interest is motivated by the fact that sentinel-terminated structures 
are a very interesting example of forward ranges that are also 
contiguous. That sets them apart from both singly-linked lists and 
simple arrays, and gives them interesting properties.

I'd be interested in defining the more general:

struct SentinelTerminatedSlice(T, T terminator)
{
     private T* data;
     ...
}

That would be a forward range and the instantiation 
SentinelTerminatedSlice!(char, 0) would be CString.

However, so far I held off of defining such a range because C-strings 
are seldom useful in D code and there are not many other compelling 
examples of sentinel-terminated ranges. Maybe it's time to dust off that 
idea, I'd love it if we gathered enough motivation for it.


Andrei

Oct 02 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
 However, so far I held off of defining such a range because 
 C-strings are seldom useful in D code [...]

I think your view of what is common in D code is not 
representative. You are primarily a library writer, which means 
you rarely have to interface with other code. Please correct me 
if I'm wrong, but I don't believe you've written much 
application-level D code.

For people that write applications, we have the unfortunate chore 
of having to call lots of C APIs to get things done. There's a 
long list of things for which there is no D interface (graphics, 
audio, input, GUI, database, platform APIs, various 3rd party 
libs). Invariably these interfaces require C strings. In short, 
if you write applications in D, you need C strings.

I don't know what the right decision is here, but please do not 
say that C-strings are seldom useful in D code.

Oct 02 2012

Don Clugston <dac nospam.com> writes:

On 02/10/12 17:14, Andrei Alexandrescu wrote:
 On 10/2/12 7:11 AM, Don Clugston wrote:
 The problem
 -----------

 String literals in D are a little bit magical; they have a trailing \0.

 [snip]

 I don't mean to be Debbie Downer on this because I reckon it addresses
 an issue that some have, although I never do. With that warning, a few
 candid opinions follow.

 First, I think zero-terminated strings shouldn't be needed frequently
 enough in D code to make this necessary.

[snip]

You're missing the point, a bit. The zero-terminator is only one symptom 
of the underlying problem: string literals and array literals have the 
same type but different semantics.
The other symptoms are:
* the implicit .dup that happens with array literals, but not string 
literals.
This is a silent performance killer. It's probably the most common 
performance bug we find in our code, and it's completely ungreppable.

* string literals are polysemous with width (c, w, d) but array literals 
are not (they are polysemous with constness).
For example,
"abc" ~ '�'
is legal, but
['a', 'b', 'c'] ~ '�'
is not.
This has nothing to do with the zero terminator.

Oct 04 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
 First, I think zero-terminated strings shouldn't be needed 
 frequently enough in D code to make this necessary.

My experience has been much different. Interfacing with C occurs
in nearly every D program I write, and I usually end up passing
a string literal. Anecdotes!

Oct 04 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 4 October 2012 at 07:57:16 UTC, Bernard Helyer wrote:
 On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
 wrote:
 First, I think zero-terminated strings shouldn't be needed 
 frequently enough in D code to make this necessary.

 My experience has been much different. Interfacing with C occurs
 in nearly every D program I write, and I usually end up passing
 a string literal. Anecdotes!

Agreed. I'm always happy when I find that the particular C API I 
am working with supports passing strings as a pointer/length pair 
:)

Anyway, toStringz (and the wchar and dchar equivalents in 
std.utf) needs to be fixed regardless - it currently does a 
dangerous optimization if the string is immutable, otherwise it 
unconditionally concatenates. We cannot rely on strings being GC 
allocated based on mutability. Memory is outside the scope of the 
D type system - we cannot make assumptions about memory based on 
types.

Oct 04 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Proposal: clean up semantics of array literals vs string literals