digitalmars.D - Proposal: clean up semantics of array literals vs string literals
- Don Clugston (73/73) Oct 02 2012 The problem
- Tobias Pankrath (6/16) Oct 02 2012 If every string literal is \0-terminated, then there should be
- Don Clugston (6/28) Oct 02 2012 The \0 is *not* part of the string, it lies after the string.
- deadalnix (23/23) Oct 02 2012 Well the whole mess come from the fact that D conflate C string and D
- Don Clugston (4/23) Oct 02 2012 This still doesn't solve the problem of the difference between array
- deadalnix (3/30) Oct 02 2012 OK, infact we have 2 different and unrelated problems here. I have to
- Andrej Mitrovic (7/16) Oct 02 2012 What about these, will these pass?:
- Don Clugston (3/23) Oct 02 2012 Yes, they pass. The \0 is not included in the string length. It's
- kenji hara (46/118) Oct 02 2012 Maybe your proposal is correct.
- monarch_dodra (18/21) Oct 02 2012 While I think it is convenient to be able to write
- Bernard Helyer (3/5) Oct 04 2012 That ship has long since sailed. You'll break code in an
- Andrei Alexandrescu (39/49) Oct 02 2012 [snip]
- Peter Alexander (15/17) Oct 02 2012 I think your view of what is common in D code is not
- Don Clugston (18/29) Oct 04 2012 [snip]
- Bernard Helyer (5/7) Oct 04 2012 My experience has been much different. Interfacing with C occurs
- Jakob Ovrum (11/18) Oct 04 2012 Agreed. I'm always happy when I find that the particular C API I
The problem ----------- String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write, printf("Hello, World!\n"); without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf. But the semantics are not well defined. printf("Hello, W" ~ "orld!\n"); Does this have a trailing \0 ? I think it should, because it improves readability of string literals that are longer than one line. Currently DMD adds a \0, but it is not in the spec. Now consider array literals. printf(['H','e', 'l', 'l','o','\n']); Does this have a trailing \0 ? Currently DMD does not put one in. How about ['H','e', 'l', 'l','o'] ~ " World!\n" ? And "Hello " ~ ['W','o','r','l','d','\n'] ? And "Hello World!" ~ '\n' ? And null ~ "Hello World!\n" ? Currently DMD puts \0 in some cases but not others, and it's rather random. The root cause is that this trailing zero is not part of the type, it's part of the literal. There are no rules for how literals are propagated inside expressions, they are just literals. This is a mess. There is a second difference. Array literals of char type, have completely different semantics from string literals. In module scope: char[] x = ['a']; // OK -- array literals can have an implicit .dup char[] y = "b"; // illegal This is a big problem for CTFE, because for CTFE, a string is just a compile-time value, it's neither string literal nor array literal! See bug 8660 for further details of the problems this causes. A proposal to clean up this mess -------------------------------- Any compile-time value of type immutable(char)[] or const(char)[], behaves a string literals currently do, and will have a \0 appended when it is stored in the executable. ie, enum hello = ['H', 'e', 'l', 'l', 'o', '\n']; printf(hello); will work. Any value of type char[], which is generated at compile time, will not have the trailing \0, and it will do an implicit dup (as current array literals do). char [] foo() { return "abc"; } char [] x = foo(); // x does not have a trailing \0, and it is implicitly duped, even though it was not declared with an array literal. ------------------- So that the difference between string literals and char array literals would simply be that the latter are polysemous. There would be no semantics associated with the form of the literal itself. We still have this oddity: void foo(char qqq = 'b') { string x = "abc"; // trailing \0 string y = ['a', 'b', 'c']; // trailing \0 string z = ['a', qqq, 'c']; // no trailing \0 } This is because we made the (IMHO mistaken) decision to allow variables inside array literals. This is the reason why I listed _compile time value_ in the requirement for having a \0, rather than entirely basing it on the type. We could fix that with a language change: an array literal which contains a variable should not be of immutable type. It should be of mutable type (or const, in the case where it contains other, immutable values). So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even though w is allocated on the heap). But that's a separate proposal from the one I'm making here. I just need a decision on the main proposal so that I can fix a pile of CTFE bugs.
Oct 02 2012
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:The problem ----------- String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write, printf("Hello, World!\n"); without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf. But the semantics are not well defined. printf("Hello, W" ~ "orld!\n");If every string literal is \0-terminated, then there should be two \0 in the final string. I guess that's not the case and that's actually my preferred behaviour, but the spec should make it crystal clear in which situations a string literal gets a terminator and in which not.
Oct 02 2012
On 02/10/12 13:18, Tobias Pankrath wrote:On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:The \0 is *not* part of the string, it lies after the string. It's as if all memory is cleared, then the string literals are copied into it, with a gap of at least one byte between each. The 'trailing 0' is not part of the literal, it's the underlying cleared memory. At least, that's how I understand it. The spec is very vague.The problem ----------- String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write, printf("Hello, World!\n"); without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf. But the semantics are not well defined. printf("Hello, W" ~ "orld!\n");If every string literal is \0-terminated, then there should be two \0 in the final string. I guess that's not the case and that's actually my preferred behaviour, but the spec should make it crystal clear in which situations a string literal gets a terminator and in which not.
Oct 02 2012
Well the whole mess come from the fact that D conflate C string and D string. The first problem come from the fact that D array are implicitly convertible to pointer. So calling D function that expect a char* is possible with D string even if it is unsafe and will not work in the general case. The fact that D provide tricks that will make it work in special cases is armful as previous discussion have shown (many D programmer assume that this will always work because of toy tests they have made, where in case it won't and toStringz must be used). The only sane solution I can think of is to : - disallow slice to convert implicitly to pointer. .ptr is made for that. - Do not put any trailing 0 in string literal, unless it is specified explicitly ( "foobar\0" ). - Except if a const(char)* is expected from the string literal. In case it becomes a Cstring literal, with a trailing 0. This is made to allow uses like printf("foobar"); In other terms, the receiver type is used to decide if the compiler generate a string literal or a Cstring literal. Other addition of 0 are just confusing, and will make incorrect code work in special cases, which is something you usually don't want. Code that work by accident often backfire in spectacular ways at the least expected moment.
Oct 02 2012
On 02/10/12 13:26, deadalnix wrote:Well the whole mess come from the fact that D conflate C string and D string. The first problem come from the fact that D array are implicitly convertible to pointer. So calling D function that expect a char* is possible with D string even if it is unsafe and will not work in the general case. The fact that D provide tricks that will make it work in special cases is armful as previous discussion have shown (many D programmer assume that this will always work because of toy tests they have made, where in case it won't and toStringz must be used). The only sane solution I can think of is to : - disallow slice to convert implicitly to pointer. .ptr is made for that. - Do not put any trailing 0 in string literal, unless it is specified explicitly ( "foobar\0" ). - Except if a const(char)* is expected from the string literal. In case it becomes a Cstring literal, with a trailing 0. This is made to allow uses like printf("foobar"); In other terms, the receiver type is used to decide if the compiler generate a string literal or a Cstring literal.This still doesn't solve the problem of the difference between array literals and string literals (the magical implicit .dup), which is the key problem I'm trying to solve.
Oct 02 2012
Le 02/10/2012 15:12, Don Clugston a écrit :On 02/10/12 13:26, deadalnix wrote:OK, infact we have 2 different and unrelated problems here. I have to say I have no idea for the second one.Well the whole mess come from the fact that D conflate C string and D string. The first problem come from the fact that D array are implicitly convertible to pointer. So calling D function that expect a char* is possible with D string even if it is unsafe and will not work in the general case. The fact that D provide tricks that will make it work in special cases is armful as previous discussion have shown (many D programmer assume that this will always work because of toy tests they have made, where in case it won't and toStringz must be used). The only sane solution I can think of is to : - disallow slice to convert implicitly to pointer. .ptr is made for that. - Do not put any trailing 0 in string literal, unless it is specified explicitly ( "foobar\0" ). - Except if a const(char)* is expected from the string literal. In case it becomes a Cstring literal, with a trailing 0. This is made to allow uses like printf("foobar"); In other terms, the receiver type is used to decide if the compiler generate a string literal or a Cstring literal.This still doesn't solve the problem of the difference between array literals and string literals (the magical implicit .dup), which is the key problem I'm trying to solve.
Oct 02 2012
On 10/2/12, Don Clugston <dac nospam.com> wrote:A proposal to clean up this mess -------------------------------- Any compile-time value of type immutable(char)[] or const(char)[], behaves a string literals currently do, and will have a \0 appended when it is stored in the executable. ie, enum hello = ['H', 'e', 'l', 'l', 'o', '\n']; printf(hello); will work.What about these, will these pass?: enum string x = "foo"; assert(x.length == 3); void test(string x) { assert(x.length == 3); } test(x); If these don't pass the proposal will break code.
Oct 02 2012
On 02/10/12 14:02, Andrej Mitrovic wrote:On 10/2/12, Don Clugston <dac nospam.com> wrote:Yes, they pass. The \0 is not included in the string length. It's effectively in the data segment, not in the string.A proposal to clean up this mess -------------------------------- Any compile-time value of type immutable(char)[] or const(char)[], behaves a string literals currently do, and will have a \0 appended when it is stored in the executable. ie, enum hello = ['H', 'e', 'l', 'l', 'o', '\n']; printf(hello); will work.What about these, will these pass?: enum string x = "foo"; assert(x.length == 3); void test(string x) { assert(x.length == 3); } test(x); If these don't pass the proposal will break code.
Oct 02 2012
2012/10/2 Don Clugston <dac nospam.com>:The problem ----------- String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write, printf("Hello, World!\n"); without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf. But the semantics are not well defined. printf("Hello, W" ~ "orld!\n"); Does this have a trailing \0 ? I think it should, because it improves readability of string literals that are longer than one line. Currently DMD adds a \0, but it is not in the spec. Now consider array literals. printf(['H','e', 'l', 'l','o','\n']); Does this have a trailing \0 ? Currently DMD does not put one in. How about ['H','e', 'l', 'l','o'] ~ " World!\n" ? And "Hello " ~ ['W','o','r','l','d','\n'] ? And "Hello World!" ~ '\n' ? And null ~ "Hello World!\n" ? Currently DMD puts \0 in some cases but not others, and it's rather random. The root cause is that this trailing zero is not part of the type, it's part of the literal. There are no rules for how literals are propagated inside expressions, they are just literals. This is a mess. There is a second difference. Array literals of char type, have completely different semantics from string literals. In module scope: char[] x = ['a']; // OK -- array literals can have an implicit .dup char[] y = "b"; // illegal This is a big problem for CTFE, because for CTFE, a string is just a compile-time value, it's neither string literal nor array literal! See bug 8660 for further details of the problems this causes. A proposal to clean up this mess -------------------------------- Any compile-time value of type immutable(char)[] or const(char)[], behaves a string literals currently do, and will have a \0 appended when it is stored in the executable. ie, enum hello = ['H', 'e', 'l', 'l', 'o', '\n']; printf(hello); will work. Any value of type char[], which is generated at compile time, will not have the trailing \0, and it will do an implicit dup (as current array literals do). char [] foo() { return "abc"; } char [] x = foo(); // x does not have a trailing \0, and it is implicitly duped, even though it was not declared with an array literal. ------------------- So that the difference between string literals and char array literals would simply be that the latter are polysemous. There would be no semantics associated with the form of the literal itself. We still have this oddity: void foo(char qqq = 'b') { string x = "abc"; // trailing \0 string y = ['a', 'b', 'c']; // trailing \0 string z = ['a', qqq, 'c']; // no trailing \0 } This is because we made the (IMHO mistaken) decision to allow variables inside array literals. This is the reason why I listed _compile time value_ in the requirement for having a \0, rather than entirely basing it on the type. We could fix that with a language change: an array literal which contains a variable should not be of immutable type. It should be of mutable type (or const, in the case where it contains other, immutable values). So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even though w is allocated on the heap). But that's a separate proposal from the one I'm making here. I just need a decision on the main proposal so that I can fix a pile of CTFE bugs.Maybe your proposal is correct. I think the key idea is *polysemous typed string literal*. When based on the Ideal D Interpreter in my brain, the organized rule will become like follows. 1-1) In semantic level, D should have just one polysemous string literal, which is "an array of char". 1-2) In token level, D has two represents for the polysemous string literal, they are "str" and ['s','t','r']. 2) The polysemous string literl is implicitly convertible to [wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is not need, because immutable([wd]?char)[] is implicitly convertible to them). 3) The concatenation result between polysemous literals is still polysemous, but its representation is different based on the both side of the operator. "str" ~ "str"; // "strstr" "str" ~ ['s','t','r']; // ['s','t','r','s','t','r'] "str" ~ 's'; // "strs" ['s','t','r'] ~ 's'; // ['s','t','r','s'] "str" ~ null; // "str" ['s','t','r'] ~ null; // ['s','t','r'] 4) After semantics _and_ optimization, polysemous string literal which represented as like 4-1) "str" is typed as immutable([wd]?char)[] (The char type is depends on the literal suffix). 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is depends on the common type of its elements). 5) In object file generating phase, string literal which typed as 5-1) immutable([wd]?)char[] is stored in the executable and implicitly terminated with \0. 5-2) [wd]?char[] are stored in the executable as the original image and implicitly 'dup'ed in runtime. ---- Additionally, in following case, both concatenation should generate polysemous string literals in CT and RT. Because, after concatenation of chars and char arrays, newly allocated strings are *purely immutable* value and implicitly convertible to mutable. immutable char ic = 'a'; pragma(msg, typeof(['s', 't', ic, 'r'])); // prints const(char)[] immutable(char)[] s = ['s', 't', ic, 'r']; // BUT, should be allowed char mc = 'a'; pragma(msg, typeof("st"~mc~"r")); // prints const(char)[] char[] s = "st"~mc~"r"; // BUT, should be allowed Kenji Hara
Oct 02 2012
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:[SNIP] A proposal to clean up this mess [SNIP]While I think it is convenient to be able to write 'printf("world");', as you point out, I think that the fact that it works "inconsistently" (and by that, I mean there are rules and exceptions), is even more dangerous. If at all possible, I'd rather side with consistency, then the "we got your back... except when we don't" approach: IE: strings are NEVER null terminated. In theory, how often do you *really* need null terminated strings? And when you do, wouldn't it be safer to just write 'printf("world\0")'? or 'printf(str ~ "world" ~ '\0');' rather than "Am I in a case where it is null terminated? Yeah... 90% confident I am..." If you want 0 termination, then make it explicit, that's my opinion. Besides, as you said, the null termination is not documented, so anything relying on it is a bug really. Just an observation of an implementation detail.
Oct 02 2012
On Tuesday, 2 October 2012 at 14:03:36 UTC, monarch_dodra wrote:If you want 0 termination, then make it explicit, that's my opinion.That ship has long since sailed. You'll break code in an incredibly dangerous way if you were to change it now.
Oct 04 2012
On 10/2/12 7:11 AM, Don Clugston wrote:The problem ----------- String literals in D are a little bit magical; they have a trailing \0.[snip] I don't mean to be Debbie Downer on this because I reckon it addresses an issue that some have, although I never do. With that warning, a few candid opinions follow. First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary. Second, a simple and workable solution to this would be to address the matter dynamically: make toStringz opportunistically look whether there's a \0 beyond the end of the string, EXCEPT when the string happens to end exactly at a page boundary (in which case accessing memory beyond the end of the string may produce a page fault). With this simple dynamic test we don't need precise and stringent rules for the implementation. Third, the complex set of rules proposed pushes the number of cases in which the \0 is guaranteed, but doesn't make for a clear and easy to remember boundary. Therefore people will need to remember some more rules to make sure they can, well, avoid a call to toStringz. On 10/2/12 10:55 AM, Regan Heath wrote:Recent discussions on the zero terminated string problems and inconsistency of string literals has me, again, wondering why D doesn't have a 'type' to represent C's zero terminated strings. It seems to me that having a type, and typing C functions with it would solve a lot of problems.[snip]I am probably missing something obvious, or I have forgotten one of the array/slice complexities which makes this a nightmare.You're not missing anything and defining a zero-terminated type is something I considered doing and have been highly interested in. My interest is motivated by the fact that sentinel-terminated structures are a very interesting example of forward ranges that are also contiguous. That sets them apart from both singly-linked lists and simple arrays, and gives them interesting properties. I'd be interested in defining the more general: struct SentinelTerminatedSlice(T, T terminator) { private T* data; ... } That would be a forward range and the instantiation SentinelTerminatedSlice!(char, 0) would be CString. However, so far I held off of defining such a range because C-strings are seldom useful in D code and there are not many other compelling examples of sentinel-terminated ranges. Maybe it's time to dust off that idea, I'd love it if we gathered enough motivation for it. Andrei
Oct 02 2012
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote:However, so far I held off of defining such a range because C-strings are seldom useful in D code [...]I think your view of what is common in D code is not representative. You are primarily a library writer, which means you rarely have to interface with other code. Please correct me if I'm wrong, but I don't believe you've written much application-level D code. For people that write applications, we have the unfortunate chore of having to call lots of C APIs to get things done. There's a long list of things for which there is no D interface (graphics, audio, input, GUI, database, platform APIs, various 3rd party libs). Invariably these interfaces require C strings. In short, if you write applications in D, you need C strings. I don't know what the right decision is here, but please do not say that C-strings are seldom useful in D code.
Oct 02 2012
On 02/10/12 17:14, Andrei Alexandrescu wrote:On 10/2/12 7:11 AM, Don Clugston wrote:[snip] You're missing the point, a bit. The zero-terminator is only one symptom of the underlying problem: string literals and array literals have the same type but different semantics. The other symptoms are: * the implicit .dup that happens with array literals, but not string literals. This is a silent performance killer. It's probably the most common performance bug we find in our code, and it's completely ungreppable. * string literals are polysemous with width (c, w, d) but array literals are not (they are polysemous with constness). For example, "abc" ~ 'ü' is legal, but ['a', 'b', 'c'] ~ 'ü' is not. This has nothing to do with the zero terminator.The problem ----------- String literals in D are a little bit magical; they have a trailing \0.[snip] I don't mean to be Debbie Downer on this because I reckon it addresses an issue that some have, although I never do. With that warning, a few candid opinions follow. First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary.
Oct 04 2012
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote:First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary.My experience has been much different. Interfacing with C occurs in nearly every D program I write, and I usually end up passing a string literal. Anecdotes!
Oct 04 2012
On Thursday, 4 October 2012 at 07:57:16 UTC, Bernard Helyer wrote:On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote:Agreed. I'm always happy when I find that the particular C API I am working with supports passing strings as a pointer/length pair :) Anyway, toStringz (and the wchar and dchar equivalents in std.utf) needs to be fixed regardless - it currently does a dangerous optimization if the string is immutable, otherwise it unconditionally concatenates. We cannot rely on strings being GC allocated based on mutability. Memory is outside the scope of the D type system - we cannot make assumptions about memory based on types.First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary.My experience has been much different. Interfacing with C occurs in nearly every D program I write, and I usually end up passing a string literal. Anecdotes!
Oct 04 2012