digitalmars.D.learn - What exactly are the String literrals in D and how they work?
- rempas (21/26) Aug 14 2021 So when I'm doing something like the following: `string name =
- jfondren (61/88) Aug 15 2021 ```d
- jfondren (5/15) Aug 15 2021 In the above case, "John" is a string that's compiled into the
- jfondren (58/75) Aug 15 2021 Probably a more useful way to think about this is to consider
- rempas (12/63) Aug 15 2021 That's interesting!
- rikki cattermole (4/5) Aug 15 2021 pragma is a set of commands to the compiler that may be compiler specifi...
- rempas (2/6) Aug 15 2021 Thanks man!
- jfondren (5/16) Aug 15 2021 dup() isn't aware of the NUL since that's outside the slice of
- rempas (3/7) Aug 15 2021 Is there something bad than just casting it to `char*` that I
- Tejas (3/11) Aug 15 2021 External C libraries expect strings to be null terminated, so if
- Mike Parker (18/23) Aug 15 2021 The D `string` is an alias for `immutable(char)[]`, immutable
- rempas (2/11) Aug 15 2021 Thanks a lot for the info!
- =?UTF-8?Q?Ali_=c3=87ehreli?= (24/29) Aug 15 2021 Lot's of great information and pointers already. I will try from another...
- Steven Schveighoffer (29/59) Aug 15 2021 Lots of great responses in this thread!
So when I'm doing something like the following: `string name = "John";` Then what's the actual type of the literal `"John"`? In the chapter [Calling C functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) in the "Interfacing with C" page, the following is said:Strings are not 0 terminated in D. See "Data Type Compatibility" for more information about this. However, string literals in D are 0 terminated.Which is really interesting and makes me suppose that `"John"` is a string literal right? However, when I'm writing something like the following: `char *name = "John";`, then D will complain with the following message:Error: cannot implicitly convert expression `"John"` of type `string` to `char*`Which is interesting because this works in C. If I use `const char*` instead, it will work. I suppose that this has to do with the fact that `string` is an alias for `immutable(char[])` but still this has to mean that the actual type of a LITERAL string is of type `string` (aka `immutable(char[])`). Another thing I can do is cast the literal to a `char*` but I'm wondering what's going on under the hood in this case. Is casting executed at compile time or at runtime? So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal? I hope all that makes sense and the someone can answer, lol
Aug 14 2021
On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote:So when I'm doing something like the following: `string name = "John";` Then what's the actual type of the literal `"John"`?```d unittest { pragma(msg, typeof("John")); // string pragma(msg, is(typeof("John") == immutable(char)[])); // true } ```In the chapter [Calling C functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) in the "Interfacing with C" page, the following is said:```d void zerort(string s) { assert(s.ptr[s.length] == '\0'); } unittest { zerort("John"); // assertion success string s = "Jo"; s ~= "hn"; zerort(s); // assertion failure } ``` If a function takes a string as a runtime parameter, it might not be NUL terminated. This might be more obvious with substrings: ```d unittest { string j = "John"; string s = j[0..2]; assert(s == "Jo"); assert(s.ptr == j.ptr); assert(s.ptr[s.length] == 'h'); // it's h-terminated } ```Strings are not 0 terminated in D. See "Data Type Compatibility" for more information about this. However, string literals in D are 0 terminated.Which is really interesting and makes me suppose that `"John"` is a string literal right? However, when I'm writing something like the following: `char *name = "John";`, then D will complain with the following message:Well, kinda: ```c void mutate(char *s) { s[0] = 'X'; } int main() { char *s = "John"; mutate(s); // segmentation fault } ``` `char*` is just the wrong type, it suggests mutability where mutability ain't.Error: cannot implicitly convert expression `"John"` of type `string` to `char*`Which is interesting because this works in C.If I use `const char*` instead, it will work. I suppose that this has to do with the fact that `string` is an alias for `immutable(char[])` but still this has to mean that the actual type of a LITERAL string is of type `string` (aka `immutable(char[])`). Another thing I can do is cast the literal to a `char*` but I'm wondering what's going on under the hood in this case.The same thing as in C: ```d void mutate(char *s) { s[0] = 'X'; } void main() { char* s = cast(char*) "John"; mutate(s); // program killed by signal 11 } ```Is casting executed at compile time or at runtime?Compile-time. std.conv.to is what you'd use at runtime. Here though, what you want is `dup` to get a `char[]`, which you can then take the pointer of if you want: ```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal? I hope all that makes sense and the someone can answer, lol
Aug 15 2021
On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote: ```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```In the above case, "John" is a string that's compiled into the resulting executable and loaded into read-only memory, and this code is reached that string is duplicated, at runtime, to create a copy in writable memory.So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal?
Aug 15 2021
On Sunday, 15 August 2021 at 07:47:27 UTC, jfondren wrote:On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:Probably a more useful way to think about this is to consider what happens in a loop: ```d void static_lifetime() nogc { foreach (i; 0 .. 100) { string s = "John"; // some code } } ``` ^^ At runtime a slice is created on the stack 100 times, with a pointer to the 'J' of the literal, a length of 4, etc. The cost of this doesn't change with the length of the literal, and the bytes of the literal aren't copied, so this code would be just as fast if the string were megabytes in length. ```d void dynamically_allocated() { // no nogc foreach (i; 0 .. 100) { char[] s = "John".dup; // some code } } ``` ^^ Here, the literal is copied into freshly GC-allocated memory a hundred times, and a slice is made from that. And for completeness: ```d void stack_allocated() nogc { foreach (i; 0 .. 100) { char[4] raw = "John"; char[] s = raw[0..$]; // some code } } ``` ^^ Here, a static array is constructed on the stack a hundred times, and the literal is copied into the array, and then a slice is constructed on the stack with a pointer into the array on the stack, a length of 4, etc. This doesn't use the GC but the stack is limited in size and now you have worry about the slice getting copied elsewhere and outliving the data on the stack: ```d char[] stack_allocated() nogc { char[] ret; foreach (i; 0 .. 100) { char[4] raw = "John"; char[] s = raw[0 .. $]; ret = s; } return ret; // errors with -preview=dip1000 } void main() { import std.stdio : writeln; char[] s = stack_allocated(); writeln(s); // prints garbage } ```On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote: ```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```In the above case, "John" is a string that's compiled into the resulting executable and loaded into read-only memory, and this code is reached that string is duplicated, at runtime, to create a copy in writable memory.So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal?
Aug 15 2021
On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:```d unittest { pragma(msg, typeof("John")); // string pragma(msg, is(typeof("John") == immutable(char)[])); // true } ```Still don't know what "pragma" does but thank you.```d void zerort(string s) { assert(s.ptr[s.length] == '\0'); } unittest { zerort("John"); // assertion success string s = "Jo"; s ~= "hn"; zerort(s); // assertion failure } ``` If a function takes a string as a runtime parameter, it might not be NUL terminated. This might be more obvious with substrings: ```d unittest { string j = "John"; string s = j[0..2]; assert(s == "Jo"); assert(s.ptr == j.ptr); assert(s.ptr[s.length] == 'h'); // it's h-terminated } ```That's interesting!```c void mutate(char *s) { s[0] = 'X'; } int main() { char *s = "John"; mutate(s); // segmentation fault } ``` `char*` is just the wrong type, it suggests mutability where mutability ain't.I mean that in C, we can assign a string literal into a `char*` and also a `const char*` type without getting a compilation error while in D, we can only assign it to a `const char*` type. I suppose that's because of C doing explicit conversion. I didn't talked about mutating a string literalCompile-time. std.conv.to is what you'd use at runtime. Here though, what you want is `dup` to get a `char[]`, which you can then take the pointer of if you want: ```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```Well, that one didn't worked out really well for me. Using `.dup.ptr`, didn't added a null terminated character while `cast(char*)` did. So I suppose the first way is more better when you want a C-like `char*` and not a D-like `char[]`.
Aug 15 2021
On 15/08/2021 8:11 PM, rempas wrote:Still don't know what "pragma" does but thank you.pragma is a set of commands to the compiler that may be compiler specific. In the case of the msg command, it tells the compiler to output a message to stdout during compilation.
Aug 15 2021
On Sunday, 15 August 2021 at 08:17:47 UTC, rikki cattermole wrote:pragma is a set of commands to the compiler that may be compiler specific. In the case of the msg command, it tells the compiler to output a message to stdout during compilation.Thanks man!
Aug 15 2021
On Sunday, 15 August 2021 at 08:11:39 UTC, rempas wrote:On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:dup() isn't aware of the NUL since that's outside the slice of the string. It only copies the chars in "John". You can use toStringz to ensure NUL termination: https://dlang.org/phobos/std_string.html#.toStringz```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```Well, that one didn't worked out really well for me. Using `.dup.ptr`, didn't added a null terminated character
Aug 15 2021
On Sunday, 15 August 2021 at 08:47:39 UTC, jfondren wrote:dup() isn't aware of the NUL since that's outside the slice of the string. It only copies the chars in "John". You can use toStringz to ensure NUL termination: https://dlang.org/phobos/std_string.html#.toStringzIs there something bad than just casting it to `char*` that I should be aware of?
Aug 15 2021
On Sunday, 15 August 2021 at 08:51:19 UTC, rempas wrote:On Sunday, 15 August 2021 at 08:47:39 UTC, jfondren wrote:External C libraries expect strings to be null terminated, so if you do use `.dup`, use `.toStringz` as well.dup() isn't aware of the NUL since that's outside the slice of the string. It only copies the chars in "John". You can use toStringz to ensure NUL termination: https://dlang.org/phobos/std_string.html#.toStringzIs there something bad than just casting it to `char*` that I should be aware of?
Aug 15 2021
On Sunday, 15 August 2021 at 08:53:50 UTC, Tejas wrote:External C libraries expect strings to be null terminated, so if you do use `.dup`, use `.toStringz` as well.Yeah, yeah I got that. My question is, if I should avoid `cast(char*)` and use `.toStringz` while both do the exact same thing?
Aug 15 2021
On Sunday, 15 August 2021 at 08:56:07 UTC, rempas wrote:On Sunday, 15 August 2021 at 08:53:50 UTC, Tejas wrote:They don't do the same thing. toStringz always copies, always GC-allocates, and always NUL-terminates. `cast(char*)` only does what you want in the case that you're applying it a string literal. But in that case you shouldn't cast, you should just ```d const char* s = "John"; ``` If you need cast cast the const away to work with a C API, doing that separately, at the point of the call to the C function, makes it clearer what you're doing and what the risks are there (does the C function modify the string? If so this will segfault).External C libraries expect strings to be null terminated, so if you do use `.dup`, use `.toStringz` as well.Yeah, yeah I got that. My question is, if I should avoid `cast(char*)` and use `.toStringz` while both do the exact same thing?
Aug 15 2021
On Sunday, 15 August 2021 at 09:01:17 UTC, jfondren wrote:They don't do the same thing. toStringz always copies, always GC-allocates, and always NUL-terminates. `cast(char*)` only does what you want in the case that you're applying it a string literal. But in that case you shouldn't cast, you should just ```d const char* s = "John"; ``` If you need cast cast the const away to work with a C API, doing that separately, at the point of the call to the C function, makes it clearer what you're doing and what the risks are there (does the C function modify the string? If so this will segfault).Yeah I won't cast when having a `const char*`. I already mentioned that it works without cast with `const` variables ;)
Aug 15 2021
On Sunday, 15 August 2021 at 08:11:39 UTC, rempas wrote:I mean that in C, we can assign a string literal into a `char*` and also a `const char*` type without getting a compilation error while in D, we can only assign it to a `const char*` type. I suppose that's because of C doing explicit conversion. I didn't talked about mutating a string literalThe D `string` is an alias for `immutable(char)[]`, immutable contents of a mutable array reference (`immutable(char[])` would mean the array reference is also immutable). You don't want to assign that to a `char*`, because then you'd be able to mutate the contents of the string, thereby violating the contract of immutable. (`immutable` means the data to which it's applied, in this case the contents of an array, will not be mutated through any reference anywhere in the program.) Assigning it to `const(char)*` is fine, because `const` means the data can't be mutated through that particular reference (pointer in this case). And because strings in C are quite frequently represented as `const(char)*`, especially in function parameter lists, D string literals are explicitly convertible to `const(char)*` and also NUL-terminated. So you can do something like `puts("Something")` without worry. This blog post may be helpful: https://dlang.org/blog/2021/05/24/interfacing-d-with-c-strings-part-one/
Aug 15 2021
On Sunday, 15 August 2021 at 09:06:14 UTC, Mike Parker wrote:The D `string` is an alias for `immutable(char)[]`, immutable contents of a mutable array reference (`immutable(char[])` would mean the array reference is also immutable). You don't want to assign that to a `char*`, because then you'd be able to mutate the contents of the string, thereby violating the contract of immutable. (`immutable` means the data to which it's applied, in this case the contents of an array, will not be mutated through any reference anywhere in the program.) [...]Thanks a lot for the info!
Aug 15 2021
Lot's of great information and pointers already. I will try from another angle. :) On 8/14/21 11:10 PM, rempas wrote:So when I'm doing something like the following: `string name = "John";` Then what's the actual type of the literal `"John"`?As you say and as the code shows, there are two constructs in that line. The right-hand side is a string literal. The left-hand side is a 'string'.The string literal is embedded into the compiled program as 5 bytes in this case: 'J', 'o', 'h', 'n', '\0'. That's the right-hand side of your code above. 'string' is an array in D and arrays are stored as the following pair: size_t length; // The number of elements T * ptr; // The pointer to the first element (This is called a "fat pointer".) So, if we assume that the literal 'John' was placed at memory location 0x1000, then the left-hand side of your code will satisfy the following conditions: assert(name.length == 4); // <-- NOT 5 assert(name.ptr == 0x1000); The important part to note is how even though the string literal was stored as 5 bytes but the string's length is 4. As others said, when we add a character to a string, there is no '\0' involved. Only the newly added char will the added. Functions in D do not need the '\0' sentinel to know where the string ends. The end is already known from the 'length' property. AliStrings are not 0 terminated in D. See "Data Type Compatibility" for more information about this. However, string literals in D are 0 terminated.
Aug 15 2021
On 8/15/21 2:10 AM, rempas wrote:So when I'm doing something like the following: `string name = "John";` Then what's the actual type of the literal `"John"`? In the chapter [Calling C functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) in the "Interfacing with C" page, the following is said:Lots of great responses in this thread! I wanted to stress that a string literal is sort of magic. It has extra type information inside the compiler that is not available in the normal type system. Namely that "this is a literal, and so can morph into other things". To give you some examples: ```d string s = "John"; immutable(char)* cs = s; // nope immutable(char)* cs2 = "John"; // OK! wstring ws = s; // nope wstring ws2 = "John"; // OK! ``` What is going on? Because the compiler knows this is a string *literal*, it can modify the type (and possibly the data itself) at will to match what you are assigning it to. In the case of zero-terminated C strings, it allows usage as a pointer instead of a D array. In the case of different width strings (wstring uses 16-bit code-units), it can actually transform the underlying data to what you wanted. Note that even when you do lose that "literal" magic by assigning to a variable, you can still rely on D always putting a terminating zero in the data segment for a string literal. So it's valid to just do: ```d string s = "John"; printf(s.ptr); ```` As long as you *know* the string came from a literal. -SteveStrings are not 0 terminated in D. See "Data Type Compatibility" for more information about this. However, string literals in D are 0 terminated.Which is really interesting and makes me suppose that `"John"` is a string literal right? However, when I'm writing something like the following: `char *name = "John";`, then D will complain with the following message:Error: cannot implicitly convert expression `"John"` of type `string` to `char*`Which is interesting because this works in C. If I use `const char*` instead, it will work. I suppose that this has to do with the fact that `string` is an alias for `immutable(char[])` but still this has to mean that the actual type of a LITERAL string is of type `string` (aka `immutable(char[])`). Another thing I can do is cast the literal to a `char*` but I'm wondering what's going on under the hood in this case. Is casting executed at compile time or at runtime? So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal? I hope all that makes sense and the someone can answer, lol
Aug 15 2021