www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - What exactly are the String literrals in D and how they work?

reply rempas <rempas tutanota.com> writes:
So when I'm doing something like the following: `string name = 
"John";`
Then what's the actual type of the literal `"John"`?
In the chapter [Calling C 
functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) in the
"Interfacing with C" page, the following is said:
 Strings are not 0 terminated in D. See "Data Type 
 Compatibility" for more information about this. However, string 
 literals in D are 0 terminated.
Which is really interesting and makes me suppose that `"John"` is a string literal right? However, when I'm writing something like the following: `char *name = "John";`, then D will complain with the following message:
 Error: cannot implicitly convert expression `"John"` of type 
 `string` to `char*`
Which is interesting because this works in C. If I use `const char*` instead, it will work. I suppose that this has to do with the fact that `string` is an alias for `immutable(char[])` but still this has to mean that the actual type of a LITERAL string is of type `string` (aka `immutable(char[])`). Another thing I can do is cast the literal to a `char*` but I'm wondering what's going on under the hood in this case. Is casting executed at compile time or at runtime? So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal? I hope all that makes sense and the someone can answer, lol
Aug 14 2021
next sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote:
 So when I'm doing something like the following: `string name = 
 "John";`
 Then what's the actual type of the literal `"John"`?
```d unittest { pragma(msg, typeof("John")); // string pragma(msg, is(typeof("John") == immutable(char)[])); // true } ```
 In the chapter [Calling C 
 functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) in
the "Interfacing with C" page, the following is said:
 Strings are not 0 terminated in D. See "Data Type 
 Compatibility" for more information about this. However, 
 string literals in D are 0 terminated.
```d void zerort(string s) { assert(s.ptr[s.length] == '\0'); } unittest { zerort("John"); // assertion success string s = "Jo"; s ~= "hn"; zerort(s); // assertion failure } ``` If a function takes a string as a runtime parameter, it might not be NUL terminated. This might be more obvious with substrings: ```d unittest { string j = "John"; string s = j[0..2]; assert(s == "Jo"); assert(s.ptr == j.ptr); assert(s.ptr[s.length] == 'h'); // it's h-terminated } ```
 Which is really interesting and makes me suppose that `"John"` 
 is a string literal right?
 However, when I'm writing something like the following: `char 
 *name = "John";`,
 then D will complain with the following message:
 Error: cannot implicitly convert expression `"John"` of type 
 `string` to `char*`
Which is interesting because this works in C.
Well, kinda: ```c void mutate(char *s) { s[0] = 'X'; } int main() { char *s = "John"; mutate(s); // segmentation fault } ``` `char*` is just the wrong type, it suggests mutability where mutability ain't.
 If I use `const char*` instead, it will work. I suppose that 
 this has to do with the fact that `string` is an alias for 
 `immutable(char[])` but still this has to mean that the actual 
 type of a LITERAL string is of type `string` (aka 
 `immutable(char[])`).

 Another thing I can do is cast the literal to a `char*` but I'm 
 wondering what's going on under the hood in this case.
The same thing as in C: ```d void mutate(char *s) { s[0] = 'X'; } void main() { char* s = cast(char*) "John"; mutate(s); // program killed by signal 11 } ```
 Is casting executed at compile time or at runtime?
Compile-time. std.conv.to is what you'd use at runtime. Here though, what you want is `dup` to get a `char[]`, which you can then take the pointer of if you want: ```d unittest { char* s = "John".dup.ptr; s[0] = 'X'; // no segfaults assert(s[0..4] == "Xohn"); // ok } ```
 So am I going to have an extra runtime cost having to first 
 construct a `string` and then ALSO cast it to a string literal?

 I hope all that makes sense and the someone can answer, lol
Aug 15 2021
next sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:
 On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote:
 ```d
 unittest {
     char* s = "John".dup.ptr;
     s[0] = 'X'; // no segfaults
     assert(s[0..4] == "Xohn"); // ok
 }
 ```

 So am I going to have an extra runtime cost having to first 
 construct a `string` and then ALSO cast it to a string literal?
In the above case, "John" is a string that's compiled into the resulting executable and loaded into read-only memory, and this code is reached that string is duplicated, at runtime, to create a copy in writable memory.
Aug 15 2021
parent jfondren <julian.fondren gmail.com> writes:
On Sunday, 15 August 2021 at 07:47:27 UTC, jfondren wrote:
 On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:
 On Sunday, 15 August 2021 at 06:10:53 UTC, rempas wrote:
 ```d
 unittest {
     char* s = "John".dup.ptr;
     s[0] = 'X'; // no segfaults
     assert(s[0..4] == "Xohn"); // ok
 }
 ```

 So am I going to have an extra runtime cost having to first 
 construct a `string` and then ALSO cast it to a string 
 literal?
In the above case, "John" is a string that's compiled into the resulting executable and loaded into read-only memory, and this code is reached that string is duplicated, at runtime, to create a copy in writable memory.
Probably a more useful way to think about this is to consider what happens in a loop: ```d void static_lifetime() nogc { foreach (i; 0 .. 100) { string s = "John"; // some code } } ``` ^^ At runtime a slice is created on the stack 100 times, with a pointer to the 'J' of the literal, a length of 4, etc. The cost of this doesn't change with the length of the literal, and the bytes of the literal aren't copied, so this code would be just as fast if the string were megabytes in length. ```d void dynamically_allocated() { // no nogc foreach (i; 0 .. 100) { char[] s = "John".dup; // some code } } ``` ^^ Here, the literal is copied into freshly GC-allocated memory a hundred times, and a slice is made from that. And for completeness: ```d void stack_allocated() nogc { foreach (i; 0 .. 100) { char[4] raw = "John"; char[] s = raw[0..$]; // some code } } ``` ^^ Here, a static array is constructed on the stack a hundred times, and the literal is copied into the array, and then a slice is constructed on the stack with a pointer into the array on the stack, a length of 4, etc. This doesn't use the GC but the stack is limited in size and now you have worry about the slice getting copied elsewhere and outliving the data on the stack: ```d char[] stack_allocated() nogc { char[] ret; foreach (i; 0 .. 100) { char[4] raw = "John"; char[] s = raw[0 .. $]; ret = s; } return ret; // errors with -preview=dip1000 } void main() { import std.stdio : writeln; char[] s = stack_allocated(); writeln(s); // prints garbage } ```
Aug 15 2021
prev sibling parent reply rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:
 ```d
 unittest {
     pragma(msg, typeof("John"));  // string
     pragma(msg, is(typeof("John") == immutable(char)[]));  // 
 true
 }
 ```
Still don't know what "pragma" does but thank you.
 ```d
 void zerort(string s) {
     assert(s.ptr[s.length] == '\0');
 }

 unittest {
     zerort("John"); // assertion success
     string s = "Jo";
     s ~= "hn";
     zerort(s); // assertion failure
 }
 ```

 If a function takes a string as a runtime parameter, it might 
 not be NUL terminated. This might be more obvious with 
 substrings:

 ```d
 unittest {
     string j = "John";
     string s = j[0..2];
     assert(s == "Jo");
     assert(s.ptr == j.ptr);
     assert(s.ptr[s.length] == 'h'); // it's h-terminated
 }
 ```
That's interesting!
 ```c
 void mutate(char *s) {
     s[0] = 'X';
 }

 int main() {
     char *s = "John";
     mutate(s); // segmentation fault
 }
 ```

 `char*` is just the wrong type, it suggests mutability where 
 mutability ain't.
I mean that in C, we can assign a string literal into a `char*` and also a `const char*` type without getting a compilation error while in D, we can only assign it to a `const char*` type. I suppose that's because of C doing explicit conversion. I didn't talked about mutating a string literal
 Compile-time. std.conv.to is what you'd use at runtime. Here 
 though, what you want is `dup` to get a `char[]`, which you can 
 then take the pointer of if you want:

 ```d
 unittest {
     char* s = "John".dup.ptr;
     s[0] = 'X'; // no segfaults
     assert(s[0..4] == "Xohn"); // ok
 }
 ```
Well, that one didn't worked out really well for me. Using `.dup.ptr`, didn't added a null terminated character while `cast(char*)` did. So I suppose the first way is more better when you want a C-like `char*` and not a D-like `char[]`.
Aug 15 2021
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 15/08/2021 8:11 PM, rempas wrote:
 Still don't know what "pragma" does but thank you.
pragma is a set of commands to the compiler that may be compiler specific. In the case of the msg command, it tells the compiler to output a message to stdout during compilation.
Aug 15 2021
parent rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 08:17:47 UTC, rikki cattermole wrote:
 pragma is a set of commands to the compiler that may be 
 compiler specific.

 In the case of the msg command, it tells the compiler to output 
 a message to stdout during compilation.
Thanks man!
Aug 15 2021
prev sibling next sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Sunday, 15 August 2021 at 08:11:39 UTC, rempas wrote:
 On Sunday, 15 August 2021 at 07:43:59 UTC, jfondren wrote:
 ```d
 unittest {
     char* s = "John".dup.ptr;
     s[0] = 'X'; // no segfaults
     assert(s[0..4] == "Xohn"); // ok
 }
 ```
Well, that one didn't worked out really well for me. Using `.dup.ptr`, didn't added a null terminated character
dup() isn't aware of the NUL since that's outside the slice of the string. It only copies the chars in "John". You can use toStringz to ensure NUL termination: https://dlang.org/phobos/std_string.html#.toStringz
Aug 15 2021
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 08:47:39 UTC, jfondren wrote:
 dup() isn't aware of the NUL since that's outside the slice of 
 the string. It only copies the chars in "John". You can use 
 toStringz to ensure NUL termination:
 https://dlang.org/phobos/std_string.html#.toStringz
Is there something bad than just casting it to `char*` that I should be aware of?
Aug 15 2021
parent reply Tejas <notrealemail gmail.com> writes:
On Sunday, 15 August 2021 at 08:51:19 UTC, rempas wrote:
 On Sunday, 15 August 2021 at 08:47:39 UTC, jfondren wrote:
 dup() isn't aware of the NUL since that's outside the slice of 
 the string. It only copies the chars in "John". You can use 
 toStringz to ensure NUL termination:
 https://dlang.org/phobos/std_string.html#.toStringz
Is there something bad than just casting it to `char*` that I should be aware of?
External C libraries expect strings to be null terminated, so if you do use `.dup`, use `.toStringz` as well.
Aug 15 2021
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 08:53:50 UTC, Tejas wrote:
 External C libraries expect strings to be null terminated, so 
 if you do use `.dup`, use `.toStringz` as well.
Yeah, yeah I got that. My question is, if I should avoid `cast(char*)` and use `.toStringz` while both do the exact same thing?
Aug 15 2021
parent reply jfondren <julian.fondren gmail.com> writes:
On Sunday, 15 August 2021 at 08:56:07 UTC, rempas wrote:
 On Sunday, 15 August 2021 at 08:53:50 UTC, Tejas wrote:
 External C libraries expect strings to be null terminated, so 
 if you do use `.dup`, use `.toStringz` as well.
Yeah, yeah I got that. My question is, if I should avoid `cast(char*)` and use `.toStringz` while both do the exact same thing?
They don't do the same thing. toStringz always copies, always GC-allocates, and always NUL-terminates. `cast(char*)` only does what you want in the case that you're applying it a string literal. But in that case you shouldn't cast, you should just ```d const char* s = "John"; ``` If you need cast cast the const away to work with a C API, doing that separately, at the point of the call to the C function, makes it clearer what you're doing and what the risks are there (does the C function modify the string? If so this will segfault).
Aug 15 2021
parent rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 09:01:17 UTC, jfondren wrote:
 They don't do the same thing. toStringz always copies, always 
 GC-allocates, and always NUL-terminates. `cast(char*)` only 
 does what you want in the case that you're applying it a string 
 literal. But in that case you shouldn't cast, you should just

 ```d
 const char* s = "John";
 ```

 If you need cast cast the const away to work with a C API, 
 doing that separately, at the point of the call to the C 
 function, makes it clearer what you're doing and what the risks 
 are there (does the C function modify the string? If so this 
 will segfault).
Yeah I won't cast when having a `const char*`. I already mentioned that it works without cast with `const` variables ;)
Aug 15 2021
prev sibling parent reply Mike Parker <aldacron gmail.com> writes:
On Sunday, 15 August 2021 at 08:11:39 UTC, rempas wrote:

 I mean that in C, we can assign a string literal into a `char*` 
 and also a `const char*` type without getting a compilation 
 error while in D, we can only assign it to a `const char*` 
 type. I suppose that's because of C doing explicit conversion. 
 I didn't talked about mutating a string literal
The D `string` is an alias for `immutable(char)[]`, immutable contents of a mutable array reference (`immutable(char[])` would mean the array reference is also immutable). You don't want to assign that to a `char*`, because then you'd be able to mutate the contents of the string, thereby violating the contract of immutable. (`immutable` means the data to which it's applied, in this case the contents of an array, will not be mutated through any reference anywhere in the program.) Assigning it to `const(char)*` is fine, because `const` means the data can't be mutated through that particular reference (pointer in this case). And because strings in C are quite frequently represented as `const(char)*`, especially in function parameter lists, D string literals are explicitly convertible to `const(char)*` and also NUL-terminated. So you can do something like `puts("Something")` without worry. This blog post may be helpful: https://dlang.org/blog/2021/05/24/interfacing-d-with-c-strings-part-one/
Aug 15 2021
parent rempas <rempas tutanota.com> writes:
On Sunday, 15 August 2021 at 09:06:14 UTC, Mike Parker wrote:
 The D `string` is an alias for `immutable(char)[]`, immutable 
 contents of a mutable array reference (`immutable(char[])` 
 would mean the array reference is also immutable). You don't 
 want to assign that to a `char*`, because then you'd be able to 
 mutate the contents of the string, thereby violating the 
 contract of immutable. (`immutable` means the data to which 
 it's applied, in this case the contents of an array, will not 
 be mutated through any reference anywhere in the program.)

 [...]
Thanks a lot for the info!
Aug 15 2021
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
Lot's of great information and pointers already. I will try from another 
angle. :)

On 8/14/21 11:10 PM, rempas wrote:

 So when I'm doing something like the following: `string name = "John";`
 Then what's the actual type of the literal `"John"`?
As you say and as the code shows, there are two constructs in that line. The right-hand side is a string literal. The left-hand side is a 'string'.
 Strings are not 0 terminated in D. See "Data Type Compatibility" for
 more information about this. However, string literals in D are 0
 terminated.
The string literal is embedded into the compiled program as 5 bytes in this case: 'J', 'o', 'h', 'n', '\0'. That's the right-hand side of your code above. 'string' is an array in D and arrays are stored as the following pair: size_t length; // The number of elements T * ptr; // The pointer to the first element (This is called a "fat pointer".) So, if we assume that the literal 'John' was placed at memory location 0x1000, then the left-hand side of your code will satisfy the following conditions: assert(name.length == 4); // <-- NOT 5 assert(name.ptr == 0x1000); The important part to note is how even though the string literal was stored as 5 bytes but the string's length is 4. As others said, when we add a character to a string, there is no '\0' involved. Only the newly added char will the added. Functions in D do not need the '\0' sentinel to know where the string ends. The end is already known from the 'length' property. Ali
Aug 15 2021
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 8/15/21 2:10 AM, rempas wrote:
 So when I'm doing something like the following: `string name = "John";`
 Then what's the actual type of the literal `"John"`?
 In the chapter [Calling C 
 functions](https://dlang.org/spec/interfaceToC.html#calling_c_functions) 
 in the "Interfacing with C" page, the following is said:
 Strings are not 0 terminated in D. See "Data Type Compatibility" for 
 more information about this. However, string literals in D are 0 
 terminated.
Which is really interesting and makes me suppose that `"John"` is a string literal right? However, when I'm writing something like the following: `char *name = "John";`, then D will complain with the following message:
 Error: cannot implicitly convert expression `"John"` of type `string` 
 to `char*`
Which is interesting because this works in C. If I use `const char*` instead, it will work. I suppose that this has to do with the fact that `string` is an alias for `immutable(char[])` but still this has to mean that the actual type of a LITERAL string is of type `string` (aka `immutable(char[])`). Another thing I can do is cast the literal to a `char*` but I'm wondering what's going on under the hood in this case. Is casting executed at compile time or at runtime? So am I going to have an extra runtime cost having to first construct a `string` and then ALSO cast it to a string literal? I hope all that makes sense and the someone can answer, lol
Lots of great responses in this thread! I wanted to stress that a string literal is sort of magic. It has extra type information inside the compiler that is not available in the normal type system. Namely that "this is a literal, and so can morph into other things". To give you some examples: ```d string s = "John"; immutable(char)* cs = s; // nope immutable(char)* cs2 = "John"; // OK! wstring ws = s; // nope wstring ws2 = "John"; // OK! ``` What is going on? Because the compiler knows this is a string *literal*, it can modify the type (and possibly the data itself) at will to match what you are assigning it to. In the case of zero-terminated C strings, it allows usage as a pointer instead of a D array. In the case of different width strings (wstring uses 16-bit code-units), it can actually transform the underlying data to what you wanted. Note that even when you do lose that "literal" magic by assigning to a variable, you can still rely on D always putting a terminating zero in the data segment for a string literal. So it's valid to just do: ```d string s = "John"; printf(s.ptr); ```` As long as you *know* the string came from a literal. -Steve
Aug 15 2021