www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - String Type Usage. String vs DString vs WString

reply Chris P <a b.com> writes:
Hello,

I'm extremely new to D and have a quick question regarding common 
practice when using strings. Is usage of one type over the others 
encouraged? When using 'string' it appears there is a length 
mismatch between the string length and the char array if large 
Unicode characters are used. So I figured I'd ask.

Thanks in advance,

Chris P - Tampa
Jan 14 2018
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 15/01/2018 2:05 AM, Chris P wrote:
 Hello,
 
 I'm extremely new to D and have a quick question regarding common 
 practice when using strings. Is usage of one type over the others 
 encouraged? When using 'string' it appears there is a length mismatch 
 between the string length and the char array if large Unicode characters 
 are used. So I figured I'd ask.
 
 Thanks in advance,
 
 Chris P - Tampa
D's strings are Unicode. Unicode has three main variants, UTF-8, UTF-16 and UTF-32. The size of a code point is 1, 2 or 4 bytes. But here is the thing, what is displayed (a character) could be multiple code points and these can be combined to form a grapheme. So yes, there will be length mismatches between them :)
Jan 14 2018
parent reply Tony <tonytdominguez aol.com> writes:
On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole 
wrote:

 Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
 The size of a code point is 1, 2 or 4 bytes.
I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 (UTF-32) bytes are referred to as "code units" and the size of a code point varies in UTF-8 and UTF-16.
Jan 14 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole

 wrote:
 Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
 The size of a code point is 1, 2 or 4 bytes.
I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 (UTF-32) bytes are referred to as "code units" and the size of a code point varies in UTF-8 and UTF-16.
Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them (IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits, and there is always 1 code unit per code point. For better or worse (mostly worse), ranges then treat all strings as ranges of code points and decode them to code points such that get a range of dchar (which means fun things like isRandomAccessRange!string and hasLength!string are false). As I understand it, each code point is then something which can be physically printed, but either way, it's not necessarily a full character. Multiple code points can then be combined to make a grapheme cluster (which then corresponds to what we'd normally consider a full character - e.g. a letter and an accent can each be a code point which are then combined to create an accented character). std.uni provides the functionality for operating on graphemes. And std.utf.byCodeUnit can be used to treat strings as ranges of code units instead of code points (and a fair bit of Phobos takes the solution of specializing range-based code for strings to avoid the auto-decoding). All in all, the whole thing is annoyingly complicated, though at least D is much more explicit about it than most languages, and I suspect that your average D programmer is better educated about Unicode than your average programmer. And having to figure out why the heck strings and wstrings act so bizarrely as ranges does have the positive side effect of putting it even more in your face than it would be otherwise, making it that much more likely that folks are going to learn about Unicode - though I still think that we'd be better off if we could ever figure out how to treat all strings as ranges of code units without breaking everything in the process. :| - Jonathan M Davis
Jan 14 2018
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Monday, 15 January 2018 at 04:27:15 UTC, Jonathan M Davis 
wrote:
 On Monday, January 15, 2018 03:14:02 Tony via 
 Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole

 wrote:
 Unicode has three main variants, UTF-8, UTF-16 and UTF-32. 
 The size of a code point is 1, 2 or 4 bytes.
I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 (UTF-32) bytes are referred to as "code units" and the size of a code point varies in UTF-8 and UTF-16.
Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them (IIRC) in a code point.
Nooooooooooo!!! Only 4 maximum for Unicode. Beyond that it's obsolete crap that is not Unicode since version 2 of Unicode.
Jan 15 2018
prev sibling next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 Hello,

 I'm extremely new to D and have a quick question regarding 
 common practice when using strings. Is usage of one type over 
 the others encouraged? When using 'string' it appears there is 
 a length mismatch between the string length and the char array 
 if large Unicode characters are used. So I figured I'd ask.

 Thanks in advance,

 Chris P - Tampa
string == immutable( char)[], char == utf8 wstring == immutable(wchar)[], char == utf16 dstring == immutable(dchar)[], char == utf32 Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string. N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think. TLDR use string.
Jan 14 2018
parent reply Chris P <a b.com> writes:
On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
 On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 [...]
string == immutable( char)[], char == utf8 wstring == immutable(wchar)[], char == utf16 dstring == immutable(dchar)[], char == utf32 Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string. N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think. TLDR use string.
Thank you (and rikki) for replying. Actually, I am using Windows (Doh!) but I now understand. Cheers!
Jan 14 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 15, 2018 02:22:09 Chris P via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
 On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 [...]
string == immutable( char)[], char == utf8 wstring == immutable(wchar)[], char == utf16 dstring == immutable(dchar)[], char == utf32 Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string. N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think. TLDR use string.
Thank you (and rikki) for replying. Actually, I am using Windows (Doh!) but I now understand. Cheers!
Even with Windows, there usually isn't any reason to use wstring. The only reason that wstring might be more desirable on Windows is that you need UTF-16 when dealing with the Windows API calls, and that's normally only going to come up if you're not writing platform-independent code. The common stuff such as file access is already wrap by Phobos (e.g. in std.file and std.stdio), so most programs, don't need to worry about the Windows API calls. And even if you do, the best practice generally is to use string everywhere in your code and then only convert to a zero-terminated wchar* when making the Windows API calls (either by actually allocating a zero-terminated wchar* or using a static array with the appropriate wchar set to 0, depending on the context). If you have to do a ton with Windows API calls, at some point, it arguably becomes better to just keep them as wstrings to avoid the conversions, but even then, because strings in D aren't zero-terminated, and the C API calls usually require them to be, you're often forced to copy the string to pass it to a Windows API call anyway, in which case, you lose most of the benefit of keeping stuff around in wstrings instead of just using strings everywhere. If you do need to worry about call a Windows API call, then check out toUTFz in std.utf, since it will allow you to easily convert to zero-terminated strings of any character type (std.string.toStringz handles zero-terminated strings as well, but just for string). - Jonathan M Davis
Jan 14 2018
prev sibling parent reply SimonN <eiderdaus gmail.com> writes:
On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 Is usage of one type over the others encouraged?
I would use string (UTF-8) throughout the program, but there seems to be no style guideline for this. Keep in mind two gotchas: D's foreach and D's ranges will autodecode and silently iterate over dchar, not char, even when the input is string, not dstring. (It's also possible to explicitly decode strings, see std.utf and std.uni.) If you call into the Windows API, some functions require extra care if everything in your program is UTF-8. But I still agree with the approach to keep everything as string in your program, and then wrap the Windows API calls, as the UTF-8 Everywhere manifesto suggests: http://utf8everywhere.org/ -- Simon
Jan 14 2018
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach [...] will autodecode and silently iterate over 
 dchar, not char, even when the input is string
That's not true. foreach will only decode on demand: string s; foreach(c; s) { /* c is a char here, it goes over bytes */ } foreach(char c; s) { /* c is a char here, same as above */ } foreach(dchar c; s) { /* c is a dchar - this decodes */ } Autodecoding is a Phobos library artifact, NOT something in the D language itself.
Jan 15 2018
parent SimonN <eiderdaus gmail.com> writes:
On Monday, 15 January 2018 at 14:44:46 UTC, Adam D. Ruppe wrote:
 On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach [...] will autodecode and silently iterate over 
 dchar, not char, even when the input is string
That's not true. foreach will only decode on demand: foreach(c; s) { /* c is a char here, it goes over bytes */ }
Thanks for the correction! Surprised I got foreach(c, s) wrong, its non-decoding iteration is even the prominent example in TDPL. Even `each`, the template function that implements a foreach, still infers as char: "aƤ".each!writeln; // prints a plus two broken characters Only `map` When I wrote "D's ranges", I meant Phobos's range-producing templates; a range itself is again encoding-agnostic.
Jan 15 2018
prev sibling parent reply Kagamin <spam here.lot> writes:
On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach and D's ranges will autodecode and silently iterate 
 over dchar, not char
foreach doesn't do it silently, decoding must be requested from it by explicitly specifying element type, it can also encode this way.
Jan 15 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 15, 2018 14:56:33 Kagamin via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach and D's ranges will autodecode and silently iterate
 over dchar, not char
foreach doesn't do it silently, decoding must be requested from it by explicitly specifying element type, it can also encode this way.
Yeah, one of the joys of that is that you have to be careful of using foreach in range-based functions, because if you haven't specialized for strings, and you use foreach without specifying the element type, then when your function template is instantiated with a string, the foreach won't match what front does. I really don't have any complaints about how foreach does this aside from the fact that it doesn't currently use the replacement character (so, it will throw on invalid Unicode if it's told to decode), but the way that interacts with Phobos is poor. Ideally, we'd get rid of auto-decoding, and we'd get rid of the whole exception on bad Unicode thing and just use the replacement character, but since changing it would break a lot of code... :| - Jonathan M Davis
Jan 15 2018