digitalmars.D.learn - String Type Usage. String vs DString vs WString

Chris P (8/8) Jan 14 2018 Hello,

rikki cattermole (7/18) Jan 14 2018 D's strings are Unicode.

Tony (5/7) Jan 14 2018 I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4

Jonathan M Davis (29/36) Jan 14 2018 Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them

Patrick Schluter (4/18) Jan 15 2018 Nooooooooooo!!! Only 4 maximum for Unicode. Beyond that it's

Nicholas Wilson (13/21) Jan 14 2018 string == immutable( char)[], char == utf8

Chris P (3/17) Jan 14 2018 Thank you (and rikki) for replying. Actually, I am using Windows

Jonathan M Davis (24/46) Jan 14 2018 Even with Windows, there usually isn't any reason to use wstring. The on...

SimonN (14/15) Jan 14 2018 I would use string (UTF-8) throughout the program, but there

Adam D. Ruppe (8/10) Jan 15 2018 That's not true. foreach will only decode on demand:

SimonN (9/14) Jan 15 2018 Thanks for the correction! Surprised I got foreach(c, s) wrong,

Kagamin (4/6) Jan 15 2018 foreach doesn't do it silently, decoding must be requested from

Jonathan M Davis (14/20) Jan 15 2018 Yeah, one of the joys of that is that you have to be careful of using

Chris P <a b.com> writes:

Hello,

I'm extremely new to D and have a quick question regarding common 
practice when using strings. Is usage of one type over the others 
encouraged? When using 'string' it appears there is a length 
mismatch between the string length and the char array if large 
Unicode characters are used. So I figured I'd ask.

Thanks in advance,

Chris P - Tampa

Jan 14 2018

rikki cattermole <rikki cattermole.co.nz> writes:

On 15/01/2018 2:05 AM, Chris P wrote:
 Hello,
 
 I'm extremely new to D and have a quick question regarding common 
 practice when using strings. Is usage of one type over the others 
 encouraged? When using 'string' it appears there is a length mismatch 
 between the string length and the char array if large Unicode characters 
 are used. So I figured I'd ask.
 
 Thanks in advance,
 
 Chris P - Tampa

D's strings are Unicode.

Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
The size of a code point is 1, 2 or 4 bytes.
But here is the thing, what is displayed (a character) could be multiple 
code points and these can be combined to form a grapheme.

So yes, there will be length mismatches between them :)

Jan 14 2018

Tony <tonytdominguez aol.com> writes:

On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole 
wrote:

 Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
 The size of a code point is 1, 2 or 4 bytes.

I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 
(UTF-32) bytes are referred to as "code units" and the size of a 
code point varies in UTF-8 and UTF-16.

Jan 14 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole

 wrote:
 Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
 The size of a code point is 1, 2 or 4 bytes.

 I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
 (UTF-32) bytes are referred to as "code units" and the size of a
 code point varies in UTF-8 and UTF-16.

Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them
(IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are
either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits,
and there is always 1 code unit per code point.

For better or worse (mostly worse), ranges then treat all strings as ranges
of code points and decode them to code points such that get a range of dchar
(which means fun things like isRandomAccessRange!string and hasLength!string
are false). As I understand it, each code point is then something which can
be physically printed, but either way, it's not necessarily a full
character.

Multiple code points can then be combined to make a grapheme cluster (which
then corresponds to what we'd normally consider a full character - e.g. a
letter and an accent can each be a code point which are then combined to
create an accented character). std.uni provides the functionality for
operating on graphemes.

And std.utf.byCodeUnit can be used to treat strings as ranges of code units
instead of code points (and a fair bit of Phobos takes the solution of
specializing range-based code for strings to avoid the auto-decoding).

All in all, the whole thing is annoyingly complicated, though at least D is
much more explicit about it than most languages, and I suspect that your
average D programmer is better educated about Unicode than your average
programmer. And having to figure out why the heck strings and wstrings act
so bizarrely as ranges does have the positive side effect of putting it even
more in your face than it would be otherwise, making it that much more
likely that folks are going to learn about Unicode - though I still think
that we'd be better off if we could ever figure out how to treat all strings
as ranges of code units without breaking everything in the process. :|

- Jonathan M Davis

Jan 14 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Monday, 15 January 2018 at 04:27:15 UTC, Jonathan M Davis 
wrote:
 On Monday, January 15, 2018 03:14:02 Tony via 
 Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole

 wrote:
 Unicode has three main variants, UTF-8, UTF-16 and UTF-32. 
 The size of a code point is 1, 2 or 4 bytes.

 I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
 (UTF-32) bytes are referred to as "code units" and the size of 
 a
 code point varies in UTF-8 and UTF-16.

 Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 
 of them (IIRC) in a code point.

Nooooooooooo!!! Only 4 maximum for Unicode. Beyond that it's 
obsolete crap that is not Unicode since version 2 of Unicode.

Jan 15 2018

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 Hello,

 I'm extremely new to D and have a quick question regarding 
 common practice when using strings. Is usage of one type over 
 the others encouraged? When using 'string' it appears there is 
 a length mismatch between the string length and the char array 
 if large Unicode characters are used. So I figured I'd ask.

 Thanks in advance,

 Chris P - Tampa

  string == immutable( char)[], char == utf8
wstring == immutable(wchar)[], char == utf16
dstring == immutable(dchar)[], char == utf32

Unless you are dealing with windows, in which case you way need 
to consider using wstring, there is very little reason to use 
anything but string.

N.B. when you iterate over a string there are a number of 
different "flavours" (for want of a better term) you can iterate 
over, bytes, unicode codepoints and graphemes ( I'm possible 
forgetting some). have a look in std.uni and related modules. 
Iteration in Phobos defaults to coepoints I think.

TLDR use string.

Jan 14 2018

Chris P <a b.com> writes:

On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
 On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 [...]

  string == immutable( char)[], char == utf8
 wstring == immutable(wchar)[], char == utf16
 dstring == immutable(dchar)[], char == utf32

 Unless you are dealing with windows, in which case you way need 
 to consider using wstring, there is very little reason to use 
 anything but string.

 N.B. when you iterate over a string there are a number of 
 different "flavours" (for want of a better term) you can 
 iterate over, bytes, unicode codepoints and graphemes ( I'm 
 possible forgetting some). have a look in std.uni and related 
 modules. Iteration in Phobos defaults to coepoints I think.

 TLDR use string.

Thank you (and rikki) for replying. Actually, I am using Windows 
(Doh!) but I now understand. Cheers!

Jan 14 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, January 15, 2018 02:22:09 Chris P via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
 On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 [...]

  string == immutable( char)[], char == utf8

 wstring == immutable(wchar)[], char == utf16
 dstring == immutable(dchar)[], char == utf32

 Unless you are dealing with windows, in which case you way need
 to consider using wstring, there is very little reason to use
 anything but string.

 N.B. when you iterate over a string there are a number of
 different "flavours" (for want of a better term) you can
 iterate over, bytes, unicode codepoints and graphemes ( I'm
 possible forgetting some). have a look in std.uni and related
 modules. Iteration in Phobos defaults to coepoints I think.

 TLDR use string.

 Thank you (and rikki) for replying. Actually, I am using Windows
 (Doh!) but I now understand. Cheers!

Even with Windows, there usually isn't any reason to use wstring. The only
reason that wstring might be more desirable on Windows is that you need
UTF-16 when dealing with the Windows API calls, and that's normally only
going to come up if you're not writing platform-independent code. The common
stuff such as file access is already wrap by Phobos (e.g. in std.file and
std.stdio), so most programs, don't need to worry about the Windows API
calls. And even if you do, the best practice generally is to use string
everywhere in your code and then only convert to a zero-terminated wchar*
when making the Windows API calls (either by actually allocating a
zero-terminated wchar* or using a static array with the appropriate wchar
set to 0, depending on the context).

If you have to do a ton with Windows API calls, at some point, it arguably
becomes better to just keep them as wstrings to avoid the conversions, but
even then, because strings in D aren't zero-terminated, and the C API calls
usually require them to be, you're often forced to copy the string to pass
it to a Windows API call anyway, in which case, you lose most of the benefit
of keeping stuff around in wstrings instead of just using strings
everywhere.

If you do need to worry about call a Windows API call, then check out toUTFz
in std.utf, since it will allow you to easily convert to zero-terminated
strings of any character type (std.string.toStringz handles zero-terminated
strings as well, but just for string).

- Jonathan M Davis

Jan 14 2018

SimonN <eiderdaus gmail.com> writes:

On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
 Is usage of one type over the others encouraged?

I would use string (UTF-8) throughout the program, but there 
seems to be no style guideline for this. Keep in mind two gotchas:

D's foreach and D's ranges will autodecode and silently iterate 
over dchar, not char, even when the input is string, not dstring. 
(It's also possible to explicitly decode strings, see std.utf and 
std.uni.)

If you call into the Windows API, some functions require extra 
care if everything in your program is UTF-8. But I still agree 
with the approach to keep everything as string in your program, 
and then wrap the Windows API calls, as the UTF-8 Everywhere 
manifesto suggests:
http://utf8everywhere.org/

-- Simon

Jan 14 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach [...] will autodecode and silently iterate over 
 dchar, not char, even when the input is string


That's not true. foreach will only decode on demand:

string s;

foreach(c; s) { /* c is a char here, it goes over bytes */ }
foreach(char c; s) { /* c is a char here, same as above */ }
foreach(dchar c; s) { /* c is a dchar - this decodes */ }



Autodecoding is a Phobos library artifact, NOT something in the D 
language itself.

Jan 15 2018

SimonN <eiderdaus gmail.com> writes:

On Monday, 15 January 2018 at 14:44:46 UTC, Adam D. Ruppe wrote:
 On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach [...] will autodecode and silently iterate over 
 dchar, not char, even when the input is string

 That's not true. foreach will only decode on demand:
 foreach(c; s) { /* c is a char here, it goes over bytes */ }

Thanks for the correction! Surprised I got foreach(c, s) wrong, 
its non-decoding iteration is even the prominent example in TDPL.

Even `each`, the template function that implements a foreach, 
still infers as char:

     "aä".each!writeln; // prints a plus two broken characters

Only `map`



When I wrote "D's ranges", I meant Phobos's range-producing 
templates; a range itself is again encoding-agnostic.

Jan 15 2018

Kagamin <spam here.lot> writes:

On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach and D's ranges will autodecode and silently iterate 
 over dchar, not char

foreach doesn't do it silently, decoding must be requested from 
it by explicitly specifying element type, it can also encode this 
way.

Jan 15 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, January 15, 2018 14:56:33 Kagamin via Digitalmars-d-learn wrote:
 On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
 D's foreach and D's ranges will autodecode and silently iterate
 over dchar, not char

 foreach doesn't do it silently, decoding must be requested from
 it by explicitly specifying element type, it can also encode this
 way.

Yeah, one of the joys of that is that you have to be careful of using
foreach in range-based functions, because if you haven't specialized for
strings, and you use foreach without specifying the element type, then when
your function template is instantiated with a string, the foreach won't
match what front does.

I really don't have any complaints about how foreach does this aside from
the fact that it doesn't currently use the replacement character (so, it
will throw on invalid Unicode if it's told to decode), but the way that
interacts with Phobos is poor.

Ideally, we'd get rid of auto-decoding, and we'd get rid of the whole
exception on bad Unicode thing and just use the replacement character, but
since changing it would break a lot of code... :|

- Jonathan M Davis

Jan 15 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - String Type Usage. String vs DString vs WString