www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why is char initialized to 0xFF ?

reply James Blachly <james.blachly gmail.com> writes:
Disclaimer: I am not a unicode expert.

Background: I have added UTF8 character type support to lldb in 
conjunction with adding support for D string/wstring/dstring.

Dlang char is analogous to C++20 char8_t[1] AFAICT.

The default initialization value in C++20 is u8'\0', whereas in D 
char.init is '\xFF'[2]. Likewise, wchar .init is 0xFFFF and dchar is 
0x0000FFFF.

char is a UTF8 character, but 0xFF is specifically forbidden[3] by the 
UTF8 specification.

What is the reasoning behind this? Is it related to zero-termination of 
C strings? Should it be considered for change?

It is surprising that these do not init to the null value, which is 
valid UTF.

Kind regards
James


[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html
[2] https://dlang.org/spec/type.html
[3] https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
Jun 08 2019
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 8 June 2019 at 17:55:07 UTC, James Blachly wrote:
 char is a UTF8 character, but 0xFF is specifically forbidden[3] 
 by the UTF8 specification.
And that is exactly why it is the default: the idea here is to make uninitialized variables obvious, because they will be a predictable, but invalid value when they appear. Same reason why floats are nan and classes are null btw. `int` is the exception as being default initialized as something that happens to be really useful. (and arrays kinda are special too. technically they are null, but the runtime will automatically allocate null arrays when needed, so it works transparently anyway... and ends up being super useful)
Jun 08 2019
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On Saturday, 8 June 2019 at 18:04:46 UTC, Adam D. Ruppe wrote:
 On Saturday, 8 June 2019 at 17:55:07 UTC, James Blachly wrote:
 char is a UTF8 character, but 0xFF is specifically 
 forbidden[3] by the UTF8 specification.
And that is exactly why it is the default: the idea here is to make uninitialized variables obvious, because they will be a predictable, but invalid value when they appear.
To me they are not really obvious or useful, especially when I interface with C/C++. I pass some default-initialized char or float to a C/C++ library (by mistake), and I get some weird output written in some distant data field. The end result is either broken data somewhere down the line, or garbled output in the UI. I much prefer default values which are correct for 99% of the intended use-cases. I make full use of the fact integers default-initialize to zero, I think it's a great "feature". If there was a NaN for integers, I'd probably hate it.. I would prefer it if the compiler (or a tool!) had a switch --check-use-before-initialize or something of the sort, with code-flow analysis and all that good stuff.
Jun 08 2019
prev sibling parent reply KnightMare <black80 bk.ru> writes:
On Saturday, 8 June 2019 at 18:04:46 UTC, Adam D. Ruppe wrote:
 On Saturday, 8 June 2019 at 17:55:07 UTC, James Blachly wrote:
 char is a UTF8 character, but 0xFF is specifically 
 forbidden[3] by the UTF8 specification.
And that is exactly why it is the default: the idea here is to make uninitialized variables obvious, because they will be a predictable, but invalid value when they appear.
double d; most compilers fire error "using unitialized variable". another side "I(D compiler) will tell u nothing for that, but u'll get a shit! haha" ok. lets see structs now struct S { double d; } S s; in most compilers s will contains zeros. in C/C++ - garbage. men comes to D not as first language, they has troubles with garbage in structs already, and they still forget initialize it right (I do), so rule "all initialization is zeros" is the best and right thing that can be. if u dont initialize use "= void" - is good too. but initialize ints as 0, ptrs as null, chars as #FF, doubles as NaN - is was invented under mushrooms men comes to D and see char=#ff,double=NaN https://www.youtube.com/watch?v=Qsa41csyNU8
Jun 09 2019
next sibling parent KnightMare <black80 bk.ru> writes:
On Sunday, 9 June 2019 at 07:48:46 UTC, KnightMare wrote:
 double d;
 most compilers fire error "using unitialized variable".
not exactly in this line, but when we try to read from it first like "d += ..."
Jun 09 2019
prev sibling next sibling parent reply Mike Parker <aldacron gmail.com> writes:
On Sunday, 9 June 2019 at 07:48:46 UTC, KnightMare wrote:

 ok. lets see structs now
 struct S { double d; }
 S s;
You can set the default initializer in this case: struct S { double d = 0.0; }
 but initialize ints as 0, ptrs as null, chars as #FF, doubles 
 as NaN - is was invented under mushrooms
Not at all. It's quite practical for debugging. Uninitialized variables are a pain in C and C++. Default initializing to invalid values makes them stand out in the debugger. The drawback is that the integrals (and bool) have no invalid value, so we're stuck with 0 (and false).
Jun 09 2019
parent KnightMare <black80 bk.ru> writes:
On Sunday, 9 June 2019 at 08:26:45 UTC, Mike Parker wrote:

 Not at all. It's quite practical for debugging. Uninitialized 
 variables are a pain in C and C++. Default initializing to 
 invalid values makes them stand out in the debugger. The 
 drawback is that the integrals (and bool) have no invalid 
 value, so we're stuck with 0 (and false).
I agree that memory must be initialized unless otherwise stated. I disagree that garbage(uninit value) should be FF and NaN. again "all zeroes" is best and right thing. people are the main resource, they have expectations, the expect zeroes, u can poll they "what values shuold be used for unitialized vars?" and if u think about it u will answer.. what?.. any men on the street. no, in IT-park. imo coz nobody used FF and Nan in D-code now (so, the default is FF, so I just do "ch += 1" and I've got 00! I am cool hacker!), we can change it to most expecting values (I think it zero). In any case we can do poll between D-users for beggining. or lets setup tagline for D "We have our own way, dont boomboom our brain!". joke. maybe a little bit trolled.
Jun 09 2019
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 9 June 2019 at 07:48:46 UTC, KnightMare wrote:
 On Saturday, 8 June 2019 at 18:04:46 UTC, Adam D. Ruppe wrote:
 On Saturday, 8 June 2019 at 17:55:07 UTC, James Blachly wrote:
 char is a UTF8 character, but 0xFF is specifically 
 forbidden[3] by the UTF8 specification.
And that is exactly why it is the default: the idea here is to make uninitialized variables obvious, because they will be a predictable, but invalid value when they appear.
double d; most compilers fire error "using unitialized variable".
Which is technically not possible in D because D always initializes variables. In C and C++ if you'd declare double d=0.0; you wouldn't get the "using unitialized variable" warning either. Independantly if 0 is the right or the wrong init value.
 another side "I(D compiler) will tell u nothing for that, but 
 u'll get a shit! haha"

 ok. lets see structs now
 struct S { double d; }
 S s;
 in most compilers s will contains zeros. in C/C++ - garbage.
 men comes to D not as first language, they has troubles with 
 garbage in structs already, and they still forget initialize it 
 right (I do), so rule "all initialization is zeros" is the best 
 and right thing that can be.
No, by putting NaN in d you hav e a deterministic error. In C and C++ you will have undefined behaviour that will vary with compiler, version, options, OS version, architecture, position of the moon, etc. and sometimes undetectable bugs.
 if u dont initialize use "= void" - is good too.
 but initialize ints as 0, ptrs as null, chars as #FF, doubles 
 as NaN - is was invented under mushrooms
No. If there were an equivalent of NaN for ints it would also be used ( Personnaly I really would prefer int.init == int.int_min and uint.init == uint.uint_max). Default initialisation of variable is here to have deterministic behaviour between versions and runs, i.e. get rid of nasal demons, not to mind read the appropriate initial value of a variable, that is something the programmer still has the responsibility for.
 men comes to D and see char=#ff,double=NaN
 https://www.youtube.com/watch?v=Qsa41csyNU8
Jun 09 2019
next sibling parent reply KnightMare <black80 bk.ru> writes:
On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:

I read the bible too. I know reasons why leaders decided use NaN 
and FF.
but
what is the best solution:
do some math and get garbage in C++ or NaN in D?
or compiler will tell "using unitialized variable" before any 
math?
Jun 09 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Sunday, June 9, 2019 3:27:39 AM MDT KnightMare via Digitalmars-d wrote:
 On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:

 I read the bible too. I know reasons why leaders decided use NaN
 and FF.
 but
 what is the best solution:
 do some math and get garbage in C++ or NaN in D?
 or compiler will tell "using unitialized variable" before any
 math?
Given how init works in D and how it's used all over the place, it really isn't feasible to have the compiler tell you to initialize the variable. A prime example would be with dynamic arrays. Mutating the length of a dynamic array has to use the init value. e.g. arr.length += 15; wouldn't work if init weren't used. Another example would be out parameters. They get assigned the init value for the type when the function is called. A lot of aspects of D are built around the fact that every type has an init value and the fact that values of that type are always initialized to that init value if they're not given an explicit value. At some point during the language's development, the ability to disable the default intialization of a type was added, but even those types still have an init value. And while it works, disabling default initialization causes all kinds of subtle problems precisely because D was built around the idea that every type could be default-initialized. Sure, there are some downsides to D's approach (such as getting unexpected NaNs or not having default constructors for structs), but it also solves a whole class of problems that other languages like C and C++ have with garbage values. Even Java has problems with garbage values in spite of it requiring that you initialize variables before using them (e.g. it's quite possible to use a static variable in Java before it's initialized because of circular reference issues). D, on the other hand, never has garbage values unless you use system features like = void to force it. - Jonathan M Davis
Jun 09 2019
parent KnightMare <black80 bk.ru> writes:
On Sunday, 9 June 2019 at 13:45:07 UTC, Jonathan M Davis wrote:
 Mutating the length of a dynamic array has to use the init 
 value. e.g.
 arr.length += 15;
 wouldn't work if init weren't used.
this mean that u should initialize added elements to some value(usual 0.0) again, coz NaN is not useful at all, no any case where u can use it. tada! double work!
 A lot of aspects of D are built around the fact that every type 
 has an init value and the fact that values of that type are 
 always initialized to that init value if they're not given an 
 explicit value.
I agree that data will should be initialized.. but with "all zeroes" not NaN or \xFF.
  disable..
I found only this https://dlang.org/spec/attribute.html#disable I feel that I miss something. what is your point?
 not having default constructors for structs
I am accepting it, its only in my responsibility assign some values to fields. but "all fields are zeroes" is good choice for me. NaN - means that I should to think small details when I switch do nothing.
Jun 09 2019
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:
 No, by putting NaN in d you hav e a deterministic error. In C 
 and C++ you will have undefined behaviour that will vary with 
 compiler, version, options, OS version, architecture, position 
 of the moon, etc. and sometimes undetectable bugs.
I don't think it is undefined though… If something has an arbitrary value, you could still compute with it, if the algorithm takes that into account. Assuming that all bit-patterns provides a defined value (which is the case for IEEE floating point bit-patterns). Anyway, the obvious advantage with having structs default initialized to all-bits-zero is that you can have an allocator that clears bits in the background (bypassing caches so they are not polluted). Then you have no penalty when allocating an array of one million struct values. Which is very useful. Just allocate memory-chunks that are already set to zero-bits. You usually want an array of floating point values to be pre-initialized to zeros. You almost never want an array of floating point values being initialized to NaN.
Jun 09 2019
next sibling parent James Blachly <james.blachly gmail.com> writes:
On 6/9/19 8:19 AM, Ola Fosheim Grøstad wrote:
 On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:
 No, by putting NaN in d you hav e a deterministic error. In C and C++ 
 you will have undefined behaviour that will vary with compiler, 
 version, options, OS version, architecture, position of the moon, etc. 
 and sometimes undetectable bugs.
I don't think it is undefined though… If something has an arbitrary value, you could still compute with it, if the algorithm takes that into account. Assuming that all bit-patterns provides a defined value (which is the case for IEEE floating point bit-patterns). Anyway, the obvious advantage with having structs default initialized to all-bits-zero is that you can have an allocator that clears bits in the background (bypassing caches so they are not polluted). Then you have no penalty when allocating an array of one million struct values. Which is very useful. Just allocate memory-chunks that are already set to zero-bits. You usually want an array of floating point values to be pre-initialized to zeros. You almost never want an array of floating point values being initialized to NaN.
Yes, and further I would suggest that non-zero-bit initializations violate the principle of least surprise. As someone posted upthread, it would be interesting to take a poll of new users (or non users, or perhaps the D-curious) and ask what their best guess is for each default value.
Jun 09 2019
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 9 June 2019 at 12:19:43 UTC, Ola Fosheim Grøstad wrote:
 On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:
 No, by putting NaN in d you hav e a deterministic error. In C 
 and C++ you will have undefined behaviour that will vary with 
 compiler, version, options, OS version, architecture, position 
 of the moon, etc. and sometimes undetectable bugs.
I don't think it is undefined though…
It is undefined behaviour by the definition of the standard. undefined behaviour includes behaviour that can be explained.
Jun 09 2019
parent lithium iodate <whatdoiknow doesntexist.net> writes:
On Sunday, 9 June 2019 at 18:27:07 UTC, Patrick Schluter wrote:
 On Sunday, 9 June 2019 at 12:19:43 UTC, Ola Fosheim Grøstad 
 wrote:
 On Sunday, 9 June 2019 at 08:36:30 UTC, Patrick Schluter wrote:
 No, by putting NaN in d you hav e a deterministic error. In C 
 and C++ you will have undefined behaviour that will vary with 
 compiler, version, options, OS version, architecture, 
 position of the moon, etc. and sometimes undetectable bugs.
I don't think it is undefined though…
It is undefined behaviour by the definition of the standard. undefined behaviour includes behaviour that can be explained.
To be fair, the C standard (C11) is a bit self-contradicting there. Variables of automatic storage duration that are not explicitly initialized contain an unspecified value (i. e. any valid value) or a trap representation. Types such as integers usually don't have any trap representations so reading them should be defined on most platforms (unless C permits some sort of compile-time-only trap representation? not sure.). Then there's informative(!) annex J which explicitly lists it as undefined behavior.
Jun 09 2019