www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - safe accessing of union members

reply Q. Schroll <qs.il.paperinik gmail.com> writes:
There was a discussion lately about overlapping pointers in 
unions,  safe and SumType.

Obvious results: It is  safe overlapping some types with others 
and accessing them in  safe code and it is un- safe to do so with 
other types.

I had an insight that I wanted to share: invariants. A type has 
invariants if there could be bit-patterns that are invalid. (If 
I'm not mistaken, it's really that simple.) In code, this would 
mean: If I had a mutable object obj of type T, and I did

     T obj = ...;
     ubyte[] obj_slice = (cast(ubyte*)cast(void*)(&obj))[0 .. 
T.sizeof];
     size_t i = ...;
     assert(i < obj_slice.length);
     obj_slice[i] = ...;

would it result in obj being un- safe to use in the sense that 
using it would result in un- safe operations? Built-in integer 
and floating point types have no invariants. Every single of the 
2³² bit-patterns an int can hold is a valid number, the same is 
true for floats (it's not un- safe reading a NaN!). Compare that 
to bool and pointers. Basically, bool is uint, but with the 
invariant that its value is 0 or 1. Any other bit-pattern is 
invalid for a bool value. At first glance, any bit-pattern is 
valid for a pointer -- but that's not true, because what is a 
valid bit-pattern need not be fixed (like for bool). An invariant 
can be: Apart from null, it must be valid to dereference. The set 
of addresses valid to dereference changes at run-time. (Even if 
the set of valid-to-dereference addresses could become the whole 
address space, it suffices that there could be situations at 
run-time when it isn't.)

Breaking invariants incurs undefined behavior and that is 
un- safe by definition. Practically, there's no way  trusted 
functions can work if they cannot in general assume that the 
types' invariants they deal with are met.

So, what kinds of union uses are  safe?
Answer: If all members of the union have no invariants.

There are cases like, if only the currently active union member 
is accessed, it's  safe to use it. This check needs control-flow 
in general, but can be watered down to checks that only one union 
member is ever active.

When should the language (conservatively) assume an aggregate 
type (struct, class, etc.) has invariants? (Or, contrapositivly, 
when can the language be sure an aggregate type definitely has no 
invariants?)

1. If the type is an interface or non-final class type.
2. If the type has an explicit invariant block.
3. If the type has a member variable having a type with 
invariants.
4. If the type has padding bytes between member variables.
5. If the type has non-public member variables.

Rationales:
1. Types implementing an interface or inheriting from a class 
could have invariants.
4. For optimization, the compiler should be allowed to assume 
that padding bits are always zero, unless explicitly told the 
opposite. (Cf. assuming a bool is 0 or 1 always.) This is 
debatable.
5. Even if no invariants are stated, the fact that some members 
are encapsulated in some way is a clear indication that an 
invariant likely exists. There are counter-examples, like a 
wrapper that's logging access to its only member using getter and 
setter.

Contrapositive formulation: The language can be (reasonably) 
certain that no invariants exist in an aggregate type when:
1. If the type is a class type, it is final, --- and
2. it has no invariant block, --- and
3. no member variable's type has invariants, --- and
4. the type has no padding bits, --- and
5. all member variables are public (i.e. anyone anywhere could 
write them).

If your type truly has no invariants, but fails condition 4, you 
can introduce ubyte[n] member variables that name the padding. In 
a sense, those padding arrays are implicitly private when 
compiler-generated, i.e. failing condition 5.
If your type truly has no invariants, but fails condition 5, it 
could be mitigated by allowing

      disable invariant;

to indicate that no implicit invariant arises from private 
members.

The compiler can recognize certain overlappings as valid although 
by the rules stated, they are not. An example would be: 
Overlapping T* and S* where T and S have the same size and both 
have no invariants. The second condition is important; otherwise, 
overlapping could be used to circumvent e.g. T's invariants using 
S which has no or different invariants.
Mar 17
parent reply Paul Backus <snarwin gmail.com> writes:
On Wednesday, 17 March 2021 at 16:43:25 UTC, Q. Schroll wrote:
 There was a discussion lately about overlapping pointers in 
 unions,  safe and SumType.

 Obvious results: It is  safe overlapping some types with others 
 and accessing them in  safe code and it is un- safe to do so 
 with other types.

 I had an insight that I wanted to share: invariants. A type has 
 invariants if there could be bit-patterns that are invalid.
Yes. This is exactly what the "Background" section of the DIP 1035 [1] was trying to say.
 Breaking invariants incurs undefined behavior
Not necessarily. The statement int* p = cast(int*) 0xDEADBEEF; ...does not have undefined behavior. You only get undefined behavior if you actually dereference `p`. In more general terms: undefined behavior doesn't come from the values themselves, but from specific *operations* on those values. The purpose of an invariant is to specify what conditions are necessary to ensure defined behavior for a given operation.
 So, what kinds of union uses are  safe?
 Answer: If all members of the union have no invariants.
This is overly narrow. Unions themselves have no invariants, even when their members do, because access to those members is forbidden in safe code, and there is no operation you can perform on a union instance *as a whole* in safe code whose behavior is potentially undefined. Some examples of things you can do with a union that are always safe, regardless of its members: union U { int* ptr; int num; } // Initialization is always safe U a = { num: 123 }; U b = { ptr: new int }; // Copying is always safe U c = b; // Bitwise comparison is always safe assert(c is b); // Casting memory to const(ubyte) is always safe writefln("Raw bytes: %(%02X %)", *cast(const(ubyte)[U.sizeof]*) &c);
 When should the language (conservatively) assume an aggregate 
 type (struct, class, etc.) has invariants?
The rules in the language spec [2] are mostly correct in this regard, though they leave out `bool` (and enum types, though that's a more debatable issue). [1] https://github.com/dlang/DIPs/blob/c39f6ac62210e0604dcee99b0092c1930839f93a/DIPs/DIP1035.md#background [2] https://dlang.org/spec/function.html#safe-values
Mar 17
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On Wednesday, 17 March 2021 at 18:01:43 UTC, Paul Backus wrote:
 On Wednesday, 17 March 2021 at 16:43:25 UTC, Q. Schroll wrote:
[...]
 When should the language (conservatively) assume an aggregate 
 type (struct, class, etc.) has invariants?
The rules in the language spec [2] are mostly correct in this regard, though they leave out `bool` (and enum types, though that's a more debatable issue).
[...]
 [2] https://dlang.org/spec/function.html#safe-values
I left out bool deliberately. The impression I get from Walter is that he only considers indirections to be potentially unsafe [1]. In anticipate him addressing issue by defining 0 = false, anything else = true. Then all bit patterns are safe, and it's just a matter of fixing code that expects only 0 or 1. [1] https://github.com/dlang/dlang.org/pull/2260
Mar 17
next sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Wednesday, 17 March 2021 at 19:09:18 UTC, ag0aep6g wrote:
 On Wednesday, 17 March 2021 at 18:01:43 UTC, Paul Backus wrote:
 On Wednesday, 17 March 2021 at 16:43:25 UTC, Q. Schroll wrote:
[...]
 When should the language (conservatively) assume an aggregate 
 type (struct, class, etc.) has invariants?
The rules in the language spec [2] are mostly correct in this regard, though they leave out `bool` (and enum types, though that's a more debatable issue).
[...]
 [2] https://dlang.org/spec/function.html#safe-values
I left out bool deliberately. The impression I get from Walter is that he only considers indirections to be potentially unsafe [1]. In anticipate him addressing issue by defining 0 = false, anything else = true. Then all bit patterns are safe, and it's just a matter of fixing code that expects only 0 or 1. [1] https://github.com/dlang/dlang.org/pull/2260
IIRC "code that expects only 0 or 1" includes things like the GCC and LLVM backends, so it may be worth some additional consideration, especially if we want D's bool to interface correctly with C99 _Bool and/or C++ bool.
Mar 17
parent ag0aep6g <anonymous example.com> writes:
On Wednesday, 17 March 2021 at 19:27:29 UTC, Paul Backus wrote:
 IIRC "code that expects only 0 or 1" includes things like the 
 GCC and LLVM backends, so it may be worth some additional 
 consideration, especially if we want D's bool to interface 
 correctly with C99 _Bool and/or C++ bool.
Oh, absolutely. I don't want to stop anyone from exploring the alternative. I'm just saying allowing values above 1 can work (AFAICT), and it's what I expect Walter to favor. But I might well be wrong on both counts.
Mar 17
prev sibling parent Dominikus Dittes Scherkl <dominikus scherkl.de> writes:
On Wednesday, 17 March 2021 at 19:09:18 UTC, ag0aep6g wrote:
 In anticipate him addressing issue by defining 0 = false, 
 anything else = true. Then all bit patterns are safe, and it's 
 just a matter of fixing code that expects only 0 or 1.
There is another interpretation that would make all patterns safe: lowest bit is 0 = false, lowest bit is 1 = true. This has the advantage that any garbage in higher bits doesn't invalidate the bool. But of course, even values beeing considered "false" may feel somewhat strange...
Mar 18
prev sibling parent reply Q. Schroll <qs.il.paperinik gmail.com> writes:
On Wednesday, 17 March 2021 at 18:01:43 UTC, Paul Backus wrote:
 On Wednesday, 17 March 2021 at 16:43:25 UTC, Q. Schroll wrote:
 There was a discussion lately about overlapping pointers in 
 unions,  safe and SumType.

 Obvious results: It is  safe overlapping some types with 
 others and accessing them in  safe code and it is un- safe to 
 do so with other types.

 I had an insight that I wanted to share: invariants. A type 
 has invariants if there could be bit-patterns that are invalid.
Yes. This is exactly what the "Background" section of the DIP 1035 [1] was trying to say.
I missed that. Using DIP 1035's terms, union members must be implicitly system if they have invariants (term "invariant" in the sense of the DIP, which could be checked the conditions I stated). Reading DIP 1035 that you co-authored, I figured my notion of a "type that has invariants" could be helpful. In an example of the DIP, there's a void initialization presented as a reason why a type called ShortString is not memory safe. If you look at my definition of "type with invariants", ShortString would be considered a type with invariants because it has private variables (and has no disable invariant). As the language currently correctly specifies, a pointer cannot be void initialized in safe code. Why? Because a pointer has invariants and those could be broken using void initialization. ShortString also has invariants, therefore void initializing one cannot be deemed safe. The DIP unfortunately nowhere states how/where to use a system variable in the introductory example code. That would be helpful. Where the DIP and this idea digress is fixing something vs introducing something (that among other things, helps trusted review). While system variables are useful for global variables for sure, I think for types like the DIP introductory examples, system variables aren't really a solution to the presented problem. safe should error if a clueless programmer writes and uses it and accidentally introduces UB. This includes writing trusted functions that are properly written. "If ShortString.length could be marked as system, this dilemma would not exist." While true, it is not obvious why a clueless programmer would mark `length` system. The only part of the program that looks fishy is the trusted function and that one cannot be changed to the better. Maintaining the invariant of an aggregate type necessarily includes auditing the whole module which has access to its private data. Annotating member variables system can help with that reducing the audit to trusted and system functions in the module. Unless system becomes the default for member variables, it cannot be relied upon for cases like ShortString. The DIP points that out in Example: User-Defined Slice: "Instead, every function that touches ptr and length, including the safe constructor, must be manually checked." I missed void initialization in my post, but interestingly, void initialization of a type T object is safe if and only if in the `union { T obj; ubyte[T.sizeof] bytes; }` it is valid to initialize `bytes` arbitrarily and use `obj`.
 Breaking invariants incurs undefined behavior
Not necessarily. The statement int* p = cast(int*) 0xDEADBEEF; ...does not have undefined behavior.
The spec says you're wrong, at least for structs and classes: "If the invariant does not hold, then the program enters an invalid state." -- https://dlang.org/spec/struct.html#StructInvariant -- https://dlang.org/spec/class.html#invariants
 You only get undefined behavior if you actually dereference `p`.
Even if that were the case, it'd be irrelevant (it's the spec that decides when UB is encountered, not what one compiler implementation does). You cannot make dereferencing a pointer system (in general), but assigning a value for that is (probably) invalid to dereference. That's what D currently does and it is the right choice IMO. The cast in your code is system, not the assignment.
 In more general terms: undefined behavior doesn't come from the 
 values themselves, but from specific *operations* on those 
 values.
Technically yes, but that's practically irrelevant as pointed out earlier.
 The purpose of an invariant is to specify what conditions are 
 necessary to ensure defined behavior for a given operation.
I'd say, an invariant is a condition (mainly on an object) such that the specified behavior of that object cannot be guaranteed if that condition is false. Maybe this is mere semantics and you basically meant what I said. I view invariants not through operations but object state. Operations have pre- and post-conditions and an object's invariants are post-conditions for every operation and may be pre-conditions for any operation.
 So, what kinds of union uses are  safe?
 Answer: If all members of the union have no invariants.
This is overly narrow.
Never did I say it's an equivalence. There are cases like SumType that use a union internally and even if a member has an invariant, access is valid, because SumType has invariants that ensure that reading a member only happens after that member was the one assigned last.
 Unions themselves have no invariants,
They can have them. SumType is an example. While SumType is (probably) implemented as a struct with a union member, it could be a union of structs: int × ∪Ts = ∪(int × Ts). You can literally put invariant blocks in unions.
 even when their members do, because access to those members is 
 forbidden in  safe code,
This is wrong. Accessing union members is sometimes considered safe currently although it clearly isn't. The compiler detects pointer overlappings as system, but doesn't for any other invariants types have. What I tried to convey in this post is: pointers having a valid-to-dereferece value is an invariant, that the safe mechanic considers, but it does not consider any other form of invariants.
 and there is no operation you can perform on a union instance 
 *as a whole* in  safe code whose behavior is potentially 
 undefined.
A union as a whole, i.e. not accessing a member, is near useless.
 Some examples of things you can do with a union that are always 
  safe, regardless of its members:

     union U { int* ptr; int num; }

     // Initialization is always  safe
     U a = { num: 123 };
     U b = { ptr: new int };
     // Copying is always  safe
     U c = b;
     // Bitwise comparison is always  safe
     assert(c is b);
     // Casting memory to const(ubyte) is always  safe
     writefln("Raw bytes: %(%02X %)", 
 *cast(const(ubyte)[U.sizeof]*) &c);
The case I'm talking about is accessing union members. You're digressing. I had a section in a draft of my post pointing out that reading a member that has been written last (needs control-flow analysis in general) is okay. I removed it because I deemed it obvious. Bit-wise reading is also obviously a non-problem.
 When should the language (conservatively) assume an aggregate 
 type (struct, class, etc.) has invariants?
The rules in the language spec [2] are mostly correct in this regard, though they leave out `bool` (and enum types, though that's a more debatable issue).
Notice that "mostly correct" in a formal setting is a euphemism for "wrong". The case with bool is an instance of the problem. In my opinion, void-initializing a bool should be system. If the language specifies that a bool can have any bit-pattern, every use of a bool b (unless proven by some form of value-range propagation) must be checked for b > 1. This is obviously nonsense. In this case, we can just deprecate bool and use ubyte instead. At least, the language would be honest that way.
 [1] 
 https://github.com/dlang/DIPs/blob/c39f6ac62210e0604dcee99b0092c1930839f93a/DIPs/DIP1035.md#background
 [2] https://dlang.org/spec/function.html#safe-values
Sorry for the long post.
Mar 18
next sibling parent Paul Backus <snarwin gmail.com> writes:
On Thursday, 18 March 2021 at 18:24:21 UTC, Q. Schroll wrote:
 Reading DIP 1035 that you co-authored, I figured my notion of a 
 "type that has invariants" could be helpful. In an example of 
 the DIP, there's a void initialization presented as a reason 
 why a type called ShortString is not memory safe. If you look 
 at my definition of "type with invariants", ShortString would 
 be considered a type with invariants because it has private 
 variables (and has no  disable invariant).
Your definition of "type with invariants" is a bad one--not necessarily from a soundness perspective, but from a good-language-design perspective. Rather than have the compiler try to guess the programmer's intent based on things like the use of `private` or the presence of padding bytes (!), and force the programmer to correct the compiler when it guesses wrong (with ` disable invariant`), it is both much simpler and much better UX to let the programmer state explicitly that a particular type needs to have its state protected from uncontrolled mutation in safe code.
  safe should error if a clueless programmer writes and uses it 
 and accidentally introduces UB. This includes writing  trusted 
 functions that are properly written. "If ShortString.length 
 could be marked as  system, this dilemma would not exist." 
 While true, it is not obvious why a clueless programmer would 
 mark `length`  system.
There is literally nothing that the language can possibly do to protect a clueless programmer from writing unsound trusted code. The best we can hope for is that the presence of the trusted attribute serves as a big, flashing "DANGER" sign to anyone auditing the code.
 I missed void initialization in my post, but interestingly, 
 void initialization of a type T object is  safe if and only if 
 in the `union { T obj; ubyte[T.sizeof] bytes; }` it is valid to 
 initialize `bytes` arbitrarily and use `obj`.
A less roundabout definition: void initialization of an object of type T is safe if and only if every bit-pattern that fits into T's in-memory representation represents a safe value [1] of type T.
 Not necessarily. The statement

     int* p = cast(int*) 0xDEADBEEF;

 ...does not have undefined behavior.
The spec says you're wrong, at least for structs and classes: "If the invariant does not hold, then the program enters an invalid state."
I am not talking about structs and classes. I am talking about a specific statement, whose behavior is explicitly well-defined according to the language spec [2].
 Unions themselves have no invariants,
They can have them. SumType is an example
Point taken. What I meant to say was that union *types*, themselves, have no invariants--which is true. Specific union *variables* may indeed have invariants imposed upon them by the programmer above and beyond what their types imply (although due to issue 20941 [3], trusted code that relies on such invariants is currently unsound).
 even when their members do, because access to those members is 
 forbidden in  safe code,
This is wrong. Accessing union members is sometimes considered safe currently although it clearly isn't.The compiler detects pointer overlappings as system, but doesn't for any other invariants types have.
I agree that this is a bug. [4]
 The case with bool is an instance of the problem.
 In my opinion, void-initializing a bool should be  system.
Again, I agree. [1] https://dlang.org/spec/function.html#safe-values [2] https://dlang.org/spec/type.html#pointer-conversions [3] https://issues.dlang.org/show_bug.cgi?id=20941 [4] https://issues.dlang.org/show_bug.cgi?id=21665
Mar 18
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On Thursday, 18 March 2021 at 18:24:21 UTC, Q. Schroll wrote:
 Maintaining the invariant of an aggregate type necessarily 
 includes auditing the whole module which has access to its 
 private data. Annotating member variables  system can help with 
 that reducing the audit to  trusted and  system functions in 
 the module. Unless  system becomes the default for member 
 variables, it cannot be relied upon for cases like ShortString.
You're saying (correct me if I'm misrepresenting you): I might forget putting system on a safety-critical variable, and then I can still mess with it from safe code. So I have to check all safe code anyway, to see if it touches a safety-critical variable that wasn't marked. I think that's missing an important aspect of trusted: trusted code is not allowed to rely on such variables for safety. The ShortString example in the DIP shows invalid code. It only becomes valid when DIP 1035 is implemented and `length` is marked system. Condensed into a piece of code: char[15] data; ubyte length_1; system ubyte length_2; trusted char[] f() { return data.ptr[0 .. length_1]; } trusted char[] g() { return data.ptr[0 .. length_2]; } `f` is always invalid. `g` is invalid without DIP 1035. It becomes valid with DIP 1035.
 The DIP points that out in Example: User-Defined Slice: 
 "Instead, every function that touches ptr and length, including 
 the  safe constructor, must be manually checked."
Allowing an safe constructor to set a system variable might be a mistake. This was pointed out in the DIP review. But auditing all safe constructors is still less work than auditing all safe code.
Mar 18
parent Paul Backus <snarwin gmail.com> writes:
On Thursday, 18 March 2021 at 23:37:26 UTC, ag0aep6g wrote:
 Allowing an  safe constructor to set a  system variable might 
 be a mistake. This was pointed out in the DIP review. But 
 auditing all  safe constructors is still less work than 
 auditing all  safe code.
It's definitely not ideal, but it's necessary given the language's current limitations. Constructors are subject to special restrictions [1] that do not apply to most D code, and those restrictions make proper use of trusted escapes impossible. I believe this is mentioned somewhere in the DIP, but it probably deserves more attention, since this is a relatively obscure corner of the language. [1] https://dlang.org/spec/struct.html#field-init
Mar 18