www.digitalmars.com         C & C++   DMDScript  

digitalmars.dip.ideas - Type state analysis

reply Richard (Rikki) Andrew Cattermole <richard cattermole.co.nz> writes:
This proposal introduces a new semantic pass, that uses an IR to 
analyse memory transfers and the different states it can be in.

It is the bleading edge on program security, that would enable 
other memory analysis solutions such as isolated (immutable 
references ala Midori).

It enables:

- To prevent reads to uninitialized variables
- Prevent reads, writes, method calls to null objects
- Logic errors such as trying to read from an unopened file
- Is a framework to analyse unreachable vs reachable variables

It does not affect the type system.
Additional information must be provided by the user if they want 
to go beyond the defaults.
The default type state of each variable is meant to be as close 
to provable by the compiler, with user assistance like null 
checks.

A key premise is that newer edition `` safe`` function will not 
be able to call the older `` safe`` function.
This drastically simplifies decisions.

The end goal of this is to enable D to become temporally safe, 
which is critical to D's future (see the recent US government 
report on memory safety with specific mention towards temporal 
safety).
Without this, the framework required to do temporal analysis does 
not exist, and what would otherwise be very clear decisions 
become arguable at best, at worst complete unknowns.

For each variable declaration I am proposing a default type 
state, along with manual specification of the input and output 
states it may be in by using the UAX31 Medial character 
``identifier'input'output``.

This proposal was born out of necessaity after failure to solve 
issues surrounding isolated, and it too required a similar IR to 
be built.

I recognize that this could be a bit costly, but due to the 
potential performance wins as well as memory and logic safety, 
this is an absolute must if we want to be competitive.

Latest: 
https://gist.github.com/rikkimax/eed86a7061445a93f214e41fb6445e40
Current: 
https://gist.github.com/rikkimax/eed86a7061445a93f214e41fb6445e40/a8fffb5725904c6f5d74052d9c974a8f5d453fb0
Mar 16
next sibling parent reply claptrap <clap trap.com> writes:
On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 This proposal introduces a new semantic pass, that uses an IR 
 to analyse memory transfers and the different states it can be 
 in.
Any examples of languages that use this and what it provides?
Mar 17
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 18/03/2024 12:25 AM, claptrap wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 This proposal introduces a new semantic pass, that uses an IR to 
 analyse memory transfers and the different states it can be in.
Any examples of languages that use this and what it provides?
From Wikipedia:
 Typestate is an experimental concept that has not yet crossed over 
into mainstream programming languages. However, many academic projects actively investigate how to make it more useful as an everyday programming technique. Two examples are the Plaid and Obsidian languages, which are being developed by Jonathan Aldrich's group at Carnegie Mellon University.[16][17] Other examples include the Clara[18] language research framework, earlier versions of the Rust language, and the >> keyword in ATS.[19] This is the bleeding edge of research. https://en.wikipedia.org/wiki/Typestate_analysis If we get this, we'll be writing the literature.
Mar 17
prev sibling next sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 Latest: 
 https://gist.github.com/rikkimax/eed86a7061445a93f214e41fb6445e40
 Current: 
 https://gist.github.com/rikkimax/eed86a7061445a93f214e41fb6445e40/a8fffb5725904c6f5d74052d9c974a8f5d453fb0
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
Mar 17
parent reply Atila Neves <atila.neves gmail.com> writes:
On Sunday, 17 March 2024 at 15:35:40 UTC, Paul Backus wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 [...]
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
I second all of the above.
Apr 08
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 09/04/2024 6:18 AM, Atila Neves wrote:
 On Sunday, 17 March 2024 at 15:35:40 UTC, Paul Backus wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 [...]
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
I second all of the above.
Just so we are clear, DIP1000 is not able to provide memory safety on a single thread, let alone multiple. It is missing the child/parent relationship to prevent: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ``` This happens to be one of the things that the DFA is capable of tracking, that is not currently possible to implement (Dennis has tried). It also does not provide any way of tracking memory from creation to ensure it can be transferred safely to another thread in its entirety to prevent needing things like locks.
Apr 08
parent reply Atila Neves <atila.neves gmail.com> writes:
On Monday, 8 April 2024 at 20:10:24 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 On 09/04/2024 6:18 AM, Atila Neves wrote:
 On Sunday, 17 March 2024 at 15:35:40 UTC, Paul Backus wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 [...]
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
I second all of the above.
Just so we are clear, DIP1000 is not able to provide memory safety on a single thread, let alone multiple. It is missing the child/parent relationship to prevent: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ```
DIP1000 depends on lifetimes and RAII, and in a language like D with constructors and destructors I struggle to find a reason to write `parent.destroy`. This works as intended, in the sense that it fails to compile: ``` Field* child; { Parent parent; child = &parent.field; } Field value = *child; ``` It's the same reason why I don't understand "Logic errors such as trying to read from an unopened file" - why would a file be unopened?
Apr 10
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 11/04/2024 7:17 AM, Atila Neves wrote:
 On Monday, 8 April 2024 at 20:10:24 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 On 09/04/2024 6:18 AM, Atila Neves wrote:
 On Sunday, 17 March 2024 at 15:35:40 UTC, Paul Backus wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 [...]
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
I second all of the above.
Just so we are clear, DIP1000 is not able to provide memory safety on a single thread, let alone multiple. It is missing the child/parent relationship to prevent: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ```
DIP1000 depends on lifetimes and RAII, and in a language like D with constructors and destructors I struggle to find a reason to write `parent.destroy`. This works as intended, in the sense that it fails to compile: ```     Field* child;     {         Parent parent;         child = &parent.field;     }     Field value = *child; ```
Yes, because of the explicit scope creation. The compiler can see it cross scopes, but not within a scope this requires tracking of values and their order of invalidation.
 It's the same reason why I don't understand "Logic errors such as trying 
 to read from an unopened file" - why would a file be unopened?
Accidental, different scope states (conditional/function calls), explicit closing, events handled automatically by OS. A lot of reasons.
Apr 10
parent reply Atila Neves <atila.neves gmail.com> writes:
On Wednesday, 10 April 2024 at 19:23:17 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 11/04/2024 7:17 AM, Atila Neves wrote:
 On Monday, 8 April 2024 at 20:10:24 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 On 09/04/2024 6:18 AM, Atila Neves wrote:
 On Sunday, 17 March 2024 at 15:35:40 UTC, Paul Backus wrote:
 On Sunday, 17 March 2024 at 06:51:23 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 [...]
Yet another proposal for a humongous language feature to solve a problem that can already be solved without it. It is possible to achieve temporal safety in D already with `scope` (DIP 1000) and ` system` variables (DIP 1035). The ergonomics are not great, but they can be improved (e.g., with built-in move-on-last-use). This proposal specifically has the same problem as ` live`: it splits ` safe` code into separate "new ` safe`" and "old ` safe`" dialects, which are mutually incompatible. The ideas themselves are not terrible. I would be interested to see what this looks like implemented in a research language. But I do not think it has any place in D.
I second all of the above.
Just so we are clear, DIP1000 is not able to provide memory safety on a single thread, let alone multiple. It is missing the child/parent relationship to prevent: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ```
DIP1000 depends on lifetimes and RAII, and in a language like D with constructors and destructors I struggle to find a reason to write `parent.destroy`. This works as intended, in the sense that it fails to compile: ```     Field* child;     {         Parent parent;         child = &parent.field;     }     Field value = *child; ```
Yes, because of the explicit scope creation. The compiler can see it cross scopes, but not within a scope this requires tracking of values and their order of invalidation.
Unless I'm missing something, that's exactly what DIP1000 does? How else would "order of invalidation" work except through scopes? Explicit destruction? Why do that??
 It's the same reason why I don't understand "Logic errors such 
 as trying to read from an unopened file" - why would a file be 
 unopened?
Accidental, different scope states (conditional/function calls),
What would an example of that look like?
 explicit closing
Doctor, doctor, it hurts when...
, events handled automatically by OS.
Example?
 A lot of reasons.
I haven't encountered any; I'm not saying they don't exist, I'm saying that in my experience they're rare enough that they're approximately zero. I don't think that's ever happened since I stopped writing C. If one creates a file, it's open. If the file is no longer in scope, it's closed.
Apr 10
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 11/04/2024 7:31 AM, Atila Neves wrote:
 On Wednesday, 10 April 2024 at 19:23:17 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 Yes, because of the explicit scope creation. The compiler can see it 
 cross scopes, but not within a scope this requires tracking of values 
 and their order of invalidation.
Unless I'm missing something, that's exactly what DIP1000 does? How else would "order of invalidation" work except through scopes? Explicit destruction? Why do that??
See my earlier example: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ``` That simply shouldn't compile. One single scope, and it absolutely should be under DIP1000's purview. This is critical for owning data structures (which can include memory allocators) for it to be solved. Note: destruction also includes assignments to the variable that had contained the data structure. ```d DS ds = ...; Item* item = ds[...]; ds = ...; Item value = *item; ``` Same exact pattern.
 It's the same reason why I don't understand "Logic errors such as 
 trying to read from an unopened file" - why would a file be unopened?
Accidental, different scope states (conditional/function calls),
What would an example of that look like?
```d mayCloseOnSpecificInput(file); file.read(); // may or may not be closed ``` Add an if/else block to it as well, well... Fun.
 explicit closing
Doctor, doctor, it hurts when...
And like goto's we don't allow you to skip variable declarations. Don't explicitly close and not handle the branches correctly. Doesn't mean we shouldn't have compiler assistance to prevent you from making this mistake.
 , events handled automatically by OS.
Example?
For a socket, peer closed connection. A normally occurring event, that kills the handle automaticaly.
 A lot of reasons.
I haven't encountered any; I'm not saying they don't exist, I'm saying that in my experience they're rare enough that they're approximately zero. I don't think that's ever happened since I stopped writing C. If one creates a file, it's open. If the file is no longer in scope, it's closed.
For context: a file representation is the classic example from the 80's for type state analysis. However there was a giant thread only a couple weeks back wanting nullability type state checks. Notice how a lot of languages now have nullability analysis in them? Yeah. That would be a much better example from a practical stand point.
Apr 10
parent Dukc <ajieskola gmail.com> writes:
On Wednesday, 10 April 2024 at 19:47:03 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 11/04/2024 7:31 AM, Atila Neves wrote:
 Unless I'm missing something, that's exactly what DIP1000 
 does? How else would "order of invalidation" work except 
 through scopes? Explicit destruction? Why do that??
See my earlier example: ```d Parent* parent; Field* child = &parent.field; parent.destroy; Field value = *child; ``` That simply shouldn't compile.
Assuming you meant `destroy(*parent);` instead of destroying only the pointer to it. `destroy` runs the destructor of the object, but it doesn't free it. It even sets it to it's `.init` value so that it's known exactly what any further calls on the object will do. So your example is safe. I know, what you're saying is that one can't give ` safe` access to a field that is deallocated when the struct is destroyed even with a scoped reference, because the struct can be destroyed before the reference expires. We've probably been through this before: one can, but only inside a callback such as `borrow` for `SafeRefCounted`, that locks the data structure preventing destruction while the callback is executed.
Apr 13
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
The  live functions do type state analysis - data flow analysis is used to 
determine if a variable is 'live' or not.

It is indeed costly to do dfa in the front end, that's one reason why it's 
restricted to  live functions.

The dfa could be extended for null checking, but in practice, null checking is 
not that effective:

```
class C { void xx(); }

struct S { C c; }

C mars(C c) { return null; }

void phobos(ref S s)
{
     C c;
     c.xx();       // detected
     mars(c).xx(); // needs whole program DFA to detect
     s.c.xx();     // cannot be detected
}
```

This is why  live functions won't work without scope, ref, and return 
annotations on the functions it interfaces with.  live functions, like Rust, 
also severely limit the kinds of data structures possible.

Other type state analysis currently done in D is:

1. Value Range Propagation

2. whether a field is initialized or not in a constructor is tracked
Mar 29
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 30/03/2024 6:10 AM, Walter Bright wrote:
 The  live functions do type state analysis - data flow analysis is used 
 to determine if a variable is 'live' or not.
 
 It is indeed costly to do dfa in the front end, that's one reason why 
 it's restricted to  live functions.
 
 The dfa could be extended for null checking, but in practice, null 
 checking is not that effective:
 
 ```
 class C { void xx(); }
 
 struct S { C c; }
 
 C mars(C c) { return null; }
 
 void phobos(ref S s)
 {
      C c;
      c.xx();       // detected
      mars(c).xx(); // needs whole program DFA to detect
      s.c.xx();     // cannot be detected
 }
 ```
 
 This is why  live functions won't work without scope, ref, and return 
 annotations on the functions it interfaces with.  live functions, like 
 Rust, also severely limit the kinds of data structures possible.
 
 Other type state analysis currently done in D is:
 
 1. Value Range Propagation
 
 2. whether a field is initialized or not in a constructor is tracked
Yes, typical applications of DFA need to be run on the whole program to work as far as I'm aware. ML family compilers do this almost exclusively, they cannot do multi-step builds. D however is the opposite. Given that what I'm suggesting is not the norminal use case for DFA (quite the opposite), and we already have an approach thanks to DIP1000 I am proposing to go function by function. You must annotate the type states input and output for functions parameters and return value if the defaults are less than what you are wanting guaranteed. It acts purely in the form of verification against this. Inferring only happens within the body of a function, it does not affect the signature including for templates (not part of type system). Because of these decisions it can be parallelized without concern (unless you need to run semantic on newly inserted statements for cleanup). See my recently added example that Rust cannot check since it hasn't got type state analysis in production. https://gist.github.com/rikkimax/eed86a7061445a93f214e41fb6445e40 ```d T* makeNull(T)() safe { return null; } void useNull() safe { int* var = makeNull!int(); // var is in type state initialized as per makeNull return state *var = 42; // segfault due to var being null } ``` What we want to happen instead: ```d T* makeNull(T)(/* return'initialized */) safe { return null; // type state default is more than the type state initialized // so it is accepted } void useNull() safe { int* var = makeNull!int(); // var is in type state initialized as per MakeNull return state // perform load via var variable // this will error due to initialized is less than the nonnull type state // Error: Variable var is in type state initialized which could be null, cannot write to it *var = 42; } ``` To fix, simply check for null! ```d void useNull() safe { int* var = makeNull!int(); // var is in type state initialized as per MakeNull return state if (var !is null) { // in scope, assume var is in type state nonnull *var = 42; } } ```
Mar 29
parent reply Walter Bright <newshound2 digitalmars.com> writes:
In order to make this work, it will need typestate annotations on all:

1. function returns
2. aggregate fields
3. array element types
4. pointed to types
5. function parameters

I suspect that would make it bulletproof. Isn't it a big ask to get people to 
add all those annotations?
Mar 29
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 30/03/2024 3:30 PM, Walter Bright wrote:
 Isn't it a big ask to get people to add all those annotations?
If people had to write annotations yes. The key will be making sure that they don't have to, the defaults should be good enough, or at bare minimum match what we assume today. What I am unsure about currently is related to pointers (that are not the this pointer). Should we require them to be nonnull by default and allow people to opt out like in other languages, or do we go in the inverse and go with what we have now and stick to initialized and must be opt-in to the stronger type state of nonnull. Slices are quite often null on purpose, what about them? What about consistency between safe and system? There are some unanswered questions that may be better answered after experience with it. Although I'm in favor of doing things differently than other languages and make people think about their assumptions for their code, rather than assume stronger guarantees that may not have been considered.
Mar 29