www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Feature Request: Hashed Based Assertion

reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
I brought this topic in "Learn" a while ago, but I want to talk 
about it again.

You are in a big team or working with a big code base. APIs are 
being defined/modified, configuration constants are 
defined/modified, structures are defined/modified for data.

You are coding on business logic side, and relying everything 
based on current APIs, configuration, and data structures. A part 
of codes have been updated on API side, but you are not aware of 
it, or time has passed, and you assume that your code will work 
properly. Nobody would be checking every single part of business 
logic line by line.

On runtime, you will get unexpected results, and lose some hair 
till finding where the problem is. Also finding expected results 
on a long running processes would cause much more trouble.

---

What I do currently is that: I calculate the hash of API code 
(function, configuration, etc together) with a hash function, and 
store it where the API is defined as a constant.

public enum HASH_OF_THIS_API = 0x1234;

// Hash is calculated from here
public void my_api_function(){}

public enum my_api_constant = 5;
// till here

Then wherever I use that API, I insert a "static assert( 
HASH_OF_THIS_API == 0x1234 );".

Whoever modifies the API, after the modification, calculates the 
most recent code's hash value and updates the constant. This 
allows compiler to warn the business logic programmer about 
changes on API codes. So, changing parts can be reviewed and 
changes are made if required.

---

The feature request part comes here: It is possible that API 
programmer forgets to update the hash value in the code. Also, 
comments in the code shouldn't affect the hash value. Automation 
is required on compile-time, so the compiler automatically 
calculates the hash value of code, and it can be read on 
compile-time. Hence, no constant is required to store the hash 
value.

What is needed is to be able to bind a hash value to any block 
with a name.
Nov 26 2015
next sibling parent reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 [...]
One applicable solution: __traits( hashOf, apiFunctionName/structName/variableName/className )
Nov 26 2015
parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 26 November 2015 at 11:14:54 UTC, tcak wrote:
 On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to 
 talk about it again.

 [...]
One applicable solution: __traits( hashOf, apiFunctionName/structName/variableName/className )
Can't you calculate hash of involved files at compile time?
Nov 26 2015
parent reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
On Thursday, 26 November 2015 at 11:18:19 UTC, Andrea Fontana 
wrote:
 On Thursday, 26 November 2015 at 11:14:54 UTC, tcak wrote:
 On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to 
 talk about it again.

 [...]
One applicable solution: __traits( hashOf, apiFunctionName/structName/variableName/className )
Can't you calculate hash of involved files at compile time?
One file can consist of many API functions. If there are 50 functions in it, and only 1 of them has been modified, whole hash will change. Compiler cannot tell which API has been changed then. Purpose is to decrease the burden on programmer, and put it onto compiler.
Nov 26 2015
parent Jacob Carlborg <doob me.com> writes:
On 2015-11-26 12:24, tcak wrote:

 One file can consist of many API functions. If there are 50 functions in
 it, and only 1 of them has been modified, whole hash will change.
 Compiler cannot tell which API has been changed then. Purpose is to
 decrease the burden on programmer, and put it onto compiler.
With a complete D front end working at compile time it would at least be possible in theory. -- /Jacob Carlborg
Nov 26 2015
prev sibling next sibling parent qznc <qznc web.de> writes:
On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 You are in a big team or working with a big code base. APIs are 
 being defined/modified, configuration constants are 
 defined/modified, structures are defined/modified for data.

 You are coding on business logic side, and relying everything 
 based on current APIs, configuration, and data structures. A 
 part of codes have been updated on API side, but you are not 
 aware of it, or time has passed, and you assume that your code 
 will work properly. Nobody would be checking every single part 
 of business logic line by line.
This is the job of the type checker, isn't it? What would a hash provide that a type checker does not?
Nov 26 2015
prev sibling next sibling parent Idan Arye <GenericNPC gmail.com> writes:
On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 [...]
So it's not just the function's signature you want to hash, but it's code as well? What about functions called from the API function? Or functions that set data that'll later be used by the API functions? If anything, I would have hashed the unittests of the API function. If the behavior of the API function changes in a fashion that requires a modification of the unittest, then you might need to alert the business logic programmers. Anything less than that is just useless noise that'll hide the actual changes you want to be warned about among the endless clutter created by trivial changes.
Nov 26 2015
prev sibling next sibling parent bitwise <bitwise.pvt gmail.com> writes:
On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 You are in a big team or working with a big code base. APIs are 
 being defined/modified, configuration constants are 
 defined/modified, structures are defined/modified for data.

 You are coding on business logic side, and relying everything 
 based on current APIs, configuration, and data structures. A 
 part of codes have been updated on API side, but you are not 
 aware of it, or time has passed, and you assume that your code 
 will work properly. Nobody would be checking every single part 
 of business logic line by line.

 On runtime, you will get unexpected results, and lose some hair 
 till finding where the problem is. Also finding expected 
 results on a long running processes would cause much more 
 trouble.

 ---

 What I do currently is that: I calculate the hash of API code 
 (function, configuration, etc together) with a hash function, 
 and store it where the API is defined as a constant.

 public enum HASH_OF_THIS_API = 0x1234;

 // Hash is calculated from here
 public void my_api_function(){}

 public enum my_api_constant = 5;
 // till here

 Then wherever I use that API, I insert a "static assert( 
 HASH_OF_THIS_API == 0x1234 );".

 Whoever modifies the API, after the modification, calculates 
 the most recent code's hash value and updates the constant. 
 This allows compiler to warn the business logic programmer 
 about changes on API codes. So, changing parts can be reviewed 
 and changes are made if required.

 ---

 The feature request part comes here: It is possible that API 
 programmer forgets to update the hash value in the code. Also, 
 comments in the code shouldn't affect the hash value. 
 Automation is required on compile-time, so the compiler 
 automatically calculates the hash value of code, and it can be 
 read on compile-time. Hence, no constant is required to store 
 the hash value.

 What is needed is to be able to bind a hash value to any block 
 with a name.
I'm wondering if a diff tool could be somehow combined with a parser to create a list of functions/symbols which may have experienced behavioural changes between versions of dmd. What I'm suggesting is a diff tool which is aware of a symbol's dependancies so that even if a function body wasn't changed, its dependant symbols could be checked as well. If such a tool existed, it could be ran against each new release of dmd, and produce a comma separated list of functions that may have experienced behavioural changes. With that list in hand, one could then simply grep for each symbol in their own repository each time they upgrade dmd. I hearby place this idea in the public domain ;) Bit
Nov 26 2015
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
I see many solution here that do not require any language change. 
To start, have a linter yell at the programmer when (s)he submit 
a diff. Dev commit directly ? What the fuck are you doing ? Do 
code review and get a linter.

Alternatively, generate a di file and hash it. You can have a bot 
do it and commit with a commit hook.

DMD can dump infos about the program in json format. hash this 
and run with it.

You may also change your strategy in term of source control: 
https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source code 
aleviate completely these kind of issues to boot.
Nov 26 2015
parent reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
On Friday, 27 November 2015 at 05:33:52 UTC, deadalnix wrote:
 I see many solution here that do not require any language 
 change. To start, have a linter yell at the programmer when 
 (s)he submit a diff. Dev commit directly ? What the fuck are 
 you doing ? Do code review and get a linter.

 Alternatively, generate a di file and hash it. You can have a 
 bot do it and commit with a commit hook.

 DMD can dump infos about the program in json format. hash this 
 and run with it.

 You may also change your strategy in term of source control: 
 https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source 
 code aleviate completely these kind of issues to boot.
Not one thing in your solutions give any simple solution like: static assert( __traits( hashOf, std.file.read ) == 0x1234, "They have changed implementation again." ); static assert( __traits( hashOf, facebook.apis.addUser ) == 0x5543, "Check API documentation again for addUser." ); di file wouldn't work. It doesn't contain implementation code. Also, all APIs are in it. We need specific hash for each API, so it doesn't take long time to find where the problem is. JSON is same as di. No difference. Yours are not helping, making everything more complex.
Nov 27 2015
next sibling parent reply bitwise <bitwise.pvt gmail.com> writes:
On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.
Yes, because to achieve what you're asking for, you NEED a complex solution. The code WILL change with every release..thats the point of a release.. so any hashing mechanism like you're describing will just trigger every time, making it useless. Even if this was not the case, you still wouldn't know where the changes were. Bit
Nov 27 2015
parent reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.
Yes, because to achieve what you're asking for, you NEED a complex solution. The code WILL change with every release..thats the point of a release.. so any hashing mechanism like you're describing will just trigger every time, making it useless. Even if this was not the case, you still wouldn't know where the changes were. Bit
Let me explain: It is not complex. What makes it complex is that you envision a very detailed thing. Hash of a Function = MD5( Token List of Function /* but ignore comments */ ); You do not have to know where the changes are. You need to know what has changed, how it acts currently briefly. If behaviour of code changes, it is good that you know it. With above hashing method, a piece of code that hasn't changed would have same hash value always. And if you do not like it, don't check the hash value. Just continue writing your codes as you wish. But in business perspective, if the software's consistency is worth millions of dollars, a software engineer would want it to be giving error whenever codes change. Do we want D to be a child language, or have more useful features?
Nov 27 2015
parent reply bitwise <bitwise.pvt gmail.com> writes:
On Friday, 27 November 2015 at 18:51:54 UTC, tcak wrote:
 On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.
Yes, because to achieve what you're asking for, you NEED a complex solution. The code WILL change with every release..thats the point of a release.. so any hashing mechanism like you're describing will just trigger every time, making it useless. Even if this was not the case, you still wouldn't know where the changes were. Bit
Let me explain: It is not complex. What makes it complex is that you envision a very detailed thing. Hash of a Function = MD5( Token List of Function /* but ignore comments */ ); You do not have to know where the changes are. You need to know what has changed, how it acts currently briefly. If behaviour of code changes, it is good that you know it. With above hashing method, a piece of code that hasn't changed would have same hash value always. And if you do not like it, don't check the hash value. Just continue writing your codes as you wish. But in business perspective, if the software's consistency is worth millions of dollars, a software engineer would want it to be giving error whenever codes change. Do we want D to be a child language, or have more useful features?
Your approach is prone to false positives. if(1) doSomething(); if(1) { doSomething(); } Same behaviour, different code. I hope you have a heck of a coding standard written up ;) Worse still, consider the following example: void foo() { if(bar()) deleteSomeFiles(); } int bar() { return 0; } Your proposed approach would not notify you that foo(), a potentially dangerous function, has changed it's behaviour if someone made bar() return 1. *insert witty comeback to your comment about "business perspective" here* Bit
Nov 27 2015
parent reply tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:
On Friday, 27 November 2015 at 20:00:16 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 18:51:54 UTC, tcak wrote:
 On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.
Yes, because to achieve what you're asking for, you NEED a complex solution. The code WILL change with every release..thats the point of a release.. so any hashing mechanism like you're describing will just trigger every time, making it useless. Even if this was not the case, you still wouldn't know where the changes were. Bit
Let me explain: It is not complex. What makes it complex is that you envision a very detailed thing. Hash of a Function = MD5( Token List of Function /* but ignore comments */ ); You do not have to know where the changes are. You need to know what has changed, how it acts currently briefly. If behaviour of code changes, it is good that you know it. With above hashing method, a piece of code that hasn't changed would have same hash value always. And if you do not like it, don't check the hash value. Just continue writing your codes as you wish. But in business perspective, if the software's consistency is worth millions of dollars, a software engineer would want it to be giving error whenever codes change. Do we want D to be a child language, or have more useful features?
Your approach is prone to false positives. if(1) doSomething(); if(1) { doSomething(); } Same behaviour, different code. I hope you have a heck of a coding standard written up ;) Worse still, consider the following example: void foo() { if(bar()) deleteSomeFiles(); } int bar() { return 0; } Your proposed approach would not notify you that foo(), a potentially dangerous function, has changed it's behaviour if someone made bar() return 1. *insert witty comeback to your comment about "business perspective" here* Bit
Question: Has the behaviour of foo changed? If foo cares about bar's behaviour, foo checks bar's hash value. -- if(1) doSomething(); if(1) { doSomething(); } You are correct here about hash calculation, but unless someone touches to codes, this never happens, and no hash changes would be seen. If someone is touching it as you exampled, checking the documentation about what has happened would be the correct approach. Importance of behaviour change is perceptional, computer cannot know that already.
Nov 27 2015
parent qznc <qznc web.de> writes:
On Friday, 27 November 2015 at 20:19:40 UTC, tcak wrote:
 if(1) doSomething();
 if(1) { doSomething(); }

 You are correct here about hash calculation, but unless someone 
 touches to codes, this never happens, and no hash changes would 
 be seen. If someone is touching it as you exampled, checking 
 the documentation about what has happened would be the correct 
 approach. Importance of behaviour change is perceptional, 
 computer cannot know that already.
If you really want to integrate this into the language, you should consider future improvements. Hashing the tokens is a conservative approximation of "behavior change", as the example above shows. Another example would be variable renames. The specification of the hash algorithm should provide the freedom that both variants above get the same hash, but still be correct in the sense that different behavior always yields different hashes. Overall, I'm not convinced that this needs to be a language extension or trait. It could simple a static analysis tool independent of the compiler.
Nov 27 2015
prev sibling parent deadalnix <deadalnix gmail.com> writes:
On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 On Friday, 27 November 2015 at 05:33:52 UTC, deadalnix wrote:
 I see many solution here that do not require any language 
 change. To start, have a linter yell at the programmer when 
 (s)he submit a diff. Dev commit directly ? What the fuck are 
 you doing ? Do code review and get a linter.

 Alternatively, generate a di file and hash it. You can have a 
 bot do it and commit with a commit hook.

 DMD can dump infos about the program in json format. hash this 
 and run with it.

 You may also change your strategy in term of source control: 
 https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source 
 code aleviate completely these kind of issues to boot.
Not one thing in your solutions give any simple solution like: static assert( __traits( hashOf, std.file.read ) == 0x1234, "They have changed implementation again." ); static assert( __traits( hashOf, facebook.apis.addUser ) == 0x5543, "Check API documentation again for addUser." ); di file wouldn't work. It doesn't contain implementation code. Also, all APIs are in it. We need specific hash for each API, so it doesn't take long time to find where the problem is. JSON is same as di. No difference. Yours are not helping, making everything more complex.
If the API signature change, the type system will yell at you. All the proposed solution will work. If the implementation change, you can apply the same solution on the binary, tadaaa ! If you want less hash change, a good idea can be to dump llvm ir from ldc, and run the cannibalization on it using opt. Also, if you have so much code that rely on implementation details that aren't in the API to the extent it is such a problem that you need language extension to handle it, you are doing something very very wrong. Indeed I'm not helping. You think you need a language extension, when it is quite obvious you have some methodology problem on your side and refuse to reconsider. What about, I know it is crazy, use a unified repository, have test and continuous integration, and submit diff with code review. If one change an API in a way that break the client code, the client ill fail and the CI tool will warn the developer that he needs to fix the client code or rework his API change. If the client code was not tested, then the problem is clearly not the API hash. Not only this doesn't require language extension, but this solves way more problems than the one you want to solve here. Now, don't get we wrong, I know how it is. Companies with broken work culture won't change anything unless the it is on the edge of bankruptcy. I understand. This is how it works. Please understand that, on the other side, it doesn't seems like the right move to export broken work environment as language features.
Nov 27 2015
prev sibling parent =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 What is needed is to be able to bind a hash value to any block 
 with a name.
I've thought about this too in the past and asked on the forums but I haven't gotten any response. It is possible. The problem is easier in dynamic languages. See for instance a the following solution in a specific Python runtime here: http://pgbovine.net/incpy.html `hashOf` is for AAs not for content digests. I believe the only realistic solution to this problem is to implement a specific pass in the D compiler that recursively calculates hash-digests (hash-chains) for all the code and data involved in a function call. It should probably only work for pure functions. AFAICT, it is possible but it's far from easy to get 100% correct :) DMD pull requests should be very welcomed, at least by me ;) See also: https://en.wikipedia.org/wiki/Hash_chain
Nov 27 2015