written by Walter Bright
March 19, 2008
The software industry is gripped with a plague of malware. Although there are many forms of malware, this article is about malware that gets in by exploiting software bugs that enable specially crafted user data to be executed as code. It seems there's almost a daily alert that some software package or other is vulnerable to some exploit. Whole industries have sprung up to defend our computers from it, but even anti-virus software itself is sometimes a vector for malware.
As software programmers, we have a responsibility to do what we can to make our software invulnerable to exploits. We do our best, but the parade of vulnerabilities continue. We try to be better programmers, and some have even suggested government certification of programmers to try and mitigate the problem. But obviously, something isn't working in the way we write software.
What makes software vulnerable to exploits? The most common weakness seems to be buffer overflows, or something equivalent. Carefully constructed data is fed to the program, it overflows a buffer somewhere, and then the user data is executed as code. How do we stop this? We go through our programs line by line, vetting it, looking for buffer overflows, failure to check for error conditions, wild pointers, etc. There are some automated code scanners like Coverity and Fortify, but while they find many patterns of bugs, they are not capable of guaranteeing the code is memory safe. A further problem is they are expensive and not relatively available.
What if it were possible to mathematically prove a program was not subject to malware? By this I mean it would be impossible for any user data to be executed as code. This property is called memory safety. If we could prove such a thing, we can just mechanically run source code through the prover, and if it got the green light, we're golden. No more line by line vetting, no more praying we didn't miss anything.
Doesn't this sound like a pipe dream? It does to me, how could one possibly prove such a thing? But it turns out it has been done for some languages, languages like Java. Being able to prove memory safety for a language means that proving a program memory safe is as straightforward as successfully running it through the compiler. Java has been mathematically proven to be type safe, for which memory safety is a subset. Java programs cannot have buffer overflow exploits (as long as those Java programs are 100% Java programs, and do not do things like call native DLLs that themselves might be subject to malware, and as long as the Java VM itself doesn't have bugs in it).
But Java isn't for everyone. How do can we prove memory safety in other languages, particularly the D programming language? Being a systems programming language, D has things like arbitrary pointers, so obviously memory safety is impossible. Or maybe it is possible. Bartosz Milewski has suggested that since D is a powerful programming language even without pointers, it may be practical to define a safe D subset. We are examining each feature of D to find the largest possible subset of the language that is memory safe. Then, if this subset is itself a practical language, D can be a major contribution towards eliminating malware.
Obviously some applications will need to use non-memory safe parts of D or call into libraries that are themselves not guaranteed memory safe. The more of those applications that utilize the safe subset of D the less code has to under go extra scrutiny for these sorts of problems.
Thanks to Brad Roberts for his insightful comments on this.