written by Walter Bright
December 22, 2009
C is arguably the world's most successful programming language. Its success has, of course, endlessly tempted people to improve upon it. Thus, C is probably the patriarch of the longest list of languages. Notable among these are C++, the D programming language, and most recently, Go. There are endless discussion threads on how to fix C, going back to the 80's.
So this is well trod ground. What could possibly be added to this soup? I posit that most such discussions center around detail. More interesting is what is the largest fundamental mistake. We should take into account the context of the times that spawned C, and the problems it was trying to solve and the environment in which it was intended to be used. Keep in mind it was developed for a 16 bit machine, with extremely limited resources available. I'd like to dismiss things like it doesn't do garbage collection, functional programming, dynamic typing, or OOP. Those aren't problems C attempted to address, so the lack of them are not mistakes.
What mistake has caused more grief, more bugs, more workarounds, more endless hours consumed, etc., than any other? Many people would say null pointers. I don't agree.
Conflating pointers with arrays.
I don't mean them using the same syntax, or the implicit conversion of arrays to pointers. I mean the inability to pass an array to a function as an array, even if it is declared to be an array. C will silently convert the array to be a pointer, and will rewrite the function declaration so it is semantically a pointer:
void foo(char a)
is exactly equivalent to :
void foo(char *a)
This seemingly innocuous convenience feature is the root of endless evil. It means that once arrays leave the scope in which they are defined, they become pointers, and lose the information which gives the extent of the array — the array dimension. What are the consequences of losing this information?
An alternative must be used. For strings, it's the whole reason for the 0 terminator. For other arrays, it is inferred programmatically from the context. Naturally, every situation is different, and so an endless array (!) of bugs ensues.
The trainwreck just unfolds in slow motion from there.
The galaxy of C string functions, from the unsafe strcpy() to sprintf() onwards, is a direct result. There are various attempts at fixing this, such as the Safe C Library. Then there are all the buffer overflows, because functions handed a pointer have no idea what the limits are, and no array bounds checking is possible.
This problem was inherited in toto by C++, which consequently spawned 10+ years
of attempts to create a usable string class. Even the eventual std::string result
is compromised by its need to be compatible with C 0-terminated strings.
C++ addressed the more general array problem by inventing std::vector
The C99 attempted to fix this problem, but the fatal error it made was still not combining the array dimension with the array pointer into one type.
But all isn't lost. C can still be fixed. All it needs is a little new syntax:
void foo(char a[..])
meaning an array is passed as a so-called “fat pointer”, i.e. a pair consisting of a pointer to the start of the array, and a size_t of the array dimension. Of course, this won't fix any existing code, but it will enable new code to be written correctly and robustly. Over time, the syntax:
void foo(char a)
can be deprecated by convention and by compilers. Even better, transitioning to the new way can be done by making the declarations binary compatible with older code:
#if NEWC extern void foo(char a[..]); #elif C99 extern void foo(size_t dim, char a[dim]); #else extern void foo(size_t dim, char *a); #endif
This change isn't going to transform C into a modern language with all the shiny bells and whistles. It'll still be C, in spirit as well as practice. It will just relieve C programmers of dealing with one particular constant, pernicious source of bugs.
- The relevant text from K+R's The C Programming Language 5.3 is “When an array name is passed to a function, what is passed is the location of the beginning of the array. Within the called function, this argument is a variable, just like any other variable, and so an array name argument is truly a pointer, that is, a variable containing an address.”
- The relevant text from the C 99 Standard 188.8.131.52.7 is “A declaration of a parameter as ‘array of type’ shall be adjusted to ‘qualified pointer to type’ ”
- From Stroupstrup's The C++ Programming Language first edition 7.1 “and as usual, array names are converted to pointers.”
Thanks to Bartosz Milewski for reviewing a draft of this.