written by Walter Bright
April 4, 2009
I recently replaced the hard disk drive in my crusty old laptop with an SSD (Solid State Drive). It really seems barbaric in our digital age that we all have motors and levers and gears and spinning things in our boxes. I was anxious to step into the next revolution.
The improvement was dramatic. It boots up promptly, and applications load crisply. It’s like a new machine. Of course, I can’t be sure this was entirely due to the SSD, as I also reinstalled Windows thereby (possibly) removing legions of viruses, trojans, malware, adware, and all the usual barnacles that accumulate on the hull of Windows.
But it still did appear to be fundamentally much faster, leading me to think about how disk I/O slows things down. I thought back to a technique Steve Russell told me about many years ago, of doing all the disk I/O in a separate thread while doing the computations in the main thread, and the dramatic throughput speedups he’d get with it. While the D programming language is designed to be friendly to being compiled in parallel threads, the dmd compiler itself was still stolidly single threaded. I was just itching to try out some parallel processing.
The first thing to try was Steve’s idea, as that didn’t require much re-engineering of the compiler. All the files to read were stacked up into an array, and a thread was fired off to read those files in the array sequentially, setting an event at the completion of each file read. That thread was also set to high priority, so it would hurry up and block on I/O. Meanwhile, back at the ranch, the foreground thread would grind away at the lexing and parsing of the previous read file in parallel with the reading of the next file.
My first test was building the D runtime library, Phobos, which was about 40 files passed on the command line (dmd can build the library in one operation). It takes about 8 seconds. Switching the code between multithreaded and single threaded and timing the results, I was frustrated by there being no measurable speedup. Zip. Nada. Nothing.
Time to start instrumenting the code. I discovered that all 40 files were read in something like one hundredth of a second. In an 8 second build, that’s not even consistently measurable. I realized that the operating system must have cached all those files I was repeatedly reading into fast ram. So I had to find a way to disable the cache in order to get any illuminating numbers. My first thought was rebooting Windows between runs, but my main dev machine takes forever to reboot, and I’m not up for that. It then occurred to me that I could load all the files on an SD card and plug it into a USB port. To flush the cache, all I needed to do was pull the card out and plug it back in in between timing tests.
Of the 8 second build time, which is an optimized build, about 1.5 seconds is the lex and parse time. With the multithread I/O turned on, this dropped to a bit over a second. Finally, I was seeing an improvement.
It’s nothing to crow about, but then, it’s not like I spent a lot of time on it, either. But it does point to the lexing and parsing time being around a second, so presumably if one had 40 cores that could be cut to 1/40 of a second. That’s some real savings. So my next foray into parallelizing the compiler should be to run the lex/parse of each module in a separate thread. Then, as the user adds more cores, the compile accelerates in a satisfying manner.
P.S. As Andrei Alexandrescu pointed out, if the compiler is fast enough then you can use a native compiler like it was an interpreter. dmd is already fast enough to use it that way (see rdmd), but it’s fun to see how fast we can make it fly.
Thanks to Bartosz Milewski and Andrei Alexandrescu for reviewing a draft of this.