Grouping functions that call each other has other benefits, even on machines that have sufficient memory to run the application without swapping:
- Called functions will be more likely to be in the memory cache already.
- When Windows loads a program or DLL, it doesn't actually read it into memory. It simply maps an address range onto the file. Then, when the program executes, pages of the program are swapped into memory as they are referenced. Grouping functions that call each other will minimize the amount of the program that actually has to be read in from disk, thus minimizing load time.
What is necessary is to gather this information at runtime, preferably while running a typical application of the program. Such dynamic profiling information can then be processed to yield an order in which functions should be placed by the linker.
Dynamic Profiling With DMC++This capability is now built in to Digital Mars C++. It is currently only implemented for Win32 systems, but the results should be directly applicable to a 16 bit version of the same program.
Setting UpIt's enabled by two switches working together: -gt and -Nc.
-gt by itself inserts profiling code into every function, and when the program terminates, it prints out a log showing for each function:
- What other functions call it and how often (called the "fan in")
- What functions it calls and how often (the "fan out")
#define _DEBUG_TRACE 1If -Nc (make all functions COMDATs) is thrown as well, profiling code is only inserted into global COMDAT functions, since it is only global COMDAT functions that the linker can adjust the link order on.
If there are many static functions, and those are to be profiled and grouped as well, throw the -gg (make static functions global) also. This assumes, of course, that all static function names in the -gg modules are distinct from one another and all global names.
The program is linked normally.
Dynamic Profiling With DMD
The D compiler can also do this. Compile with the -profile switch.
Gathering the DataRun the program in the manner it is typically used. The trace code will accumulate the information, and when the program is terminated, the information will be written to files trace.log and trace.def. trace.log is a readable text file with a list of all the traced functions, along with who calls them (fan in) and who they call (fan out), and each time this occurs.
Following this is listed timing data. The data are:
|Num Calls||Number of times the function was called.|
|Tree Time||Time used by function plus the tree times of all the functions it calls.|
|Func Time||Time used by a function excluding the time used by the functions it calls (times for functions it calls that are not compiled with -gt, like RTL functions, are included in Func Time).|
|Per Call||(Func Time) / (Num Calls)|
If multiple runs are done with the program, the profiling data is merged and summed with any previously existing trace.log file.
The final output is the trace.def file, which is meant to be included in the linker module definition file. It contains the list of the functions in the order the linker should link them.
Modifying Dynamic Profiling BehaviorThe behavior of the profiler can be modified with several functions that can be explicitly called in the program. They are prototyped in trace.h. They are implemented aa macros that expand to nothing if -gt is not thrown.
void trace_term()Normally, this function is called when the program exits as part of the static destructor list. If this list is not called, or you wish to stop gathering data before the end of the program (for instance, to optimize for fast load), then insert a call to trace_term() at the point where it is to stop.
int trace_setdeffilename(const char *name)Set file name to use for .def file instead of trace.def. Returns: 0 for success, non-zero for failure
int trace_setlogfilename(const char *name)Set file name to use for .def file instead of trace.log. Returns: 0 for success, non-zero for failure
Optimizing Program Load TimeTo optimize the program for fast loading, rather than fast running, gather the profile data only as far as the program is loaded. Then, call trace_term() to terminate the data gathering and write the output files.
- The library does not have profiling instrumentation in it, so the library functions are not grouped by the linker, nor is any profiling information collected for library functions. The extra code linked in to do the profiling makes use of the library functions, so if the library functions are instrumented, some infinite loops can occur. So don't recompile the library with -gt.
- Functions declared with __declspec(naked) never get profiling instrumentation added to them.
- The profile data gathering code will get confused if non-local jumps occur when things like setjmp/longjmp, Windows NT structured exceptions, or C++ exceptions are used. If this profiling tool is well used by customers, we'll address these shortcomings.
- The timing is based on the Win32 performance timing API, which is accurate to about one microsecond. Unfortunately, any paging or any other tasks running can erratically change the timings. There is no real way around this besides just running the tests several times and averaging the results.
- The overhead of the instrumentation code can distort the measurements when the function times are very short. The profiler tries to compensate for this, but the overhead times are not always consistent.
- Profiling of multithreaded apps is not supported.
- Multiple EXEs and DLLs can be profiled simultaneously, but the log files and def files must all be set to different names, or else they will step on each other.