Infrastructure for Profile Driven Optimizations

August 29, 2001

Jan Hubicka, SuSE Labs, together with Richard Henderson, Red Hat, and Andreas Jaeger, SuSE Labs, has contributed infrastructure for profile driven optimizations.

In numerous cases, the optimizations performed by a compiler trade one objective for another one, for instance size for speed, or speed in one execution path for speed on other. Poor choices on such decisions may lead to programs that are even bigger and slower than if compiled with a less sophisticated compiler.

To assist optimizers in those decisions, it is useful to have information about the runtime behavior of a program (expected probabilities of branches and frequencies of executions of given program blocks) - the so called program profile. Jan Hubicka has extended GCC so that profile information is more readily available to various optimizers in GCC and modified some existing optimizers to use this information.

Profile feedback

The most natural way to determine a program profile is to compile an instrumented program, execute it, store the information about the runtime behavior and then recompile it again reading the data back into the compiler. GCC has had this feature for several years now, originally contributed by James E. Wilson and Bob Manson, Cygnus Support.

This information has not been widely used by the compiler - until GCC version 3.0 it has been only used to generate branch prediction hints for some architectures. GCC 3.0 also uses it for the basic block reordering pass. Both optimization passes only used information about branch direction, but not about the frequency of execution of a basic block. The latter information could not be used since it was not directly available to these passes and subsequent optimization passes destroyed the profile.

The first contribution to this project was to revamp large parts of the GCC back end to maintain a consistent copy of the control flow graph so that all optimization passes can easily get correct profile information. As part of this rework, the jump optimization pass has been rewritten. The new jump optimization should be faster and easier to maintain and also perform more simplifications. The profiler has been revamped too and execution counts are now attached directly to the basic block and edges.

Additionally, the profile counters now use 64-bit, so they do not overflow on fast 32-bit machines (this affects the gcov utility too) and several bugs have been fixed.

Static program profile

A major problem of profile driven optimization is ease of use. Users are required to compile their program twice. Additionally, designing good training runs is a complex issue and nearly impossible for some programs. As this issue is very important for GCC - commonly used for free software development - we took great care to provide an alternative.

GCC contains a static branch predictor which is able to guess the common direction of any branch without an experimental run based on [1]. The predictor consists of a set of simple heuristics that expose common behavior of programs, for instance that loops usually loop more than once, pointers are non-null and integers usually positive. The original predictor has been contributed by Stan Cox and Jason Eckhardt for the basic block reordering pass.

For this project the predictor has been extended to use Dempster-Shaffer theory [2] to combine the used heuristics to give the expected branch probability and a new mechanism has been added for other optimization passes of GCC to annotate branches. For instance, the loop optimizer is sometimes able to compute the exact number of iterations of a given loop and now it can add the prediction, so the branch predictor will be aware of this information, too.

Finally, a new pass has been implemented to propagate the edge probabilities into expected frequencies of executions of the individual basic blocks, so that the static profile looks identical to the feedback driven profile for the rest of the compiler. Wu and Larus [2] report that this algorithm can accurately identify hot spots in a program even at intraprocedural level.

Automatic tester

In order to further analyze the used heuristics, information about the chosen heuristics is output optionally to a debug file. The AWK script analyze_branches (part of GCC's contrib directory) can be used to compare the guesses with a real profile. Andreas Jaeger has set up an automated tester performing daily SPEC2000 runs with and without profile driven feedback available at http://www.suse.de/~aj/SPEC/. (Update 2007-08-21: please refer to our benchmarking page for up-to-date pointers to automated testers.)

While analyzing the branch predictors we discovered some severe bugs in their implementation, e.g. loop exits had been predicted as taken. The result of the analyze_branches script on the SPECint2000 runs is manually fed back into GCC as basic probabilities and used thereby to compute the expected outcome of the implemented heuristics for branches.

The experimental results show that the current implementation of branch predictors successfully guesses about 76% of the branches (compared to 70% reported by [1]). About half of the branches are guessed with 90% success. A perfect branch predictor based on the profile feedback guesses 94% of the branches correctly.

It is natural that the static profile is inferior to the profile used by a feedback driven compilation. Feedback driven compilation produces faster and smaller programs, but the difference is not that drastic at the moment. In the case of SPEC2000 integer tests, it gives an overall difference of about 3%. We hope to enlarge this gap in the future by better use of the profile information and by implementing better static predictors. As reported in [3], the benefit for real world applications is higher than for benchmarks, as applications tend to have larger working sets and benefit more from reduced code size.

Optimizations performed

The register allocator has been modified to use frequencies to compute the expected cost for choosing proper register classes and for computing priorities for the allocation itself. The experimental runs have shown that the current register allocator correctly incorporates the new information and works considerably better than with the original set of heuristics, especially on register starved architectures, such as ia32.

The reg-stack pass has been enhanced to optimize the common paths of code at the expense of uncommon paths. This decreased random peaks in the benchmark results, but did not bring as large improvements as the previous change.

Code alignment decisions are now based on profile information, avoiding useless alignment of infrequently executed regions, e.g. loops that iterate only a few times. For i386, the savings is about 7% of final executable size.

The basic block reordering and if-conversion passes, which already made use of probability information, benefit from more accurate information.

Future work

Still there are some patches awaiting review: A patch for better choice between slow and small or big and fast prologue/epilogue code on i386 and x86-64 architectures and also some small improvements to the branch prediction algorithms.

A number of further optimizations are possible. For instance [3] describes superblock formation, loop peeling, loop inlining and some other minor optimizations. Work continues on a separate branch to introduce better infrastructure for control flow graph manipulation (such as code duplication) that will make implementing such optimizations easier.

It would also be nice to modify the current loop optimizer to preserve the flow graph and use this information to control the optimizations performed, such as loop unrolling, peeling or strength reduction [8], [3].

The splitting, peephole2 and final passes can be modified to produce smaller code in infrequently executed parts of functions to avoid code bloat and cache pollution.

The basic block reordering algorithm can be considerably improved and extended for code replication, as described in [5] and [6], to optimize branch prediction, cache and instruction fetch performance.

The predicated execution framework can be used for hyperblock formation [8] and possible reverse if-conversion [9] on architectures not supporting predicated execution.

There is room from improvement in branch prediction. Patterson describes branch prediction using an improved value range propagation pass [4] that has significantly better accuracy. A number of other simple heuristics can be added.

The basic block profiler needs to be extended to work for multi-threaded programs.

Credits

Richard Henderson has done some excellent work on reviewing all patches, sending lots of valuable suggestions and helping to fix a number of problems created by the control-flow graph revamp and jump pass rewrite. He also contributed the if-conversion pass replacing lots of functionality of the original jump pass and the new CFG infrastructure. Without his contributions this project would have been much more difficult, if not impossible.

Jan Hubicka contributed the patches to GCC including the analyze_branches script.

Andreas Jaeger has created the automatic tester based on scripts written by Diego Novillo, Red Hat, for his SPEC95 tester. Such testers are an important step to increase code quality in future GCC releases. For more information about these testers, check our benchmarks page.


References

[1]
Branch Prediction for Free; Ball and Larus; PLDI '93.
[2]
Static Branch Frequency and Program Profile Analysis; Wu and Larus; MICRO-27.
[3]
Design and Analysis of Profile-Based Optimization in Compaq's Compilation Tools for Alpha; Journal of Instruction-Level Parallelism 3 (2000) 1-25
[4]
Accurate Static Branch Prediction by Value Range Propagation; Jason R. C. Patterson (jasonp@fit.qut.edu.au), 1995
[5]
Near-optimal Intraprocedural Branch Alignment; Cliff Young, David S. Johnson, David R. Karger, Michael D. Smith, ACM 1997
[6]
Software Trace Cache; International Conference on Supercomputing, 1999
[7]
Using Profile Information to Assist Classic Code Optimizations; Pohua P. Chang, Scott A. Mahlke, and Wen-mei W. Hwu, 1991
[8]
Hyperblock Performance Optimizations For ILP Processors; David Isaac August, 1996
[9]
Reverse If-Conversion; Nancy J. Warter, Scott A. Mahlke, Wen-mei W. Hwu, B. Ramakrishna Rau; ACM SIGPLAN Notices, 1993