Improving GCC's Interprocedural Optimizaion Infrastructure

This page describes ongoing work to improve GCC's infrastructure for tree-based interprocedural optimizers. The work is done on a branch in GCC's CVS repository called tree-profiling-branch.

The short-term goals of the branch are to bring profiling heuristics and profile feedback information to the tree level, to add support for managing multiple control flow graphs, and to implement profile guided function inlining. At the heart of this project is the call graph code added for GCC 3.4, and the effort to maintain the control flow graph and profile information over expanding from GIMPLE, which is the tree-based intermediate representation, to RTL, the intermediate representation of GCC's back ends.

Longer-term goals include adding support for doing some optimizations (such as CCP and DCE) before inlining, and implementing a framework for other interprocedural optimizations, such as interprocedural constant propagation and side-effects analysis.

As the patches stabilize, they will be submitted for review and acceptance into the mainline development tree. Please contact Jan Hubicka <jh@suse.cz> if you would like to contribute.

Bringing profile information to the tree optimization framework

Rationale

When the compiler knows where optimizations are beneficial, it can do a better job at generating good code for an application. GCC has a framework for profile-guided optimizations, but the profile is not available early enough for several optimizations that could benefit from it. Specifically, function inlining, code generation for multiway branches, various loop transformations, and value histogram based optimizations are done, or could be done better, in the new Tree SSA framework where profile information is not available. Making it available to these passes would allow GCC to generate better code.

Current situation in GCC

As of June 2004, GCC does support profile feedback and branch prediction and several algorithms use the profile to improve performance. The compiler can obtain profile information by trying to predict all branches appearing in the intermediate representation for the application during the compilation process, or from profile feedback collected during test runs of the application.

Unfortunately, the profile is not available until very late in the compilation process. To understand why, you need to understand what profile information is to the compiler, and how it is represented.

While the true execution profile is a program property, for the compiler it is a control flow graph annotation. Therefore, the profile fed back from test runs is in reality the output of the application instrumented with code on control flow graph edges and basic blocks to write out execution counts. This means that the compiler must read the fed back profile at the same point in the compilation process where the control flow graph was previously instrumented for the test runs, because otherwise the profile data may not match with the control flow graph for the intermediate representation at that point in the compilation process.

After instrumenting or reading, the compiler has to update the profile for each modification it does to the control flow graph. So it has to instrument the application for profiling somewhere between the first point in the compilation process from where the CFG will be kept up to date, and the passes for which having profile information available is the most profitable.

In GCC, this means that profile information cannot be made available to passes that run early in the compilation process. The old loop passes do not preserve the control flow graph, and when the CFG is destroyed, any profile information available at that point is also destroyed. The passes which benefit the most from profile information available, such as scheduling and block reordering,all have to run after the loop optimizer. So in GCC we have to instrument the control flow graph for for profiling between running the old loop optimizer and the final scheduling pass.

To make things worse, the control flow graph is also destroyed before the GIMPLE tree is expanded to RTL, and a profiling infrastructure for trees is not available. So simply replacing the old loop is not enough to bring profile information to the Tree SSA passes. GCC needs a way to preserve profile information when expanding from trees to RTL.

Status on the branch

The old loop optimizer has been disabled in anticipation of its replacement by the rewritten loop optimizers from the cfg-branch and the lno-branch. This means that all RTL passes now maintain the CFG from the start of rest_of_compilation all the way through the reorg pass. Also, GCC now knows how to maintain the control flow graph when expanding from GIMPLE to RTL. The CFG-aware expand code can be found in the new file cfgexpand.c. Code to instrument the GIMPLE tree for profiling is being worked on also.

Profile guided inlining

Rationale

Judicious inlining of functions is important for good performance. Inlining avoids function call overhead (register saves and restores, parameter passing ABI (using specific registers, setting up the stack), the call itself) and exposes larger functions to the optimizers for for deeper enhancements. However, inlining may cause exponential code growth. Therefore, inlining of nontrivial functions called from rarely executed basic blocks is almost always a loss.

This implies that the compiler should have some means to estimate the profitability of inlining using information extracted from an execution profile. The profile may be estimated from branch prediction heuristics, or obtained using trial runs and fed back to the compiler. Profiling can reveal details about the actual usage of functions in the application to direct the compiler to inline better and more effectively.

Current situation in GCC

Even if a profile was available, the current inliner would destroy it for inlined functions because profile information is inherently associated with the control flow graph, which is also not built until after inlining. So a new inliner is needed that can inline function control flow graphs instead of functions as trees. The compiler will have to construct and maintain a CFG for every function, instead of only for the function that is currently being compiled.

Status on the branch

Support for multiple CFGs has recently been committed, and work is under way on a new inliner that inlines a function's CFG instead of a GIMPLE tree. Our goal is to have this work ready in time for inclusion in the next release of GCC.

Framework for Interprocedural Analysis and Optimizations

Rationale

Function calls hide information about side effects from the optimizers. By inlining a function into a caller, all such information is made available to the caller, but at the cost of increased code size and degraded instruction locality. Also it is not always possible to inline a function, for example when it has unpredictable side effects (e.g. it calls setjmp or it has non-local gotos).

Interprocedural data flow analyses (IPA) attempt to provide a different means to overcome the information-hiding problem of function calls. They typically analyse the intraprocedural properties of a function, and propagate that information across the call graph. The compiler can then use that information to improve intraprocedural information, such as alias and dependency information, and generate better code for any single function. It can also choose to specialize a function, to remove paths from a function that will never be taken, or to inline functions for which intraprocedural analysis can not prove that inlining is profitable.

Experience from other compilers shows that interprocedural constant propagation and interprocedural side-effects analysis may help disambiguate memory dependencies, and that function specialization by cloning can significantly reduce the cost of function calls in highly abstracted applications (such as C++ applications using the STL). For GCC we believe that implementing these interprocedural analyses would make it possible to greatly improve the generated code in many applications.

Current situation in GCC

At present, function inlining is the only interprocedural optimization implemented in GCC. Until recently, a framework for IPA was lacking, and the intermediate representation was too low-level to allow interprocedural analyses to be effective.

With the contribution of code to construct and analyze the call graph, and with the new optimization framework and higher level intermediate representations from the Tree SSA project, an IPA framework is within reach. The major remaining technical obstacles for IPA are inefficient data structures for the tree intermediate representations, which may cause memory consumption to rise excessively when used as-is in an IPA framework. Other obstacles of non-technical nature also make it it impossible at present to implement across-file interprocedural optimizations other than those provided by the intermodule framework, which also suffers from the inefficient structure of the tree representations.

Status on the branch

At present, no interprocedural work is being done on the branch, but a patch for interprocedural constant propagation is available and will probably be included on the branch soonish. Otherwise, plans exist for reorganisation of the compiler passes as follows:

  1. For each SCC in the call graph in reverse topological order
  2. Perform interprocedural analyses and optimizations (at least constant propagation, side-effects analysis, and function cloning and specialization).
  3. For each SCC in the call graph in reverse topological order