Improving GCC's Infrastructure (Control Flow Graph)

This page describes ongoing work to improve GCC's infrastructure for RTL based optimizers. The work is done on a branch in GCC's CVS tree called cfg-branch, which is available to experiment in. As the patches stabilize, they will be submitted for review and acceptance into the mainline development tree. Please contact Jan Hubicka <jh@suse.cz> if you would like to contribute.

Background & Rationale

The purpose of work on the CFG branch is to update GCC's older optimization passes to use the modern infrastructure which was not available when they were written. A second purpose is to continue improving the infrastructure for RTL-based optimizations.

Currently, work concentrates on the control flow graph (CFG) data structure, originally contributed by Richard Henderson. We would like to convert all of the RTL-based optimizer passes to use it and keep it accurate. This will allow us to store more information directly in this structure, instead of using "notes" in the instruction stream which are difficult to keep accurate during optimization.

Goals

In addition to improving the infrastructure and updating the existing RTL-based optimization passes, we are implementing new optimization passes. We plan to implement optimizations that perform code duplication first such as superblock formation and loop peeling. We also plan to reimplement the loop unrolling pass.

Detailed Description

Some ideas that the cfg-branch will implement are known only in the compiler research literature but are new concepts for GCC. We'd like to explain the concepts briefly, give an overview of what we want to achieve in GCC and describe the current status.

Loop Peeling

This work is done by Jan Hubicka.

Theory

Loop Peeling is a form of loop unrolling where the first iterations from a loop are unrolled. It removes i iterations from the beginning of a loop and adds i copies of the loop body directly before the start of the loop.

Implementation in GCC

The loop peeling optimization will be done as part of the superblock formation. The tracer should peel as many iterations as can be predicted. Zdenek has also implemented simple loop peeling as part of the new loop optimizer.

Status

The code is on the branch.

The tracer peels loops with one predicted iteration. We should try to peel more than one iteration.

Loop optimizer re-implementation

This work is done by Zdenek Dvorak.

Theory

The current loop optimizer uses information passed by the front end to discover loop constructs to simplify flow analysis. It is difficult to keep the information up-to-date and nowday it is easy to implement the loop discovery code on CFG.

Implementation in GCC

We want to use the Michael Hayes' loop discovery code and slowly replace existing features of loop optimizer by new one. So far Zdenek has modified the datastructure to allow easier updating and implemented loop unrolling, peeling and unswitching on the top of it.

Status

The main changes the loop infrastructure has been merged to mainline. Rest is being tunned on the branch.

Software Trace Cache

This work is done by Josef Zlomek.

Theory

Software Trace Cache is a basic block layout algorithm that also does code duplication during the process. While optimal code layout is an NP complete problem, allowing a limited amount of duplication should eliminate the common drawbacks of current algorithms and keep code size under control. The purpose of this pass is to minimize the number of branches and cache misses. It uses code duplication to avoid jumps. A trivial example is copying the return instruction instead of jumping to it. See [6] for a detailed description.

Implementation in GCC

The implementation will only do intraprocedural optimizations since interprocedural optimizations are out of reach of current GCC.

Status

The code is in the branch. The exact benefits needs to be measured but on non-PDO runs it brings 5% to eon benchmark (from CPU2000).

Web Construction Pass

This pass constructs webs as commonly used for register allocation purposes. After that each web gets assigned individual pseudo. This allows our register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover.

While future SSA passes will have same effect on the low-level RTL generation, we still believe this pass to be useful right before the register allocation pass or after the code duplication passes that need to unshare temporaries, such as tracer or loop unroller.

Implementation in GCC

The implementation of the web pass uses Michael Hayes's dataflow module to construct du-chains and unionfind to produce webs.

Status

The code is on the branch.

This pass improves SPECint2000 performance by roughly 1%. We need to work on updating debug information in this pass.

Register coalescing Pass

This pass coalesces multiple registers into single one in order to avoid register to register copies that our register allocator is not able to deal with very well. It is a kind of temporary solution until the new register allocator is in place. At that point, the benefits are questionable, but still it is more effective (and probably less expensive) than the current copy propagation implementation.

Implementation in GCC

It is designed as stand alone pass constructing conflict graph and coalescing registers run after GCSE.

Status

The code is on the branch.

This pass improves some of SPECint2000 tests and degrades others basically because of lack of live range splitting and increased register pressure in final program. It helps by about 1 point on peak SPECint runs.

Jump Combining Pass

This work is done by Jan Hubicka.

Theory

This optimization is very close to instruction combining. The compiler discovers conditionals with common expression like patterns (a || b) and (a && b) and attempts to create compound conditions.

When using condition registers arithmetics this is always possible, but not always profitable for all architectures. Trivial, still common, cases can be handled. Some examples for these conversions are: (a || b) can be converted to (a | b), (a == b || c == d) to (a^b)|(c^d) and, (a == 3 || a == 4) to ((unsigned)(a - 3) < 2).

Implementation in GCC

This is implemented as part of the if-conversion pass. It basically finds the patterns of two jumps in the CFG that can be combined and then attempts to to apply several simplifications. Profile data is used to control the amount of optimizations.

Status

The code is on the branch. Benefits for Athlon CPU are about 0.5%. The code is being reviewed for inclusion by Richard Henderson.

Thread Safe Profiling

This work is done by Zdenek Dvorak.

Theory

GCC currently does not support profiling of threads.

Implementation in GCC

This is only needed for machines that have no atomic adds and for SMP machines. We currently plan to do this with per-thread counters. At the start of each function, a function from GCC's runtime environment will be called that returns the pointer.

Status

A first version has been written and is installed to the branch. The merging to mainline is problematic and the code may be dropped

Procedure splitting

To shorten branches and avoid pollution it is desirable to put frequently executed procedures together and split their infrequently executed parts.

Status

We don't split the function body, but we do categorize functions into different parts. A first version has been written and has been installed on the branch. It adds about 1% to spec2000 runs with profile feedback.

MIDlevel RTL

This work is done by Jan Hubicka.

Theory

The current RTL representation is too low level for a number of optimizations that we want to implement. This makes the implementation difficult and even ineffective in some cases. MidRTL will be essentially a RTL representation of a program lacking the tons of architectural details making the transformations of instruction chain much easier and therefore being the basis for new optimization passes.

Implementation in GCC

The following outline has been written by Jeff Law:

Particularly for the task of building the machine description for the generic
RTL and the translation from generic RTL into target specific RTL.  Those two
(closely related) tasks are the biggest hunks of infrastructure we need.

Conceptually what we want is a md file which describes the set of generic
RTL patterns.  We're not precisely sure on what predicates to use for
operands, so invent something like "generic_operand" which we will more
precisely define later.  Make all operands use "generic_operand".  At
least initially accept MEMs, REGs and all constants.

The only "tricky" part of "generic_operand" is memory addresses; at least
at this stage, generic_operand should accept a MEM with a "generic_address".
A "generic address" should be any "generic operand", except another MEM.
(ie we don't allow (MEM (MEM)) at this stage.  We will likely refine this
later.

Once you've got a rough framework in place, you have to build out some
infrastructure for lowering.  Specifically we'll need the ability to have
two recognizers.  Early in compilation we'll only recognize the generic
patterns.  When we start lowering we want to recognize the target specific
expanders, patterns, splitters, etc.  So you'll probably have to hack up
genrecog to build two insn-recog files with different prefixes for anything
with external scope and some infrastructure to switch between them based
on a state variable which indicates we've started the lowering process.

For the actual lowering process, we want to avoid twiddling the existing
back ends as much as possible.  If I recall, the general idea Richard and
I came up with was:

For each insn:

  Try to see if it matches a pattern in the target's md file.  If it does,
  then you're done.  Similarly, if it matches a splitter, then run the
  splitter, match the resulting insns in the target md file and your done.

  Else generic.md should call into optabs.c to expand operations -- this will
  in large part allow target specific lowering using existing mechanisms.
  (ie, lowering of operands, running target expanders, etc etc).  It's likely
  we'll need to refine this some, but this is the general idea.

Status

A first version bootstraps and will be hopefully included in the branch soon. It is far from being complete thought.

Miscellaneous Patches

Achievements that have been merged into mainline already:

Easier to use CFG routines
Before branching, we reorganized and cleaned up the CFG routines to make them easier to use. Each RTL instruction now contains a pointer to its basic block. It is therefore easy to adjust basic block boundary notes when instructions are reordered.
Profile based optimization infrastructure

Patches that are waiting for review and integration into mainline:

Subprojects already merged into mainline tree in 3.2 development period

Tracer / Superblock Formation

Theory

The tracer also known as tail duplication or superblock formation pass identifies commonly executed sequences of basic blocks, so called traces and duplicates the blocks in order to avoid side entrances to those sequences - forming so-called superblocks. The superblocks are easier to optimize by conventional passes such as CSE and scheduler. In case no extra optimizations apply, our cross-jumping pass merges the code paths again after register allocation.

Implementation in GCC

We plan to do loop peeling and superblock formation in single pass as described in [3].

Enhancements that are currently only in the cfg-branch:

High Level Branch Prediction

This work is done by Zdenek Dvorak.

Theory

This is an infrastructure enhancement to propagate predictions from the trees representation to the RTL-level.

Implementation in GCC

Flexible and Safe GCOV Format

This work is done by Pavel Nejedly.

Theory

Currently the profile is fragile, since there is no verification that the compiled program matches the profiled data. Since the profiler eliminates the redundancy in data, the mismatch is often not discovered at all causing completely nonsensical results. nonsense.

Implementation in GCC

We plan to calculate a checksum (CRC) of each graph when it is constructed. Also the format will be extended so that more information can be recorded, such as histograms for the number of iterations for loops that can then be used for loop peeling and loop unrolling.

We also want to add versioning and further information to make it look like a real file format.

There are a few references to much more advanced profiling systems in [3].

This is done using NOTE_INSN_PREDICTION emitted in the stream converted to REG_PREDICTION later. For instance, the compiler can now predict branches using goto as infrequent. It also allows to handle and specially predict constructs such as return of (negative) constants, out of range code and, exception handling where the front end knows that those are infrequent, but the back end has no information about it.

We can also use it to preserve more information about high-level loops. For instance loop a with continue statement looks like two nested loops to the back end, but the optimizer may be able to estimate that the internal (continue) loop is not that frequent.

Branch Status

The branch was created from the development mainline on 12 November 2001. Its CVS tag is called cfg-branch.

Contributing

Check out the cfg-branch branch with the command

cvs co -r cfg-branch gcc

then configure and build in the normal way.

Please post patches in the usual way to the gcc-patches list, marked [cfg-branch]. As this is a branch the usual maintainer rules do not apply. This branch is maintained by Jan Hubicka <jh@suse.cz>. Approval from the usual maintainers will be needed when submitting patches from the branch for consideration for the mainline.

Readings

Lots of useful information is present at the IMPACT compiler group homepage. Some other papers:

[1]
Branch Prediction for Free; Ball and Larus; PLDI '93.
[2]
Static Branch Frequency and Program Profile Analysis; Wu and Larus; MICRO-27.
[3]
Design and Analysis of Profile-Based Optimization in Compaq's Compilation Tools for Alpha; Journal of Instruction-Level Parallelism 3 (2000) 1-25.
[4]
Accurate Static Branch Prediction by Value Range Propagation; Jason R. C. Patterson (jasonp@fit.qut.edu.au), 1995
[5]
Near-optimal Intraprocedural Branch Alignment; Cliff Young, David S. Johnson, David R. Karger, Michael D. Smith, ACM 1997
[6]
Software Trace Cache; International Conference on Supercomputing, 1999
[7]
Using Profile Information to Assist Classic Code Optimizations; Pohua P. Chang, Scott A. Mahlke, and Wen-mei W. Hwu, 1991
[8]
Hyperblock Performance Optimizations For ILP Processors; David Isaac August, 1996 (Master Thesis)
[9]
Reverse If-Conversion; Nancy J. Warter, Scott A. Mahlke, Wen-mei W. Hwu, B. Ramakrishna Rau; ACM SIGPLAN Notices, 1993