Treegion Scheduling

Table of Contents

Latest News

2005-02-03

Treegion formation with tail duplication based on ICSE checked in.

2005-12-23

Natural Treegion formation checked in.

2005-12-16

Creation of sched-treegion-branch.

Introduction

Instruction scheduling is a critical phase of compilation for extracting large amounts of ILP from a program. In this work we present the status of the implementation of an architecture-independent, aggressive global instruction scheduler based on Treegions. We also present a technique for efficiently tail duplicating treegions considering the tradeoff between codes size and ILP. The implementation currently resides on the sched-treegion-branch.

Natural Treegion Formation

A treegion is a non-linear, single-entry, multiple-exit region of code containing basic blocks that constitute a subgraph of the control-flow-graph (CFG). Building large regions are critical for enabling a compiler to exploit large amounts of parallelism through speculation. Unlike other region formation algorithms, such as traces and superblocks, treegions take into account multiple execution paths, producing larger regions and more opportunities for speculation. In addition, treegions do not require special architectural features (e.g., predication for Hyperblocks) for region formation. GCC currently supports both linear regions, in the form of superblocks (tracer.c) and extended basic blocks (sched-ebb.c), as well as non-linear regions (sched-rgn.c), which are limited to loop-free procedures and reducible inner loops. Treegions have the advantage that unlike superblocks and extended basic blocks, treegion formation is based on the CFG and does not require profile information. Further, treegion formation typically results in much larger regions as compared to the current region formation implementation.

Natural treegion formation (i.e., treegions formed without tail duplication) begins at the entry point of a procedure, which forms the root of a new treegion. Starting at this root, the CFG is traversed and successor basic blocks are absorbed into the treegion if they are not a merge point (i.e., have multiple predecessor edges). Once all possible blocks have been added to the treegion, the leaf nodes, all of which are merge points, are added to a saplings list. These saplings form the roots of other treegions. For each sapling the same process is applied until all basic blocks in the CFG have been consumed. Treegion formation requires only a single pass over the CFG. Treegions also simplify the calculation of dominators since all basic blocks dominate all successor blocks.

Tail Duplication

In this section we discuss our method for efficient tail duplication, with treegions being the unit of duplication. Our heuristic considers the increase in code size relative to the increase in ILP from tail duplication. This metric, referred to as the Instantaneous Code Size Efficiency (ICSE), is defined as the change in IPC relative to the increase in code size after tail duplication.

To calculate the ICSE of a tail duplicate candidate the IPC of a region must be known at compile time. Since this information is not available we've developed a heuristic to calculate the estimated execution time of a treegion. We define the estimated execution time of a multi-path treegion as the sum of the expected execution time of each path through treegion biased by the execution frequency of each path. The execution frequency of each path is determined through profiling or other heuristics. The expected execution time of any path is the maximum of the data dependence bound and the resource bound. The data dependence bound is calculated as the height of the true data dependence bound of the data dependence graph (DDG) for a given treegion. The resource bound is computed as the number of instructions in the treegion divided by the issue width of the target machine.

The tail duplication process begins by calculating the ICSE for each tail duplication candidate. Each control edge between a parent and child treegion is a potential candidate with the child being the target for duplication. After the ICSE has been calculated for each candidate, the best candidate is chosen for tail duplication. After duplication the effected treegions are reformed and the candidate list is updated. Tail duplication continues until either no more candidates exist or no more candidates are above the ICSE threshold. Discussion of this optimal threshold can be found in [1]. Under code-size or compile time constraints, treegion size may also be limited by the number of basic blocks and/or the number of instructions contained with in the treegion.

Treegion scheduling - Tree Traversal Scheduling

Due to the acyclic nature of treegions, the Haifa scheduler does not require any modifications to schedule treegions. We do however have a number of enhancements currently in development to improve the performance of treegion scheduling.

The goal of Tree Traversal Scheduling (TTS) is to speedup every execution path through the treegion. This is accomplished by prioritizing speculative instructions from different paths which compete for limited resources. Profile information is used to prioritize the scheduling of basic blocks within a treegion. Tree traversal scheduling consists of two steps (1) construction of the control/data dependence graph to perform instruction ordering and (2) scheduling of the instructions in the treegion.

For each treegion instructions are prioritized based on a.) execution frequency, b.) exit count heuristic to resolve ties from (a), and c.) data dependence height to resolve ties from (b). Heuristic (a) gives priority to the most frequently executed path. Heuristic (b) gives priority to instructions that help more exits, which has the potential to make more speculative instructions ready for execution.

The algorithm for treegion scheduling is as follows: (1) For a treegion, sort the basic blocks according to a depth-first traversal order with the child block selected with the highest execution frequency. (2) Begin list scheduling blocks at the root basic block. (3) During the scheduling of a basic block, consider speculation for instructions dominated by this basic block. (4) Repeat step 3 until all basic blocks in the treegion have been scheduled.

The implementation currently resides on the branch.

Contributing

Checkout the sched-treegion branch.

When posting to the development lists, please mark messages and patches with [sched-treegion] in the subject. The usual contribution and testing rules apply. This branch is maintained by Chad Rosier.

Readings

Lots of useful information is present at the TINKER Microarchitecture and Compiler Research homepage. More relevant papers:

[1]
H. Zhou, and T.M. Conte, Code Size Efficiency in Global Scheduling for ILP Processors, Proceedings of the 6th Annual Workshop on the Interaction between Compilers and Computer Architectures (INTERACT-6), Cambridge, MA, February 2002.
[2]
H. Zhou, M. D. Jennings, and T. M. Conte, Tree Traversal Scheduling: A Global Scheduling Technique for VLIW/EPIC Processors, Proceedings of the 14th Annual Workshop on Languages and Compilers for Parallel Computing (LCPC'01), Cumberland Falls, KY, August 2001.
[3]
W. A. Havanki, S. Banerjia, and T. M. Conte, Treegion scheduling for wide-issue processors, Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA-4), Las Vegas, Feb. 1998.
[4]
S. Banerjia, W.A. Havanki, and T.M. Conte, Treegion scheduling for highly parallel processors, Proceedings of the 3rd International Euro-Par Conference (Euro-Par'97), Passau, Germany, pp.1074-1078, Aug. 1997.