This page lists projects that are expected to improve the performance of the code that GCC generates for IA-64, more properly known as IPF (Itanium Processor Family). The lists originally came out of the GCC IA-64 Summit that was held June 6, 2001, and many of the comments are from that summit. Later updates are from discussions among people working in this area. Additions and corrections are always welcome.
During the June 2001 summit, developers of proprietary IA-64 compilers stressed that interactions between optimizations for IA-64 can be very significant, more so than with other architectures. People contributing IA-64 improvements are highly encouraged to work closely with people working on related improvements so that adverse interactions can be detected early.
At the summit in June 2001, Ross Towle said that some optimizations for IA-64 fall out nicely if data dependence information is as perfect as it can be. At that time GCC did not keep track of this information well at all, and experienced GCC developers reported that alias analysis in GCC during scheduling is extremely weak; it can even lose track of which addresses are supposed to come from the stack frame. It's weak in general and is even weaker in IA-64. Alias analysis is a general infrastructure problem; GCC has no knowledge of cross-block scheduling.
Since then, Richard Kenner has checked in several patches to track memory origins. His changes link each MEM to the declaration it's from so that alias analysis can know that two MEMs from different declarations can't conflict. This allows other things to be specified in a MEM, like alignment. He's also added functionality to prove that two MEMs cannot conflict.
Now that better alias information is available, GCC should make use of it.
What kinds of projects could now make use of Richard's memory origin work? Is the new information available during scheduling? What other optimizations could use it?
At the GCC IA-64 Summit in June 2001, developers of other IA-64 compilers said that optimizations involving compiler generated data prefetch are important for IA-64 performance.
GCC 3.1 includes a prefetch RTL pattern that supports data prefetch on
a variety of GCC targets, a
__builtin_prefetch function, and
-fprefetch-loop-arrays. General information
about data prefetch and about data prefetch instructions supported by a
variety of GCC targets are described in the
Data Prefetch Support section of the Projects
Janis Johnson is trying tweaks to the heuristics used for the
-fprefetch-loop-arrays optimization to try to get better
performance on IA-64.
There is dependence distance code already checked into the compiler that
no one uses. That information could be hooked into the loop unroller
and the prefetcher.
For example, this is to check that references to two different
array elements within a loop iteration don't conflict.
See the code in
dependency.c to see if
it uses the MEM tracking information and if the dependence distance code
itself is ever used in any loop optimization or could be used there.
This could also be
hooked up to the MEM info struct and used for iteration distance.
Code locality is even more important for this architecture than for others where it shows a benefit.
There is an article by Carl Pettis and Bob Hansen about how to order functions based on a call graph: "Profile guided code positioning", http://acm.proxy.nova.edu/pubs/articles/proceedings/pldi/93542/p16-pettis/p16-pettis.pdf.
Steve Christiansen tried using gprof output to create a linker script that orders functions based on run-time call graphs and call counts, but couldn't show that it made a difference, based on SPEC CPU2000 results.
Jan Hubicka, together with Richard Henderson and Andreas Jaeger,
made several changes to the profile-directed block ordering in GCC
for GCC 3.1. This functionality
is available through
using data generated by first compiling with
This is described in
Profile Driven Optimizations, which also lists items for future work.
The following items came out of the June 2001 summit as issues to investigate:
Look into SGI's tool CORD to determine whether its techniques can be used with GCC.
Some of this was done in the summer of 2001 and is in GCC 3.1. There might be more work that could be done here.
Validate that the machine model in GCC is accurate. This would be most useful when specific problems are noticed in generated code, rather than making a full pass through it.
Look into incorporating information from Intel's KAPI library into the machine model in GCC.
The machine model should guide instruction bundling, but currently it is done using ad-hoc methods.
To evaluate instruction bundling, look at nop density.
The register allocator needs to know that there is some cost in allocating additional stack registers because there's the danger of hidden spilling in the Register Stack Engine (RSE) at the time of a call.
This doesn't require recovery code and is quite simple,
Turning off the current support actually makes faster code for IA-64, since it tends to create extra dependencies. For it to be used effectively post-increment could be generated after the second scheduling pass, with a third pass then required.
Post-increment could be used when optimizing for size.
Exploit opportunities for non-loop induction variables.
It's necessary to measure the trade-offs between alignment and code size.
This isn't turned on for IA-64; again, measure the trade-offs.
Tuning for Itanium 2, controlled by
-mtune, should be added.
Jan Hubicka added support to the mainline (to become 3.2) to do branch combining of chained branches having the same destination, with hooks for target-specific tricks. Such tricks are expected to be worthwhile for IA-64; see the thread in the gcc-patches archives.
John Sias explains: "Region formation is a way of coping with either limitations of the machine or limitations of the compiler / compile time. "Regions" are control-flow-subgraphs, formed by various heuristics, usually to perform transformations (i.e. hyperblock formation) or to do register allocation or other work-intensive things. For hyperblock formation, for example, region formation heuristics are critical---selecting too much unrelated code wastes resources; conversely, missing important paths that interact well with each other defeats the purpose of the transformation. Large functions are sometimes broken heuristically into regions for compilation, with the goal of reducing compile time."
Richard Henderson says we could rip out the Haifa scheduler's CFG detection, use regular data structures, and fix region detection.
Now that the tree-ssa branch has been merged into mainline, we can perform cool optimizations that require more information than is available in RTL.
The infrastructure for this is not yet available.
The lno-branch can perform many high-level loop optimisations.
This requires highly predicated code.
There is little or no knowledge of predication outside of the if-cvt.c file, so there are a number of optimization passes that are suboptimal when predicated code is present. None of the optimization passes up to and including register allocation know how to handle predication from a correctness standpoint.
PQS is a database of known relationships between predicates. It would underlie predicate-aware dataflow, and therefore dependence drawing and register allocation.
Bernd Schmidt made an unsuccessful attempt to add data speculation. Completing the patch won't be worthwhile until there is a sufficient amount of ILP.
The IBM IA-64 compiler team saw code in important applications that could have benefited from very local data speculation; see comments by Jim McInnes in the minutes of the GCC IA-64 Summit.
Control speculation is more important than data speculation. It needs cross-block scheduling, since the compiler doesn't see the opportunity or need within a basic block. Both require generating recovery code, which introduces new instructions and new register definitions and uses. It might be difficult to build in.
Some people at Red Hat tried unsuccessfully to tie control speculation into the Haifa scheduler, but the effort showed that alias analysis in GCC during scheduling is extremely weak. One problem was that it couldn't even tell which addresses are from the stack frame and so it would speculate too much. This project was tried quite quickly, though, and with more time such a project might be successful. Since then, Richard Kenner has added support for tracking memory origin, so this might be more successful now.
Bernd Schmidt might have an unfinished patch that could be picked up.
Stan Cox also had an unfinished control speculation patch.
This is difficult if an exception is involved.
Dwarf2 is the only debugging format that can handle this.
The infrastructure doesn't currently handle this. This is related to memory optimizations.
Data prefetching is mentioned under short-term projects. Instruction prefetching requires additional infrastructure.
It might be difficult to keep track of this in the machine-independent part of GCC.
Avoid reloads of GP when it is not necessary. The compiler needs more information than is currently available.
Jason Merrill invented cool stuff, e.g. thunks for multiple inheritance, that hasn't been done yet.
It's possible to inline stubs.
This would be for information like DLL import/export; it is not machine independent.
If GCC defined such an attribute, glibc would probably use it.
One of the projects identified at the GCC IA-64 Summit is measuring the performance of GCC on IPF, comparing it to other IPF compilers, and identifying the reasons for performance differences. This would enable the limited developer resources to be spent on those improvements that are most likely to affect the performance of the applications that are identified as being important.
This project can be broken up into a number of tasks that can be performed by separate teams to best utilize the experience and strengths of each team.
Run benchmarks with GCC for IPF with a variety of options for specific optimizations to determine which ones should be included with gcc -O2.
Profile the kernel and look for hot spots where better code generation or optimization would make a significant difference.
Gary Hade at IBM has been collecting profile and coverage data for 2.4.18 IA-64 Linux kernels built with prerelease versions of GCC 3.1. Profile data collection utilizes the SGI-donated Kernprof facility. Coverage data collection utilizes the IBM-donated GCOV kernel facility. The data is being generated under various system loads including parallel Linux kernel builds, AIM Suite VII Benchmark, and the SPEC SDET Benchmark. Much of the data has already been collected but it still needs to be analyzed.
Information and code for the data collection facilities and workloads mentioned above is available from these sources:
Steve Christiansen wrote a dispersal analysis tool that uses objdump disassembly output. It uses McKinley rules and cannot be distributed outside of IBM.
Gary Hade added Itanium 1 support to Steve's dispersal analysis code and integrated the code into GNU Binutils source so it can be invoked and controlled from objdump using IA-64 specific disassembler options. The results of the optional dispersal analysis are added to the disassembly output. Gary submitted a patch supporting Itanium 1 to the Binutils project via the bug-binutils mailing list on June 6, 2002. An update adding Itanium 2 support can be provided after Intel makes the McKinley information public.
Developers of proprietary IPF compilers who have identified key code fragments from real applications where IPF optimizations make a big difference could share these with GCC developers.
This would allow a tool that uses profiling output to order functions to be used with a wider variety of applications.
Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.