This page lists places where GCC's code generation is suboptimal. Although the examples are small, the problems are usually quite deep.
Note: unless otherwise specified, all examples have been compiled
with the current CVS tree as of the date of the example, on x86, with
-O2 -fomit-frame-pointer -fschedule-insns. (The x86 back
-fschedule-insns, which is something that
should be revisited, because it almost always gives better code when I
turn it back on.)
If a function has been placed in a special section via attributes, we may want to put its static data and string constants in a special section too. But which one? (Being able to specify a section for string constants would be useful for the Linux kernel.)
Perhaps we should have an un-cse step right after cse, which tries to replace a reg with its value if the value can be substituted for the reg everywhere, if that looks like an improvement. Which is if the reg is used only a few times. Use rtx_cost to determine if the change is really an improvement.
The scheme is that each value has just one hash entry. The first_same_value and next_same_value chains are no longer needed.
For arithmetic, each hash table elt has the following slots:
So, if we want to enter
(plus:SI (reg:SI 30) (const_int
104)), we first enter
(const_int 104) and find the
(reg:SI 30) now points to. Then we put these
elts into operands 0 and 1 of a new elt. We put PLUS and
SI into the new elt.
Registers and mem refs would never be entered into the table as such. However, the values they contain would be entered. There would be a table indexed by regno which points at the hash entry for the value in that reg.
The hash entry index now plays the role of a qty number. We still need qty_table and reg_eqv_table to record which regs share a particular qty.
When a reg is used whose contents are unknown, we need to create a hash table entry whose contents say "unknown", as a place holder for whatever the reg contains. If that reg is added to something, then the hash entry for the sum will refer to the "unknown" entry. Use UNKNOWN for the rtx code in this entry. This replaces make_new_qty.
For a constant, a unique hash entry would be made based on the value of the constant.
What about MEM? Each time a memory address is referenced, we need a qty (a hash table elt) to represent what is in it. (Just as for a register.) If this isn't known, create one, just as for a reg whose contents are unknown.
We need a way to find all mem refs that still contain a certain value. Do this with a chain of hash elts (for memory addresses) that point to locations that hold the value. The hash elt for the value itself should point to the start of the chain. It would be good for the hash elt for an address to point to the hash elt for the contents of that address (but this ptr can be null if the contents have never been entered).
With this data structure, nothing need ever be invalidated except the lists of which regs or mems hold a particular value. It is easy to see if there is a reg or mem that is equiv to a particular value. If the value is constant, it is always explicitly constant.
Strength reduction and iteration variable elimination could be smarter. They should know how to decide which iteration variables are not worth making explicit because they can be computed as part of an address calculation. Based on this information, they should decide when it is desirable to eliminate one iteration variable and create another in its place.
It should be possible to compute what the value of an iteration variable will be at the end of the loop, and eliminate the variable within the loop by computing that value at the loop end.
Many operations could be simplified based on knowledge of the
minimum and maximum possible values of a register at any particular
time. These limits could come from the data types in the tree, via
rtl generation, or they can be deduced from operations that are
performed. For example, the result of an
one of whose operands is 7 must be in the range 0 to 7. Compare
instructions also tell something about the possible values of the
operand, in the code beyond the test.
Value constraints can be used to determine the results of a further
comparison. They can also indicate that certain
operations are redundant. Constraints might permit a decrement and
branch instruction that checks zeroness to be used when the user has
specified to exit if negative.
Sometimes a variable is declared as
int, it is
assigned only once from a value of type
char, and then it
is used only by comparison against constants. On many machines,
better code would result if the variable had type
If the compiler could detect this case, it could change the
declaration of the variable and change all the places that use it.
There may be cases where it would be better to compile a switch statement to use a fixed hash table rather than the current combination of jump tables and binary search.
It might be possible to make better code by paying attention to the order in which to generate code for subexpressions of an expression.
The C expression
*(X + 4 * (Y + C)) compiles better on
certain machines if rewritten as
*(X + 4*C + 4*Y) because
of known addressing modes. It may be tricky to determine when, and
for which machines, to use each alternative.
Some work has been done on this, in combine.c.
Although GCC implements numerous optimizations of the standard C library's string, math and I/O functions, there are still plenty more transformations that could be implemented.
optabentries, like the ones for
strlen, could be provided for several more functions including
memcmp) where one string is constant to compare successive bytes to known constant ones inline. For lengths longer than about 4, the inline comparisons could test the prefix before calling the library function (or inline a suitable instruction) for the remainder of the strings.
strncpyfrom a string constant, where the maximum length is constant and greater than the length of the string constant including the terminating null character, by using an optimization not implemented in GCC. Its makes use of its own
__mempcpyfunction to copy and return the address after the data copied, to pass into
memset. This could be implemented if GCC provided a
strcatinto a call to
strchrfollowed by a
strpbrkwhere the string of characters that need to be matched is constant and of length not greater than three.
The GNU libc also currently contains macros to optimize calls to
some string functions with constant arguments and those that can be
implemented by processor specific instructions. These transformations
are better performed in GCC, both to reduce the overhead of macro
expansion and to take advantage of the functions attributes, for
example to avoid a second call to a pure function altogether. The
use of these macros tend to cause huge blowup in the size of preprocessed
source if nested; for example, each nested call to
expands the source 20-fold, with four nested calls having an expansion
ten megabytes in size. GCC then consumes a huge amount of memory
compiling such expressions. The remaining optimizations need to be
implemented in GCC and then disabled in glibc, with benefits to other
systems as well, and the potential to use information GCC has about
All the string functions act as if they access individual
characters, so care may need to be taken that no
-fstrict-aliasing problems occur when internal uses of
other types are generated. Also, the arguments to the string function
must be evaluated exactly once each (if they have any side effects),
even though the call to the string function might be optimized away.
Care must be taken that any optimizations in GCC are standards-conforming in terms of not possibly accessing beyond the arrays involved (possibly within a function call created in the optimization); whereas the glibc macros know the glibc implementation and how much memory it might access, GCC optimizations can't.
There are some further optimizations in glibc not covered here, that either optimize calls to non-ISO C functions or call glibc internal functions in the expansion of the macro.
Many of these optimizations should not be applied if
-Os is specified.
Loads from memory can take many cycles if the loaded data is not in a cache line. Cache misses can bring a CPU to a halt for several 100 cycles, so minimizing cache misses is one of the most important jobs of a modern compiler. Data prefetch instructions are one tool the compiler can use for this.
Data prefetch instructions allow a compiler to minimize cache-miss latency by loading data into a cache before it it accessed. A separate page describes data prefetch support and optimizations that are in development or already supported by GCC.
Almost all code transformations implemented in GCC are target independent by design, but how well they work depends on how accurately an architecture is modelled. For example, combine may fail to combine some instructions if a machine description does not provide a pattern, or if the pattern exists but the predicates are too strong. Another example is the target cost function which is used to determine if a transformation is profitable. And sometimes a transformation that some specific target could benefit from is simply missing in the target independent implementation.
Optimization deficiencies like these are listed per target in the target specific pages below. An issue listed for one target may still also be an issue for other targets, but problems in the target specific pages are problems with, and ideas for the machine descriptions of some target.
Note: If an issue listed in a target specific issues page, but it clearly is a target indepentent issue, please move it to a page discussing target indepentent issues
Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.