Loop optimization
Encyclopedia
In compiler theory, loop optimization plays an important role in improving cache performance, making effective use of parallel processing capabilities, and reducing overheads associated with executing loops. Most execution time of a scientific program is spent on loops. Thus a lot of compiler analysis and compiler optimization
techniques have been developed to make the execution of loops faster.
if it is to preserve the result of the program (i.e., be a legal transformation). Evaluating the benefit of a transformation or sequence of transformations can be quite difficult within this approach, as the application of one beneficial transformation may require the prior use of one or more other transformations that, by themselves, would result in reduced performance.
to describe the combined result of a sequence of many of the above transformations. Central to this approach is the view of the set of all executions of a statement within n loops as a set of integer points in an n-dimensional space, with the points being executed in lexicographical order
. For example, the executions of a statement nested inside an outer loop with index i and an inner loop with index j can be associated with the pairs of integers (i, j). The application of a unimodular transformation corresponds to the multiplication of the points within this space by the matrix. For example, the interchange of two loops corresponds to the matrix .
A unimodular transformation is legal if it preserves the temporal sequence of all dependencies
; measuring the performance impact of a unimodular transformation is more difficult. Imperfectly nested loops and some transformations (such as tiling) do not fit easily into this framework.
handles a wider class of programs and transformations than the unimodular framework. The set of executions of a set of statements within a possibly-imperfectly nested set of loops is seen as the union of a set of polytopes representing the executions of the statements. Affine transformation
s are applied to these polytopes, producing a description of a new execution order. The boundaries of the polytopes, the data dependencies, and the transformations are often described using systems of constraints, and this approach is often referred to as a constraint-based approach to loop optimization. For example, a single statement within an outer loop 'for i := 0 to n' and an inner loop 'for j := 0 to i+2' is executed once for each (i, j) pair such that 0 <= i <= n and 0 <= j <= i+2.
Once again, a transformation is legal if it preserves the temporal sequence of all dependencies
. Estimating the benefits of a transformation, or finding the best transformation for a given code on a given computer, remain the subject of ongoing research as of the time of this writing (2010).
Compiler optimization
Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied...
techniques have been developed to make the execution of loops faster.
Representation of computation and transformations
Since instructions inside loops can be executed repeatedly, it is frequently not possible to give a bound on the number of instruction executions that will be impacted by a loop optimization. This presents challenges when reasoning about the correctness and benefits of a loop optimization, specifically the representations of the computation being optimized and the optimization(s) being performed.Optimization via a sequence of loop transformations
Loop optimization can be viewed as the application of a sequence of specific loop transformations (listed below or in ) to the source code or Intermediate representation, with each transformation having an associated test for legality. A transformation (or sequence of transformations) generally must preserve the temporal sequence of all dependenciesData dependency
A data dependency in computer science is a situation in which a program statement refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.There are three types of dependencies: data, name, and...
if it is to preserve the result of the program (i.e., be a legal transformation). Evaluating the benefit of a transformation or sequence of transformations can be quite difficult within this approach, as the application of one beneficial transformation may require the prior use of one or more other transformations that, by themselves, would result in reduced performance.
Common loop transformations
- fission/distributionLoop fissionLoop fission is a compiler optimization technique attempting to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. The goal is to break down large loop body into smaller ones to achieve better utilization of locality of reference. It is the...
: Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of referenceLocality of referenceIn computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...
, both of the data being accessed in the loop and the code in the loop's body. - fusion/combiningLoop fusionLoop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple loops with a single one.- Example in C : int i, a[100], b[100]; for Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple...
: Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data. - interchange/permutationLoop interchangeIn compiler theory, loop interchange is the process of exchanging the order of two iteration variables.For example, in the code fragment: for i from 0 to 10 for j from 0 to 20 a[i,j] = i + jloop interchange would result in: for j from 0 to 20...
: These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of referenceLocality of referenceIn computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...
, depending on the array's layout. - inversionLoop inversionLoop inversion is a compiler optimization, a loop transformation, which replaces a while loop by an if block containing a do..while loop.-Example in C: int i, a[100]; i = 0; while Loop inversion is a compiler optimization, a loop transformation, which replaces a while loop by an if block...
: This technique changes a standard while loop into a do/while (a.k.a. repeat/until) loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if guard can be skipped. - loop-invariant code motionLoop-invariant code motionIn computer programming, loop-invariant code consists of statements or expressions which can be moved outside the body of a loop without affecting the semantics of the program...
: If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversionLoop inversionLoop inversion is a compiler optimization, a loop transformation, which replaces a while loop by an if block containing a do..while loop.-Example in C: int i, a[100]; i = 0; while Loop inversion is a compiler optimization, a loop transformation, which replaces a while loop by an if block...
, because not all code is safe to be hoisted outside the loop. - parallelization : a special case for Automatic parallelizationAutomatic parallelizationAutomatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of which implies automation when used in context, refers to converting sequential code into multi-threaded or vectorized code in order to utilize multiple processors simultaneously in a...
focusing on loops, restructuring them to run efficiently on multiprocessor systems. It can be done automatically by compilers (named automatic parallelizationAutomatic parallelizationAutomatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of which implies automation when used in context, refers to converting sequential code into multi-threaded or vectorized code in order to utilize multiple processors simultaneously in a...
) or manually (inserting parallel directives like OpenMPOpenMPOpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems, including Linux, Unix, AIX, Solaris, Mac OS X, and Microsoft Windows platforms...
). - reversal : Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependenciesDependence analysisIn compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2...
and thus enable other optimizations. Also, certain architectures utilize looping constructs at Assembly languageAssembly languageAn assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
level that count in a single direction only (e.g. decrement-jump-if-not-zero (DJNZ)). - schedulingLoop schedulingIn parallel computing, loop scheduling is the problem of assigning proper iterations of parallelizable loops among n processors to achieve load balancing and maintain data locality with minimum dispatch overhead.Typical loop scheduling methods are:...
- skewing : Loop skewing takes a nested loop iterating over a multidimensional array, where each iteration of the inner loopInner loopIn computer programs, an important form of control flow is the loop. For example, this small pseudo-code program uses two nested loops to iterate over all the entries of an n×n matrix, changing their values so that the matrix becomes an identity matrix: for a in 1..n for b in 1..n ...
depends on previous iterations, and rearranges its array accesses so that the only dependencies are between iterations of the outer loop. - Software pipeliningSoftware pipeliningIn computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor...
: a type of out-of-order executionOut-of-order executionIn computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...
of loop iterations to hide the latencies of processor function units. - splitting/peelingLoop splittingLoop splitting is a compiler optimization technique. It attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.-Loop peeling:...
: Loop splitting attempts to simplify a loop or eliminate dependenciesDependence analysisIn compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2...
by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop. - tiling/blockingLoop tilingLoop tiling, also known as loop blocking, or strip mine and interchange, is a loop optimization used by compilers to make the execution of certain types of loops more efficient.- Overview :...
: Loop tiling reorganizes a loop to iterate over blocks of data sized to fit in the cache. - vectorization
- unrolling: Duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches & increased program load time), but requires that the number of iterations be known at compile time (except in the case of JITJITJIT may refer to:* Various meanings of Just In Time:** Just-in-time compilation - a technique for improving the performance of virtual machines in computing.** Just-in-time - a business inventory strategy.* Jabber ICQ Transport....
compilers). Care must also be taken to ensure that multiple re-calculation of indexed variables is not a greater overhead than advancing pointers within the original loop. - unswitchingLoop unswitchingLoop unswitching is a compiler optimization. It moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. This can improve the parallelization of the loop. Since modern processors can...
: Unswitching moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional.
Other loop optimizations
- sectioning : First introduced for vectorizers, loop-sectioning (also known as strip-mining) is a loop-transformation technique for enabling SIMD-encodings of loops and improving memory performance. This technique involves each vector operation being done for a size less than or equal to the maximum vector length on a given vector machine. http://software.intel.com/en-us/articles/strip-mining-to-optimize-memory-use-on-32-bit-intel-architecture/ http://download.oracle.com/docs/cd/E19205-01/819-5262/aeugr/index.html
The unimodular transformation framework
The unimodular transformation approach uses a single unimodular matrixUnimodular matrix
In mathematics, a unimodular matrix M is a square integer matrix with determinant +1 or −1. Equivalently, it is an integer matrix that is invertible over the integers: there is an integer matrix N which is its inverse...
to describe the combined result of a sequence of many of the above transformations. Central to this approach is the view of the set of all executions of a statement within n loops as a set of integer points in an n-dimensional space, with the points being executed in lexicographical order
Lexicographical order
In mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...
. For example, the executions of a statement nested inside an outer loop with index i and an inner loop with index j can be associated with the pairs of integers (i, j). The application of a unimodular transformation corresponds to the multiplication of the points within this space by the matrix. For example, the interchange of two loops corresponds to the matrix .
A unimodular transformation is legal if it preserves the temporal sequence of all dependencies
Data dependency
A data dependency in computer science is a situation in which a program statement refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.There are three types of dependencies: data, name, and...
; measuring the performance impact of a unimodular transformation is more difficult. Imperfectly nested loops and some transformations (such as tiling) do not fit easily into this framework.
The polyhedral or constraint-based framework
The polyhedral modelPolytope model
The polyhedral model is a mathematical framework for loop nest optimization in program optimization. The polytope method treats each loop iteration within nested loops as lattice points inside mathematical objects called polytopes, performs affine transformations or more general non-affine...
handles a wider class of programs and transformations than the unimodular framework. The set of executions of a set of statements within a possibly-imperfectly nested set of loops is seen as the union of a set of polytopes representing the executions of the statements. Affine transformation
Affine transformation
In geometry, an affine transformation or affine map or an affinity is a transformation which preserves straight lines. It is the most general class of transformations with this property...
s are applied to these polytopes, producing a description of a new execution order. The boundaries of the polytopes, the data dependencies, and the transformations are often described using systems of constraints, and this approach is often referred to as a constraint-based approach to loop optimization. For example, a single statement within an outer loop 'for i := 0 to n' and an inner loop 'for j := 0 to i+2' is executed once for each (i, j) pair such that 0 <= i <= n and 0 <= j <= i+2.
Once again, a transformation is legal if it preserves the temporal sequence of all dependencies
Data dependency
A data dependency in computer science is a situation in which a program statement refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.There are three types of dependencies: data, name, and...
. Estimating the benefits of a transformation, or finding the best transformation for a given code on a given computer, remain the subject of ongoing research as of the time of this writing (2010).
See also
- Loop nest optimizationLoop nest optimizationLoop nest optimization applies a set of loop transformations for the purpose of locality optimization or parallelization or other loop overhead reduction of the loop nests...
- Polytope modelPolytope modelThe polyhedral model is a mathematical framework for loop nest optimization in program optimization. The polytope method treats each loop iteration within nested loops as lattice points inside mathematical objects called polytopes, performs affine transformations or more general non-affine...
- Scalable parallelismScalable parallelismSoftware is said to exhibit scalable parallelism if it can make use of additional processors to solve larger problems,i.e. this term refers to software for which Gustafson's law holds.Consider a program whose execution time is dominated by one or more loops,...
- Scalable localityScalable localitySoftware is said to exhibit scalable locality if it can continue to make use of processors that out-pace their memory systems, to solve ever larger problems....