Software pipelining - AbsoluteAstronomy.com

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, software pipelining is a technique used to optimize

Optimization (computer science)

In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...

loops, in a manner that parallels hardware pipelining

Instruction pipeline

An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

. Software pipelining is a type of out-of-order execution

Out-of-order execution

In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...

, except that the reordering is done by a compiler

Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language...

(or in the case of hand written assembly code, by the programmer) instead of the processor

Central processing unit

The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

. Some computer architecture

Computer architecture

In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals and the formal modelling of those systems....

s have explicit support for software pipelining, notably Intel's IA-64 architecture.

It is important to distinguish software pipelining which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.
Software pipelining has been known to assembly language programmers of machines with instruction level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser.
Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo renaming is widely used in practice.
Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper. This paper has a
good set of references on the topic.

Example

Consider the following loop:
for (i = 1) to bignumber
A(i)
B(i)
C(i)
end

In this example, let A(i), B(i), C(i), be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register

Processor register

In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...

, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations will execute in the following sequence:
A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...
Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 7 cycles to complete (3 + 3 + 1, because A(i+1) does not have to wait for C(i)).

Now consider the following sequence of instructions (with software pipelining):
A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...
It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3):
for i = 1 to (bignumber - 2) step 3
A(i)
A(i+1)
A(i+2)
B(i)
B(i+1)
B(i+2)
C(i)
C(i+1)
C(i+2)
end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem. It should be noted, however, that software pipelining prevents the use of Duff's device

Duff's device

In computer science, Duff's device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November of 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic...

, a widely known and efficient solution to this problem.

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency

Latency (engineering)

Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...

. For example, the following code:
for i = 1 to bignumber
A(i) ; 3 cycle latency
B(i) ; 3
C(i) ; 12(perhaps a floating point operation)
D(i) ; 3
E(i) ; 3
F(i) ; 3
end
would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache

Cache

In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...

performance, see code bloat
Code bloat
Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the language in which the code is written, inadequacies in the compiler used to compile the code, or by a programmer...

). Even worse, the prolog (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prolog code, rendering the software pipelining optimization ineffectual.

Here is an alternative implementation of software pipelining for our example (the prolog and postlog will be explained later):
prolog
for i = 1 to (bignumber - 6)
A(i+6)
B(i+5)
C(i+4)
D(i+2) ; note that we skip i+3
E(i+1)
F(i)
end
postlog
Before getting to the prolog and postlog, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)

Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)

Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)

Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)

Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)

Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)

Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoid the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prolog and postlog handle iterations at the beginning and end of the loop. Here is a possible prolog for our example above:
; loop prolog (arranged on lines for clarity)
A(1)
A(2), B(1)
A(3), B(2), C(1)
A(4), B(3), C(2)
A(5), B(4), C(3), D(1)
A(6), B(5), C(4), D(2), E(1)
Each line above corresponds to an iteration of the pipelined loop, with instructions for iterations that have not yet begun removed. The postlog would look similar.

Difficulties of implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The postlog would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). She calls it Modulo Renaming. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A(i) and B(i) can be issued in parallel and that the latency of the latter is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result of A(i+2) must stay live for two iterations. Using the same register for the result of A(i+2) and the input of B(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

A(i+2); B(i)
A(i+3); B(i+1)

Now a separate register can be allocated to the results of A(i+2) and A(i+3). To be more concrete:

r1 = A(i+2); B(i) = r1
r2 = A(i+3); B(i+1) = r2
i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A(i+2) from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop (eventually looping back around to the beginning). This makes the extra instructions inserted in the previous example unnecessary.
Predicates (used to "predicate" instructions; see Branch predication
Branch predication
Branch predication is a strategy in computer architecture design for mitigating the costs usually associated with conditional branches, particularly branches to short sections of code...

) that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prolog and postlog unnecessary.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.