Branch predictor
Encyclopedia
In computer architecture
, a branch predictor is a digital circuit that tries to guess which way a branch
(e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline
. Branch predictors are crucial in today's pipelined microprocessor
s for achieving high performance
.
Two-way branching is usually implemented with a conditional jump
instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump - or it can be "taken" and jump to a different place in program memory where the second branch of code is stored.
It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline (see fig. 1).
Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed
. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the higher the need for a good branch predictor.
The first time a conditional jump instruction is encountered, there is not much information to base a prediction on. But the branch predictor keeps records of whether branches are taken or not taken. When it encounters a conditional jump that has been seen several times before then it can base the prediction on the past history. The branch predictor may, for example, recognize that the conditional jump is taken more often than not, or that it is taken every second time.
Branch prediction is not the same as branch target prediction
. Branch prediction attempts to guess whether a conditional jump will be taken or not. Branch target prediction attempts to guess the target of a taken conditional or unconditional jump before it is computed by decoding and executing the instruction itself. Branch prediction and branch target prediction are often combined into the same circuitry.
The early implementations of SPARC
and MIPS
(two of the first commercial RISC architectures) used single direction static branch prediction: they always predicted that a conditional jump would not be taken, so they always fetched the next sequential instruction. Only when the branch or jump was evaluated and found to be taken did the instruction pointer get set to a non-sequential address.
Both CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Both architectures defined branch delay slot
s in order to utilize these fetched instructions.
A more complex form of static prediction assumes that backwards branches will be taken, and forward-pointing branches will not be taken. A backwards branch is one that has a target address that is lower than its own address. This technique can help with prediction accuracy of loops, which are usually backward-pointing branches, and are taken more often than not taken.
Some processors allow branch prediction hints to be inserted into the code to tell whether the static prediction should be taken or not taken. The Intel Pentium 4
accepts branch prediction hints while this feature is abandoned in later processors.
Static prediction is used as a fall-back technique in some processors with dynamic branch prediction when there isn't any information for dynamic predictors to use. Both the Motorola MPC7450 (G4e) and the Intel Pentium 4
use this technique as a fall-back.
processors (MIPS R8000
, Alpha 21264
and Alpha 21464
(EV8)) fetch each line of instructions with a pointer to the next line. This next line predictor handles branch target prediction
as well as branch direction prediction.
When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.
The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.
When a branch is evaluated, the corresponding state machine is updated. Branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The advantage of the two-bit counter over a one-bit scheme is that a conditional jump has to deviate twice from what it has done most in the past before the prediction changes. For example, a loop-closing conditional jump is mispredicted once rather than twice.
The original, non-MMX Intel Pentium processor uses a saturating counter, though with an imperfect implementation.
On the SPEC
'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.
The predictor table is indexed with the instruction address
bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.
Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a 2-bit shift register
. This branch history register can have 4 different binary
values: 00, 01, 10, and 11; where 0 means "not taken" and 1 means "taken". Now, we make a pattern history table with four entries, one for each of the 2n = 4 possible branch histories. Each entry in the pattern history table contains a 2-bit saturating counter of the same type as in figure 2. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00 then the first counter is used. If the history is 11 then the last of the four counters is used.
Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a 0. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones.
The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences
are different.
The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern. This method was invented by T.-Y. Yeh and Yale Patt
of the University of Michigan.
Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors today.
The Intel Pentium MMX, Pentium II
and Pentium III
have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump.
On the SPEC
'89 benchmarks, very large local predictors saturate at 97.1% correct.
between different conditional jumps is utilized in the prediction making. The disadvantage is that the history is diluted by irrelevant information in case the different conditional jumps are uncorrelated, and that the history buffer may not include any bits from the same branch in case there are many other branches in between. It may use a two-level adaptive predictor.
This scheme is only better than the saturating counter scheme for large table sizes, and it is rarely as good as local prediction. The history buffer must be longer in order to make a good prediction. The size of the pattern history table grows exponentially
with the size of the history buffer. Hence, the big pattern history table must be shared between all conditional jumps.
A two-level adaptive predictor with globally shared history buffer and pattern history table is called a "gshare" predictor if it xors
the global history and branch PC, and "gselect" if it concatenates
them. Global branch prediction is used in AMD
microprocessors and in Intel Pentium M
, Core
and Core 2
.
On the SPEC'89 benchmarks, very large gshare predictors saturate at 96.6% correct, which is just a little worse than large local predictors.
local and global branch histories, possibly with some bits from the program counter as well. Tests indicate that the VIA Nano
processor may be using this technique.
The agree predictor was used in the first version of the Intel Pentium 4
, but was later abandoned.
Scott McFarling proposed combined branch prediction in his 1993
paper.
On the SPEC'89 benchmarks, such a predictor is about as good as the
local predictor.
Predictors like gshare use multiple table entries to track the
behavior of any particular branch. This multiplication of entries
makes it much more likely that two branches will map to the same table
entry (a situation called aliasing), which in turn makes it much more
likely that prediction accuracy will suffer for those branches. Once
you have multiple predictors, it is beneficial to arrange that each
predictor will have different aliasing patterns, so that it is more
likely that at least one predictor will have no aliasing. Combined
predictors with different indexing functions for the different
predictors are called gskew predictors, and are analogous to
skewed associative caches used for data and instruction caching.
Many microprocessors today have loop predictors.
instruction can choose between more than two branches. Newer processors from Intel and AMD can predict indirect branches by using a two-level adaptive predictor. An indirect jump that chooses between more than two branches contributes more than one bit to the history buffer.
Processors without this mechanism will simply predict an indirect jump to go to the same target as it did last time.
will normally return to where it is called from. The return instruction
is an indirect jump that reads its target address from the call stack
. Many microprocessors have a separate prediction mechanism for return instructions. This mechanism is based on a so-called return stack buffer, which is a local mirror of the call stack. The size of the return stack buffer is typically 4 - 16 entries.
between fast branch prediction and good branch prediction is sometimes dealt with by having two branch predictors. The first branch predictor is fast and simple. The second branch predictor, which is slower, more complicated, and with bigger tables, will override a possibly wrong prediction made by the first predictor.
The Alpha 21264 and Alpha EV8 microprocessors used a fast single-cycle next line predictor to handle the branch target recurrence and provide a simple and fast branch prediction. Because the next line predictor is so inaccurate, and the branch resolution recurrence takes so long, both cores have two-cycle secondary branch predictors which can override the prediction of the next line predictor at the cost of a single lost fetch cycle.
The Intel Core i7 has two branch target buffers
and possibly two or more branch predictors.
, Romania
), in his paper entitled "Towards a High Performance Neural Branch Predictor", Proceedings of The International Joint Conference on Neural Networks - IJCNN '99, Washington DC, USA, 1999. The neural branch predictor research was strongly developed further by Prof. Daniel Jimenez (Rutgers University
, USA). In 2001, (HPCA Conference) it was the first presented perceptron predictor
that was feasible to implement in hardware
.
The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor, see http://cava.cs.utsa.edu/pdfs/micro03_dist.pdf. He also used a gshare/perceptron overriding hybrid predictors.
The main disadvantage of the perceptron predictor is its high latency. Even after taking advantage of high-speed arithmetic tricks, the computation latency is relatively high compared to the clock period of many modern microarchitectures. In order to reduce the prediction latency, Jimenez proposed in 2003 the fast-path neural predictor, where the perceptron predictor chooses its weights according to the current branch’s path, rather than according to the branch’s PC. Many other researchers developed this concept (A. Seznec, M. Monchiero, D. Tarjan & K. Skadron, V. Desmet, Akkary et al., K. Aasaraai, Michael Black, etc.)
The neural branch predictor concept is very promising. Most of the state of the art branch predictors are using a perceptron predictor (see Intel's "Championship Branch Prediction Competition" ). Intel already implements this idea in one of the IA-64's simulators (2003).
Two-bit predictors were introduced by Tom McWilliams and Curt Widdoes in 1977 for the Lawrence Livermore National Lab S-1 supercomputer and independently by Jim Smith in 1979 at CDC.
Microprogrammed processors, popular from the 1960s to the 1980s and beyond, took multiple cycles per instruction, and generally did not require branch prediction. However, along with the IBM 3090, there are several examples of microprogrammed designs that incorporated branch prediction.
The Burroughs B4900, a microprogrammed COBOL machine released in ~1982 was pipelined and used branch prediction. The B4900 branch prediction history state was stored back into the in-memory instructions during program execution. The B4900 implemented 4-state branch prediction by using 4 semantically equivalent branch opcodes to represent each branch operator type. The opcode used indicated the history of that particular branch instruction. If the hardware determined that the branch prediction state of a particular branch needed to be updated, it would rewrite the opcode with the semantically equivalent opcode that hinted the proper history. This scheme obtained a 93% hit rate. US patent 4,435,756 and others were granted on this scheme.
The VAX 9000
, announced in 1989, was both microprogrammed and pipelined, and performed branch prediction.
The first commercial RISC processors, the MIPS
R2000
and R3000
and the earlier SPARC
processors, did only trivial "not-taken" branch prediction. Because they used branch delay slots, fetched just one instruction per cycle, and executed in-order, there was no performance loss. Later, the R4000
used the same trivial "not-taken" branch prediction, and lost two cycles to each taken branch because the branch resolution recurrence was four cycles long.
Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064
, the MIPS R8000
, and the IBM POWER
series. These processors all relied on one-bit or simple bimodal predictors.
The DEC Alpha 21264
(EV6) uses a next-line predictor overridden by a combined local predictor and global predictor, where the combining choice is made by a bimodal predictor.
The AMD K8
has a combined bimodal and global predictor, where the combining choice is another bimodal predictor. This processor caches the base and choice bimodal predictor counters in bits of the L2 cache otherwise used for ECC. As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache. Parity is just fine, since any instruction suffering a parity error can be invalidated and refetched from memory.
The Alpha 21464
(EV8, cancelled late in design) had a minimum branch misprediction penalty of 14 cycles. It was to use a complex but fast next line predictor overridden by a combined bimodal and majority-voting predictor. The majority vote was between the bimodal and two gskew predictors.
Computer architecture
In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals and the formal modelling of those systems....
, a branch predictor is a digital circuit that tries to guess which way a branch
Branch (computer science)
A branch is sequence of code in a computer program which is conditionally executed depending on whether the flow of control is altered or not . The term can be used when referring to programs in high level languages as well as program written in machine code or assembly language...
(e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
. Branch predictors are crucial in today's pipelined microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
s for achieving high performance
Computer performance
Computer performance is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.Depending on the context, good computer performance may involve one or more of the following:...
.
Two-way branching is usually implemented with a conditional jump
Branch (computer science)
A branch is sequence of code in a computer program which is conditionally executed depending on whether the flow of control is altered or not . The term can be used when referring to programs in high level languages as well as program written in machine code or assembly language...
instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump - or it can be "taken" and jump to a different place in program memory where the second branch of code is stored.
It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline (see fig. 1).
Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...
. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the higher the need for a good branch predictor.
The first time a conditional jump instruction is encountered, there is not much information to base a prediction on. But the branch predictor keeps records of whether branches are taken or not taken. When it encounters a conditional jump that has been seen several times before then it can base the prediction on the past history. The branch predictor may, for example, recognize that the conditional jump is taken more often than not, or that it is taken every second time.
Branch prediction is not the same as branch target prediction
Branch target predictor
In computer architecture, a branch target predictor is the part of a processor that predicts the target of a taken conditional branch or an unconditional branch instruction before the target of the branch instruction is computed by the execution unit of the processor.Branch target prediction is not...
. Branch prediction attempts to guess whether a conditional jump will be taken or not. Branch target prediction attempts to guess the target of a taken conditional or unconditional jump before it is computed by decoding and executing the instruction itself. Branch prediction and branch target prediction are often combined into the same circuitry.
Static prediction
Static prediction is the simplest branch prediction technique because it does not rely on information about the dynamic history of code executing. Instead it predicts the outcome of a branch based solely on the branch instruction.The early implementations of SPARC
SPARC
SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987....
and MIPS
MIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
(two of the first commercial RISC architectures) used single direction static branch prediction: they always predicted that a conditional jump would not be taken, so they always fetched the next sequential instruction. Only when the branch or jump was evaluated and found to be taken did the instruction pointer get set to a non-sequential address.
Both CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Both architectures defined branch delay slot
Branch delay slot
In computer architecture, a delay slot is an instruction slot that gets executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP architecture; this instruction will execute even if...
s in order to utilize these fetched instructions.
A more complex form of static prediction assumes that backwards branches will be taken, and forward-pointing branches will not be taken. A backwards branch is one that has a target address that is lower than its own address. This technique can help with prediction accuracy of loops, which are usually backward-pointing branches, and are taken more often than not taken.
Some processors allow branch prediction hints to be inserted into the code to tell whether the static prediction should be taken or not taken. The Intel Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
accepts branch prediction hints while this feature is abandoned in later processors.
Static prediction is used as a fall-back technique in some processors with dynamic branch prediction when there isn't any information for dynamic predictors to use. Both the Motorola MPC7450 (G4e) and the Intel Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
use this technique as a fall-back.
Next line prediction
Some superscalarSuperscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
processors (MIPS R8000
R8000
The R8000 is a microprocessor chipset developed by MIPS Technologies, Inc. , Toshiba, and Weitek. It was the first implementation of the MIPS IV instruction set architecture. The R8000 is also known as the TFP, for Tremendous Floating-Point, its name during development.-History:Development of the...
, Alpha 21264
Alpha 21264
The Alpha 21264 was a Digital Equipment Corporation RISC microprocessor introduced in October, 1996. The 21264 implemented the Alpha instruction set architecture .- Description :...
and Alpha 21464
Alpha 21464
The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8 or Araña, the latter being its code-name...
(EV8)) fetch each line of instructions with a pointer to the next line. This next line predictor handles branch target prediction
Branch target predictor
In computer architecture, a branch target predictor is the part of a processor that predicts the target of a taken conditional branch or an unconditional branch instruction before the target of the branch instruction is computed by the execution unit of the processor.Branch target prediction is not...
as well as branch direction prediction.
When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.
The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.
Saturating counter
A saturating counter or bimodal predictor is a state machine with four states:- Strongly not taken
- Weakly not taken
- Weakly taken
- Strongly taken
When a branch is evaluated, the corresponding state machine is updated. Branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The advantage of the two-bit counter over a one-bit scheme is that a conditional jump has to deviate twice from what it has done most in the past before the prediction changes. For example, a loop-closing conditional jump is mispredicted once rather than twice.
The original, non-MMX Intel Pentium processor uses a saturating counter, though with an imperfect implementation.
On the SPEC
Spec
-Specification:* Specification , an explicit set of requirements to be satisfied by a material, product, or service** "Spec sheet" or datasheet used to describe something technical...
'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.
The predictor table is indexed with the instruction address
Memory address
A digital computer's memory, more specifically main memory, consists of many memory locations, each having a memory address, a number, analogous to a street address, at which computer programs store and retrieve, machine code or data. Most application programs do not directly read and write to...
bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.
Two-level adaptive predictor
Conditional jumps that are taken every second time or have some other regularly recurring pattern are not predicted well by the saturating counter. A two-level adaptive predictor remembers the history of the last n occurrences of the branch and use one saturating counter for each of the possible 2n history patterns. This method is illustrated in figure 3.Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a 2-bit shift register
Shift register
In digital circuits, a shift register is a cascade of flip flops, sharing the same clock, which has the output of any one but the last flip-flop connected to the "data" input of the next one in the chain, resulting in a circuit that shifts by one position the one-dimensional "bit array" stored in...
. This branch history register can have 4 different binary
Binary numeral system
The binary numeral system, or base-2 number system, represents numeric values using two symbols, 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2...
values: 00, 01, 10, and 11; where 0 means "not taken" and 1 means "taken". Now, we make a pattern history table with four entries, one for each of the 2n = 4 possible branch histories. Each entry in the pattern history table contains a 2-bit saturating counter of the same type as in figure 2. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00 then the first counter is used. If the history is 11 then the last of the four counters is used.
Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a 0. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones.
The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences
Subsequence
In mathematics, a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements...
are different.
The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern. This method was invented by T.-Y. Yeh and Yale Patt
Yale Patt
Yale Nance Patt is an American professor of electrical and computer engineering at The University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex logic gate implemented on a single piece of silicon...
of the University of Michigan.
Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors today.
Local branch prediction
A local branch predictor has a separate history buffer for each conditional jump instruction. It may use a two-level adaptive predictor. The history buffer is separate for each conditional jump instruction, while the pattern history table may be separate as well or it may be shared between all conditional jumps.The Intel Pentium MMX, Pentium II
Pentium II
The Pentium II brand refers to Intel's sixth-generation microarchitecture and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million...
and Pentium III
Pentium III
The Pentium III brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitecture introduced on February 26, 1999. The brand's initial processors were very similar to the earlier Pentium II-branded microprocessors...
have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump.
On the SPEC
Spec
-Specification:* Specification , an explicit set of requirements to be satisfied by a material, product, or service** "Spec sheet" or datasheet used to describe something technical...
'89 benchmarks, very large local predictors saturate at 97.1% correct.
Global branch prediction
A global branch predictor does not keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. The advantage of a shared history is that any correlationCorrelation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
between different conditional jumps is utilized in the prediction making. The disadvantage is that the history is diluted by irrelevant information in case the different conditional jumps are uncorrelated, and that the history buffer may not include any bits from the same branch in case there are many other branches in between. It may use a two-level adaptive predictor.
This scheme is only better than the saturating counter scheme for large table sizes, and it is rarely as good as local prediction. The history buffer must be longer in order to make a good prediction. The size of the pattern history table grows exponentially
Exponential function
In mathematics, the exponential function is the function ex, where e is the number such that the function ex is its own derivative. The exponential function is used to model a relationship in which a constant change in the independent variable gives the same proportional change In mathematics,...
with the size of the history buffer. Hence, the big pattern history table must be shared between all conditional jumps.
A two-level adaptive predictor with globally shared history buffer and pattern history table is called a "gshare" predictor if it xors
XOR gate
The XOR gate is a digital logic gate that implements an exclusive or; that is, a true output results if one, and only one, of the inputs to the gate is true . If both inputs are false or both are true , a false output results. Its behavior is summarized in the truth table shown on the right...
the global history and branch PC, and "gselect" if it concatenates
Concatenation
In computer programming, string concatenation is the operation of joining two character strings end-to-end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"...
them. Global branch prediction is used in AMD
Advanced Micro Devices
Advanced Micro Devices, Inc. or AMD is an American multinational semiconductor company based in Sunnyvale, California, that develops computer processors and related technologies for commercial and consumer markets...
microprocessors and in Intel Pentium M
Pentium M
The Pentium M brand refers to a family of mobile single-core x86 microprocessors introduced in March 2003 , and forming a part of the Intel Carmel notebook platform under the then new Centrino brand...
, Core
Intel Core
Yonah was the code name for Intel's first generation of 65 nm process mobile microprocessors, based on the Banias/Dothan-core Pentium M microarchitecture. SIMD performance has been improved through the addition of SSE3 instructions and improvements to SSE and SSE2 implementations, while integer...
and Core 2
Intel Core 2
Core 2 is a brand encompassing a range of Intel's consumer 64-bit x86-64 single-, dual-, and quad-core microprocessors based on the Core microarchitecture. The single- and dual-core models are single-die, whereas the quad-core models comprise two dies, each containing two cores, packaged in a...
.
On the SPEC'89 benchmarks, very large gshare predictors saturate at 96.6% correct, which is just a little worse than large local predictors.
Alloyed branch prediction
An alloyed branch predictor combines the local and global prediction principles by concatenatingConcatenation
In computer programming, string concatenation is the operation of joining two character strings end-to-end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"...
local and global branch histories, possibly with some bits from the program counter as well. Tests indicate that the VIA Nano
VIA Nano
The VIA Nano is a 64-bit CPU for personal computers. The VIA Nano was released by VIA Technologies in 2008 after five years of development by its CPU division, Centaur Technology...
processor may be using this technique.
Agree predictor
An agree predictor is a two-level adaptive predictor with globally shared history buffer and pattern history table, and an additional local saturating counter. The outputs of the local and the global predictors are XORed with each other to give the final prediction. The purpose is to reduce contentions in the pattern history table where two branches with opposite prediction happen to share the same entry in the pattern history table.The agree predictor was used in the first version of the Intel Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
, but was later abandoned.
Hybrid predictor
A hybrid predictor, also called combined predictor, implements more than one prediction mechanism. The final prediction is based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors.Scott McFarling proposed combined branch prediction in his 1993
paper.
On the SPEC'89 benchmarks, such a predictor is about as good as the
local predictor.
Predictors like gshare use multiple table entries to track the
behavior of any particular branch. This multiplication of entries
makes it much more likely that two branches will map to the same table
entry (a situation called aliasing), which in turn makes it much more
likely that prediction accuracy will suffer for those branches. Once
you have multiple predictors, it is beneficial to arrange that each
predictor will have different aliasing patterns, so that it is more
likely that at least one predictor will have no aliasing. Combined
predictors with different indexing functions for the different
predictors are called gskew predictors, and are analogous to
skewed associative caches used for data and instruction caching.
Loop predictor
A conditional jump that controls a loop is best predicted with a special loop predictor. A conditional jump in the bottom of a loop that repeats N times will be taken N-1 times and then not taken once. If the conditional jump is placed at the top of the loop, it will be not taken N-1 times and then taken once. A conditional jump that goes many times one way and then the other way once is detected as having loop behavior. Such a conditional jump can be predicted easily with a simple counter. A loop predictor is part of a hybrid predictor where a meta-predictor detects whether the conditional jump has loop behavior.Many microprocessors today have loop predictors.
Prediction of indirect jumps
An indirect jumpIndirect branch
An indirect branch is a type of program control instruction present in some machine language instruction sets. Rather than specifying the address of the next instruction to execute, as in a direct branch, the argument specifies where the address is located...
instruction can choose between more than two branches. Newer processors from Intel and AMD can predict indirect branches by using a two-level adaptive predictor. An indirect jump that chooses between more than two branches contributes more than one bit to the history buffer.
Processors without this mechanism will simply predict an indirect jump to go to the same target as it did last time.
Prediction of function returns
A functionSubroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
will normally return to where it is called from. The return instruction
Return statement
In computer programming, a return statement causes execution to leave the current subroutine and resume at the point in the code immediately after where the subroutine was called, known as its return address. The return address is saved, usually on the process's call stack, as part of the operation...
is an indirect jump that reads its target address from the call stack
Call stack
In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This kind of stack is also known as an execution stack, control stack, run-time stack, or machine stack, and is often shortened to just "the stack"...
. Many microprocessors have a separate prediction mechanism for return instructions. This mechanism is based on a so-called return stack buffer, which is a local mirror of the call stack. The size of the return stack buffer is typically 4 - 16 entries.
Overriding branch prediction
The trade-offTrade-off
A trade-off is a situation that involves losing one quality or aspect of something in return for gaining another quality or aspect...
between fast branch prediction and good branch prediction is sometimes dealt with by having two branch predictors. The first branch predictor is fast and simple. The second branch predictor, which is slower, more complicated, and with bigger tables, will override a possibly wrong prediction made by the first predictor.
The Alpha 21264 and Alpha EV8 microprocessors used a fast single-cycle next line predictor to handle the branch target recurrence and provide a simple and fast branch prediction. Because the next line predictor is so inaccurate, and the branch resolution recurrence takes so long, both cores have two-cycle secondary branch predictors which can override the prediction of the next line predictor at the cost of a single lost fetch cycle.
The Intel Core i7 has two branch target buffers
Branch target predictor
In computer architecture, a branch target predictor is the part of a processor that predicts the target of a taken conditional branch or an unconditional branch instruction before the target of the branch instruction is computed by the execution unit of the processor.Branch target prediction is not...
and possibly two or more branch predictors.
Neural branch predictors
The first dynamic neural branch predictors (LVQ-predictors and perceptrons) were proposed by Prof. Lucian Vintan (Lucian Blaga University of SibiuLucian Blaga University of Sibiu
The Lucian Blaga University of Sibiu was named after the philosopher, poet, and playwright, Lucian Blaga. It was founded in 1990 with five schools: Letters, History and Law, Medicine, Food and Textile Processing Technology, Engineering and Sciences...
, Romania
Romania
Romania is a country located at the crossroads of Central and Southeastern Europe, on the Lower Danube, within and outside the Carpathian arch, bordering on the Black Sea...
), in his paper entitled "Towards a High Performance Neural Branch Predictor", Proceedings of The International Joint Conference on Neural Networks - IJCNN '99, Washington DC, USA, 1999. The neural branch predictor research was strongly developed further by Prof. Daniel Jimenez (Rutgers University
Rutgers University
Rutgers, The State University of New Jersey , is the largest institution for higher education in New Jersey, United States. It was originally chartered as Queen's College in 1766. It is the eighth-oldest college in the United States and one of the nine Colonial colleges founded before the American...
, USA). In 2001, (HPCA Conference) it was the first presented perceptron predictor
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...
that was feasible to implement in hardware
Hardware
Hardware is a general term for equipment such as keys, locks, hinges, latches, handles, wire, chains, plumbing supplies, tools, utensils, cutlery and machine parts. Household hardware is typically sold in hardware stores....
.
The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor, see http://cava.cs.utsa.edu/pdfs/micro03_dist.pdf. He also used a gshare/perceptron overriding hybrid predictors.
The main disadvantage of the perceptron predictor is its high latency. Even after taking advantage of high-speed arithmetic tricks, the computation latency is relatively high compared to the clock period of many modern microarchitectures. In order to reduce the prediction latency, Jimenez proposed in 2003 the fast-path neural predictor, where the perceptron predictor chooses its weights according to the current branch’s path, rather than according to the branch’s PC. Many other researchers developed this concept (A. Seznec, M. Monchiero, D. Tarjan & K. Skadron, V. Desmet, Akkary et al., K. Aasaraai, Michael Black, etc.)
The neural branch predictor concept is very promising. Most of the state of the art branch predictors are using a perceptron predictor (see Intel's "Championship Branch Prediction Competition" ). Intel already implements this idea in one of the IA-64's simulators (2003).
History
The IBM Stretch, designed in the late 1950s, pre-executed all unconditional branches and any conditional branches that depended on the index registers. For other conditional branches, the first two production models implemented predict untaken; subsequent models were changed to implement predictions based on the current values of the indicator bits (corresponding to today's condition codes). The Stretch designers had considered static hint bits in the branch instructions early in the project but decided against them. Misprediction recovery was provided by the lookahead unit on Stretch, and part of Stretch's reputation for less-than-stellar performance was blamed on the time required for misprediction recovery. Subsequent IBM large computer designs did not use branch prediction with speculative execution until the IBM 3090 in 1985.Two-bit predictors were introduced by Tom McWilliams and Curt Widdoes in 1977 for the Lawrence Livermore National Lab S-1 supercomputer and independently by Jim Smith in 1979 at CDC.
Microprogrammed processors, popular from the 1960s to the 1980s and beyond, took multiple cycles per instruction, and generally did not require branch prediction. However, along with the IBM 3090, there are several examples of microprogrammed designs that incorporated branch prediction.
The Burroughs B4900, a microprogrammed COBOL machine released in ~1982 was pipelined and used branch prediction. The B4900 branch prediction history state was stored back into the in-memory instructions during program execution. The B4900 implemented 4-state branch prediction by using 4 semantically equivalent branch opcodes to represent each branch operator type. The opcode used indicated the history of that particular branch instruction. If the hardware determined that the branch prediction state of a particular branch needed to be updated, it would rewrite the opcode with the semantically equivalent opcode that hinted the proper history. This scheme obtained a 93% hit rate. US patent 4,435,756 and others were granted on this scheme.
The VAX 9000
VAX 9000
The VAX 9000, code named Aridus, was a family of high-end minicomputers developed and manufactured by Digital Equipment Corporation using processors implementing the VAX instruction set architecture . The VAX 9000 was positioned by Digital as its first mainframe...
, announced in 1989, was both microprogrammed and pipelined, and performed branch prediction.
The first commercial RISC processors, the MIPS
MIPS Technologies
MIPS Technologies, Inc. , formerly MIPS Computer Systems, Inc., is most widely known for developing the MIPS architecture and a series of pioneering RISC chips. MIPS provides processor architectures and cores for digital home, networking and mobile applications.MIPS Computer Systems Inc. was...
R2000
R2000 (microprocessor)
The R2000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in January 1986, it was the first commercial implementation of the MIPS architecture and the first merchant RISC processor available to all companies...
and R3000
R3000
The R3000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in June 1988, it was the second MIPS implementation, succeeding the R2000 as the flagship MIPS microprocessor...
and the earlier SPARC
SPARC
SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987....
processors, did only trivial "not-taken" branch prediction. Because they used branch delay slots, fetched just one instruction per cycle, and executed in-order, there was no performance loss. Later, the R4000
R4000
The R4000 is a microprocessor developed by MIPS Computer Systems that implemented the MIPS III instruction set architecture . Officially announced on 1 October 1991, it was one of the first 64-bit microprocessors and the first MIPS III implementation...
used the same trivial "not-taken" branch prediction, and lost two cycles to each taken branch because the branch resolution recurrence was four cycles long.
Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064
Alpha 21064
The Alpha 21064 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture . It was introduced as the DECchip 21064 before it was renamed in 1994. The 21064 is also known by its code name, EV4...
, the MIPS R8000
R8000
The R8000 is a microprocessor chipset developed by MIPS Technologies, Inc. , Toshiba, and Weitek. It was the first implementation of the MIPS IV instruction set architecture. The R8000 is also known as the TFP, for Tremendous Floating-Point, its name during development.-History:Development of the...
, and the IBM POWER
IBM POWER
POWER is a reduced instruction set computer instruction set architecture developed by IBM. The name is an acronym for Performance Optimization With Enhanced RISC....
series. These processors all relied on one-bit or simple bimodal predictors.
The DEC Alpha 21264
Alpha 21264
The Alpha 21264 was a Digital Equipment Corporation RISC microprocessor introduced in October, 1996. The 21264 implemented the Alpha instruction set architecture .- Description :...
(EV6) uses a next-line predictor overridden by a combined local predictor and global predictor, where the combining choice is made by a bimodal predictor.
The AMD K8
AMD K8
The AMD K8 is a computer processor microarchitecture designed by AMD as the successor to the AMD K7 microarchitecture. The K8 was the first implementation of the AMD64 64-bit extension to the x86 processor architecture.Processors based on the K8 core include:...
has a combined bimodal and global predictor, where the combining choice is another bimodal predictor. This processor caches the base and choice bimodal predictor counters in bits of the L2 cache otherwise used for ECC. As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache. Parity is just fine, since any instruction suffering a parity error can be invalidated and refetched from memory.
The Alpha 21464
Alpha 21464
The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8 or Araña, the latter being its code-name...
(EV8, cancelled late in design) had a minimum branch misprediction penalty of 14 cycles. It was to use a complex but fast next line predictor overridden by a combined bimodal and majority-voting predictor. The majority vote was between the bimodal and two gskew predictors.
See also
- Branch target predictorBranch target predictorIn computer architecture, a branch target predictor is the part of a processor that predicts the target of a taken conditional branch or an unconditional branch instruction before the target of the branch instruction is computed by the execution unit of the processor.Branch target prediction is not...
- Branch prediction analysis attacks (relating to RSA public-key cryptographyPublic-key cryptographyPublic-key cryptography refers to a cryptographic system requiring two separate keys, one to lock or encrypt the plaintext, and one to unlock or decrypt the cyphertext. Neither key will do both functions. One of these keys is published or public and the other is kept private...
) - Instruction unitInstruction unitThe instruction unit in a central processing unit is responsible for organising for program instructions to be fetched from memory, and executed, in an appropriate order...
External links
- Seznec et al. (1996). "Multiple-Block Ahead Branch Predictors" — demonstrates prediction accuracy is not impaired by indexing with previous branch address.
- Seznec et al. (2002). "Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor" — describes the Alpha EV8 branch predictor. This paper does an excellent job discussing how they arrived at their design from various hardware constraints and simulation studies.
- Jimenez (2003). "Reconsidering Complex Branch Predictors" — describes the EV6 and K8 branch predictors, and pipelining considerations.