Memory disambiguation
Encyclopedia
Memory disambiguation is a set of techniques employed by high-performance out-of-order execution
microprocessor
s that execute memory access instructions (loads and stores) out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic
inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.
1: add $1, $2, $3 # R1 <= R2 + R3
2: add $5, $1, $4 # R5 <= R1 + R4 (dependent on 1)
In this example, the
R1 is a source operand
of the addition operation on line 2. The
Complications arise when the dependence is not statically determinable. Such non-static dependencies arise with memory instructions (loads and stores) because the location of the operand may be indirectly specified as a register operand rather than directly specified in the instruction encoding itself.
1: store ($2), $1 # Mem[R2] <= R1
2: load $3, ($4) # R3 <= Mem[R4] (possibly dependent on 1)
Here, the store instruction writes a value to the memory location specified by the value in R2, and the load instruction reads the value at the memory location specified by the value in R4. The microprocessor cannot statically determine, prior to execution, if the memory locations specified in these two instructions are different, or are the same location, because the locations depend on the values in R2 and R4. If the locations are different, the instructions are independent and can be successfully executed out of order. However, if the locations are the same, then the load instruction is dependent on the store to produce its value. This is known as an ambiguous dependence.
assembly
:
1: div $27, $20
2: sw ($30), $27
3: lw $08, ($31)
4: sw ($30), $26
5: lw $09, ($31)
Assume that the scheduling logic will issue an instruction to the execution unit when all of its register operands are ready. Further assume that registers
In this situation, the
Similarly, assume that register
The three dependencies are shown in the preceding code segment (reproduced for clarity):
1: div $27, $20
2: sw ($30), $27
3: lw $08, ($31)
4: sw ($30), $26
5: lw $09, ($31)
, to resolve ambiguous dependences and recover when a dependence was violated.
) when they execute. Instead, the store instructions, including the memory address and store data, are buffered in a store queue until they reach the retirement point. When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems shown in the code snippet above where an earlier load receives an incorrect value from the memory system because a later store was allowed to execute before the earlier load.
Additionally, buffering stores until retirement allows processors to speculatively execute store instructions that follow an instruction that may produce an exception
(such as a load of a bad address, divide by zero, etc.) or a conditional branch
instruction whose direction (taken or not taken) is not yet known. If the exception-producing instruction has not executed or the branch direction was predicted incorrectly, the processor will have fetched and executed instructions on a "wrong path." These instructions should not have been executed at all; the exception condition should have occurred before any of the speculative instructions executed, or the branch should have gone the other direction and caused different instructions to be fetched and executed. The processor must "throw away" any results from the bad-path, speculatively-executed instructions when it discovers the exception or branch misprediction. The complication for stores is that any stores on the bad or mispredicted path should not have committed their values to the memory system; if the stores had committed their values, it would be impossible to "throw away" the commit, and the memory state of the machine would be corrupted by data from a store instruction that should not have executed.
Thus, without store buffering, stores cannot execute until all previous possibly-exception-causing instructions have executed (and not caused an exception) and all previous branch directions are known. Forcing stores to wait until branch directions and exceptions are known significantly reduces the out-of-order aggressiveness and limits ILP (Instruction level parallelism
) and performance. With store buffering, stores can execute ahead of exception-causing or unresolved branch instructions, buffering their data in the store queue but not committing their values until retirement. This prevents stores on mispredicted or bad paths from committing their values to the memory system while still offering the increased ILP and performance from full out-of-order execution of stores.
To solve this problem, processors employ a technique called store-to-load forwarding using the store queue. In addition to buffering stores until retirement, the store queue serves a second purpose: forwarding data from completed but not-yet-retired ("in-flight") stores to later loads. Rather than a simple FIFO
queue, the store queue is really a Content-Addressable Memory (CAM)
searched using the memory address. When a load executes, it searches the store queue for in-flight stores to the same address that are logically earlier in program order. If a matching store exists, the load obtains its data value from that store instead of the memory system. If there is no matching store, the load accesses the memory system as usual; any preceding, matching stores must have already retired and committed their values. This technique allows loads to obtain correct data if their producer store has completed but not yet retired.
Multiple stores to the load's memory address may be present in the store queue. To handle this case, the store queue is priority encoded
to select the latest store that is logically earlier than the load in program order. The determination of which store is "latest" can be achieved by attaching some sort of timestamp
to the instructions as they are fetched and decoded, or alternatively by knowing the relative position (slot) of the load with respect to the oldest and newest stores within the store queue.
With this technique, the load queue, like the store queue, is a CAM searched using the memory access address, and keeps track of all in-flight loads. When a store executes, it searches the load queue for completed loads from the same address that are logically later in program order. If such a matching load exists, it must have executed before the store and thus read an incorrect, old value from the memory system/store queue. Any instructions that used the load's value have also used bad data. To recover if such a violation is detected, the load is marked as "violated" in the retirement buffer. The store remains in the store queue and retirement buffer and retires normally, committing its value to the memory system when it retires. However, when the violated load reaches the retirement point, the processor flushes the pipeline and restarts execution from the load instruction. At this point, all previous stores have committed their values to the memory system. The load instruction will now read the correct value from the memory system, and any dependent instructions will re-execute using the correct value.
This technique requires an associative search of the load queue on every store execution, which consumes circuit
power and can prove to be a difficult timing path for large load queues. However, it does not require any additional memory (cache) ports or create resource conflicts with other loads or stores that are executing.
With this technique, load instructions that have executed out-of-order are re-executed (they access the memory system and read the value from their address a second time) when they reach the retirement point. Since the load is now the retiring instruction, it has no dependencies on any instruction still in-flight; all stores ahead of it have committed their values to the memory system, and so any value read from the memory system is guaranteed to be correct. The value read from memory at re-execution time is compared to the value obtained when the load first executed. If the values are the same, the original value was correct and no violation has occurred. If the re-execution value differs from the original value, a RAW violation has occurred and the pipeline must be flushed because instructions dependent on the load have used an incorrect value.
This technique is conceptually simpler than the load queue search, and it eliminates a second CAM and its power-hungry search (the load queue can now be a simple FIFO queue). Since the load must re-access the memory system just before retirement, the access must be very fast, so this scheme relies on a fast cache. No matter how fast the cache is, however, the second memory system access for every out-of-order load instruction does increase instruction retirement latency
and increases the total number of cache accesses that must be performed by the processor. The additional retire-time cache access can be satisfied by re-using an existing cache port; however, this creates port resource contention with other loads and stores in the processor trying to execute, and thus may cause a decrease in performance. Alternatively, an additional cache port can be added just for load disambiguation, but this increases the complexity, power, and area of the cache. Some recent work (Roth 2005) has shown ways to filter many loads from re-executing if it is known that no RAW dependence violation could have occurred; such a technique would help or eliminate such latency and resource contention.
A minor benefit of this scheme (compared to a load-queue search) is that it will not flag a RAW dependence violation and trigger a pipeline flush if a store that would have caused a RAW dependence violation (the store's address matches an in-flight load's address) has a data value that matches the data value already in the cache. In the load-queue search scheme, an additional data comparison would need to be added to the load-queue search hardware to prevent such a pipeline flush.
The first option, making loads and stores go in-order, avoids RAW dependences because there is no possibility of a load executing before its producer store and obtaining incorrect data. Another possibility is to effectively break loads and stores into two operations: address generation and cache access. With these two separate but linked operations, the CPU can allow loads and stores to access the memory system only once all previous loads and stores have had their address generated and buffered in the LSQ. After address generation, there are no longer any ambiguous dependencies since all addresses are known, and so dependent loads will not be executed until their corresponding stores complete. This scheme still allows for some "out-of-orderness"—the address generation operations for any in-flight loads and stores can execute out-of-order, and once addresses have been generated, the cache accesses for each load or store can happen in any order that respects the (now known) true dependences.
, to attempt to predict true dependences between loads and stores before their addresses are known. Using this technique, the processor can prevent loads that are predicted to be dependent on an in-flight store from executing before that store completes, avoiding a RAW dependence violation and thus avoiding the pipeline flush and the performance penalty that is incurred. See the memory dependence prediction article for more details.
Out-of-order execution
In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...
microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
s that execute memory access instructions (loads and stores) out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic
Digital circuit
Digital electronics represent signals by discrete bands of analog levels, rather than by a continuous range. All levels within a band represent the same signal state...
inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.
Dependencies
When attempting to execute instructions out of order, a microprocessor must respect true dependencies between instructions. For example, consider a simple true dependence:1: add $1, $2, $3 # R1 <= R2 + R3
2: add $5, $1, $4 # R5 <= R1 + R4 (dependent on 1)
In this example, the
add
instruction on line 2 is dependent on the add
instruction on line 1 because the registerProcessor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
R1 is a source operand
Operand
In mathematics, an operand is the object of a mathematical operation, a quantity on which an operation is performed.-Example :The following arithmetic expression shows an example of operators and operands:3 + 6 = 9\;...
of the addition operation on line 2. The
add
on line 2 cannot execute until the add
on line 1 completes. In this case, the dependence is static and easily determined by a microprocessor, because the sources and destinations are registers. The destination register of the add
instruction on line 1 (R1
) is part of the instruction encoding, and so can be determined by the microprocessor early on, during the decode stage of the pipeline. Similarly, the source registers of the add
instruction on line 2 (R1
and R4
) are also encoded into the instruction itself and are determined in decode. To respect this true dependence, the microprocessor's scheduler logic will issue these instructions in the correct order (instruction 1 first, followed by instruction 2) so that the results of 1 are available when instruction 2 needs them.Complications arise when the dependence is not statically determinable. Such non-static dependencies arise with memory instructions (loads and stores) because the location of the operand may be indirectly specified as a register operand rather than directly specified in the instruction encoding itself.
1: store ($2), $1 # Mem[R2] <= R1
2: load $3, ($4) # R3 <= Mem[R4] (possibly dependent on 1)
Here, the store instruction writes a value to the memory location specified by the value in R2, and the load instruction reads the value at the memory location specified by the value in R4. The microprocessor cannot statically determine, prior to execution, if the memory locations specified in these two instructions are different, or are the same location, because the locations depend on the values in R2 and R4. If the locations are different, the instructions are independent and can be successfully executed out of order. However, if the locations are the same, then the load instruction is dependent on the store to produce its value. This is known as an ambiguous dependence.
Out-of-order execution and memory access operations
Executing loads and stores out of order can produce incorrect results if a dependent load/store pair was executed out of order. Consider the following code snippet, given in MIPSMIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
assembly
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
:
1: div $27, $20
2: sw ($30), $27
3: lw $08, ($31)
4: sw ($30), $26
5: lw $09, ($31)
Assume that the scheduling logic will issue an instruction to the execution unit when all of its register operands are ready. Further assume that registers
$30
and $31
are ready: the values in $30
and $31
were computed a long time ago and have not changed. However, assume $27
is not ready: its value is still in the process of being computed by the div
(integer divide) instruction. Finally, assume that registers $30
and $31
hold the same value, and thus all the loads and stores in the snippet access the same memory word.In this situation, the
sw ($30), $27
instruction on line 2 is not ready to execute, but the lw $08, ($31)
instruction on line 3 is ready. If the processor allows the lw
instruction to execute before the sw
, the load will read an old value from the memory system; however, it should have read the value that was just written there by the sw
. The load and store executed out of program order, but there was a memory dependence between them that was violated.Similarly, assume that register
$26
is ready. The sw ($30), $26
instruction on line 4 is also ready to execute, and it may execute before the preceding lw $08, ($31)
on line 3. If this occurs, the lw $08, ($31)
instruction will read the wrong value from the memory system, since a later store instruction wrote its value there before the load executed.Characterization of memory dependencies
Memory dependencies come in three flavors:- Read-After-Write (RAW) dependencies: Also known as true dependencies, RAW dependencies arise when a load operation reads a value from memory that was produced by the most recent preceding store operation to that same address.
- Write-After-Read (WAR) dependencies: Also known as anti dependencies, WAR dependencies arise when a store operation writes a value to memory that a preceding load reads.
- Write-After-Write (WAW) dependencies: Also known as output dependencies, WAW dependencies arise when two store operations write values to the same memory address.
The three dependencies are shown in the preceding code segment (reproduced for clarity):
1: div $27, $20
2: sw ($30), $27
3: lw $08, ($31)
4: sw ($30), $26
5: lw $09, ($31)
- The
lw $08, ($31)
instruction on line 3 has a RAW dependence on thesw ($30), $27
instruction on line 2, and thelw $09, ($31)
instruction on line 5 has a RAW dependence on thesw ($30), $26
instruction on line 4. Both load instructions read the memory address that the preceding stores wrote. The stores were the most recent producers to that memory address, and the loads are reading that memory address's value. - The
sw ($30), $26
instruction on line 4 has a WAR dependence on thelw $08, ($31)
instruction on line 3 since it writes the memory address that the preceding load reads from. - The
sw ($30), $26
instruction on line 4 has a WAW dependence on thesw ($30), $27
instruction on line 2 since both stores write to the same memory address.
Memory disambiguation mechanisms
Modern microprocessors use the following mechanisms, implemented in hardwareComputer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...
, to resolve ambiguous dependences and recover when a dependence was violated.
Avoiding WAR and WAW dependencies
Values from store instructions are not committed to the memory system (in modern microprocessors, CPU cacheCPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
) when they execute. Instead, the store instructions, including the memory address and store data, are buffered in a store queue until they reach the retirement point. When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems shown in the code snippet above where an earlier load receives an incorrect value from the memory system because a later store was allowed to execute before the earlier load.
Additionally, buffering stores until retirement allows processors to speculatively execute store instructions that follow an instruction that may produce an exception
Exception handling
Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution....
(such as a load of a bad address, divide by zero, etc.) or a conditional branch
Branch (computer science)
A branch is sequence of code in a computer program which is conditionally executed depending on whether the flow of control is altered or not . The term can be used when referring to programs in high level languages as well as program written in machine code or assembly language...
instruction whose direction (taken or not taken) is not yet known. If the exception-producing instruction has not executed or the branch direction was predicted incorrectly, the processor will have fetched and executed instructions on a "wrong path." These instructions should not have been executed at all; the exception condition should have occurred before any of the speculative instructions executed, or the branch should have gone the other direction and caused different instructions to be fetched and executed. The processor must "throw away" any results from the bad-path, speculatively-executed instructions when it discovers the exception or branch misprediction. The complication for stores is that any stores on the bad or mispredicted path should not have committed their values to the memory system; if the stores had committed their values, it would be impossible to "throw away" the commit, and the memory state of the machine would be corrupted by data from a store instruction that should not have executed.
Thus, without store buffering, stores cannot execute until all previous possibly-exception-causing instructions have executed (and not caused an exception) and all previous branch directions are known. Forcing stores to wait until branch directions and exceptions are known significantly reduces the out-of-order aggressiveness and limits ILP (Instruction level parallelism
Instruction level parallelism
Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...
) and performance. With store buffering, stores can execute ahead of exception-causing or unresolved branch instructions, buffering their data in the store queue but not committing their values until retirement. This prevents stores on mispredicted or bad paths from committing their values to the memory system while still offering the increased ILP and performance from full out-of-order execution of stores.
Store to load forwarding
Buffering stores until retirement avoids WAW and WAR dependencies but introduces a new issue. Consider the following scenario: a store executes and buffers its address and data in the store queue. A few instructions later, a load executes that reads from the same memory address to which the store just wrote. If the load reads its data from the memory system, it will read an old value that would have been overwritten by the preceding store. The data obtained by the load will be incorrect.To solve this problem, processors employ a technique called store-to-load forwarding using the store queue. In addition to buffering stores until retirement, the store queue serves a second purpose: forwarding data from completed but not-yet-retired ("in-flight") stores to later loads. Rather than a simple FIFO
FIFO
FIFO is an acronym for First In, First Out, an abstraction related to ways of organizing and manipulation of data relative to time and prioritization...
queue, the store queue is really a Content-Addressable Memory (CAM)
Content-addressable memory
Content-addressable memory is a special type of computer memory used in certain very high speed searching applications. It is also known as associative memory, associative storage, or associative array, although the last term is more often used for a programming data structure...
searched using the memory address. When a load executes, it searches the store queue for in-flight stores to the same address that are logically earlier in program order. If a matching store exists, the load obtains its data value from that store instead of the memory system. If there is no matching store, the load accesses the memory system as usual; any preceding, matching stores must have already retired and committed their values. This technique allows loads to obtain correct data if their producer store has completed but not yet retired.
Multiple stores to the load's memory address may be present in the store queue. To handle this case, the store queue is priority encoded
Priority encoder
A priority encoder is a circuit or algorithm that compresses multiple binary inputs into a smaller number of outputs. The output of a priority encoder is the binary representation of the ordinal number starting from zero of the most significant input bit...
to select the latest store that is logically earlier than the load in program order. The determination of which store is "latest" can be achieved by attaching some sort of timestamp
Timestamp
A timestamp is a sequence of characters, denoting the date or time at which a certain event occurred. A timestamp is the time at which an event is recorded by a computer, not the time of the event itself...
to the instructions as they are fetched and decoded, or alternatively by knowing the relative position (slot) of the load with respect to the oldest and newest stores within the store queue.
Detecting RAW dependence violations
Modern out-of-order CPUs can use a number of techniques to detect a RAW dependence violation, but all techniques require tracking in-flight loads from execution until retirement. When a load executes, it accesses the memory system and/or store queue to obtain its data value, and then its address and data are buffered in a load queue until retirement. The load queue is similar in structure and function to the store queue, and in fact in some processors may be combined with the store queue in a single structure called a load-store queue, or LSQ. The following techniques are used or have been proposed to detect RAW dependence violations:Load queue CAM search
With this technique, the load queue, like the store queue, is a CAM searched using the memory access address, and keeps track of all in-flight loads. When a store executes, it searches the load queue for completed loads from the same address that are logically later in program order. If such a matching load exists, it must have executed before the store and thus read an incorrect, old value from the memory system/store queue. Any instructions that used the load's value have also used bad data. To recover if such a violation is detected, the load is marked as "violated" in the retirement buffer. The store remains in the store queue and retirement buffer and retires normally, committing its value to the memory system when it retires. However, when the violated load reaches the retirement point, the processor flushes the pipeline and restarts execution from the load instruction. At this point, all previous stores have committed their values to the memory system. The load instruction will now read the correct value from the memory system, and any dependent instructions will re-execute using the correct value.
This technique requires an associative search of the load queue on every store execution, which consumes circuit
Digital circuit
Digital electronics represent signals by discrete bands of analog levels, rather than by a continuous range. All levels within a band represent the same signal state...
power and can prove to be a difficult timing path for large load queues. However, it does not require any additional memory (cache) ports or create resource conflicts with other loads or stores that are executing.
Disambiguation At Retirement
With this technique, load instructions that have executed out-of-order are re-executed (they access the memory system and read the value from their address a second time) when they reach the retirement point. Since the load is now the retiring instruction, it has no dependencies on any instruction still in-flight; all stores ahead of it have committed their values to the memory system, and so any value read from the memory system is guaranteed to be correct. The value read from memory at re-execution time is compared to the value obtained when the load first executed. If the values are the same, the original value was correct and no violation has occurred. If the re-execution value differs from the original value, a RAW violation has occurred and the pipeline must be flushed because instructions dependent on the load have used an incorrect value.
This technique is conceptually simpler than the load queue search, and it eliminates a second CAM and its power-hungry search (the load queue can now be a simple FIFO queue). Since the load must re-access the memory system just before retirement, the access must be very fast, so this scheme relies on a fast cache. No matter how fast the cache is, however, the second memory system access for every out-of-order load instruction does increase instruction retirement latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
and increases the total number of cache accesses that must be performed by the processor. The additional retire-time cache access can be satisfied by re-using an existing cache port; however, this creates port resource contention with other loads and stores in the processor trying to execute, and thus may cause a decrease in performance. Alternatively, an additional cache port can be added just for load disambiguation, but this increases the complexity, power, and area of the cache. Some recent work (Roth 2005) has shown ways to filter many loads from re-executing if it is known that no RAW dependence violation could have occurred; such a technique would help or eliminate such latency and resource contention.
A minor benefit of this scheme (compared to a load-queue search) is that it will not flag a RAW dependence violation and trigger a pipeline flush if a store that would have caused a RAW dependence violation (the store's address matches an in-flight load's address) has a data value that matches the data value already in the cache. In the load-queue search scheme, an additional data comparison would need to be added to the load-queue search hardware to prevent such a pipeline flush.
Avoiding RAW dependence violations
CPUs that fully support out-of-order execution of loads and stores must be able to detect RAW dependence violations when they occur. However, many CPUs avoid this problem by forcing all loads and stores to execute in-order, or by supporting only a limited form of out-of-order load/store execution. This approach offers lower performance compared to supporting full out-of-order load/store execution, but it can significantly reduce the complexity of the execution core and caches.The first option, making loads and stores go in-order, avoids RAW dependences because there is no possibility of a load executing before its producer store and obtaining incorrect data. Another possibility is to effectively break loads and stores into two operations: address generation and cache access. With these two separate but linked operations, the CPU can allow loads and stores to access the memory system only once all previous loads and stores have had their address generated and buffered in the LSQ. After address generation, there are no longer any ambiguous dependencies since all addresses are known, and so dependent loads will not be executed until their corresponding stores complete. This scheme still allows for some "out-of-orderness"—the address generation operations for any in-flight loads and stores can execute out-of-order, and once addresses have been generated, the cache accesses for each load or store can happen in any order that respects the (now known) true dependences.
Memory dependence prediction
Processors that fully support out-of-order load/store execution can use an additional, related technique, called memory dependence predictionMemory dependence prediction
Memory dependence prediction is a technique, employed by high-performance out-of-order execution microprocessors that execute memory access operations out of program order, to predict true dependences between loads and stores at instruction execution time...
, to attempt to predict true dependences between loads and stores before their addresses are known. Using this technique, the processor can prevent loads that are predicted to be dependent on an in-flight store from executing before that store completes, avoiding a RAW dependence violation and thus avoiding the pipeline flush and the performance penalty that is incurred. See the memory dependence prediction article for more details.
See also
- Instruction pipelineInstruction pipelineAn instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
- Out-of-order executionOut-of-order executionIn computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...
- CPU cacheCPU cacheA CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
- Memory dependence predictionMemory dependence predictionMemory dependence prediction is a technique, employed by high-performance out-of-order execution microprocessors that execute memory access operations out of program order, to predict true dependences between loads and stores at instruction execution time...