Cache pollution
Encyclopedia
Cache pollution describes situations where an executing computer program
loads data into CPU cache
unnecessarily, thus causing other needed data to be evicted from the cache into lower levels of the memory hierarchy
, potentially all the way down to main memory, thus causing a performance hit.
T[0] = T[0] + 1;
for i in 0..SIZEOF(CACHE)
C[i] = C[i] + 1;
T[0] = T[0] + C[SIZEOF(CACHE)-1];
(The assumptions here are that the cache is composed of only one level, it is unlocked, the replacement policy is pseudo-LRU
, all data is cacheable, the set associativity of the cache is N (where N > 1), and at most one processor register
is available to contain program values).
Right before the loop starts, T[0] will be fetched from memory into cache, its value updated. However, as the loop executes, because the number of data elements the loop references requires the whole cache to be filled to its capacity, the cache block containing T[0] has to be evicted. Thus, the next time the program requests T[0] to be updated, the cache misses, and the cache controller has to request the data bus
to bring the corresponding cache block from main memory again.
In this case the cache is said to be "polluted" because it is obvious that just by changing the pattern of data accesses (e.g. positioning the first update of T[0] between the loop and the second update) the inefficiency can be eliminated.
AltiVec
. This instruction loads a 128 bit wide value into a register and marks the corresponding cache block as "least recently used" i.e. as the prime candidate for eviction upon a need to evict a block from its cache set. To appropriately use that instruction in the context of the above example, the data elements referenced by the loop would have to be loaded using this instruction. When implemented in this manner, cache pollution would not take place, since the execution of such loop would not cause premature eviction of T[0] from cache. This would be avoided because, as the loop would progress, the addresses of the elements in C would map to the same cache way, leaving the actually older (but not marked as "least recently used") data intact on the other way(s). Only the oldest data (not pertinent for the example given) would be evicted from cache, which T[0] is not a member of, since its update occurs right before the loop's start.
Yet other possible solutions involve the operating system
. For example, the pages in main memory that correspond to the C data array can be marked as "caching inhibited" or, in other words, non-cacheable. However, in the above program's example, such manipulations do not appear to improve execution performance, since the resulting run time overhead is overwhelmingly larger than any gain achievable by cache pollution avoidance (unless the memory region has been non-cacheable to begin with).
Often in real life the cache is composed of more than one level (called the "L1", "L2" etc.). Therefore, "cache pollution" is well-defined only for situations where the term "cache" is unambiguous. Otherwise, it is imperative to specify which level of cache is involved.
Sometimes, however, even the most evolved and involved software is either insufficient or the very effort of software optimization crosses cost tolerance thresholds or profitability ratios as explained by the law of diminishing returns. In those cases the ball shifts back to the court of hardware engineers, and the "trick game" starts again. The relatively recent surge of interest in the system-on-a-chip
concept (such as the Cell processor) is fueled by these issues.
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
loads data into CPU cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
unnecessarily, thus causing other needed data to be evicted from the cache into lower levels of the memory hierarchy
Memory hierarchy
The term memory hierarchy is used in the theory of computation when discussing performance issues in computer architectural design, algorithm predictions, and the lower level programming constructs such as involving locality of reference. A 'memory hierarchy' in computer storage distinguishes each...
, potentially all the way down to main memory, thus causing a performance hit.
Example
Consider the following illustration:T[0] = T[0] + 1;
for i in 0..SIZEOF(CACHE)
C[i] = C[i] + 1;
T[0] = T[0] + C[SIZEOF(CACHE)-1];
(The assumptions here are that the cache is composed of only one level, it is unlocked, the replacement policy is pseudo-LRU
Pseudo-LRU
Pseudo-LRU usually refers two cache replacement algorithms: tree-PLRU and bit-PLRU.Tree-PLRU, is an efficient algorithm to find an item that most likely has not been accessed very recently, given a set of items and a sequence of access events to the items...
, all data is cacheable, the set associativity of the cache is N (where N > 1), and at most one processor register
Processor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
is available to contain program values).
Right before the loop starts, T[0] will be fetched from memory into cache, its value updated. However, as the loop executes, because the number of data elements the loop references requires the whole cache to be filled to its capacity, the cache block containing T[0] has to be evicted. Thus, the next time the program requests T[0] to be updated, the cache misses, and the cache controller has to request the data bus
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...
to bring the corresponding cache block from main memory again.
In this case the cache is said to be "polluted" because it is obvious that just by changing the pattern of data accesses (e.g. positioning the first update of T[0] between the loop and the second update) the inefficiency can be eliminated.
Solutions
Other solutions to this problem involve the use of specialized hardware instructions such as "lvxl" provided by PowerPCPowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...
AltiVec
AltiVec
AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
. This instruction loads a 128 bit wide value into a register and marks the corresponding cache block as "least recently used" i.e. as the prime candidate for eviction upon a need to evict a block from its cache set. To appropriately use that instruction in the context of the above example, the data elements referenced by the loop would have to be loaded using this instruction. When implemented in this manner, cache pollution would not take place, since the execution of such loop would not cause premature eviction of T[0] from cache. This would be avoided because, as the loop would progress, the addresses of the elements in C would map to the same cache way, leaving the actually older (but not marked as "least recently used") data intact on the other way(s). Only the oldest data (not pertinent for the example given) would be evicted from cache, which T[0] is not a member of, since its update occurs right before the loop's start.
Yet other possible solutions involve the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
. For example, the pages in main memory that correspond to the C data array can be marked as "caching inhibited" or, in other words, non-cacheable. However, in the above program's example, such manipulations do not appear to improve execution performance, since the resulting run time overhead is overwhelmingly larger than any gain achievable by cache pollution avoidance (unless the memory region has been non-cacheable to begin with).
Often in real life the cache is composed of more than one level (called the "L1", "L2" etc.). Therefore, "cache pollution" is well-defined only for situations where the term "cache" is unambiguous. Otherwise, it is imperative to specify which level of cache is involved.
Increasing importance
The reason why cache pollution control has been increasing in importance, is because the penalties caused by the so-called "memory wall" keep on growing. Chip manufacturers continue devising new tricks to overcome the ever increasing relative memory-to-CPU latency. They do that by increasing cache sizes and by providing useful ways for software engineers to control the way data arrives and stays at the CPU. Cache pollution control is one of the numerous devices available to the (mainly embedded) programmer. However, other methods, most of which are proprietary and highly hardware and application specific, are used as well.Sometimes, however, even the most evolved and involved software is either insufficient or the very effort of software optimization crosses cost tolerance thresholds or profitability ratios as explained by the law of diminishing returns. In those cases the ball shifts back to the court of hardware engineers, and the "trick game" starts again. The relatively recent surge of interest in the system-on-a-chip
System-on-a-chip
A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single chip substrate...
concept (such as the Cell processor) is fueled by these issues.