Compare-and-swap
Encyclopedia
In computer science
, the compare-and-swap (CAS) CPU
instruction is a special instruction that atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple Boolean
response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it). In the x86 and Itanium
architectures this is the compare and exchange (CMPXCHG) instruction, though here the lock prefix should be there to make it really atomic.
(and all successor) architectures since 1970. The operating systems which run on these architectures make extensive use of this instruction to facilitate process (i.e., system and user tasks) and processor (i.e., central processors) parallelism
while eliminating, to the greatest degree possible, the "disabled spin locks" which were employed in earlier IBM operating systems. In these operating systems, new units of work may be instantiated "globally", into the Global Service Priority List, or "locally", into the Local Service Priority List, by the execution of a single Compare-and-Swap instruction. This dramatically improved the responsiveness of these operating systems.
old_reg_val is always returned, but it can be tested following the compare_and_swap operation to see if it matches oldval, as it may be different,
meaning that another process has managed to succeed in a competing compare_and_swap to change the reg value from oldval.
For example,
an election protocol can be done, where every process returns the result of compare_and_swap with its say PID (=newval), except the process
which finds the compare_and_swap returned the initial non-pid value (e.g. zero), in which case it returns its own PID.
This is the logic in the Intel Software Manual Vol 2A.
s and mutexes, as well as more sophisticated lock-free and wait-free algorithms. Maurice Herlihy
(1991) proved that CAS can implement more of these algorithms than atomic read, write, and fetch-and-add
, and that, assuming a fairly large amount of memory, it can implement all of them.
Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again.
. It's possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.
A general solution to this is to use a double-length CAS (e.g. on a 32 bit system, a 64 bit CAS). The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer *and* the counter, to the current pointer and counter. If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32 bit value, a multiple of 2^32 operations would have had to occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).
An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32 bit CAS, use a 16 bit index and a 16 bit counter. However, the reduced counter lengths begin to make ABA - especially at modern CPU speeds - likely.
One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.
A more complicated but more effective solution is to implement SMR, Safe Memory Reclamation. This is in effect lock-free garbage collection. The upshot of using SMR is that you can know that a given pointer will only be present once at any one time in the data structure, so the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).
es.
In multiprocessor systems, it is usually impossible to disable interrupts on all processors at the same time. Even if it were possible, two or more processors could be attempting to access the same semaphore's memory at the same time, and thus atomicity would not be achieved. The compare-and-swap instruction allows any processor to atomically test and modify a memory location, preventing such multiple-processor collisions.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, the compare-and-swap (CAS) CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
instruction is a special instruction that atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple Boolean
Boolean logic
Boolean algebra is a logical calculus of truth values, developed by George Boole in the 1840s. It resembles the algebra of real numbers, but with the numeric operations of multiplication xy, addition x + y, and negation −x replaced by the respective logical operations of...
response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it). In the x86 and Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...
architectures this is the compare and exchange (CMPXCHG) instruction, though here the lock prefix should be there to make it really atomic.
History of use
Compare-and-Swap (and Compare-and-Swap-Double) has been an integral part of the IBM 370System/370
The IBM System/370 was a model range of IBM mainframes announced on June 30, 1970 as the successors to the System/360 family. The series maintained backward compatibility with the S/360, allowing an easy migration path for customers; this, plus improved performance, were the dominant themes of the...
(and all successor) architectures since 1970. The operating systems which run on these architectures make extensive use of this instruction to facilitate process (i.e., system and user tasks) and processor (i.e., central processors) parallelism
Parallelism
Parallelism may refer to:* Angle of parallelism, the angle at one vertex of a right hyperbolic triangle that has two hyperparallel sides* Conscious parallelism, price-fixing between competitors in an oligopoly that occurs without an actual spoken agreement between the parties* Parallel computing,...
while eliminating, to the greatest degree possible, the "disabled spin locks" which were employed in earlier IBM operating systems. In these operating systems, new units of work may be instantiated "globally", into the Global Service Priority List, or "locally", into the Local Service Priority List, by the execution of a single Compare-and-Swap instruction. This dramatically improved the responsiveness of these operating systems.
Implementation in C
The following C function shows the basic behavior of a compare-and-swap variant that returns the old value of the specified memory location; however, this version does not provide the crucial guarantees of atomicity that a real compare-and-swap operation would:old_reg_val is always returned, but it can be tested following the compare_and_swap operation to see if it matches oldval, as it may be different,
meaning that another process has managed to succeed in a competing compare_and_swap to change the reg value from oldval.
For example,
an election protocol can be done, where every process returns the result of compare_and_swap with its say PID (=newval), except the process
which finds the compare_and_swap returned the initial non-pid value (e.g. zero), in which case it returns its own PID.
This is the logic in the Intel Software Manual Vol 2A.
Usage
CAS is used to implement synchronization primitives like semaphoreSemaphore (programming)
In computer science, a semaphore is a variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment....
s and mutexes, as well as more sophisticated lock-free and wait-free algorithms. Maurice Herlihy
Maurice Herlihy
Maurice Herlihy is a computer scientist active in the field of multiprocessor synchronization. Herlihy has contributed to the design of concurrent algorithms, and in particular to the exposition and quantification of the properties and uses of hardware synchronization operations...
(1991) proved that CAS can implement more of these algorithms than atomic read, write, and fetch-and-add
Fetch-and-add
In computer science, the fetch-and-add CPU instruction is a special instruction that atomically modifies the contents of a memory location. It is used to implement mutual exclusion and concurrent algorithms in multiprocessor systems, a generalization of semaphores.In uniprocessor systems, it is...
, and that, assuming a fairly large amount of memory, it can implement all of them.
Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again.
ABA Problem
Some of these algorithms are affected by and must handle the problem of a false positive match, or the ABA problemABA problem
In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed"...
. It's possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.
A general solution to this is to use a double-length CAS (e.g. on a 32 bit system, a 64 bit CAS). The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer *and* the counter, to the current pointer and counter. If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32 bit value, a multiple of 2^32 operations would have had to occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).
An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32 bit CAS, use a 16 bit index and a 16 bit counter. However, the reduced counter lengths begin to make ABA - especially at modern CPU speeds - likely.
One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.
A more complicated but more effective solution is to implement SMR, Safe Memory Reclamation. This is in effect lock-free garbage collection. The upshot of using SMR is that you can know that a given pointer will only be present once at any one time in the data structure, so the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).
Benefits
CAS, and other atomic instructions, are sometimes thought to be unnecessary in uniprocessor systems, because the atomicity of any sequence of instructions can be achieved by disabling interrupts while executing it. However, disabling interrupts has numerous downsides. For example, code that is allowed to do so must be trusted not to be malicious and monopolize the CPU, as well as to be correct and not accidentally hang the machine in an infinite loop. Further, disabling interrupts is often deemed too expensive to be practical. Thus, even programs only intended to run on uniprocessor machines will benefit from atomic instructions, as in the case of Linux's futexFutex
A futex is a Linux construct that can be used to implement basic locking, or as a building block for higher-level locking abstractions such as semaphores and POSIX mutexes or condition variables....
es.
In multiprocessor systems, it is usually impossible to disable interrupts on all processors at the same time. Even if it were possible, two or more processors could be attempting to access the same semaphore's memory at the same time, and thus atomicity would not be achieved. The compare-and-swap instruction allows any processor to atomically test and modify a memory location, preventing such multiple-processor collisions.
Extensions
Since CAS operates on a single pointer-sized memory location, while most lock-free and wait-free algorithms need to modify multiple locations, several extensions have been implemented.- Double compare-and-swap compares two unrelated memory locations with two expected values, and if they're equal, sets both locations to new values. This operation only exists on Motorola 680x0 processors, but in early papers was often required.
- Double-wide compare-and-swap operates on two adjacent pointer-sized locations (or, equivalently, one location twice as big as a pointer). On later x86 processors, the CMPXCHG8B and CMPXCHG16B instructions serve this role, although early 64-bit AMD CPUs did not support CMPXCHG16B (modern AMD CPUs do).
- Single compare, double swap compares one pointer but writes two. The Itanium's cmp8xchg16 instruction implements this, where the two written pointers are adjacent.
- Multi-word compare-and-swap is a generalisation of normal compare-and-swap. It can be used to atomically swap an arbitrary number of arbitrarily located memory locations. Usually, multi-word compare-and-swap is implemented in software using normal double-wide compare-and-swap operations. The drawback of this approach is a lack of scalability.
See also
- Fetch-and-addFetch-and-addIn computer science, the fetch-and-add CPU instruction is a special instruction that atomically modifies the contents of a memory location. It is used to implement mutual exclusion and concurrent algorithms in multiprocessor systems, a generalization of semaphores.In uniprocessor systems, it is...
- Load-link/store-conditionalLoad-Link/Store-ConditionalIn computer science, load-link and store-conditional are a pair of instructions that together implement a lock-free atomic read-modify-write operation....
- Non-blocking synchronizationNon-blocking synchronizationIn computer science, a non-blocking algorithm ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion...
- Conditional Put and Delete in AmazonAmazon.comAmazon.com, Inc. is a multinational electronic commerce company headquartered in Seattle, Washington, United States. It is the world's largest online retailer. Amazon has separate websites for the following countries: United States, Canada, United Kingdom, Germany, France, Italy, Spain, Japan, and...
's SimpleDB
External links
- compare_and_swap Kernel Service
- "Lock-Free and Practical Deques using Single-Word Compare-And-Swap" by Håkan Sundell, Philippas Tsigas
- " Fast, Reactive and Lock-free: Multi-word Compare-and-swap Algorithms" by Phuong Ha-Hoai, Philippas Tsigas
- "A Practical Multi-Word Compare-And-Swap Operation" by Timothy L. Harris, Keir Fraser, Ian A. Pratt
- "Lock-Free Linked Lists Using Compare-and-Swap" by John D. Valois
- "A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap" by S. Prakash, Yann Hang Lee, T. Johnson
- A FOSSFossFoss may refer toPeople*Foss , people with the last name Foss*Foss Shanahan , New Zealand diplomat*Foss Westcott , English bishop...
(BSD) C# implementation of a Lock-Free Queue - Discussion "Lock-Free using cmpxchg8b..."
- Java package
- Java method