Race condition
Encyclopedia
A race condition or race hazard
Hazard (logic)
In digital logic, a hazard in a system is an undesirable effect caused by either a deficiency in the system or external influences. Logic hazards are manifestations of a problem in which changes in the input variables do not change the output correctly due to some form of delay caused by logic...

is a flaw in an electronic system
System
System is a set of interacting or interdependent components forming an integrated whole....

 or process whereby the output or result of the process is unexpectedly and critically dependent on the sequence or timing
Timing
Timing is the time when something happens or the spacing of events in time. Some typical uses are:* The act of measuring the elapsed time of something or someone, often at athletic events such as swimming or running, where participants are timed with a device such as a stopwatch...

 of other events. The term originates with the idea of two signals racing each other to influence the output
Output
Output is the term denoting either an exit or changes which exit a system and which activate/modify a process. It is an abstract concept, used in the modeling, system design and system exploitation.-In control theory:...

 first.

Race conditions can occur in electronics
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

 systems, especially logic circuits
Logic gate
A logic gate is an idealized or physical device implementing a Boolean function, that is, it performs a logical operation on one or more logic inputs and produces a single logic output. Depending on the context, the term may refer to an ideal logic gate, one that has for instance zero rise time and...

, and in computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

 software, especially multithreaded
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

 or distributed programs.

Electronics

A typical example of a race condition may occur in a system of logic gate
Logic gate
A logic gate is an idealized or physical device implementing a Boolean function, that is, it performs a logical operation on one or more logic inputs and produces a single logic output. Depending on the context, the term may refer to an ideal logic gate, one that has for instance zero rise time and...

s, where inputs vary. If a particular output depends on the state of the inputs, it may only be defined for steady-state signals. As the inputs change state, a small delay will occur before the output changes, due to the physical nature of the electronic system. For a brief period, the output may change to an unwanted state before settling back to the designed state. Certain systems can tolerate such glitch
Glitch
A glitch is a short-lived fault in a system. It is often used to describe a transient fault that corrects itself, and is therefore difficult to troubleshoot...

es
, but if for example this output functions as a clock signal
Clock signal
In electronics and especially synchronous digital circuits, a clock signal is a particular type of signal that oscillates between a high and a low state and is utilized like a metronome to coordinate actions of circuits...

 for further systems that contain memory, the system can rapidly depart from its designed behaviour (in effect, the temporary glitch becomes permanent glitch).

For example, consider a two input AND gate
AND gate
The AND gate is a basic digital logic gate that implements logical conjunction - it behaves according to the truth table to the right. A HIGH output results only if both the inputs to the AND gate are HIGH . If neither or only one input to the AND gate is HIGH, a LOW output results...

 fed with a logic signal A on one input and its negation, NOT A, on another input. In theory, the output (A AND NOT A) should never be high. However, if changes in the value of A take longer to propagate to the second input than the first when A changes from false to true, a brief period will ensue during which both inputs are true, and so the gate's output will also be true.

Proper design techniques (e.g. Karnaugh map
Karnaugh map
The Karnaugh map , Maurice Karnaugh's 1953 refinement of Edward Veitch's 1952 Veitch diagram, is a method to simplify Boolean algebra expressions...

s) encourage designers to recognize and eliminate race conditions before they cause problems.

As well as these problems, some logic elements can enter metastable state
Metastability in electronics
Metastability in electronics is the ability of a digital electronic system to persist for an unbounded time in an unstable equilibrium or metastable state....

s, which create further problems for circuit designers.

Critical and non-critical race conditions

A critical race occurs when the order in which internal variables are changed determines the eventual state that the state machine
Finite state machine
A finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...

 will end up in.

A non-critical race occurs when the order in which internal variables are changed does not alter the eventual state. In other words, a non-critical race occurs when moving to a desired state means that more than one internal state variable must be changed at once, but no matter in what order these internal state variables change, the resultant state will be the same.

Static, dynamic, and essential race conditions

Static race conditions : These are caused when a signal and its complement are combined together.
Dynamic race conditions : These result in multiple transitions when only one is intended. They are due to interaction between gates (Dynamic race conditions can be eliminated by using not more than two levels of gating).
Essential race conditions : These are caused when an input has two transitions in less than the total feedback propagation time. Sometimes they are cured using inductive delay-line elements to effectively increase the time duration of an input signal.

Computing

Race conditions arise in software when separate processes or threads
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

 of execution depend on some shared state. Operations upon shared states are critical section
Critical section
In concurrent programming a critical section is a piece of code that accesses a shared resource that must not be concurrently accessed by more than one thread of execution. A critical section will usually terminate in fixed time, and a thread, task or process will have to wait a fixed time to...

s that must be mutually exclusive
Mutually exclusive
In layman's terms, two events are mutually exclusive if they cannot occur at the same time. An example is tossing a coin once, which can result in either heads or tails, but not both....

 in order to avoid harmful collision between processes or threads that share those states.

As a simple example let us assume that two threads T1 and T2 each want to increment the value of a global integer by one. Ideally, the following sequence of operations would take place:
  1. Integer i = 0; (memory)
  2. T1 reads the value of i from memory into register1: 0
  3. T1 increments the value of i in register1: (register1 contents) + 1 = 1
  4. T1 stores the value of register1 in memory: 1
  5. T2 reads the value of i from memory into register2: 1
  6. T2 increments the value of i in register2: (register2 contents) + 1 = 2
  7. T2 stores the value of register2 in memory: 2
  8. Integer i = 2; (memory)


In the case shown above, the final value of i is 2, as expected. However, if the two threads run simultaneously without locking or synchronization, the outcome of the operation could be wrong. The alternative sequence of operations below demonstrates this scenario:
  1. Integer i = 0; (memory)
  2. T1 reads the value of i from memory into register1: 0
  3. T2 reads the value of i from memory into register2: 0
  4. T1 increments the value of i in register1: (register1 contents) + 1 = 1
  5. T2 increments the value of i in register2: (register2 contents) + 1 = 1
  6. T1 stores the value of register1 in memory: 1
  7. T2 stores the value of register2 in memory: 1
  8. Integer i = 1; (memory)


The final value of i is 1 instead of the expected result of 2. This occurs because the increment operations of the second case are not mutually exclusive. Mutually-exclusive operations are those that cannot be interrupted while accessing some resource such as a memory location. In the first case, T1 was not interrupted while accessing the variable i, so its operation was mutually-exclusive.
For another example, consider the following two tasks, in pseudocode
Pseudocode
In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...

:

global int A = 0

// increments the value of A and print "RX"
// activated whenever an interrupt is received from the serial controller
task Received
A = A + 1
print "RX"
end task

// prints out only the even numbers
// is activated every second
task Timeout
if (A is divisible by 2)
print A
end if
end task


Output would look something like:

0
0
0
RX
RX
2
RX
RX
4
4

Now consider this chain of events, which might occur next:
  1. timeout occurs, activating task Timeout
  2. task Timeout evaluates A and finds it is divisible by 2, so elects to execute the "print A" next.
  3. data is received on the serial port, causing an interrupt and a switch to task Received
  4. task Received runs to completion, incrementing A and printing "RX"
  5. control returns to task Timeout
  6. task Timeout executes print A, using the current value of A, which is 5.


Mutex
Mutual exclusion
Mutual exclusion algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. A critical section is a piece of code in which a process or thread accesses a common resource...

es are used to address this problem in concurrent programming.

File systems

In file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

s, two or more programs may "collide" in their attempts to modify or access a file, which could result in data corruption. File locking
File locking
File locking is a mechanism that restricts access to a computer file by allowing only one user or process access at any specific time. Systems implement locking to prevent the classic interceding update scenario ....

 provides a commonly-used solution. A more cumbersome remedy involves organizing the system in such a way that one unique process (running a daemon
Daemon (computer software)
In Unix and other multitasking computer operating systems, a daemon is a computer program that runs as a background process, rather than being under the direct control of an interactive user...

 or the like) has exclusive access to the file, and all other processes that need to access the data in that file do so only via interprocess communication with that one process (which of course requires synchronization at the process level).

A different form of race condition exists in file systems where unrelated programs may affect each other by suddenly using up available resources such as disk space (or memory, or processor cycles). Software not carefully designed to anticipate and handle this race situation may then become quite fragile and unpredictable. Such a risk may be overlooked for a long time in a system that seems very reliable. But eventually enough data may accumulate or enough other software may be added to critically destabilize many parts of a system. Probably the best known example of this occurred with the near-loss of the Mars Rover "Spirit"
Spirit rover
Spirit, MER-A , is a robotic rover on Mars, active from 2004 to 2010. It was one of two rovers of NASA's ongoing Mars Exploration Rover Mission. It landed successfully on Mars at 04:35 Ground UTC on January 4, 2004, three weeks before its twin, Opportunity , landed on the other side of the planet...

 not long after landing, but this is a commonly overlooked hazard in many computer systems. A solution is for software to request and reserve all the resources it will need before beginning a task; if this request fails then the task is postponed, avoiding the many points where failure could have occurred. (Alternately, each of those points can be equipped with error handling, or the success of the entire task can be verified afterwards, before continuing on.) A more common but incorrect approach is to simply verify that enough disk space (for example) is available before starting a task; this is not adequate because in complex systems the actions of other running programs can be unpredictable.

Networking

In networking, consider a distributed chat network like IRC, where a user acquires channel-operator privileges in any channel he starts. If two users on different servers, on different ends of the same network, try to start the same-named channel at the same time, each user's respective server will grant channel-operator privileges to each user, since neither server will yet have received the other server's signal that it has allocated that channel. (Note that this problem has been largely solved by various IRC server implementations.)

In this case of a race condition, the concept of the "shared resource
Resource (computer science)
A resource, or system resource, is any physical or virtual component of limited availability within a computer system. Every device connected to a computer system is a resource. Every internal system component is a resource...

" covers the state of the network (what channels exist, as well as what users started them and therefore have what privileges), which each server can freely change as long as it signals the other servers on the network about the changes so that they can update their conception of the state of the network. However, the latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...

 across the network makes possible the kind of race condition described. In this case, heading off race conditions by imposing a form of control over access to the shared resource—say, appointing one server to control who holds what privileges—would mean turning the distributed network into a centralized one (at least for that one part of the network operation).

Life-critical systems

Software flaws in life-critical system
Life-critical system
A life-critical system or safety-critical system is a system whose failure ormalfunction may result in:* death or serious injury to people, or* loss or severe damage to equipment or* environmental harm....

s can be disastrous. Race conditions were among the flaws in the Therac-25
Therac-25
The Therac-25 was a radiation therapy machine produced by Atomic Energy of Canada Limited after the Therac-6 and Therac-20 units ....

 radiation therapy
Radiation therapy
Radiation therapy , radiation oncology, or radiotherapy , sometimes abbreviated to XRT or DXT, is the medical use of ionizing radiation, generally as part of cancer treatment to control malignant cells.Radiation therapy is commonly applied to the cancerous tumor because of its ability to control...

 machine, which led to the death of at least three patients and injuries to several more. Another example is the Energy Management System provided by GE Energy and used by Ohio
Ohio
Ohio is a Midwestern state in the United States. The 34th largest state by area in the U.S.,it is the 7th‑most populous with over 11.5 million residents, containing several major American cities and seven metropolitan areas with populations of 500,000 or more.The state's capital is Columbus...

-based FirstEnergy Corp. (among other power facilities). A race condition existed in the alarm subsystem; when three sagging power lines were tripped simultaneously, the condition prevented alerts from being raised to the monitoring technicians, delaying their awareness of the problem. This software flaw eventually led to the North American Blackout of 2003. GE Energy later developed a software patch to correct the previously undiscovered error.

Computer security

A specific kind of race condition involves checking for a predicate (e.g. for authentication
Authentication
Authentication is the act of confirming the truth of an attribute of a datum or entity...

), then acting on the predicate, while the state can change between the time of check and the time of use. When this kind of bug exists in security
Computer security
Computer security is a branch of computer technology known as information security as applied to computers and networks. The objective of computer security includes protection of information and property from theft, corruption, or natural disaster, while allowing the information and property to...

-conscious code, a security vulnerability called a time-of-check-to-time-of-use
Time-of-check-to-time-of-use
In software development, time-of-check-to-time-of-use is a class of software bug caused by changes in a system between the checking of a condition and the use of the results of that check...

 (TOCTTOU) bug is created.

See also

  • Concurrency control
    Concurrency control
    In information technology and computer science, especially in the fields of computer programming , operating systems , multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.Computer...

  • Deadlock
    Deadlock
    A deadlock is a situation where in two or more competing actions are each waiting for the other to finish, and thus neither ever does. It is often seen in a paradox like the "chicken or the egg"...

  • Synchronization
    Synchronization
    Synchronization is timekeeping which requires the coordination of events to operate a system in unison. The familiar conductor of an orchestra serves to keep the orchestra in time....

  • Linearizability
    Linearizability
    In concurrent programming, an operation is atomic, linearizable, indivisible or uninterruptible if it appears to the rest of the system to occur instantaneously. Atomicity is a guarantee of isolation from concurrent processes...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK