Fault-tolerant system
Encyclopedia
Fault-tolerance or graceful degradation is the property that enables a system
(often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. A newer approach is progressive enhancement
. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical system
s.
Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol
(TCP) is designed to allow reliable two-way communication in a packet-switched
network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.
Data formats may also be designed to degrade gracefully. HTML
for example, is designed to be forward compatible
, allowing new HTML entities to be ignored by Web browser
s which do not understand them without causing the document to be unusable.
Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.
Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization
so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.
In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability
and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.
Fault-tolerant systems are typically based on the concept of redundancy.
All implementations of RAID
, redundant array of independent disks
, except RAID 0 are examples of a fault-tolerant storage device
that uses data redundancy
.
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication
, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed Dual Modular Redundant
(DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed Triple Modular Redundancy
(TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit canghghj output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep fault tolerant machines are most easily made fully synchronous
, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.
Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology
(NIST) categorizes faults based on Locality, Cause, Duration and Effect.
System
System is a set of interacting or interdependent components forming an integrated whole....
(often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. A newer approach is progressive enhancement
Progressive enhancement
Progressive enhancement is a strategy for web design that emphasizes accessibility, semantic HTML markup, and external stylesheet and scripting technologies...
. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical system
Life-critical system
A life-critical system or safety-critical system is a system whose failure ormalfunction may result in:* death or serious injury to people, or* loss or severe damage to equipment or* environmental harm....
s.
Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol
Transmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
(TCP) is designed to allow reliable two-way communication in a packet-switched
Packet switching
Packet switching is a digital networking communications method that groups all transmitted data – regardless of content, type, or structure – into suitably sized blocks, called packets. Packet switching features delivery of variable-bit-rate data streams over a shared network...
network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.
Data formats may also be designed to degrade gracefully. HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
for example, is designed to be forward compatible
Forward compatibility
Forward compatibility or upward compatibility is a compatibility concept for systems design, as e.g. backward compatibility. Forward compatibility aims at the ability of a design to gracefully accept input intended for later versions of itself...
, allowing new HTML entities to be ignored by Web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
s which do not understand them without causing the document to be unusable.
Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.
Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization
Self-stabilization
Self-stabilization is a concept of fault-tolerance in distributed computing. A distributed system that is self-stabilizing will end up in a correct state no matter what state it is initialized with...
so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.
Fault tolerance requirements
The basic characteristics of fault tolerance require:- No single point of failure
- Fault isolation to the failing component
- Fault containment to prevent propagation of the failure
- Availability of reversion modes
In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability
Availability
In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...
and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.
Fault-tolerant systems are typically based on the concept of redundancy.
Fault-tolerance by replication
Spare components addresses the first fundamental characteristic of fault-tolerance in three ways:- ReplicationReplication (computer science)Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or...
: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallelParallel computingParallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...
, and choosing the correct result on the basis of a quorumQuorumA quorum is the minimum number of members of a deliberative assembly necessary to conduct the business of that group...
; - RedundancyRedundancy (engineering)In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe....
: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failoverFailoverIn computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...
); - Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
All implementations of RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
, redundant array of independent disks
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
, except RAID 0 are examples of a fault-tolerant storage device
Data storage device
thumb|200px|right|A reel-to-reel tape recorder .The magnetic tape is a data storage medium. The recorder is data storage equipment using a portable medium to store the data....
that uses data redundancy
Data redundancy
Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, in case when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different...
.
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication
Replication
Replication may refer to:Science* Replication is one of the main principles of the scientific method, a.k.a. reproducibility** Replication , the repetition of a test or complete experiment...
, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed Dual Modular Redundant
Dual modular redundant
A machine which is Dual Modular Redundant has duplicated elements which work in parallel to provide one form of redundancy. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work...
(DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed Triple Modular Redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
(TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit canghghj output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep fault tolerant machines are most easily made fully synchronous
Synchronization (computer science)
In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, so as to reach an agreement or...
, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.
No single point of repair
If a system experiences a failure, it must continue to operate without interruption during the repair process.Fault isolation to the failing component
When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation.Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology
National Institute of Standards and Technology
The National Institute of Standards and Technology , known between 1901 and 1988 as the National Bureau of Standards , is a measurement standards laboratory, otherwise known as a National Metrological Institute , which is a non-regulatory agency of the United States Department of Commerce...
(NIST) categorizes faults based on Locality, Cause, Duration and Effect.
Fault containment
Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "Rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Mechanisms that isolate a rogue transmitter or failing component to protect the system are required.See also
- Byzantine fault toleranceByzantine fault toleranceByzantine fault tolerance is a sub-field of fault tolerance research inspired by the Byzantine Generals' Problem, which is a generalized version of the Two Generals' Problem....
- Intrusion toleranceIntrusion ToleranceIntrusion tolerance is a Fault-tolerant design approach to defending information systems against malicious attack. Abandoning the conventional aim of preventing all intrusions, intrusion tolerance instead calls for triggering mechanisms that prevent intrusions from leading to a system security...
- Capillary routingCapillary routingIn networking and in graph theory, capillary routing, for a given network, is a multi-path solution between a pair of source and destination nodes...
- Cluster (computing)Cluster (computing)A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks...
- Control reconfigurationControl reconfigurationControl reconfiguration is an active approach in control theory to achieve fault-tolerant control for dynamic systems . It is used when severe faults, such as actuator or sensor outages, cause a break-up of the control loop, which must be restructured to prevent failure at the system level...
- Defence in depthDefence in depthDefence in depth is a military strategy; it seeks to delay rather than prevent the advance of an attacker, buying time and causing additional casualties by yielding space...
- Data redundancyData redundancyData redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, in case when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different...
- Elegant degradationElegant degradationElegant degradation is a term used in engineering to describe what occurs to machines which are subject to constant, repetitive stress.Externally, such a machine maintains the same appearance to the user, appearing to function properly. Internally, the machine slowly weakens over time. Eventually,...
- Error detection and correctionError detection and correctionIn information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...
- Fail-safeFail-safeA fail-safe or fail-secure device is one that, in the event of failure, responds in a way that will cause no harm, or at least a minimum of harm, to other devices or danger to personnel....
- Fail softFail softFail-soft operation is a characteristic of computing that refers to the ability of a system to fail in such a way as to preserve as much capability and data as possible....
- Failure transparencyFailure transparencyIn a distributed system, failure transparency refers to the extent to which errors and subsequent recoveries of hosts and services within the system are invisible to users and applications....
- fall back and forwardFall back and forwardFall back is a feature of a modem protocol in data communication whereby two communicating modems which experience data corruption can renegotiate with each other to use a lower-speed connection...
- Fault-tolerant designFault-tolerant designIn engineering, fault-tolerant design is a design that enables a system to continue operation, possibly at a reduced level , rather than failing completely, when some part of the system fails...
- Fault-tolerant computer systemsFault-tolerant computer systemsFault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults.- Types of fault tolerance :...
- Progressive EnhancementProgressive enhancementProgressive enhancement is a strategy for web design that emphasizes accessibility, semantic HTML markup, and external stylesheet and scripting technologies...
- Separation of protection and securitySeparation of protection and securityIn computer sciences the separation of protection and security is a design choice. Wulf et al. identified protection as a mechanism and security as a policy, therefore making the protection-security distinction a particular case of the separation of mechanism and policy principle.- Overview :The...
- RollbackRollbackIn political science, rollback is the strategy of forcing change in the major policies of a state, usually by replacing its ruling regime. It contrasts with containment, which means preventing the expansion of that state; and with détente, which means a working relationship with that state...
- Resilience (ecology)Resilience (ecology)In ecology, resilience is the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly. Such perturbations and disturbances can include stochastic events such as fires, flooding, windstorms, insect population explosions, and human activities...
- Resilience (network)Resilience (network)In computer networking: “Resilience is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.”These services include:* supporting distributed processing* supporting networked storage...
- List of system quality attributes
External links
- Article "Practical Considerations in Making CORBA Services Fault-Tolerant" by Priya Narasimhan
- Article about TMR with reference to TMR usage in avionics and industry
- Article "Experiences, Strategies and Challenges in Building Fault-Tolerant CORBA Systems" by Pascal Felber and Priya Narasimhan
- Dependability And Its Threats: A Taxonomy by Algirdas Avizienis, Jean-Claude Laprie, B. RandellBrian RandellBrian Randell is a British computer scientist, and Emeritus Professor at the School of Computing Science, Newcastle University, U.K. He specializes in research in software fault tolerance and dependability, and is a noted authority on the early prior to 1950 history of computers.- Biography...
- EU funded research project HPC4U addressing development of fault tolerant technologies for Grid computing environments
- Fault Tolerance and High Availability Systems
- High Availability Software
- Graceful Degradation in the RKBExplorer
- Fault Tolerance and High Availability Systems for Check Point Firewall and VPN networks with Resilience line of FCR appliances