Reliability, Availability and Serviceability

reliability
Reliability engineering
Reliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...

, availability
High availability
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period....

, and serviceability
Serviceability (computer)
In software engineering and hardware engineering, serviceability is one of the -ilities or aspects...

are computer hardware

Computer hardware

Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...

engineering terms. It originated from IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

to advertise the robustness of their mainframe computer

Mainframe computer

Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

s. The concept is often known by the acronym RAS.

Mainframe computers have a multitude of features that help them stay available

Availability

In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...

for long periods of time without failure

Fault (technology)

In document ISO/CD 10303-226, a fault is defined as an abnormal condition or defect at the component, equipment, or sub-system level which may lead to a failure....

. This uptime

Uptime

Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

is a selling point for mainframes and fault-tolerant system

Fault-tolerant system

Fault-tolerance or graceful degradation is the property that enables a system to continue operating properly in the event of the failure of some of its components. A newer approach is progressive enhancement...

s, with some computer vendors offering uptime

Uptime

s on the order of years.

Reliability means features that help avoid and detect such faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data, instead it corrects the corruption when possible or else stops and reports the corruption.
Availability is the amount of time a device is actually operating as the percentage of total time it should be operating. In high availability applications, availability may be reported as minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational.
Serviceability takes the form of various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some of IBM's systems could automatically call an IBM service center (without human intervention) when the system experiences a system fault. Traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.

RAS features might include:

Parity
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....

or ECC protection of memory components as well as buses.
Cyclic redundancy check
Cyclic redundancy check
A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...

checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

s for data transmission and data storage.
RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...

configurations for magnetic disk storage.
Journaling file system
Journaling file system
A journaling file system is a file system that keeps track of the changes that will be made in a journal before committing them to the main file system...

s for file repair after crashes.
Over-designing the system for the specified operating ranges of clock frequency, temperature, voltage, vibration.
Duplication of computing components running in lock-step to perform master-checker
Master-checker
A master-checker is a hardware-supported fault tolerance method for multiprocessor systems, in which two processors, referred to as the master and checker, calculate the same functions in parallel in order to increase the probability that the result is exact. The checker-CPU is synchronised at...

or voting schemes.
Duplication of components to avoid single point of failures (for example power-supplies).
Hot swapping
Hot swapping
Hot swapping and hot plugging are terms used to describe the functions of replacing computer system components without shutting down the system...

of components.
Partitioning/domaining of computer components to allow one large system to act as several smaller systems.
Computer clustering capability.
Virtual machine
Virtual machine
A virtual machine is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software emulation or hardware virtualization or both together.-VM Definitions:A virtual machine is a software...

s to decrease the severity of operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

software faults.
Temperature sensors to throttle operating frequency when temperature goes out of specification.
Surge protector
Surge protector
A surge protector is an appliance designed to protect electrical devices from voltage spikes. A surge protector attempts to limit the voltage supplied to an electric device by either blocking or by shorting to ground any unwanted voltages above a safe threshold...

, uninterruptible power supply
Uninterruptible power supply
An uninterruptible power supply, also uninterruptible power source, UPS or battery/flywheel backup, is an electrical apparatus that provides emergency power to a load when the input power source, typically mains power, fails...

, auxiliary power
Emergency power system
Emergency power systems are a type of system, which may include lighting, generators, fuel cells and other apparatus, to provide backup power resources in a crisis or when regular systems fail. They find uses in a wide variety of settings from residential homes to hospitals, scientific...

.
Failover
Failover
In computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...

capability.

Fault-tolerant design

In engineering, fault-tolerant design is a design that enables a system to continue operation, possibly at a reduced level , rather than failing completely, when some part of the system fails...

s extended the idea by making RAS to be the defining feature of their computers for applications like stock market

Stock market

A stock market or equity market is a public entity for the trading of company stock and derivatives at an agreed price; these are securities listed on a stock exchange as well as those only traded privately.The size of the world stock market was estimated at about $36.6 trillion...

exchanges or air traffic control

Air traffic control

Air traffic control is a service provided by ground-based controllers who direct aircraft on the ground and in the air. The primary purpose of ATC systems worldwide is to separate aircraft to prevent collisions, to organize and expedite the flow of traffic, and to provide information and other...

where system crashes would be catastrophic. Fault-tolerant computers, which tend to have duplicate components running in lock-step for reliability, have become less popular due to their high cost. High availability systems, using distributed computing

Distributed computing

Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

techniques like computer clusters, are often used as cheaper alternatives.

See also