Reliability, Availability and Serviceability
Encyclopedia
reliability
Reliability engineering
Reliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...

, availability
High availability
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period....

, and serviceability
Serviceability (computer)
In software engineering and hardware engineering, serviceability is one of the -ilities or aspects...

are computer hardware
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...

 engineering terms. It originated from IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

to advertise the robustness of their mainframe computer
Mainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

s. The concept is often known by the acronym RAS.

Mainframe computers have a multitude of features that help them stay available
Availability
In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...

 for long periods of time without failure
Fault (technology)
In document ISO/CD 10303-226, a fault is defined as an abnormal condition or defect at the component, equipment, or sub-system level which may lead to a failure....

. This uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

 is a selling point for mainframes and fault-tolerant system
Fault-tolerant system
Fault-tolerance or graceful degradation is the property that enables a system to continue operating properly in the event of the failure of some of its components. A newer approach is progressive enhancement...

s, with some computer vendors offering uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

s on the order of years.
  • Reliability means features that help avoid and detect such faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data, instead it corrects the corruption when possible or else stops and reports the corruption.
  • Availability is the amount of time a device is actually operating as the percentage of total time it should be operating. In high availability applications, availability may be reported as minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational.
  • Serviceability takes the form of various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some of IBM's systems could automatically call an IBM service center (without human intervention) when the system experiences a system fault. Traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.


RAS features might include:
  • Parity
    Parity bit
    A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....

     or ECC protection of memory components as well as buses.
  • Cyclic redundancy check
    Cyclic redundancy check
    A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...

     checksum
    Checksum
    A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

    s for data transmission and data storage.
  • RAID
    RAID
    RAID is a storage technology that combines multiple disk drive components into a logical unit...

     configurations for magnetic disk storage.
  • Journaling file system
    Journaling file system
    A journaling file system is a file system that keeps track of the changes that will be made in a journal before committing them to the main file system...

    s for file repair after crashes.
  • Over-designing the system for the specified operating ranges of clock frequency, temperature, voltage, vibration.
  • Duplication of computing components running in lock-step to perform master-checker
    Master-checker
    A master-checker is a hardware-supported fault tolerance method for multiprocessor systems, in which two processors, referred to as the master and checker, calculate the same functions in parallel in order to increase the probability that the result is exact. The checker-CPU is synchronised at...

     or voting schemes.
  • Duplication of components to avoid single point of failures (for example power-supplies).
  • Hot swapping
    Hot swapping
    Hot swapping and hot plugging are terms used to describe the functions of replacing computer system components without shutting down the system...

     of components.
  • Partitioning/domaining of computer components to allow one large system to act as several smaller systems.
  • Computer clustering capability.
  • Virtual machine
    Virtual machine
    A virtual machine is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software emulation or hardware virtualization or both together.-VM Definitions:A virtual machine is a software...

    s to decrease the severity of operating system
    Operating system
    An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

     software faults.
  • Temperature sensors to throttle operating frequency when temperature goes out of specification.
  • Surge protector
    Surge protector
    A surge protector is an appliance designed to protect electrical devices from voltage spikes. A surge protector attempts to limit the voltage supplied to an electric device by either blocking or by shorting to ground any unwanted voltages above a safe threshold...

    , uninterruptible power supply
    Uninterruptible power supply
    An uninterruptible power supply, also uninterruptible power source, UPS or battery/flywheel backup, is an electrical apparatus that provides emergency power to a load when the input power source, typically mains power, fails...

    , auxiliary power
    Emergency power system
    Emergency power systems are a type of system, which may include lighting, generators, fuel cells and other apparatus, to provide backup power resources in a crisis or when regular systems fail. They find uses in a wide variety of settings from residential homes to hospitals, scientific...

    .
  • Failover
    Failover
    In computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...

     capability.


Fault-tolerant design
Fault-tolerant design
In engineering, fault-tolerant design is a design that enables a system to continue operation, possibly at a reduced level , rather than failing completely, when some part of the system fails...

s extended the idea by making RAS to be the defining feature of their computers for applications like stock market
Stock market
A stock market or equity market is a public entity for the trading of company stock and derivatives at an agreed price; these are securities listed on a stock exchange as well as those only traded privately.The size of the world stock market was estimated at about $36.6 trillion...

 exchanges or air traffic control
Air traffic control
Air traffic control is a service provided by ground-based controllers who direct aircraft on the ground and in the air. The primary purpose of ATC systems worldwide is to separate aircraft to prevent collisions, to organize and expedite the flow of traffic, and to provide information and other...

 where system crashes would be catastrophic. Fault-tolerant computers, which tend to have duplicate components running in lock-step for reliability, have become less popular due to their high cost. High availability systems, using distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

 techniques like computer clusters, are often used as cheaper alternatives.

See also

  • Machine check architecture
    Machine check architecture
    In computing, Machine Check Architecture refers to a mechanism in which the CPU reports hardware errors to the operating system.Intel's Pentium 4, Intel Xeon, P6 family processors as well as the Itanium architecture implement a machine check architecture that provides a mechanism for detecting and...

  • Integrated logistics support
    Integrated logistics support
    Integrated logistics support is an integrated approach to the management of logistic disciplines in the military, similar to commercial product support or customer service organisations...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK