ECC memory
Encyclopedia
Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the more common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing and as servers.
ECC memory maintains a memory system effectively free from single-bit errors: the data read from each word is always the same as the data that had been written to it, even if a single bit actually stored, or more in some cases, has been flipped to the wrong state. Some non-ECC memory with parity support allows errors to be detected, but not corrected; otherwise errors that may occur are not detected.
") errors in DRAM chips occur as a result of background radiation
, chiefly neutron
s from cosmic ray
secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them. There was some concern that as DRAM density increases further, and thus the components on DRAM chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI
may make individual cells less susceptible and so counteract, or even reverse this trend. Recent studies show that single event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.
Several approaches have been developed to deal with these unwanted bit-flips:
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity
or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most common error correcting code, a SECDED Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Seymour Cray
famously said "parity is for farmers" when asked why he left this out of the CDC 6600
. He included parity in the CDC 7600
. The original IBM PC
and all PCs until the early 1990s used parity checking. Later ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.
An ECC-capable memory controller as used in many modern PCs can typically detect and correct errors of a single bit per 64-bit "word" (the unit of bus
transfer), and detect (but not correct) errors of two bits per 64-bit word. Some systems also 'scrub' the errors, by writing the corrected version back to memory. The BIOS
in some computers, and operating systems such as Linux
, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Typically machines intended for server use support ECC. Most mainboards intended for desktop (rather than server) machines do not support ECC, and those that do support ECC are shipped with it disabled (but easily enabled, usually by a change to BIOS
setup), to allow use of cheaper non-ECC memory. Most modern PCs do not support ECC at all as can be seen by examining computer and motherboard specifications; those whose motherboards do are often supplied with memory modules that do not support ECC. It may be that most users opt for non-ECC systems and memory even when ECC is available. The most important reasons for this are:
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill
.
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from , roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory. A very large-scale study based on Google
's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10×10−9 error/bit·h), and more than 8% of DIMM memory modules affected by errors per year.
In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications.
DRAM
memory may provide increased protection against soft error
s by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation
.
Error-correcting memory controllers traditionally use Hamming code
s, although some use triple modular redundancy
.
Interleaving
allows distributing the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset
(SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.
Some ECC memory uses triple modular redundancy
hardware (rather than the more common Hamming code
), because triple modular redundancy hardware is faster than Hamming error correction hardware. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction.
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).
ECC memory maintains a memory system effectively free from single-bit errors: the data read from each word is always the same as the data that had been written to it, even if a single bit actually stored, or more in some cases, has been flipped to the wrong state. Some non-ECC memory with parity support allows errors to be detected, but not corrected; otherwise errors that may occur are not detected.
History
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off ("softSoft error
In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to...
") errors in DRAM chips occur as a result of background radiation
Background radiation
Background radiation is the ionizing radiation constantly present in the natural environment of the Earth, which is emitted by natural and artificial sources.-Overview:Both Natural and human-made background radiation varies by location....
, chiefly neutron
Neutron
The neutron is a subatomic hadron particle which has the symbol or , no net electric charge and a mass slightly larger than that of a proton. With the exception of hydrogen, nuclei of atoms consist of protons and neutrons, which are therefore collectively referred to as nucleons. The number of...
s from cosmic ray
Cosmic ray
Cosmic rays are energetic charged subatomic particles, originating from outer space. They may produce secondary particles that penetrate the Earth's atmosphere and surface. The term ray is historical as cosmic rays were thought to be electromagnetic radiation...
secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them. There was some concern that as DRAM density increases further, and thus the components on DRAM chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI
Silicon on insulator
Silicon on insulator technology refers to the use of a layered silicon-insulator-silicon substrate in place of conventional silicon substrates in semiconductor manufacturing, especially microelectronics, to reduce parasitic device capacitance and thereby improving performance...
may make individual cells less susceptible and so counteract, or even reverse this trend. Recent studies show that single event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.
Several approaches have been developed to deal with these unwanted bit-flips:
- Immunity-aware programming
- RAM parityRAM parityRAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.The parity bit was originally...
memory - ECC memory
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity
RAM parity
RAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.The parity bit was originally...
or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most common error correcting code, a SECDED Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Seymour Cray
Seymour Cray
Seymour Roger Cray was an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research which would build many of these machines. Called "the father of supercomputing," Cray has been credited...
famously said "parity is for farmers" when asked why he left this out of the CDC 6600
CDC 6600
The CDC 6600 was a mainframe computer from Control Data Corporation, first delivered in 1964. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times...
. He included parity in the CDC 7600
CDC 7600
The CDC 7600 was the Seymour Cray-designed successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz and had a 65 Kword primary memory using core and variable-size secondary memory...
. The original IBM PC
IBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...
and all PCs until the early 1990s used parity checking. Later ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.
An ECC-capable memory controller as used in many modern PCs can typically detect and correct errors of a single bit per 64-bit "word" (the unit of bus
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...
transfer), and detect (but not correct) errors of two bits per 64-bit word. Some systems also 'scrub' the errors, by writing the corrected version back to memory. The BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....
in some computers, and operating systems such as Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Typically machines intended for server use support ECC. Most mainboards intended for desktop (rather than server) machines do not support ECC, and those that do support ECC are shipped with it disabled (but easily enabled, usually by a change to BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....
setup), to allow use of cheaper non-ECC memory. Most modern PCs do not support ECC at all as can be seen by examining computer and motherboard specifications; those whose motherboards do are often supplied with memory modules that do not support ECC. It may be that most users opt for non-ECC systems and memory even when ECC is available. The most important reasons for this are:
- Lack of knowledge of the existence of the problem of memory corruption and its solution;
- Applications which do not suffer seriously from occasional small-scale memory corruption.
- The higher cost of ECC memory (each bank is 9 memory chips compared to 8 for non-ECC memory, and more importantly there is more volume for non-ECC. In some cases the price ratio reduces to 9/8, as an example, on 2008/11/30, on Crucial.com, an ECC CL=5 unbuffered 2GB DDR2-667 DIMM cost $30 while the corresponding non-ECC part cost $28, a difference of 1/15, however some ECC modules cost twice as much as their non-ECC equivalents [Crucial CT12872Z40B and CT12864Z40B, Jan 2009]);
- The higher cost of motherboards, chipsetChipsetA chipset, PC chipset, or chip set refers to a group of integrated circuits, or chips, that are designed to work together. They are usually marketed as a single product.- Computers :...
s, and processors that support ECC functionality in RAM; - A performance decrease of around 0.5–2 percent, depending on application, due to the additional time needed for ECC memory controllers to perform error checking;
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill
Chipkill
Chipkill is IBM's trademark for a form of advanced error checking and correcting computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip...
.
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from , roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory. A very large-scale study based on Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10×10−9 error/bit·h), and more than 8% of DIMM memory modules affected by errors per year.
In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications.
DRAM
Dynamic random access memory
Dynamic random-access memory is a type of random-access memory that stores each bit of data in a separate capacitor within an integrated circuit. The capacitor can be either charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1...
memory may provide increased protection against soft error
Soft error
In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to...
s by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation
Cosmic ray
Cosmic rays are energetic charged subatomic particles, originating from outer space. They may produce secondary particles that penetrate the Earth's atmosphere and surface. The term ray is historical as cosmic rays were thought to be electromagnetic radiation...
.
Error-correcting memory controllers traditionally use Hamming code
Hamming code
In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
s, although some use triple modular redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
.
Interleaving
Interleaving
In computer science and telecommunication, interleaving is a way to arrange data in a non-contiguous way to increase performance.It is typically used:* In error-correction coding, particularly within data transmission, disk storage, and computer memory....
allows distributing the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset
Single event upset
A single event upset is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close...
(SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.
Some ECC memory uses triple modular redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
hardware (rather than the more common Hamming code
Hamming code
In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
), because triple modular redundancy hardware is faster than Hamming error correction hardware. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction.
Effects of memory corruption
The consequence of a memory error is system-dependent. In systems without ECC an error can lead either to to a crash or to corruption of data: in large-scale production sites memory errors are one of the most common hardware causes of machine crashes. Memory errors can cause security vulnerabilities. A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved.An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).