Chipkill
Encyclopedia
Chipkill is IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

's trademark for a form of advanced error checking and correcting (ECC) computer memory
Computer memory
In computing, memory refers to the physical devices used to store programs or data on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast In computing, memory refers to the...

 technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code
Hamming code
In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...

 ECC word across multiple memory chips, such that the failure of any one memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code
BCH code
In coding theory the BCH codes form a class of parameterised error-correcting codes which have been the subject of much academic attention in the last fifty years. BCH codes were invented in 1959 by Hocquenghem, and independently in 1960 by Bose and Ray-Chaudhuri...

, that can correct multiple bits with less overhead. The equivalent system from Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

 is called Extended ECC. The equivalent system from HP is called Chipspare. A similar system from Intel is called SDDC.

Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...

, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and middle 1990s. An important RAS
Reliability, Availability and Serviceability
reliability, availability, and serviceability are computer hardware engineering terms. It originated from IBM to advertise the robustness of their mainframe computers. The concept is often known by the acronym RAS....

 feature, Chipkill technology is deployed primarily on SSD
Solid-state drive
A solid-state drive , sometimes called a solid-state disk or electronic disk, is a data storage device that uses solid-state memory to store persistent data with the intention of providing access in the same manner of a traditional block i/o hard disk drive...

s, mainframes
Mainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

 and midrange servers.

In a 2009 paper using data from Google's datacentres , provided evidence that demonstrated that in the Google systems DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month". DIMMs with chip-kill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error correcting codes that can only correct single-bit errors.

External links

  • DRAM study turns assumptions about errors upside down, Ars Technica
    Ars Technica
    Ars Technica is a technology news and information website created by Ken Fisher and Jon Stokes in 1998. It publishes news, reviews and guides on issues such as computer hardware and software, science, technology policy, and video games. Ars Technica is known for its features, long articles that go...

    October 7, 2009.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK