Multiply-accumulate
Encyclopedia
In computing
, especially digital signal processing
, the multiply–accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator
. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC, or MAC unit); the operation itself is also often called a MAC or a MAC operation. The MAC operation modifies an accumulator a:
When done with floating point
numbers, it might be performed with two rounding
s (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).
Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic
followed by an adder
and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MAC units were digital signal processor
s, but the technique is now also common in general-purpose processors.
s, the operation is typically exact (computed modulo
some power of 2). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative
or distributive
. (See Floating point#Accuracy problems.)
Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add).
A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:
Fused multiply–add can usually be relied on to give more accurate results. However, Kahan has pointed out that it can give problems if used unthinkingly. If is evaluated as using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if for instance the square root of the result is then evaluated.
When implemented inside a microprocessor
, an FMA can actually be faster than a multiply operation followed by an add, even though standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.
A useful benefit of including this instruction is that it allows an efficient software implementation of division
and square root
operations, thus eliminating the need for dedicated hardware for those operations.
The FMA operation is included in IEEE 754-2008.
The DEC
VAX
's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of fused multiply–add steps. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.
The 1999 standard
of the C programming language
supports the FMA operation through the
The fused multiply–add operation was introduced as multiply–add fused in the IBM POWER1
processor, but has been added to numerous other processors since then:
An FMA instruction will be implemented in the newer AMD CPUs like 'Bulldozer
' with FMA4
support. Intel plans to implement FMA3
in processors using its Haswell microarchitecture, due in late 2012.
FMA capability is also present in the NVIDIA
GeForce 200 Series
(GTX 200) GPUs, GeForce 400 Series
, GeForce 500 Series
GPUs and the NVIDIA Tesla
C1060
Computing Processor & C2050 / C2070
GPU Computing Processor GPGPU
s. FMA has been added to the AMD Radeon
line with the HD 5000 series.
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, especially digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...
, the multiply–accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator
Accumulator (computing)
In a computer's central processing unit , an accumulator is a register in which intermediate arithmetic and logic results are stored. Without a register like an accumulator, it would be necessary to write the result of each calculation to main memory, perhaps only to be read right back again for...
. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC, or MAC unit); the operation itself is also often called a MAC or a MAC operation. The MAC operation modifies an accumulator a:
When done with floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
numbers, it might be performed with two rounding
Rounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
s (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).
Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic
Combinational logic
In digital circuit theory, combinational logic is a type of digital logic which is implemented by boolean circuits, where the output is a pure function of the present input only. This is in contrast to sequential logic, in which the output depends not only on the present input but also on the...
followed by an adder
Adder (electronics)
In electronics, an adder or summer is a digital circuit that performs addition of numbers.In many computers and other kinds of processors, adders are used not only in the arithmetic logic unit, but also in other parts of the processor, where they are used to calculate addresses, table indices, and...
and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MAC units were digital signal processor
Digital signal processor
A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...
s, but the technique is now also common in general-purpose processors.
In floating-point arithmetic
When done with integerInteger
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
s, the operation is typically exact (computed modulo
Modular arithmetic
In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" after they reach a certain value—the modulus....
some power of 2). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative
Associativity
In mathematics, associativity is a property of some binary operations. It means that, within an expression containing two or more occurrences in a row of the same associative operator, the order in which the operations are performed does not matter as long as the sequence of the operands is not...
or distributive
Distributivity
In mathematics, and in particular in abstract algebra, distributivity is a property of binary operations that generalizes the distributive law from elementary algebra.For example:...
. (See Floating point#Accuracy problems.)
Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add).
Fused multiply–add
A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:
- Dot productDot productIn mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number obtained by multiplying corresponding entries and then summing those products...
- Matrix multiplicationMatrix multiplicationIn mathematics, matrix multiplication is a binary operation that takes a pair of matrices, and produces another matrix. If A is an n-by-m matrix and B is an m-by-p matrix, the result AB of their multiplication is an n-by-p matrix defined only if the number of columns m of the left matrix A is the...
- PolynomialPolynomialIn mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...
evaluation (e.g., with Horner's rule) - Newton's methodNewton's methodIn numerical analysis, Newton's method , named after Isaac Newton and Joseph Raphson, is a method for finding successively better approximations to the roots of a real-valued function. The algorithm is first in the class of Householder's methods, succeeded by Halley's method...
for evaluating functions.
Fused multiply–add can usually be relied on to give more accurate results. However, Kahan has pointed out that it can give problems if used unthinkingly. If is evaluated as using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if for instance the square root of the result is then evaluated.
When implemented inside a microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
, an FMA can actually be faster than a multiply operation followed by an add, even though standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.
A useful benefit of including this instruction is that it allows an efficient software implementation of division
Division (digital)
Several algorithms exist to perform division in digital designs. These algorithms fall into two main categories: slow division and fast division. Slow division algorithms produce one digit of the final quotient per iteration. Examples of slow division include restoring, non-performing restoring,...
and square root
Square root
In mathematics, a square root of a number x is a number r such that r2 = x, or, in other words, a number r whose square is x...
operations, thus eliminating the need for dedicated hardware for those operations.
The FMA operation is included in IEEE 754-2008.
The DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
VAX
VAX
VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs...
's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of fused multiply–add steps. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.
The 1999 standard
C99
C99 is a modern dialect of the C programming language. It extends the previous version with new linguistic and library features, and helps implementations make better use of available computer hardware and compiler technology.-History:...
of the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
supports the FMA operation through the
fma
standard math library function.The fused multiply–add operation was introduced as multiply–add fused in the IBM POWER1
POWER1
The POWER1 is a multi-chip CPU developed and fabricated by IBM that implemented the POWER instruction set architecture . It was originally known as the “RISC System/6000 CPU” or when an abbreviated form, the “RS/6000 CPU” before introduction of successors required the original name to be replaced...
processor, but has been added to numerous other processors since then:
- FujitsuFujitsuis a Japanese multinational information technology equipment and services company headquartered in Tokyo, Japan. It is the world's third-largest IT services provider measured by revenues....
SPARC64 VISPARC64 VIThe SPARC64 VI, code-named Olympus-C, is a microprocessor, developed by Fujitsu. It implements the SPARC V9 instruction set architecture and is compliant with the Joint Programming Specification developed by Fujitsu and Sun. It is used by Fujitsu and Sun Microsystems in their SPARC Enterprise...
(2007) and above - HPHewlett-PackardHewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...
PA-8000PA-8000The PA-8000 , code-named Onyx, is a microprocessor developed and fabricated by Hewlett-Packard that implemented the PA-RISC 2.0 instruction set architecture . It was a completely new design with no circuitry derived from previous PA-RISC microprocessors...
(1996) and above - SCESony Computer EntertainmentSony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...
-ToshibaToshibais a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
Emotion EngineEmotion EngineThe Emotion Engine is a CPU developed and manufactured by Sony Computer Entertainment and Toshiba for use in the Sony PlayStation 2 video game console, as well as early PlayStation 3 models sold in Japan and North America...
(1999) - Intel ItaniumItaniumItanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...
(2001) - STI CellCell (microprocessor)Cell is a microprocessor architecture jointly developed by Sony, Sony Computer Entertainment, Toshiba, and IBM, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas over a four-year period beginning March 2001 on a...
(2006) - (MIPSMIPS architectureMIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
-compatible) Loongson-2F (2008). - ARM with VFPv4 (which is optional)
An FMA instruction will be implemented in the newer AMD CPUs like 'Bulldozer
Bulldozer (processor)
Bulldozer is the codename Advanced Micro Devices has given to one of the next-generation CPU cores after the K10 microarchitecture for the company's M-SPACE design methodology, with the core specifically aimed at 10-watt to 125-watt TDP computing products. Bulldozer is a completely new design...
' with FMA4
FMA instruction set
The FMA instruction set is the name of a future extension to the 128-bit SIMD instructions in the X86 microprocessor instruction set to perform fused multiply–add operations...
support. Intel plans to implement FMA3
FMA instruction set
The FMA instruction set is the name of a future extension to the 128-bit SIMD instructions in the X86 microprocessor instruction set to perform fused multiply–add operations...
in processors using its Haswell microarchitecture, due in late 2012.
FMA capability is also present in the NVIDIA
NVIDIA
Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...
GeForce 200 Series
GeForce 200 Series
The GeForce 200 Series is the 10th generation of Nvidia's GeForce graphics processing units. The series also represents the continuation of the company's unified shader architecture introduced with the GeForce 8 Series and the GeForce 9 Series. Its primary competition came from ATI's Radeon HD 4000...
(GTX 200) GPUs, GeForce 400 Series
GeForce 400 Series
The GeForce 400 Series is the 11th generation of Nvidia's GeForce graphics processing units. The series was originally slated for production in November 2009, but, after a number of delays, launched on March 26, 2010 with availability following in April 2010....
, GeForce 500 Series
GeForce 500 Series
The GeForce 500 Series is a family of graphics processing units developed by Nvidia, based on the refreshed Fermi architecture. Nvidia officially announced the GeForce 500 series on 9 November 2010 with the launch of the GeForce GTX 580.- Overview :...
GPUs and the NVIDIA Tesla
Nvidia Tesla
The Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...
C1060
Computing Processor & C2050 / C2070
GPU Computing Processor GPGPU
GPGPU
General-purpose computing on graphics processing units is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU...
s. FMA has been added to the AMD Radeon
Radeon
Radeon is a brand of graphics processing units and random access memory produced by Advanced Micro Devices , first launched in 2000 by ATI Technologies, which was acquired by AMD in 2006. Radeon is the successor to the Rage line. There are four different groups, which can be differentiated by...
line with the HD 5000 series.