Decimal floating point
Encyclopedia
Decimal floating point arithmetic refers to both a representation and operations on decimal
floating point
numbers. Working directly with decimal (base 10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base 2) fractions.
The advantage of decimal floating-point representation over decimal fixed-point
and integer
representation is that it supports a much wider range of values. For example, while a fixed-point representation that allocates eight decimal digits and two decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floating-point representation with eight decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm
can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.
, slide rule
, the Smallwood calculator, and some other calculator
s that support entries in scientific notation
. In the case of the mechanical calculators, the exponent is often treated as side information that is accounted for separately.
Some computer languages have implementations of decimal floating point arithmetic, including Java
with big decimal, emacs
with calc, python
, and in Unix
the bc
and dc
calculators.
In 1987, the IEEE released IEEE 854, a standard for computing with decimal floating point, which lacked a specification for how floating point data should be encoded for interchange with other systems. This is being addressed in IEEE 754-2008 which standardizes the encodings of decimal floating point data, albeit with two different alternative encodings.
IBM POWER6
includes DFP in hardware, as does the IBM System z9. SilMinds offers SilAx; a configurable vector DFP coprocessor
. IEEE 754-2008 defines this in more detail.
Microsoft C#, or .NET, uses System.Decimal.
. Unlike binary floating-point, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×102=0.1×103=0.01×104, etc. When the significand is zero, the exponent can be any value at all.
The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.
Two different representations are defined:
Both alternatives provide exactly the same range of representable values.
The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5-bit field, along with special forms for infinity and NaN
.
If the most significant 4 bits of the significand are between 0 and 7, the encoded value begins as follows:
s 00 xxxx Exponent begins with 00, significand with 0mmm
s 01 xxxx Exponent begins with 01, significand with 0mmm
s 10 xxxx Exponent begins with 10, significand with 0mmm
If the leading 4 bits of the significand are binary 1000 or 1001 (decimal 8 or 9), the number begins as follows:
s 1100 xx Exponent begins with 00, significand with 100m
s 1101 xx Exponent begins with 01, significand with 100m
s 1110 xx Exponent begins with 10, significand with 100m
The leading bit (s in the above) is a sign bit, and the following bits (xxx in the above) encode the additional exponent bits and the remainder of the most significant digit, but the details vary depending on the encoding alternative used.
The final combinations are used for infinities and NaNs, and are the same for both alternative encodings:
s 11110 x ±Infinity (see Extended real number line
)
s 111110 quiet NaN (sign bit ignored)
s 111111 signaling NaN (sign bit ignored)
In the latter cases, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to NaNs by filling it with a single byte value.
As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (00002 to 01112), or higher (10002 or 10012).
If the 2 bits after the sign bit are "00", "01", or "10", then the exponent field consists of the 8 bits following the sign bit (the 2 bits mentioned plus 6 bits of "exponent continuation field"), and the significand is the remaining 23 bits, with an implicit leading 0 bit, shown here in parentheses:
s 00eeeeee (0)TTTtttttttttttttttttttt
s 01eeeeee (0)TTTtttttttttttttttttttt
s 10eeeeee (0)TTTtttttttttttttttttttt
This includes subnormal numbers where the leading significand digit is 0.
If the 4 bits after the sign bit are "1100", "1101", or "1110", then the 8-bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 21 bits. In this case there is an implicit (that is, not stored) leading 3-bit sequence "100" in the true significand:
s 11 00eeeeee (100)Ttttttttttttttttttttt
s 11 01eeeeee (100)Ttttttttttttttttttttt
s 11 10eeeeee (100)Ttttttttttttttttttttt
The "11" 2-bit sequence after the sign bit indicates that there is an implicit "100" 3-bit prefix to the significand.
Note that the leading bits of the significand field do not encode the most significant decimal digit; they are simply part of a larger pure-binary number. For example, a significand of 8,000,000 is encoded as binary 0111 1010 0001 0010 0000 0000, with the leading 4 bits encoding 7; the first significand which requires a 24th bit (and thus the second emcoding form) is 223 = 8,388,608.
In the above cases, the value represented is:
Decimal64 and Decimal128 operate analogously, but with larger exponent continuation and significand fields. For Decimal128, the second encoding form is actually never used; the largest valid significand of 1034−1 =
encoding.
Unlike the binary integer significand version, where the exponent changed position and came before the significand, this encoding combines the leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand into the five bits that follow the sign bit. This is followed by a fixed-offset exponent continuation field.
Finally, the significand continuation field made of 2, 5, or 11 10-bit "declets", each encoding 3 decimal digits.
If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits after that are interpreted as the leading decimal digit (0 to 7):
Comb. Exponent Significand
s 00 TTT (00)eeeeee (TTT)[tttttttttt][tttttttttt]
s 01 TTT (01)eeeeee (TTT)[tttttttttt][tttttttttt]
s 10 TTT (10)eeeeee (TTT)[tttttttttt][tttttttttt]
If the 4 bits after the sign bit are "1100", "1101", or "1110", then the second two bits are the leading bits of the exponent, and the last bit is prefixed with "100" to form the leading decimal digit (8 or 9):
Comb. Exponent Significand
s 1100 T (00)eeeeee (100T)[tttttttttt][tttttttttt]
s 1101 T (01)eeeeee (100T)[tttttttttt][tttttttttt]
s 1110 T (10)eeeeee (100T)[tttttttttt][tttttttttt]
The remaining two combinations (11110 and 11111) of the 5-bit field are used to represent ±infinity and NaNs, respectively.
For ease of presentation and understanding, 7 digit precision will be used in the examples. The fundamental principles are the same in any precision.
The following example is decimal means base is simply 10.
123456.7 = 1.234567 * 10^5
101.7654 = 1.017654 * 10^2 = 0.001017654 * 10^5 simply
Hence:
123456.7 + 101.7654 = (1.234567 * 10^5) + (1.017654 * 10^2) =
= (1.234567 * 10^5) + (0.001017654 * 10^5) =
= 10^5 * ( 1.234567 + 0.001017654 ) = 10^5 * 1.235584654. simply
This is nothing else as converting to engineering notation.
In detail:
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
--------------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error
. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=-3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
----------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when two close numbers are subtracted.
e=5; s=1.234571 and e=5; s=1.234567 are representations of the rationals 123457.1467 and 123456.659.
e=5; s=1.234571
- e=5; s=1.234567
----------------
e=5; s=0.000004
e=-1; s=4.000000 (after rounding/normalization)
The best representation of this difference is e=-1; s=4.877000, which differs more than 20% from e=-1; s=4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation
illustrates the danger in assuming that all of the digits of a computed result are meaningful.
Dealing with the consequences of these errors are topics in numerical analysis
.
e=3; s=4.734612
× e=5; s=5.417242
-----------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Division is done similarly, but that is more complicated.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex.
Decimal
The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....
floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
numbers. Working directly with decimal (base 10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base 2) fractions.
The advantage of decimal floating-point representation over decimal fixed-point
Fixed-point arithmetic
In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point...
and integer
Integer (computer science)
In computer science, an integer is a datum of integral data type, a data type which represents some finite subset of the mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values....
representation is that it supports a much wider range of values. For example, while a fixed-point representation that allocates eight decimal digits and two decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floating-point representation with eight decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm
Kahan summation algorithm
In numerical analysis, the Kahan summation algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach...
can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.
Implementations
Early mechanical uses of decimal floating point are evident in the abacusAbacus
The abacus, also called a counting frame, is a calculating tool used primarily in parts of Asia for performing arithmetic processes. Today, abaci are often constructed as a bamboo frame with beads sliding on wires, but originally they were beans or stones moved in grooves in sand or on tablets of...
, slide rule
Slide rule
The slide rule, also known colloquially as a slipstick, is a mechanical analog computer. The slide rule is used primarily for multiplication and division, and also for functions such as roots, logarithms and trigonometry, but is not normally used for addition or subtraction.Slide rules come in a...
, the Smallwood calculator, and some other calculator
Calculator
An electronic calculator is a small, portable, usually inexpensive electronic device used to perform the basic operations of arithmetic. Modern calculators are more portable than most computers, though most PDAs are comparable in size to handheld calculators.The first solid-state electronic...
s that support entries in scientific notation
Scientific notation
Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. Scientific notation has a number of useful properties and is commonly used in calculators and by scientists, mathematicians, doctors, and engineers.In scientific...
. In the case of the mechanical calculators, the exponent is often treated as side information that is accounted for separately.
Some computer languages have implementations of decimal floating point arithmetic, including Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
with big decimal, emacs
Emacs
Emacs is a class of text editors, usually characterized by their extensibility. GNU Emacs has over 1,000 commands. It also allows the user to combine these commands into macros to automate work.Development began in the mid-1970s and continues actively...
with calc, python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
, and in Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
the bc
Bc programming language
bc, for bench calculator, is "an arbitrary precision calculator language" with syntax similar to the C programming language. bc is typically used as either a mathematical scripting language or as an interactive mathematical shell....
and dc
Dc (Unix)
dc is a cross-platform reverse-polish desk calculator which supports arbitrary-precision arithmetic. It is one of the oldest Unix utilities, predating even the invention of the C programming language; like other utilities of that vintage, it has a powerful set of features but an extremely terse...
calculators.
In 1987, the IEEE released IEEE 854, a standard for computing with decimal floating point, which lacked a specification for how floating point data should be encoded for interchange with other systems. This is being addressed in IEEE 754-2008 which standardizes the encodings of decimal floating point data, albeit with two different alternative encodings.
IBM POWER6
POWER6
The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor...
includes DFP in hardware, as does the IBM System z9. SilMinds offers SilAx; a configurable vector DFP coprocessor
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...
. IEEE 754-2008 defines this in more detail.
Microsoft C#, or .NET, uses System.Decimal.
IEEE 754-2008 encoding
The IEEE 754-2008 standard defines 32-, 64- and 128-bit decimal floating-point representations. Like the binary floating-point formats, the number is divided into a sign, and exponent, and a significandSignificand
The significand is part of a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.-Examples:...
. Unlike binary floating-point, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×102=0.1×103=0.01×104, etc. When the significand is zero, the exponent can be any value at all.
decimal32 | decimal64 | decimal128 | decimal(32k) | | Format |
---|---|---|---|---|
1 | 1 | 1 | 1 | Sign field (bits) |
5 | 5 | 5 | 5 | Combination field (bits) |
6 | 8 | 12 | w = 2×k + 4 | Exponent continuation field (bits) |
20 | 50 | 110 | t = 30×k−10 | Coefficient continuation field (bits) |
32 | 64 | 128 | 32×k | Total size (bits) |
7 | 16 | 34 | p = 3×t/10+1 = 9×k−2 | Coefficient size (decimal digits) |
192 | 768 | 12288 | 3×2w = 48×4k | Exponent range |
96 | 384 | 6144 | Emax = 3×2w−1 | Largest value is 9.99...×10Emax |
−95 | −383 | −6143 | Emin = 1−Emax | Smallest normalized value is 1.00...×10Emin |
−101 | −398 | −6176 | Etiny = 2−p−Emax | Smallest non-zero value is 1×10Etiny |
The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.
Two different representations are defined:
- One with a binary integer significand field encodes the significand as a large binary integer between 0 and 10p−1. This is expected to be more convenient for software implementations using a binary ALUArithmetic logic unitIn computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...
. - Another with a densely packed decimal significand field encodes decimal digits more directly. This makes conversion to and from binary floating-point form faster, but requires specialized hardware to manipulate efficiently. This is expected to be more convenient for hardware implementations.
Both alternatives provide exactly the same range of representable values.
The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5-bit field, along with special forms for infinity and NaN
NaN
In computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...
.
If the most significant 4 bits of the significand are between 0 and 7, the encoded value begins as follows:
s 00 xxxx Exponent begins with 00, significand with 0mmm
s 01 xxxx Exponent begins with 01, significand with 0mmm
s 10 xxxx Exponent begins with 10, significand with 0mmm
If the leading 4 bits of the significand are binary 1000 or 1001 (decimal 8 or 9), the number begins as follows:
s 1100 xx Exponent begins with 00, significand with 100m
s 1101 xx Exponent begins with 01, significand with 100m
s 1110 xx Exponent begins with 10, significand with 100m
The leading bit (s in the above) is a sign bit, and the following bits (xxx in the above) encode the additional exponent bits and the remainder of the most significant digit, but the details vary depending on the encoding alternative used.
The final combinations are used for infinities and NaNs, and are the same for both alternative encodings:
s 11110 x ±Infinity (see Extended real number line
Extended real number line
In mathematics, the affinely extended real number system is obtained from the real number system R by adding two elements: +∞ and −∞ . The projective extended real number system adds a single object, ∞ and makes no distinction between "positive" or "negative" infinity...
)
s 111110 quiet NaN (sign bit ignored)
s 111111 signaling NaN (sign bit ignored)
In the latter cases, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to NaNs by filling it with a single byte value.
Binary integer significand field
This format uses a binary significand from 0 to 10p−1. For example, the Decimal32 significand can be up to 107−1 = 9,999,999 = 98967F16 = 1001 1000 1001 0110 0111 11112. While the encoding can represent larger significands, they are illegal and the standard requires implementations to treat them as 0, if encountered on input.As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (00002 to 01112), or higher (10002 or 10012).
If the 2 bits after the sign bit are "00", "01", or "10", then the exponent field consists of the 8 bits following the sign bit (the 2 bits mentioned plus 6 bits of "exponent continuation field"), and the significand is the remaining 23 bits, with an implicit leading 0 bit, shown here in parentheses:
s 00eeeeee (0)TTTtttttttttttttttttttt
s 01eeeeee (0)TTTtttttttttttttttttttt
s 10eeeeee (0)TTTtttttttttttttttttttt
This includes subnormal numbers where the leading significand digit is 0.
If the 4 bits after the sign bit are "1100", "1101", or "1110", then the 8-bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 21 bits. In this case there is an implicit (that is, not stored) leading 3-bit sequence "100" in the true significand:
s 11 00eeeeee (100)Ttttttttttttttttttttt
s 11 01eeeeee (100)Ttttttttttttttttttttt
s 11 10eeeeee (100)Ttttttttttttttttttttt
The "11" 2-bit sequence after the sign bit indicates that there is an implicit "100" 3-bit prefix to the significand.
Note that the leading bits of the significand field do not encode the most significant decimal digit; they are simply part of a larger pure-binary number. For example, a significand of 8,000,000 is encoded as binary 0111 1010 0001 0010 0000 0000, with the leading 4 bits encoding 7; the first significand which requires a 24th bit (and thus the second emcoding form) is 223 = 8,388,608.
In the above cases, the value represented is:
- (−1)sign × 10exponent−101 × significand
Decimal64 and Decimal128 operate analogously, but with larger exponent continuation and significand fields. For Decimal128, the second encoding form is actually never used; the largest valid significand of 1034−1 =
0x1ED09BEAD87C0378D8E63FFFFFFFF
can be represented in 113 bits.Densely packed decimal significand field
In this version, the significand is stored as a series of decimal digits. The leading digit is between 0 and 9 (3 or 4 binary bits), and the rest of the significand uses the densely packed decimalDensely Packed Decimal
Densely packed decimal is a system of binary encoding for decimal digits.The traditional system of binary encoding for decimal digits, known as Binary-coded decimal , uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth...
encoding.
Unlike the binary integer significand version, where the exponent changed position and came before the significand, this encoding combines the leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand into the five bits that follow the sign bit. This is followed by a fixed-offset exponent continuation field.
Finally, the significand continuation field made of 2, 5, or 11 10-bit "declets", each encoding 3 decimal digits.
If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits after that are interpreted as the leading decimal digit (0 to 7):
Comb. Exponent Significand
s 00 TTT (00)eeeeee (TTT)[tttttttttt][tttttttttt]
s 01 TTT (01)eeeeee (TTT)[tttttttttt][tttttttttt]
s 10 TTT (10)eeeeee (TTT)[tttttttttt][tttttttttt]
If the 4 bits after the sign bit are "1100", "1101", or "1110", then the second two bits are the leading bits of the exponent, and the last bit is prefixed with "100" to form the leading decimal digit (8 or 9):
Comb. Exponent Significand
s 1100 T (00)eeeeee (100T)[tttttttttt][tttttttttt]
s 1101 T (01)eeeeee (100T)[tttttttttt][tttttttttt]
s 1110 T (10)eeeeee (100T)[tttttttttt][tttttttttt]
The remaining two combinations (11110 and 11111) of the 5-bit field are used to represent ±infinity and NaNs, respectively.
Floating point arithmetic operations
The usual rule for performing floating point arithmetic is that the exact mathematical value is calculated, and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, under normal rounding behavior and in the absence of exceptional conditions.For ease of presentation and understanding, 7 digit precision will be used in the examples. The fundamental principles are the same in any precision.
Addition
A simple method to add floating point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits. We proceed with the usual addition method:The following example is decimal means base is simply 10.
123456.7 = 1.234567 * 10^5
101.7654 = 1.017654 * 10^2 = 0.001017654 * 10^5 simply
Hence:
123456.7 + 101.7654 = (1.234567 * 10^5) + (1.017654 * 10^2) =
= (1.234567 * 10^5) + (0.001017654 * 10^5) =
= 10^5 * ( 1.234567 + 0.001017654 ) = 10^5 * 1.235584654. simply
This is nothing else as converting to engineering notation.
In detail:
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
--------------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error
Round-off error
A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate this error when using approximation equations and/or algorithms, especially when using finitely many...
. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=-3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
----------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when two close numbers are subtracted.
e=5; s=1.234571 and e=5; s=1.234567 are representations of the rationals 123457.1467 and 123456.659.
e=5; s=1.234571
- e=5; s=1.234567
----------------
e=5; s=0.000004
e=-1; s=4.000000 (after rounding/normalization)
The best representation of this difference is e=-1; s=4.877000, which differs more than 20% from e=-1; s=4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation
Loss of significance
Loss of significance is an undesirable effect in calculations using floating-point arithmetic. It occurs when an operation on two numbers increases relative error substantially more than it increases absolute error, for example in subtracting two large and nearly equal numbers. The effect is that...
illustrates the danger in assuming that all of the digits of a computed result are meaningful.
Dealing with the consequences of these errors are topics in numerical analysis
Numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....
.
Multiplication
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.e=3; s=4.734612
× e=5; s=5.417242
-----------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Division is done similarly, but that is more complicated.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex.
Further reading
- Decimal Floating-Point: Algorism for Computers, Proceedings of the 16th IEEE Symposium on Computer Arithmetic (Cowlishaw, M. F.Mike CowlishawMike Cowlishaw is a retired IBM Fellow, a Visiting Professor at the Department of Computer Science at the University of Warwick, and is a Fellow of the Royal Academy of Engineering , the Institute of Engineering and Technology , and the British Computer Society.- Career at IBM :Cowlishaw joined IBM...
, 2003)