IEEE 754r
Encyclopedia
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard established by the Institute of Electrical and Electronics Engineers
Institute of Electrical and Electronics Engineers
The Institute of Electrical and Electronics Engineers is a non-profit professional association headquartered in New York City that is dedicated to advancing technological innovation and excellence...

 (IEEE) and the most widely used standard for floating-point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...

 computation, followed by many hardware (CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 and FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

) and software implementations. Many computer languages allow or require that some or all arithmetic be carried out using IEEE 754 formats and operations. The current version is IEEE 754-2008, which was published in August 2008; it includes nearly all of the original IEEE 754-1985 (which was published in 1985) and the IEEE Standard for Radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...

-Independent Floating-Point Arithmetic (IEEE 854-1987
IEEE 854-1987
IEEE Std 854-1987, the Standard for radix-independent floating-point arithmetic, was the first Institute of Electrical and Electronics Engineers standard for floating-point arithmetic with radix 2 or radix 10 .The standard was published in 1987, nearly immediately superceeded by IEEE 754-1985 but...

). The international standard ISO/IEC/IEEE 60559:2011 has been approved for adoption through JTC1
ISO/IEC JTC1
ISO/IEC JTC 1 is Joint Technical Committee 1 of the International Organization for Standardization and the International Electrotechnical Commission . It deals with all matters of information technology....

/SC 25 under the ISO/IEEE PSDO Agreement and published.

The standard defines
  • arithmetic formats: sets of binary
    Binary code
    A binary code is a way of representing text or computer processor instructions by the use of the binary number system's two-binary digits 0 and 1. This is accomplished by assigning a bit string to each particular symbol or instruction...

     and decimal
    Decimal
    The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....

     floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities
    Infinity
    Infinity is a concept in many fields, most predominantly mathematics and physics, that refers to a quantity without bound or end. People have developed various ideas throughout history about the nature of infinity...

    , and special "not a number" values (NaN
    NaN
    In computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...

    s)
  • interchange formats: encodings (bit strings) that may be used to exchange floating-point data in an efficient and compact form
  • rounding algorithms: methods to be used for rounding numbers during arithmetic and conversions
  • operations: arithmetic and other operations on arithmetic formats
  • exception handling
    Exception handling
    Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution....

    :
    indications of exceptional conditions (such as division by zero
    Division by zero
    In mathematics, division by zero is division where the divisor is zero. Such a division can be formally expressed as a / 0 where a is the dividend . Whether this expression can be assigned a well-defined value depends upon the mathematical setting...

    , overflow, etc.)


The standard also includes extensive recommendations for advanced exception handling, additional operations (such as trigonometric functions), expression evaluation, and for achieving reproducible results.

The standard is derived from and replaces IEEE 754-1985, the previous version, following a seven-year revision process
IEEE 754 revision
IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating point standard...

, chaired by Dan Zuras and edited by Mike Cowlishaw
Mike Cowlishaw
Mike Cowlishaw is a retired IBM Fellow, a Visiting Professor at the Department of Computer Science at the University of Warwick, and is a Fellow of the Royal Academy of Engineering , the Institute of Engineering and Technology , and the British Computer Society.- Career at IBM :Cowlishaw joined IBM...

. The binary formats in the original standard are included in the new standard along with three new basic formats (one binary and two decimal). To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.

Formats

Formats in IEEE 754 describe sets of floating-point data and encodings for interchanging them.

A given format comprises:
  • Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is most simply described by three integers: s= a sign (zero or one), c= a significand
    Significand
    The significand is part of a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.-Examples:...

    (or 'coefficient'), q= an exponent. The numerical value of a finite number is
      (−1)s × c × bq
    where b is the base (2 or 10). For example, if the sign is 1 (indicating negative), the significand is 12345, the exponent is −3, and the base is 10, then the value of the number is −12.345.

  • Two infinities: +∞ and −∞.

  • Two kinds of NaN
    NaN
    In computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...

     (quiet and signaling). A NaN may also carry a payload, intended for diagnostic information indicating the source of the NaN. The sign of a NaN has no meaning, but it may be predictable in some circumstances.


The possible finite values that can be represented in a given format are determined by the base (b), the number of digits in the significand (precision, p), and the exponent parameter emax:
  • c must be an integer in the range zero through bp−1 (e.g., if b=10 and p=7 then c is 0 through 9999999)
  • q must be an integer such that 1−emaxq+p−1 ≤ emax (e.g., if p=7 and emax=96 then q is −101 through 90).


Hence (for the example parameters) the smallest non-zero positive number that can be represented is 1×10−101 and the largest is 9999999×1090 (9.999999×1096), and the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.

Zero values are finite values with significand 0. These are signed zeros, the sign bit specifies if a zero is +0 (positive zero) or −0 (negative zero).

Basic formats

The standard defines five basic formats, named using their base and the number of bits used to encode them. A conforming implementation must fully implement at least one of the basic formats. There are three binary floating-point basic formats (which can be encoded using 32, 64 or 128 bits) and two decimal floating-point basic formats (which can be encoded using 64 or 128 bits). The binary32 and binary64 formats are the single and double formats of IEEE 754-1985.

The precision of the binary formats is one greater than the width of its significand, because there is an implied (hidden) 1 bit.
Name Common name Base Digits E min E max Notes Decimal
digits
Decimal
E max
binary16
Half precision floating-point format
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.In IEEE 754-2008 the 16-bit base 2 format is officially referred to as binary16...

Half precision 2 10+1 −14 +15 storage, not basic 3.31 4.51
binary32
Single precision floating-point format
Single-precision floating-point format is a computer number format that occupies 4 bytes in computer memory and represents a wide dynamic range of values by using a floating point....

Single precision 2 23+1 −126 +127 7.22 38.23
binary64 Double precision 2 52+1 −1022 +1023 15.95 307.95
binary128
Quadruple precision floating-point format
In computing, quadruple precision is a binary floating-point computer number format that occupies 16 bytes in computer memory....

Quadruple precision 2 112+1 −16382 +16383 34.02 4931.77
decimal32
Decimal32 floating-point format
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes in computer memory.It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations....

10 7 −95 +96 storage, not basic 7 96
decimal64
Decimal64 floating-point format
In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory.It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations....

10 16 −383 +384 16 384
decimal128
Decimal128 floating-point format
In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes in computer memory.It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.Decimal128 supports 34 decimal digits of...

10 34 −6143 +6144 34 6144


Decimal digits is digits × log10 base, this gives the precision in decimal.

Decimal E max is Emax × log10 base, this gives the maximum exponent in decimal.

All the basic formats are available in both hardware and software implementations.

Arithmetic formats

A format that is just to be used for arithmetic and other operations need not have an encoding associated with it (that is, an implementation can use whatever internal representation it chooses); all that needs to be defined are its parameters (b, p, and emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent) that it can represent.

Interchange formats

Interchange formats are intended for the exchange of floating-point data using a fixed-length bit-string for a given format.

For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥128 are defined. The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics).

The encoding scheme for these binary interchange formats is the same as that of IEEE 754-1985: a sign bit, followed by w exponent bits that describe the exponent offset by a bias, and p−1 bits that describe the significand. The width of the exponent field for a k-bit format is computed as w = floor(4 log2(k))−13. The existing 64- and 128-bit formats follow this rule, but the 16- and 32-bit formats have more exponent bits (5 and 8) than this formula would provide (3 and 7, respectively).

As with IEEE 754-1985, there is some flexibility in the encoding of signaling NaNs.

For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined.

The encoding scheme for the decimal interchange formats similarly encodes the sign, exponent, and significand, but uses a more complex approach to allow the significand to be encoded as a compressed sequence of decimal digits (using Densely Packed Decimal
Densely Packed Decimal
Densely packed decimal is a system of binary encoding for decimal digits.The traditional system of binary encoding for decimal digits, known as Binary-coded decimal , uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth...

) or as a binary integer. In either case the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and signaling NaNs have a unique encoding (and the same set of possible payloads).

Rounding algorithms

The standard defines five rounding algorithms. The first two round to a nearest value; the others are called directed roundings:

Roundings to nearest

  • Round to nearest, ties to even – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time; this is the default algorithm for binary floating-point and the recommended default for decimal
  • Round to nearest, ties away from zero – rounds to the nearest value; if the number falls midway it is rounded to the nearest value above (for positive numbers) or below (for negative numbers)

Directed roundings

  • Round toward 0 – directed rounding towards zero (also known as truncation).
  • Round toward +∞ – directed rounding towards positive infinity (also known as rounding up or ceiling).
  • Round toward −∞ – directed rounding towards negative infinity (also known as rounding down or floor).

Operations

Required operations for a supported arithmetic format (including the basic formats) include:
  • Arithmetic operations (add, subtract, multiply, divide, square root, fused multiply–add, remainder, etc.)
  • Conversions (between formats, to and from strings, etc.)
  • Scaling and (for decimal) quantizing
  • Copying and manipulating the sign (abs, negate, etc.)
  • Comparisons and total ordering
  • Classification and testing for NaNs, etc.
  • Testing and setting flags
  • Miscellaneous operations.

Total-ordering predicate

The standard provides a predicate totalOrder which defines a total ordering for all floating numbers for each format. The predicate agrees with the normal comparison operations when they say one floating point number is less than another. The normal comparison operations however treat NaNs as unordered and compare −0 and +0 as equal. The totalOrder predicate will order these cases, and it also distinguishes between different representations of NaNs and between the same decimal floating point number encoded in different ways.

Exception handling

The standard defines five exceptions, each of which has a corresponding status flag that (except in certain cases of underflow) is raised when the exception occurs. No other action is required, but alternatives are recommended (see below).

The five possible exceptions are:
  • Invalid operation (e.g., square root of a negative number)
  • Division by zero
  • Overflow (a result is too large to be represented correctly)
  • Underflow (a result is very small (outside the normal range) and is inexact)
  • Inexact.


These are the same five exceptions as were defined in IEEE 754-1985.

Alternate exception handling

The standard recommends optional exception handling in various forms, including traps (exceptions that change the flow of control in some way) and other exception handling models which interrupt the flow, such as try/catch. The traps and other exception mechanisms remain optional, as they were in IEEE 754-1985.

Recommended operations

A new clause in the standard recommends fifty operations, including log, power, and trigonometric functions, that language standards should define. These are all optional (none are required in order to conform to the standard). The operations include some on dynamic modes for attributes, and also a set of reduction operations (sum, scaled product, etc.). All are required to supply a correctly rounded result, but they do not have to detect or report inexactness.

Expression evaluation

The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result.

Reproducibility

The IEEE 754-1985 allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has tightened up many of these, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

Character representation

The standard requires operations to convert between basic formats and external character sequence formats. Conversions to and from a decimal character format are required for all formats. Conversion to an external character sequence must be such that conversion back using round to even will recover the original number. There is no requirement to preserve the payload of a NaN or signaling NaN, and conversion from the external character sequence may turn a signaling NaN into a quiet NaN.

The original binary value will be preserved by converting to decimal and back again using:
  • 5 decimal digits for binary16
  • 9 decimal digits for binary32
  • 17 decimal digits for binary64
  • 36 decimal digits for binary128


For other binary formats the required number of decimal digits is
1 + ceiling(p×log102)


where p is the number of significant bits in the binary format, e.g. 24 bits for binary32.

Correct rounding is only guaranteed for the number of decimal digits above plus 3 for the largest binary format supported. For instance if binary32 is the largest supported binary format supported, then a conversion from a decimal external sequence with 12 decimal digits is guaranteed to be correctly rounded when converted to binary32; but conversion of a sequence of 13 decimal digits is not.

When using a decimal floating point format the decimal representation will be preserved using:
  • 7 decimal digits for decimal32
  • 16 decimal digits for decimal64
  • 34 decimal digits for decimal128

See also

  • IEEE 754-1985
  • Minifloat
    Minifloat
    In computing, minifloats are floating point values represented with very few bits. Predictably, they are not well suited for general purpose numerical calculations. They are used for special purposes most often in computer graphics where iterations are small and precision has aesthetic effects...

    , low-precision binary floating-point formats following IEEE 754 principles
  • Half precision – Single precision – Double precision
    Double precision
    In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations...

     – Quadruple precision
  • IBM System z9, the first CPU to implement IEEE 754-2008 (using hardware microcode)
  • z10
    IBM z10 (microprocessor)
    The z10 is a microprocessor chip made by IBM for their System z10 mainframe computers, released February 26, 2008. It was called "z6" during development.- Description :...

    , a CPU that implements IEEE 754-2008 fully in hardware
  • POWER6
    POWER6
    The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor...

    , a CPU that implements IEEE 754-2008 fully in hardware
  • Coprocessor
    Coprocessor
    A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...

  • strictfp
    Strictfp
    strictfp is a keyword in the Java programming language that restricts floating-point calculations to ensure portability. It was introduced into Java with the Java virtual machine version 1.2.-Basis:...

    , a keyword in the Java programming language
    Java (programming language)
    Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

     that ensures IEEE 754 implementation.

Standard


Secondary references


Further reading

. (Note: Algorism is not a misspelling of the title; see also algorism
Algorism
Algorism is the technique of performing basic arithmetic by writing numbers in place value form and applying a set of memorized rules and facts to the digits. One who practices algorism is known as an algorist...

.): A compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK