Floating Point Numbers (The GNU C Library)

20.3 Floating Point Numbers

Most computer hardware has support for two different kinds of numbers: integers (…-3, -2, -1, 0, 1, 2, 3…) and floating-point numbers. Floating-point numbers have three parts: the mantissa, the exponent, and the sign bit. The real number represented by a floating-point value is given by (s ? -1 : 1) · 2^e · M where s is the sign bit, e the exponent, and M the mantissa. See Floating Point Representation Concepts, for details. (It is possible to have a different base for the exponent, but all modern hardware uses 2.)

Floating-point numbers can represent a finite subset of the real numbers. While this subset is large enough for most purposes, it is important to remember that the only reals that can be represented exactly are rational numbers that have a terminating binary expansion shorter than the width of the mantissa. Even simple fractions such as 1/5 can only be approximated by floating point.

Mathematical operations and functions frequently need to produce values that are not representable. Often these values can be approximated closely enough for practical purposes, but sometimes they can’t. Historically there was no way to tell when the results of a calculation were inaccurate. Modern computers implement the IEEE 754 standard for numerical computations, which defines a framework for indicating to the program when the results of calculation are not trustworthy. This framework consists of a set of exceptions that indicate why a result could not be represented, and the special values infinity and not a number (NaN).