The rest of this chapter uses a number of terms. Here are some informal definitions that should help you work your way through the material here:

*Accuracy*A floating-point calculation’s accuracy is how close it comes to the real (paper and pencil) value.

*Error*The difference between what the result of a computation “should be” and what it actually is. It is best to minimize error as much as possible.

*Exponent*The order of magnitude of a value; some number of bits in a floating-point value store the exponent.

*Inf*A special value representing infinity. Operations involving another number and infinity produce infinity.

*NaN*“Not a number.” A special value that results from attempting a calculation that has no answer as a real number. See Floating Point Values They Didn’t Talk About In School, for more information about infinity and not-a-number values.

*Normalized*How the significand (see later in this list) is usually stored. The value is adjusted so that the first bit is one, and then that leading one is assumed instead of physically stored. This provides one extra bit of precision.

*Precision*The number of bits used to represent a floating-point number. The more bits, the more digits you can represent. Binary and decimal precisions are related approximately, according to the formula:

`prec`= 3.322 *`dps`Here,

*prec*denotes the binary precision (measured in bits) and*dps*(short for decimal places) is the decimal digits.*Rounding mode*How numbers are rounded up or down when necessary. More details are provided later.

*Significand*A floating-point value consists of the significand multiplied by 10 to the power of the exponent. For example, in

`1.2345e67`

, the significand is`1.2345`

.*Stability*From the Wikipedia article on numerical stability: “Calculations that can be proven not to magnify approximation errors are called

*numerically stable*.”

See the Wikipedia article on accuracy and precision for more information on some of those terms.

On modern systems, floating-point hardware uses the representation and
operations defined by the IEEE 754 standard.
Three of the standard IEEE 754 types are 32-bit single precision,
64-bit double precision, and 128-bit quadruple precision.
The standard also specifies extended precision formats
to allow greater precisions and larger exponent ranges.
(`awk`

uses only the 64-bit double-precision format.)

Table 16.3 lists the precision and exponent field values for the basic IEEE 754 binary formats.

Name | Total bits | Precision | Minimum exponent | Maximum exponent |
---|---|---|---|---|

Single | 32 | 24 | −126 | +127 |

Double | 64 | 53 | −1022 | +1023 |

Quadruple | 128 | 113 | −16382 | +16383 |

NOTE:The precision numbers include the implied leading one that gives them one extra bit of significand.