Next: , Previous: , Up: Fixnum and Flonum Operations   [Contents][Index]

#### 4.7.2 Flonum Operations

A flonum is an inexact real number that is implemented as a floating-point number. In MIT/GNU Scheme, all inexact real numbers are flonums. For this reason, constants such as `0.` and `2.3` are guaranteed to be flonums.

MIT/GNU Scheme follows the IEEE 754-2008 floating-point standard, using binary64 arithmetic for flonums. All floating-point values are classified into:

normal

Numbers of the form

```r^e (1 + f/r^p)
```

where r, the radix, is a positive integer, here always 2; p, the precision, is a positive integer, here always 53; e, the exponent, is an integer within a limited range, here always -1022 to 1023 (inclusive); and f, the fractional part of the significand, is a (p-1)-bit unsigned integer,

subnormal

Fixed-point numbers near zero that allow for gradual underflow. Every subnormal number is an integer multiple of the smallest subnormal number. Subnormals were also historically called “denormal”.

zero

There are two distinguished zero values, one with “negative” sign bit and one with “positive” sign bit.

The two zero values are considered numerically equal, but serve to distinguish paths converging to zero along different branch cuts and so some operations yield different results for differently signed zero values.

infinity

There are two distinguished infinity values, negative infinity or `-inf.0` and positive infinity or `+inf.0`, representing overflow on the real line.

NaN

There are 4 r^{p-2} - 2 distinguished not-a-number values, representing invalid operations or uninitialized data, distinguished by their negative/positive sign bit, a quiet/signalling bit, and a (p-2)-digit unsigned integer payload which must not be zero for signalling NaNs.

Arithmetic on quiet NaNs propagates them without raising any floating-point exceptions. In contrast, arithmetic on signalling NaNs raises the floating-point invalid-operation exception. Quiet NaNs are written `+nan.123`, `-nan.0`, etc. Signalling NaNs are written `+snan.123`, `-snan.1`, etc. The notation `+snan.0` and `-snan.0` is not allowed: what would be the encoding for them actually means `+inf.0` and `-inf.0`.

procedure: flo:flonum? object

Returns `#t` if object is a flonum; otherwise returns `#f`.

procedure: flo:= flonum1 flonum2
procedure: flo:< flonum1 flonum2
procedure: flo:<= flonum1 flonum2
procedure: flo:> flonum1 flonum2
procedure: flo:>= flonum1 flonum2
procedure: flo:<> flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”. When floating-point exception traps are disabled, they return false when any argument is NaN.

Every pair of floating-point numbers — excluding NaN — exhibits ordered trichotomy: they are related either by `flo:=`, `flo:<`, or `flo:>`.

procedure: flo:safe= flonum1 flonum2
procedure: flo:safe< flonum1 flonum2
procedure: flo:safe<= flonum1 flonum2
procedure: flo:safe> flonum1 flonum2
procedure: flo:safe>= flonum1 flonum2
procedure: flo:safe<> flonum1 flonum2
procedure: flo:unordered? flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates do not raise floating-point exceptions, and simply return false on NaN arguments, except `flo:unordered?` which returns true iff at least one argument is NaN; in other words, they are “unordered comparisons”.

Every pair of floating-point values — including NaN — exhibits unordered tetrachotomy: they are related either by `flo:safe=`, `flo:safe<`, `flo:safe>`, or `flo:unordered?`.

procedure: flo:zero? flonum
procedure: flo:positive? flonum
procedure: flo:negative? flonum

Each of these procedures compares its argument to zero. When compiled, they do not check the type of their argument. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”.

```(flo:zero? -0.)                ⇒ #t
(flo:negative? -0.)            ⇒ #f
(flo:negative? -1.)            ⇒ #t

(flo:zero? 0.)                 ⇒ #t
(flo:positive? 0.)             ⇒ #f
(flo:positive? 1.)             ⇒ #f

(flo:zero? +nan.123)           ⇒ #f  ; (raises invalid-operation)
```
procedure: flo:normal? flonum
procedure: flo:subnormal? flonum
procedure: flo:safe-zero? flonum
procedure: flo:infinite? flonum
procedure: flo:nan? flonum

Floating-point classification predicates. For any flonum, exactly one of these predicates returns true. These predicates never raise floating-point exceptions.

```(flo:normal? 1.23)             ⇒ #t
(flo:subnormal? 4e-124)        ⇒ #t
(flo:safe-zero? -0.)           ⇒ #t
(flo:infinite? +inf.0)         ⇒ #t
(flo:nan? -nan.123)            ⇒ #t
```
procedure: flo:finite? flonum

Equivalent to:

```(or (flo:safe-zero? flonum)
(flo:subnormal? flonum)
(flo:normal? flonum))
; or
(and (not (flo:infinite? flonum))
(not (flo:nan? flonum)))
```

True for normal, subnormal, and zero floating-point values; false for infinity and NaN.

procedure: flo:classify flonum

Returns a symbol representing the classification of the flonum, one of `normal`, `subnormal`, `zero`, `infinity`, or `nan`.

procedure: flo:sign-negative? flonum

Returns true if the sign bit of flonum is negative, and false otherwise. Never raises a floating-point exception.

```(flo:sign-negative? +0.)       ⇒ #f
(flo:sign-negative? -0.)       ⇒ #t
(flo:sign-negative? -1.)       ⇒ #t
(flo:sign-negative? +inf.0)    ⇒ #f
(flo:sign-negative? +nan.123)  ⇒ #f

(flo:negative? -0.)            ⇒ #f
(flo:negative? +nan.123)       ⇒ #f  ; (raises invalid-operation)
```
procedure: flo:+ flonum1 flonum2
procedure: flo:- flonum1 flonum2
procedure: flo:* flonum1 flonum2
procedure: flo:/ flonum1 flonum2

These procedures are the standard arithmetic operations on flonums. When compiled, they do not check the types of their arguments.

procedure: flo:*+ flonum1 flonum2 flonum3
procedure: flo:fma flonum1 flonum2 flonum3
procedure: flo:fast-fma?

Fused multiply-add: `(flo:*+ u v a)` computes uv+a correctly rounded, with no intermediate overflow or underflow arising from uv. In contrast, `(flo:+ (flo:* u v) a)` may have two rounding errors, and can overflow or underflow if uv is too large or too small even if uv + a is normal. `Flo:fma` is an alias for `flo:*+` with the more familiar name used in other languages like C.

`Flo:fast-fma?` returns true if the implementation of fused multiply-add is supported by fast hardware, and false if it is emulated using Dekker’s double-precision algorithm in software.

```(flo:+ (flo:* 1.2e100 2e208) -1.4e308)
⇒ +inf.0  ; (raises overflow)
(flo:*+ 1.2e100 2e208  -1.4e308)
⇒ 1e308
```
procedure: flo:negate flonum

This procedure returns the negation of its argument. When compiled, it does not check the type of its argument.

This is not equivalent to `(flo:- 0. flonum)`:

```(flo:negate 1.2)               ⇒ -1.2
(flo:negate -nan.123)          ⇒ +nan.123
(flo:negate +inf.0)            ⇒ -inf.0
(flo:negate 0.)                ⇒ -0.
(flo:negate -0.)               ⇒ 0.

(flo:- 0. 1.2)                 ⇒ -1.2
(flo:- 0. -nan.123)            ⇒ -nan.123
(flo:- 0. +inf.0)              ⇒ -inf.0
(flo:- 0. 0.)                  ⇒ 0.
(flo:- 0. -0.)                 ⇒ 0.
```
procedure: flo:abs flonum
procedure: flo:exp flonum
procedure: flo:log flonum
procedure: flo:sin flonum
procedure: flo:cos flonum
procedure: flo:tan flonum
procedure: flo:asin flonum
procedure: flo:acos flonum
procedure: flo:atan flonum
procedure: flo:sinh flonum
procedure: flo:cosh flonum
procedure: flo:tanh flonum
procedure: flo:asinh flonum
procedure: flo:acosh flonum
procedure: flo:atanh flonum
procedure: flo:sqrt flonum
procedure: flo:cbrt flonum
procedure: flo:expt flonum1 flonum2
procedure: flo:erf flonum
procedure: flo:erfc flonum
procedure: flo:hypot flonum1 flonum2
procedure: flo:j0 flonum
procedure: flo:j1 flonum
procedure: flo:jn flonum
procedure: flo:y0 flonum
procedure: flo:y1 flonum
procedure: flo:yn flonum
procedure: flo:gamma flonum
procedure: flo:lgamma flonum
procedure: flo:floor flonum
procedure: flo:ceiling flonum
procedure: flo:truncate flonum
procedure: flo:round flonum
procedure: flo:floor->exact flonum
procedure: flo:ceiling->exact flonum
procedure: flo:truncate->exact flonum
procedure: flo:round->exact flonum

These procedures are flonum versions of the corresponding procedures. When compiled, they do not check the types of their arguments.

procedure: flo:expm1 flonum
procedure: flo:log1p flonum

Flonum versions of `expm1` and `log1p` with restricted domains: `flo:expm1` is defined only on inputs bounded below log(2) in magnitude, and `flo:log1p` is defined only on inputs bounded below 1 - sqrt(1/2) in magnitude. Callers must use `(- (flo:exp x) 1)` or `(flo:log (+ 1 x))` outside these ranges.

procedure: flo:atan2 flonum1 flonum2

This is the flonum version of `atan` with two arguments. When compiled, it does not check the types of its arguments.

procedure: flo:signed-lgamma x

Returns two values,

```m = log(|Gamma(x)|)

and

s = sign(Gamma(x)),
```

respectively a flonum and an exact integer either `-1` or `1`, so that

```Gamma(x) = s * e^m.
```
procedure: flo:min x1 x2
procedure: flo:max x1 x2

Returns the min or max of two floating-point numbers. If either argument is NaN, raises the floating-point invalid-operation exception and returns the other one if it is not NaN, or the first argument if they are both NaN.

procedure: flo:min-mag x1 x2
procedure: flo:max-mag x1 x2

Returns the argument that has the smallest or largest magnitude, as in minNumMag or maxNumMag of IEEE 754-2008. If either argument is NaN, raises the floating-point invalid-operation exception and returns the other one if it is not NaN, or the first argument if they are both NaN.

procedure: flo:ldexp x1 x2
procedure: flo:scalbn x1 x2

`Flo:ldexp` scales by a power of two; `flo:scalbn` scales by a power of the floating-point radix.

```ldexp x e := x * 2^e,
scalbn x e := x * r^e.
```

In MIT/GNU Scheme, these procedures are the same; they are both provided to make it clearer which operation is meant.

procedure: flo:logb x

For nonzero finite x, returns floor(log(x)/log(r)) as an exact integer, where r is the floating-point radix.

For all other inputs, raises invalid-operation and returns `#f`.

procedure: flo:nextafter x1 x2

Returns the next floating-point number after x1 in the direction of x2.

```(flo:nextafter 0. -1.)         ⇒ -4.9406564584124654e-324
```
procedure: flo:copysign x1 x2

Returns a floating-point number with the magnitude of x1 and the sign of x2.

```(flo:copysign 123. 456.)       ⇒ 123.
(flo:copysign +inf.0 -1)       ⇒ -inf.0
(flo:copysign 0. -1)           ⇒ -0.
(flo:copysign -0. 0.)          ⇒ 0.
(flo:copysign -nan.123 0.)     ⇒ +nan.123
```
constant: flo:precision

Floating-point system parameters. `Flo:radix` is the floating-point radix as an integer, and `flo:precision` is the floating-point precision as an integer; `flo:radix.` is the flotaing-point radix as a flonum.

constant: flo:error-bound
constant: flo:log-error-bound
constant: flo:ulp-of-one
constant: flo:log-ulp-of-one

`Flo:error-bound`, sometimes called the machine epsilon, is the maximum relative error of rounding to nearest:

```max |x - fl(x)|/|x| = 1/(2 r^(p-1)),
```

where r is the floating-point radix and p is the floating-point precision.

`Flo:ulp-of-one` is the distance from 1 to the next larger floating-point number, and is equal to 1/r^{p-1}.

`Flo:error-bound` is half `flo:ulp-of-one`.

`Flo:log-error-bound` is the logarithm of `flo:error-bound`, and `flo:log-ulp-of-one` is the logarithm of `flo:log-ulp-of-one`.

procedure: flo:ulp flonum

Returns the distance from flonum to the next floating-point number larger in magnitude with the same sign. For zero, this returns the smallest subnormal. For infinities, this returns positive infinity. For NaN, this returns the same NaN.

```(flo:ulp 1.)                    ⇒ 2.220446049250313e-16
(= (flo:ulp 1.) flo:ulp-of-one) ⇒ #t
```
constant: flo:normal-exponent-max
constant: flo:normal-exponent-min
constant: flo:subnormal-exponent-min

Largest and smallest positive integer exponents of the radix in normal and subnormal floating-point numbers.

• `Flo:normal-exponent-max` is the largest positive integer such that `(expt flo:radix. flo:normal-exponent-max)` does not overflow.
• `Flo:normal-exponent-min` is the smallest positive integer such that `(expt flo:radix. flo:normal-exponent-min)` is a normal floating-point number.
• `Flo:subnormal-exponent-min` is the smallest positive integer such that `(expt flo:radix. flo:subnormal-exponent-min)` is nonzero; this is also the smallest positive floating-point number.
constant: flo:largest-positive-normal
constant: flo:smallest-positive-normal
constant: flo:smallest-positive-subnormal

Smallest and largest normal and subnormal numbers in magnitude.

constant: flo:greatest-normal-exponent-base-e
constant: flo:greatest-normal-exponent-base-2
constant: flo:greatest-normal-exponent-base-10
constant: flo:least-normal-exponent-base-e
constant: flo:least-normal-exponent-base-2
constant: flo:least-normal-exponent-base-10
constant: flo:least-subnormal-exponent-base-e
constant: flo:least-subnormal-exponent-base-2
constant: flo:least-subnormal-exponent-base-10

Least and greatest exponents of normal and subnormal floating-point numbers, as floating-point numbers. For example, `flo:greatest-normal-exponent-base-2` is the greatest floating-point number such that ```(expt 2. flo:greatest-normal-exponent-base-2)``` does not overflow and is a normal floating-point number.

procedure: flo:total< x1 x2
procedure: flo:total-mag< x1 x2
procedure: flo:total-order x1 x2
procedure: flo:total-order-mag x1 x2

These procedures implement the IEEE 754-2008 total ordering on floating-point values and their magnitudes. Here the “magnitude” of a floating-point value is a floating-point value with positive sign bit and everything else the same; e.g., `+nan.123` is the “magnitude” of `-nan.123` and `0.0` is the “magnitude” of `-0.0`.

The total ordering has little to no numerical meaning and should be used only when an arbitrary choice of total ordering is required for some non-numerical reason.

• `Flo:total<` returns true if x1 precedes x2.
• `Flo:total-mag<` returns true if the magnitude of x1 precedes the magnitude of x2.
• `Flo:total-order` returns -1 if x1 precedes x2, 0 if they are the same floating-point value (including sign of zero, or sign and payload of NaN), and +1 if x1 follows x2.
• `Flo:total-order-mag` returns -1 if the magnitude of x1 precedes the magnitude of x2, etc.
procedure: flo:nan-quiet? nan

`Flo:make-nan` creates a NaN given the sign bit, quiet bit, and payload. Negative? and quiet? must be booleans, and payload must be an unsigned (p-2)-bit integer, where p is the floating-point precision. If quiet? is false, payload must be nonzero.

```(flo:sign-negative? (flo:make-nan negative? quiet? payload))
⇒ negative?