Flonum Operations (MIT/GNU Scheme 12.1)

4.8.2 Flonum Operations

A flonum is an inexact real number that is implemented as a floating-point number. In MIT/GNU Scheme, all inexact real numbers are flonums. For this reason, constants such as 0. and 2.3 are guaranteed to be flonums.

MIT/GNU Scheme follows the IEEE 754-2008 floating-point standard, using binary64 arithmetic for flonums. All floating-point values are classified into:

normal ¶

Numbers of the form

+/- r^e (1 + f/r^p)

where r, the radix, is a positive integer, here always 2; p, the precision, is a positive integer, here always 53; e, the exponent, is an integer within a limited range, here always -1022 to 1023 (inclusive); and f, the fractional part of the significand, is a (p-1)-bit unsigned integer,

subnormal ¶

Fixed-point numbers near zero that allow for gradual underflow. Every subnormal number is an integer multiple of the smallest subnormal number. Subnormals were also historically called “denormal”.

zero ¶

There are two distinguished zero values, one with “negative” sign bit and one with “positive” sign bit.

The two zero values are considered numerically equal, but serve to distinguish paths converging to zero along different branch cuts and so some operations yield different results for differently signed zero values.

infinity ¶

There are two distinguished infinity values, negative infinity or -inf.0 and positive infinity or +inf.0, representing overflow on the real line.

NaN ¶

There are 4 r^{p-2} - 2 distinguished not-a-number values, representing invalid operations or uninitialized data, distinguished by their negative/positive sign bit, a quiet/signalling bit, and a (p-2)-digit unsigned integer payload which must not be zero for signalling NaNs.

Arithmetic on quiet NaNs propagates them without raising any floating-point exceptions. In contrast, arithmetic on signalling NaNs raises the floating-point invalid-operation exception. Quiet NaNs are written +nan.123, -nan.0, etc. Signalling NaNs are written +snan.123, -snan.1, etc. The notation +snan.0 and -snan.0 is not allowed: what would be the encoding for them actually means +inf.0 and -inf.0.

procedure: flo:flonum? object ¶: Returns #t if object is a flonum; otherwise returns #f.

procedure: flo:= flonum1 flonum2 ¶

procedure: flo:< flonum1 flonum2 ¶

procedure: flo:<= flonum1 flonum2 ¶

procedure: flo:> flonum1 flonum2 ¶

procedure: flo:>= flonum1 flonum2 ¶

procedure: flo:<> flonum1 flonum2 ¶

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”. When floating-point exception traps are disabled, they return false when any argument is NaN.

Every pair of floating-point numbers — excluding NaN — exhibits ordered trichotomy: they are related either by flo:=, flo:<, or flo:>.

procedure: flo:safe= flonum1 flonum2 ¶

procedure: flo:safe< flonum1 flonum2 ¶

procedure: flo:safe<= flonum1 flonum2 ¶

procedure: flo:safe> flonum1 flonum2 ¶

procedure: flo:safe>= flonum1 flonum2 ¶

procedure: flo:safe<> flonum1 flonum2 ¶

procedure: flo:unordered? flonum1 flonum2 ¶

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates do not raise floating-point exceptions, and simply return false on NaN arguments, except flo:unordered? which returns true iff at least one argument is NaN; in other words, they are “unordered comparisons”.

Every pair of floating-point values — including NaN — exhibits unordered tetrachotomy: they are related either by flo:safe=, flo:safe<, flo:safe>, or flo:unordered?.

procedure: flo:zero? flonum ¶

procedure: flo:positive? flonum ¶

procedure: flo:negative? flonum ¶

Each of these procedures compares its argument to zero. When compiled, they do not check the type of their argument. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”.

(flo:zero? -0.)                ⇒  #t
(flo:negative? -0.)            ⇒  #f
(flo:negative? -1.)            ⇒  #t

(flo:zero? 0.)                 ⇒  #t
(flo:positive? 0.)             ⇒  #f
(flo:positive? 1.)             ⇒  #f

(flo:zero? +nan.123)           ⇒  #f  ; (raises invalid-operation)

procedure: flo:normal? flonum ¶

procedure: flo:subnormal? flonum ¶

procedure: flo:safe-zero? flonum ¶

procedure: flo:infinite? flonum ¶

procedure: flo:nan? flonum ¶

Floating-point classification predicates. For any flonum, exactly one of these predicates returns true. These predicates never raise floating-point exceptions.

(flo:normal? 1.23)             ⇒  #t
(flo:subnormal? 4e-124)        ⇒  #t
(flo:safe-zero? -0.)           ⇒  #t
(flo:infinite? +inf.0)         ⇒  #t
(flo:nan? -nan.123)            ⇒  #t

procedure: flo:finite? flonum ¶

Equivalent to:

(or (flo:safe-zero? flonum)
    (flo:subnormal? flonum)
    (flo:normal? flonum))
; or
(and (not (flo:infinite? flonum))
     (not (flo:nan? flonum)))

True for normal, subnormal, and zero floating-point values; false for infinity and NaN.

procedure: flo:classify flonum ¶: Returns a symbol representing the classification of the flonum, one of normal, subnormal, zero, infinity, or nan.

procedure: flo:sign-negative? flonum ¶

Returns true if the sign bit of flonum is negative, and false otherwise. Never raises a floating-point exception—not even for signalling NaN.

(flo:sign-negative? +0.)       ⇒  #f
(flo:sign-negative? -0.)       ⇒  #t
(flo:sign-negative? -1.)       ⇒  #t
(flo:sign-negative? +inf.0)    ⇒  #f
(flo:sign-negative? +nan.123)  ⇒  #f

(flo:negative? -0.)            ⇒  #f
(flo:negative? +nan.123)       ⇒  #f  ; (raises invalid-operation)

procedure: flo:+ flonum1 flonum2 ¶
procedure: flo:- flonum1 flonum2 ¶
procedure: flo:* flonum1 flonum2 ¶
procedure: flo:/ flonum1 flonum2 ¶: These procedures are the standard arithmetic operations on flonums. When compiled, they do not check the types of their arguments.

procedure: flo:*+ flonum1 flonum2 flonum3 ¶

procedure: flo:*- flonum1 flonum2 flonum3 ¶

procedure: flo:fma flonum1 flonum2 flonum3 ¶

procedure: flo:fast-fma? ¶

Fused multiply-add: (flo:*+ u v a) computes uv+a correctly rounded, with no intermediate overflow or underflow arising from uv. In contrast, (flo:+ (flo:* u v) a) may have two rounding errors, and can overflow or underflow if uv is too large or too small even if uv + a is normal. Flo:fma is an alias for flo:*+ with the more familiar name used in other languages like C.

(flo:*- u v s) computes uv-s correctly rounded, equivalent to (flo:*+ u v (flo:negate s)).

Flo:fast-fma? returns true if the implementation of fused multiply-add is supported by fast hardware, and false if it is emulated using Dekker’s double-precision algorithm in software.

(flo:+ (flo:* 1.2e100 2e208) -1.4e308)
                               ⇒  +inf.0  ; (raises overflow)
(flo:*+ 1.2e100 2e208  -1.4e308)
                               ⇒  1e308

procedure: flo:negate flonum ¶

This procedure returns the negation of its argument. When compiled, it does not check the type of its argument. Never raises a floating-point exception—not even for signalling NaN.

This is not equivalent to (flo:- 0. flonum):

(flo:negate 1.2)               ⇒  -1.2
(flo:negate -nan.123)          ⇒  +nan.123
(flo:negate +inf.0)            ⇒  -inf.0
(flo:negate 0.)                ⇒  -0.
(flo:negate -0.)               ⇒  0.

(flo:- 0. 1.2)                 ⇒  -1.2
(flo:- 0. -nan.123)            ⇒  -nan.123
(flo:- 0. +inf.0)              ⇒  -inf.0
(flo:- 0. 0.)                  ⇒  0.
(flo:- 0. -0.)                 ⇒  0.

procedure: flo:abs flonum ¶
procedure: flo:copysign flonum1 flonum2 ¶
procedure: flo:exp flonum ¶
procedure: flo:exp2 flonum ¶
procedure: flo:exp10 flonum ¶
procedure: flo:expm1 flonum ¶
procedure: flo:exp2m1 flonum ¶
procedure: flo:exp10m1 flonum ¶
procedure: flo:log flonum ¶
procedure: flo:log2 flonum ¶
procedure: flo:log10 flonum ¶
procedure: flo:log1p flonum ¶
procedure: flo:logp1 flonum ¶
procedure: flo:log2p1 flonum ¶
procedure: flo:log10p1 flonum ¶
procedure: flo:sin flonum ¶
procedure: flo:cos flonum ¶
procedure: flo:tan flonum ¶
procedure: flo:asin flonum ¶
procedure: flo:acos flonum ¶
procedure: flo:atan flonum ¶
procedure: flo:sin-pi* flonum ¶
procedure: flo:cos-pi* flonum ¶
procedure: flo:tan-pi* flonum ¶
procedure: flo:asin/pi flonum ¶
procedure: flo:acos/pi flonum ¶
procedure: flo:atan/pi flonum ¶
procedure: flo:versin flonum ¶
procedure: flo:exsec flonum ¶
procedure: flo:aversin flonum ¶
procedure: flo:aexsec flonum ¶
procedure: flo:versin-pi* flonum ¶
procedure: flo:exsec-pi* flonum ¶
procedure: flo:aversin/pi flonum ¶
procedure: flo:aexsec/pi flonum ¶
procedure: flo:sinh flonum ¶
procedure: flo:cosh flonum ¶
procedure: flo:tanh flonum ¶
procedure: flo:asinh flonum ¶
procedure: flo:acosh flonum ¶
procedure: flo:atanh flonum ¶
procedure: flo:sqrt flonum ¶
procedure: flo:cbrt flonum ¶
procedure: flo:rsqrt flonum ¶
procedure: flo:sqrt1pm1 flonum ¶
procedure: flo:expt flonum1 flonum2 ¶
procedure: flo:compound flonum1 flonum2 ¶
procedure: flo:compoundm1 flonum1 flonum2 ¶
procedure: flo:erf flonum ¶
procedure: flo:erfc flonum ¶
procedure: flo:hypot flonum1 flonum2 ¶
procedure: flo:j0 flonum ¶
procedure: flo:j1 flonum ¶
procedure: flo:jn flonum ¶
procedure: flo:y0 flonum ¶
procedure: flo:y1 flonum ¶
procedure: flo:yn flonum ¶
procedure: flo:gamma flonum ¶
procedure: flo:lgamma flonum ¶
procedure: flo:floor flonum ¶
procedure: flo:ceiling flonum ¶
procedure: flo:truncate flonum ¶
procedure: flo:round flonum ¶
procedure: flo:floor->exact flonum ¶
procedure: flo:ceiling->exact flonum ¶
procedure: flo:truncate->exact flonum ¶
procedure: flo:round->exact flonum ¶: These procedures are flonum versions of the corresponding procedures. When compiled, they do not check the types of their arguments.

procedure: flo:atan2 flonum1 flonum2 ¶
procedure: flo:atan2/pi flonum1 flonum2 ¶: These are the flonum versions of atan and atan/pi with two arguments. When compiled, they do not check the types of their arguments.

procedure: flo:signed-lgamma x ¶

Returns two values,

m = log(|Gamma(x)|)

and

s = sign(Gamma(x)),

respectively a flonum and an exact integer either -1 or 1, so that

Gamma(x) = s * e^m.

procedure: flo:min x1 x2 ¶

procedure: flo:max x1 x2 ¶

Returns the min or max of two floating-point numbers. -0. is considered less than +0. for the purposes of flo:min and flo:max.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns a quiet NaN. In other words, flo:min and flo:max propagate NaN.

These are the minimum and maximum operations of IEEE 754-2019.

procedure: flo:min-mag x1 x2 ¶

procedure: flo:max-mag x1 x2 ¶

Returns the argument that has the smallest or largest magnitude, or the min or max if the magnitude is the same.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns a quiet NaN. In other words, flo:min-mag and flo:max-mag propagate NaN.

These are the minimumMagnitude and maximumMagnitude operations of IEEE 754-2019.

procedure: flo:min-num x1 x2 ¶

procedure: flo:max-num x1 x2 ¶

Returns the min or max of two floating-point numbers. -0. is considered less than +0. for the purposes of flo:min-num and flo:max-num.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns the other one if it is not NaN, or the first argument if they are both NaN. In other words, flo:min-num and flo:max-num treat NaN as missing data and ignore it if possible.

These are the minimumNumber and maximumNumber operations of IEEE 754-2019, formerly called minNum and maxNum in IEEE 754-2008.

procedure: flo:min-mag-num x1 x2 ¶

procedure: flo:max-mag-num x1 x2 ¶

Returns the argument that has the smallest or largest magnitude, or the min or max if the magnitude is the same.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns the other one if it is not NaN, or the first argument if they are both NaN. In other words, flo:min-mag-num and flo:max-mag-num treat NaN as missing data and ignore it if possible.

These are the minimumMagnitudeNumber and maximumMagnitudeNumber operations of IEEE 754-2019, formerly called minNumMag and maxNumMag in IEEE 754-2008.

procedure: flo:ldexp x1 x2 ¶

procedure: flo:scalbn x1 x2 ¶

Flo:ldexp scales by a power of two; flo:scalbn scales by a power of the floating-point radix.

ldexp x e := x * 2^e,
scalbn x e := x * r^e.

In MIT/GNU Scheme, these procedures are the same; they are both provided to make it clearer which operation is meant.

procedure: flo:logb x ¶

For nonzero finite x, returns floor(log(x)/log(r)) as an exact integer, where r is the floating-point radix.

For all other inputs, raises invalid-operation and returns #f.

procedure: flo:nextafter x1 x2 ¶

Returns the next floating-point number after x1 in the direction of x2.

(flo:nextafter 0. -1.)         ⇒  -4.9406564584124654e-324

constant: flo:radix ¶
constant: flo:radix. ¶
constant: flo:precision ¶: Floating-point system parameters. Flo:radix is the floating-point radix as an integer, and flo:precision is the floating-point precision as an integer; flo:radix. is the flotaing-point radix as a flonum.

constant: flo:error-bound ¶

constant: flo:log-error-bound ¶

constant: flo:ulp-of-one ¶

constant: flo:log-ulp-of-one ¶

Flo:error-bound, sometimes called the machine epsilon, is the maximum relative error of rounding to nearest:

max |x - fl(x)|/|x| = 1/(2 r^(p-1)),

where r is the floating-point radix and p is the floating-point precision.

Flo:ulp-of-one is the distance from 1 to the next larger floating-point number, and is equal to 1/r^{p-1}.

Flo:error-bound is half flo:ulp-of-one.

Flo:log-error-bound is the logarithm of flo:error-bound, and flo:log-ulp-of-one is the logarithm of flo:log-ulp-of-one.

procedure: flo:ulp flonum ¶

Returns the distance from flonum to the next floating-point number larger in magnitude with the same sign. For zero, this returns the smallest subnormal. For infinities, this returns positive infinity. For NaN, this returns the same NaN.

(flo:ulp 1.)                    ⇒  2.220446049250313e-16
(= (flo:ulp 1.) flo:ulp-of-one) ⇒  #t

constant: flo:normal-exponent-max ¶

constant: flo:normal-exponent-min ¶

constant: flo:subnormal-exponent-min ¶

Largest and smallest positive integer exponents of the radix in normal and subnormal floating-point numbers.

Flo:normal-exponent-max is the largest positive integer such that (expt flo:radix. flo:normal-exponent-max) does not overflow.
Flo:normal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:normal-exponent-min) is a normal floating-point number.
Flo:subnormal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:subnormal-exponent-min) is nonzero; this is also the smallest positive floating-point number.

constant: flo:largest-positive-normal ¶
constant: flo:smallest-positive-normal ¶
constant: flo:smallest-positive-subnormal ¶: Smallest and largest normal and subnormal numbers in magnitude.

constant: flo:greatest-normal-exponent-base-e ¶
constant: flo:greatest-normal-exponent-base-2 ¶
constant: flo:greatest-normal-exponent-base-10 ¶
constant: flo:least-normal-exponent-base-e ¶
constant: flo:least-normal-exponent-base-2 ¶
constant: flo:least-normal-exponent-base-10 ¶
constant: flo:least-subnormal-exponent-base-e ¶
constant: flo:least-subnormal-exponent-base-2 ¶
constant: flo:least-subnormal-exponent-base-10 ¶: Least and greatest exponents of normal and subnormal floating-point numbers, as floating-point numbers. For example, flo:greatest-normal-exponent-base-2 is the greatest floating-point number such that (expt 2. flo:greatest-normal-exponent-base-2) does not overflow and is a normal floating-point number.

procedure: flo:total< x1 x2 ¶

procedure: flo:total-mag< x1 x2 ¶

procedure: flo:total-order x1 x2 ¶

procedure: flo:total-order-mag x1 x2 ¶

These procedures implement the IEEE 754-2008 total ordering on floating-point values and their magnitudes. Here the “magnitude” of a floating-point value is a floating-point value with positive sign bit and everything else the same; e.g., +nan.123 is the “magnitude” of -nan.123 and 0.0 is the “magnitude” of -0.0.

The total ordering has little to no numerical meaning and should be used only when an arbitrary choice of total ordering is required for some non-numerical reason.

Flo:total< returns true if x1 precedes x2.
Flo:total-mag< returns true if the magnitude of x1 precedes the magnitude of x2.
Flo:total-order returns -1 if x1 precedes x2, 0 if they are the same floating-point value (including sign of zero, or sign and payload of NaN), and +1 if x1 follows x2.
Flo:total-order-mag returns -1 if the magnitude of x1 precedes the magnitude of x2, etc.

procedure: flo:make-nan negative? quiet? payload ¶

procedure: flo:nan-quiet? nan ¶

procedure: flo:nan-payload nan ¶

Flo:make-nan creates a NaN given the sign bit, quiet bit, and payload. Negative? and quiet? must be booleans, and payload must be an unsigned (p-2)-bit integer, where p is the floating-point precision. If quiet? is false, payload must be nonzero.

(flo:sign-negative? (flo:make-nan negative? quiet? payload))
                               ⇒  negative?
(flo:nan-quiet? (flo:make-nan negative? quiet? payload))
                               ⇒  quiet?
(flo:nan-payload (flo:make-nan negative? quiet? payload))
                               ⇒  payload

(flo:make-nan #t #f 42)        ⇒  -snan.42
(flo:sign-negative? +nan.123)  ⇒  #f
(flo:nan-quiet? +nan.123)      ⇒  #t
(flo:nan-payload +nan.123)     ⇒  123