Next: , Previous: , Up: Fixnum and Flonum Operations   [Contents][Index]


4.8.2 Flonum Operations

A flonum is an inexact real number that is implemented as a floating-point number. In MIT/GNU Scheme, all inexact real numbers are flonums. For this reason, constants such as 0. and 2.3 are guaranteed to be flonums.

MIT/GNU Scheme follows the IEEE 754-2008 floating-point standard, using binary64 arithmetic for flonums. All floating-point values are classified into:

normal

Numbers of the form

+/- r^e (1 + f/r^p)

where r, the radix, is a positive integer, here always 2; p, the precision, is a positive integer, here always 53; e, the exponent, is an integer within a limited range, here always -1022 to 1023 (inclusive); and f, the fractional part of the significand, is a (p-1)-bit unsigned integer,

subnormal

Fixed-point numbers near zero that allow for gradual underflow. Every subnormal number is an integer multiple of the smallest subnormal number. Subnormals were also historically called “denormal”.

zero

There are two distinguished zero values, one with “negative” sign bit and one with “positive” sign bit.

The two zero values are considered numerically equal, but serve to distinguish paths converging to zero along different branch cuts and so some operations yield different results for differently signed zero values.

infinity

There are two distinguished infinity values, negative infinity or -inf.0 and positive infinity or +inf.0, representing overflow on the real line.

NaN

There are 4 r^{p-2} - 2 distinguished not-a-number values, representing invalid operations or uninitialized data, distinguished by their negative/positive sign bit, a quiet/signalling bit, and a (p-2)-digit unsigned integer payload which must not be zero for signalling NaNs.

Arithmetic on quiet NaNs propagates them without raising any floating-point exceptions. In contrast, arithmetic on signalling NaNs raises the floating-point invalid-operation exception. Quiet NaNs are written +nan.123, -nan.0, etc. Signalling NaNs are written +snan.123, -snan.1, etc. The notation +snan.0 and -snan.0 is not allowed: what would be the encoding for them actually means +inf.0 and -inf.0.

procedure: flo:flonum? object

Returns #t if object is a flonum; otherwise returns #f.

procedure: flo:= flonum1 flonum2
procedure: flo:< flonum1 flonum2
procedure: flo:<= flonum1 flonum2
procedure: flo:> flonum1 flonum2
procedure: flo:>= flonum1 flonum2
procedure: flo:<> flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”. When floating-point exception traps are disabled, they return false when any argument is NaN.

Every pair of floating-point numbers — excluding NaN — exhibits ordered trichotomy: they are related either by flo:=, flo:<, or flo:>.

procedure: flo:safe= flonum1 flonum2
procedure: flo:safe< flonum1 flonum2
procedure: flo:safe<= flonum1 flonum2
procedure: flo:safe> flonum1 flonum2
procedure: flo:safe>= flonum1 flonum2
procedure: flo:safe<> flonum1 flonum2
procedure: flo:unordered? flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates do not raise floating-point exceptions, and simply return false on NaN arguments, except flo:unordered? which returns true iff at least one argument is NaN; in other words, they are “unordered comparisons”.

Every pair of floating-point values — including NaN — exhibits unordered tetrachotomy: they are related either by flo:safe=, flo:safe<, flo:safe>, or flo:unordered?.

procedure: flo:zero? flonum
procedure: flo:positive? flonum
procedure: flo:negative? flonum

Each of these procedures compares its argument to zero. When compiled, they do not check the type of their argument. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”.

(flo:zero? -0.)                ⇒  #t
(flo:negative? -0.)            ⇒  #f
(flo:negative? -1.)            ⇒  #t

(flo:zero? 0.)                 ⇒  #t
(flo:positive? 0.)             ⇒  #f
(flo:positive? 1.)             ⇒  #f

(flo:zero? +nan.123)           ⇒  #f  ; (raises invalid-operation)
procedure: flo:normal? flonum
procedure: flo:subnormal? flonum
procedure: flo:safe-zero? flonum
procedure: flo:infinite? flonum
procedure: flo:nan? flonum

Floating-point classification predicates. For any flonum, exactly one of these predicates returns true. These predicates never raise floating-point exceptions.

(flo:normal? 1.23)             ⇒  #t
(flo:subnormal? 4e-124)        ⇒  #t
(flo:safe-zero? -0.)           ⇒  #t
(flo:infinite? +inf.0)         ⇒  #t
(flo:nan? -nan.123)            ⇒  #t
procedure: flo:finite? flonum

Equivalent to:

(or (flo:safe-zero? flonum)
    (flo:subnormal? flonum)
    (flo:normal? flonum))
; or
(and (not (flo:infinite? flonum))
     (not (flo:nan? flonum)))

True for normal, subnormal, and zero floating-point values; false for infinity and NaN.

procedure: flo:classify flonum

Returns a symbol representing the classification of the flonum, one of normal, subnormal, zero, infinity, or nan.

procedure: flo:sign-negative? flonum

Returns true if the sign bit of flonum is negative, and false otherwise. Never raises a floating-point exception—not even for signalling NaN.

(flo:sign-negative? +0.)       ⇒  #f
(flo:sign-negative? -0.)       ⇒  #t
(flo:sign-negative? -1.)       ⇒  #t
(flo:sign-negative? +inf.0)    ⇒  #f
(flo:sign-negative? +nan.123)  ⇒  #f

(flo:negative? -0.)            ⇒  #f
(flo:negative? +nan.123)       ⇒  #f  ; (raises invalid-operation)
procedure: flo:+ flonum1 flonum2
procedure: flo:- flonum1 flonum2
procedure: flo:* flonum1 flonum2
procedure: flo:/ flonum1 flonum2

These procedures are the standard arithmetic operations on flonums. When compiled, they do not check the types of their arguments.

procedure: flo:*+ flonum1 flonum2 flonum3
procedure: flo:*- flonum1 flonum2 flonum3
procedure: flo:fma flonum1 flonum2 flonum3
procedure: flo:fast-fma?

Fused multiply-add: (flo:*+ u v a) computes uv+a correctly rounded, with no intermediate overflow or underflow arising from uv. In contrast, (flo:+ (flo:* u v) a) may have two rounding errors, and can overflow or underflow if uv is too large or too small even if uv + a is normal. Flo:fma is an alias for flo:*+ with the more familiar name used in other languages like C.

(flo:*- u v s) computes uv-s correctly rounded, equivalent to (flo:*+ u v (flo:negate s)).

Flo:fast-fma? returns true if the implementation of fused multiply-add is supported by fast hardware, and false if it is emulated using Dekker’s double-precision algorithm in software.

(flo:+ (flo:* 1.2e100 2e208) -1.4e308)
                               ⇒  +inf.0  ; (raises overflow)
(flo:*+ 1.2e100 2e208  -1.4e308)
                               ⇒  1e308
procedure: flo:negate flonum

This procedure returns the negation of its argument. When compiled, it does not check the type of its argument. Never raises a floating-point exception—not even for signalling NaN.

This is not equivalent to (flo:- 0. flonum):

(flo:negate 1.2)               ⇒  -1.2
(flo:negate -nan.123)          ⇒  +nan.123
(flo:negate +inf.0)            ⇒  -inf.0
(flo:negate 0.)                ⇒  -0.
(flo:negate -0.)               ⇒  0.

(flo:- 0. 1.2)                 ⇒  -1.2
(flo:- 0. -nan.123)            ⇒  -nan.123
(flo:- 0. +inf.0)              ⇒  -inf.0
(flo:- 0. 0.)                  ⇒  0.
(flo:- 0. -0.)                 ⇒  0.
procedure: flo:abs flonum
procedure: flo:copysign flonum1 flonum2
procedure: flo:exp flonum
procedure: flo:exp2 flonum
procedure: flo:exp10 flonum
procedure: flo:expm1 flonum
procedure: flo:exp2m1 flonum
procedure: flo:exp10m1 flonum
procedure: flo:log flonum
procedure: flo:log2 flonum
procedure: flo:log10 flonum
procedure: flo:log1p flonum
procedure: flo:logp1 flonum
procedure: flo:log2p1 flonum
procedure: flo:log10p1 flonum
procedure: flo:sin flonum
procedure: flo:cos flonum
procedure: flo:tan flonum
procedure: flo:asin flonum
procedure: flo:acos flonum
procedure: flo:atan flonum
procedure: flo:sin-pi* flonum
procedure: flo:cos-pi* flonum
procedure: flo:tan-pi* flonum
procedure: flo:asin/pi flonum
procedure: flo:acos/pi flonum
procedure: flo:atan/pi flonum
procedure: flo:versin flonum
procedure: flo:exsec flonum
procedure: flo:aversin flonum
procedure: flo:aexsec flonum
procedure: flo:versin-pi* flonum
procedure: flo:exsec-pi* flonum
procedure: flo:aversin/pi flonum
procedure: flo:aexsec/pi flonum
procedure: flo:sinh flonum
procedure: flo:cosh flonum
procedure: flo:tanh flonum
procedure: flo:asinh flonum
procedure: flo:acosh flonum
procedure: flo:atanh flonum
procedure: flo:sqrt flonum
procedure: flo:cbrt flonum
procedure: flo:rsqrt flonum
procedure: flo:sqrt1pm1 flonum
procedure: flo:expt flonum1 flonum2
procedure: flo:compound flonum1 flonum2
procedure: flo:compoundm1 flonum1 flonum2
procedure: flo:erf flonum
procedure: flo:erfc flonum
procedure: flo:hypot flonum1 flonum2
procedure: flo:j0 flonum
procedure: flo:j1 flonum
procedure: flo:jn flonum
procedure: flo:y0 flonum
procedure: flo:y1 flonum
procedure: flo:yn flonum
procedure: flo:gamma flonum
procedure: flo:lgamma flonum
procedure: flo:floor flonum
procedure: flo:ceiling flonum
procedure: flo:truncate flonum
procedure: flo:round flonum
procedure: flo:floor->exact flonum
procedure: flo:ceiling->exact flonum
procedure: flo:truncate->exact flonum
procedure: flo:round->exact flonum

These procedures are flonum versions of the corresponding procedures. When compiled, they do not check the types of their arguments.

procedure: flo:atan2 flonum1 flonum2
procedure: flo:atan2/pi flonum1 flonum2

These are the flonum versions of atan and atan/pi with two arguments. When compiled, they do not check the types of their arguments.

procedure: flo:signed-lgamma x

Returns two values,

m = log(|Gamma(x)|)

and

s = sign(Gamma(x)),

respectively a flonum and an exact integer either -1 or 1, so that

Gamma(x) = s * e^m.
procedure: flo:min x1 x2
procedure: flo:max x1 x2

Returns the min or max of two floating-point numbers. -0. is considered less than +0. for the purposes of flo:min and flo:max.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns a quiet NaN. In other words, flo:min and flo:max propagate NaN.

These are the minimum and maximum operations of IEEE 754-2019.

procedure: flo:min-mag x1 x2
procedure: flo:max-mag x1 x2

Returns the argument that has the smallest or largest magnitude, or the min or max if the magnitude is the same.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns a quiet NaN. In other words, flo:min-mag and flo:max-mag propagate NaN.

These are the minimumMagnitude and maximumMagnitude operations of IEEE 754-2019.

procedure: flo:min-num x1 x2
procedure: flo:max-num x1 x2

Returns the min or max of two floating-point numbers. -0. is considered less than +0. for the purposes of flo:min-num and flo:max-num.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns the other one if it is not NaN, or the first argument if they are both NaN. In other words, flo:min-num and flo:max-num treat NaN as missing data and ignore it if possible.

These are the minimumNumber and maximumNumber operations of IEEE 754-2019, formerly called minNum and maxNum in IEEE 754-2008.

procedure: flo:min-mag-num x1 x2
procedure: flo:max-mag-num x1 x2

Returns the argument that has the smallest or largest magnitude, or the min or max if the magnitude is the same.

If either argument is NaN, raises the floating-point invalid-operation exception if it is a signalling NaN, and returns the other one if it is not NaN, or the first argument if they are both NaN. In other words, flo:min-mag-num and flo:max-mag-num treat NaN as missing data and ignore it if possible.

These are the minimumMagnitudeNumber and maximumMagnitudeNumber operations of IEEE 754-2019, formerly called minNumMag and maxNumMag in IEEE 754-2008.

procedure: flo:ldexp x1 x2
procedure: flo:scalbn x1 x2

Flo:ldexp scales by a power of two; flo:scalbn scales by a power of the floating-point radix.

ldexp x e := x * 2^e,
scalbn x e := x * r^e.

In MIT/GNU Scheme, these procedures are the same; they are both provided to make it clearer which operation is meant.

procedure: flo:logb x

For nonzero finite x, returns floor(log(x)/log(r)) as an exact integer, where r is the floating-point radix.

For all other inputs, raises invalid-operation and returns #f.

procedure: flo:nextafter x1 x2

Returns the next floating-point number after x1 in the direction of x2.

(flo:nextafter 0. -1.)         ⇒  -4.9406564584124654e-324
constant: flo:radix
constant: flo:radix.
constant: flo:precision

Floating-point system parameters. Flo:radix is the floating-point radix as an integer, and flo:precision is the floating-point precision as an integer; flo:radix. is the flotaing-point radix as a flonum.

constant: flo:error-bound
constant: flo:log-error-bound
constant: flo:ulp-of-one
constant: flo:log-ulp-of-one

Flo:error-bound, sometimes called the machine epsilon, is the maximum relative error of rounding to nearest:

max |x - fl(x)|/|x| = 1/(2 r^(p-1)),

where r is the floating-point radix and p is the floating-point precision.

Flo:ulp-of-one is the distance from 1 to the next larger floating-point number, and is equal to 1/r^{p-1}.

Flo:error-bound is half flo:ulp-of-one.

Flo:log-error-bound is the logarithm of flo:error-bound, and flo:log-ulp-of-one is the logarithm of flo:log-ulp-of-one.

procedure: flo:ulp flonum

Returns the distance from flonum to the next floating-point number larger in magnitude with the same sign. For zero, this returns the smallest subnormal. For infinities, this returns positive infinity. For NaN, this returns the same NaN.

(flo:ulp 1.)                    ⇒  2.220446049250313e-16
(= (flo:ulp 1.) flo:ulp-of-one) ⇒  #t
constant: flo:normal-exponent-max
constant: flo:normal-exponent-min
constant: flo:subnormal-exponent-min

Largest and smallest positive integer exponents of the radix in normal and subnormal floating-point numbers.

  • Flo:normal-exponent-max is the largest positive integer such that (expt flo:radix. flo:normal-exponent-max) does not overflow.
  • Flo:normal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:normal-exponent-min) is a normal floating-point number.
  • Flo:subnormal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:subnormal-exponent-min) is nonzero; this is also the smallest positive floating-point number.
constant: flo:largest-positive-normal
constant: flo:smallest-positive-normal
constant: flo:smallest-positive-subnormal

Smallest and largest normal and subnormal numbers in magnitude.

constant: flo:greatest-normal-exponent-base-e
constant: flo:greatest-normal-exponent-base-2
constant: flo:greatest-normal-exponent-base-10
constant: flo:least-normal-exponent-base-e
constant: flo:least-normal-exponent-base-2
constant: flo:least-normal-exponent-base-10
constant: flo:least-subnormal-exponent-base-e
constant: flo:least-subnormal-exponent-base-2
constant: flo:least-subnormal-exponent-base-10

Least and greatest exponents of normal and subnormal floating-point numbers, as floating-point numbers. For example, flo:greatest-normal-exponent-base-2 is the greatest floating-point number such that (expt 2. flo:greatest-normal-exponent-base-2) does not overflow and is a normal floating-point number.

procedure: flo:total< x1 x2
procedure: flo:total-mag< x1 x2
procedure: flo:total-order x1 x2
procedure: flo:total-order-mag x1 x2

These procedures implement the IEEE 754-2008 total ordering on floating-point values and their magnitudes. Here the “magnitude” of a floating-point value is a floating-point value with positive sign bit and everything else the same; e.g., +nan.123 is the “magnitude” of -nan.123 and 0.0 is the “magnitude” of -0.0.

The total ordering has little to no numerical meaning and should be used only when an arbitrary choice of total ordering is required for some non-numerical reason.

  • Flo:total< returns true if x1 precedes x2.
  • Flo:total-mag< returns true if the magnitude of x1 precedes the magnitude of x2.
  • Flo:total-order returns -1 if x1 precedes x2, 0 if they are the same floating-point value (including sign of zero, or sign and payload of NaN), and +1 if x1 follows x2.
  • Flo:total-order-mag returns -1 if the magnitude of x1 precedes the magnitude of x2, etc.
procedure: flo:make-nan negative? quiet? payload
procedure: flo:nan-quiet? nan
procedure: flo:nan-payload nan

Flo:make-nan creates a NaN given the sign bit, quiet bit, and payload. Negative? and quiet? must be booleans, and payload must be an unsigned (p-2)-bit integer, where p is the floating-point precision. If quiet? is false, payload must be nonzero.

(flo:sign-negative? (flo:make-nan negative? quiet? payload))
                               ⇒  negative?
(flo:nan-quiet? (flo:make-nan negative? quiet? payload))
                               ⇒  quiet?
(flo:nan-payload (flo:make-nan negative? quiet? payload))
                               ⇒  payload

(flo:make-nan #t #f 42)        ⇒  -snan.42
(flo:sign-negative? +nan.123)  ⇒  #f
(flo:nan-quiet? +nan.123)      ⇒  #t
(flo:nan-payload +nan.123)     ⇒  123

Next: Floating-Point Environment, Previous: Fixnum Operations, Up: Fixnum and Flonum Operations   [Contents][Index]