Bitwise Functions (The GNU Awk User’s Guide)

Next: Getting Type Information, Previous: Time Functions, Up: Built-in Functions [Contents][Index]

9.1.7 Bit-Manipulation Functions ¶

I can explain it for you, but I can’t understand it for you.

— Anonymous

Many languages provide the ability to perform bitwise operations on two integer numbers. In other words, the operation is performed on each successive pair of bits in the operands. Three common operations are bitwise AND, OR, and XOR. The operations are described in Table 9.6.

                Bit operator
          |  AND  |   OR  |  XOR
          |---+---+---+---+---+---
Operands  | 0 | 1 | 0 | 1 | 0 | 1
----------+---+---+---+---+---+---
    0     | 0   0 | 0   1 | 0   1
    1     | 0   1 | 1   1 | 1   0

Table 9.6: Bitwise operations

As you can see, the result of an AND operation is 1 only when both bits are 1. The result of an OR operation is 1 if either bit is 1. The result of an XOR operation is 1 if either bit is 1, but not both. The next operation is the complement; the complement of 1 is 0 and the complement of 0 is 1. Thus, this operation “flips” all the bits of a given value.

Finally, two other common operations are to shift the bits left or right. For example, if you have a bit string ‘10111001’ and you shift it right by three bits, you end up with ‘00010111’.⁶² If you start over again with ‘10111001’ and shift it left by three bits, you end up with ‘11001000’. The following list describes gawk’s built-in functions that implement the bitwise operations. Optional parameters are enclosed in square brackets ([ ]):

and(v1, v2 [, …]): Return the bitwise AND of the arguments. There must be at least two.
compl(val): Return the bitwise complement of val.
lshift(val, count) ¶: Return the value of val, shifted left by count bits.
or(v1, v2 [, …]): Return the bitwise OR of the arguments. There must be at least two.
rshift(val, count) ¶: Return the value of val, shifted right by count bits.
xor(v1, v2 [, …]): Return the bitwise XOR of the arguments. There must be at least two.

CAUTION: Beginning with gawk version 4.2, negative operands are not allowed for any of these functions. A negative operand produces a fatal error. See the sidebar “Beware The Smoke and Mirrors!” for more information as to why.

Here is a user-defined function (see User-Defined Functions) that illustrates the use of these functions:

# bits2str --- turn an integer into readable ones and zeros

function bits2str(bits,        data, mask)
{
    if (bits == 0)
        return "0"

    mask = 1
    for (; bits != 0; bits = rshift(bits, 1))
        data = (and(bits, mask) ? "1" : "0") data

    while ((length(data) % 8) != 0)
        data = "0" data

    return data
}


BEGIN {
    printf "123 = %s\n", bits2str(123)
    printf "0123 = %s\n", bits2str(0123)
    printf "0x99 = %s\n", bits2str(0x99)
    comp = compl(0x99)
    printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp)
    shift = lshift(0x99, 2)
    printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
    shift = rshift(0x99, 2)
    printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
}

This program produces the following output when run:

$ gawk -f testbits.awk
-| 123 = 01111011
-| 0123 = 01010011
-| 0x99 = 10011001
-| compl(0x99) = 0x3fffffffffff66 =
-| 00111111111111111111111111111111111111111111111101100110
-| lshift(0x99, 2) = 0x264 = 0000001001100100
-| rshift(0x99, 2) = 0x26 = 00100110

The bits2str() function turns a binary number into a string. Initializing mask to one creates a binary value where the rightmost bit is set to one. Using this mask, the function repeatedly checks the rightmost bit. ANDing the mask with the value indicates whether the rightmost bit is one or not. If so, a "1" is concatenated onto the front of the string. Otherwise, a "0" is added. The value is then shifted right by one bit and the loop continues until there are no more one bits.

If the initial value is zero, it returns a simple "0". Otherwise, at the end, it pads the value with zeros to represent multiples of 8-bit quantities. This is typical in modern computers.

The main code in the BEGIN rule shows the difference between the decimal and octal values for the same numbers (see Octal and Hexadecimal Numbers), and then demonstrates the results of the compl(), lshift(), and rshift() functions.

Beware The Smoke and Mirrors!

Beware The Smoke and Mirrors!
It other languages, bitwise operations are performed on integer values, not floating-point values. As a general statement, such operations work best when performed on unsigned integers. `gawk` attempts to treat the arguments to the bitwise functions as unsigned integers. For this reason, negative arguments produce a fatal error. In normal operation, for all of these functions, first the double-precision floating-point value is converted to the widest C unsigned integer type, then the bitwise operation is performed. If the result cannot be represented exactly as a C `double`, leading nonzero bits are removed one by one until it can be represented exactly. The result is then converted back into a C `double`.⁶³ However, when using arbitrary precision arithmetic with the `-M` option (see Arithmetic and Arbitrary-Precision Arithmetic with `gawk`), the results may differ. This is particularly noticeable with the `compl()` function: $ `gawk 'BEGIN { print compl(42) }'` -\| 9007199254740949 $ `gawk -M 'BEGIN { print compl(42) }'` -\| -43 What’s going on becomes clear when printing the results in hexadecimal: $ `gawk 'BEGIN { printf "%#x\n", compl(42) }'` -\| 0x1fffffffffffd5 $ `gawk -M 'BEGIN { printf "%#x\n", compl(42) }'` -\| 0xffffffffffffffd5 When using the `-M` option, under the hood, `gawk` uses GNU MP arbitrary precision integers which have at least 64 bits of precision. When not using `-M`, `gawk` stores integral values in regular double-precision floating point, which only maintain 53 bits of precision. Furthermore, the GNU MP library treats (or at least seems to treat) the leading bit as a sign bit; thus the result with `-M` in this case is a negative number. In short, using `gawk` for any but the simplest kind of bitwise operations is probably a bad idea; caveat emptor!

It other languages, bitwise operations are performed on integer values, not floating-point values. As a general statement, such operations work best when performed on unsigned integers.

gawk attempts to treat the arguments to the bitwise functions as unsigned integers. For this reason, negative arguments produce a fatal error.

In normal operation, for all of these functions, first the double-precision floating-point value is converted to the widest C unsigned integer type, then the bitwise operation is performed. If the result cannot be represented exactly as a C double, leading nonzero bits are removed one by one until it can be represented exactly. The result is then converted back into a C double.⁶³

However, when using arbitrary precision arithmetic with the -M option (see Arithmetic and Arbitrary-Precision Arithmetic with gawk), the results may differ. This is particularly noticeable with the compl() function:

$ gawk 'BEGIN { print compl(42) }'
-| 9007199254740949
$ gawk -M 'BEGIN { print compl(42) }'
-| -43

What’s going on becomes clear when printing the results in hexadecimal:

$ gawk 'BEGIN { printf "%#x\n", compl(42) }'
-| 0x1fffffffffffd5
$ gawk -M 'BEGIN { printf "%#x\n", compl(42) }'
-| 0xffffffffffffffd5

When using the -M option, under the hood, gawk uses GNU MP arbitrary precision integers which have at least 64 bits of precision. When not using -M, gawk stores integral values in regular double-precision floating point, which only maintain 53 bits of precision. Furthermore, the GNU MP library treats (or at least seems to treat) the leading bit as a sign bit; thus the result with -M in this case is a negative number.

In short, using gawk for any but the simplest kind of bitwise operations is probably a bad idea; caveat emptor!

Footnotes

(62)

This example shows that zeros come in on the left side. For gawk, this is always true, but in some languages, it’s possible to have the left side fill with ones.

(63)

If you don’t understand this paragraph, the upshot is that gawk can only store a particular range of integer values; numbers outside that range are reduced to fit within the range.