12.3.2 Sorting Array Values and Indices with gawk

In most awk implementations, sorting an array requires writing a sort() function. This can be educational for exploring different sorting algorithms, but usually that’s not the point of the program. gawk provides the built-in asort() and asorti() functions (see String-Manipulation Functions) for sorting arrays. For example:

populate the array data
n = asort(data)
for (i = 1; i <= n; i++)
    do something with data[i]

After the call to asort(), the array data is indexed from 1 to some number n, the total number of elements in data. (This count is asort()’s return value.) data[1] <= data[2] <= data[3], and so on. The default comparison is based on the type of the elements (see Variable Typing and Comparison Expressions). All numeric values come before all string values, which in turn come before all subarrays.

An important side effect of calling asort() is that the array’s original indices are irrevocably lost. As this isn’t always desirable, asort() accepts a second argument:

populate the array source
n = asort(source, dest)
for (i = 1; i <= n; i++)
    do something with dest[i]

In this case, gawk copies the source array into the dest array and then sorts dest, destroying its indices. However, the source array is not affected.

Often, what’s needed is to sort on the values of the indices instead of the values of the elements. To do that, use the asorti() function. The interface and behavior are identical to that of asort(), except that the index values are used for sorting and become the values of the result array:

{ source[$0] = some_func($0) }

END {
    n = asorti(source, dest)
    for (i = 1; i <= n; i++) {
        Work with sorted indices directly:
        do something with dest[i]
        ...
        Access original array via sorted indices:
        do something with source[dest[i]]
    }
}

So far, so good. Now it starts to get interesting. Both asort() and asorti() accept a third string argument to control comparison of array elements. When we introduced asort() and asorti() in String-Manipulation Functions, we ignored this third argument; however, now is the time to describe how this argument affects these two functions.

Basically, the third argument specifies how the array is to be sorted. There are two possibilities. As with PROCINFO["sorted_in"], this argument may be one of the predefined names that gawk provides (see Using Predefined Array Scanning Orders with gawk), or it may be the name of a user-defined function (see Controlling Array Traversal).

In the latter case, the function can compare elements in any way it chooses, taking into account just the indices, just the values, or both. This is extremely powerful.

Once the array is sorted, asort() takes the values in their final order and uses them to fill in the result array, whereas asorti() takes the indices in their final order and uses them to fill in the result array.

NOTE: Copying array indices and elements isn’t expensive in terms of memory. Internally, gawk maintains reference counts to data. For example, when asort() copies the first array to the second one, there is only one copy of the original array elements’ data, even though both arrays use the values.

You may use the same array for both the first and second arguments to asort() and asorti(). Doing so only makes sense if you are also supplying the third argument, since awk doesn’t provide a way to pass that third argument without also passing the first and second ones.

Because IGNORECASE affects string comparisons, the value of IGNORECASE also affects sorting for both asort() and asorti(). Note also that the locale’s sorting order does not come into play; comparisons are based on character values only.86

The following example demonstrates the use of a comparison function with asort(). The comparison function, case_fold_compare(), maps both values to lowercase in order to compare them ignoring case.

# case_fold_compare --- compare as strings, ignoring case

function case_fold_compare(i1, v1, i2, v2,    l, r)
{
    l = tolower(v1)
    r = tolower(v2)

    if (l < r)
        return -1
    else if (l == r)
        return 0
    else
        return 1
}

And here is the test program for it:

# Test program

BEGIN {
    Letters = "abcdefghijklmnopqrstuvwxyz" \
              "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    split(Letters, data, "")

    asort(data, result, "case_fold_compare")

    j = length(result)
    for (i = 1; i <= j; i++) {
        printf("%s", result[i])
        if (i % (j/2) == 0)
            printf("\n")
        else
            printf(" ")
    }
}

When run, we get the following:

$ gawk -f case_fold_compare.awk
-| A a B b c C D d e E F f g G H h i I J j k K l L M m
-| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z

NOTE: “Under the hood,” gawk uses the C library qsort() function to manage the sorting. qsort() can call itself recursively. This means that when you write a comparison function, you should be careful to avoid the use of global variables and arrays; use only local variables and arrays that you declare as additional parameters to the comparison function. Otherwise, you are likely to cause unintentional memory corruption in your global arrays and possibly cause gawk itself to fail.


Footnotes

(86)

This is true because locale-based comparison occurs only when in POSIX-compatibility mode, and because asort() and asorti() are gawk extensions, they are not available in that case.