Previous: , Up: Array Sorting   [Contents][Index]


12.2.2 Sorting Array Values and Indices with gawk

In most awk implementations, sorting an array requires writing a sort() function. While this can be educational for exploring different sorting algorithms, usually that’s not the point of the program. gawk provides the built-in asort() and asorti() functions (see String Functions) for sorting arrays. For example:

populate the array data
n = asort(data)
for (i = 1; i <= n; i++)
    do something with data[i]

After the call to asort(), the array data is indexed from 1 to some number n, the total number of elements in data. (This count is asort()’s return value.) data[1] <= data[2] <= data[3], and so on. The default comparison is based on the type of the elements (see Typing and Comparison). All numeric values come before all string values, which in turn come before all subarrays.

An important side effect of calling asort() is that the array’s original indices are irrevocably lost. As this isn’t always desirable, asort() accepts a second argument:

populate the array source
n = asort(source, dest)
for (i = 1; i <= n; i++)
    do something with dest[i]

In this case, gawk copies the source array into the dest array and then sorts dest, destroying its indices. However, the source array is not affected.

Often, what’s needed is to sort on the values of the indices instead of the values of the elements. To do that, use the asorti() function. The interface and behavior are identical to that of asort(), except that the index values are used for sorting, and become the values of the result array:

{ source[$0] = some_func($0) }

END {
    n = asorti(source, dest)
    for (i = 1; i <= n; i++) {
        Work with sorted indices directly:
        do something with dest[i]
        …
        Access original array via sorted indices:
        do something with source[dest[i]]
    }
}

So far, so good. Now it starts to get interesting. Both asort() and asorti() accept a third string argument to control comparison of array elements. In String Functions, we ignored this third argument; however, the time has now come to describe how this argument affects these two functions.

Basically, the third argument specifies how the array is to be sorted. There are two possibilities. As with PROCINFO["sorted_in"], this argument may be one of the predefined names that gawk provides (see Controlling Scanning), or it may be the name of a user-defined function (see Controlling Array Traversal).

In the latter case, the function can compare elements in any way it chooses, taking into account just the indices, just the values, or both. This is extremely powerful.

Once the array is sorted, asort() takes the values in their final order, and uses them to fill in the result array, whereas asorti() takes the indices in their final order, and uses them to fill in the result array.

NOTE: Copying array indices and elements isn’t expensive in terms of memory. Internally, gawk maintains reference counts to data. For example, when asort() copies the first array to the second one, there is only one copy of the original array elements’ data, even though both arrays use the values.

Because IGNORECASE affects string comparisons, the value of IGNORECASE also affects sorting for both asort() and asorti(). Note also that the locale’s sorting order does not come into play; comparisons are based on character values only.80 Caveat Emptor.


Footnotes

(80)

This is true because locale-based comparison occurs only when in POSIX compatibility mode, and since asort() and asorti() are gawk extensions, they are not available in that case.


Previous: , Up: Array Sorting   [Contents][Index]