Constant-Time Array Access (Persistent-Memory gawk User Manual)

4.1 Constant-Time Array Access

pm-gawk preserves the efficiency of data access when data structures are created by one process and later re-used by a different process.

Consider the associative arrays used to analyze Mark Twain’s books in Examples. We created arrays ts[] and hf[] by reading files sawyer.txt and finn.txt. If N denotes the total volume of data in these files, building the associative arrays typically requires time proportional to N, or “O(N) expected time” in the lingo of asymptotic analysis. If W is the number of unique words in the input files, the size of the associative arrays will be proportional to W, or O(W). Accessing individual array elements requires only constant or O(1) expected time, not O(N) or O(W) time, because gawk implements arrays as hash tables.

The performance advantage of pm-gawk arises when different processes create and access associative arrays. Accessing an element of a persistent array created by a previous pm-gawk process, as we did earlier in BEGIN{print ts["river"], hf["river"]}, still requires only O(1) time, which is asymptotically far superior to the alternatives. Naïvely reconstructing arrays by re-ingesting all raw inputs in every gawk process that accesses the arrays would of course require O(N) time—a profligate cost if the text corpus is large. Dumping arrays to files and re-loading them as needed would reduce the preparation time for access to O(W). That can be a substantial improvement in practice; N is roughly 19 times larger than W in our Twain corpus. Nonetheless O(W) remains far slower than pm-gawk’s O(1). As we’ll see in Results, the difference is not merely theoretical.

The persistent memory implementation beneath pm-gawk enables it to avoid work proportional to N or W when accessing an element of a persistent associative array. Under the hood, pm-gawk stores script-defined AWK variables such as associative arrays in a persistent heap laid out in a memory-mapped file (the heap file). When an AWK script accesses an element of an associative array, pm-gawk performs a lookup on the corresponding hash table, which in turn accesses memory on the persistent heap. Modern operating systems implement memory-mapped files in such a way that these memory accesses trigger the bare minimum of data movement required: Only those parts of the heap file containing needed data are “paged in” to the memory of the pm-gawk process. In the worst case, the heap file is not in the file system’s in-memory cache, so the required pages must be faulted into memory from storage. Our asymptotic analysis of efficiency applies regardless of whether the heap file is cached or not. The entire heap file is not accessed merely to access an element of a persistent associative array.

Persistent memory thus enables pm-gawk to offer the flexibility of de-coupling data ingestion from analytic queries without the fuss and overhead of serializing and loading data structures and without sacrificing constant-time access to the associative arrays that make AWK scripting convenient and productive.