Next: Performance, Previous: Quick Start, Up: General Introduction

Our first example uses pm-`gawk`

to streamline analysis of a prose
corpus, Mark Twain’s Tom Sawyer and Huckleberry Finn
from
https://gutenberg.org/files/74/74-0.txt
and
https://gutenberg.org/files/76/76-0.txt.
We first convert non-alphabetic characters to newlines (so each line
has at most one word) and convert to lowercase:

$ tr -c a-zA-Z '\n' < 74-0.txt | tr A-Z a-z > sawyer.txt $ tr -c a-zA-Z '\n' < 76-0.txt | tr A-Z a-z > finn.txt

It’s easy to count word frequencies with AWK’s associative arrays.
pm-`gawk`

makes these arrays persistent, so we need not re-ingest the
entire corpus every time we ask a new question (“read once, analyze
happily ever after”):

$ truncate -s 100M twain.pma $ export GAWK_PERSIST_FILE=twain.pma $ gawk '{ts[$1]++}' sawyer.txt # ingest $ gawk 'BEGIN{print ts["work"], ts["play"]}' # query 92 11 $ gawk 'BEGIN{print ts["necktie"], ts["knife"]}' # query 2 27

The `truncate`

command above creates a heap file large enough
to store all of the data it must eventually contain, with plenty of
room to spare. (As we’ll see in Sparse Heap Files, this isn’t
wasteful.) The `export`

command ensures that subsequent
`gawk`

invocations activate pm-`gawk`

. The first pm-`gawk`

command stores
Tom Sawyer’s word frequencies in associative array `ts[]`

.
Because this array is persistent, subsequent pm-`gawk`

commands can
access it without having to parse the input file again.

Expanding our analysis to encompass a second book is easy. Let’s
populate a new associative array `hf[]`

with the word frequencies
in Huckleberry Finn:

$ gawk '{hf[$1]++}' finn.txt

Now we can freely intermix accesses to both books’ data conveniently and efficiently, without the overhead and coding fuss of repeated input parsing:

$ gawk 'BEGIN{print ts["river"], hf["river"]}' 26 142

By making AWK more interactive, pm-`gawk`

invites casual conversations
with data. If we’re curious what words in Finn are absent from
Sawyer, answers (including “flapdoodle,” “yellocution,” and
“sockdolager”) are easy to find:

$ gawk 'BEGIN{for(w in hf) if (!(w in ts)) print w}'

Rumors of Twain’s death may be exaggerated. If he publishes new books in the future, it will be easy to incorporate them into our analysis incrementally. The performance benefits of incremental processing for common AWK chores such as log file analysis are discussed in https://queue.acm.org/detail.cfm?id=3534855 and the companion paper cited therein, and below in Performance.

Exercise: The “Markov” AWK script on page 79 of Kernighan & Pike’s
The Practice of Programming generates random text reminiscent
of a given corpus using a simple statistical modeling technique. This
script consists of a “learning” or “training” phase followed by an
output-generation phase. Use pm-`gawk`

to de-couple the two phases and
to allow the statistical model to incrementally ingest additions to
the input corpus.

Our second example considers another domain that plays to AWK’s strengths, data analysis. For simplicity we’ll create two small input files of numeric data.

$ printf '1\n2\n3\n4\n5\n' > A.dat $ printf '5\n6\n7\n8\n9\n' > B.dat

A conventional *non*-persistent AWK script can compute basic
summary statistics:

$ cat summary_conventional.awk 1 == NR { min = max = $1 } min > $1 { min = $1 } max < $1 { max = $1 } { sum += $1 } END { print "min: " min " max: " max " mean: " sum/NR } $ gawk -f summary_conventional.awk A.dat B.dat min: 1 max: 9 mean: 5

To use pm-`gawk`

for the same purpose, we first create a heap file for
our AWK script variables and tell pm-`gawk`

where to find it via the
usual environment variable:

$ truncate -s 10M stats.pma $ export GAWK_PERSIST_FILE=stats.pma

pm-`gawk`

requires changing the above script to ensure that `min`

and `max`

are initialized exactly once, when the heap file is
first used, and *not* every time the script runs. Furthermore,
whereas script-defined variables such as `min`

retain their
values across pm-`gawk`

executions, built-in AWK variables such as
`NR`

are reset to zero every time pm-`gawk`

runs, so we can’t use
them in the same way. Here’s a modified script for pm-`gawk`

:

$ cat summary_persistent.awk ! init { min = max = $1; init = 1 } min > $1 { min = $1 } max < $1 { max = $1 } { sum += $1; ++n } END { print "min: " min " max: " max " mean: " sum/n }

Note the different pattern on the first line and the introduction of
`n`

to supplant `NR`

. When used with pm-`gawk`

, this new
initialization logic supports the same kind of cumulative processing
that we saw in the text-analysis scenario. For example, we can ingest
our input files separately:

$ gawk -f summary_persistent.awk A.dat min: 1 max: 5 mean: 3 $ gawk -f summary_persistent.awk B.dat min: 1 max: 9 mean: 5

As expected, after the second pm-`gawk`

invocation consumes the
second input file, the output matches that of the non-persistent
script that read both files at once.

Exercise: Amend the AWK scripts above to compute the median and
mode(s) using both conventional `gawk`

and pm-`gawk`

. (The median is the
number in the middle of a sorted list; if the length of the list is
even, average the two numbers at the middle. The modes are the values
that occur most frequently.)

Our third and final set of examples shows that pm-`gawk`

allows us to
bundle both script-defined data and also user-defined *functions*
in a persistent heap that may be passed freely between unrelated AWK
scripts.

The following shell transcript repeatedly invokes pm-`gawk`

to create and
then employ a user-defined function. These separate invocations
involve several different AWK scripts that communicate via the heap
file. Each invocation can add user-defined functions and add or
remove data from the heap that subsequent invocations will access.

$ truncate -s 10M funcs.pma $ export GAWK_PERSIST_FILE=funcs.pma $ gawk 'function count(A,t) {for(i in A)t++; return ""==t?0:t}' $ gawk 'BEGIN { a["x"] = 4; a["y"] = 5; a["z"] = 6 }' $ gawk 'BEGIN { print count(a) }' 3 $ gawk 'BEGIN { delete a["x"] }' $ gawk 'BEGIN { print count(a) }' 2 $ gawk 'BEGIN { delete a }' $ gawk 'BEGIN { print count(a) }' 0 $ gawk 'BEGIN { for (i=0; i<47; i++) a[i]=i }' $ gawk 'BEGIN { print count(a) }' 47

The first pm-`gawk`

command creates user-defined function `count()`

,
which returns the number of entries in a given associative array; note
that variable `t`

is local to `count()`

, not global. The
next pm-`gawk`

command populates a persistent associative array with
three entries; not surprisingly, the `count()`

call in the
following pm-`gawk`

command finds these three entries. The next two
pm-`gawk`

commands respectively delete an array entry and print the
reduced count, 2. The two commands after that delete the entire array
and print a count of zero. Finally, the last two pm-`gawk`

commands
populate the array with 47 entries and count them.

The following shell script invokes pm-`gawk`

repeatedly to create a
collection of user-defined functions that perform basic operations on
quadratic polynomials: evaluation at a given point, computing the
discriminant, and using the quadratic formula to find the roots. It
then factorizes *x^2 + x - 12* into *(x - 3)(x + 4)*.

#!/bin/sh rm -f poly.pma truncate -s 10M poly.pma export GAWK_PERSIST_FILE=poly.pma gawk 'function q(x) { return a*x^2 + b*x + c }' gawk 'function p(x) { return "q(" x ") = " q(x) }' gawk 'BEGIN { print p(2) }' # evaluate & print gawk 'BEGIN{ a = 1; b = 1; c = -12 }' # new coefficients gawk 'BEGIN { print p(2) }' # eval/print again gawk 'function d(s) { return s * sqrt(b^2 - 4*a*c)}' gawk 'BEGIN{ print "discriminant (must be >=0): " d(1)}' gawk 'function r(s) { return (-b + d(s))/(2*a)}' gawk 'BEGIN{ print "root: " r( 1) " " p(r( 1)) }' gawk 'BEGIN{ print "root: " r(-1) " " p(r(-1)) }' gawk 'function abs(n) { return n >= 0 ? n : -n }' gawk 'function sgn(x) { return x >= 0 ? "- " : "+ " } ' gawk 'function f(s) { return "(x " sgn(r(s)) abs(r(s))}' gawk 'BEGIN{ print "factor: " f( 1) ")" }' gawk 'BEGIN{ print "factor: " f(-1) ")" }' rm -f poly.pma

Next: Performance, Previous: Quick Start, Up: General Introduction