Persistent Memory (The GNU Awk User’s Guide)

Next: Builtin Features versus Extensions, Previous: Profiling Your awk Programs, Up: Advanced Features of gawk [Contents][Index]

12.7 Preserving Data Between Runs ¶

Starting with version 5.2, gawk supports persistent memory. This experimental feature stores the values of all of gawk’s variables, arrays and user-defined functions in a persistent heap, which resides in a file in the filesystem. When persistent memory is not in use (the normal case), gawk’s data resides in ephemeral system memory.

Persistent memory is enabled on certain 64-bit systems supporting the mmap() and munmap() system calls. gawk must be compiled as a non-PIE (Position Independent Executable) binary, since the persistent store ends up holding pointers to functions held within the gawk executable. This also means that to use the persistent memory, you must use the same gawk executable from run to run.

You can see if your version of gawk supports persistent memory like so:

$ gawk --version
-| GNU Awk 5.2.2, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.1.0, GNU MP 6.2.1)
-| Copyright (C) 1989, 1991-2023 Free Software Foundation.
...

If you see the ‘PMA’ with a version indicator, then it’s supported.

As of this writing, persistent memory has only been tested on GNU/Linux, Cygwin, Solaris 2.11, Intel architecture macOS systems, FreeBSD 13.1 and OpenBSD 7.1. On all others, persistent memory is disabled by default. You can force it to be enabled by exporting the shell variable REALLY_USE_PERSIST_MALLOC with a nonempty value before running configure (see Compiling gawk for Unix-Like Systems). If you do so and all the tests pass, please let the maintainer know.

To use persistent memory, follow these steps:

Create a new, empty sparse file of the desired size. For example, four gigabytes. On a GNU/Linux system, you can use the truncate utility:
```
$ truncate -s 4G data.pma
```
It is recommended (but not required) to change the permissions on the file so that only the owner can read and write it:
```
$ chmod 0600 data.pma
```
Provide the path to the data file in the GAWK_PERSIST_FILE environment variable. This is best done by placing the value in the environment just for the run of gawk, like so:
```
$ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
1
```
Use the same data file in subsequent runs to use the preserved data values:
```
$ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
2
$ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
3
```
As shown, in subsequent runs using the same data file, the values of gawk’s variables are preserved. However, gawk’s special variables, such as NR, are reset upon each run. Only the variables defined by the program are preserved across runs.

Interestingly, the program that you execute need not be the same from run to run; the persistent store only maintains the values of variables, arrays, and user-defined functions, not the totality of gawk’s internal state. This lets you share data between unrelated programs, eliminating the need for scripts to communicate via text files.

Terence Kelly, the author of the persistent memory allocator gawk uses, provides the following advice about the backing file:

Regarding backing file size, I recommend making it far larger than all of the data that will ever reside in it, assuming that the file system supports sparse files. The “pay only for what you use” aspect of sparse files ensures that the actual storage resource footprint of the backing file will meet the application’s needs but will be as small as possible. If the file system does not support sparse files, there’s a dilemma: Making the backing file too large is wasteful, but making it too small risks memory exhaustion, i.e., pma_malloc() returns NULL. But persistent gawk should still work even without sparse files.

You can disable the use of the persistent memory allocator in gawk with the --disable-pma option to the configure command at the time that you build gawk (see Compiling and Installing gawk on Unix-Like Systems).

You can set the PMA_VERBOSITY environment variable to a value between zero and three to control how much debugging and error information the persistent memory allocator will print. gawk sets the default to one. See the support/pma.c source code to understand what the different verbosity levels are.

There are a few constraints on the use of persistent memory:

If you use MPFR mode (the -M option) on the first run of a program using persistent memory, you must continue to use it on all subsequent runs. Similarly, if you don’t use -M on the first run, do not use it on any subsequent runs.
Mixing and matching MPFR mode and regular mode with the same backing file is not allowed. gawk detects such a situation and issues a fatal error message.
The GNU/Linux CIFS filesystem is known to not work well with the PMA allocator. Don’t use a backing file on a CIFS filesystem.
If gawk is run by the root user, then persistent memory is not allowed. This is to avoid the possibility of private data “leaking” into the backing file and being recovered later by an attacker.
Over time, the backing file will be filled with memory “leaked” by gawk as it runs. Most notably this is the memory used to compile your program into an internal form before running it, which happens each time, but there are other leakages as well. (For an extreme example of this, see this thread in the “bug-gawk at gnu.org” mailing list archives.) It is up to you to use ‘du -sh pmafile’ occasionally to monitor how full the file is, and arrange to dump any data you may need before the backing file becomes full.

Terence Kelly has provided a separate Persistent-Memory gawk User Manual document, which is included in the gawk distribution. It is worth reading.

Here are additional articles and web links that provide more information about persistent memory and why it’s useful in a scripting language like gawk.

https://web.eecs.umich.edu/~tpkelly/pma/: This is the canonical source for Terence Kelly’s Persistent Memory Allocator (PMA). The latest source code and user manual will always be available at this location. Kelly may be reached directly at any of the following email addresses: “tpkelly AT acm.org”, “tpkelly AT cs.princeton.edu”, or “tpkelly AT eecs.umich.edu”.
Persistent Memory Allocation: Terence Kelly, Zi Fan Tan, Jianan Li, and Haris Volos, ACM Queue magazine, Vol. 20 No. 2 (March/April 2022), PDF, HTML. This paper explains the design of the PMA allocator used in persistent gawk.
Persistent Scripting: Zi Fan Tan, Jianan Li, Haris Volos, and Terence Kelly, Non-Volatile Memory Workshop (NVMW) 2022, http://nvmw.ucsd.edu/program/. This paper motivates and describes a research prototype of persistent gawk and presents performance evaluations on Intel Optane non-volatile memory; note that the interface differs slightly.
Persistent Memory Programming on Conventional Hardware: Terence Kelly, ACM Queue magazine Vol. 17 No. 4 (July/Aug 2019), PDF, HTML. This paper describes simple techniques for persistent memory for C/C++ code on conventional computers that lack non-volatile memory hardware.
Is Persistent Memory Persistent?: Terence Kelly, ACM Queue magazine Vol. 18 No. 2 (March/April 2020), PDF, HTML. This paper describes a simple and robust testbed for testing software against real power failures.
Crashproofing the Original NoSQL Key/Value Store: Terence Kelly, ACM Queue magazine Vol. 19 No. 4 (July/Aug 2021), PDF, HTML. This paper describes a crash-tolerance feature added to GNU DBM’ (gdbm).

When Terence Kelly published his papers, his collaborators produced a prototype integration of PMA with gawk. That version used a (mandatory!) option --persist=file to specify the file for storing the persistent heap. If this option is given to gawk, it produces a fatal error message instructing the user to use the GAWK_PERSIST_FILE environment variable instead. Except for this paragraph, that option is otherwise undocumented.

The prototype only supported persistent data; it did not support persistent functions.

As noted earlier, support for persistent memory is experimental. If it becomes burdensome,⁸⁹ then the feature will be removed.

Footnotes

(89)

Meaning, there are too many bug reports, or too many strange differences in behavior from when gawk is run normally.