Next: , Previous: , Up: Performance  


4.2 Virtual Memory and Big Data

Small data sets seldom spoil the delights of AWK by causing performance troubles, with or without persistence. As the size of the gawk interpreter’s internal data structures approaches the capacity of physical memory, however, acceptable performance requires understanding modern operating systems and sometimes tuning them. Fortunately pm-gawk offers new degrees of control for performance-conscious users tackling large data sets. A terse mnemonic captures the basic principle: Precluding paging promotes peak performance and prevents perplexity.

Modern operating systems feature virtual memory that strives to appear both larger than installed DRAM (which is small) and faster than installed storage devices (which are slow). As a program’s memory footprint approaches the capacity of DRAM, the virtual memory system transparently pages (moves) the program’s data between DRAM and a swap area on a storage device. Paging can degrade performance mildly or severely, depending on the program’s memory access patterns. Random accesses to large data structures can trigger excessive paging and dramatic slowdown. Unfortunately, the hash tables beneath AWK’s signature associative arrays inherently require random memory accesses, so large associative arrays can be problematic.

Persistence changes the rules in our favor: The OS pages data to pm-gawk’s heap file instead of the swap area. This won’t help performance much if the heap file resides in a conventional storage-backed file system. On Unix-like systems, however, we may place the heap file in a DRAM-backed file system such as /dev/shm/, which entirely prevents paging to slow storage devices. Temporarily placing the heap file in such a file system is a reasonable expedient, with two caveats: First, keep in mind that DRAM-backed file systems perish when the machine reboots or crashes, so you must copy the heap file to a conventional storage-backed file system when your computation is done. Second, pm-gawk’s memory footprint can’t exceed available DRAM if you place the heap file in a DRAM-backed file system.

Tuning OS paging parameters may improve performance if you decide to run pm-gawk with a heap file in a conventional storage-backed file system. Some OSes have unhelpful default habits regarding modified (“dirty”) memory backed by files. For example, the OS may write dirty memory pages to the heap file periodically and/or when the OS believes that “too much” memory is dirty. Such “eager” writeback can degrade performance noticeably and brings no benefit to pm-gawk. Fortunately some OSes allow paging defaults to be over-ridden so that writeback is “lazy” rather than eager. For Linux see the discussion of the dirty_* parameters at https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html. Changing these parameters can prevent wasteful eager paging:2

        $ echo 100    | sudo tee /proc/sys/vm/dirty_background_ratio
        $ echo 100    | sudo tee /proc/sys/vm/dirty_ratio
        $ echo 300000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
        $ echo 50000  | sudo tee /proc/sys/vm/dirty_writeback_centisecs

Tuning paging parameters can help non-persistent gawk as well as pm-gawk. [Disclaimer: OS tuning is an occult art, and your mileage may vary.]


Footnotes

(2)

The tee rigmarole is explained at https://askubuntu.com/questions/1098059/which-is-the-right-way-to-drop-caches-in-lubuntu.


Next: Sparse Heap Files, Previous: Constant-Time Array Access, Up: Performance