Next: , Previous: , Up: Performance  


4.3 Sparse Heap Files

To be frugal with storage resources, pm-gawk’s heap file should be created as a sparse file: a file whose logical size is larger than its storage resource footprint. Modern file systems support sparse files, which are easy to create using the truncate tool shown in our examples.

Let’s first create a conventional non-sparse file using echo:

        $ echo hi > dense
        $ ls -l dense
        -rw-rw-r--. 1 me me 3 Aug  5 23:08 dense
        $ du -h dense
        4.0K    dense

The ls utility reports that file dense is three bytes long (two for the letters in “hi” plus one for the newline). The du utility reports that this file consumes 4 KiB of storage—one block of disk, as small as a non-sparse file’s storage footprint can be. Now let’s use truncate to create a logically enormous sparse file and check its physical size:

        $ truncate -s 1T sparse
        $ ls -l sparse
        -rw-rw-r--. 1 me me 1099511627776 Aug  5 22:33 sparse
        $ du -h sparse
        0       sparse

Whereas ls reports the logical file size that we expect (one TiB or 2 raised to the power 40 bytes), du reveals that the file occupies no storage whatsoever. The file system will allocate physical storage resources beneath this file as data is written to it; reading unwritten regions of the file yields zeros.

The “pay as you go” storage cost of sparse files offers both convenience and control for pm-gawk users. If your file system supports sparse files, go ahead and create lavishly capacious heap files for pm-gawk. Their logical size costs nothing and persistent memory allocation within pm-gawk won’t fail until physical storage resources beneath the file system are exhausted. But if instead you want to prevent a heap file from consuming too much storage, simply set its initial size to whatever bound you wish to enforce; it won’t eat more disk than that. Copying sparse files with GNU cp creates sparse copies by default.

File-system encryption can preclude sparse files: If the cleartext of a byte offset range within a file is all zero bytes, the corresponding ciphertext probably shouldn’t be all zeros! Encrypting at the storage layer instead of the file system layer may offer acceptable security while still permitting file systems to implement sparse files.

Sometimes you might prefer a dense heap file backed by pre-allocated storage resources, for example to increase the likelihood that pm-gawk’s internal memory allocation will succeed until the persistent heap occupies the entire heap file. The fallocate utility will do the trick:

        $ fallocate -l 1M mibi
        $ ls -l mibi
        -rw-rw-r--. 1 me me 1048576 Aug  5 23:18 mibi
        $ du -h mibi
        1.0M    mibi

We get the MiB we asked for, both logically and physically.


Next: Persistence versus Durability, Previous: Virtual Memory and Big Data, Up: Performance