Next: , Previous: , Up: General Introduction  


5 Data Integrity

Mishaps including power outages, OS kernel panics, scripting bugs, and command-line typos can harm your data, but precautions can mitigate these risks. In scripting scenarios it usually suffices to create safe backups of important files at appropriate times. As simple as this sounds, care is needed to achieve genuine protection and to reduce the costs of backups. Here’s a prudent yet frugal way to back up a heap file between uses:

        $ backup_base=heap_bk_`date +%s`
        $ cp --reflink=always heap.pma $backup_base.pma
        $ chmod a-w $backup_base.pma
        $ sync
        $ touch $backup_base.done
        $ chmod a-w $backup_base.done
        $ sync
        $ ls -l heap*
        -rw-rw-r--. 1 me me 4096000 Aug  6 15:53 heap.pma
        -r--r--r--. 1 me me       0 Aug  6 16:16 heap_bk_1659827771.done
        -r--r--r--. 1 me me 4096000 Aug  6 16:16 heap_bk_1659827771.pma

Timestamps in backup filenames make it easy to find the most recent copy if the heap file is damaged, even if last-mod metadata are inadvertently altered.

The cp command’s --reflink option reduces both the storage footprint of the copy and the time required to make it. Just as sparse files provide “pay as you go” storage footprints, reflink copying offers “pay as you change” storage costs.5 A reflink copy shares storage with the original file. The file system ensures that subsequent changes to either file don’t affect the other. Reflink copying is not available on all file systems; XFS, BtrFS, and OCFS2 currently support it.6 Fortunately you can install, say, an XFS file system inside an ordinary file on some other file system, such as ext4.7

After creating a backup copy of the heap file we use sync to force it down to durable media. Otherwise the copy may reside only in volatile DRAM memory—the file system’s cache—where an OS crash or power failure could corrupt it.8 After sync-ing the backup we create and sync a “success indicator” file with extension .done to address a nasty corner case: Power may fail while a backup is being copied from the primary heap file, leaving either file, or both, corrupt on storage—a particularly worrisome possibility for jobs that run unattended. Upon reboot, each .done file attests that the corresponding backup succeeded, making it easy to identify the most recent successful backup.

Finally, if you’re serious about tolerating failures you must “train as you would fight” by testing your hardware/software stack against realistic failures. For realistic power-failure testing, see https://queue.acm.org/detail.cfm?id=3400902.


Footnotes

(5)

The system call that implements reflink copying is described in man ioctl_ficlone.

(6)

The --reflink option creates copies as sparse as the original. If reflink copying is not available, --sparse=always should be used.

(7)

See https://www.usenix.org/system/files/login/articles/login_winter19_08_kelly.pdf.

(8)

On some OSes sync provides very weak guarantees, but on Linux sync returns only after all file system data are flushed down to durable storage. If your sync is unreliable, write a little C program that calls fsync() to flush a file. To be safe, also call fsync() on every enclosing directory on the file’s realpath() up to the root.


Next: Acknowledgments, Previous: Performance, Up: General Introduction