[ << ] [ < ] [ Up ] [ > ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.4 Making tar Archives More Reproducible

Sometimes it is important for an archive to be reproducible, so that one can easily verify it to have been derived solely from its input. We call an archive reproducible, if an archive created from the same set of input files with the same command line options is byte-to-byte equivalent to the original one.

However, two archives created by GNU tar from two sets of input files normally might differ even if the input files have the same contents and GNU tar was invoked the same way on both sets of input. This can happen if the inputs have different modification dates or other metadata, or if the input directories’ entries are in different orders.

To avoid this problem when creating an archive, and thus make the archive reproducible, you can run GNU tar in the C locale with some or all of the following options:

--sort=name

Omit irrelevant information about directory entry order.

--format=posix

Avoid problems with large files or files with unusual timestamps. This also enables ‘--pax-option’ options mentioned below.

--pax-option='exthdr.name=%d/PaxHeaders/%f'

Omit the process ID of tar. This option is needed only if POSIXLY_CORRECT is set in the environment.

--pax-option='delete=atime,delete=ctime'

Omit irrelevant information about file access or status change time.

--clamp-mtime --mtime="$SOURCE_EPOCH"

Omit irrelevant information about file timestamps after ‘$SOURCE_EPOCH’, which should be a time no less than any timestamp of any source file.

--numeric-owner

Omit irrelevant information about user and group names.

--owner=0
--group=0

Omit irrelevant information about file ownership and group.

--mode='go+u,go-w'

Omit irrelevant information about file permissions.

When creating a reproducible archive from version-controlled source files, it can be useful to set each file’s modification time to be that of its last commit, so that the timestamps are reproducible from the version-control repository. If these timestamps are all on integer second boundaries, and if you use ‘--format=posix --pax-option='delete=atime,delete=ctime' --clamp-mtime --mtime="$SOURCE_EPOCH"’ where $SOURCE_EPOCH is the the time of the most recent commit, and if all non-source files have timestamps greater than $SOURCE_EPOCH, then GNU tar should generate an archive in ustar format, since no POSIX features will be needed and the archive will be in the ustar subset of posix format.

Also, if compressing, use a reproducible compression format; e.g., with gzip you should use the ‘--no-name’ (‘-n’) option.

Here is an example set of shell commands to produce a reproducible tarball with git and gzip, which you can tailor to your project’s needs.

function get_commit_time() {
  TZ=UTC0 git log -1 \
    --format=tformat:%cd \
    --date=format:%Y-%m-%dT%H:%M:%SZ \
    "$@"
}
#
# Set each source file timestamp to that of its latest commit.
git ls-files | while read -r file; do
  commit_time=$(get_commit_time "$file") &&
  touch -md $commit_time "$file"
done
#
# Set timestamp of each directory under $FILES
# to the latest timestamp of any descendant.
find $FILES -depth -type d -exec sh -c \
  'touch -r "$0/$(ls -At "$0" | head -n 1)" "$0"' \
  {} ';'
#
# Create $ARCHIVE.tgz from $FILES, pretending that
# the modification time for each newer file
# is that of the most recent commit of any source file.
SOURCE_EPOCH=$(get_commit_time)
TARFLAGS="
  --sort=name --format=posix
  --pax-option=exthdr.name=%d/PaxHeaders/%f
  --pax-option=delete=atime,delete=ctime
  --clamp-mtime --mtime=$SOURCE_EPOCH
  --numeric-owner --owner=0 --group=0
  --mode=go+u,go-w
"
GZIPFLAGS="--no-name --best"
LC_ALL=C tar $TARFLAGS -cf - $FILES |
  gzip $GZIPFLAGS > $ARCHIVE.tgz

[ << ] [ < ] [ Up ] [ > ] [ >> ]

This document was generated on August 23, 2023 using texi2html 5.0.