GNU tar 1.35: 8.1 Using Less Space through Compression

8.1 Using Less Space through Compression

8.1.1 Creating and Reading Compressed Archives

GNU tar is able to create and read compressed archives. It supports a wide variety of compression programs, namely: gzip, bzip2, lzip, lzma, lzop, zstd, xz and traditional compress. The latter is supported mostly for backward compatibility, and we recommend against using it, because it is by far less effective than the other compression programs(21).

Creating a compressed archive is simple: you just specify a compression option along with the usual archive creation commands. Available compression options are summarized in the table below:

Long	Short	Archive format
‘`--gzip`’	‘`-z`’	`gzip`
‘`--bzip2`’	‘`-j`’	`bzip2`
‘`--xz`’	‘`-J`’	`xz`
‘`--lzip`’		`lzip`
‘`--lzma`’		`lzma`
‘`--lzop`’		`lzop`
‘`--zstd`’		`zstd`
‘`--compress`’	‘`-Z`’	`compress`

For example:

$ tar czf archive.tar.gz .

You can also let GNU tar select the compression program based on the suffix of the archive file name. This is done using ‘--auto-compress’ (‘-a’) command line option. For example, the following invocation will use bzip2 for compression:

$ tar caf archive.tar.bz2 .

whereas the following one will use lzma:

$ tar caf archive.tar.lzma .

For a complete list of file name suffixes recognized by GNU tar, see auto-compress.

Reading compressed archive is even simpler: you don’t need to specify any additional options as GNU tar recognizes its format automatically. Thus, the following commands will list and extract the archive created in previous example:

# List the compressed archive
$ tar tf archive.tar.gz
# Extract the compressed archive
$ tar xf archive.tar.gz

The format recognition algorithm is based on signatures, a special byte sequences in the beginning of file, that are specific for certain compression formats. If this approach fails, tar falls back to using archive name suffix to determine its format (see auto-compress, for a list of recognized suffixes).

Some compression programs are able to handle different compression formats. GNU tar uses this, if the principal decompressor for the given format is not available. For example, if compress is not installed, tar will try to use gzip. As of version 1.35 the following alternatives are tried(22):

Format	Main decompressor	Alternatives
compress	compress	gzip
lzma	lzma	xz
bzip2	bzip2	lbzip2

The only case when you have to specify a decompression option while reading the archive is when reading from a pipe or from a tape drive that does not support random access. However, in this case GNU tar will indicate which option you should use. For example:

$ cat archive.tar.gz | tar tf -
tar: Archive is compressed.  Use -z option
tar: Error is not recoverable: exiting now

If you see such diagnostics, just add the suggested option to the invocation of GNU tar:

$ cat archive.tar.gz | tar tzf -

Notice also, that there are several restrictions on operations on compressed archives. First of all, compressed archives cannot be modified, i.e., you cannot update (‘--update’, alias ‘-u’) them or delete (‘--delete’) members from them or add (‘--append’, alias ‘-r’) members to them. Likewise, you cannot append another tar archive to a compressed archive using ‘--concatenate’ (‘-A’). Secondly, multi-volume archives cannot be compressed.

The following options allow to select a particular compressor program:

‘-z’
‘--gzip’
‘--ungzip’: Filter the archive through gzip.
‘-J’
‘--xz’: Filter the archive through xz.
‘-j’
‘--bzip2’: Filter the archive through bzip2.
‘--lzip’: Filter the archive through lzip.
‘--lzma’: Filter the archive through lzma.
‘--lzop’: Filter the archive through lzop.
‘--zstd’: Filter the archive through zstd.
‘-Z’
‘--compress’
‘--uncompress’: Filter the archive through compress.

When any of these options is given, GNU tar searches the compressor binary in the current path and invokes it. The name of the compressor program is specified at compilation time using a corresponding ‘--with-compname’ option to configure, e.g. ‘--with-bzip2’ to select a specific bzip2 binary. See section Using lbzip2 with GNU tar., for a detailed discussion.

The output produced by tar --help shows the actual compressor names along with each of these options.

You can use any of these options on physical devices (tape drives, etc.) and remote files as well as on normal files; data to or from such devices or remote files is reblocked by another copy of the tar program to enforce the specified (or default) record size. The default compression parameters are used. You can override them by using the ‘-I’ option (see below), e.g.:

$ tar -cf archive.tar.gz -I 'gzip -9 -n' subdir

A more traditional way to do this is to use a pipe:

$ tar cf - subdir | gzip -9 -n > archive.tar.gz

Compressed archives are easily corrupted, because compressed files have little redundancy. The adaptive nature of the compression scheme means that the compression tables are implicitly spread all over the archive. If you lose a few blocks, the dynamic construction of the compression tables becomes unsynchronized, and there is little chance that you could recover later in the archive.

Other compression options provide better control over creating compressed archives. These are:

‘--auto-compress’

‘-a’

Select a compression program to use by the archive file name suffix. The following suffixes are recognized:

Suffix	Compression program
‘`.gz`’	`gzip`
‘`.tgz`’	`gzip`
‘`.taz`’	`gzip`
‘`.Z`’	`compress`
‘`.taZ`’	`compress`
‘`.bz2`’	`bzip2`
‘`.tz2`’	`bzip2`
‘`.tbz2`’	`bzip2`
‘`.tbz`’	`bzip2`
‘`.lz`’	`lzip`
‘`.lzma`’	`lzma`
‘`.tlz`’	`lzma`
‘`.lzo`’	`lzop`
‘`.xz`’	`xz`
‘`.zst`’	`zstd`
‘`.tzst`’	`zstd`

‘--use-compress-program=command’

‘-I=command’

Use external compression program command. Use this option if you want to specify options for the compression program, or if you are not happy with the compression program associated with the suffix at compile time, or if you have a compression program that GNU tar does not support. The command argument is a valid command invocation, as you would type it at the command line prompt, with any additional options as needed. Enclose it in quotes if it contains white space (see section Running External Commands).

The command should follow two conventions:

First, when invoked without additional options, it should read data from standard input, compress it and output it on standard output.

Secondly, if invoked with the additional ‘-d’ option, it should do exactly the opposite, i.e., read the compressed data from the standard input and produce uncompressed data on the standard output.

The latter requirement means that you must not use the ‘-d’ option as a part of the command itself.

The ‘--use-compress-program’ option, in particular, lets you implement your own filters, not necessarily dealing with compression/decompression. For example, suppose you wish to implement PGP encryption on top of compression, using gpg (see gpg —- encryption and signing tool in GNU Privacy Guard Manual). The following script does that:

#! /bin/sh
case $1 in
-d) gpg --decrypt - | gzip -d -c;;
'') gzip -c | gpg -s;;
*)  echo "Unknown option $1">&2; exit 1;;
esac

Suppose you name it ‘gpgz’ and save it somewhere in your PATH. Then the following command will create a compressed archive signed with your private key:

$ tar -cf foo.tar.gpgz -Igpgz .

Likewise, the command below will list its contents:

$ tar -tf foo.tar.gpgz -Igpgz .

8.1.1.1 Using lbzip2 with GNU `tar`.

Lbzip2 is a multithreaded utility for handling ‘bzip2’ compression, written by Laszlo Ersek. It makes use of multiple processors to speed up its operation and in general works considerably faster than bzip2. For a detailed description of lbzip2 see http://freshmeat.net/projects/lbzip2 and lbzip2: parallel bzip2 utility.

Recent versions of lbzip2 are mostly command line compatible with bzip2, which makes it possible to automatically invoke it via the ‘--bzip2’ GNU tar command line option. To do so, GNU tar must be configured with the ‘--with-bzip2’ command line option, like this:

$ ./configure --with-bzip2=lbzip2 [other-options]

Once configured and compiled this way, tar --help will show the following:

$ tar --help | grep -- --bzip2
  -j, --bzip2                filter the archive through lbzip2

which means that running tar --bzip2 will invoke lbzip2.

8.1.2 Archiving Sparse Files

Files in the file system occasionally have holes. A hole in a file is a section of the file’s contents which was never written. The contents of a hole reads as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, tar could create an archive longer than the original. To have tar attempt to recognize the holes in a file, use ‘--sparse’ (‘-S’). When you use this option, then, for any file using less disk space than would be expected from its length, tar searches the file for holes. It then records in the archive for the file where the holes (consecutive stretches of zeros) are, and only archives the “real contents” of the file. On extraction (using ‘--sparse’ is not needed on extraction) any such files have also holes created wherever the holes were found. Thus, if you use ‘--sparse’, tar archives won’t take more space than the original.

GNU tar uses two methods for detecting holes in sparse files. These methods are described later in this subsection.

‘-S’

‘--sparse’

This option instructs tar to test each file for sparseness before attempting to archive it. If the file is found to be sparse it is treated specially, thus allowing to decrease the amount of space used by its image in the archive.

This option is meaningful only when creating or updating archives. It has no effect on extraction.

Consider using ‘--sparse’ when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.

Even if your system has no sparse files currently, some may be created in the future. If you use ‘--sparse’ while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). See section Using tar to Perform Incremental Dumps.

However, be aware that ‘--sparse’ option may present a serious drawback. Namely, in order to determine the positions of holes in a file tar may have to read it before trying to archive it, so in total the file may be read twice. This may happen when your OS or your FS does not support SEEK_HOLE/SEEK_DATA feature in lseek (See ‘--hole-detection’, below).

When using ‘POSIX’ archive format, GNU tar is able to store sparse files using in three distinct ways, called sparse formats. A sparse format is identified by its number, consisting, as usual of two decimal numbers, delimited by a dot. By default, format ‘1.0’ is used. If, for some reason, you wish to use an earlier format, you can select it using ‘--sparse-version’ option.

‘--sparse-version=version’: Select the format to store sparse files in. Valid version values are: ‘0.0’, ‘0.1’ and ‘1.0’. See section Storing Sparse Files, for a detailed description of each format.

Using ‘--sparse-format’ option implies ‘--sparse’.

‘--hole-detection=method’

Enforce concrete hole detection method. Before the real contents of sparse file are stored, tar needs to gather knowledge about file sparseness. This is because it needs to have the file’s map of holes stored into tar header before it starts archiving the file contents. Currently, two methods of hole detection are implemented:

‘--hole-detection=seek’ Seeking the file for data and holes. It uses enhancement of the lseek system call (SEEK_HOLE and SEEK_DATA) which is able to reuse file system knowledge about sparse file contents - so the detection is usually very fast. To use this feature, your file system and operating system must support it. At the time of this writing (2015) this feature, in spite of not being accepted by POSIX, is fairly widely supported by different operating systems.
‘--hole-detection=raw’ Reading byte-by-byte the whole sparse file before the archiving. This method detects holes like consecutive stretches of zeroes. Comparing to the previous method, it is usually much slower, although more portable.

When no ‘--hole-detection’ option is given, tar uses the ‘seek’, if supported by the operating system.

Using ‘--hole-detection’ option implies ‘--sparse’.

This document was generated on August 23, 2023 using texi2html 5.0.

8.1 Using Less Space through Compression

8.1.1 Creating and Reading Compressed Archives

8.1.1.1 Using lbzip2 with GNU tar.

8.1.2 Archiving Sparse Files

8.1.1.1 Using lbzip2 with GNU `tar`.