[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.1 Using Less Space through Compression


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.1.1 Creating and Reading Compressed Archives

GNU tar is able to create and read compressed archives. It supports gzip, bzip2, lzma and lzop compression programs. For backward compatibility, it also supports compress command, although we strongly recommend against using it, because it is by far less effective than other compression programs(17).

Creating a compressed archive is simple: you just specify a compression option along with the usual archive creation commands. The compression option is ‘-z’ (‘--gzip’) to create a gzip compressed archive, ‘-j’ (‘--bzip2’) to create a bzip2 compressed archive, ‘-J’ (‘--xz’) to create an XZ archive, ‘--lzma’ to create an LZMA compressed archive, ‘--lzop’ to create an LSOP archive, and ‘-Z’ (‘--compress’) to use compress program. For example:

 
$ tar cfz archive.tar.gz .

You can also let GNU tar select the compression program basing on the suffix of the archive file name. This is done using ‘--auto-compress’ (‘-a’) command line option. For example, the following invocation will use bzip2 for compression:

 
$ tar cfa archive.tar.bz2 .

whereas the following one will use lzma:

 
$ tar cfa archive.tar.lzma .

For a complete list of file name suffixes recognized by GNU tar, auto-compress.

Reading compressed archive is even simpler: you don't need to specify any additional options as GNU tar recognizes its format automatically. Thus, the following commands will list and extract the archive created in previous example:

 
# List the compressed archive
$ tar tf archive.tar.gz
# Extract the compressed archive
$ tar xf archive.tar.gz

The format recognition algorithm is based on signatures, a special byte sequences in the beginning of file, that are specific for certain compression formats. If this approach fails, tar falls back to using archive name suffix to determine its format (See auto-compress, for a list of recognized suffixes).

The only case when you have to specify a decompression option while reading the archive is when reading from a pipe or from a tape drive that does not support random access. However, in this case GNU tar will indicate which option you should use. For example:

 
$ cat archive.tar.gz | tar tf -
tar: Archive is compressed.  Use -z option
tar: Error is not recoverable: exiting now

If you see such diagnostics, just add the suggested option to the invocation of GNU tar:

 
$ cat archive.tar.gz | tar tfz -

Notice also, that there are several restrictions on operations on compressed archives. First of all, compressed archives cannot be modified, i.e., you cannot update (‘--update’ (‘-u’)) them or delete (‘--delete’) members from them or add (‘--append’ (‘-r’)) members to them. Likewise, you cannot append another tar archive to a compressed archive using ‘--concatenate’ (‘-A’)). Secondly, multi-volume archives cannot be compressed.

The following table summarizes compression options used by GNU tar.

--auto-compress
-a

Select a compression program to use by the archive file name suffix. The following suffixes are recognized:

Suffix

Compression program

.gz

gzip

.tgz

gzip

.taz

gzip

.Z

compress

.taZ

compress

.bz2

bzip2

.tz2

bzip2

.tbz2

bzip2

.tbz

bzip2

.lzma

lzma

.tlz

lzma

.lzo

lzop

.xz

xz

-z
--gzip
--ungzip

Filter the archive through gzip.

You can use ‘--gzip’ and ‘--gunzip’ on physical devices (tape drives, etc.) and remote files as well as on normal files; data to or from such devices or remote files is reblocked by another copy of the tar program to enforce the specified (or default) record size. The default compression parameters are used; if you need to override them, set GZIP environment variable, e.g.:

 
$ GZIP=--best tar cfz archive.tar.gz subdir

Another way would be to avoid the ‘--gzip’ (‘--gunzip’, ‘--ungzip’, ‘-z’) option and run gzip explicitly:

 
$ tar cf - subdir | gzip --best -c - > archive.tar.gz

About corrupted compressed archives: gzip'ed files have no redundancy, for maximum compression. The adaptive nature of the compression scheme means that the compression tables are implicitly spread all over the archive. If you lose a few blocks, the dynamic construction of the compression tables becomes unsynchronized, and there is little chance that you could recover later in the archive.

There are pending suggestions for having a per-volume or per-file compression in GNU tar. This would allow for viewing the contents without decompression, and for resynchronizing decompression at every volume or file, in case of corrupted archives. Doing so, we might lose some compressibility. But this would have make recovering easier. So, there are pros and cons. We'll see!

-J
--xz

Filter the archive through xz. Otherwise like ‘--gzip’.

-j
--bzip2

Filter the archive through bzip2. Otherwise like ‘--gzip’.

--lzma

Filter the archive through lzma. Otherwise like ‘--gzip’.

--lzop

Filter the archive through lzop. Otherwise like ‘--gzip’.

-Z
--compress
--uncompress

Filter the archive through compress. Otherwise like ‘--gzip’.

--use-compress-program=prog
-I=prog

Use external compression program prog. Use this option if you have a compression program that GNU tar does not support. There are two requirements to which prog should comply:

First, when called without options, it should read data from standard input, compress it and output it on standard output.

Secondly, if called with ‘-d’ argument, it should do exactly the opposite, i.e., read the compressed data from the standard input and produce uncompressed data on the standard output.

The ‘--use-compress-program’ option, in particular, lets you implement your own filters, not necessarily dealing with compression/decompression. For example, suppose you wish to implement PGP encryption on top of compression, using gpg (see gpg: (gpg)Top section `gpg —- encryption and signing tool' in GNU Privacy Guard Manual). The following script does that:

 
#! /bin/sh
case $1 in
-d) gpg --decrypt - | gzip -d -c;;
'') gzip -c | gpg -s ;;
*)  echo "Unknown option $1">&2; exit 1;;
esac

Suppose you name it ‘gpgz’ and save it somewhere in your PATH. Then the following command will create a compressed archive signed with your private key:

 
$ tar -cf foo.tar.gpgz -Igpgz .

Likewise, the command below will list its contents:

 
$ tar -tf foo.tar.gpgz -Igpgz .

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.1.2 Archiving Sparse Files

Files in the file system occasionally have holes. A hole in a file is a section of the file's contents which was never written. The contents of a hole reads as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, tar could create an archive longer than the original. To have tar attempt to recognize the holes in a file, use ‘--sparse’ (‘-S’). When you use this option, then, for any file using less disk space than would be expected from its length, tar searches the file for consecutive stretches of zeros. It then records in the archive for the file where the consecutive stretches of zeros are, and only archives the “real contents” of the file. On extraction (using ‘--sparse’ is not needed on extraction) any such files have holes created wherever the continuous stretches of zeros were found. Thus, if you use ‘--sparse’, tar archives won't take more space than the original.

-S
--sparse

This option instructs tar to test each file for sparseness before attempting to archive it. If the file is found to be sparse it is treated specially, thus allowing to decrease the amount of space used by its image in the archive.

This option is meaningful only when creating or updating archives. It has no effect on extraction.

Consider using ‘--sparse’ when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.

Even if your system has no sparse files currently, some may be created in the future. If you use ‘--sparse’ while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). See section Using tar to Perform Incremental Dumps.

However, be aware that ‘--sparse’ option presents a serious drawback. Namely, in order to determine if the file is sparse tar has to read it before trying to archive it, so in total the file is read twice. So, always bear in mind that the time needed to process all files with this option is roughly twice the time needed to archive them without it.

When using ‘POSIX’ archive format, GNU tar is able to store sparse files using in three distinct ways, called sparse formats. A sparse format is identified by its number, consisting, as usual of two decimal numbers, delimited by a dot. By default, format ‘1.0’ is used. If, for some reason, you wish to use an earlier format, you can select it using ‘--sparse-version’ option.

--sparse-version=version

Select the format to store sparse files in. Valid version values are: ‘0.0’, ‘0.1’ and ‘1.0’. See section Storing Sparse Files, for a detailed description of each format.

Using ‘--sparse-format’ option implies ‘--sparse’.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by Sergey Poznyakoff on March, 29 2009 using texi2html 1.78.