GNU tar 1.35: 8.3 Making tar Archives More Portable

8.3 Making `tar` Archives More Portable

Creating a tar archive on a particular system that is meant to be useful later on many other machines and with other versions of tar is more challenging than you might think. tar archive formats have been evolving since the first versions of Unix. Many such formats are around, and are not always compatible with each other. This section discusses a few problems, and gives some advice about making tar archives more portable.

One golden rule is simplicity. For example, limit your tar archives to contain only regular files and directories, avoiding other kind of special files. Do not attempt to save sparse files or contiguous files as such. Let’s discuss a few more problems, in turn.

8.3.1 Portable Names

Use portable file and member names. A name is portable if it contains only ASCII letters and digits, ‘/’, ‘.’, ‘_’, and ‘-’; it cannot be empty, start with ‘-’ or ‘//’, or contain ‘/-’. Avoid deep directory nesting. For portability to old Unix hosts, limit your file name components to 14 characters or less.

If you intend to have your tar archives to be read on case-insensitive file systems like FAT32, you should not rely on case distinction for file names.

8.3.2 Symbolic Links

Normally, when tar archives a symbolic link, it writes a block to the archive naming the target of the link. In that way, the tar archive is a faithful record of the file system contents. When ‘--dereference’ (‘-h’) is used with ‘--create’ (‘-c’), tar archives the files symbolic links point to, instead of the links themselves.

When creating portable archives, use ‘--dereference’ (‘-h’): some systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.

When reading from an archive, the ‘--dereference’ (‘-h’) option causes tar to follow an already-existing symbolic link when tar writes or reads a file named in the archive. Ordinarily, tar does not follow such a link, though it may remove the link before writing a new file. See section Options Controlling the Overwriting of Existing Files.

The ‘--dereference’ option is unsafe if an untrusted user can modify directories while tar is running. See section Security.

8.3.3 Hard Links

Normally, when tar archives a hard link, it writes a block to the archive naming the target of the link (a ‘1’ type block). In that way, the actual file contents is stored in file only once. For example, consider the following two files:

$ ls -l
-rw-r--r-- 2 gray staff 4 2007-10-30 15:11 one
-rw-r--r-- 2 gray staff 4 2007-10-30 15:11 jeden

Here, ‘jeden’ is a link to ‘one’. When archiving this directory with a verbose level 2, you will get an output similar to the following:

$ tar cvvf ../archive.tar .
drwxr-xr-x gray/staff        0 2007-10-30 15:13 ./
-rw-r--r-- gray/staff        4 2007-10-30 15:11 ./jeden
hrw-r--r-- gray/staff        0 2007-10-30 15:11 ./one link to ./jeden

The last line shows that, instead of storing two copies of the file, tar stored it only once, under the name ‘jeden’, and stored file ‘one’ as a hard link to this file.

It may be important to know that all hard links to the given file are stored in the archive. For example, this may be necessary for exact reproduction of the file system. The following option does that:

‘--check-links’
‘-l’: Check the number of links dumped for each processed file. If this number does not match the total number of hard links for the file, print a warning message.

For example, trying to archive only file ‘jeden’ with this option produces the following diagnostics:

$ tar -c -f ../archive.tar -l jeden
tar: Missing links to 'jeden'.

Although creating special records for hard links helps keep a faithful record of the file system contents and makes archives more compact, it may present some difficulties when extracting individual members from the archive. For example, trying to extract file ‘one’ from the archive created in previous examples produces, in the absence of file ‘jeden’:

$ tar xf archive.tar ./one
tar: ./one: Cannot hard link to './jeden': No such file or directory
tar: Error exit delayed from previous errors

The reason for this behavior is that tar cannot seek back in the archive to the previous member (in this case, ‘one’), to extract it(23). If you wish to avoid such problems at the cost of a bigger archive, use the following option:

‘--hard-dereference’: Dereference hard links and store the files they refer to.

For example, trying this option on our two sample files, we get two copies in the archive, each of which can then be extracted independently of the other:

$ tar -c -vv -f ../archive.tar --hard-dereference .
drwxr-xr-x gray/staff        0 2007-10-30 15:13 ./
-rw-r--r-- gray/staff        4 2007-10-30 15:11 ./jeden
-rw-r--r-- gray/staff        4 2007-10-30 15:11 ./one

8.3.4 Old V7 Archives

Certain old versions of tar cannot handle additional information recorded by newer tar programs. To create an archive in V7 format (not ANSI), which can be read by these old versions, specify the ‘--format=v7’ option in conjunction with the ‘--create’ (‘-c’) (tar also accepts ‘--portability’ or ‘--old-archive’ for this option). When you specify it, tar leaves out information about directories, pipes, fifos, contiguous files, and device files, and specifies file ownership by group and user IDs instead of group and user names.

When updating an archive, do not use ‘--format=v7’ unless the archive was created using this option.

In most cases, a new format archive can be read by an old tar program without serious trouble, so this option should seldom be needed. On the other hand, most modern tars are able to read old format archives, so it might be safer for you to always use ‘--format=v7’ for your distributions. Notice, however, that ‘ustar’ format is a better alternative, as it is free from many of ‘v7’’s drawbacks.

8.3.5 Ustar Archive Format

The archive format defined by the POSIX.1-1988 specification is called ustar. Although it is more flexible than the V7 format, it still has many restrictions (see section ustar, for the detailed description of ustar format). Along with V7 format, ustar format is a good choice for archives intended to be read with other implementations of tar.

To create an archive in ustar format, use the ‘--format=ustar’ option in conjunction with ‘--create’ (‘-c’).

8.3.6 GNU and old GNU `tar` format

GNU tar was based on an early draft of the POSIX 1003.1 ustar standard. GNU extensions to tar, such as the support for file names longer than 100 characters, use portions of the tar header record which were specified in that POSIX draft as unused. Subsequent changes in POSIX have allocated the same parts of the header record for other purposes. As a result, GNU tar format is incompatible with the current POSIX specification, and with tar programs that follow it.

In the majority of cases, tar will be configured to create this format by default. This will change in future releases, since we plan to make ‘POSIX’ format the default.

To force creation a GNU tar archive, use option ‘--format=gnu’.

8.3.7 GNU `tar` and POSIX `tar`

Starting from version 1.14 GNU tar features full support for POSIX.1-2001 archives.

A POSIX conformant archive will be created if tar was given ‘--format=posix’ (‘--format=pax’) option. No special option is required to read and extract from a POSIX archive.

8.3.7.1 Controlling Extended Header Keywords

‘--pax-option=keyword-list’: Handle keywords in PAX extended headers. This option is equivalent to ‘-o’ option of the pax utility.

Keyword-list is a comma-separated list of keyword options, each keyword option taking one of the following forms:

delete=pattern

When used with one of archive-creation commands, this option instructs tar to omit from extended header records that it produces any keywords matching the string pattern. If the pattern contains shell metacharacters like ‘*’, it should be quoted to prevent the shell from expanding the pattern before tar sees it.

When used in extract or list mode, this option instructs tar to ignore any keywords matching the given pattern in the extended header records. In both cases, matching is performed using the pattern matching notation described in POSIX 1003.2, 3.13 (see section Wildcards Patterns and Matching). For example:

--pax-option 'delete=security.*'

would suppress security-related information.

exthdr.name=string

This keyword allows user control over the name that is written into the ustar header blocks for the extended headers. The name is obtained from string after making the following substitutions:

Meta-character	Replaced By
%d	The directory name of the file, equivalent to the result of the `dirname` utility on the translated file name.
%f	The name of the file with the directory information stripped, equivalent to the result of the `basename` utility on the translated file name.
%p	The process ID of the `tar` process.
%%	A ‘`%`’ character.

Any other ‘%’ characters in string produce undefined results.

If no option ‘exthdr.name=string’ is specified, tar will use the following default value:

%d/PaxHeaders/%f

This default helps make the archive more reproducible. See section Making tar Archives More Reproducible. POSIX recommends using ‘%d/PaxHeaders.%p/%f’ instead, which means the two archives created with the same set of options and containing the same set of files will be byte-to-byte different. This default will be used if the environment variable POSIXLY_CORRECT is set.

exthdr.mtime=value

This keyword defines the value of the ‘mtime’ field that is written into the ustar header blocks for the extended headers. By default, the ‘mtime’ field is set to the modification time of the archive member described by that extended header (or to the value of the ‘--mtime’ option, if supplied).

globexthdr.name=string

This keyword allows user control over the name that is written into the ustar header blocks for global extended header records. The name is obtained from the contents of string, after making the following substitutions:

Meta-character	Replaced By
%n	An integer that represents the sequence number of the global extended header record in the archive, starting at 1.
%p	The process ID of the `tar` process.
%%	A ‘`%`’ character.

Any other ‘%’ characters in string produce undefined results.

If no option ‘globexthdr.name=string’ is specified, tar will use the following default value:

$TMPDIR/GlobalHead.%n

If the environment variable POSIXLY_CORRECT is set, the following value is used instead:

$TMPDIR/GlobalHead.%p.%n

In both cases, ‘$TMPDIR’ stands for the value of the TMPDIR environment variable. If TMPDIR is not set, tar uses ‘/tmp’.

globexthdr.mtime=value

This keyword defines the value of the ‘mtime’ field that is written into the ustar header blocks for the global extended headers. By default, the ‘mtime’ field is set to the time when tar was invoked.

keyword=value

When used with one of archive-creation commands, these keyword/value pairs will be included at the beginning of the archive in a global extended header record. When used with one of archive-reading commands, tar will behave as if it has encountered these keyword/value pairs at the beginning of the archive in a global extended header record.

keyword:=value

When used with one of archive-creation commands, these keyword/value pairs will be included as records at the beginning of an extended header for each file. This is effectively equivalent to keyword=value form except that it creates no global extended header records.

When used with one of archive-reading commands, tar will behave as if these keyword/value pairs were included as records at the end of each extended header; thus, they will override any global or file-specific extended header record keywords of the same names. For example, in the command:

tar --format=posix --create \
    --file archive --pax-option gname:=user .

the group name will be forced to a new value for all files stored in the archive.

In any of the forms described above, the value may be a string enclosed in curly braces. In that case, the string between the braces is understood either as a textual time representation, as described in Date input formats, or a name of the existing file, starting with ‘/’ or ‘.’. In the latter case, the modification time of that file is used.

For example, to set all modification times to the current date, you use the following option:

--pax-option 'mtime:={now}'

As another example, the following option helps make the archive more reproducible. See section Making tar Archives More Reproducible.

--pax-option delete=atime

If you extract files from such an archive and recreate the archive from them, you will also need to eliminate changes due to ctime:

--pax-option 'delete=atime,delete=ctime'

Normally tar saves an mtime value with subsecond resolution in an extended header for any file with a timestamp that is not on a one-second boundary. This is in addition to the traditional mtime timestamp in the header block. Although you can suppress subsecond timestamp resolution with ‘--pax-option delete=mtime’, this hack will not work for timestamps before 1970 or after 2242-03-16 12:56:31 UTC.

If the environment variable POSIXLY_CORRECT is set, two POSIX archives created using the same options on the same set of files might not be byte-to-byte equivalent even with the above options. This is because the POSIX default for extended header names includes the tar process ID, which typically differs at each run. To produce byte-to-byte equivalent archives in this case, either unset POSIXLY_CORRECT, or use the following option, which can be combined with the above options:

--pax-option exthdr.name=%d/PaxHeaders/%f

8.3.8 Checksumming Problems

SunOS and HP-UX tar fail to accept archives created using GNU tar and containing non-ASCII file names, that is, file names having characters with the eighth bit set, because they use signed checksums, while GNU tar uses unsigned checksums while creating archives, as per POSIX standards. On reading, GNU tar computes both checksums and accepts either of them. It is somewhat worrying that a lot of people may go around doing backup of their files using faulty (or at least non-standard) software, not learning about it until it’s time to restore their missing files with an incompatible file extractor, or vice versa.

GNU tar computes checksums both ways, and accepts either of them on read, so GNU tar can read Sun tapes even with their wrong checksums. GNU tar produces the standard checksum, however, raising incompatibilities with Sun. That is to say, GNU tar has not been modified to produce incorrect archives to be read by buggy tar’s. I’ve been told that more recent Sun tar now read standard archives, so maybe Sun did a similar patch, after all?

The story seems to be that when Sun first imported tar sources on their system, they recompiled it without realizing that the checksums were computed differently, because of a change in the default signing of char’s in their compiler. So they started computing checksums wrongly. When they later realized their mistake, they merely decided to stay compatible with it, and with themselves afterwards. Presumably, but I do not really know, HP-UX has chosen their tar archives to be compatible with Sun’s. The current standards do not favor Sun tar format. In any case, it now falls on the shoulders of SunOS and HP-UX users to get a tar able to read the good archives they receive.

8.3.9 Large or Negative Values

(This message will disappear, once this node revised.)

The above sections suggest to use ‘oldest possible’ archive format if in doubt. However, sometimes it is not possible. If you attempt to archive a file whose metadata cannot be represented using required format, GNU tar will print error message and ignore such a file. You will than have to switch to a format that is able to handle such values. The format summary table (see section Controlling the Archive Format) will help you to do so.

In particular, when trying to archive files 8 GiB or larger, or with timestamps not in the range 1970-01-01 00:00:00 through 2242-03-16 12:56:31 UTC, you will have to chose between GNU and POSIX archive formats. When considering which format to choose, bear in mind that the GNU format uses two’s-complement base-256 notation to store values that do not fit into standard ustar range. Such archives can generally be read only by a GNU tar implementation. Moreover, they sometimes cannot be correctly restored on another hosts even by GNU tar. For example, using two’s complement representation for negative time stamps that assumes a signed 32-bit time_t generates archives that are not portable to hosts with differing time_t representations.

On the other hand, POSIX archives, generally speaking, can be extracted by any tar implementation that understands older ustar format. The exceptions are files 8 GiB or larger, or files dated before 1970-01-01 00:00:00 or after 2242-03-16 12:56:31 UTC

8.3.10 How to Extract GNU-Specific Data Using Other `tar` Implementations

In previous sections you became acquainted with various quirks necessary to make your archives portable. Sometimes you may need to extract archives containing GNU-specific members using some third-party tar implementation or an older version of GNU tar. Of course your best bet is to have GNU tar installed, but if it is for some reason impossible, this section will explain how to cope without it.

When we speak about GNU-specific members we mean two classes of them: members split between the volumes of a multi-volume archive and sparse members. You will be able to always recover such members if the archive is in PAX format. In addition split members can be recovered from archives in old GNU format. The following subsections describe the required procedures in detail.

8.3.10.1 Extracting Members Split Between Volumes

If a member is split between several volumes of an old GNU format archive most third party tar implementation will fail to extract it. To extract it, use tarcat program (see section Concatenate Volumes into a Single Archive). This program is available from GNU tar home page. It concatenates several archive volumes into a single valid archive. For example, if you have three volumes named from ‘vol-1.tar’ to ‘vol-3.tar’, you can do the following to extract them using a third-party tar:

$ tarcat vol-1.tar vol-2.tar vol-3.tar | tar xf -

You could use this approach for most (although not all) PAX format archives as well. However, extracting split members from a PAX archive is a much easier task, because PAX volumes are constructed in such a way that each part of a split member is extracted to a different file by tar implementations that are not aware of GNU extensions. More specifically, the very first part retains its original name, and all subsequent parts are named using the pattern:

%d/GNUFileParts/%f.%n

where symbols preceded by ‘%’ are macro characters that have the following meaning:

Meta-character	Replaced By
%d	The directory name of the file, equivalent to the result of the `dirname` utility on its full name.
%f	The file name of the file, equivalent to the result of the `basename` utility on its full name.
%p	The process ID of the `tar` process that created the archive.
%n	Ordinal number of this particular part.

For example, if the file ‘var/longfile’ was split during archive creation between three volumes, then the member names will be:

var/longfile
var/GNUFileParts/longfile.1
var/GNUFileParts/longfile.2

When you extract your archive using a third-party tar, these files will be created on your disk, and the only thing you will need to do to restore your file in its original form is concatenate them in the proper order, for example:

$ cd var
$ cat GNUFileParts/longfile.1 \
  GNUFileParts/longfile.2 >> longfile
$ rm -f GNUFileParts

Notice, that if the tar implementation you use supports PAX format archives, it will probably emit warnings about unknown keywords during extraction. They will look like this:

Tar file too small
Unknown extended header keyword 'GNU.volume.filename' ignored.
Unknown extended header keyword 'GNU.volume.size' ignored.
Unknown extended header keyword 'GNU.volume.offset' ignored.

You can safely ignore these warnings.

If your tar implementation is not PAX-aware, you will get more warnings and more files generated on your disk, e.g.:

$ tar xf vol-1.tar
var/PaxHeaders/longfile: Unknown file type 'x', extracted as
normal file
Unexpected EOF in archive
$ tar xf vol-2.tar
tmp/GlobalHead.1: Unknown file type 'g', extracted as normal file
GNUFileParts/PaxHeaders/sparsefile.1: Unknown file type
'x', extracted as normal file

Ignore these warnings. The ‘PaxHeaders.*’ directories created will contain files with extended header keywords describing the extracted files. You can delete them, unless they describe sparse members. Read further to learn more about them.

8.3.10.2 Extracting Sparse Members

Any tar implementation will be able to extract sparse members from a PAX archive. However, the extracted files will be condensed, i.e., any zero blocks will be removed from them. When we restore such a condensed file to its original form, by adding zero blocks (or holes) back to their original locations, we call this process expanding a compressed sparse file.

To expand a file, you will need a simple auxiliary program called xsparse. It is available in source form from GNU tar home page.

Let’s begin with archive members in sparse format version 1.0(24), which are the easiest to expand. The condensed file will contain both file map and file data, so no additional data will be needed to restore it. If the original file name was ‘dir/name’, then the condensed file will be named ‘dir/GNUSparseFile.n/name’, where n is a decimal number(25).

To expand a version 1.0 file, run xsparse as follows:

$ xsparse ‘cond-file’

where ‘cond-file’ is the name of the condensed file. The utility will deduce the name for the resulting expanded file using the following algorithm:

If ‘cond-file’ does not contain any directories, ‘../cond-file’ will be used;
If ‘cond-file’ has the form ‘dir/t/name’, where both t and name are simple names, with no ‘/’ characters in them, the output file name will be ‘dir/name’.
Otherwise, if ‘cond-file’ has the form ‘dir/name’, the output file name will be ‘name’.

In the unlikely case when this algorithm does not suit your needs, you can explicitly specify output file name as a second argument to the command:

$ xsparse ‘cond-file’ ‘out-file’

It is often a good idea to run xsparse in dry run mode first. In this mode, the command does not actually expand the file, but verbosely lists all actions it would be taking to do so. The dry run mode is enabled by ‘-n’ command line argument:

$ xsparse -n /home/gray/GNUSparseFile.6058/sparsefile
Reading v.1.0 sparse map
Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to
'/home/gray/sparsefile'
Finished dry run

To actually expand the file, you would run:

$ xsparse /home/gray/GNUSparseFile.6058/sparsefile

The program behaves the same way all UNIX utilities do: it will keep quiet unless it has something important to tell you (e.g. an error condition or something). If you wish it to produce verbose output, similar to that from the dry run mode, use ‘-v’ option:

$ xsparse -v /home/gray/GNUSparseFile.6058/sparsefile
Reading v.1.0 sparse map
Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to
'/home/gray/sparsefile'
Done

Additionally, if your tar implementation has extracted the extended headers for this file, you can instruct xstar to use them in order to verify the integrity of the expanded file. The option ‘-x’ sets the name of the extended header file to use. Continuing our example:

$ xsparse -v -x /home/gray/PaxHeaders/sparsefile \
  /home/gray/GNUSparseFile/sparsefile
Reading extended header file
Found variable GNU.sparse.major = 1
Found variable GNU.sparse.minor = 0
Found variable GNU.sparse.name = sparsefile
Found variable GNU.sparse.realsize = 217481216
Reading v.1.0 sparse map
Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to
'/home/gray/sparsefile'
Done

An extended header is a special tar archive header that precedes an archive member and contains a set of variables, describing the member properties that cannot be stored in the standard ustar header. While optional for expanding sparse version 1.0 members, the use of extended headers is mandatory when expanding sparse members in older sparse formats: v.0.0 and v.0.1 (The sparse formats are described in detail in Storing Sparse Files.) So, for these formats, the question is: how to obtain extended headers from the archive?

If you use a tar implementation that does not support PAX format, extended headers for each member will be extracted as a separate file. If we represent the member name as ‘dir/name’, then the extended header file will be named ‘dir/PaxHeaders/name’.

Things become more difficult if your tar implementation does support PAX headers, because in this case you will have to manually extract the headers. We recommend the following algorithm:

Consult the documentation of your tar implementation for an option that prints block numbers along with the archive listing (analogous to GNU tar’s ‘-R’ option). For example, star has ‘-block-number’.

Obtain verbose listing using the ‘block number’ option, and find block numbers of the sparse member in question and the member immediately following it. For example, running star on our archive we obtain:

$ star -t -v -block-number -f arc.tar
…
star: Unknown extended header keyword 'GNU.sparse.size' ignored.
star: Unknown extended header keyword 'GNU.sparse.numblocks' ignored.
star: Unknown extended header keyword 'GNU.sparse.name' ignored.
star: Unknown extended header keyword 'GNU.sparse.map' ignored.
block        56:  425984 -rw-r--r--  gray/users Jun 25 14:46 2006 GNUSparseFile.28124/sparsefile
block       897:   65391 -rw-r--r--  gray/users Jun 24 20:06 2006 README
…

(as usual, ignore the warnings about unknown keywords.)

Let size be the size of the sparse member, Bs be its block number and Bn be the block number of the next member. Compute:
```
N = Bs - Bn - size/512 - 2
```
This number gives the size of the extended header part in tar blocks. In our example, this formula gives: 897 - 56 - 425984 / 512 - 2 = 7.
Use dd to extract the headers:
```
dd if=archive of=hname bs=512 skip=Bs count=N
```
where archive is the archive name, hname is a name of the file to store the extended header in, Bs and N are computed in previous steps.

In our example, this command will be
```
$ dd if=arc.tar of=xhdr bs=512 skip=56 count=7
```

Finally, you can expand the condensed file, using the obtained header:

$ xsparse -v -x xhdr GNUSparseFile.6058/sparsefile
Reading extended header file
Found variable GNU.sparse.size = 217481216
Found variable GNU.sparse.numblocks = 208
Found variable GNU.sparse.name = sparsefile
Found variable GNU.sparse.map = 0,2048,1050624,2048,…
Expanding file 'GNUSparseFile.28124/sparsefile' to 'sparsefile'
Done

This document was generated on August 23, 2023 using texi2html 5.0.

8.3.10.1 Extracting Members Split Between Volumes		Members Split Between Volumes
8.3.10.2 Extracting Sparse Members		Sparse Members

8.3 Making tar Archives More Portable