GNU tar 1.35: Storing Sparse Files

Storing Sparse Files

The notion of sparse file, and the ways of handling it from the point of view of GNU tar user have been described in detail in Archiving Sparse Files. This chapter describes the internal format GNU tar uses to store such files.

The support for sparse files in GNU tar has a long history. The earliest version featuring this support that I was able to find was 1.09, released in November, 1990. The format introduced back then is called old GNU sparse format and in spite of the fact that its design contained many flaws, it was the only format GNU tar supported until version 1.14 (May, 2004), which introduced initial support for sparse archives in PAX archives (see section GNU tar and POSIX tar). This format was not free from design flaws, either and it was subsequently improved in versions 1.15.2 (November, 2005) and 1.15.92 (June, 2006).

In addition to GNU sparse format, GNU tar is able to read and extract sparse files archived by star.

The following subsections describe each format in detail.

E.0.1 Old GNU Format

The format introduced in November 1990 (v. 1.09) was designed on top of standard ustar headers in such an unfortunate way that some of its fields overwrote fields required by POSIX.

An old GNU sparse header is designated by type ‘S’ (GNUTYPE_SPARSE) and has the following layout:

Offset	Size	Name	Data type	Contents
0	345		N/A	Not used.
345	12	atime	Number	`atime` of the file.
357	12	ctime	Number	`ctime` of the file .
369	12	offset	Number	For multivolume archives: the offset of the start of this volume.
381	4		N/A	Not used.
385	1		N/A	Not used.
386	96	sp	`sparse_header`	(4 entries) File map.
482	1	isextended	Bool	`1` if an extension sparse header follows, `0` otherwise.
483	12	realsize	Number	Real size of the file.

Each of sparse_header object at offset 386 describes a single data chunk. It has the following structure:

Offset	Size	Data type	Contents
0	12	Number	Offset of the beginning of the chunk.
12	12	Number	Size of the chunk.

If the member contains more than four chunks, the isextended field of the header has the value 1 and the main header is followed by one or more extension headers. Each such header has the following structure:

Offset	Size	Name	Data type	Contents
0	21	sp	`sparse_header`	(21 entries) File map.
504	1	isextended	Bool	`1` if an extension sparse header follows, or `0` otherwise.

A header with isextended=0 ends the map.

E.0.2 PAX Format, Versions 0.0 and 0.1

There are two formats available in this branch. The version 0.0 is the initial version of sparse format used by tar versions 1.14–1.15.1. The sparse file map is kept in extended (x) PAX header variables:

GNU.sparse.size: Real size of the stored file;
GNU.sparse.numblocks: Number of blocks in the sparse map;
GNU.sparse.offset: Offset of the data block;
GNU.sparse.numbytes: Size of the data block.

The latter two variables repeat for each data block, so the overall structure is like this:

GNU.sparse.size=size
GNU.sparse.numblocks=numblocks
repeat numblocks times
  GNU.sparse.offset=offset
  GNU.sparse.numbytes=numbytes
end repeat

This format presented the following two problems:

Whereas the POSIX specification allows a variable to appear multiple times in a header, it requires that only the last occurrence be meaningful. Thus, multiple occurrences of GNU.sparse.offset and GNU.sparse.numbytes are conflicting with the POSIX specs.
Attempting to extract such archives using a third-party’s tar results in extraction of sparse files in condensed form. If the tar implementation in question does not support POSIX format, it will also extract a file containing extension header attributes. This file can be used to expand the file to its original state. However, posix-aware tars will usually ignore the unknown variables, which makes restoring the file more difficult. See Extraction of sparse members in v.0.0 format, for the detailed description of how to restore such members using non-GNU tars.

GNU tar 1.15.2 introduced sparse format version 0.1, which attempted to solve these problems. As its predecessor, this format stores sparse map in the extended POSIX header. It retains GNU.sparse.size and GNU.sparse.numblocks variables, but instead of GNU.sparse.offset/GNU.sparse.numbytes pairs it uses a single variable:

GNU.sparse.map: Map of non-null data chunks. It is a string consisting of comma-separated values "offset,size[,offset-1,size-1...]"

To address the 2nd problem, the name field in ustar is replaced with a special name, constructed using the following pattern:

%d/GNUSparseFile.%p/%f

The real name of the sparse file is stored in the variable GNU.sparse.name. Thus, those tar implementations that are not aware of GNU extensions will at least extract the files into separate directories, giving the user a possibility to expand it afterwards. See Extraction of sparse members in v.0.1 format, for the detailed description of how to restore such members using non-GNU tars.

The resulting GNU.sparse.map string can be very long. Although POSIX does not impose any limit on the length of a x header variable, this possibly can confuse some tars.

E.0.3 PAX Format, Version 1.0

The version 1.0 of sparse format was introduced with GNU tar 1.15.92. Its main objective was to make the resulting file extractable with little effort even by non-posix aware tar implementations. Starting from this version, the extended header preceding a sparse member always contains the following variables that identify the format being used:

GNU.sparse.major: Major version
GNU.sparse.minor: Minor version

The name field in ustar header contains a special name, constructed using the following pattern:

%d/GNUSparseFile.%p/%f

The real name of the sparse file is stored in the variable GNU.sparse.name. The real size of the file is stored in the variable GNU.sparse.realsize.

The sparse map itself is stored in the file data block, preceding the actual file data. It consists of a series of decimal numbers delimited by newlines. The map is padded with nulls to the nearest block boundary.

The first number gives the number of entries in the map. Following are map entries, each one consisting of two numbers giving the offset and size of the data block it describes.

The format is designed in such a way that non-posix aware tars and tars not supporting GNU.sparse.* keywords will extract each sparse file in its condensed form with the file map prepended and will place it into a separate directory. Then, using a simple program it would be possible to expand the file to its original form even without GNU tar. See section Extracting Sparse Members, for the detailed information on how to extract sparse members without GNU tar.

This document was generated on August 23, 2023 using texi2html 5.0.