System File Format (PSPP)

Next: SPSS Viewer File Format, Previous: GNU PSPP Developers Guide, Up: GNU PSPP Developers Guide [Contents]

1 System File Format

An SPSS system file holds a set of cases and dictionary information that describes how they may be interpreted. The system file format dates back 40+ years and has evolved greatly over that time to support new features, but in a way to facilitate interchange between even the oldest and newest versions of software. This chapter describes the system file format.

System files use four data types: 8-bit characters, 32-bit integers, 64-bit integers, and 64-bit floating points, called here char, int32, int64, and flt64, respectively. Data is not necessarily aligned on a word or double-word boundary: the long variable name record (see Long Variable Names Record) and very long string records (see Very Long String Record) have arbitrary byte length and can therefore cause all data coming after them in the file to be misaligned.

Integer data in system files may be big-endian or little-endian. A reader may detect the endianness of a system file by examining layout_code in the file header record (see layout_code).

Floating-point data in system files may nominally be in IEEE 754, IBM, or VAX formats. A reader may detect the floating-point format in use by examining bias in the file header record (see bias).

PSPP detects big-endian and little-endian integer formats in system files and translates as necessary. PSPP also detects the floating-point format in use, as well as the endianness of IEEE 754 floating-point numbers, and translates as needed. However, only IEEE 754 numbers with the same endianness as integer data in the same file have actually been observed in system files, and it is likely that other formats are obsolete or were never used.

System files use a few floating point values for special purposes:

SYSMIS: The system-missing value is represented by the largest possible negative number in the floating point format (-DBL_MAX).
HIGHEST: HIGHEST is used as the high end of a missing value range with an unbounded maximum. It is represented by the largest possible positive number (DBL_MAX).
LOWEST: LOWEST is used as the low end of a missing value range with an unbounded minimum. It was originally represented by the second-largest negative number (in IEEE 754 format, 0xffeffffffffffffe). System files written by SPSS 21 and later instead use the largest negative number (-DBL_MAX), the same value as SYSMIS. This does not lead to ambiguity because LOWEST appears in system files only in missing value ranges, which never contain SYSMIS.

System files may use most character encodings based on an 8-bit unit. UTF-16 and UTF-32, based on wider units, appear to be unacceptable. rec_type in the file header record is sufficient to distinguish between ASCII and EBCDIC based encodings. The best way to determine the specific encoding in use is to consult the character encoding record (see Character Encoding Record), if present, and failing that the character_code in the machine integer info record (see Machine Integer Info Record). The same encoding should be used for the dictionary and the data in the file, although it is possible to artificially synthesize files that use different encodings (see Character Encoding Record).