Character Encoding Record (PSPP)

Next: Long String Value Labels Record, Previous: Very Long String Record, Up: System File Format [Contents]

1.14 Character Encoding Record

This record, if present, indicates the character encoding for string data, long variable names, variable labels, value labels and other strings in the file.

/* Header. */
int32               rec_type;
int32               subtype;
int32               size;
int32               count;

/* Exactly count bytes of data. */
char                encoding[];

int32 rec_type;: Record type. Always set to 7.
int32 subtype;: Record subtype. Always set to 20.
int32 size;: The size of each element in the encoding member. Always set to 1.
int32 count;: The total number of bytes in encoding.
char encoding[];: The name of the character encoding. Normally this will be an official IANA character set name or alias. See http://www.iana.org/assignments/character-sets. Character set names are not case-sensitive, but SPSS appears to write them in all-uppercase.

This record is not present in files generated by older software. See also the character_code field in the machine integer info record (see character-code).

When the character encoding record and the machine integer info record are both present, all system files observed in practice indicate the same character encoding, e.g. 1252 as character_code and windows-1252 as encoding, 65001 and UTF-8, etc.

If, for testing purposes, a file is crafted with different character_code and encoding, it seems that character_code controls the encoding for all strings in the system file before the dictionary termination record, including strings in data (e.g. string missing values), and encoding controls the encoding for strings following the dictionary termination record.