Next: , Previous: , Up: Top   [Contents][Index]

6 Internationalisation

Internationalisation in pspp is complicated. The most annoying aspect is that of character-encoding. This chapter attempts to describe the problems and current ways in which they are addressed.

6.1 The working locales

Pspp has three “working” locales:

Each of these locales may, at different times take separate (or identical) values. So for example, a French statistician can use pspp to prepare a report in the English language, using a datafile which has been created by a Japanese researcher hence uses a Japanese character set.

It’s rarely, if ever, necessary to interrogate the system to find out the values of the 3 locales. However it’s important to be aware of the source (destination) locale when reading (writing) string data. When transferring data between a source and a destination, the appropriate recoding must be performed.

6.1.1 The user interface locale

This is the locale which is visible to the person using pspp. Error messages and confidence indications are written in this locale. For example “Cannot open file” will be written in the user interface locale.

This locale is set from the environment of the user who starts pspp{ire} or from the system locale if not set.

6.1.2 The output locale

This locale is the one that should be visible to the person reading a report generated by pspp. Non-data related strings (Eg: “Page number”, “Standard Deviation” etc.) will appear in this locale.

6.1.3 The data locale

This locale is the one associated with the data being analysed with pspp. The only important aspect of this locale is the character encoding. 1 The dictionary pertaining to the data contains a field denoting the encoding. Any string data stored in a union value will be encoded in the dictionary’s character set.

6.2 System files

*.sav files contain a field which is supposed to identify the encoding of the data they contain (see Machine Integer Info Record). However, many files produced by early versions of spss set this to “2” (ASCII) regardless of the encoding of the data. Later versions contain an additional record (see Character Encoding Record) describing the encoding. When a system file is read, the dictionary’s encoding is set using information gleened from the system file. If the encoding cannot be determined or would be unreliable, then it remains unset.

6.3 GUI

The psppire graphic user interface is written using the Gtk+ api, for which all strings must be encoded in UTF8. All strings passed to the GTK+/GLib library functions (except for filenames) must be UTF-8 encoded otherwise errors will occur. Thus, for the purposes of the programming psppire, the user interface locale should be assumed to be UTF8, even if setlocale and/or nl_langinfo indicates otherwise.

6.3.1 Filenames

The GLib API has some special functions for dealing with filenames. Strings returned from functions like gtk_file_chooser_dialog_get_name are not, in general, encoded in UTF8, but in “filename” encoding. If that filename is passed to another GLib function which expects a filename, no conversion is necessary. If it’s passed to a function for the purposes of displaying it (eg. in a window’s title-bar) it must be converted to UTF8 — there is a special function for this: g_filename_display_name or g_filename_basename. If however, a filename needs to be passed outside of GTK+/GLib (for example to fopen) it must be converted to the local system encoding.

6.4 Existing locale handling functions

The major aspect of locale handling which the programmer has to consider is that of character encoding.

The following function is used to recode strings:

Function: char * recode_string (const char *to, const char *from, const char *text, int len);

Converts the string text, which is encoded in from to a new string encoded in to encoding. If len is not -1, then it must be the number of bytes in text. It is the caller’s responsibility to free the returned string when no longer required.

In order to minimise the number of conversions required, and to simplify design, PSPP attempts to store all internal strings in UTF8 encoding. Thus, when reading system and portable files (or any other data source), the following items are immediately converted to UTF8 encoding:

Conversely, when writing system files, these are converted back to the encoding of that system file.

String data stored in union values are left in their original encoding. These will be converted by the data_in/data_out functions.

6.5 Quirks

For historical reasons, not all locale handling follows posix conventions. This makes it difficult (impossible?) to elegantly handle the issues. For example, it would make sense for the gui’s datasheet to display numbers formatted according to the LC_NUMERIC category of the data locale. Instead however there is the data_out function (see Obtaining Properties of Format Types) which uses the settings_get_decimal_char function instead of the decimal separator of the locale. Similarly, formatting of monetary values is displayed in a pspp/spss specific fashion instead of using the LC_MONETARY category.



It might also be desirable for the LC_COLLATE category to be used for the purposes of sorting data.

Next: , Previous: , Up: Top   [Contents][Index]