Next: , Previous: , Up: Introduction   [Contents][Index]


1.3 Locale encodings

A locale is a set of cultural conventions. According to POSIX, for a program, at any moment, there is one locale being designated as the “current locale”. (Actually, POSIX supports also one locale per thread, but this feature is not yet universally implemented and not widely used.) The locale is partitioned into several aspects, called the “categories” of the locale. The main various aspects are:

In particular, the LC_CTYPE category of the current locale determines the character encoding. This is the encoding of ‘char *’ strings. We also call it the “locale encoding”. GNU libunistring has a function, locale_charset, that returns a standardized (platform independent) name for this encoding.

All locale encodings used on glibc systems are essentially ASCII compatible: Most graphic ASCII characters have the same representation, as a single byte, in that encoding as in ASCII.

Among the possible locale encodings are UTF-8 and GB18030. Both allow to represent any Unicode character as a sequence of bytes. UTF-8 is used in most of the world, whereas GB18030 is used in the People’s Republic of China, because it is backward compatible with the GB2312 encoding that was used in this country earlier.

The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in some places, though.

UTF-16 and UTF-32 are not used as locale encodings, because they are not ASCII compatible.


Next: Choice of in-memory representation of strings, Previous: Unicode and Internationalization, Up: Introduction   [Contents][Index]