Next: , Previous: In-memory representation, Up: Introduction


1.5 ‘char *’ strings

The classical C strings, with its C library support standardized by ISO C and POSIX, can be used in internationalized programs with some precautions. The problem with this API is that many of the C library functions for strings don't work correctly on strings in locale encodings, leading to bugs that only people in some cultures of the world will experience.

The first problem with the C library API is the support of multibyte locales. According to the locale encoding, in general, every character is represented by one or more bytes (up to 4 bytes in practice — but use MB_LEN_MAX instead of the number 4 in the code). When every character is represented by only 1 byte, we speak of an “unibyte locale”, otherwise of a “multibyte locale”. It is important to realize that the majority of Unix installations nowadays use UTF-8 or GB18030 as locale encoding; therefore, the majority of users are using multibyte locales.

The important fact to remember is:

A ‘char’ is a byte, not a character.

As a consequence:

The workarounds can be found in GNU gnulib http://www.gnu.org/software/gnulib/.

The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages:

The correct way to deal with this problem is

  1. to provide functions for titlecasing, as well as for upper- and lowercasing,
  2. to view case transformations as functions that operates on strings, rather than on characters.

This is implemented in this library, through the functions declared in <unicase.h>, see unicase.h.