Next: , Up: Strings   [Contents][Index]


16.1.1 The C string representation

The classical representation of a string in C is a sequence of characters, where each character takes up one or more bytes, followed by a terminating NUL byte. This representation is used for strings that are passed by the operating system (in the argv argument of main, for example) and for strings that are passed to the operating system (in system calls such as open). The C type to hold such strings is ‘char *’ or, in places where the string shall not be modified, ‘const char *’. There are many C library functions, standardized by ISO C and POSIX, that assume this representation of strings.

A character encoding, or encoding for short, describes how the elements of a character set are represented as a sequence of bytes. For example, in the ASCII encoding, the UNDERSCORE character is represented by a single byte, with value 0x5F. As another example, the COPYRIGHT SIGN character is represented:

Note: The ‘char’ type may be signed or unsigned, depending on the platform. When we talk about the "byte 0xA9" we actually mean the char object whose value is (char) 0xA9; we omit the cast to char in this documentation, for brevity.

In POSIX, the character encoding is determined by the locale. The locale is some environmental attribute that the user can choose.

Depending on the encoding, in general, every character is represented by one or more bytes (up to 4 bytes in practice – but use MB_LEN_MAX instead of the number 4 in the code). When every character is represented by only 1 byte, we speak of an “unibyte locale”, otherwise of a “multibyte locale”.

It is important to realize that the majority of Unix installations nowadays use UTF-8 as locale encoding; therefore, the majority of users are using multibyte locales.

Three important facts to remember are:

A ‘char’ is a byte, not a character.

As a consequence:

Multibyte does not imply UTF-8 encoding.

While UTF-8 is the most common multibyte encoding, GB18030 is also a supported locale encoding on GNU systems (mostly because it is a Chinese government standard, last revised in 2022).

Searching for a character in a string is not the same as searching for a byte in the string.

Take the above example of COPYRIGHT SIGN in the GB18030 encoding: A byte search will find the bytes '0' and '8' in this string. But a search for the character "0" or "8" in the string "©" must, of course, report “not found”.

As a consequence:

Workarounds can be found in Gnulib, in the form of mbs* API functions:

A C string can contain encoding errors.

Not every NUL-terminated byte sequence represents a valid multibyte string. Byte sequences can contain encoding errors, that is, bytes or byte sequences that are invalid and do not represent characters.

String functions like mbscasecmp and strcoll whose behavior depends on encoding have unspecified behavior on strings containing encoding errors, unless the behavior is specifically documented. If an application needs a particular behavior on these strings it can iterate through them itself, as described in the next subsection.


Next: Iterating through strings, Up: Strings   [Contents][Index]