Representation of Strings (The GNU C Library)

5.1 Representation of Strings

This section is a quick summary of string concepts for beginning C programmers. It describes how strings are represented in C and some common pitfalls. If you are already familiar with this material, you can skip this section.

A string is a null-terminated array of bytes of type char, including the terminating null byte. String-valued variables are usually declared to be pointers of type char *. Such variables do not include space for the contents of a string; that has to be stored somewhere else—in an array variable, a string constant, or dynamically allocated memory (see Allocating Storage For Program Data). It’s up to you to store the address of the chosen memory space into the pointer variable. Alternatively you can store a null pointer in the pointer variable. The null pointer does not point anywhere, so attempting to reference the string it points to gets an error.

A multibyte character is a sequence of one or more bytes that represents a single character using the locale’s encoding scheme; a null byte always represents the null character. A multibyte string is a string that consists entirely of multibyte characters. In contrast, a wide string is a null-terminated sequence of wchar_t objects. A wide-string variable is usually declared to be a pointer of type wchar_t *, by analogy with string variables and char *. See Introduction to Extended Characters.

By convention, the null byte, '\0', marks the end of a string and the null wide character, L'\0', marks the end of a wide string. For example, in testing to see whether the char * variable p points to a null byte marking the end of a string, you can write !*p or *p == '\0'.

A null byte is quite different conceptually from a null pointer, although both are represented by the integer constant 0.

A string literal appears in C program source as a multibyte string between double-quote characters (‘"’). If the initial double-quote character is immediately preceded by a capital ‘L’ (ell) character (as in L"foo"), it is a wide string literal. String literals can also contribute to string concatenation: "a" "b" is the same as "ab". For wide strings one can use either L"a" L"b" or L"a" "b". Modification of string literals is not allowed by the GNU C compiler, because literals are placed in read-only storage.

Arrays that are declared const cannot be modified either. It’s generally good style to declare non-modifiable string pointers to be of type const char *, since this often allows the C compiler to detect accidental modifications as well as providing some amount of documentation about what your program intends to do with the string.

The amount of memory allocated for a byte array may extend past the null byte that marks the end of the string that the array contains. In this document, the term allocated size is always used to refer to the total amount of memory allocated for an array, while the term length refers to the number of bytes up to (but not including) the terminating null byte. Wide strings are similar, except their sizes and lengths count wide characters, not bytes.

A notorious source of program bugs is trying to put more bytes into a string than fit in its allocated size. When writing code that extends strings or moves bytes into a pre-allocated array, you should be very careful to keep track of the length of the string and make explicit checks for overflowing the array. Many of the library functions do not do this for you! Remember also that you need to allocate an extra byte to hold the null byte that marks the end of the string.

Originally strings were sequences of bytes where each byte represented a single character. This is still true today if the strings are encoded using a single-byte character encoding. Things are different if the strings are encoded using a multibyte encoding (for more information on encodings see Introduction to Extended Characters). There is no difference in the programming interface for these two kind of strings; the programmer has to be aware of this and interpret the byte sequences accordingly.

But since there is no separate interface taking care of these differences the byte-based string functions are sometimes hard to use. Since the count parameters of these functions specify bytes a call to memcpy could cut a multibyte character in the middle and put an incomplete (and therefore unusable) byte sequence in the target buffer.

To avoid these problems later versions of the ISO C standard introduce a second set of functions which are operating on wide characters (see Introduction to Extended Characters). These functions don’t have the problems the single-byte versions have since every wide character is a legal, interpretable value. This does not mean that cutting wide strings at arbitrary points is without problems. It normally is for alphabet-based languages (except for non-normalized text) but languages based on syllables still have the problem that more than one wide character is necessary to complete a logical unit. This is a higher level problem which the C library functions are not designed to solve. But it is at least good that no invalid byte sequences can be created. Also, the higher level functions can also much more easily operate on wide characters than on multibyte characters so that a common strategy is to use wide characters internally whenever text is more than simply copied.

The remaining of this chapter will discuss the functions for handling wide strings in parallel with the discussion of strings since there is almost always an exact equivalent available.