In-memory representation (GNU libunistring)

1.4 Choice of in-memory representation of strings

There are three ways of representing strings in memory of a running program.

As ‘char *’ strings. Such strings are represented in locale encoding. This approach is employed when not much text processing is done by the program. When some Unicode aware processing is to be done, a string is converted to Unicode on the fly and back to locale encoding afterwards.
As UTF-8 or UTF-16 or UTF-32 strings. This implies that conversion from locale encoding to Unicode is performed on input, and in the opposite direction on output. This approach is employed when the program does a significant amount of text processing, or when the program has multiple threads operating on the same data but in different locales.
As ‘wchar_t *’, a.k.a. “wide strings”. This approach is misguided, see The wchar_t mess.

Of course, a ‘char *’ string can, in some cases, be encoded in UTF-8. You will use the data type depending on what you can guarantee about how it’s encoded: If a string is encoded in the locale encoding, or if you don’t know how it’s encoded, use ‘char *’. If, on the other hand, you can guarantee that it is UTF-8 encoded, then you can use the UTF-8 string type, uint8_t *, for it.

The five types char *, uint8_t *, uint16_t *, uint32_t *, and wchar_t * are incompatible types at the C level. Therefore, ‘gcc -Wall’ will produce a warning if, by mistake, your code contains a mismatch between these types. In the context of using GNU libunistring, even a warning about a mismatch between char * and uint8_t * is a sign of a bug in your code that you should not try to silence through a cast.