Next: The char32_t type, Previous: The char type, Up: Characters [Contents][Index]
wchar_t typeThe ISO C and POSIX standard creators made an attempt to overcome the
dead end regarding the char type. They introduced
<wchar.h>, and
<wctype.h> that were meant to supplant
the ones in <ctype.h>.
Unfortunately, this API and its implementation has numerous problems:
wchar_t is a
16-bit type. This means that it can never accommodate an entire Unicode
character. Either the wchar_t * strings are limited to
characters in UCS-2 (the “Basic Multilingual Plane” of Unicode), or
– if wchar_t * strings are encoded in UTF-16 – a
wchar_t represents only half of a character in the worst case,
making the <wctype.h> functions pointless.
wchar_t encoding is locale dependent
and undocumented. This means, if you want to know any property of a
wchar_t character, other than the properties defined by
<wctype.h> – such as whether it’s a dash, currency symbol,
paragraph separator, or similar –, you have to convert it to
char * encoding first, by use of the function wctomb.
fgetwc and fgetws, and when the input
stream/file is not in the expected encoding, you have no way to
determine the invalid byte sequence and do some corrective action. If
you use these functions, your program becomes “garbage in - more
garbage out” or “garbage in - abort”.
As a consequence, it is better to use multibyte strings. Such multibyte
strings can bypass limitations of the wchar_t type, if you use
functions defined in Gnulib and GNU libunistring for text processing.
They can also faithfully transport malformed characters that were
present in the input, without requiring the program to produce garbage
or abort.
Next: The char32_t type, Previous: The char type, Up: Characters [Contents][Index]