Collating Elements vs. Characters (GNU Gnulib)

18.2.3 Collating Elements vs. Characters

POSIX generalizes the notion of a character to that of a collating element. It defines a collating element to be “a sequence of one or more bytes defined in the current collating sequence as a unit of collation.”

This generalizes the notion of a character in two ways. First, a single character can map into two or more collating elements. For example, the German “ß” collates as the collating element ‘s’ followed by another collating element ‘s’. Second, two or more characters can map into one collating element. For example, the Czech ‘ch’ collates after ‘h’ and before ‘i’.

Since POSIX’s “collating element” preserves the essential idea of a “character,” we use the latter, more familiar, term in this document.