Bytes vs. Characters (The GNU Awk User’s Guide)

Next: A Brief Introduction To Extensions, Up: Counting Things [Contents][Index]

11.2.7.1 Modern Character Sets ¶

In the early days of computing, single bytes were used for storing characters. The most common character sets were ASCII and EBCDIC, which each provided all the English upper- and lowercase letters, the 10 Hindu-Arabic numerals from 0 through 9, and a number of other standard punctuation and control characters.

Today, the most popular character set in use is Unicode (of which ASCII is a pure subset). Unicode provides tens of thousands of unique characters (called code points) to cover most existing human languages (living and dead) and a number of nonhuman ones as well (such as Klingon and J.R.R. Tolkien’s elvish languages).

To save space in files, Unicode code points are encoded, where each character takes from one to four bytes in the file. UTF-8 is possibly the most popular of such multibyte encodings.

The POSIX standard requires that awk function in terms of characters, not bytes. Thus in gawk, length(), substr(), split(), match() and the other string functions (see String-Manipulation Functions) all work in terms of characters in the local character set, and not in terms of bytes. (Not all awk implementations do so, though).

There is no standard, built-in way to distinguish characters from bytes in an awk program. For an awk implementation of wc, which needs to make such a distinction, we will have to use an external extension.