23.19 Charsets

In Emacs, charset is short for “character set”. Emacs supports most popular charsets (such as ascii, iso-8859-1, cp1250, big5, and unicode), in addition to some charsets of its own (such as emacs, unicode-bmp, and eight-bit). All supported characters belong to one or more charsets.

Emacs normally does the right thing with respect to charsets, so that you don’t have to worry about them. However, it is sometimes helpful to know some of the underlying details about charsets.

One example is font selection (see Fonts). Each language environment (see Language Environments) defines a priority list for the various charsets. When searching for a font, Emacs initially attempts to find one that can display the highest-priority charsets. For instance, in the Japanese language environment, the charset japanese-jisx0208 has the highest priority, so Emacs tries to use a font whose registry property is ‘JISX0208.1983-0’.

There are two commands that can be used to obtain information about charsets. The command M-x list-charset-chars prompts for a charset name, and displays all the characters in that character set. The command M-x describe-character-set prompts for a charset name, and displays information about that charset, including its internal representation within Emacs.

M-x list-character-sets displays a list of all supported charsets. The list gives the names of charsets and additional information to identity each charset; for more details, see the ISO International Register of Coded Character Sets to be Used with Escape Sequences (ISO-IR) maintained by the Information Processing Society of Japan/Information Technology Standards Commission of Japan (IPSJ/ITSCJ). In this list, charsets are divided into two categories: normal charsets are listed first, followed by supplementary charsets. A supplementary charset is one that is used to define another charset (as a parent or a subset), or to provide backward-compatibility for older Emacs versions.

To find out which charset a character in the buffer belongs to, put point before it and type C-u C-x = (see Introduction to International Character Sets).