Previous: Unibyte Mode, Up: International


26.19 Charsets

Emacs defines most of popular character sets (e.g. ascii, iso-8859-1, cp1250, big5, unicode) as charsets and a few of its own charsets (e.g. emacs, unicode-bmp, eight-bit). All supported characters belong to one or more charsets. Usually you don't have to take care of “charset”, but knowing about it may help understanding the behavior of Emacs in some cases.

One example is a font selection. In each language environment, charsets have different priorities. Emacs, at first, tries to use a font that matches with charsets of higher priority. For instance, in Japanese language environment, the charset japanese-jisx0208 has the highest priority (see Describe Language Environment). So, Emacs tries to use a font whose registry property is “JISX0208.1983-0” for characters belonging to that charset.

Another example is a use of charset text property. When Emacs reads a file encoded in a coding systems that uses escape sequences to switch charsets (e.g. iso-2022-int-1), the buffer text keep the information of the original charset by charset text property. By using this information, Emacs can write the file with the same byte sequence as the original.

There are two commands for obtaining information about Emacs charsets. The command M-x list-charset-chars prompts for a charset name, and displays all the characters in that character set. The command M-x describe-character-set prompts for a charset name and displays information about that charset, including its internal representation within Emacs.

To display a list of all the supported charsets, type M-x list-character-sets. The list gives the names of charsets and additional information to identity each charset (see ISO/IEC's this page <http://www.itscj.ipsj.or.jp/ISO-IR/> for the detail). In the list, charsets are categorized into two; the normal charsets are listed first, and the supplementary charsets are listed last. A charset in the latter category is used for defining another charset (as a parent or a subset), or was used only in Emacs of the older versions.

To find out which charset a character in the buffer belongs to, put point before it and type C-u C-x =.