Next: , Up: International

22.1 Introduction to International Character Sets

The users of international character sets and scripts have established many more-or-less standard coding systems for storing files. These coding systems are typically multibyte, meaning that sequences of two or more bytes are used to represent individual non-ASCII characters.

Internally, Emacs uses its own multibyte character encoding, which is a superset of the Unicode standard. This internal encoding allows characters from almost every known script to be intermixed in a single buffer or string. Emacs translates between the multibyte character encoding and various other coding systems when reading and writing files, and when exchanging data with subprocesses.

The command C-h h (view-hello-file) displays the file etc/HELLO, which illustrates various scripts by showing how to say “hello” in many languages. If some characters can't be displayed on your terminal, they appear as ‘?’ or as hollow boxes (see Undisplayable Characters).

Keyboards, even in the countries where these character sets are used, generally don't have keys for all the characters in them. You can insert characters that your keyboard does not support, using C-q (quoted-insert) or C-x 8 <RET> (insert-char). See Inserting Text. Emacs also supports various input methods, typically one for each script or language, which make it easier to type characters in the script. See Input Methods.

The prefix key C-x <RET> is used for commands that pertain to multibyte characters, coding systems, and input methods.

The command C-x = (what-cursor-position) shows information about the character at point. In addition to the character position, which was described in Position Info, this command displays how the character is encoded. For instance, it displays the following line in the echo area for the character ‘c’:

     Char: c (99, #o143, #x63) point=28062 of 36168 (78%) column=53

The four values after ‘Char:’ describe the character that follows point, first by showing it and then by giving its character code in decimal, octal and hex. For a non-ASCII multibyte character, these are followed by ‘file’ and the character's representation, in hex, in the buffer's coding system, if that coding system encodes the character safely and with a single byte (see Coding Systems). If the character's encoding is longer than one byte, Emacs shows ‘file ...’.

As a special case, if the character lies in the range 128 (0200 octal) through 159 (0237 octal), it stands for a “raw” byte that does not correspond to any specific displayable character. Such a “character” lies within the eight-bit-control character set, and is displayed as an escaped octal character code. In this case, C-x = shows ‘part of display ...’ instead of ‘file’.

With a prefix argument (C-u C-x =), this command displays a detailed description of the character in a window:

Here's an example showing the Latin-1 character A with grave accent, in a buffer whose coding system is utf-8-unix:

                  position: 1 of 1 (0%), column: 0
                 character: À (displayed as À) (codepoint 192, #o300, #xc0)
         preferred charset: unicode (Unicode (ISO10646))
     code point in charset: 0xC0
                    syntax: w       which means: word
                  category: .:Base, L:Left-to-right (strong),
                            j:Japanese, l:Latin, v:Viet
               buffer code: #xC3 #x80
                 file code: not encodable by coding system undecided-unix
                   display: by this font (glyph code)
         xft:-unknown-DejaVu Sans Mono-normal-normal-
             normal-*-13-*-*-*-m-0-iso10646-1 (#x82)
     
     Character code properties: customize what to show
       name: LATIN CAPITAL LETTER A WITH GRAVE
       old-name: LATIN CAPITAL LETTER A GRAVE
       general-category: Lu (Letter, Uppercase)
       decomposition: (65 768) ('A' '`')