Next: Selecting a Representation, Previous: Text Representations, Up: Non-ASCII Characters
Emacs can convert unibyte text to multibyte; it can also convert multibyte text to unibyte, though this conversion loses information. In general these conversions happen when inserting text into a buffer, or when putting text from several strings together in one string. You can also explicitly convert a string's contents to either representation.
Emacs chooses the representation for a string based on the text that it is constructed from. The general rule is to convert unibyte text to multibyte text when combining it with other multibyte text, because the multibyte representation is more general and can hold whatever characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the
buffer's representation, as specified by
enable-multibyte-characters in that buffer. In particular, when
you insert multibyte text into a unibyte buffer, Emacs converts the text
to unibyte, even though this conversion cannot in general preserve all
the characters that might be in the multibyte text. The other natural
alternative, to convert the buffer contents to multibyte, is not
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.
Converting unibyte text to multibyte text leaves ASCII characters
unchanged, and likewise character codes 128 through 159. It converts
the non-ASCII codes 160 through 255 by adding the value
nonascii-insert-offset to each character code. By setting this
variable, you specify which character set the unibyte characters
correspond to (see Character Sets). For example, if
nonascii-insert-offset is 2048, which is (- (make-char
'latin-iso8859-1) 128), then the unibyte non-ASCII characters
correspond to Latin 1. If it is 2688, which is (- (make-char
'greek-iso8859-7) 128), then they correspond to Greek letters.
Converting multibyte text to unibyte is simpler: it discards all but
the low 8 bits of each character code. If nonascii-insert-offset
has a reasonable value, corresponding to the beginning of some character
set, this conversion is the inverse of the other: converting unibyte
text to multibyte and back to unibyte reproduces the original unibyte
text.
This variable specifies the amount to add to a non-ASCII character when converting unibyte text to multibyte. It also applies when
self-insert-commandinserts a character in the unibyte non-ASCII range, 128 through 255. However, the functionsinsertandinsert-chardo not perform this conversion.The right value to use to select character set cs is
(- (make-charcs) 128). If the value ofnonascii-insert-offsetis zero, then conversion actually uses the value for the Latin 1 character set, rather than zero.
This variable provides a more general alternative to
nonascii-insert-offset. You can use it to specify independently how to translate each code in the range of 128 through 255 into a multibyte character. The value should be a char-table, ornil. If this is non-nil, it overridesnonascii-insert-offset.
The next three functions either return the argument string, or a newly created string with no text properties.
This function converts the text of string to unibyte representation, if it isn't already, and returns the result. If string is a unibyte string, it is returned unchanged. Multibyte character codes are converted to unibyte according to
nonascii-translation-tableor, if that isnil, usingnonascii-insert-offset. If the lookup in the translation table fails, this function takes just the low 8 bits of each character.
This function converts the text of string to multibyte representation, if it isn't already, and returns the result. If string is a multibyte string or consists entirely of ASCII characters, it is returned unchanged. In particular, if string is unibyte and entirely ASCII, the returned string is unibyte. (When the characters are all ASCII, Emacs primitives will treat the string the same way whether it is unibyte or multibyte.) If string is unibyte and contains non-ASCII characters, the function
unibyte-char-to-multibyteis used to convert each unibyte character to a multibyte character.
This function returns a multibyte string containing the same sequence of character codes as string. Unlike
string-make-multibyte, this function unconditionally returns a multibyte string. If string is a multibyte string, it is returned unchanged.