Kawa: Characters

Characters

Characters are objects that represent human-readable characters such as letters and digits. More precisely, a character represents a Unicode scalar value. Each character has an integer value in the range 0 to #x10FFFF (excluding the range #xD800 to #xDFFF used for Surrogate Code Points).

Note: Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that’s sensitive to surrounding characters). Furthermore, different sequences of scalar values sometimes correspond to the same character. The relationships among scalar, characters, and glyphs are subtle and complex.

Despite this complexity, most things that a literate human would call a “character” can be represented by a single Unicode scalar value (although several sequences of Unicode scalar values may represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category.

Unicode scalar values exclude the range #xD800 to #xDFFF, which are part of the range of Unicode code points. However, the Unicode code points in this range, the so-called surrogates, are an artifact of the UTF-16 encoding, and can only appear in specific Unicode encodings, and even then only in pairs that encode scalar values. Consequently, all characters represent code points, but the surrogate code points do not have representations as characters.

Type: character

A Unicode code point - normally a Unicode scalar value, but could be a surrogate. This is implemented using a 32-bit int. When an object is needed (i.e. the boxed representation), it is implemented an instance of gnu.text.Char.

Type: character-or-eof

A character or the specical #!eof value (used to indicate end-of-file when reading from a port). This is implemented using a 32-bit int, where the value -1 indicates end-of-file. When an object is needed, it is implemented an instance of gnu.text.Char or the special #!eof object.

Type: char

A UTF-16 code unit. Same as Java primitive char type. Considered to be a sub-type of character. When an object is needed, it is implemented as an instance of java.lang.Character. Note the unfortunate inconsistency (for historical reasons) of char boxed as Character vs character boxed as Char.

Characters are written using the notation #\character (which stands for the given character; #\xhex-scalar-value (the character whose scalar value is the given hex integer); or #\character-name (a character with a given name):

character ::= #\any-character
        | #\ character-name
        | #\x hex-scalar-value
        | #\X hex-scalar-value

The following character-name forms are recognized:

#\alarm: #\x0007 - the alarm (bell) character
#\backspace: #\x0008
#\delete
#\del
#\rubout: #\x007f - the delete or rubout character
#\escape
#\esc: #\x001b
#\newline
#\linefeed: #\x001a - the linefeed character
#\null
#\nul: #\x0000 - the null character
#\page: #\000c - the formfeed character
#\return: #\000d - the carriage return character
#\space: #\x0020 - the preferred way to write a space
#\tab: #\x0009 - the tab character
#\vtab: #\x000b - the vertical tabulation character
#\ignorable-char: A special character value, but it is not a Unicode code point. It is a special value returned when an index refers to the second char (code point) of a surrogate pair, and which should be ignored. (When writing a character to a string or file, it will be written as one or two char values. The exception is #\ignorable-char, for which zero char values are written.)

Procedure: char? obj

Return #t if obj is a character, #f otherwise. (The obj can be any character, not just a 16-bit char.)

Procedure: char->integer char

Procedure: integer->char sv

sv should be a Unicode scalar value, i.e., a non–negative exact integer object in [0, #xD7FF] union [#xE000, #x10FFFF]. (Kawa also allows values in the surrogate range.)

Given a character, char->integer returns its Unicode scalar value as an exact integer object. For a Unicode scalar value sv, integer->char returns its associated character.
(integer->char 32)                     ⇒ #\space
(char->integer (integer->char 5000))   ⇒ 5000
(integer->char #\xD800)                ⇒ throws ClassCastException
Performance note: A call to char->integer is compiled as casting the argument to a character, and then re-interpreting that value as an int. A call to integer->char is compiled as casting the argument to an int, and then re-interpreting that value as an character. If the argument is the right type, no code is emitted: the value is just re-interpreted as the result type.

Procedure: char=? char₁ char₂ char₃ …

Procedure: char<? char₁ char₂ char₃ …

Procedure: char>? char₁ char₂ char₃ …

Procedure: char<=? char₁ char₂ char₃ …

Procedure: char>=? char₁ char₂ char₃ …

These procedures impose a total ordering on the set of characters according to their Unicode scalar values.
(char<? #\z #\ß)      ⇒ #t
(char<? #\z #\Z)      ⇒ #f
Performance note: This is compiled as if converting each argument using char->integer (which requires no code) and the using the corresponing int comparison.

Procedure: digit-value char

This procedure returns the numeric value (0 to 9) of its argument if it is a numeric digit (that is, if char-numeric? returns #t), or #f on any other character.
(digit-value #\3)        ⇒ 3
(digit-value #\x0664)    ⇒ 4
(digit-value #\x0AE6)    ⇒ 0
(digit-value #\x0EA6)    ⇒ #f